Ridiculous number of empty columns #182

Closed
opened 2015-02-25 16:30:41 +00:00 by davidworkman9 · 14 comments
davidworkman9 commented 2015-02-25 16:30:41 +00:00 (Migrated from github.com)

I'm currently using J to convert user submitted XLSX files to CSV on my backend. I've came across a few files that when ran through J generate a large number of empty columns. When you open the file in Excel it looks fine, only has about 25 columns. If I re-save it and run it through J again it exports as you'd expect. I'm assuming the problem is the version of Excel that generated the file in the first place. Unfortunately I can't share the XLSX file without getting an OK from the customer, which I'm working on. I thought I'd open this up for now anyways in case it's an easy fix.

I'm currently using J to convert user submitted XLSX files to CSV on my backend. I've came across a few files that when ran through J generate a large number of empty columns. When you open the file in Excel it looks fine, only has about 25 columns. If I re-save it and run it through J again it exports as you'd expect. I'm assuming the problem is the version of Excel that generated the file in the first place. Unfortunately I can't share the XLSX file without getting an OK from the customer, which I'm working on. I thought I'd open this up for now anyways in case it's an easy fix.
SheetJSDev commented 2015-02-25 16:41:34 +00:00 (Migrated from github.com)

If you have the XLSX file, can you check the sheet dimensions manually? Fortunately it's not too painful:

  • Rename the file to "test.zip"
  • unzip test.zip
  • open the file xl/worksheets/sheet1.xml
  • search for the tag <dimension in the file and please share the entire tag.

For example, for the AutoFilter.xlsx test file the relevant tag is <dimension ref="A1:E22"/>

FYI: The way I normally do this is to use vim to edit the file (zip.vim plugin is needed for this), browser xl/worksheets/sheet1.xml, and run the vim command :%!xmllint --format -

If you have the XLSX file, can you check the sheet dimensions manually? Fortunately it's not too painful: - Rename the file to "test.zip" - unzip test.zip - open the file `xl/worksheets/sheet1.xml` - search for the tag `<dimension` in the file and please share the entire tag. For example, for the [AutoFilter.xlsx test file](https://github.com/SheetJS/test_files/raw/master/AutoFilter.xlsx) the relevant tag is `<dimension ref="A1:E22"/>` FYI: The way I normally do this is to use vim to edit the file (zip.vim plugin is needed for this), browser xl/worksheets/sheet1.xml, and run the vim command `:%!xmllint --format -`
davidworkman9 commented 2015-02-25 17:08:42 +00:00 (Migrated from github.com)

<dimension ref="A1:XFD132"/>

Seems like that's the problem! But there's really nothing after Q:131 when opening the file in excel.

`<dimension ref="A1:XFD132"/>` Seems like that's the problem! But there's really nothing after Q:131 when opening the file in excel.
SheetJSDev commented 2015-02-25 17:20:06 +00:00 (Migrated from github.com)

@davidworkman9 Long story short: the parser sets the range of a worksheet to the hull of the reported range (according to that record) and the set of all cells (according to the worksheet), and the utilities reference that range. It may make more sense to just ignore the reported range entirely, but that'll have to be changed in every single format parser.

@elad @notatestuser thoughts? Simply put, if the worksheet reports the range (as specified in a Dimension record) as A1:E5, but the only actual cells are A1 D2 B4, should the range be A1:E5 or A1:D4?

@davidworkman9 Long story short: the parser sets the range of a worksheet to the hull of the reported range (according to that record) and the set of all cells (according to the worksheet), and the utilities reference that range. It may make more sense to just ignore the reported range entirely, but that'll have to be changed in every single format parser. @elad @notatestuser thoughts? Simply put, if the worksheet reports the range (as specified in a Dimension record) as `A1:E5`, but the only actual cells are `A1 D2 B4`, should the range be `A1:E5` or `A1:D4`?
SheetJSDev commented 2015-02-25 17:27:19 +00:00 (Migrated from github.com)

@davidworkman9 out of curiosity, what tool generated the file in question? Also, can you do one more test: Search for every instance of the tag <c r= and see if any of those have a column beyond Q. In vim: %!grep '<c r="' will filter for those lines. Here is a snippet from the AutoFilter.xlsx file:

      <c r="A22">
      <c r="B22" t="s">
      <c r="E22">
@davidworkman9 out of curiosity, what tool generated the file in question? Also, can you do one more test: Search for every instance of the tag `<c r=` and see if any of those have a column beyond Q. In vim: `%!grep '<c r="'` will filter for those lines. Here is a snippet from the AutoFilter.xlsx file: ``` <c r="A22"> <c r="B22" t="s"> <c r="E22"> ```
davidworkman9 commented 2015-02-25 18:40:00 +00:00 (Migrated from github.com)

There's a bunch of them past Q.

Most look like this:
<c r="AMJ1" s="0"/>

There's a bunch of them past Q. Most look like this: `<c r="AMJ1" s="0"/>`
davidworkman9 commented 2015-02-25 18:55:22 +00:00 (Migrated from github.com)

From what I can see, everything past Q has an s attribute with a value of either 14 or 0. Where it's 14 it's only 14 from R-Z then it goes back to 0.

From what I can see, everything past Q has an s attribute with a value of either 14 or 0. Where it's 14 it's only 14 from R-Z then it goes back to 0.
SheetJSDev commented 2015-02-25 18:59:27 +00:00 (Migrated from github.com)

@davidworkman9 so to summarize: the actual sheet range has many more columns than actual data. The code as it currently stands will skip those stub cells (unless sheetStubs is set to true), so the workaround of ignoring the reported range will work as expected.

Last question: if you use excel to save as CSV, does it show up to column XFD or column Q?

@davidworkman9 so to summarize: the actual sheet range has many more columns than actual data. The code as it currently stands will skip those stub cells (unless `sheetStubs` is set to true), so the workaround of ignoring the reported range will work as expected. Last question: if you use excel to save as CSV, does it show up to column XFD or column Q?
davidworkman9 commented 2015-02-25 19:09:27 +00:00 (Migrated from github.com)

Saved as CSV it shows up to column XFD.

I don't believe it's skipping those cells, as I'm currently using J's CLI tool and I get a CSV with a comma for every one of those columns.

Saved as CSV it shows up to column XFD. I don't believe it's skipping those cells, as I'm currently using J's CLI tool and I get a CSV with a comma for every one of those columns.
SheetJSDev commented 2015-02-25 19:15:10 +00:00 (Migrated from github.com)

@davidworkman9 If I understand correctly, right now the CSV generated by J is the same as what Excel generates when you save the file as CSV. The problem is that you disagree with how Excel (and therefore J) handles empty columns. If that's the case, then this behavior (ignoring empty columns) should be an option. If J is doing something different from excel's CSV saving, then we need to change the default behavior.

@davidworkman9 If I understand correctly, right now the CSV generated by J is the same as what Excel generates when you save the file as CSV. The problem is that you disagree with how Excel (and therefore J) handles empty columns. If that's the case, then this behavior (ignoring empty columns) should be an option. If J is doing something different from excel's CSV saving, then we need to change the default behavior.
davidworkman9 commented 2015-02-25 19:18:31 +00:00 (Migrated from github.com)

I see that makes sense. Would be very nice to have that as an option.

I see that makes sense. Would be very nice to have that as an option.
elad commented 2015-02-25 19:29:05 +00:00 (Migrated from github.com)

@SheetJSDev I think whatever the worksheet reports should be used. I personally dislike software that tries to be smart... :)

@SheetJSDev I think whatever the worksheet reports should be used. I personally dislike software that tries to be smart... :)
davidworkman9 commented 2015-02-25 19:49:47 +00:00 (Migrated from github.com)

@elad Running a regex replace that removes all commas at the end of each row that are empty(/,+/g) would not result in any data loss, and being an optional parameter I fail to see the harm.

@elad Running a regex replace that removes all commas at the end of each row that are empty(`/,+/g`) would not result in any data loss, and being an optional parameter I fail to see the harm.
elad commented 2015-02-25 19:52:03 +00:00 (Migrated from github.com)

@davidworkman9 as long as it's tunable and documented, I don't either - it might as well be the default behavior.

@davidworkman9 as long as it's tunable and documented, I don't either - it might as well be the default behavior.
SheetJSDev commented 2017-03-22 07:52:38 +00:00 (Migrated from github.com)

We added an option strip to sheet_to_csv that will now strip any trailing commas (or whatever the field separator happens to be).

We added an option `strip` to `sheet_to_csv` that will now strip any trailing commas (or whatever the field separator happens to be).
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#182
No description provided.