Ridiculous number of empty columns #182
Labels
No Label
DBF
Dates
Defined Names
Features
Formula
HTML
Images
Infrastructure
Integration
International
ODS
Operations
Performance
PivotTables
Pro
Protection
Read Bug
SSF
SYLK
Style
Write Bug
good first issue
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sheetjs/sheetjs#182
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm currently using J to convert user submitted XLSX files to CSV on my backend. I've came across a few files that when ran through J generate a large number of empty columns. When you open the file in Excel it looks fine, only has about 25 columns. If I re-save it and run it through J again it exports as you'd expect. I'm assuming the problem is the version of Excel that generated the file in the first place. Unfortunately I can't share the XLSX file without getting an OK from the customer, which I'm working on. I thought I'd open this up for now anyways in case it's an easy fix.
If you have the XLSX file, can you check the sheet dimensions manually? Fortunately it's not too painful:
xl/worksheets/sheet1.xml
<dimension
in the file and please share the entire tag.For example, for the AutoFilter.xlsx test file the relevant tag is
<dimension ref="A1:E22"/>
FYI: The way I normally do this is to use vim to edit the file (zip.vim plugin is needed for this), browser xl/worksheets/sheet1.xml, and run the vim command
:%!xmllint --format -
<dimension ref="A1:XFD132"/>
Seems like that's the problem! But there's really nothing after Q:131 when opening the file in excel.
@davidworkman9 Long story short: the parser sets the range of a worksheet to the hull of the reported range (according to that record) and the set of all cells (according to the worksheet), and the utilities reference that range. It may make more sense to just ignore the reported range entirely, but that'll have to be changed in every single format parser.
@elad @notatestuser thoughts? Simply put, if the worksheet reports the range (as specified in a Dimension record) as
A1:E5
, but the only actual cells areA1 D2 B4
, should the range beA1:E5
orA1:D4
?@davidworkman9 out of curiosity, what tool generated the file in question? Also, can you do one more test: Search for every instance of the tag
<c r=
and see if any of those have a column beyond Q. In vim:%!grep '<c r="'
will filter for those lines. Here is a snippet from the AutoFilter.xlsx file:There's a bunch of them past Q.
Most look like this:
<c r="AMJ1" s="0"/>
From what I can see, everything past Q has an s attribute with a value of either 14 or 0. Where it's 14 it's only 14 from R-Z then it goes back to 0.
@davidworkman9 so to summarize: the actual sheet range has many more columns than actual data. The code as it currently stands will skip those stub cells (unless
sheetStubs
is set to true), so the workaround of ignoring the reported range will work as expected.Last question: if you use excel to save as CSV, does it show up to column XFD or column Q?
Saved as CSV it shows up to column XFD.
I don't believe it's skipping those cells, as I'm currently using J's CLI tool and I get a CSV with a comma for every one of those columns.
@davidworkman9 If I understand correctly, right now the CSV generated by J is the same as what Excel generates when you save the file as CSV. The problem is that you disagree with how Excel (and therefore J) handles empty columns. If that's the case, then this behavior (ignoring empty columns) should be an option. If J is doing something different from excel's CSV saving, then we need to change the default behavior.
I see that makes sense. Would be very nice to have that as an option.
@SheetJSDev I think whatever the worksheet reports should be used. I personally dislike software that tries to be smart... :)
@elad Running a regex replace that removes all commas at the end of each row that are empty(
/,+/g
) would not result in any data loss, and being an optional parameter I fail to see the harm.@davidworkman9 as long as it's tunable and documented, I don't either - it might as well be the default behavior.
We added an option
strip
tosheet_to_csv
that will now strip any trailing commas (or whatever the field separator happens to be).