Spurious whitespace creates spurious columns and rows #945
Labels
No Label
DBF
Dates
Defined Names
Features
Formula
HTML
Images
Infrastructure
Integration
International
ODS
Operations
Performance
PivotTables
Pro
Protection
Read Bug
SSF
SYLK
Style
Write Bug
good first issue
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sheetjs/sheetjs#945
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Expected
Cells containing only whitespace are treated as empty string.
Actual
Whitespace is treated as characters.
Background
In the case that I accidentally have a space in a random cell, everything up to that point ends up being imported, including empty columns/rows. Of course, it is important to leave blank columns between 2 actual columns of data (for spacing). But if I have a 5x5 spreadsheet that upon opening has 4 extra columns and 2 extra rows because at some point there was data in some far-off cell but was replaced with
" "
, things look strange especially since I can't see whitespace normally. While in Excel extra columns and rows appear by default, when using web-based tools like the one at http://oss.sheetjs.com/ extra rows and columns must be added explicitly.I have some code I plan to use in the mean time. Feel free to use it. Be warned, it is not yet untested. It is as follows:
@bean5 you can add this to a page in the wiki https://github.com/SheetJS/js-xlsx/wiki . Note that if you open the code block with
```js
GH will use syntax highlighting:As for the general question of whether cells consisting solely of whitespace should be preserved, here are a few cases to consider:
'
=""
'=" "
=CHAR(32)&CHAR(32)
I updated my code block to use highlighting.
Your cases there are interesting. My code avoids at least some of those by running after evaluation has occurred. I used regular expressions to replace whatever regular expression defines as whitespace.
Aside, I found something strange with one of your cases. If you copy the value of
=CHAR(32)&CHAR(32)
into another cell (I am using LibreOffice), then open the file with http://oss.sheetjs.com/ it will be imported and look like whitespace. If you click on the cell, it gets updated toundefined
.Whitespace is really starting to rear its ugly head.
Also, I'm not going to add the code to the wiki...my code still has bugs--plus it doesn't aim to preserve blank rows/columns sandwiched between non-whitespace data.
Since you're working with an array of arrays, and since you want to preserve blank columns and blank rows between valid columns and rows, you can fix the data in one pass by looping backwards:
Use this on array of arrays, like those you would get from
sheet_to_json
withheader:1
:Since my use cases are okay with removing blank rows and columns, I just use the forwards route. But yes, I believe the backwards route should be used at a minimum by default by this project. It would be nice to have a setting to optionally remove the rest. Normally I'd make a PR, but I am new to the codebase. I don't know where to centrally place such code.
These are nice little utility functions but don't really belong in the library code directly, which is why the wiki or a separate library is the best place to house them. For posterity sake, that code segment represents one obvious solution and is hopefully helpful to other people who stumble upon this page.
It is unclear where the "fix" should appear. For example, readers could omit string cells with no formula and whitespace as content (which already happens with HTML input).
sheet_to_json
,sheet_to_csv
, and other utilities could skip cells with whitespace content.I agree on many of those points. I ran into this when consuming the json
version of the some xls documents I have.
On Thu, Jan 11, 2018 at 2:19 PM SheetJSDev notifications@github.com wrote: