What is a wrong input file format? #641
Labels
No Label
DBF
Dates
Defined Names
Features
Formula
HTML
Images
Infrastructure
Integration
International
ODS
Operations
Performance
PivotTables
Pro
Protection
Read Bug
SSF
SYLK
Style
Write Bug
good first issue
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sheetjs/sheetjs#641
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm writing a test to verify that when my code reads a bad input file using
xlsx.readFile
, a proper error is returned. However, I'm having trouble finding a file that would causexlsx
not to return a parsed worksheet. Any text-based format seems to work, including YAML, JSON, .gitignore, even JS source files. Only a binary file would currently cause it to choke.Is this a desired behavior? Seems really easy for an attacker to trick my application into reading a malformed (but valid, according to
xlsx
) file. My only options are to check the file extension (seems silly, since the file contents is what really matters) and to check for presence of every column that my application cares about (it accepts a range of spreadsheet formats, so this is not a quick check that can be done early on).I saw a parsing option called
WTF
, but setting it doesn't causereadFile
to reject non-spreadsheets. Would it be possible to add another option that rejects anything that doesn't actually have "columns" and "rows"?@dskrvk try this: take the .gitignore file from this repo, rename to gitignore.csv and try opening in Excel. This is what I see:
Excel itself is extremely aggressive is reading files that may or may not be utter nonsense. You can do file extension checks (using
path.extname
) to filter for the file extensions you want, but it won't address a YAML-masquerading-as-CSV.The best bet is to check the desired worksheet. If it has only one column and you expect 2, then there's something wrong. To get the number of rows/columns:
On a side note: the WTF option is to force certain classes of errors to bubble through. They are not showstoppers but rather indicate where we are missing support for a feature (e.g. unrecognized XML tags)
If that's how Excel behaves, it makes sense. Thanks for the suggestions!
@dskrvk I thought about this a little more and decided that this answer wasn't satisfactory. Should we add an option that generates an error if the file appears to be a one-column text file? It would obviously generate false positives (like a single-column CSV) but it would also avoid the uncertainty raised in this issue
I'm open to a new option to address the issue, and there are two obvious candidates:
"book formats": some string which describes which formats to support. You could restrict to the core Excel formats (XLSX/XLSB/XLSM/XLS/SpreadsheetML), also include OpenDocument formats, also include multi-column plaintext formats, and also include all plaintext formats.
"single column": reject all single-worksheet formats where the data is a single column. This is effectively the check from an earlier comment built in.
Option 1 is not directly applicable to my case, but I suppose it would be nice to be able to provide the expected format (e.g. as one of the constants defined in
xlsx
) and get an error if the parser finds a different format (optionally of course).Option 2 is trickier. Technically, any text file (source code, novel text, application log) can be interpreted as a single-column "spreadsheet". The question is, what do "real" spreadsheets look like? E.g. do any of the CSV format specs restrict how single-field files may be formatted? If there are no formal restrictions, it becomes a heuristic such as "first N rows should have approximately the same length" or "one delimiter type is used consistently". I'm not sure this library should be in the business of making such checks. I suppose the correct solution then is, as you suggested, for the user to verify the presence of columns that they need.
Depends on how you define "real". If you take Excel as the authority on what constitutes a real spreadsheet, these files are indeed spreadsheets, but they are definitely surprising!
If you want to verify, in node if you install the module globally (
npm install -g xlsx
) you can call thexlsx
program from your shell:This generates a file:
gitignore.xlsx
. You can verify manually that it is a good XLSX file: If you convert it back to CSV you will see a single-column output that is identical to the original.gitignore
:Screenshots of the attached file when viewed in Excel 2016 and LibreOffice:
This behavior is now documented in https://github.com/sheetjs/js-xlsx#guessing-file-type