What is a wrong input file format? #641

Closed
opened 2017-04-21 15:37:27 +00:00 by dskrvk · 6 comments
dskrvk commented 2017-04-21 15:37:27 +00:00 (Migrated from github.com)

I'm writing a test to verify that when my code reads a bad input file using xlsx.readFile, a proper error is returned. However, I'm having trouble finding a file that would cause xlsx not to return a parsed worksheet. Any text-based format seems to work, including YAML, JSON, .gitignore, even JS source files. Only a binary file would currently cause it to choke.

Is this a desired behavior? Seems really easy for an attacker to trick my application into reading a malformed (but valid, according to xlsx) file. My only options are to check the file extension (seems silly, since the file contents is what really matters) and to check for presence of every column that my application cares about (it accepts a range of spreadsheet formats, so this is not a quick check that can be done early on).

I saw a parsing option called WTF, but setting it doesn't cause readFile to reject non-spreadsheets. Would it be possible to add another option that rejects anything that doesn't actually have "columns" and "rows"?

I'm writing a test to verify that when my code reads a bad input file using `xlsx.readFile`, a proper error is returned. However, I'm having trouble finding a file that would cause `xlsx` not to return a parsed worksheet. Any text-based format seems to work, including YAML, JSON, .gitignore, even JS source files. Only a binary file would currently cause it to choke. Is this a desired behavior? Seems really easy for an attacker to trick my application into reading a malformed (but valid, according to `xlsx`) file. My only options are to check the file extension (seems silly, since the file contents is what really matters) and to check for presence of every column that my application cares about (it accepts a range of spreadsheet formats, so this is not a quick check that can be done early on). I saw a parsing option called `WTF`, but setting it doesn't cause `readFile` to reject non-spreadsheets. Would it be possible to add another option that rejects anything that doesn't actually have "columns" and "rows"?
SheetJSDev commented 2017-04-21 15:48:34 +00:00 (Migrated from github.com)

@dskrvk try this: take the .gitignore file from this repo, rename to gitignore.csv and try opening in Excel. This is what I see:

screen shot 2017-04-21 at 11 42 04

Excel itself is extremely aggressive is reading files that may or may not be utter nonsense. You can do file extension checks (using path.extname) to filter for the file extensions you want, but it won't address a YAML-masquerading-as-CSV.

The best bet is to check the desired worksheet. If it has only one column and you expect 2, then there's something wrong. To get the number of rows/columns:

var range = XLSX.utils.decode_range(worksheet['!ref']);
var ncols = range.e.c - range.r.c + 1, nrows = range.e.r - range.s.r + 1;

On a side note: the WTF option is to force certain classes of errors to bubble through. They are not showstoppers but rather indicate where we are missing support for a feature (e.g. unrecognized XML tags)

@dskrvk try this: take the .gitignore file from this repo, rename to gitignore.csv and try opening in Excel. This is what I see: <img width="92" alt="screen shot 2017-04-21 at 11 42 04" src="https://cloud.githubusercontent.com/assets/6070939/25285274/aa465410-2687-11e7-94b4-b102586901c1.png"> Excel itself is extremely aggressive is reading files that may or may not be utter nonsense. You can do file extension checks (using `path.extname`) to filter for the file extensions you want, but it won't address a YAML-masquerading-as-CSV. The best bet is to check the desired worksheet. If it has only one column and you expect 2, then there's something wrong. To get the number of rows/columns: ```js var range = XLSX.utils.decode_range(worksheet['!ref']); var ncols = range.e.c - range.r.c + 1, nrows = range.e.r - range.s.r + 1; ``` On a side note: the WTF option is to force certain classes of errors to bubble through. They are not showstoppers but rather indicate where we are missing support for a feature (e.g. unrecognized XML tags)
dskrvk commented 2017-04-21 16:25:07 +00:00 (Migrated from github.com)

If that's how Excel behaves, it makes sense. Thanks for the suggestions!

If that's how Excel behaves, it makes sense. Thanks for the suggestions!
SheetJSDev commented 2017-04-24 08:00:55 +00:00 (Migrated from github.com)

@dskrvk I thought about this a little more and decided that this answer wasn't satisfactory. Should we add an option that generates an error if the file appears to be a one-column text file? It would obviously generate false positives (like a single-column CSV) but it would also avoid the uncertainty raised in this issue

I'm open to a new option to address the issue, and there are two obvious candidates:

  • "book formats": some string which describes which formats to support. You could restrict to the core Excel formats (XLSX/XLSB/XLSM/XLS/SpreadsheetML), also include OpenDocument formats, also include multi-column plaintext formats, and also include all plaintext formats.

  • "single column": reject all single-worksheet formats where the data is a single column. This is effectively the check from an earlier comment built in.

@dskrvk I thought about this a little more and decided that this answer wasn't satisfactory. Should we add an option that generates an error if the file appears to be a one-column text file? It would obviously generate false positives (like a single-column CSV) but it would also avoid the uncertainty raised in this issue I'm open to a new option to address the issue, and there are two obvious candidates: - "book formats": some string which describes which formats to support. You could restrict to the core Excel formats (XLSX/XLSB/XLSM/XLS/SpreadsheetML), also include OpenDocument formats, also include multi-column plaintext formats, and also include all plaintext formats. - "single column": reject all single-worksheet formats where the data is a single column. This is effectively the check from [an earlier comment](https://github.com/SheetJS/js-xlsx/issues/641#issuecomment-296228153) built in.
dskrvk commented 2017-04-25 22:50:57 +00:00 (Migrated from github.com)

Option 1 is not directly applicable to my case, but I suppose it would be nice to be able to provide the expected format (e.g. as one of the constants defined in xlsx) and get an error if the parser finds a different format (optionally of course).

Option 2 is trickier. Technically, any text file (source code, novel text, application log) can be interpreted as a single-column "spreadsheet". The question is, what do "real" spreadsheets look like? E.g. do any of the CSV format specs restrict how single-field files may be formatted? If there are no formal restrictions, it becomes a heuristic such as "first N rows should have approximately the same length" or "one delimiter type is used consistently". I'm not sure this library should be in the business of making such checks. I suppose the correct solution then is, as you suggested, for the user to verify the presence of columns that they need.

Option 1 is not directly applicable to my case, but I suppose it would be nice to be able to provide the expected format (e.g. as one of the constants defined in `xlsx`) and get an error if the parser finds a different format (optionally of course). Option 2 is trickier. Technically, any text file (source code, novel text, application log) can be interpreted as a single-column "spreadsheet". The question is, what do "real" spreadsheets look like? E.g. do any of the CSV format specs restrict how single-field files may be formatted? If there are no formal restrictions, it becomes a heuristic such as "first N rows should have approximately the same length" or "one delimiter type is used consistently". I'm not sure this library should be in the business of making such checks. I suppose the correct solution then is, as you suggested, for the user to verify the presence of columns that they need.
SheetJSDev commented 2017-04-26 06:33:05 +00:00 (Migrated from github.com)

The question is, what do "real" spreadsheets look like? E.g. do any of the CSV format specs restrict how single-field files may be formatted?

Depends on how you define "real". If you take Excel as the authority on what constitutes a real spreadsheet, these files are indeed spreadsheets, but they are definitely surprising!

If you want to verify, in node if you install the module globally (npm install -g xlsx) you can call the xlsx program from your shell:

$ xlsx -o gitignore.xlsx -X .gitignore

This generates a file: gitignore.xlsx. You can verify manually that it is a good XLSX file: If you convert it back to CSV you will see a single-column output that is identical to the original .gitignore:

$ xlsx gitignore.xlsx | md5
Sheet1
1382dabe917c69a2037540d85c3ecc9a
MD5 (.gitignore) = 1382dabe917c69a2037540d85c3ecc9a
$ xlsx ~/Desktop/gitignore.xlsx
Sheet1
node_modules
*.tgz
_book/
misc/coverage.html
misc/prof.js
v8.log
tmp
*.[tT][xX][tT]
*.[cC][sS][vV]
*.[dD][iIbB][fF]
*.[pP][rR][nN]
*.[sS][lL][kK]
*.socialcalc
*.[xX][lL][sSwWcC]
*.[xX][lL][sS][xXmMbB]
*.[oO][dD][sS]
*.[fF][oO][dD][sS]
*.[xX][mM][lL]
*.[uU][oO][sS]
*.[wW][kKqQbB][S1234567890]
*.[qQ][pP][wW]
*.123
*.htm
*.html
*.sheetjs
*.exe

Screenshots of the attached file when viewed in Excel 2016 and LibreOffice:

issue641excel2016 issue641libreoffice
> The question is, what do "real" spreadsheets look like? E.g. do any of the CSV format specs restrict how single-field files may be formatted? Depends on how you define "real". If you take Excel as the authority on what constitutes a real spreadsheet, these files are indeed spreadsheets, but they are definitely surprising! If you want to verify, in node if you install the module globally (`npm install -g xlsx`) you can call the `xlsx` program from your shell: ```bash $ xlsx -o gitignore.xlsx -X .gitignore ``` This generates a file: [`gitignore.xlsx`](https://github.com/SheetJS/js-xlsx/files/957419/gitignore.xlsx). You can verify manually that it is a good XLSX file: If you convert it back to CSV you will see a single-column output that is identical to the original `.gitignore`: ```bash $ xlsx gitignore.xlsx | md5 Sheet1 1382dabe917c69a2037540d85c3ecc9a MD5 (.gitignore) = 1382dabe917c69a2037540d85c3ecc9a $ xlsx ~/Desktop/gitignore.xlsx Sheet1 node_modules *.tgz _book/ misc/coverage.html misc/prof.js v8.log tmp *.[tT][xX][tT] *.[cC][sS][vV] *.[dD][iIbB][fF] *.[pP][rR][nN] *.[sS][lL][kK] *.socialcalc *.[xX][lL][sSwWcC] *.[xX][lL][sS][xXmMbB] *.[oO][dD][sS] *.[fF][oO][dD][sS] *.[xX][mM][lL] *.[uU][oO][sS] *.[wW][kKqQbB][S1234567890] *.[qQ][pP][wW] *.123 *.htm *.html *.sheetjs *.exe ``` Screenshots of the attached file when viewed in Excel 2016 and LibreOffice: <img width="186" alt="issue641excel2016" src="https://cloud.githubusercontent.com/assets/6070939/25420883/5a2be59a-2a28-11e7-9734-48c94bf1ba80.png"> <img width="358" alt="issue641libreoffice" src="https://cloud.githubusercontent.com/assets/6070939/25420906/63cd10a6-2a28-11e7-84f7-85d0ffe5c0cc.png">
SheetJSDev commented 2017-05-16 17:54:00 +00:00 (Migrated from github.com)

This behavior is now documented in https://github.com/sheetjs/js-xlsx#guessing-file-type

This behavior is now documented in https://github.com/sheetjs/js-xlsx#guessing-file-type
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#641
No description provided.