Large .xlsx file not getting .Sheets #792

Open
opened 2017-08-24 13:21:01 +00:00 by jensJJ · 9 comments
jensJJ commented 2017-08-24 13:21:01 +00:00 (Migrated from github.com)

Hello,
I am using the newest library of xlsx.js on the client-side (browser). It works reading xlsx files and rendering to HTML table for smaller files.

Now I have a larger xlsx file with 1mil rows. I am getting the sheetNames attributes, but I am not getting the .Sheets object. It is empty.
A smaller sub-set of this file reads perfectly.

What could be the error?
Thank you very much!

excel1

Update: I used the WTF: true read options and receive following errror:

xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js:17317 Uncaught RangeError: Invalid string length
    at Array.join (native)
    at Object.arrayLikeToString [as string] (http://localhost:3000/assets/jszip.self-dcc1caf9dd7a61892574ee6089fd0084b3f35682d2f9a06604edf5ba62b4c505.js?body=1:1932:19)
    at Object.exports.transformTo (http://localhost:3000/assets/jszip.self-dcc1caf9dd7a61892574ee6089fd0084b3f35682d2f9a06604edf5ba62b4c505.js?body=1:2052:50)
    at ZipObject.dataToString (http://localhost:3000/assets/jszip.self-dcc1caf9dd7a61892574ee6089fd0084b3f35682d2f9a06604edf5ba62b4c505.js?body=1:712:24)
    at ZipObject.asBinary (http://localhost:3000/assets/jszip.self-dcc1caf9dd7a61892574ee6089fd0084b3f35682d2f9a06604edf5ba62b4c505.js?body=1:760:29)
    at getdatastr (http://localhost:3000/assets/xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js?body=1:1556:38)
    at getdata (http://localhost:3000/assets/xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js?body=1:1573:95)
    at getzipdata (http://localhost:3000/assets/xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js?body=1:1594:19)
    at safe_parse_sheet (http://localhost:3000/assets/xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js?body=1:17300:14)
    at parse_zip (http://localhost:3000/assets/xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js?body=1:17429:3)
Hello, I am using the newest library of `xlsx.js` on the client-side (browser). It works reading `xlsx` files and rendering to HTML table for smaller files. Now I have a larger `xlsx` file with **1mil** rows. I am getting the `sheetNames` attributes, but I am not getting the `.Sheets `object. It is empty. A smaller sub-set of this file reads perfectly. What could be the error? Thank you very much! ![excel1](https://user-images.githubusercontent.com/4704155/29668227-b0d75c4c-88df-11e7-813c-e1a3c8fd1551.png) Update: I used the `WTF: true` read options and receive following errror: ``` xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js:17317 Uncaught RangeError: Invalid string length at Array.join (native) at Object.arrayLikeToString [as string] (http://localhost:3000/assets/jszip.self-dcc1caf9dd7a61892574ee6089fd0084b3f35682d2f9a06604edf5ba62b4c505.js?body=1:1932:19) at Object.exports.transformTo (http://localhost:3000/assets/jszip.self-dcc1caf9dd7a61892574ee6089fd0084b3f35682d2f9a06604edf5ba62b4c505.js?body=1:2052:50) at ZipObject.dataToString (http://localhost:3000/assets/jszip.self-dcc1caf9dd7a61892574ee6089fd0084b3f35682d2f9a06604edf5ba62b4c505.js?body=1:712:24) at ZipObject.asBinary (http://localhost:3000/assets/jszip.self-dcc1caf9dd7a61892574ee6089fd0084b3f35682d2f9a06604edf5ba62b4c505.js?body=1:760:29) at getdatastr (http://localhost:3000/assets/xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js?body=1:1556:38) at getdata (http://localhost:3000/assets/xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js?body=1:1573:95) at getzipdata (http://localhost:3000/assets/xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js?body=1:1594:19) at safe_parse_sheet (http://localhost:3000/assets/xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js?body=1:17300:14) at parse_zip (http://localhost:3000/assets/xlsx.self-4eba00dc4b89315f21b672499210ec51d971b33bc29739752e402683284d8353.js?body=1:17429:3) ```
sigod commented 2017-09-25 12:09:50 +00:00 (Migrated from github.com)

I have the same problem. I'm trying to process XLSX file (179772 rows, 38.1 MB) and .Sheets are just empty.

Edit: Tried 0.10.8 and 0.11.4 versions on server-side.

I have the same problem. I'm trying to process [XLSX file][0] (179772 rows, 38.1 MB) and `.Sheets` are just empty. Edit: Tried 0.10.8 and 0.11.4 versions on server-side. [0]: https://platform.prisma-capacity.eu/rest/reports/2329414/downloadReport
SheetJSDev commented 2017-09-25 14:43:33 +00:00 (Migrated from github.com)

@sigod it's surpassing a JS engine limit. XLSX files are ZIP containers, and the worksheet data is stored as XML text files within the ZIP. Rename to .zip and extract and you'll see the file:

$ unzip -l 20160503_Auction_Report.xlsx
Archive:  /tmp/20160503_Auction_Report.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
      592  05-03-2016 16:03   _rels/.rels
     1064  05-03-2016 16:03   [Content_Types].xml
      183  05-03-2016 16:03   docProps/app.xml
      438  05-03-2016 16:03   docProps/core.xml
  4245649  05-03-2016 16:03   xl/sharedStrings.xml
     2164  05-03-2016 16:03   xl/styles.xml
      357  05-03-2016 16:03   xl/workbook.xml
      570  05-03-2016 16:03   xl/_rels/workbook.xml.rels
392759204  05-03-2016 16:03   xl/worksheets/sheet1.xml <--
---------                     -------
397010221                     9 files

The current V8 string size limit (which applies to node and chrome) is (1<<28) - 16 = 268435440 bytes and the file is significantly larger than that.

The good news is that the limit is being increased in the future: https://v8project.blogspot.com/2017/09/v8-release-62.html

@sigod it's surpassing a JS engine limit. XLSX files are ZIP containers, and the worksheet data is stored as XML text files within the ZIP. Rename to .zip and extract and you'll see the file: ``` $ unzip -l 20160503_Auction_Report.xlsx Archive: /tmp/20160503_Auction_Report.xlsx Length Date Time Name --------- ---------- ----- ---- 592 05-03-2016 16:03 _rels/.rels 1064 05-03-2016 16:03 [Content_Types].xml 183 05-03-2016 16:03 docProps/app.xml 438 05-03-2016 16:03 docProps/core.xml 4245649 05-03-2016 16:03 xl/sharedStrings.xml 2164 05-03-2016 16:03 xl/styles.xml 357 05-03-2016 16:03 xl/workbook.xml 570 05-03-2016 16:03 xl/_rels/workbook.xml.rels 392759204 05-03-2016 16:03 xl/worksheets/sheet1.xml <-- --------- ------- 397010221 9 files ``` The current V8 string size limit (which applies to node and chrome) is `(1<<28) - 16 = 268435440` bytes and the file is significantly larger than that. The good news is that the limit is being increased in the future: https://v8project.blogspot.com/2017/09/v8-release-62.html
sigod commented 2017-09-25 17:05:13 +00:00 (Migrated from github.com)

@SheetJSDev Is there any good way to work around this issue?

@SheetJSDev Is there any good way to work around this issue?
idibidiart commented 2018-03-15 00:07:49 +00:00 (Migrated from github.com)

Even with WTF option, I get no errors and it makes up "Sheet1" where the file actually contains many sheets none of which is named Sheet1.

File(265892023) {name: "Contract_Insight_XXXX.xlsm", lastModified: 1520893001000, lastModifiedDate: Mon Mar 12 2018 15:16:41 GMT-0700 (PDT), webkitRelativePath: "", size: 265892023, …}lastModified: 1520893001000lastModifiedDate: Mon Mar 12 2018 15:16:41 GMT-0700 (PDT) {}name: "Contract_Insight_XXXX.xlsm"size: 265892023type: "application/vnd.ms-excel.sheet.macroenabled.12"webkitRelativePath: ""proto: File

demo.js:246 {
"SheetNames": [
"Sheet1"
],
"Sheets": {
"Sheet1": {}
}
}

I'm afraid the string limit increase won't help as this is 10X larger than the file @sigod linked to.

If my need is only to read the data itself (no cell meta data) does the Pro version offer a streaming solution? @SheetJSDev

Even with WTF option, I get no errors and it makes up "Sheet1" where the file actually contains many sheets none of which is named Sheet1. File(265892023) {name: "Contract_Insight_XXXX.xlsm", lastModified: 1520893001000, lastModifiedDate: Mon Mar 12 2018 15:16:41 GMT-0700 (PDT), webkitRelativePath: "", size: 265892023, …}lastModified: 1520893001000lastModifiedDate: Mon Mar 12 2018 15:16:41 GMT-0700 (PDT) {}name: "Contract_Insight_XXXX.xlsm"size: 265892023type: "application/vnd.ms-excel.sheet.macroenabled.12"webkitRelativePath: ""__proto__: File demo.js:246 { "SheetNames": [ "Sheet1" ], "Sheets": { "Sheet1": {} } } I'm afraid the string limit increase won't help as this is 10X larger than the file @sigod linked to. If my need is only to read the data itself (no cell meta data) does the Pro version offer a streaming solution? @SheetJSDev
pckhoi commented 2020-10-15 13:25:51 +00:00 (Migrated from github.com)

Is it possible to store the entire file in a typed array instead of string? In Chrome maximum buffer length is 2 ** 31 - 2 ** 12 = 2147479552 bytes. Firefox's is double that. That should be more than enough.

Is it possible to store the entire file in a typed array instead of string? In Chrome maximum buffer length is `2 ** 31 - 2 ** 12 = 2147479552` bytes. Firefox's is double that. That should be more than enough.
Kaiido commented 2021-02-25 08:59:10 +00:00 (Migrated from github.com)

I didn't check thoroughly the sources yet, and I don't have time to work on this myself right now, but from what I understand you are using JSZipSync to first unzip the xlsx file.
The sheets are then accessible in the object (obj) returned by JSZipSync as obj["files"]["xl/worksheets/sheetN.xml"] which is actually yet an other compressed file.
Calling the .getContent() method of this object will return an Uint8Array representing the xml file.

So theoretically, all it takes to have huge files support (and a place to fire progress events) is to hook here a stream-xml-parser (it seems there are many out-there), and to read that xml file by chunks (which can be done efficiently in a browser using a TextDecoder).
Am I missing something else here that would prevent implementing this more than necessary feature?

I didn't check thoroughly the sources yet, and I don't have time to work on this myself right now, but from what I understand you are using `JSZipSync` to first unzip the xlsx file. The sheets are then accessible in the object (`obj`) returned by `JSZipSync` as `obj["files"]["xl/worksheets/sheetN.xml"]` which is actually yet an other compressed file. Calling the `.getContent()` method of this object will return an Uint8Array representing the xml file. So theoretically, all it takes to have huge files support (and a place to fire progress events) is to hook here a stream-xml-parser (it seems there are many out-there), and to read that xml file by chunks (which can be done efficiently in a browser using a TextDecoder). Am I missing something else here that would prevent implementing this more than necessary feature?
Kaiido commented 2021-03-01 06:50:13 +00:00 (Migrated from github.com)

Today was holiday here and I had time to make a simple Proof-Of-Concept of my idea in https://github.com/SheetJS/sheetjs/issues/792#issuecomment-785733626

You can see it at https://xml-stream-parsing-poc.glitch.me/

Sorry it is very fast written since I lack time, and it might make your eyes get out of their orbit seeing how bad my xlsx parser is (I really just wanted to somehow log the values in my sheet), but it at least shows that it should be possible to implement such an xml stream parser in your library too.
Using that I was able to parse 60MB+ files in Chrome, while SheetJS fails to read these.

However the speed of parsing obviously suffers from this streaming method, but I believe it's preferable to have a slower parser than to not be able to parse some files, moreover when this allows to add some kind of a progress event.

Today was holiday here and I had time to make a simple Proof-Of-Concept of my idea in https://github.com/SheetJS/sheetjs/issues/792#issuecomment-785733626 You can see it at https://xml-stream-parsing-poc.glitch.me/ Sorry it is very fast written since I lack time, and it might make your eyes get out of their orbit seeing how bad my xlsx parser is (I really just wanted to somehow log the values in my sheet), but it at least shows that it should be possible to implement such an xml stream parser in your library too. Using that I was able to parse 60MB+ files in Chrome, while SheetJS fails to read these. However the speed of parsing obviously suffers from this streaming method, but I believe it's preferable to have a slower parser than to not be able to parse some files, moreover when this allows to add some kind of a progress event.
SheetJSDev commented 2021-10-18 19:51:43 +00:00 (Migrated from github.com)

@sigod so a full read takes 38 seconds currently, which is awful but it immediately addresses @jensJJ 's original error. V8 lifted the string limit 🥳 @idibidiart it is possible that your file works too, although it may take a while to test.

@pckhoi that is definitely doable and interesting in general. At this point, text processing is a very significant part of the runtime profile.

@Kaiido that is very cool!

Rewinding to the beginning, streaming is not currently supported because of limitations at multiple levels:

  1. Because of how the ZIP (XLSX/XLSB) and CFB (XLS) containers work, you have to find the directory to properly determine where the subfiles are located. ZIP EOCD record is at the end of the logical zip portion of the file, while the CFB directory can be anywhere (curiously, the pointer is in the file header). That means a correct streaming implementation would have to buffer the entire

  2. XLSX/XLSB should parse files in a very specific order (generally starting from [Content_Types].xml or _rels/.rels to determine the main workbook file) and the container structure does not guarantee the ordering of the subfiles within the ZIP. In fact, when the library writes XLSX and XLSB files, those are written at the very end

  3. XLSX/XLSB and XLS worksheets write interesting properties including merge cell locations after the sheet data, so a streaming approach necessarily loses critical metadata. For example, the XLSB writer has comments for each part. CELLTABLE is the actual cell data. The metadata that shows up before the cell data pertain to column properties, general worksheet formatting and window properties.

In parallel to "improving the sync performance", it's worth investigating the event/stream approach. It would be interesting to see if using an incremental deflate approach (off-thread unzipping) would improve throughput

@sigod so a full read takes 38 seconds currently, which is awful but it immediately addresses @jensJJ 's original error. V8 lifted the string limit 🥳 @idibidiart it is possible that your file works too, although it may take a while to test. @pckhoi that is definitely doable and interesting in general. At this point, text processing is a very significant part of the runtime profile. @Kaiido that is very cool! Rewinding to the beginning, [streaming is not currently supported](https://github.com/SheetJS/sheetjs#streaming-read) because of limitations at multiple levels: 1) Because of how the ZIP (XLSX/XLSB) and CFB (XLS) containers work, you have to find the directory to properly determine where the subfiles are located. ZIP EOCD record is at the end of the logical zip portion of the file, while the CFB directory can be anywhere (curiously, the pointer is in the file header). That means a correct streaming implementation would have to buffer the entire 2) XLSX/XLSB should parse files in a very specific order (generally starting from `[Content_Types].xml` or `_rels/.rels` to determine the main workbook file) and the container structure does not guarantee the ordering of the subfiles within the ZIP. In fact, when the library writes XLSX and XLSB files, [those are written at the very end](https://github.com/SheetJS/sheetjs/blob/master/bits/86_writezip.js#L123) 3) XLSX/XLSB and XLS worksheets write interesting properties including merge cell locations *after* the sheet data, so a streaming approach necessarily loses critical metadata. For example, [the XLSB writer](https://github.com/SheetJS/sheetjs/blob/master/bits/68_wsbin.js#L968) has comments for each part. `CELLTABLE` is the actual cell data. The metadata that shows up before the cell data pertain to column properties, general worksheet formatting and window properties. In parallel to "improving the sync performance", it's worth investigating the event/stream approach. It would be interesting to see if using an incremental deflate approach (off-thread unzipping) would improve throughput
kisaragi99 commented 2021-12-15 15:12:32 +00:00 (Migrated from github.com)

Need help.
I have the same error message as jensJJ had.
Tried to parse calendar_stress_test file. (74mb, 1 million rows)
https://github.com/SheetJS/sheetjs/issues/61#issuecomment-47676371
Converted it to xlsx.

Got this error.

image

If im not mistaken - the issue with V8 max string length is obsolete and the size of sheet1 (572mb) should be fine?

Archive:  calendar_stress_test_copy.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     1348  01-01-1980 00:00   [Content_Types].xml
      733  01-01-1980 00:00   _rels/.rels
      831  01-01-1980 00:00   xl/_rels/workbook.xml.rels
     1393  01-01-1980 00:00   xl/workbook.xml
      383  01-01-1980 00:00   xl/sharedStrings.xml
     7646  01-01-1980 00:00   xl/theme/theme1.xml
     2285  01-01-1980 00:00   xl/styles.xml
572078353  01-01-1980 00:00   xl/worksheets/sheet1.xml
    38292  01-01-1980 00:00   docProps/thumbnail.jpeg
      792  01-01-1980 00:00   docProps/app.xml
168383066  01-01-1980 00:00   xl/calcChain.xml
      593  01-01-1980 00:00   docProps/core.xml
---------                     -------
740515715                     12 files

But, it worked perfectly in Firefox - approx 2 minutes.

(No browser extensions enabled in both chrome and firefox)

Need help. I have the same error message as jensJJ had. Tried to parse `calendar_stress_test` file. (74mb, 1 million rows) https://github.com/SheetJS/sheetjs/issues/61#issuecomment-47676371 Converted it to xlsx. Got this error. ![image](https://user-images.githubusercontent.com/58854608/146205982-300c5c42-c2b5-439f-8e8b-7109edd043d5.png) If im not mistaken - the issue with V8 max string length is obsolete and the size of sheet1 (572mb) should be fine? ``` Archive: calendar_stress_test_copy.zip Length Date Time Name --------- ---------- ----- ---- 1348 01-01-1980 00:00 [Content_Types].xml 733 01-01-1980 00:00 _rels/.rels 831 01-01-1980 00:00 xl/_rels/workbook.xml.rels 1393 01-01-1980 00:00 xl/workbook.xml 383 01-01-1980 00:00 xl/sharedStrings.xml 7646 01-01-1980 00:00 xl/theme/theme1.xml 2285 01-01-1980 00:00 xl/styles.xml 572078353 01-01-1980 00:00 xl/worksheets/sheet1.xml 38292 01-01-1980 00:00 docProps/thumbnail.jpeg 792 01-01-1980 00:00 docProps/app.xml 168383066 01-01-1980 00:00 xl/calcChain.xml 593 01-01-1980 00:00 docProps/core.xml --------- ------- 740515715 12 files ``` ### But, it worked perfectly in Firefox - approx 2 minutes. (No browser extensions enabled in both chrome and firefox)
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#792
No description provided.