Large .xlsx file not getting .Sheets #792
Labels
No Label
DBF
Dates
Defined Names
Features
Formula
HTML
Images
Infrastructure
Integration
International
ODS
Operations
Performance
PivotTables
Pro
Protection
Read Bug
SSF
SYLK
Style
Write Bug
good first issue
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sheetjs/sheetjs#792
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hello,
I am using the newest library of
xlsx.js
on the client-side (browser). It works readingxlsx
files and rendering to HTML table for smaller files.Now I have a larger
xlsx
file with 1mil rows. I am getting thesheetNames
attributes, but I am not getting the.Sheets
object. It is empty.A smaller sub-set of this file reads perfectly.
What could be the error?
Thank you very much!
Update: I used the
WTF: true
read options and receive following errror:I have the same problem. I'm trying to process XLSX file (179772 rows, 38.1 MB) and
.Sheets
are just empty.Edit: Tried 0.10.8 and 0.11.4 versions on server-side.
@sigod it's surpassing a JS engine limit. XLSX files are ZIP containers, and the worksheet data is stored as XML text files within the ZIP. Rename to .zip and extract and you'll see the file:
The current V8 string size limit (which applies to node and chrome) is
(1<<28) - 16 = 268435440
bytes and the file is significantly larger than that.The good news is that the limit is being increased in the future: https://v8project.blogspot.com/2017/09/v8-release-62.html
@SheetJSDev Is there any good way to work around this issue?
Even with WTF option, I get no errors and it makes up "Sheet1" where the file actually contains many sheets none of which is named Sheet1.
File(265892023) {name: "Contract_Insight_XXXX.xlsm", lastModified: 1520893001000, lastModifiedDate: Mon Mar 12 2018 15:16:41 GMT-0700 (PDT), webkitRelativePath: "", size: 265892023, …}lastModified: 1520893001000lastModifiedDate: Mon Mar 12 2018 15:16:41 GMT-0700 (PDT) {}name: "Contract_Insight_XXXX.xlsm"size: 265892023type: "application/vnd.ms-excel.sheet.macroenabled.12"webkitRelativePath: ""proto: File
demo.js:246 {
"SheetNames": [
"Sheet1"
],
"Sheets": {
"Sheet1": {}
}
}
I'm afraid the string limit increase won't help as this is 10X larger than the file @sigod linked to.
If my need is only to read the data itself (no cell meta data) does the Pro version offer a streaming solution? @SheetJSDev
Is it possible to store the entire file in a typed array instead of string? In Chrome maximum buffer length is
2 ** 31 - 2 ** 12 = 2147479552
bytes. Firefox's is double that. That should be more than enough.I didn't check thoroughly the sources yet, and I don't have time to work on this myself right now, but from what I understand you are using
JSZipSync
to first unzip the xlsx file.The sheets are then accessible in the object (
obj
) returned byJSZipSync
asobj["files"]["xl/worksheets/sheetN.xml"]
which is actually yet an other compressed file.Calling the
.getContent()
method of this object will return an Uint8Array representing the xml file.So theoretically, all it takes to have huge files support (and a place to fire progress events) is to hook here a stream-xml-parser (it seems there are many out-there), and to read that xml file by chunks (which can be done efficiently in a browser using a TextDecoder).
Am I missing something else here that would prevent implementing this more than necessary feature?
Today was holiday here and I had time to make a simple Proof-Of-Concept of my idea in https://github.com/SheetJS/sheetjs/issues/792#issuecomment-785733626
You can see it at https://xml-stream-parsing-poc.glitch.me/
Sorry it is very fast written since I lack time, and it might make your eyes get out of their orbit seeing how bad my xlsx parser is (I really just wanted to somehow log the values in my sheet), but it at least shows that it should be possible to implement such an xml stream parser in your library too.
Using that I was able to parse 60MB+ files in Chrome, while SheetJS fails to read these.
However the speed of parsing obviously suffers from this streaming method, but I believe it's preferable to have a slower parser than to not be able to parse some files, moreover when this allows to add some kind of a progress event.
@sigod so a full read takes 38 seconds currently, which is awful but it immediately addresses @jensJJ 's original error. V8 lifted the string limit 🥳 @idibidiart it is possible that your file works too, although it may take a while to test.
@pckhoi that is definitely doable and interesting in general. At this point, text processing is a very significant part of the runtime profile.
@Kaiido that is very cool!
Rewinding to the beginning, streaming is not currently supported because of limitations at multiple levels:
Because of how the ZIP (XLSX/XLSB) and CFB (XLS) containers work, you have to find the directory to properly determine where the subfiles are located. ZIP EOCD record is at the end of the logical zip portion of the file, while the CFB directory can be anywhere (curiously, the pointer is in the file header). That means a correct streaming implementation would have to buffer the entire
XLSX/XLSB should parse files in a very specific order (generally starting from
[Content_Types].xml
or_rels/.rels
to determine the main workbook file) and the container structure does not guarantee the ordering of the subfiles within the ZIP. In fact, when the library writes XLSX and XLSB files, those are written at the very endXLSX/XLSB and XLS worksheets write interesting properties including merge cell locations after the sheet data, so a streaming approach necessarily loses critical metadata. For example, the XLSB writer has comments for each part.
CELLTABLE
is the actual cell data. The metadata that shows up before the cell data pertain to column properties, general worksheet formatting and window properties.In parallel to "improving the sync performance", it's worth investigating the event/stream approach. It would be interesting to see if using an incremental deflate approach (off-thread unzipping) would improve throughput
Need help.
I have the same error message as jensJJ had.
Tried to parse
calendar_stress_test
file. (74mb, 1 million rows)https://github.com/SheetJS/sheetjs/issues/61#issuecomment-47676371
Converted it to xlsx.
Got this error.
If im not mistaken - the issue with V8 max string length is obsolete and the size of sheet1 (572mb) should be fine?
But, it worked perfectly in Firefox - approx 2 minutes.
(No browser extensions enabled in both chrome and firefox)