Support large excel files #61

Open
opened 2014-05-16 00:45:28 +00:00 by oliverzy · 29 comments
oliverzy commented 2014-05-16 00:45:28 +00:00 (Migrated from github.com)

currently this module will load the excel file into memory and build the js model, but it doesn't work for large file because memory limitation.

Is it possible to provide a stream like api so that we can handle only part of data at one time?

currently this module will load the excel file into memory and build the js model, but it doesn't work for large file because memory limitation. Is it possible to provide a stream like api so that we can handle only part of data at one time?
SheetJSDev commented 2014-05-16 01:05:04 +00:00 (Migrated from github.com)

It's certainly "possible". Here are the roadblocks:

  1. AFAICT the zip library does not support node streams API. That will need to be replaced.

  2. The static interface will need to be replaced with some event-driven one (next Sheet, next Cell, etc) if the in-memory object is too large

  3. The converters will need to be reimplemented using the new event-driven interface

On a side note, have you tried saving the file as XLSB? IME it's about half the size of the equivalent XLSX/XLSM

It's certainly "possible". Here are the roadblocks: 1) AFAICT the [zip library](https://github.com/Stuk/jszip/) does not support node streams API. That will need to be replaced. 2) The static interface will need to be replaced with some event-driven one (next Sheet, next Cell, etc) if the in-memory object is too large 3) The converters will need to be reimplemented using the new event-driven interface On a side note, have you tried saving the file as XLSB? IME it's about half the size of the equivalent XLSX/XLSM
hmalphettes commented 2014-06-27 02:52:06 +00:00 (Migrated from github.com)

I am interested in this too.

Going through XLSB will help with regard to the zip library but what I am most interested in is number 2 to avoid building the parsed object in memory: nextSheet, nextRow is the interface I would use.
Would you guys need more granularity: nextCell, nextComment etc?

Also @SheetJSDev do you intend to add callbacks to the interface for async processing of the parsed data?

I am interested in this too. Going through XLSB will help with regard to the zip library but what I am most interested in is number 2 to avoid building the parsed object in memory: nextSheet, nextRow is the interface I would use. Would you guys need more granularity: nextCell, nextComment etc? Also @SheetJSDev do you intend to add callbacks to the interface for async processing of the parsed data?
SheetJSDev commented 2014-06-27 12:11:16 +00:00 (Migrated from github.com)

@hmalphettes I used the phrase 'event-driven' but yes that will involve callback functions.

We'd have to look through the test suite, but my initial concern is that certain data won't be available until you process the entire file (for example, if dimensions are not included in the sheet, the parser keeps track of the first and last seen column as well as the first and last seen row when calculating the range for the worksheet). Comments, merge cells and other metadata potentially could be stored after the actual sheet data (although we'd have to dig into some test cases to see what writers do)

@hmalphettes I used the phrase 'event-driven' but yes that will involve callback functions. We'd have to look through the test suite, but my initial concern is that certain data won't be available until you process the entire file (for example, if dimensions are not included in the sheet, the parser keeps track of the first and last seen column as well as the first and last seen row when calculating the range for the worksheet). Comments, merge cells and other metadata potentially could be stored after the actual sheet data (although we'd have to dig into some test cases to see what writers do)
hmalphettes commented 2014-07-01 01:08:26 +00:00 (Migrated from github.com)

OK @SheetJSDev.

OK @SheetJSDev.
antonabramovich commented 2014-07-01 13:58:48 +00:00 (Migrated from github.com)

@SheetJSDev, it will be nice to figure out, what the max file size supported today

@SheetJSDev, it will be nice to figure out, what the max file size supported today
SheetJSDev commented 2014-07-01 16:12:14 +00:00 (Migrated from github.com)

@antonama it's a tricky question because XLSX/XLSM/XLSB are zip files and some very large sheets can be stored in a small file (and vice versa -- a small string-heavy sheet will appear to be big).

For example, the command line tool xlsx (install the xlsx module in nodejs) can "handle" the 43MB calendar stress test (with cells from A1 to H1048576). This file has only a few styles and no charts or other jazz, which arguably is the best case scenario.

It "handles" the file insofar as it took a full minute on my computer. YMMV. IE6 seems to crash on files with more than 30 cells (it appears that IE6 tries to warn about a slow-running script, but that is thwarted by the fact that the script locks up the browser while processing)

Due to github's file size limitation, I don't have file larger than 50MB in the test suite.

@antonama it's a tricky question because XLSX/XLSM/XLSB are zip files and some very large sheets can be stored in a small file (and vice versa -- a small string-heavy sheet will appear to be big). For example, the command line tool `xlsx` (install the [`xlsx` module](http://npm.im/xlsx) in nodejs) can "handle" the [43MB calendar stress test](https://github.com/SheetJS/test_files/blob/master/calendar_stress_test.xlsb?raw=true) (with cells from A1 to H1048576). This file has only a few styles and no charts or other jazz, which arguably is the best case scenario. It "handles" the file insofar as it took a full minute on my computer. YMMV. IE6 seems to crash on files with more than 30 cells (it appears that IE6 tries to warn about a slow-running script, but that is thwarted by the fact that the script locks up the browser while processing) Due to github's file size limitation, I don't have file larger than 50MB in the test suite.
antonabramovich commented 2014-07-02 13:32:36 +00:00 (Migrated from github.com)

@SheetJSDev, your file sizes are really impressive. I'm trying to convert file ~2MB size and I get blank output. And I don't have any other 'jazz' except raw text. Could you assume, why?

@SheetJSDev, your file sizes are really impressive. I'm trying to convert file ~2MB size and I get blank output. And I don't have any other 'jazz' except raw text. Could you assume, why?
SheetJSDev commented 2014-07-02 13:38:16 +00:00 (Migrated from github.com)

@antonama can you share that file (either email me directly -- sheetjs at the google email service -- or put the file up in a public location) and describe your conditions (OS, browser version, webpage/script) ?

@antonama can you share that file (either email me directly -- sheetjs at the google email service -- or put the file up in a public location) and describe your conditions (OS, browser version, webpage/script) ?
antonabramovich commented 2014-07-02 14:36:46 +00:00 (Migrated from github.com)

@SheetJSDev, hmmm, I'm really sorry. I've just noticed macros in file. Is it may be the issue?

@SheetJSDev, hmmm, I'm really sorry. I've just noticed macros in file. Is it may be the issue?
SheetJSDev commented 2014-07-02 14:40:54 +00:00 (Migrated from github.com)

I doubt macros are the issue.

If you are seeing a blank screen (no output) can you check the javascript console? In Chrome, hit control-shift-j on windows / command-option-j on mac.

I doubt macros are the issue. If you are seeing a blank screen (no output) can you check the javascript console? In Chrome, hit control-shift-j on windows / command-option-j on mac.
nshopik commented 2014-09-17 12:29:18 +00:00 (Migrated from github.com)

Is large document suppose crash tab process? I have 4mb xlsx which converted from xls 17mb file, after 15 seconds of processing chrome tab crashes

Is large document suppose crash tab process? I have 4mb xlsx which converted from xls 17mb file, after 15 seconds of processing chrome tab crashes
SheetJSDev commented 2014-09-17 21:12:14 +00:00 (Migrated from github.com)

Could you share a file? If not, can you at least share how many cells are
in the sheet?

Also, can you check if the node module
(https://www.npmjs.org/package/xlsx) successfully handles the file? If
you npm install -g xlsx then you can run xlsx your_file.xlsx.

Is large document suppose crash tab process? I have 4mb xlsx which
converted from xls 17mb file, after 15 seconds of processing chrome tab
crashes


Reply to this email directly or view it on GitHub:
https://github.com/SheetJS/js-xlsx/issues/61#issuecomment-55886087

Could you share a file? If not, can you at least share how many cells are in the sheet? Also, can you check if the node module (https://www.npmjs.org/package/xlsx) successfully handles the file? If you `npm install -g xlsx` then you can run `xlsx your_file.xlsx`. > Is large document suppose crash tab process? I have 4mb xlsx which > converted from xls 17mb file, after 15 seconds of processing chrome tab > crashes > > --- > > Reply to this email directly or view it on GitHub: > https://github.com/SheetJS/js-xlsx/issues/61#issuecomment-55886087
nshopik commented 2014-09-18 08:07:20 +00:00 (Migrated from github.com)

Sure here it is https://yadi.sk/d/gB5Qe8yMbXosL
Original XLS is parced correctly but when I converted it to XLSX always crash tab

Sure here it is https://yadi.sk/d/gB5Qe8yMbXosL Original XLS is parced correctly but when I converted it to XLSX always crash tab
SheetJSDev commented 2014-09-18 14:49:29 +00:00 (Migrated from github.com)

The website uses the gh-pages branch. Can you check against the master
branch? To do this:

$ git clone https://github.com/SheetJS/js-xlsx
$ cd js-xlsx
$ python -mSimpleHTTPServer

Then go to http://localhost:8000 and try to repeat the test. I can
reproduce the problem on http://oss.sheetjs.com/js-xlsx (which uses the
gh-pages branch), but this oddly works in the master branch.

Sure here it is https://yadi.sk/d/gB5Qe8yMbXosL
Original XLS is parced correctly but when I converted it to XLSX always
crash tab


Reply to this email directly or view it on GitHub:
https://github.com/SheetJS/js-xlsx/issues/61#issuecomment-56007194

The website uses the gh-pages branch. Can you check against the master branch? To do this: ``` $ git clone https://github.com/SheetJS/js-xlsx $ cd js-xlsx $ python -mSimpleHTTPServer ``` Then go to http://localhost:8000 and try to repeat the test. I can reproduce the problem on http://oss.sheetjs.com/js-xlsx (which uses the gh-pages branch), but this oddly works in the master branch. > Sure here it is https://yadi.sk/d/gB5Qe8yMbXosL > Original XLS is parced correctly but when I converted it to XLSX always > crash tab > > --- > > Reply to this email directly or view it on GitHub: > https://github.com/SheetJS/js-xlsx/issues/61#issuecomment-56007194
nshopik commented 2014-09-19 10:16:07 +00:00 (Migrated from github.com)

I though gh-pages runing on master. Anyway master branch works for this file correctly.

But I did try another even more complex file (9,2mb) and this time master branch crash process too. its more complex file with 126 sheets inside of with different sizes of rows per sheet. I can share it privately if you let me know how.

I though gh-pages runing on master. Anyway master branch works for this file correctly. But I did try another even more complex file (9,2mb) and this time master branch crash process too. its more complex file with 126 sheets inside of with different sizes of rows per sheet. I can share it privately if you let me know how.
SheetJSDev commented 2014-09-19 14:00:01 +00:00 (Migrated from github.com)

You can email it to the gmail account listed at https://github.com/SheetJS or to the email address in the git log (the commit email is hosted on my own server directly)

Sent from my iPad

On Sep 19, 2014, at 6:16 AM, Nikolay Shopik notifications@github.com wrote:

I though gh-pages runing on master. Anyway master branch works for this file correctly.

But I did try another even more complex file (9,2mb) and this time master branch crash process too. its more complex file with 126 sheets inside of with different sizes of rows per sheet. I can share it privately if you let me know how.


Reply to this email directly or view it on GitHub.

You can email it to the gmail account listed at https://github.com/SheetJS or to the email address in the git log (the commit email is hosted on my own server directly) Sent from my iPad > On Sep 19, 2014, at 6:16 AM, Nikolay Shopik notifications@github.com wrote: > > I though gh-pages runing on master. Anyway master branch works for this file correctly. > > But I did try another even more complex file (9,2mb) and this time master branch crash process too. its more complex file with 126 sheets inside of with different sizes of rows per sheet. I can share it privately if you let me know how. > > — > Reply to this email directly or view it on GitHub.
nshopik commented 2014-09-19 14:32:54 +00:00 (Migrated from github.com)

So it appears disabling "Web Workers" helps with larger files, and from what I can see its even little bit faster

So it appears disabling "Web Workers" helps with larger files, and from what I can see its even little bit faster
osher commented 2016-01-30 12:11:06 +00:00 (Migrated from github.com)

I could offer a different API if it makes it simpler and should not require a heavy rewrite:

The options may accept an event emitter (or any object with the method "emit(name,...args)).
When such emitter is provided - the parser should events in the following spirit:

  • 'doc-open' - should pass the file path, when the doc is found valid and is being passed to parsing.
  • 'sheet-start' - should pass the file path, name of sheet, index of sheet, when a sheet is being opened.
  • 'row' - should pass the row on top of the args to sheet-start when a row in the sheet has been fully read - (*) - may not contain calculated values
  • 'sheet-end' - should pass the sheet, the sheet index when the sheet is concluded.
  • 'doc-end' - should pass the doc. All values in this stage should be fully populated

I'd recommend all events to be passed with a context reference on which the passed arguments should be found.
A context - depending in the event it fires for - may contain attributes (or getters for entries ) such as:

  • ctx.currentSheet
  • ctx.currentSheetIndex
  • ctx.currentRow
  • ctx.currentRowIndex
  • ctx.book

The only thing I cannot consder in full by myself is the implication of calculated fields.
I know that calculated cells in a given sheet may pull values from preceding sheets...

I could offer for this

  • 'sheet-calculated' - should emit when the sheet is fully calculated (all data on which it depends has been loaded
    However, this obviously an advanced feature and may not be included in the first release of this features.
    I have a feeling that a big workbook with inter-sheet formulas is an edge case.

What do you think?

I could offer a different API if it makes it simpler and should not require a heavy rewrite: The options may accept an event emitter (or any object with the method "emit(name,...args)). When such emitter is provided - the parser should events in the following spirit: - 'doc-open' - should pass the file path, when the doc is found valid and is being passed to parsing. - 'sheet-start' - should pass the file path, name of sheet, index of sheet, when a sheet is being opened. - 'row' - should pass the row on top of the args to sheet-start when a row in the sheet has been fully read - (*) - may not contain calculated values - 'sheet-end' - should pass the sheet, the sheet index when the sheet is concluded. - 'doc-end' - should pass the doc. All values in this stage should be fully populated I'd recommend all events to be passed with a context reference on which the passed arguments should be found. A context - depending in the event it fires for - may contain attributes (or getters for entries ) such as: - ctx.currentSheet - ctx.currentSheetIndex - ctx.currentRow - ctx.currentRowIndex - ctx.book The only thing I cannot consder in full by myself is the implication of calculated fields. I know that calculated cells in a given sheet may pull values from preceding sheets... I could offer for this - 'sheet-calculated' - should emit when the sheet is fully calculated (all data on which it depends has been loaded However, this obviously an advanced feature and may not be included in the first release of this features. I have a feeling that a big workbook with inter-sheet formulas is an edge case. What do you think?
SheetJSDev commented 2017-04-13 20:57:50 +00:00 (Migrated from github.com)

We've started addressing the general performance problem in node:

  • internally, instead of the address keys, worksheets are built up as arrays of arrays of cell objects. This behavior is controlled with the undocumented dense option. The read and write utilities know how to handle both types of worksheets, so if you are just translating formats you can enable it in your read interface, but raw manipulation requires a slightly different access pattern. The test suite uses a small helper function:
function get_cell(ws/*:Worksheet*/, addr/*:string*/) {
	if(!Array.isArray(ws)) return ws[addr];
	var a = X.utils.decode_cell(addr);
	return (ws[a.r]||[])[a.c];
}
  • the XLSX object now has a streaming CSV write function XLSX.stream.to_csv which takes the same parameters as the normal XLSX.utils.sheet_to_csv. Instead of building the whole CSV in memory, it builds one line at a time and passes the data.

Besides the general issue of non-conformant files, there are more hurdles preventing a true streaming read solution:

  1. you actually have to buffer the entire file before processing. For XLSX/XLSB the zip container metadata is at the end of the file (you have to read till the end anyway), so most "streaming" solutions buffer the file anyway

  2. lots of metadata is stored at the end of the worksheet streams. Merge cell metadata is a good example. You have to jump to the end of the worksheet XML/binary to find the part encoding the merge ranges, so the entire file has to be held in memory

  3. the dimension data reported in the worksheet may disagree with the true worksheet size, so you'd have to scan the whole thing to figure out the true worksheet size.

Now, it's possible to conserve memory at the expense of time by performing two passes of the worksheets: collect relevant metadata in the first pass then emit cell/row events in the second pass. Most formats store in row-major order so it's fairly straightforward to implement (some quattro pro formats use column major order but those aren't show-stoppers).

@osher the original stumbling block was the ZIP/CFB container format. You basically have to read the whole file into memory to start processing. If that behavior is acceptable, and if it's acceptable to limit the feature set, we could easily patch in support for streaming read

We've started addressing the general performance problem in node: - internally, instead of the address keys, worksheets are built up as arrays of arrays of cell objects. This behavior is controlled with the undocumented `dense` option. The read and write utilities know how to handle both types of worksheets, so if you are just translating formats you can enable it in your read interface, but raw manipulation requires a slightly different access pattern. The test suite [uses a small helper function](https://github.com/SheetJS/js-xlsx/blob/master/test.js#L253-L257): ```js function get_cell(ws/*:Worksheet*/, addr/*:string*/) { if(!Array.isArray(ws)) return ws[addr]; var a = X.utils.decode_cell(addr); return (ws[a.r]||[])[a.c]; } ``` - the XLSX object now has a streaming CSV write function `XLSX.stream.to_csv` which takes the same parameters as the normal `XLSX.utils.sheet_to_csv`. Instead of building the whole CSV in memory, it builds one line at a time and passes the data. Besides the general issue of non-conformant files, there are more hurdles preventing a true streaming read solution: 1) you actually have to buffer the entire file before processing. For XLSX/XLSB the zip container metadata is at the end of the file (you have to read till the end anyway), so most "streaming" solutions buffer the file anyway 2) lots of metadata is stored at the end of the worksheet streams. Merge cell metadata is a good example. You have to jump to the end of the worksheet XML/binary to find the part encoding the merge ranges, so the entire file has to be held in memory 3) the dimension data reported in the worksheet may disagree with the true worksheet size, so you'd have to scan the whole thing to figure out the true worksheet size. Now, it's possible to conserve memory at the expense of time by performing two passes of the worksheets: collect relevant metadata in the first pass then emit cell/row events in the second pass. Most formats store in row-major order so it's fairly straightforward to implement (some quattro pro formats use column major order but those aren't show-stoppers). @osher the original stumbling block was the ZIP/CFB container format. You basically have to read the whole file into memory to start processing. If that behavior is acceptable, and if it's acceptable to limit the feature set, we could easily patch in support for streaming read
the-spyke commented 2017-07-04 11:58:54 +00:00 (Migrated from github.com)

jszip now supports streaming. In my tests it consumes even more memory than creating document itself.

`jszip` now supports streaming. In my tests it consumes even more memory than creating document itself.
edbrannin commented 2021-03-17 19:24:36 +00:00 (Migrated from github.com)

FWIW, I've experienced much-worse-than-linear performance on very large spreadsheets even stored in plaintext SpreadsheetML (bookType: xlml), so I don't think this can be pinned just on the Zip library.

The spreadsheet I was filling out was pretty sparse, but it had columns going out to GB in Excel. I had a total of 100k rows to write, and I timed how long it would take to write all files when I split them into evenly-balanced chunks:

  • 12 chunks (about 9k rows per file) all completed in ~30 seconds
  • 9 chunks (about 11k rows per file) all completed in ~34 seconds
  • 6 chunks (about 16k rows per file) all completed in ~37 seconds
  • 3 chunks (about 33k rows per file) all completed in ~50 seconds
  • One chunk of 100k records exceeded the 90 seconds I gave it in a benchmarking setting.
    • When I tried that amount in my application, it churned for hours and never completed.
    • (The application is now working fine with output files split by max 20k rows)

(This was on node 12.20, via node-xlsx)

FWIW, I've experienced much-worse-than-linear performance on very large spreadsheets even stored in plaintext SpreadsheetML (bookType: `xlml`), so I don't think this can be pinned just on the Zip library. The spreadsheet I was filling out was pretty sparse, but it had columns going out to `GB` in Excel. I had a total of 100k rows to write, and I timed how long it would take to write *all* files when I split them into evenly-balanced chunks: * 12 chunks (about 9k rows per file) all completed in ~30 seconds * 9 chunks (about 11k rows per file) all completed in ~34 seconds * 6 chunks (about 16k rows per file) all completed in ~37 seconds * 3 chunks (about 33k rows per file) all completed in ~50 seconds * One chunk of 100k records exceeded the 90 seconds I gave it in a benchmarking setting. * When I tried that amount in my application, it churned for hours and never completed. * (The application is now working fine with output files split by max 20k rows) (This was on node 12.20, via [node-xlsx](https://www.npmjs.com/package/node-xlsx))
UzairAhmad-1 commented 2021-10-28 10:02:08 +00:00 (Migrated from github.com)

@SheetJSDev Is it possible to read an excel file with 1M rows and the file size is around 220MB. How can I read such a file?

@SheetJSDev Is it possible to read an excel file with 1M rows and the file size is around 220MB. How can I read such a file?
kisaragi99 commented 2021-12-15 15:28:06 +00:00 (Migrated from github.com)

Had an experience when a browser extension caused infinite parsing(sheet_to_json method was not returning data).
When i disabled it - parsing of the file was successfully done.

Browser: Chrome
Extension: Metamask
File: 500k rows, 52mb

Smaller files were parsed successfully even with enabled Metamask ext.
Have no time to make more tests.

Had an experience when a browser extension caused infinite parsing(`sheet_to_json` method was not returning data). When i disabled it - parsing of the file was successfully done. `Browser: Chrome` `Extension: Metamask` `File: 500k rows, 52mb` Smaller files were parsed successfully even with enabled Metamask ext. Have no time to make more tests.
Qwqw127 commented 2022-01-06 09:21:54 +00:00 (Migrated from github.com)

currently this module will load the excel file into memory and build the js model, but it doesn't work for large file because memory limitation.

Is it possible to provide a stream like api so that we can handle only part of data at one time?

hello,How do you deal with this problem?I came across this, too

> currently this module will load the excel file into memory and build the js model, but it doesn't work for large file because memory limitation. > > Is it possible to provide a stream like api so that we can handle only part of data at one time? hello,How do you deal with this problem?I came across this, too
davidfoxio commented 2022-01-22 18:52:08 +00:00 (Migrated from github.com)

I'm getting the following errors on a javascript script file that I'm trying to run:

Error: Cannot create a string longer than 0x1fffffe8 characters
at ReadStream.emit (events.js:388:22) {
  code: 'ERR_STRING_TOO_LONG'
}

Here is the code I'm trying to run:

var fs = require('fs');
var XLSX = require('xlsx');

var parseData = async () => {
  var file = fs.createReadStream('./data/data.ods');
  function process_RS(stream, cb) {
    var buffers = [];
    stream.on('data', function (data) {
      buffers.push(data);
    });
    stream.on('end', function () {
      var buffer = Buffer.concat(buffers);
      var workbook = XLSX.read(buffer, { type: 'buffer' });

      /* DO SOMETHING WITH workbook IN THE CALLBACK */
      cb(workbook);
    });
  }

  process_RS(file);
};

parseData();

Would really appreciate some help with this, thanks

I'm getting the following errors on a javascript script file that I'm trying to run: ``` Error: Cannot create a string longer than 0x1fffffe8 characters at ReadStream.emit (events.js:388:22) { code: 'ERR_STRING_TOO_LONG' } ``` Here is the code I'm trying to run: ``` var fs = require('fs'); var XLSX = require('xlsx'); var parseData = async () => { var file = fs.createReadStream('./data/data.ods'); function process_RS(stream, cb) { var buffers = []; stream.on('data', function (data) { buffers.push(data); }); stream.on('end', function () { var buffer = Buffer.concat(buffers); var workbook = XLSX.read(buffer, { type: 'buffer' }); /* DO SOMETHING WITH workbook IN THE CALLBACK */ cb(workbook); }); } process_RS(file); }; parseData(); ``` Would really appreciate some help with this, thanks
kumarr10 commented 2022-02-09 21:12:20 +00:00 (Migrated from github.com)

I am trying to use the api json_to_sheet to create excel file(I return 300K rows from backend API) on client side/browser. Sheet_to_json seems to error out with this much data. Any help or alternate ? please help
let jsonData = data.Table; let ws = XLSX.utils.json_to_sheet(jsonData); var wb = XLSX.utils.book_new(); XLSX.utils.book_append_sheet(wb, ws); XLSX.writeFile(wb, filename + '.xlsx');

Though, I have created an issue for this , I thought active community could help if they have already resolved it for themselves

I am trying to use the api json_to_sheet to create excel file(I return 300K rows from backend API) on client side/browser. Sheet_to_json seems to error out with this much data. Any help or alternate ? please help `let jsonData = data.Table; let ws = XLSX.utils.json_to_sheet(jsonData); var wb = XLSX.utils.book_new(); XLSX.utils.book_append_sheet(wb, ws); XLSX.writeFile(wb, filename + '.xlsx');` Though, I have created an issue for this , I thought active community could help if they have already resolved it for themselves
devgrigor commented 2022-02-28 13:14:22 +00:00 (Migrated from github.com)

Small suggestion about this. I was looking into the source code and augmented it to write the result of the parsed xml data to file instead of variable. This way there is no memory issue, although the final result must be rendered as well.

s[tag.r] = p;

This line can be replace by a for loop that writes to file if the result is larger than given number. I didn't want to open a PR with feature because I ended up using files without parsing into a final result, but maybe there is a more generic solution that can be implemented into the library at some point ? Something like useFiles flag combining with recordsPerFile number or something similar

Small suggestion about this. I was looking into the source code and augmented it to write the result of the parsed xml data to file instead of variable. This way there is no memory issue, although the final result must be rendered as well. ``` s[tag.r] = p; ``` This line can be replace by a for loop that writes to file if the result is larger than given number. I didn't want to open a PR with feature because I ended up using files without parsing into a final result, but maybe there is a more generic solution that can be implemented into the library at some point ? Something like `useFiles` flag combining with `recordsPerFile` number or something similar
SheetJSDev commented 2022-02-28 18:14:46 +00:00 (Migrated from github.com)

@davidfoxio ODS files are ZIP files under the hood and the library tries to decompress the content.xml entry before processing. The underlying file size is probably larger than 0x1fffffe8 characters, hence the error (this is a V8 limit)

@devgrigor The general approach would use a callback or iterator or stream, generating X rows at a time and allowing the end code to process. This would avoid building up the entire worksheet in memory. It works well for simple row-level operations like generating SQL statements or CSV rows but does not have the flexibility of the in-memory object approach

@davidfoxio ODS files are ZIP files under the hood and the library tries to decompress the `content.xml` entry before processing. The underlying file size is probably larger than `0x1fffffe8` characters, hence the error (this is a V8 limit) @devgrigor The general approach would use a callback or iterator or stream, generating X rows at a time and allowing the end code to process. This would avoid building up the entire worksheet in memory. It works well for simple row-level operations like generating SQL statements or CSV rows but does not have the flexibility of the in-memory object approach
devgrigor commented 2022-03-01 06:19:52 +00:00 (Migrated from github.com)

@SheetJSDev I am afraid the file does not exceed the limit. The file I've used was 55 MB. And it didn't throw an error it just stopped working once it reached 400231th row :) when I opened task manager it showed that the node process uses 2.7GB memory of RAM. I believe it was waiting for a memory to be available to continue. This was simple node process (an empty js file with nothing than read operation of XLSX) Basically the code was the following

    let buf;
    await new Promise((resolve, reject) => {
        fs.readFile(inputFilePath, (err, data) => {
            buf= data
            resolve('')
        })
    })
    // code reaches here in seconds
    const xslJson = XLSX.read(buf, {type: 'buffer'});
    // code doesn't reach here for minutes

It is about having a single variable with that size, once the underlying code was modified to use files to write down the json the read operation was being done in 1,2 minutes. My solution was using those files directly after the read operation was done. I was suggesting adding that solution to the library. So when the flag useFiles is true it returns the file path list (array of strings) that contains the information parsed in JSON format, so the end developer can access each file separately and not think about overusing the RAM (obviously one will be obligated to handle the file removal as well, but that will be on them rather on the lib itself)

@SheetJSDev I am afraid the file does not exceed the limit. The file I've used was 55 MB. And it didn't throw an error it just stopped working once it reached 400231th row :) when I opened task manager it showed that the node process uses 2.7GB memory of RAM. I believe it was waiting for a memory to be available to continue. This was simple node process (an empty js file with nothing than read operation of XLSX) Basically the code was the following ``` let buf; await new Promise((resolve, reject) => { fs.readFile(inputFilePath, (err, data) => { buf= data resolve('') }) }) // code reaches here in seconds const xslJson = XLSX.read(buf, {type: 'buffer'}); // code doesn't reach here for minutes ``` It is about having a single variable with that size, once the underlying code was modified to use files to write down the json the `read` operation was being done in 1,2 minutes. My solution was using those files directly after the read operation was done. I was suggesting adding that solution to the library. So when the flag `useFiles` is true it returns the file path list (array of strings) that contains the information parsed in JSON format, so the end developer can access each file separately and not think about overusing the RAM (obviously one will be obligated to handle the file removal as well, but that will be on them rather on the lib itself)
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#61
No description provided.