XLSX.write() taking too long
#77
Open
opened 9 years ago by irfanyounas
·
39 comments
No Branch/Tag Specified
master
sankhavaramsaitulasiram/feat-fix-2752
maybeswapnil/issue2737
gh-pages
scottysseus/2560_2
ivan-trusov/fix-basedate
nandanv2702/issue_1300
protobi/master
ThomasChan/master
grantfayvor/master
tom-groves/bug-1105/rounding-error
mgreter/master
v0.87
v0.9.9
v0.9.8
v0.9.7
v0.9.6
v0.9.5
v0.9.4
v0.9.3
v0.9.2
v0.9.13
v0.9.12
v0.9.11
v0.9.10
v0.9.1
v0.9.0
v0.8.8
v0.8.7
v0.8.6
v0.8.5
v0.8.4
v0.8.3
v0.8.2
v0.8.1
v0.8.0
v0.7.9
v0.7.7
v0.7.6-i
v0.7.6-h
v0.7.6-a
v0.7.6
v0.7.5
v0.7.4
v0.7.3
v0.7.2
v0.7.11
v0.7.10
v0.7.1
v0.5.9
v0.5.8
v0.5.7
v0.5.10
v0.5.0
v0.4.3
v0.18.6
v0.18.5
v0.18.4+deno
v0.18.4
v0.18.3
v0.18.2
v0.18.1
v0.18.0+deno
v0.17.5
v0.17.0
v0.16.8
v0.16.7
v0.16.6
v0.16.5
v0.16.3
v0.16.2
v0.16.1
v0.16.0
v0.15.6
v0.15.5
v0.15.2
v0.14.0
v0.13.5
v0.13.4
v0.13.3
v0.13.1
v0.13.0
v0.12.9
v0.12.8
v0.12.7
v0.12.6
v0.12.5
v0.12.4
v0.12.3
v0.12.2
v0.12.13
v0.12.12
v0.12.11
v0.12.10
v0.12.1
v0.12.0
v0.11.9
v0.11.8
v0.11.7
v0.11.6
v0.11.5
v0.11.4
v0.11.3
v0.11.2
v0.11.19
v0.11.18
v0.11.17
v0.11.16
v0.11.15
v0.11.14
v0.11.13
v0.11.12
v0.11.11
v0.11.10
v0.11.1
v0.11.0
v0.10.9
v0.10.8
v0.10.7
v0.10.6
v0.10.5
v0.10.4
v0.10.3
v0.10.2
v0.10.1
v0.10.0
No Label
DBF
Dates
Defined Names
Features
Formula
HTML
Images
Infrastructure
Integration
International
ODS
Operations
Performance
PivotTables
Pro
Protection
Read Bug
SSF
SYLK
Style
Write Bug
good first issue
Milestone
Set milestone
Clear milestone
No items
No Milestone
Projects
Set Project
Clear projects
No project
Assignees
Assign users
Clear assignees
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
No dependencies set.
Reference: sheetjs/sheetjs#77
Reference in New Issue
There is no content yet.
Delete Branch '%!s(<nil>)'
Deleting a branch is permanent. It CANNOT be undone. Continue?
No
Yes
Hi,
I have tried to write 3 columns and N number of rows, below we can see the time taken (on my i3 machine):
N --------------- TIME (s)
10000 --------------- 9
20000 --------------- 35
30000 --------------- 90
65000 --------------- 453
We can see that as we increase the number of rows, the time taken by the write() function increases dramatically (its not linear), its taking too long for 65000 records.
Can we have some other way which is efficient?
I am not happy with the performance at all (granted, our first goal was to be correct). There are quite a few slow/inefficient operations that can be improved.
When I last investigated, the runtime was dominated by functions from the ZIP library (we are currently using https://github.com/Stuk/jszip), so I'm working on a new component right now (https://github.com/SheetJS/js-crc32 and https://github.com/SheetJS/js-adler32 are the first few pieces of the puzzle).
Thanks, good work. I hope so we will have efficient solution soon.
@irfanyounas just so we are on the same page, can you share the script you used?
I am using the script give below:
src="//cdnjs.cloudflare.com/ajax/libs/xlsx/0.7.6-g/xlsx.core.min.js"
For creating xlsx file, I have followed the following example:
http://sheetjs.com/demos/writexlsx.html
@irfanyounas how did you create a JS object with 65000 rows? Do you have numbers / text / dates? Are there missing cells (in the example, C2 is missing) or did you fill every cell?
@irfanyounas Here's an example writing numbers in a 3x100001 file: http://sheetjs.com/demos/100001.html
Dates are printed at 3 times:
On Chrome 35 it takes 4 seconds to process a much larger test case than what you showed.
Thanks, I have 3 'text' (string) columns. The second column is empty sometimes.
I have used the following script:
@irfanyounas http://sheetjs.com/demos/write_num.html?n=1000000 is a more general demo (you can specify the number of rows via the
n
parameter). The runtime appears to be dominated by generating the source array-of-arrays (not from the process that converts it to a workbook, and not from the process that generates the XLSX file).Here are the times I see using the aforementioned page:
@irfanyounas just saw your last message -- where are you measuring the times? Your code sample doesn't show it. Also, can you put up the main page (it looks like you shared only the worker logic)
Thanks, I am looking into it, may the problem is on my side. Thanks for your help
If you are using web workers, I recommend also timing the process in the main thread (avoid we workers altogether). FWIW I found in some cases that transferring data to and from the worker is significantly slower than just doing the work in the main thread.
I have tested it using different kinds of data:
Scenario 1: I create a list of 20,000 records, where each records is a dict
as given: {"status": "active", "name": "", "address": "sms:+45705846525"}.
In this scenario all the records are same (copy of the dict shown above).
Scenario 2: The list of 20,000 records, where each recordsis a dict but
each record is different.
Now runtime for Senario 1 is 4 sec, while for Scenario 2, it is around 14
sec.
As per my understanding, May be zip module is taking long due to different
records (not same text) in Scenario 2.
Can you please try this scenario?
On Tue, Jul 1, 2014 at 2:11 PM, SheetJSDev notifications@github.com wrote:
Here's a good example: http://sheetjs.com/demos/write_text.html?n=8000. The length of the data in the B column increases as you go down. I do see a massive performance issue
Yes, though we can keep the length of the data in B column fixed, the performance issue will also be there if the data is different, e.g, in my case the data looks like 'sms:+45704055659', 'sms:+457683738392', 'sms:+457049821521' and so on.....
I did some more testing by adding some timing statements (simple
console.log(new Date().getTime())
all over the place) and about 90% of the runtime is in thezip.generate
operation, which suggests that it should be replaced.Yes that's right, may be, we can try different zip modules and see the performance. Thanks
@SheetJSDev Not to divert you from the main topic, but still: it's easier to type
console.time
andconsole.timeEnd
instead of explicitconsole.log(new Date().getTime())
.@Mithgol this is really neat! I think the MDN article needs to be updated: https://developer.mozilla.org/en-US/docs/Web/API/console.time claims that IE11 supports it, but https://developer.mozilla.org/en-US/docs/Web/API/console.timeEnd has a
?
for IEMDN is a wiki. If you think that it needs to be updated, update it.
(I wound have updated it myself, but for some reason I'm on Windows XP currently and IE11 is not installable on such an old OS.)
@Mithgol updated :) IE11 actually shows the elapsed time down to the 100-nsec level (milliseconds + 4 decimal places)
When I export a big amount of rows in Firefox I get a dialog saying something like "A script is broken or doesn't respond. [Debug Script] [Stop Script] [Continue Script]"
No matter how fast the export will be, by exporting even more data it will always be possible to get this message.
If Firefox gets some time slices for processing by adding
window.setTimeout(function(){...},milliseconds);
it should solve the isse. I know that it adds time to the export. But as a user I understand that it is a long running task to export this big amount of data. If I provide the user with a message saying that an export is in progress he knows why the UI is blocked. I would be even better if it would be possible to cancel a running export.@sandroboehme you can perform the write process in a WebWorker (which wouldn't lock up the UI). For example, http://oss.sheetjs.com/js-xlsx/ (it's hosted from the gh-pages branch ) uses a web worker to read data.
The main reader demo (http://oss.sheetjs.com/, source https://github.com/SheetJS/SheetJS.github.io) shows a spinner when the read process may take a while)
Regarding XLSX write: I have debugged the code and found out that inside function write_ws_xml_cell(), the following line is creating the performance issue:
v = writetag('v', ''+get_sst_id(opts.Strings, cell.v));
Specifically the following function : get_sst_id(opts.Strings, cell.v)
@irfanyounas That function performs a scan in the shared string table. That behavior can be disabled by setting the option
bookSST
to false.I agree that a linear scan is horrible here, but from my testing, it contributes less than 5% to the runtime at 20K rows. It does, however, explain the superlinear runtime
If the text data is same in all rows then its true that it will be fast, but if you have unique data in each row then the performance will be poor.
For example you can test it by creating data using the following code:
var data1 = [];
while(data1.length < 20000) {
var randomnumber = Math.floor(Math.random() * (799999999 - 700000000 + 1)) + 700000000;
var sms = "sms:+45"+randomnumber;
data1[data1.length] = ["active", "abc", sms];
}
Thanks. I was wondering what if we don't perform a scan in the shared string table? Does it effect the compressed file size?
@irfanyounas The original problem is that some readers (notably iOS Numbers) had issues with inline strings. The SST was the only way for strings to be displayed. Testing it now, it appears that the newest version of Numbers does support inline strings, so it might make sense to recommend inline strings.
Adding the shared string table is larger for files that don't have repeated elements (for example, with tables that have text labels and numeric bodies) but smaller for files with lots of repeated strings (for example, a table where one column describes a category).
The mystery here (which explains the difference in our understanding) is that as the size of the data increases, the zip time (as a percentage of the total write process) gets progressively worse. Basically, there are two effects: the effect of the linear scan and the effect of having to write a new file in the zip. To see this, add a
console.time("zip")
just before the switch statement in write_zip and a `console.timeEnd("zip") in your script. I generate data this way:This ensures v8 doesn't perform some magic and ensures we have distinct strings.
For small
w
, the zip operations take less than half the total runtime, but as w is increased the zip operations almost completely dominate the runtime (in both cases). According to the v8 profile (you can get this in node by running with the--prof
flag and then using the tick processor), the function that takes the most time is the CRC-32 function.Thanks, that's right. You are doing really good work.
I'm having this issue too. Is there any form of the xls spec that doesn't require zipping? That might be a workaround till the zip issue is taken care of.
The jszip api offers the ability to not compress the zip file. Would that help?
https://stuk.github.io/jszip/documentation/api_jszip/generate.html
31 Columns and 300K rows from API , this json_to_sheet/ writeFile breaks. please see if this can addressed.
@kumarr10 live demo https://jsfiddle.net/xgr4sbk1/
Using chrome 0.18.3 on a 2018 macbook pro it takes ~13 seconds to generate 300K x 31 fully populated worksheet. You can play with the NR / NC variables. The new limit is the string length cap when generating the worksheet xml.
Hi All,
I have struggle to export large rows of data for xlsx export. The data is 100 columns X 100K rows for JSON data. Browser either goes into freeze mode or you will get array length error hitting the string limitation for V8 in chrome.
Following is how I resolve this, so hopefully it will shed some light to some who are having similar issues.
3 Example. JSON data = {
{a:111 , b: 333, c: 444}
{a:111 , b: 333, c: 444}
}
Convert to data = [
[111,333,444] ,
[111,333,444]
]
type : binary
bookSST: true
compression: true
These are all based on my understanding of the code and various help from the developers when I raised issues to them. They been a great help. Hopefully this will help others encountering the same problem.
Thanks.
Anand Muthu
Thanks for your explanations @anandmuthu1978!
Thanks for sharing @anandmuthu1978 !
bookSST: true
is helpful for files containing a large number of repeated strings (text labels for each row, for example) andcompression: true
obviously reduces the file size, but the tradeoff is increased runtime. Usingtype: "array"
generates aUint8Array
rather than a string (which should be more efficient)@anandmuthu1978 hello
Could you give me an example how you managed to solve to write a large number of lines?
With my code, I can save up to 60k lines. With 70k I couldn't understand why it saves only 15 lines, and it doesn't show any error.
thks
@goproclube
hi.
In my case i converted the data from json to array and used aoa_to_sheet(data).
Most likely in your case 70k is already hitting the memory limit. I noticed json_to_sheet takes a lot of memory so I converted it to array format. You will lose the data mapping association but as long you want all it doesnt really matter.
Example. JSON data = {
{a:111 , b: 333, c: 444}
{a:111 , b: 333, c: 444}
}
Convert to data = [
[111,333,444] ,
[111,333,444]
]
@SheetJSDev is it possible, when writing big files to .xlsx, somehow append worksheets together?
I mean, imagine I have 700k rows, my CHUNK_SIZE = 350k, I create worksheet, append it to workbook and pass it to XLSX.write and get typed array back. On my next step I repeat the same steps, but now with another 350k piece of data. At the end I get 2 typed Arrays. My question is it possible to merge worksheets together in order to get them in one file. If so, we will be able to pass chunked data to xlsx and zipping manipulations will not take so long time.
thx, anandmuthu1978
I face the performance issue using json_to_sheet
problem resolved by using aoa_to_sheet,
Sample code below for reference