parse_dom_table function performance #1626

Open
opened 2019-09-17 10:48:21 +00:00 by ThomasChan · 3 comments
ThomasChan commented 2019-09-17 10:48:21 +00:00 (Migrated from github.com)

https://github.com/SheetJS/js-xlsx/blob/master/xlsx.js#L19061

The merges for loop will became extremely large and slow while handling large table, i had test this with 204 cols * 250 rows table, without optimize ws[!merges] result a huge array, and almost of item is merging single cell itself which is useless.

before optimize, in my test case, export function excute 18.6s, and after, only excute 4.67s.

and in my customer's client, they were exporting a 193 cols * 1277 rows table, export function excute 6mins, after optimize, only excute 15s.

table

before

after

code change febac23e8e

https://github.com/SheetJS/js-xlsx/blob/master/xlsx.js#L19061 The `merges` for loop will became extremely large and slow while handling large table, i had test this with 204 cols * 250 rows table, without optimize `ws[!merges]` result a huge array, and almost of item is merging single cell itself which is useless. before optimize, in my test case, export function excute 18.6s, and after, only excute 4.67s. and in my customer's client, they were exporting a 193 cols * 1277 rows table, export function excute 6mins, after optimize, only excute 15s. ![table](https://user-images.githubusercontent.com/5715335/65030546-4fcf5e00-d972-11e9-9b4d-6f6a3e46b9cc.png) ![before](https://user-images.githubusercontent.com/5715335/65030545-4fcf5e00-d972-11e9-8471-da8f72e6517b.png) ![after](https://user-images.githubusercontent.com/5715335/65030544-4f36c780-d972-11e9-89f7-c7e2290debfc.png) code change https://github.com/henglabs/js-xlsx/commit/febac23e8e0c1cc12efe81ef7a7d1d369b55f352
ThomasChan commented 2019-09-18 08:47:35 +00:00 (Migrated from github.com)

well, my code change above was wrong, and i figured out that merges for loop's logic then made another change, decrease time complexity from O(merges.length * n) to O(merges.length).

屏幕快照 2019-09-18 下午4 37 44

for export 200 cols * 10000 rows table, parse_dom_table function don't out-of-memory any more, but jszip utf8ToBytes function got out-of-memory.

well, my code change above was wrong, and i figured out that `merges` for loop's logic then made another change, decrease time complexity from O(merges.length * n) to O(merges.length). ![屏幕快照 2019-09-18 下午4 37 44](https://user-images.githubusercontent.com/5715335/65132437-87520f00-da33-11e9-9c4f-ca52af2b81e6.png) for export 200 cols * 10000 rows table, parse_dom_table function don't out-of-memory any more, but jszip utf8ToBytes function got out-of-memory.
SheetJSDev commented 2019-09-18 19:17:38 +00:00 (Migrated from github.com)

@ThomasChan thanks for looking into this, and feel free to submit a PR.

There's definitely room for improvement. The weird loop is done that way to address a case like:

A1:C2 D1:E2 F1
F2

The first cell in the second row should be located at F2, but to determine that you need to look at the A1:C2 merge first then the D1:E2 merge.

The implementation was designed expecting only a small number of merges. IF you have many, then the approach is extremely slow.

Given the parse order, it will always be sorted by starting row then by starting column. To reduce it to a single walk through the merge array, you might be able to sort by ending row then by starting column (sorting the array with a custom sort function). Then you'd keep track of a starting index into the array (elements before that point could never affect the result, so you can skip them).

So that we have a sense for the performance, can you share a sample table that you are trying to convert?

@ThomasChan thanks for looking into this, and feel free to submit a PR. There's definitely room for improvement. The weird loop is done that way to address a case like: <table> <tr> <td colspan="3" rowspan="2">A1:C2</td> <td colspan="2" rowspan="2">D1:E2</td> <td>F1</td> </tr> <tr><td>F2</td></tr> </table> The first cell in the second row should be located at F2, but to determine that you need to look at the A1:C2 merge first then the D1:E2 merge. The implementation was designed expecting only a small number of merges. IF you have many, then the approach is extremely slow. Given the parse order, it will always be sorted by starting row then by starting column. To reduce it to a single walk through the merge array, you might be able to sort by ending row then by starting column (sorting the array with a custom sort function). Then you'd keep track of a starting index into the array (elements before that point could never affect the result, so you can skip them). So that we have a sense for the performance, can you share a sample table that you are trying to convert?
ThomasChan commented 2019-09-19 13:15:45 +00:00 (Migrated from github.com)

Thanks for reply, i had submit a PR, and sample table is nothing special to any html table, you can create a BIG table like i said above, 200 cols * 10000 rows, then use xlsx to export like

 var title = 'Big Table';
  var writeOptions = {
    Props: {
      Title: title,
      CreatedDate: new Date().toLocaleString(),
    },
    type: 'binary', // if not use binary will out-of-memory
  };
  var wb = XLSX.utils.book_new();
  var ws = XLSX.utils.table_to_sheet(t, {
    dense: true, // no dense also out-of-memory
    raw: true, // no raw will slow down performance
  });
  XLSX.utils.book_append_sheet(wb, ws, title);
  XLSX.writeFile(
    wb,
    `${title}.xlsx`,
    writeOptions,
  );
Thanks for reply, i had submit a PR, and sample table is nothing special to any html table, you can create a BIG table like i said above, 200 cols * 10000 rows, then use xlsx to export like ```javascript var title = 'Big Table'; var writeOptions = { Props: { Title: title, CreatedDate: new Date().toLocaleString(), }, type: 'binary', // if not use binary will out-of-memory }; var wb = XLSX.utils.book_new(); var ws = XLSX.utils.table_to_sheet(t, { dense: true, // no dense also out-of-memory raw: true, // no raw will slow down performance }); XLSX.utils.book_append_sheet(wb, ws, title); XLSX.writeFile( wb, `${title}.xlsx`, writeOptions, ); ```
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#1626
No description provided.