2022-10-31 00:58:49 +00:00
|
|
|
---
|
|
|
|
title: Large Datasets
|
2023-02-28 11:40:44 +00:00
|
|
|
pagination_prev: demos/extensions/index
|
|
|
|
pagination_next: demos/engines/index
|
|
|
|
sidebar_custom_props:
|
2024-07-18 22:19:02 +00:00
|
|
|
summary: Dense Mode + Incremental CSV / HTML / JSON / XLML Export
|
2022-10-31 00:58:49 +00:00
|
|
|
---
|
|
|
|
|
2023-05-03 03:40:40 +00:00
|
|
|
import current from '/version.js';
|
2024-07-18 22:19:02 +00:00
|
|
|
import Tabs from '@theme/Tabs';
|
|
|
|
import TabItem from '@theme/TabItem';
|
2023-05-03 03:40:40 +00:00
|
|
|
import CodeBlock from '@theme/CodeBlock';
|
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
For maximal compatibility, SheetJS API functions read entire files into memory
|
|
|
|
and write files in memory. Browsers and other JS engines enforce tight memory
|
|
|
|
limits. The library offers alternate strategies to optimize for memory usage.
|
2022-10-31 00:58:49 +00:00
|
|
|
|
|
|
|
## Dense Mode
|
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
[Dense mode worksheets](/docs/csf/sheet#dense-mode), which store cells in arrays
|
|
|
|
of arrays, are designed to work around Google Chrome performance regressions.
|
|
|
|
For backwards compatibility, dense mode worksheets are not created by default.
|
|
|
|
|
2024-10-26 03:17:31 +00:00
|
|
|
:::tip pass
|
|
|
|
|
|
|
|
Dense worksheets were overhauled in version `0.19.0`. It is strongly recommended
|
|
|
|
to [upgrade to the latest version](/docs/getting-started/installation/).
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
`read`, `readFile` and `aoa_to_sheet` accept the `dense` option. When enabled,
|
|
|
|
the methods create worksheet objects that store cells in arrays of arrays:
|
2022-10-31 00:58:49 +00:00
|
|
|
|
|
|
|
```js
|
|
|
|
var dense_wb = XLSX.read(ab, {dense: true});
|
|
|
|
|
2022-11-18 18:22:01 +00:00
|
|
|
var dense_sheet = XLSX.utils.aoa_to_sheet(aoa, {dense: true});
|
2022-10-31 00:58:49 +00:00
|
|
|
```
|
|
|
|
|
2024-04-08 04:47:04 +00:00
|
|
|
<details>
|
|
|
|
<summary><b>Historical Note</b> (click to show)</summary>
|
2022-10-31 00:58:49 +00:00
|
|
|
|
|
|
|
The earliest versions of the library aimed for IE6+ compatibility. In early
|
|
|
|
testing, both in Chrome 26 and in IE6, the most efficient worksheet storage for
|
|
|
|
small sheets was a large object whose keys were cell addresses.
|
|
|
|
|
|
|
|
Over time, V8 (the engine behind Chrome and NodeJS) evolved in a way that made
|
|
|
|
the array of arrays approach more efficient but reduced the performance of the
|
|
|
|
large object approach.
|
|
|
|
|
|
|
|
In the interest of preserving backwards compatibility, the library opts to make
|
|
|
|
the array of arrays approach available behind a special `dense` option.
|
|
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
The various API functions will seamlessly handle dense and sparse worksheets.
|
|
|
|
|
|
|
|
## Streaming Write
|
|
|
|
|
|
|
|
The streaming write functions are available in the `XLSX.stream` object. They
|
|
|
|
take the same arguments as the normal write functions:
|
|
|
|
|
|
|
|
- `XLSX.stream.to_csv` is the streaming version of `XLSX.utils.sheet_to_csv`.
|
|
|
|
- `XLSX.stream.to_html` is the streaming version of `XLSX.utils.sheet_to_html`.
|
|
|
|
- `XLSX.stream.to_json` is the streaming version of `XLSX.utils.sheet_to_json`.
|
2024-07-18 22:19:02 +00:00
|
|
|
- `XLSX.stream.to_xlml` is the streaming SpreadsheetML2003 workbook writer.
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
These functions are covered in the ["Stream Export"](/docs/api/stream) section.
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
:::tip pass
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
This feature was expanded in version `0.20.3`. It is strongly recommended to
|
|
|
|
[upgrade to the latest version](/docs/getting-started/installation/).
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
:::
|
2022-10-31 00:58:49 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### NodeJS
|
|
|
|
|
|
|
|
In a CommonJS context, NodeJS Streams and `fs` immediately work with SheetJS:
|
|
|
|
|
|
|
|
```js
|
|
|
|
const XLSX = require("xlsx"); // "just works"
|
|
|
|
```
|
|
|
|
|
2024-04-14 07:40:38 +00:00
|
|
|
:::danger ECMAScript Module Machinations
|
2023-05-30 06:41:09 +00:00
|
|
|
|
2022-10-31 00:58:49 +00:00
|
|
|
In NodeJS ESM, the dependency must be loaded manually:
|
|
|
|
|
|
|
|
```js
|
|
|
|
import * as XLSX from 'xlsx';
|
|
|
|
import { Readable } from 'stream';
|
|
|
|
|
|
|
|
XLSX.stream.set_readable(Readable); // manually load stream helpers
|
|
|
|
```
|
|
|
|
|
|
|
|
Additionally, for file-related operations in NodeJS ESM, `fs` must be loaded:
|
|
|
|
|
|
|
|
```js
|
|
|
|
import * as XLSX from 'xlsx';
|
|
|
|
import * as fs from 'fs';
|
|
|
|
|
|
|
|
XLSX.set_fs(fs); // manually load fs helpers
|
|
|
|
```
|
|
|
|
|
|
|
|
**It is strongly encouraged to use CommonJS in NodeJS whenever possible.**
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
#### Text Streams
|
|
|
|
|
|
|
|
`to_csv`, `to_html`, and `to_xlml` emit strings. The data can be directly pushed
|
|
|
|
to a `Writable` stream. `fs.createWriteStream`[^1] is the recommended approach
|
|
|
|
for streaming to a file in NodeJS.
|
2023-05-30 06:41:09 +00:00
|
|
|
|
2022-10-31 00:58:49 +00:00
|
|
|
This example reads a worksheet passed as an argument to the script, pulls the
|
2023-05-30 06:41:09 +00:00
|
|
|
first worksheet, converts to CSV and writes to `SheetJSNodeJStream.csv`:
|
2022-10-31 00:58:49 +00:00
|
|
|
|
|
|
|
```js
|
2023-05-30 06:41:09 +00:00
|
|
|
var XLSX = require("xlsx"), fs = require("fs");
|
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
/* read file */
|
|
|
|
var wb = XLSX.readFile(process.argv[2]), {dense: true};
|
|
|
|
|
|
|
|
/* get first worksheet */
|
2023-05-30 06:41:09 +00:00
|
|
|
var ws = wb.Sheets[wb.SheetNames[0]];
|
2024-07-18 22:19:02 +00:00
|
|
|
|
|
|
|
/* create CSV stream */
|
|
|
|
var csvstream = XLSX.stream.to_csv(ws);
|
|
|
|
|
|
|
|
/* create output stream */
|
2023-05-30 06:41:09 +00:00
|
|
|
var ostream = fs.createWriteStream("SheetJSNodeJStream.csv");
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
/* write data from CSV stream to output file */
|
2022-10-31 00:58:49 +00:00
|
|
|
// highlight-next-line
|
2024-07-18 22:19:02 +00:00
|
|
|
csvstream.pipe(ostream);
|
2022-10-31 00:58:49 +00:00
|
|
|
```
|
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
#### Object Streams
|
|
|
|
|
|
|
|
`to_json` uses Object-mode streams[^2]. A `Transform` stream[^3] can be used to
|
|
|
|
generate a text stream for streaming to a file or the screen.
|
2023-05-30 06:41:09 +00:00
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
The following example prints data by writing to the `process.stdout` stream:
|
2022-10-31 00:58:49 +00:00
|
|
|
|
|
|
|
```js
|
2023-05-30 06:41:09 +00:00
|
|
|
var XLSX = require("xlsx"), Transform = require("stream").Transform;
|
2024-07-18 22:19:02 +00:00
|
|
|
|
|
|
|
/* read file */
|
2023-05-30 06:41:09 +00:00
|
|
|
var wb = XLSX.readFile(process.argv[2], {dense: true});
|
2024-07-18 22:19:02 +00:00
|
|
|
|
|
|
|
/* get first worksheet */
|
2023-05-30 06:41:09 +00:00
|
|
|
var ws = wb.Sheets[wb.SheetNames[0]];
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
/* this Transform stream converts JS objects to text */
|
2022-10-31 00:58:49 +00:00
|
|
|
var conv = new Transform({writableObjectMode:true});
|
|
|
|
conv._transform = function(obj, e, cb){ cb(null, JSON.stringify(obj) + "\n"); };
|
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
/* pipe `to_json` -> transformer -> standard output */
|
2022-10-31 00:58:49 +00:00
|
|
|
// highlight-next-line
|
2023-05-30 06:41:09 +00:00
|
|
|
XLSX.stream.to_json(ws, {raw: true}).pipe(conv).pipe(process.stdout);
|
2022-10-31 00:58:49 +00:00
|
|
|
```
|
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
#### BunJS
|
|
|
|
|
|
|
|
BunJS is directly compatible with NodeJS streams.
|
|
|
|
|
|
|
|
:::caution Bun support is considered experimental.
|
|
|
|
|
|
|
|
Great open source software grows with user tests and reports. Any issues should
|
|
|
|
be reported to the Bun project for further diagnosis.
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
#### NodeJS Demo
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-03-12 06:47:52 +00:00
|
|
|
:::note Tested Deployments
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-03-12 06:47:52 +00:00
|
|
|
This demo was tested in the following deployments:
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
| Node Version | Date | Node Status when tested |
|
|
|
|
|:-------------|:-----------|:------------------------|
|
2024-07-18 22:19:02 +00:00
|
|
|
| `0.12.18` | 2024-07-18 | End-of-Life |
|
|
|
|
| `4.9.1` | 2024-07-18 | End-of-Life |
|
|
|
|
| `6.17.1` | 2024-07-18 | End-of-Life |
|
|
|
|
| `8.17.0` | 2024-07-18 | End-of-Life |
|
|
|
|
| `10.24.1` | 2024-07-18 | End-of-Life |
|
|
|
|
| `12.22.12` | 2024-07-18 | End-of-Life |
|
|
|
|
| `14.21.3` | 2024-07-18 | End-of-Life |
|
|
|
|
| `16.20.2` | 2024-07-18 | End-of-Life |
|
|
|
|
| `18.20.4` | 2024-07-18 | Maintenance LTS |
|
|
|
|
| `20.15.1` | 2024-07-18 | Active LTS |
|
|
|
|
| `22.5.0` | 2024-07-18 | Current |
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
While streaming methods work in End-of-Life versions of NodeJS, production
|
|
|
|
deployments should upgrade to a Current or LTS version of NodeJS.
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
This demo was also tested against BunJS `1.1.18` on 2024-07-18.
|
|
|
|
|
2022-10-31 00:58:49 +00:00
|
|
|
:::
|
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
1) Install the [NodeJS module](/docs/getting-started/installation/nodejs)
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
<Tabs groupId="plat">
|
|
|
|
<TabItem value="node" label="NodeJS">
|
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
<CodeBlock language="bash">{`\
|
|
|
|
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz`}
|
|
|
|
</CodeBlock>
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
</TabItem>
|
|
|
|
<TabItem value="bun" label="BunJS">
|
|
|
|
|
|
|
|
<CodeBlock language="bash">{`\
|
|
|
|
bun i --save xlsx@https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz`}
|
|
|
|
</CodeBlock>
|
|
|
|
|
|
|
|
</TabItem>
|
|
|
|
</Tabs>
|
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
2) Download [`SheetJSNodeJStream.js`](pathname:///stream/SheetJSNodeJStream.js):
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
```bash
|
|
|
|
curl -LO https://docs.sheetjs.com/stream/SheetJSNodeJStream.js
|
|
|
|
```
|
|
|
|
|
2024-04-26 04:16:13 +00:00
|
|
|
3) Download [the test file](https://docs.sheetjs.com/pres.xlsx):
|
2023-05-30 06:41:09 +00:00
|
|
|
|
|
|
|
```bash
|
2024-04-26 04:16:13 +00:00
|
|
|
curl -LO https://docs.sheetjs.com/pres.xlsx
|
2023-05-30 06:41:09 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
4) Run the script:
|
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
<Tabs groupId="plat">
|
|
|
|
<TabItem value="node" label="NodeJS">
|
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
```bash
|
|
|
|
node SheetJSNodeJStream.js pres.xlsx
|
|
|
|
```
|
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
</TabItem>
|
|
|
|
<TabItem value="bun" label="BunJS">
|
|
|
|
|
|
|
|
```bash
|
|
|
|
bun SheetJSNodeJStream.js pres.xlsx
|
|
|
|
```
|
|
|
|
|
|
|
|
</TabItem>
|
|
|
|
</Tabs>
|
|
|
|
|
2024-04-08 04:47:04 +00:00
|
|
|
<details>
|
|
|
|
<summary><b>Expected Output</b> (click to show)</summary>
|
2023-05-30 06:41:09 +00:00
|
|
|
|
|
|
|
The console will display a list of objects:
|
|
|
|
|
|
|
|
```json
|
|
|
|
{"Name":"Bill Clinton","Index":42}
|
|
|
|
{"Name":"GeorgeW Bush","Index":43}
|
|
|
|
{"Name":"Barack Obama","Index":44}
|
|
|
|
{"Name":"Donald Trump","Index":45}
|
|
|
|
{"Name":"Joseph Biden","Index":46}
|
|
|
|
```
|
|
|
|
|
|
|
|
The script will also generate `SheetJSNodeJStream.csv`:
|
|
|
|
|
|
|
|
```csv
|
|
|
|
Name,Index
|
|
|
|
Bill Clinton,42
|
|
|
|
GeorgeW Bush,43
|
|
|
|
Barack Obama,44
|
|
|
|
Donald Trump,45
|
|
|
|
Joseph Biden,46
|
|
|
|
```
|
2022-10-31 00:58:49 +00:00
|
|
|
|
|
|
|
</details>
|
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
### Browser
|
|
|
|
|
2024-03-12 06:47:52 +00:00
|
|
|
:::note Tested Deployments
|
2023-05-30 06:41:09 +00:00
|
|
|
|
2024-03-12 06:47:52 +00:00
|
|
|
Each browser demo was tested in the following environments:
|
|
|
|
|
|
|
|
| Browser | Date |
|
|
|
|
|:------------|:-----------|
|
2024-07-18 22:19:02 +00:00
|
|
|
| Chrome 126 | 2024-07-18 |
|
|
|
|
| Safari 17.4 | 2024-07-18 |
|
2023-05-30 06:41:09 +00:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
2022-10-31 00:58:49 +00:00
|
|
|
NodeJS streaming APIs are not available in the browser. The following function
|
|
|
|
supplies a pseudo stream object compatible with the `to_csv` function:
|
|
|
|
|
|
|
|
```js
|
|
|
|
function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
|
|
|
|
XLSX.stream.set_readable(() => ({
|
|
|
|
__done: false,
|
|
|
|
// this function will be assigned by the SheetJS stream methods
|
|
|
|
_read: function() { this.__done = true; },
|
|
|
|
// this function is called by the stream methods
|
|
|
|
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
|
|
|
|
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
|
|
|
|
}));
|
|
|
|
return XLSX.stream.to_csv(ws, opts);
|
|
|
|
}
|
|
|
|
|
|
|
|
// assuming `workbook` is a workbook, stream the first sheet
|
|
|
|
const ws = workbook.Sheets[workbook.SheetNames[0]];
|
|
|
|
const strm = sheet_to_csv_cb(ws, (csv)=>{ if(csv != null) console.log(csv); });
|
|
|
|
strm.resume();
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Web Workers
|
|
|
|
|
|
|
|
For processing large files in the browser, it is strongly encouraged to use Web
|
2023-04-29 11:21:37 +00:00
|
|
|
Workers. The [Worker demo](/docs/demos/bigdata/worker#streaming-write) includes
|
|
|
|
examples using the File System Access API.
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2024-04-08 04:47:04 +00:00
|
|
|
<details>
|
|
|
|
<summary><b>Web Worker Details</b> (click to show)</summary>
|
2023-05-30 06:41:09 +00:00
|
|
|
|
2022-10-31 00:58:49 +00:00
|
|
|
Typically, the file and stream processing occurs in the Web Worker. CSV rows
|
|
|
|
can be sent back to the main thread in the callback:
|
|
|
|
|
2023-05-03 03:40:40 +00:00
|
|
|
<CodeBlock language="js" title="worker.js">{`\
|
2022-10-31 00:58:49 +00:00
|
|
|
/* load standalone script from CDN */
|
2023-05-03 03:40:40 +00:00
|
|
|
importScripts("https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js");
|
|
|
|
\n\
|
2022-10-31 00:58:49 +00:00
|
|
|
function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
|
|
|
|
XLSX.stream.set_readable(() => ({
|
|
|
|
__done: false,
|
|
|
|
// this function will be assigned by the SheetJS stream methods
|
|
|
|
_read: function() { this.__done = true; },
|
|
|
|
// this function is called by the stream methods
|
|
|
|
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
|
|
|
|
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
|
|
|
|
}));
|
|
|
|
return XLSX.stream.to_csv(ws, opts);
|
|
|
|
}
|
2023-05-03 03:40:40 +00:00
|
|
|
\n\
|
2022-10-31 00:58:49 +00:00
|
|
|
/* this callback will run once the main context sends a message */
|
|
|
|
self.addEventListener('message', async(e) => {
|
|
|
|
try {
|
|
|
|
postMessage({state: "fetching " + e.data.url});
|
|
|
|
/* Fetch file */
|
|
|
|
const res = await fetch(e.data.url);
|
|
|
|
const ab = await res.arrayBuffer();
|
2023-05-03 03:40:40 +00:00
|
|
|
\n\
|
2022-10-31 00:58:49 +00:00
|
|
|
/* Parse file */
|
|
|
|
postMessage({state: "parsing"});
|
|
|
|
const wb = XLSX.read(ab, {dense: true});
|
|
|
|
const ws = wb.Sheets[wb.SheetNames[0]];
|
2023-05-03 03:40:40 +00:00
|
|
|
\n\
|
2022-10-31 00:58:49 +00:00
|
|
|
/* Generate CSV rows */
|
|
|
|
postMessage({state: "csv"});
|
|
|
|
const strm = sheet_to_csv_cb(ws, (csv) => {
|
|
|
|
if(csv != null) postMessage({csv});
|
|
|
|
else postMessage({state: "done"});
|
|
|
|
});
|
|
|
|
strm.resume();
|
|
|
|
} catch(e) {
|
|
|
|
/* Pass the error message back */
|
|
|
|
postMessage({error: String(e.message || e) });
|
|
|
|
}
|
2023-05-03 03:40:40 +00:00
|
|
|
}, false);`}
|
|
|
|
</CodeBlock>
|
2022-10-31 00:58:49 +00:00
|
|
|
|
|
|
|
The main thread will receive messages with CSV rows for further processing:
|
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
```js title="main.js"
|
2022-10-31 00:58:49 +00:00
|
|
|
worker.onmessage = function(e) {
|
|
|
|
if(e.data.error) { console.error(e.data.error); /* show an error message */ }
|
|
|
|
else if(e.data.state) { console.info(e.data.state); /* current state */ }
|
|
|
|
else {
|
|
|
|
/* e.data.csv is the row generated by the stream */
|
|
|
|
console.log(e.data.csv);
|
|
|
|
}
|
|
|
|
};
|
|
|
|
```
|
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
</details>
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
### Live Demo
|
2022-10-31 00:58:49 +00:00
|
|
|
|
2023-05-30 06:41:09 +00:00
|
|
|
The following live demo fetches and parses a file in a Web Worker. The `to_csv`
|
|
|
|
streaming function is used to generate CSV rows and pass back to the main thread
|
|
|
|
for further processing.
|
|
|
|
|
2023-09-02 09:26:57 +00:00
|
|
|
:::note pass
|
2023-05-30 06:41:09 +00:00
|
|
|
|
|
|
|
For Chromium browsers, the File System Access API provides a modern worker-only
|
|
|
|
approach. [The Web Workers demo](/docs/demos/bigdata/worker#streaming-write)
|
|
|
|
includes a live example of CSV streaming write.
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
The demo has a URL input box. Feel free to change the URL. For example,
|
|
|
|
|
|
|
|
`https://raw.githubusercontent.com/SheetJS/test_files/master/large_strings.xls`
|
|
|
|
is an XLS file over 50 MB
|
|
|
|
|
|
|
|
`https://raw.githubusercontent.com/SheetJS/libreoffice_test-files/master/calc/xlsx-import/perf/8-by-300000-cells.xlsx`
|
|
|
|
is an XLSX file with 300000 rows (approximately 20 MB)
|
|
|
|
|
|
|
|
<CodeBlock language="jsx" live>{`\
|
|
|
|
function SheetJSFetchCSVStreamWorker() {
|
|
|
|
const [__html, setHTML] = React.useState("");
|
|
|
|
const [state, setState] = React.useState("");
|
|
|
|
const [cnt, setCnt] = React.useState(0);
|
2023-06-05 20:12:53 +00:00
|
|
|
const [url, setUrl] = React.useState("https://docs.sheetjs.com/test_files/large_strings.xlsx");
|
2023-05-03 03:40:40 +00:00
|
|
|
\n\
|
2023-05-30 06:41:09 +00:00
|
|
|
return ( <>
|
|
|
|
<b>URL: </b><input type="text" value={url} onChange={(e) => setUrl(e.target.value)} size="80"/>
|
|
|
|
<button onClick={() => {
|
|
|
|
/* this mantra embeds the worker source in the function */
|
|
|
|
const worker = new Worker(URL.createObjectURL(new Blob([\`\\
|
|
|
|
/* load standalone script from CDN */
|
|
|
|
importScripts("https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js");
|
|
|
|
\n\
|
|
|
|
function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
|
|
|
|
XLSX.stream.set_readable(() => ({
|
2022-10-31 00:58:49 +00:00
|
|
|
__done: false,
|
|
|
|
// this function will be assigned by the SheetJS stream methods
|
|
|
|
_read: function() { this.__done = true; },
|
|
|
|
// this function is called by the stream methods
|
2023-05-30 06:41:09 +00:00
|
|
|
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
|
2022-10-31 00:58:49 +00:00
|
|
|
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
|
|
|
|
}));
|
2023-05-30 06:41:09 +00:00
|
|
|
return XLSX.stream.to_csv(ws, opts);
|
2022-10-31 00:58:49 +00:00
|
|
|
}
|
2023-05-03 03:40:40 +00:00
|
|
|
\n\
|
2023-05-30 06:41:09 +00:00
|
|
|
/* this callback will run once the main context sends a message */
|
|
|
|
self.addEventListener('message', async(e) => {
|
|
|
|
try {
|
|
|
|
postMessage({state: "fetching " + e.data.url});
|
|
|
|
/* Fetch file */
|
|
|
|
const res = await fetch(e.data.url);
|
|
|
|
const ab = await res.arrayBuffer();
|
|
|
|
\n\
|
|
|
|
/* Parse file */
|
|
|
|
let len = ab.byteLength;
|
|
|
|
if(len < 1024) len += " bytes"; else { len /= 1024;
|
|
|
|
if(len < 1024) len += " KB"; else { len /= 1024; len += " MB"; }
|
|
|
|
}
|
|
|
|
postMessage({state: "parsing " + len});
|
|
|
|
const wb = XLSX.read(ab, {dense: true});
|
|
|
|
const ws = wb.Sheets[wb.SheetNames[0]];
|
|
|
|
\n\
|
|
|
|
/* Generate CSV rows */
|
|
|
|
postMessage({state: "csv"});
|
|
|
|
const strm = sheet_to_csv_cb(ws, (csv) => {
|
|
|
|
if(csv != null) postMessage({csv});
|
|
|
|
else postMessage({state: "done"});
|
|
|
|
});
|
|
|
|
strm.resume();
|
|
|
|
} catch(e) {
|
|
|
|
/* Pass the error message back */
|
|
|
|
postMessage({error: String(e.message || e) });
|
|
|
|
}
|
|
|
|
}, false);
|
|
|
|
\`])));
|
|
|
|
/* when the worker sends back data, add it to the DOM */
|
|
|
|
worker.onmessage = function(e) {
|
|
|
|
if(e.data.error) return setHTML(e.data.error);
|
|
|
|
else if(e.data.state) return setState(e.data.state);
|
|
|
|
setHTML(e.data.csv);
|
|
|
|
setCnt(cnt => cnt+1);
|
|
|
|
};
|
|
|
|
setCnt(0); setState("");
|
|
|
|
/* post a message to the worker with the URL to fetch */
|
|
|
|
worker.postMessage({url});
|
|
|
|
}}><b>Click to Start</b></button>
|
|
|
|
<pre>State: <b>{state}</b><br/>Number of rows: <b>{cnt}</b></pre>
|
|
|
|
<pre dangerouslySetInnerHTML={{ __html }}/>
|
|
|
|
</> );
|
|
|
|
}`}
|
|
|
|
</CodeBlock>
|
|
|
|
|
|
|
|
### Deno
|
|
|
|
|
|
|
|
Deno does not support NodeJS streams in normal execution, so a wrapper is used:
|
|
|
|
|
|
|
|
<CodeBlock language="ts">{`\
|
|
|
|
// @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts"
|
|
|
|
import { stream } from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';
|
|
|
|
\n\
|
2022-10-31 00:58:49 +00:00
|
|
|
/* Callback invoked on each row (string) and at the end (null) */
|
|
|
|
const csv_cb = (d:string|null) => {
|
|
|
|
if(d == null) return;
|
|
|
|
/* The strings include line endings, so raw write ops should be used */
|
|
|
|
Deno.stdout.write(new TextEncoder().encode(d));
|
|
|
|
};
|
2023-05-03 03:40:40 +00:00
|
|
|
\n\
|
2023-05-30 06:41:09 +00:00
|
|
|
/* Prepare \`Readable\` function */
|
|
|
|
const Readable = () => ({
|
|
|
|
__done: false,
|
|
|
|
// this function will be assigned by the SheetJS stream methods
|
|
|
|
_read: function() { this.__done = true; },
|
|
|
|
// this function is called by the stream methods
|
|
|
|
push: function(d: any) {
|
|
|
|
if(!this.__done) csv_cb(d);
|
|
|
|
if(d == null) this.__done = true;
|
|
|
|
},
|
|
|
|
resume: function pump() {
|
|
|
|
for(var i = 0; i < 1000 && !this.__done; ++i) this._read();
|
|
|
|
if(!this.__done) setTimeout(pump.bind(this), 0);
|
|
|
|
}
|
|
|
|
})
|
|
|
|
/* Wire up */
|
|
|
|
stream.set_readable(Readable);
|
2023-05-03 03:40:40 +00:00
|
|
|
\n\
|
2023-05-30 06:41:09 +00:00
|
|
|
/* assuming \`workbook\` is a workbook, stream the first sheet */
|
|
|
|
const ws = workbook.Sheets[workbook.SheetNames[0]];
|
|
|
|
stream.to_csv(wb.Sheets[wb.SheetNames[0]]).resume();`}
|
2023-05-03 03:40:40 +00:00
|
|
|
</CodeBlock>
|
2023-05-30 06:41:09 +00:00
|
|
|
|
2024-03-12 06:47:52 +00:00
|
|
|
:::note Tested Deployments
|
2023-05-30 06:41:09 +00:00
|
|
|
|
2024-07-18 22:19:02 +00:00
|
|
|
This demo was last tested on 2024-07-18 against Deno `1.45.2`.
|
2023-05-30 06:41:09 +00:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
[`SheetJSDenoStream.ts`](pathname:///stream/SheetJSDenoStream.ts) is a small
|
2024-04-26 04:16:13 +00:00
|
|
|
example script that downloads https://docs.sheetjs.com/pres.numbers and prints
|
2023-05-30 06:41:09 +00:00
|
|
|
CSV row objects.
|
|
|
|
|
2024-03-12 06:47:52 +00:00
|
|
|
1) Run the script:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
deno run -A https://docs.sheetjs.com/stream/SheetJSDenoStream.ts
|
|
|
|
```
|
|
|
|
|
2024-04-26 04:16:13 +00:00
|
|
|
This script will fetch [`pres.numbers`](https://docs.sheetjs.com/pres.numbers) and
|
2024-07-18 22:19:02 +00:00
|
|
|
generate CSV rows. The result will be printed to the terminal window.
|
|
|
|
|
|
|
|
[^1]: See [`fs.createWriteStream`](https://nodejs.org/api/fs.html#fscreatewritestreampath-options) in the NodeJS documentation.
|
|
|
|
[^2]: See ["Object mode"](https://nodejs.org/api/stream.html#object-mode) in the NodeJS documentation.
|
|
|
|
[^3]: See [`Transform`](https://nodejs.org/api/stream.html#class-streamtransform) in the NodeJS documentation.
|