docs.sheetjs.com/docz/docs/03-demos/37-bigdata/01-stream.md

471 lines
15 KiB
Markdown
Raw Permalink Normal View History

2022-10-31 00:58:49 +00:00
---
title: Large Datasets
2023-02-28 11:40:44 +00:00
pagination_prev: demos/extensions/index
pagination_next: demos/engines/index
sidebar_custom_props:
summary: Dense Mode + Incremental CSV / HTML / JSON Export
2022-10-31 00:58:49 +00:00
---
2023-05-03 03:40:40 +00:00
import current from '/version.js';
import CodeBlock from '@theme/CodeBlock';
2022-10-31 00:58:49 +00:00
For maximal compatibility, the library reads entire files at once and generates
files at once. Browsers and other JS engines enforce tight memory limits. In
these cases, the library offers strategies to optimize for memory or space by
using platform-specific APIs.
## Dense Mode
2023-05-30 06:41:09 +00:00
`read`, `readFile` and `aoa_to_sheet` accept the `dense` option. When enabled,
the methods create worksheet objects that store cells in arrays of arrays:
2022-10-31 00:58:49 +00:00
```js
var dense_wb = XLSX.read(ab, {dense: true});
2022-11-18 18:22:01 +00:00
var dense_sheet = XLSX.utils.aoa_to_sheet(aoa, {dense: true});
2022-10-31 00:58:49 +00:00
```
<details>
<summary><b>Historical Note</b> (click to show)</summary>
2022-10-31 00:58:49 +00:00
The earliest versions of the library aimed for IE6+ compatibility. In early
testing, both in Chrome 26 and in IE6, the most efficient worksheet storage for
small sheets was a large object whose keys were cell addresses.
Over time, V8 (the engine behind Chrome and NodeJS) evolved in a way that made
the array of arrays approach more efficient but reduced the performance of the
large object approach.
In the interest of preserving backwards compatibility, the library opts to make
the array of arrays approach available behind a special `dense` option.
</details>
The various API functions will seamlessly handle dense and sparse worksheets.
## Streaming Write
The streaming write functions are available in the `XLSX.stream` object. They
take the same arguments as the normal write functions:
- `XLSX.stream.to_csv` is the streaming version of `XLSX.utils.sheet_to_csv`.
- `XLSX.stream.to_html` is the streaming version of `XLSX.utils.sheet_to_html`.
- `XLSX.stream.to_json` is the streaming version of `XLSX.utils.sheet_to_json`.
"Stream" refers to the NodeJS push streams API.
<details>
<summary><b>Historical Note</b> (click to show)</summary>
2022-10-31 00:58:49 +00:00
2023-05-30 06:41:09 +00:00
NodeJS push streams were introduced in 2012. The text streaming methods `to_csv`
and `to_html` are supported in NodeJS v0.10 and later while the object streaming
method `to_json` is supported in NodeJS v0.12 and later.
2022-10-31 00:58:49 +00:00
2023-09-18 06:44:33 +00:00
The first streaming write function, `to_csv`, was introduced in early 2017. It
2022-10-31 00:58:49 +00:00
used and still uses the same NodeJS streaming API.
Years later, browser vendors are settling on a different stream API.
For maximal compatibility, the library uses NodeJS push streams.
</details>
### NodeJS
In a CommonJS context, NodeJS Streams and `fs` immediately work with SheetJS:
```js
const XLSX = require("xlsx"); // "just works"
```
:::danger ECMAScript Module Machinations
2023-05-30 06:41:09 +00:00
2022-10-31 00:58:49 +00:00
In NodeJS ESM, the dependency must be loaded manually:
```js
import * as XLSX from 'xlsx';
import { Readable } from 'stream';
XLSX.stream.set_readable(Readable); // manually load stream helpers
```
Additionally, for file-related operations in NodeJS ESM, `fs` must be loaded:
```js
import * as XLSX from 'xlsx';
import * as fs from 'fs';
XLSX.set_fs(fs); // manually load fs helpers
```
**It is strongly encouraged to use CommonJS in NodeJS whenever possible.**
:::
2023-05-30 06:41:09 +00:00
**`XLSX.stream.to_csv`**
2022-10-31 00:58:49 +00:00
This example reads a worksheet passed as an argument to the script, pulls the
2023-05-30 06:41:09 +00:00
first worksheet, converts to CSV and writes to `SheetJSNodeJStream.csv`:
2022-10-31 00:58:49 +00:00
```js
2023-05-30 06:41:09 +00:00
var XLSX = require("xlsx"), fs = require("fs");
var wb = XLSX.readFile(process.argv[2]);
var ws = wb.Sheets[wb.SheetNames[0]];
var ostream = fs.createWriteStream("SheetJSNodeJStream.csv");
2022-10-31 00:58:49 +00:00
// highlight-next-line
2023-05-30 06:41:09 +00:00
XLSX.stream.to_csv(ws).pipe(ostream);
2022-10-31 00:58:49 +00:00
```
2023-05-30 06:41:09 +00:00
**`XLSX.stream.to_json`**
2022-10-31 00:58:49 +00:00
`stream.to_json` uses Object-mode streams. A `Transform` stream can be used to
generate a normal stream for streaming to a file or the screen:
```js
2023-05-30 06:41:09 +00:00
var XLSX = require("xlsx"), Transform = require("stream").Transform;
var wb = XLSX.readFile(process.argv[2], {dense: true});
var ws = wb.Sheets[wb.SheetNames[0]];
2022-10-31 00:58:49 +00:00
2023-05-30 06:41:09 +00:00
/* this Transform stream converts JS objects to text */
2022-10-31 00:58:49 +00:00
var conv = new Transform({writableObjectMode:true});
conv._transform = function(obj, e, cb){ cb(null, JSON.stringify(obj) + "\n"); };
2023-05-30 06:41:09 +00:00
/* pipe `to_json` -> transformer -> standard output */
2022-10-31 00:58:49 +00:00
// highlight-next-line
2023-05-30 06:41:09 +00:00
XLSX.stream.to_json(ws, {raw: true}).pipe(conv).pipe(process.stdout);
2022-10-31 00:58:49 +00:00
```
2023-05-30 06:41:09 +00:00
**Demo**
2022-10-31 00:58:49 +00:00
2024-03-12 06:47:52 +00:00
:::note Tested Deployments
2022-10-31 00:58:49 +00:00
2024-03-12 06:47:52 +00:00
This demo was tested in the following deployments:
2022-10-31 00:58:49 +00:00
2023-05-30 06:41:09 +00:00
| Node Version | Date | Node Status when tested |
|:-------------|:-----------|:------------------------|
2024-03-12 06:47:52 +00:00
| `0.12.18` | 2024-02-23 | End-of-Life |
| `4.9.1` | 2024-02-23 | End-of-Life |
| `6.17.1` | 2024-02-23 | End-of-Life |
| `8.17.0` | 2024-02-23 | End-of-Life |
| `10.24.1` | 2024-02-23 | End-of-Life |
| `12.22.12` | 2024-02-23 | End-of-Life |
| `14.21.3` | 2024-02-23 | End-of-Life |
| `16.20.2` | 2024-02-23 | End-of-Life |
| `18.19.1` | 2024-02-23 | Maintenance LTS |
| `20.11.1` | 2024-02-23 | Active LTS |
2024-04-26 04:16:13 +00:00
| `22.0.0` | 2024-04-25 | Current |
2022-10-31 00:58:49 +00:00
2023-05-30 06:41:09 +00:00
While streaming methods work in End-of-Life versions of NodeJS, production
deployments should upgrade to a Current or LTS version of NodeJS.
2022-10-31 00:58:49 +00:00
:::
2023-05-30 06:41:09 +00:00
1) Install the [NodeJS module](/docs/getting-started/installation/nodejs)
2022-10-31 00:58:49 +00:00
2023-05-30 06:41:09 +00:00
<CodeBlock language="bash">{`\
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz`}
</CodeBlock>
2022-10-31 00:58:49 +00:00
2023-05-30 06:41:09 +00:00
2) Download [`SheetJSNodeJStream.js`](pathname:///stream/SheetJSNodeJStream.js):
2022-10-31 00:58:49 +00:00
2023-05-30 06:41:09 +00:00
```bash
curl -LO https://docs.sheetjs.com/stream/SheetJSNodeJStream.js
```
2024-04-26 04:16:13 +00:00
3) Download [the test file](https://docs.sheetjs.com/pres.xlsx):
2023-05-30 06:41:09 +00:00
```bash
2024-04-26 04:16:13 +00:00
curl -LO https://docs.sheetjs.com/pres.xlsx
2023-05-30 06:41:09 +00:00
```
4) Run the script:
```bash
node SheetJSNodeJStream.js pres.xlsx
```
<details>
<summary><b>Expected Output</b> (click to show)</summary>
2023-05-30 06:41:09 +00:00
The console will display a list of objects:
```json
{"Name":"Bill Clinton","Index":42}
{"Name":"GeorgeW Bush","Index":43}
{"Name":"Barack Obama","Index":44}
{"Name":"Donald Trump","Index":45}
{"Name":"Joseph Biden","Index":46}
```
The script will also generate `SheetJSNodeJStream.csv`:
```csv
Name,Index
Bill Clinton,42
GeorgeW Bush,43
Barack Obama,44
Donald Trump,45
Joseph Biden,46
```
2022-10-31 00:58:49 +00:00
</details>
2023-05-30 06:41:09 +00:00
### Browser
2024-03-12 06:47:52 +00:00
:::note Tested Deployments
2023-05-30 06:41:09 +00:00
2024-03-12 06:47:52 +00:00
Each browser demo was tested in the following environments:
| Browser | Date |
|:------------|:-----------|
| Chrome 121 | 2024-02-23 |
| Safari 17.3 | 2024-02-23 |
2023-05-30 06:41:09 +00:00
:::
2022-10-31 00:58:49 +00:00
NodeJS streaming APIs are not available in the browser. The following function
supplies a pseudo stream object compatible with the `to_csv` function:
```js
function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
XLSX.stream.set_readable(() => ({
__done: false,
// this function will be assigned by the SheetJS stream methods
_read: function() { this.__done = true; },
// this function is called by the stream methods
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
}));
return XLSX.stream.to_csv(ws, opts);
}
// assuming `workbook` is a workbook, stream the first sheet
const ws = workbook.Sheets[workbook.SheetNames[0]];
const strm = sheet_to_csv_cb(ws, (csv)=>{ if(csv != null) console.log(csv); });
strm.resume();
```
#### Web Workers
For processing large files in the browser, it is strongly encouraged to use Web
2023-04-29 11:21:37 +00:00
Workers. The [Worker demo](/docs/demos/bigdata/worker#streaming-write) includes
examples using the File System Access API.
2022-10-31 00:58:49 +00:00
<details>
<summary><b>Web Worker Details</b> (click to show)</summary>
2023-05-30 06:41:09 +00:00
2022-10-31 00:58:49 +00:00
Typically, the file and stream processing occurs in the Web Worker. CSV rows
can be sent back to the main thread in the callback:
2023-05-03 03:40:40 +00:00
<CodeBlock language="js" title="worker.js">{`\
2022-10-31 00:58:49 +00:00
/* load standalone script from CDN */
2023-05-03 03:40:40 +00:00
importScripts("https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js");
\n\
2022-10-31 00:58:49 +00:00
function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
XLSX.stream.set_readable(() => ({
__done: false,
// this function will be assigned by the SheetJS stream methods
_read: function() { this.__done = true; },
// this function is called by the stream methods
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
}));
return XLSX.stream.to_csv(ws, opts);
}
2023-05-03 03:40:40 +00:00
\n\
2022-10-31 00:58:49 +00:00
/* this callback will run once the main context sends a message */
self.addEventListener('message', async(e) => {
try {
postMessage({state: "fetching " + e.data.url});
/* Fetch file */
const res = await fetch(e.data.url);
const ab = await res.arrayBuffer();
2023-05-03 03:40:40 +00:00
\n\
2022-10-31 00:58:49 +00:00
/* Parse file */
postMessage({state: "parsing"});
const wb = XLSX.read(ab, {dense: true});
const ws = wb.Sheets[wb.SheetNames[0]];
2023-05-03 03:40:40 +00:00
\n\
2022-10-31 00:58:49 +00:00
/* Generate CSV rows */
postMessage({state: "csv"});
const strm = sheet_to_csv_cb(ws, (csv) => {
if(csv != null) postMessage({csv});
else postMessage({state: "done"});
});
strm.resume();
} catch(e) {
/* Pass the error message back */
postMessage({error: String(e.message || e) });
}
2023-05-03 03:40:40 +00:00
}, false);`}
</CodeBlock>
2022-10-31 00:58:49 +00:00
The main thread will receive messages with CSV rows for further processing:
2023-05-30 06:41:09 +00:00
```js title="main.js"
2022-10-31 00:58:49 +00:00
worker.onmessage = function(e) {
if(e.data.error) { console.error(e.data.error); /* show an error message */ }
else if(e.data.state) { console.info(e.data.state); /* current state */ }
else {
/* e.data.csv is the row generated by the stream */
console.log(e.data.csv);
}
};
```
2023-05-30 06:41:09 +00:00
</details>
2022-10-31 00:58:49 +00:00
2023-05-30 06:41:09 +00:00
### Live Demo
2022-10-31 00:58:49 +00:00
2023-05-30 06:41:09 +00:00
The following live demo fetches and parses a file in a Web Worker. The `to_csv`
streaming function is used to generate CSV rows and pass back to the main thread
for further processing.
2023-09-02 09:26:57 +00:00
:::note pass
2023-05-30 06:41:09 +00:00
For Chromium browsers, the File System Access API provides a modern worker-only
approach. [The Web Workers demo](/docs/demos/bigdata/worker#streaming-write)
includes a live example of CSV streaming write.
:::
The demo has a URL input box. Feel free to change the URL. For example,
`https://raw.githubusercontent.com/SheetJS/test_files/master/large_strings.xls`
is an XLS file over 50 MB
`https://raw.githubusercontent.com/SheetJS/libreoffice_test-files/master/calc/xlsx-import/perf/8-by-300000-cells.xlsx`
is an XLSX file with 300000 rows (approximately 20 MB)
<CodeBlock language="jsx" live>{`\
function SheetJSFetchCSVStreamWorker() {
const [__html, setHTML] = React.useState("");
const [state, setState] = React.useState("");
const [cnt, setCnt] = React.useState(0);
2023-06-05 20:12:53 +00:00
const [url, setUrl] = React.useState("https://docs.sheetjs.com/test_files/large_strings.xlsx");
2023-05-03 03:40:40 +00:00
\n\
2023-05-30 06:41:09 +00:00
return ( <>
<b>URL: </b><input type="text" value={url} onChange={(e) => setUrl(e.target.value)} size="80"/>
<button onClick={() => {
/* this mantra embeds the worker source in the function */
const worker = new Worker(URL.createObjectURL(new Blob([\`\\
/* load standalone script from CDN */
importScripts("https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js");
\n\
function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
XLSX.stream.set_readable(() => ({
2022-10-31 00:58:49 +00:00
__done: false,
// this function will be assigned by the SheetJS stream methods
_read: function() { this.__done = true; },
// this function is called by the stream methods
2023-05-30 06:41:09 +00:00
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
2022-10-31 00:58:49 +00:00
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
}));
2023-05-30 06:41:09 +00:00
return XLSX.stream.to_csv(ws, opts);
2022-10-31 00:58:49 +00:00
}
2023-05-03 03:40:40 +00:00
\n\
2023-05-30 06:41:09 +00:00
/* this callback will run once the main context sends a message */
self.addEventListener('message', async(e) => {
try {
postMessage({state: "fetching " + e.data.url});
/* Fetch file */
const res = await fetch(e.data.url);
const ab = await res.arrayBuffer();
\n\
/* Parse file */
let len = ab.byteLength;
if(len < 1024) len += " bytes"; else { len /= 1024;
if(len < 1024) len += " KB"; else { len /= 1024; len += " MB"; }
}
postMessage({state: "parsing " + len});
const wb = XLSX.read(ab, {dense: true});
const ws = wb.Sheets[wb.SheetNames[0]];
\n\
/* Generate CSV rows */
postMessage({state: "csv"});
const strm = sheet_to_csv_cb(ws, (csv) => {
if(csv != null) postMessage({csv});
else postMessage({state: "done"});
});
strm.resume();
} catch(e) {
/* Pass the error message back */
postMessage({error: String(e.message || e) });
}
}, false);
\`])));
/* when the worker sends back data, add it to the DOM */
worker.onmessage = function(e) {
if(e.data.error) return setHTML(e.data.error);
else if(e.data.state) return setState(e.data.state);
setHTML(e.data.csv);
setCnt(cnt => cnt+1);
};
setCnt(0); setState("");
/* post a message to the worker with the URL to fetch */
worker.postMessage({url});
}}><b>Click to Start</b></button>
<pre>State: <b>{state}</b><br/>Number of rows: <b>{cnt}</b></pre>
<pre dangerouslySetInnerHTML={{ __html }}/>
</> );
}`}
</CodeBlock>
### Deno
Deno does not support NodeJS streams in normal execution, so a wrapper is used:
<CodeBlock language="ts">{`\
// @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts"
import { stream } from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';
\n\
2022-10-31 00:58:49 +00:00
/* Callback invoked on each row (string) and at the end (null) */
const csv_cb = (d:string|null) => {
if(d == null) return;
/* The strings include line endings, so raw write ops should be used */
Deno.stdout.write(new TextEncoder().encode(d));
};
2023-05-03 03:40:40 +00:00
\n\
2023-05-30 06:41:09 +00:00
/* Prepare \`Readable\` function */
const Readable = () => ({
__done: false,
// this function will be assigned by the SheetJS stream methods
_read: function() { this.__done = true; },
// this function is called by the stream methods
push: function(d: any) {
if(!this.__done) csv_cb(d);
if(d == null) this.__done = true;
},
resume: function pump() {
for(var i = 0; i < 1000 && !this.__done; ++i) this._read();
if(!this.__done) setTimeout(pump.bind(this), 0);
}
})
/* Wire up */
stream.set_readable(Readable);
2023-05-03 03:40:40 +00:00
\n\
2023-05-30 06:41:09 +00:00
/* assuming \`workbook\` is a workbook, stream the first sheet */
const ws = workbook.Sheets[workbook.SheetNames[0]];
stream.to_csv(wb.Sheets[wb.SheetNames[0]]).resume();`}
2023-05-03 03:40:40 +00:00
</CodeBlock>
2023-05-30 06:41:09 +00:00
2024-03-12 06:47:52 +00:00
:::note Tested Deployments
2023-05-30 06:41:09 +00:00
2024-04-26 04:16:13 +00:00
This demo was last tested on 2024-04-25 against Deno `1.42.4`.
2023-05-30 06:41:09 +00:00
:::
[`SheetJSDenoStream.ts`](pathname:///stream/SheetJSDenoStream.ts) is a small
2024-04-26 04:16:13 +00:00
example script that downloads https://docs.sheetjs.com/pres.numbers and prints
2023-05-30 06:41:09 +00:00
CSV row objects.
2024-03-12 06:47:52 +00:00
1) Run the script:
```bash
deno run -A https://docs.sheetjs.com/stream/SheetJSDenoStream.ts
```
2024-04-26 04:16:13 +00:00
This script will fetch [`pres.numbers`](https://docs.sheetjs.com/pres.numbers) and
2024-03-12 06:47:52 +00:00
generate CSV rows. The result will be printed to the terminal window.