2022-10-31 00:58:49 +00:00
|
|
|
---
|
|
|
|
title: Large Datasets
|
2023-02-28 11:40:44 +00:00
|
|
|
pagination_prev: demos/extensions/index
|
|
|
|
pagination_next: demos/engines/index
|
|
|
|
sidebar_custom_props:
|
|
|
|
summary: Dense Mode + Incremental CSV / HTML / JSON Export
|
2022-10-31 00:58:49 +00:00
|
|
|
---
|
|
|
|
|
|
|
|
For maximal compatibility, the library reads entire files at once and generates
|
|
|
|
files at once. Browsers and other JS engines enforce tight memory limits. In
|
|
|
|
these cases, the library offers strategies to optimize for memory or space by
|
|
|
|
using platform-specific APIs.
|
|
|
|
|
|
|
|
## Dense Mode
|
|
|
|
|
|
|
|
The `dense` option (supported in `read`, `readFile` and `aoa_to_sheet`) creates
|
|
|
|
worksheet objects that use arrays of arrays under the hood:
|
|
|
|
|
|
|
|
```js
|
|
|
|
var dense_wb = XLSX.read(ab, {dense: true});
|
|
|
|
|
2022-11-18 18:22:01 +00:00
|
|
|
var dense_sheet = XLSX.utils.aoa_to_sheet(aoa, {dense: true});
|
2022-10-31 00:58:49 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
<details><summary><b>Historical Note</b> (click to show)</summary>
|
|
|
|
|
|
|
|
The earliest versions of the library aimed for IE6+ compatibility. In early
|
|
|
|
testing, both in Chrome 26 and in IE6, the most efficient worksheet storage for
|
|
|
|
small sheets was a large object whose keys were cell addresses.
|
|
|
|
|
|
|
|
Over time, V8 (the engine behind Chrome and NodeJS) evolved in a way that made
|
|
|
|
the array of arrays approach more efficient but reduced the performance of the
|
|
|
|
large object approach.
|
|
|
|
|
|
|
|
In the interest of preserving backwards compatibility, the library opts to make
|
|
|
|
the array of arrays approach available behind a special `dense` option.
|
|
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
The various API functions will seamlessly handle dense and sparse worksheets.
|
|
|
|
|
|
|
|
## Streaming Write
|
|
|
|
|
|
|
|
The streaming write functions are available in the `XLSX.stream` object. They
|
|
|
|
take the same arguments as the normal write functions:
|
|
|
|
|
|
|
|
- `XLSX.stream.to_csv` is the streaming version of `XLSX.utils.sheet_to_csv`.
|
|
|
|
- `XLSX.stream.to_html` is the streaming version of `XLSX.utils.sheet_to_html`.
|
|
|
|
- `XLSX.stream.to_json` is the streaming version of `XLSX.utils.sheet_to_json`.
|
|
|
|
|
|
|
|
"Stream" refers to the NodeJS push streams API.
|
|
|
|
|
|
|
|
<details><summary><b>Historical Note</b> (click to show)</summary>
|
|
|
|
|
|
|
|
NodeJS push streams were introduced in 2012.
|
|
|
|
|
|
|
|
The first streaming write function, `to_csv`, was introduced in April 2017. It
|
|
|
|
used and still uses the same NodeJS streaming API.
|
|
|
|
|
|
|
|
Years later, browser vendors are settling on a different stream API.
|
|
|
|
|
|
|
|
For maximal compatibility, the library uses NodeJS push streams.
|
|
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
### NodeJS
|
|
|
|
|
|
|
|
:::note
|
|
|
|
|
|
|
|
In a CommonJS context, NodeJS Streams and `fs` immediately work with SheetJS:
|
|
|
|
|
|
|
|
```js
|
|
|
|
const XLSX = require("xlsx"); // "just works"
|
|
|
|
```
|
|
|
|
|
|
|
|
In NodeJS ESM, the dependency must be loaded manually:
|
|
|
|
|
|
|
|
```js
|
|
|
|
import * as XLSX from 'xlsx';
|
|
|
|
import { Readable } from 'stream';
|
|
|
|
|
|
|
|
XLSX.stream.set_readable(Readable); // manually load stream helpers
|
|
|
|
```
|
|
|
|
|
|
|
|
Additionally, for file-related operations in NodeJS ESM, `fs` must be loaded:
|
|
|
|
|
|
|
|
```js
|
|
|
|
import * as XLSX from 'xlsx';
|
|
|
|
import * as fs from 'fs';
|
|
|
|
|
|
|
|
XLSX.set_fs(fs); // manually load fs helpers
|
|
|
|
```
|
|
|
|
|
|
|
|
**It is strongly encouraged to use CommonJS in NodeJS whenever possible.**
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
This example reads a worksheet passed as an argument to the script, pulls the
|
|
|
|
first worksheet, converts to CSV and writes to `out.csv`:
|
|
|
|
|
|
|
|
```js
|
|
|
|
var XLSX = require("xlsx");
|
|
|
|
var workbook = XLSX.readFile(process.argv[2]);
|
|
|
|
var worksheet = workbook.Sheets[workbook.SheetNames[0]];
|
|
|
|
// highlight-next-line
|
|
|
|
var stream = XLSX.stream.to_csv(worksheet);
|
|
|
|
|
|
|
|
var output_file_name = "out.csv";
|
|
|
|
// highlight-next-line
|
|
|
|
stream.pipe(fs.createWriteStream(output_file_name));
|
|
|
|
```
|
|
|
|
|
|
|
|
`stream.to_json` uses Object-mode streams. A `Transform` stream can be used to
|
|
|
|
generate a normal stream for streaming to a file or the screen:
|
|
|
|
|
|
|
|
```js
|
|
|
|
var XLSX = require("xlsx");
|
|
|
|
var workbook = XLSX.readFile(process.argv[2], {dense: true});
|
|
|
|
var worksheet = workbook.Sheets[workbook.SheetNames[0]];
|
|
|
|
/* to_json returns an object-mode stream */
|
|
|
|
// highlight-next-line
|
|
|
|
var stream = XLSX.stream.to_json(worksheet, {raw:true});
|
|
|
|
|
|
|
|
/* this Transform stream converts JS objects to text and prints to screen */
|
|
|
|
var conv = new Transform({writableObjectMode:true});
|
|
|
|
conv._transform = function(obj, e, cb){ cb(null, JSON.stringify(obj) + "\n"); };
|
|
|
|
conv.pipe(process.stdout);
|
|
|
|
|
|
|
|
// highlight-next-line
|
|
|
|
stream.pipe(conv);
|
|
|
|
```
|
|
|
|
|
|
|
|
### Browser
|
|
|
|
|
|
|
|
<details><summary><b>Live Demo</b> (click to show)</summary>
|
|
|
|
|
|
|
|
The following live demo fetches and parses a file in a Web Worker. The `to_csv`
|
|
|
|
streaming function is used to generate CSV rows and pass back to the main thread
|
|
|
|
for further processing.
|
|
|
|
|
|
|
|
:::note
|
|
|
|
|
|
|
|
For Chromium browsers, the File System Access API provides a modern worker-only
|
2023-04-29 11:21:37 +00:00
|
|
|
approach. [The Web Workers demo](/docs/demos/bigdata/worker#streaming-write)
|
|
|
|
includes a live example of CSV streaming write.
|
2022-10-31 00:58:49 +00:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
The demo has a URL input box. Feel free to change the URL. For example,
|
|
|
|
|
|
|
|
`https://raw.githubusercontent.com/SheetJS/test_files/master/large_strings.xls`
|
|
|
|
is an XLS file over 50 MB
|
|
|
|
|
|
|
|
`https://raw.githubusercontent.com/SheetJS/libreoffice_test-files/master/calc/xlsx-import/perf/8-by-300000-cells.xlsx`
|
|
|
|
is an XLSX file with 300000 rows (approximately 20 MB)
|
|
|
|
|
|
|
|
```jsx live
|
|
|
|
function SheetJSFetchCSVStreamWorker() {
|
|
|
|
const [__html, setHTML] = React.useState("");
|
2022-12-01 01:13:00 +00:00
|
|
|
const [state, setState] = React.useState("");
|
2022-10-31 00:58:49 +00:00
|
|
|
const [cnt, setCnt] = React.useState(0);
|
|
|
|
const [url, setUrl] = React.useState("https://oss.sheetjs.com/test_files/large_strings.xlsx");
|
|
|
|
|
|
|
|
return ( <>
|
|
|
|
<b>URL: </b><input type="text" value={url} onChange={(e) => setUrl(e.target.value)} size="80"/>
|
|
|
|
<button onClick={() => {
|
|
|
|
/* this mantra embeds the worker source in the function */
|
|
|
|
const worker = new Worker(URL.createObjectURL(new Blob([`\
|
|
|
|
/* load standalone script from CDN */
|
|
|
|
importScripts("https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js");
|
|
|
|
|
|
|
|
function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
|
|
|
|
XLSX.stream.set_readable(() => ({
|
|
|
|
__done: false,
|
|
|
|
// this function will be assigned by the SheetJS stream methods
|
|
|
|
_read: function() { this.__done = true; },
|
|
|
|
// this function is called by the stream methods
|
|
|
|
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
|
|
|
|
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
|
|
|
|
}));
|
|
|
|
return XLSX.stream.to_csv(ws, opts);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* this callback will run once the main context sends a message */
|
|
|
|
self.addEventListener('message', async(e) => {
|
|
|
|
try {
|
|
|
|
postMessage({state: "fetching " + e.data.url});
|
|
|
|
/* Fetch file */
|
|
|
|
const res = await fetch(e.data.url);
|
|
|
|
const ab = await res.arrayBuffer();
|
|
|
|
|
|
|
|
/* Parse file */
|
|
|
|
let len = ab.byteLength;
|
|
|
|
if(len < 1024) len += " bytes"; else { len /= 1024;
|
|
|
|
if(len < 1024) len += " KB"; else { len /= 1024; len += " MB"; }
|
|
|
|
}
|
|
|
|
postMessage({state: "parsing " + len});
|
|
|
|
const wb = XLSX.read(ab, {dense: true});
|
|
|
|
const ws = wb.Sheets[wb.SheetNames[0]];
|
|
|
|
|
|
|
|
/* Generate CSV rows */
|
|
|
|
postMessage({state: "csv"});
|
|
|
|
const strm = sheet_to_csv_cb(ws, (csv) => {
|
|
|
|
if(csv != null) postMessage({csv});
|
|
|
|
else postMessage({state: "done"});
|
|
|
|
});
|
|
|
|
strm.resume();
|
|
|
|
} catch(e) {
|
|
|
|
/* Pass the error message back */
|
|
|
|
postMessage({error: String(e.message || e) });
|
|
|
|
}
|
|
|
|
}, false);
|
|
|
|
`])));
|
|
|
|
/* when the worker sends back data, add it to the DOM */
|
|
|
|
worker.onmessage = function(e) {
|
|
|
|
if(e.data.error) return setHTML(e.data.error);
|
|
|
|
else if(e.data.state) return setState(e.data.state);
|
|
|
|
setHTML(e.data.csv);
|
|
|
|
setCnt(cnt => cnt+1);
|
|
|
|
};
|
|
|
|
setCnt(0); setState("");
|
|
|
|
/* post a message to the worker with the URL to fetch */
|
|
|
|
worker.postMessage({url});
|
|
|
|
}}><b>Click to Start</b></button>
|
|
|
|
<pre>State: <b>{state}</b><br/>Number of rows: <b>{cnt}</b></pre>
|
|
|
|
<pre dangerouslySetInnerHTML={{ __html }}/>
|
|
|
|
</> );
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
NodeJS streaming APIs are not available in the browser. The following function
|
|
|
|
supplies a pseudo stream object compatible with the `to_csv` function:
|
|
|
|
|
|
|
|
```js
|
|
|
|
function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
|
|
|
|
XLSX.stream.set_readable(() => ({
|
|
|
|
__done: false,
|
|
|
|
// this function will be assigned by the SheetJS stream methods
|
|
|
|
_read: function() { this.__done = true; },
|
|
|
|
// this function is called by the stream methods
|
|
|
|
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
|
|
|
|
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
|
|
|
|
}));
|
|
|
|
return XLSX.stream.to_csv(ws, opts);
|
|
|
|
}
|
|
|
|
|
|
|
|
// assuming `workbook` is a workbook, stream the first sheet
|
|
|
|
const ws = workbook.Sheets[workbook.SheetNames[0]];
|
|
|
|
const strm = sheet_to_csv_cb(ws, (csv)=>{ if(csv != null) console.log(csv); });
|
|
|
|
strm.resume();
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Web Workers
|
|
|
|
|
|
|
|
For processing large files in the browser, it is strongly encouraged to use Web
|
2023-04-29 11:21:37 +00:00
|
|
|
Workers. The [Worker demo](/docs/demos/bigdata/worker#streaming-write) includes
|
|
|
|
examples using the File System Access API.
|
2022-10-31 00:58:49 +00:00
|
|
|
|
|
|
|
Typically, the file and stream processing occurs in the Web Worker. CSV rows
|
|
|
|
can be sent back to the main thread in the callback:
|
|
|
|
|
|
|
|
```js title="worker.js"
|
|
|
|
/* load standalone script from CDN */
|
|
|
|
importScripts("https://cdn.sheetjs.com/xlsx-latest/package/dist/xlsx.full.min.js");
|
|
|
|
|
|
|
|
function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
|
|
|
|
XLSX.stream.set_readable(() => ({
|
|
|
|
__done: false,
|
|
|
|
// this function will be assigned by the SheetJS stream methods
|
|
|
|
_read: function() { this.__done = true; },
|
|
|
|
// this function is called by the stream methods
|
|
|
|
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
|
|
|
|
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
|
|
|
|
}));
|
|
|
|
return XLSX.stream.to_csv(ws, opts);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* this callback will run once the main context sends a message */
|
|
|
|
self.addEventListener('message', async(e) => {
|
|
|
|
try {
|
|
|
|
postMessage({state: "fetching " + e.data.url});
|
|
|
|
/* Fetch file */
|
|
|
|
const res = await fetch(e.data.url);
|
|
|
|
const ab = await res.arrayBuffer();
|
|
|
|
|
|
|
|
/* Parse file */
|
|
|
|
postMessage({state: "parsing"});
|
|
|
|
const wb = XLSX.read(ab, {dense: true});
|
|
|
|
const ws = wb.Sheets[wb.SheetNames[0]];
|
|
|
|
|
|
|
|
/* Generate CSV rows */
|
|
|
|
postMessage({state: "csv"});
|
|
|
|
const strm = sheet_to_csv_cb(ws, (csv) => {
|
|
|
|
if(csv != null) postMessage({csv});
|
|
|
|
else postMessage({state: "done"});
|
|
|
|
});
|
|
|
|
strm.resume();
|
|
|
|
} catch(e) {
|
|
|
|
/* Pass the error message back */
|
|
|
|
postMessage({error: String(e.message || e) });
|
|
|
|
}
|
|
|
|
}, false);
|
|
|
|
```
|
|
|
|
|
|
|
|
The main thread will receive messages with CSV rows for further processing:
|
|
|
|
|
|
|
|
```js
|
|
|
|
worker.onmessage = function(e) {
|
|
|
|
if(e.data.error) { console.error(e.data.error); /* show an error message */ }
|
|
|
|
else if(e.data.state) { console.info(e.data.state); /* current state */ }
|
|
|
|
else {
|
|
|
|
/* e.data.csv is the row generated by the stream */
|
|
|
|
console.log(e.data.csv);
|
|
|
|
}
|
|
|
|
};
|
|
|
|
```
|
|
|
|
|
|
|
|
### Deno
|
|
|
|
|
|
|
|
Deno does not support NodeJS streams in normal execution, so a wrapper is used.
|
|
|
|
This example fetches <https://sheetjs.com/pres.numbers> and prints CSV rows:
|
|
|
|
|
|
|
|
```ts title="sheet2csv.ts"
|
|
|
|
// @deno-types="https://cdn.sheetjs.com/xlsx-latest/package/types/index.d.ts"
|
|
|
|
import { stream, Sheet2CSVOpts, WorkSheet } from 'https://cdn.sheetjs.com/xlsx-latest/package/xlsx.mjs';
|
|
|
|
|
|
|
|
interface Resumable { resume:()=>void; };
|
|
|
|
/* Generate row strings from a worksheet */
|
|
|
|
function sheet_to_csv_cb(ws: WorkSheet, cb:(d:string|null)=>void, opts: Sheet2CSVOpts = {}, batch = 1000): Resumable {
|
|
|
|
stream.set_readable(() => ({
|
|
|
|
__done: false,
|
|
|
|
// this function will be assigned by the SheetJS stream methods
|
|
|
|
_read: function() { this.__done = true; },
|
|
|
|
// this function is called by the stream methods
|
|
|
|
push: function(d: any) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
|
|
|
|
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
|
|
|
|
}));
|
|
|
|
return stream.to_csv(ws, opts) as Resumable;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Callback invoked on each row (string) and at the end (null) */
|
|
|
|
const csv_cb = (d:string|null) => {
|
|
|
|
if(d == null) return;
|
|
|
|
/* The strings include line endings, so raw write ops should be used */
|
|
|
|
Deno.stdout.write(new TextEncoder().encode(d));
|
|
|
|
};
|
|
|
|
|
|
|
|
/* Fetch https://sheetjs.com/pres.numbers, parse, and get first worksheet */
|
|
|
|
import { read } from 'https://cdn.sheetjs.com/xlsx-latest/package/xlsx.mjs';
|
|
|
|
const ab = await (await fetch("https://sheetjs.com/pres.numbers")).arrayBuffer();
|
|
|
|
const wb = read(ab, { dense: true });
|
|
|
|
const ws = wb.Sheets[wb.SheetNames[0]];
|
|
|
|
|
|
|
|
/* Create and start CSV stream */
|
|
|
|
sheet_to_csv_cb(ws, csv_cb).resume();
|
|
|
|
```
|