Best practice for streaming to HTTP clients? #2834
Labels
No Label
DBF
Dates
Defined Names
Features
Formula
HTML
Images
Infrastructure
Integration
International
ODS
Operations
Performance
PivotTables
Pro
Protection
Read Bug
SSF
SYLK
Style
Write Bug
good first issue
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sheetjs/sheetjs#2834
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
My understanding is that the entire workbook buffer is always created in memory due to the forward pointers and such. Is that correct?
If so, it sounds like the best streaming strategy would actually to be to write quickly to the fastest medium - likely local files (or maybe s3), NOT the network - and then stream from that to the slow clients.
Something like this:
Do you know of any nuance I should be aware of or have any other thoughts on that?
To avoid blocking the main thread, the best approach for modern NodeJS is worker threads, probably using some sort of thread pool.
The idea is to use
writeFile
in the worker and pass the path back to the API route callback. The rest is similar to how you have laid out the code.We probably should make a demo of this approach.
.
If you are comfortable changing the API design and are exporting large files, you can make the server respond with a random name and add a separate route for downloading exports given the name. Typically you would cleanup exports after X hours or days. This is spiritually similar to how services like Twitter make exports of personal data.
.
Under the hood,
writeFile
assembles the file in memory and writes. This approach "just works" in legacy and modern browsers, NodeJS and a number of other runtimes.We've shied away from streaming XLSX write in the past since the current state of browser streaming is pretty abysmal. In case you are curious, https://docs.sheetjs.com/docs/demos/worker#file-system-access-api shows XLSX -> CSV conversion. Locally, against the file with 300K rows, writing at once takes ~2 sec while committing each row takes ~112 sec in Chrome 106.
If streaming XLSX write in NodeJS is of interest, we can take a closer look. The issues about forward references apply to streaming parse (XLSX is a ZIP-based file format and the "table of contents" is stored at the end of the file)
Re: streaming
My concern was more having memory backing up, but I can see what you're saying about using a lot of CPU at once and blocking the main thread.
Re: thread pool
Honestly, that thread pool stuff is way too complicated. Node just isn't the right tool for the job when you need multi-threading, IMO.
I'd recommend anyone just write a little API microservice instead and call out to it.
(we do this for LibreOffice pdf conversion https://github.com/savvi-legal/libreoffice-as-a-service/blob/main/routes/convert.js - though... I used it as an excuse to try out
fastify
, so it's probably equally complicated 🫤...)Thanks
Great detail and suggestions. I appreciate it. :)
Mirroring the soffice example, you can write a NodeJS script and shell-out (https://docs.sheetjs.com/docs/demos/cli a similar script was in our first release back in 2012!). You can write your CLI script to query the backend for data and write the file.
.
To show worker threads, here's a small XLSX conversion service powered by ExpressJS. It will accept file uploads, parse the data and send back the converted data in the XLSX format
install the SheetJS library, express, and formidable:
task_processor.js
:worker_pool.js
.Code (click to show)
main.mjs
):pres.numbers
:This will generate
SheetJSPool.xlsx
which you can open.