2023-05-18 22:41:23 +00:00
|
|
|
---
|
|
|
|
title: Synthetic DOM
|
2024-04-01 10:44:10 +00:00
|
|
|
pagination_prev: demos/net/headless/index
|
2023-05-18 22:41:23 +00:00
|
|
|
---
|
|
|
|
|
|
|
|
import current from '/version.js';
|
|
|
|
import CodeBlock from '@theme/CodeBlock';
|
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
[SheetJS](https://sheetjs.com) is a JavaScript library for reading and writing
|
|
|
|
data from spreadsheets.
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
SheetJS offers three methods to directly process HTML DOM TABLE elements:
|
|
|
|
|
|
|
|
- `table_to_sheet`[^1] generates a SheetJS worksheet[^2] from a TABLE element
|
|
|
|
- `table_to_book`[^3] generates a SheetJS workbook[^4] from a TABLE element
|
|
|
|
- `sheet_add_dom`[^5] adds data from a TABLE element to an existing worksheet
|
2023-09-11 04:44:15 +00:00
|
|
|
|
|
|
|
These methods work in the web browser. NodeJS and other server-side platforms
|
|
|
|
traditionally lack a DOM implementation, but third-party modules fill the gap.
|
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
This demo covers synthetic DOM implementations for non-browser platforms. We'll
|
|
|
|
explore how to use SheetJS DOM methods in server-side environments to parse
|
|
|
|
tables and export data to spreadsheets.
|
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
:::tip pass
|
2023-05-18 22:41:23 +00:00
|
|
|
|
|
|
|
The most robust approach for server-side processing is to automate a headless
|
|
|
|
web browser. ["Browser Automation"](/docs/demos/net/headless) includes demos.
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
## Integration Details
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2024-01-29 03:29:45 +00:00
|
|
|
Synthetic DOM implementations typically provide a function that accept a HTML
|
|
|
|
string and return an object that represents `document`. An API method such as
|
|
|
|
`getElementsByTagName` or `querySelector` can pull TABLE elements.
|
|
|
|
|
|
|
|
```mermaid
|
|
|
|
flowchart LR
|
2024-04-22 19:38:55 +00:00
|
|
|
subgraph Synthetic DOM Operations
|
|
|
|
html(HTML\nstring)
|
2024-01-29 03:29:45 +00:00
|
|
|
doc{{`document`\nDOM Object}}
|
2024-04-22 19:38:55 +00:00
|
|
|
end
|
|
|
|
subgraph SheetJS Operations
|
|
|
|
table{{DOM\nTable}}
|
|
|
|
wb(((SheetJS\nWorkbook)))
|
|
|
|
file(workbook\nfile)
|
|
|
|
end
|
2024-01-29 03:29:45 +00:00
|
|
|
html --> |Library\n\n| doc
|
|
|
|
doc --> |DOM\nAPI| table
|
|
|
|
table --> |`table_to_book`\n\n| wb
|
2024-04-22 19:38:55 +00:00
|
|
|
wb --> |`writeFile`\n\n| file
|
2024-01-29 03:29:45 +00:00
|
|
|
```
|
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
SheetJS methods use features that may be missing from some DOM implementations.
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
### Table rows
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
The `rows` property of TABLE elements is a list of TR row children. This list
|
|
|
|
automatically updates when rows are added and deleted.
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2024-01-29 03:29:45 +00:00
|
|
|
SheetJS methods do not mutate `rows`. Assuming there are no nested tables, the
|
|
|
|
`rows` property can be created using `getElementsByTagName`:
|
2023-09-11 04:44:15 +00:00
|
|
|
|
|
|
|
```js
|
|
|
|
tbl.rows = Array.from(tbl.getElementsByTagName("tr"));
|
2023-05-18 22:41:23 +00:00
|
|
|
```
|
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
### Row cells
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
The `cells` property of TR elements is a list of TD cell children. This list
|
|
|
|
automatically updates when cells are added and deleted.
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2024-01-29 03:29:45 +00:00
|
|
|
SheetJS methods do not mutate `cells`. Assuming there are no nested tables, the
|
2023-09-11 04:44:15 +00:00
|
|
|
`cells` property can be created using `getElementsByTagName`:
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
```js
|
|
|
|
tbl.rows.forEach(row => row.cells = Array.from(row.getElementsByTagName("td")));
|
|
|
|
```
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
## NodeJS
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
### JSDOM
|
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
[JSDOM](https://git.io/jsdom) is a DOM implementation for NodeJS. The synthetic
|
|
|
|
DOM elements are compatible with SheetJS methods.
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
The following example scrapes the first table from the file `SheetJSTable.html`
|
|
|
|
and generates a XLSX workbook:
|
2023-05-18 22:41:23 +00:00
|
|
|
|
|
|
|
```js title="SheetJSDOM.js"
|
|
|
|
const XLSX = require("xlsx");
|
|
|
|
const { readFileSync } = require("fs");
|
|
|
|
const { JSDOM } = require("jsdom");
|
|
|
|
|
|
|
|
/* obtain HTML string. This example reads from SheetJSTable.html */
|
|
|
|
const html_str = readFileSync("SheetJSTable.html", "utf8");
|
2023-09-11 04:44:15 +00:00
|
|
|
|
|
|
|
// highlight-start
|
2023-05-18 22:41:23 +00:00
|
|
|
/* get first TABLE element */
|
|
|
|
const doc = new JSDOM(html_str).window.document.querySelector("table");
|
2023-09-11 04:44:15 +00:00
|
|
|
|
2023-05-18 22:41:23 +00:00
|
|
|
/* generate workbook */
|
|
|
|
const workbook = XLSX.utils.table_to_book(doc);
|
2023-09-11 04:44:15 +00:00
|
|
|
// highlight-end
|
|
|
|
|
2023-05-18 22:41:23 +00:00
|
|
|
XLSX.writeFile(workbook, "SheetJSDOM.xlsx");
|
|
|
|
```
|
|
|
|
|
2024-04-08 04:47:04 +00:00
|
|
|
<details>
|
|
|
|
<summary><b>Complete Demo</b> (click to show)</summary>
|
2023-09-11 04:44:15 +00:00
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
:::note Tested Deployments
|
2023-09-11 04:44:15 +00:00
|
|
|
|
2024-01-29 03:29:45 +00:00
|
|
|
This demo was last tested on 2024 January 27 against JSDOM `24.0.0`
|
2023-09-11 04:44:15 +00:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
1) Install SheetJS and JSDOM libraries:
|
|
|
|
|
|
|
|
<CodeBlock language="bash">{`\
|
2024-01-29 03:29:45 +00:00
|
|
|
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz jsdom@24.0.0`}
|
2023-09-11 04:44:15 +00:00
|
|
|
</CodeBlock>
|
|
|
|
|
|
|
|
2) Save the previous codeblock to `SheetJSDOM.js`.
|
|
|
|
|
2023-05-18 22:41:23 +00:00
|
|
|
3) Download [the sample `SheetJSTable.html`](pathname:///dom/SheetJSTable.html):
|
|
|
|
|
|
|
|
```bash
|
|
|
|
curl -LO https://docs.sheetjs.com/dom/SheetJSTable.html
|
|
|
|
```
|
|
|
|
|
|
|
|
4) Run the script:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
node SheetJSDOM.js
|
|
|
|
```
|
|
|
|
|
|
|
|
The script will create a file `SheetJSDOM.xlsx` that can be opened.
|
|
|
|
|
2023-05-20 21:37:10 +00:00
|
|
|
</details>
|
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
### HappyDOM
|
|
|
|
|
2024-01-29 03:29:45 +00:00
|
|
|
HappyDOM provides a DOM framework for NodeJS. For the tested version (`13.3.1`),
|
2023-09-11 04:44:15 +00:00
|
|
|
the following patches were needed:
|
2023-05-20 21:37:10 +00:00
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
- TABLE `rows` property (explained above)
|
|
|
|
- TR `cells` property (explained above)
|
2023-05-20 21:37:10 +00:00
|
|
|
|
2024-04-08 04:47:04 +00:00
|
|
|
<details>
|
|
|
|
<summary><b>Complete Demo</b> (click to show)</summary>
|
2023-05-20 21:37:10 +00:00
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
:::note Tested Deployments
|
2023-05-20 21:37:10 +00:00
|
|
|
|
2024-01-29 03:29:45 +00:00
|
|
|
This demo was last tested on 2024 January 27 against HappyDOM `13.3.1`
|
2023-05-20 21:37:10 +00:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
1) Install SheetJS and HappyDOM libraries:
|
2023-05-20 21:37:10 +00:00
|
|
|
|
|
|
|
<CodeBlock language="bash">{`\
|
2024-01-29 03:29:45 +00:00
|
|
|
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz happy-dom@13.3.1`}
|
2023-05-20 21:37:10 +00:00
|
|
|
</CodeBlock>
|
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
2) Download [the sample script `SheetJSHappyDOM.js`](pathname:///dom/SheetJSHappyDOM.js):
|
2023-05-20 21:37:10 +00:00
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
```bash
|
|
|
|
curl -LO https://docs.sheetjs.com/dom/SheetJSHappyDOM.js
|
|
|
|
```
|
2023-05-20 21:37:10 +00:00
|
|
|
|
2024-01-29 03:29:45 +00:00
|
|
|
3) Download [the sample `SheetJSTable.html`](pathname:///dom/SheetJSTable.html):
|
|
|
|
|
|
|
|
```bash
|
|
|
|
curl -LO https://docs.sheetjs.com/dom/SheetJSTable.html
|
|
|
|
```
|
|
|
|
|
|
|
|
4) Run the script:
|
2023-05-20 21:37:10 +00:00
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
```bash
|
|
|
|
node SheetJSHappyDOM.js
|
|
|
|
```
|
|
|
|
|
|
|
|
The script will create a file `SheetJSHappyDOM.xlsx` that can be opened.
|
|
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
### XMLDOM
|
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
[XMLDOM](https://xmldom.org/) provides a DOM framework for NodeJS. For the
|
|
|
|
tested version (`0.8.10`), the following patches were needed:
|
2023-09-11 04:44:15 +00:00
|
|
|
|
|
|
|
- TABLE `rows` property (explained above)
|
|
|
|
- TR `cells` property (explained above)
|
|
|
|
- Element `innerHTML` property:
|
|
|
|
|
|
|
|
```js
|
2023-05-20 21:37:10 +00:00
|
|
|
Object.defineProperty(tbl.__proto__, "innerHTML", { get: function() {
|
2024-04-22 19:38:55 +00:00
|
|
|
var outerHTML = new XMLSerializer().serializeToString(this);
|
|
|
|
if(outerHTML.match(/</g).length == 1) return "";
|
|
|
|
return outerHTML.slice(0, outerHTML.lastIndexOf("</")).replace(/<[^"'>]*(("[^"]*"|'[^']*')[^"'>]*)*>/, "");
|
2023-05-20 21:37:10 +00:00
|
|
|
}});
|
2023-09-11 04:44:15 +00:00
|
|
|
```
|
2023-05-20 21:37:10 +00:00
|
|
|
|
2024-04-08 04:47:04 +00:00
|
|
|
<details>
|
|
|
|
<summary><b>Complete Demo</b> (click to show)</summary>
|
2023-09-11 04:44:15 +00:00
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
:::note Tested Deployments
|
2023-09-11 04:44:15 +00:00
|
|
|
|
2024-03-14 08:25:08 +00:00
|
|
|
This demo was last tested on 2024 March 12 against XMLDOM `0.8.10`
|
2023-09-11 04:44:15 +00:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
1) Install SheetJS and XMLDOM libraries:
|
|
|
|
|
|
|
|
<CodeBlock language="bash">{`\
|
|
|
|
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz @xmldom/xmldom@0.8.10`}
|
|
|
|
</CodeBlock>
|
|
|
|
|
|
|
|
2) Download [the sample script `SheetJSXMLDOM.js`](pathname:///dom/SheetJSXMLDOM.js):
|
|
|
|
|
|
|
|
```bash
|
|
|
|
curl -LO https://docs.sheetjs.com/dom/SheetJSXMLDOM.js
|
2023-05-20 21:37:10 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
3) Run the script:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
node SheetJSXMLDOM.js
|
|
|
|
```
|
|
|
|
|
|
|
|
The script will create a file `SheetJSXMLDOM.xlsx` that can be opened.
|
|
|
|
|
2023-05-18 22:41:23 +00:00
|
|
|
</details>
|
|
|
|
|
|
|
|
### CheerioJS
|
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
:::caution pass
|
2023-05-18 22:41:23 +00:00
|
|
|
|
|
|
|
Cheerio does not support a number of fundamental properties out of the box. They
|
|
|
|
can be shimmed, but it is strongly recommended to use a more compliant library.
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
[CheerioJS](https://cheerio.js.org/) provides a DOM-like framework for NodeJS.
|
2023-09-11 04:44:15 +00:00
|
|
|
[`SheetJSCheerio.js`](pathname:///dom/SheetJSCheerio.js) implements the missing
|
|
|
|
features to ensure that SheetJS DOM methods can process TABLE elements.
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2024-04-08 04:47:04 +00:00
|
|
|
<details>
|
|
|
|
<summary><b>Complete Demo</b> (click to show)</summary>
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
:::note Tested Deployments
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2024-03-14 08:25:08 +00:00
|
|
|
This demo was last tested on 2024 March 12 against Cheerio `1.0.0-rc.12`
|
2023-05-18 22:41:23 +00:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
1) Install SheetJS and CheerioJS libraries:
|
|
|
|
|
|
|
|
<CodeBlock language="bash">{`\
|
|
|
|
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz cheerio@1.0.0-rc.12`}
|
|
|
|
</CodeBlock>
|
|
|
|
|
|
|
|
2) Download [the sample script `SheetJSCheerio.js`](pathname:///dom/SheetJSCheerio.js):
|
|
|
|
|
|
|
|
```bash
|
|
|
|
curl -LO https://docs.sheetjs.com/dom/SheetJSCheerio.js
|
|
|
|
```
|
|
|
|
|
|
|
|
3) Download [the sample `SheetJSTable.html`](pathname:///dom/SheetJSTable.html):
|
|
|
|
|
|
|
|
```bash
|
|
|
|
curl -LO https://docs.sheetjs.com/dom/SheetJSTable.html
|
|
|
|
```
|
|
|
|
|
|
|
|
4) Run the script:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
node SheetJSCheerio.js
|
|
|
|
```
|
|
|
|
|
|
|
|
The script will create a file `SheetJSCheerio.xlsx` that can be opened.
|
|
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
## Other Platforms
|
|
|
|
|
|
|
|
### DenoDOM
|
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
[DenoDOM](https://deno.land/x/deno_dom) provides a DOM framework for Deno. For
|
2024-06-20 07:30:34 +00:00
|
|
|
the tested version (`0.1.46`), the following patches were needed:
|
2023-09-11 04:44:15 +00:00
|
|
|
|
|
|
|
- TABLE `rows` property (explained above)
|
|
|
|
- TR `cells` property (explained above)
|
2023-05-18 22:41:23 +00:00
|
|
|
|
|
|
|
This example fetches [a sample table](pathname:///dom/SheetJSTable.html):
|
|
|
|
|
2023-06-25 09:36:58 +00:00
|
|
|
<CodeBlock language="ts" title="SheetJSDenoDOM.ts">{`\
|
|
|
|
// @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts"
|
|
|
|
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';
|
|
|
|
\n\
|
2024-06-20 07:30:34 +00:00
|
|
|
import { DOMParser } from 'https://deno.land/x/deno_dom@v0.1.46/deno-dom-wasm.ts';
|
2023-06-25 09:36:58 +00:00
|
|
|
\n\
|
2023-05-18 22:41:23 +00:00
|
|
|
const doc = new DOMParser().parseFromString(
|
2023-06-25 09:36:58 +00:00
|
|
|
await (await fetch('https://docs.sheetjs.com/dom/SheetJSTable.html')).text(),
|
2023-05-18 22:41:23 +00:00
|
|
|
"text/html",
|
|
|
|
)!;
|
|
|
|
// highlight-start
|
|
|
|
const tbl = doc.querySelector("table");
|
2023-06-25 09:36:58 +00:00
|
|
|
\n\
|
2023-05-18 22:41:23 +00:00
|
|
|
/* patch DenoDOM element */
|
|
|
|
tbl.rows = tbl.querySelectorAll("tr");
|
|
|
|
tbl.rows.forEach(row => row.cells = row.querySelectorAll("td, th"))
|
2023-06-25 09:36:58 +00:00
|
|
|
\n\
|
2023-05-18 22:41:23 +00:00
|
|
|
/* generate workbook */
|
|
|
|
const workbook = XLSX.utils.table_to_book(tbl);
|
|
|
|
// highlight-end
|
2023-06-25 09:36:58 +00:00
|
|
|
XLSX.writeFile(workbook, "SheetJSDenoDOM.xlsx");`}
|
|
|
|
</CodeBlock>
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2024-04-08 04:47:04 +00:00
|
|
|
<details>
|
|
|
|
<summary><b>Complete Demo</b> (click to show)</summary>
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
:::note Tested Deployments
|
2023-05-18 22:41:23 +00:00
|
|
|
|
2024-06-20 07:30:34 +00:00
|
|
|
This demo was tested in the following deployments:
|
|
|
|
|
|
|
|
| Architecture | DenoDOM | Deno | Date |
|
|
|
|
|:-------------|:--------|:-------|:-----------|
|
|
|
|
| `darwin-x64` | 0.1.46 | 1.44.4 | 2024-06-19 |
|
|
|
|
| `darwin-arm` | 0.1.46 | 1.44.4 | 2024-06-19 |
|
2023-05-18 22:41:23 +00:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
1) Save the previous codeblock to `SheetJSDenoDOM.ts`.
|
|
|
|
|
|
|
|
2) Run the script with `--allow-net` and `--allow-write` entitlements:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
deno run --allow-net --allow-write SheetJSDenoDOM.ts
|
|
|
|
```
|
|
|
|
|
|
|
|
The script will create a file `SheetJSDenoDOM.xlsx` that can be opened.
|
|
|
|
|
2023-09-11 04:44:15 +00:00
|
|
|
</details>
|
|
|
|
|
2024-01-17 20:22:38 +00:00
|
|
|
[^1]: See [`table_to_sheet` in "HTML" Utilities](/docs/api/utilities/html#create-new-sheet)
|
2024-04-08 03:55:10 +00:00
|
|
|
[^2]: See ["Worksheet Object" in "SheetJS Data Model"](/docs/csf/sheet) for more details.
|
2024-01-17 20:22:38 +00:00
|
|
|
[^3]: See [`table_to_book` in "HTML" Utilities](/docs/api/utilities/html#create-new-sheet)
|
|
|
|
[^4]: See ["Workbook Object" in "SheetJS Data Model"](/docs/csf/book) for more details.
|
|
|
|
[^5]: See [`sheet_add_dom` in "HTML" Utilities](/docs/api/utilities/html#add-to-sheet)
|