docs.sheetjs.com/docz/docs/03-demos/03-net/09-dom.md

307 lines
7.9 KiB
Markdown
Raw Normal View History

2023-05-18 22:41:23 +00:00
---
title: Synthetic DOM
---
import current from '/version.js';
import CodeBlock from '@theme/CodeBlock';
2023-09-11 04:44:15 +00:00
SheetJS offers three methods to directly process HTML DOM TABLE elements[^1]:
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
- `table_to_sheet` generates a SheetJS worksheet[^2] from a TABLE element
- `table_to_book` generates a SheetJS workbook[^3] from a TABLE element
- `sheet_add_dom` adds data from a TABLE element to an existing worksheet
These methods work in the web browser. NodeJS and other server-side platforms
traditionally lack a DOM implementation, but third-party modules fill the gap.
:::tip pass
2023-05-18 22:41:23 +00:00
The most robust approach for server-side processing is to automate a headless
web browser. ["Browser Automation"](/docs/demos/net/headless) includes demos.
:::
This demo covers synthetic DOM implementations for non-browser platforms.
2023-09-11 04:44:15 +00:00
## Integration Details
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
SheetJS API methods use DOM features that may not be available.
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
### Table rows
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
The `rows` property of TABLE elements is a list of TR row children. This list
automatically updates when rows are added and deleted.
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
SheetJS does not mutate `rows`. Assuming there are no nested tables, the `rows`
property can be created using `getElementsByTagName`:
```js
tbl.rows = Array.from(tbl.getElementsByTagName("tr"));
2023-05-18 22:41:23 +00:00
```
2023-09-11 04:44:15 +00:00
### Row cells
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
The `cells` property of TR elements is a list of TD cell children. This list
automatically updates when cells are added and deleted.
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
SheetJS does not mutate `cells`. Assuming there are no nested tables, the
`cells` property can be created using `getElementsByTagName`:
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
```js
tbl.rows.forEach(row => row.cells = Array.from(row.getElementsByTagName("td")));
```
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
## NodeJS
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
### JSDOM
JSDOM is a DOM implementation for NodeJS. The synthetic DOM elements are
compatible with SheetJS methods.
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
The following example scrapes the first table from the file `SheetJSTable.html`
and generates a XLSX workbook:
2023-05-18 22:41:23 +00:00
```js title="SheetJSDOM.js"
const XLSX = require("xlsx");
const { readFileSync } = require("fs");
const { JSDOM } = require("jsdom");
/* obtain HTML string. This example reads from SheetJSTable.html */
const html_str = readFileSync("SheetJSTable.html", "utf8");
2023-09-11 04:44:15 +00:00
// highlight-start
2023-05-18 22:41:23 +00:00
/* get first TABLE element */
const doc = new JSDOM(html_str).window.document.querySelector("table");
2023-09-11 04:44:15 +00:00
2023-05-18 22:41:23 +00:00
/* generate workbook */
const workbook = XLSX.utils.table_to_book(doc);
2023-09-11 04:44:15 +00:00
// highlight-end
2023-05-18 22:41:23 +00:00
XLSX.writeFile(workbook, "SheetJSDOM.xlsx");
```
2023-09-11 04:44:15 +00:00
<details><summary><b>Complete Demo</b> (click to show)</summary>
:::note
This demo was last tested on 2023 September 10 against JSDOM `22.1.0`
:::
1) Install SheetJS and JSDOM libraries:
<CodeBlock language="bash">{`\
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz jsdom@22.0.0`}
</CodeBlock>
2) Save the previous codeblock to `SheetJSDOM.js`.
2023-05-18 22:41:23 +00:00
3) Download [the sample `SheetJSTable.html`](pathname:///dom/SheetJSTable.html):
```bash
curl -LO https://docs.sheetjs.com/dom/SheetJSTable.html
```
4) Run the script:
```bash
node SheetJSDOM.js
```
The script will create a file `SheetJSDOM.xlsx` that can be opened.
2023-05-20 21:37:10 +00:00
</details>
2023-09-11 04:44:15 +00:00
### HappyDOM
HappyDOM provides a DOM framework for NodeJS. For the tested version (`11.0.2`),
the following patches were needed:
2023-05-20 21:37:10 +00:00
2023-09-11 04:44:15 +00:00
- TABLE `rows` property (explained above)
- TR `cells` property (explained above)
2023-05-20 21:37:10 +00:00
<details><summary><b>Complete Demo</b> (click to show)</summary>
:::note
2023-09-11 04:44:15 +00:00
This demo was last tested on 2023 September 10 against HappyDOM `11.0.2`
2023-05-20 21:37:10 +00:00
:::
2023-09-11 04:44:15 +00:00
1) Install SheetJS and HappyDOM libraries:
2023-05-20 21:37:10 +00:00
<CodeBlock language="bash">{`\
2023-09-11 04:44:15 +00:00
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz happy-dom@11.0.2`}
2023-05-20 21:37:10 +00:00
</CodeBlock>
2023-09-11 04:44:15 +00:00
2) Download [the sample script `SheetJSHappyDOM.js`](pathname:///dom/SheetJSHappyDOM.js):
2023-05-20 21:37:10 +00:00
2023-09-11 04:44:15 +00:00
```bash
curl -LO https://docs.sheetjs.com/dom/SheetJSHappyDOM.js
```
2023-05-20 21:37:10 +00:00
2023-09-11 04:44:15 +00:00
3) Run the script:
2023-05-20 21:37:10 +00:00
2023-09-11 04:44:15 +00:00
```bash
node SheetJSHappyDOM.js
```
The script will create a file `SheetJSHappyDOM.xlsx` that can be opened.
</details>
### XMLDOM
XMLDOM provides a DOM framework for NodeJS. For the tested version (`0.8.10`),
the following patches were needed:
- TABLE `rows` property (explained above)
- TR `cells` property (explained above)
- Element `innerHTML` property:
```js
2023-05-20 21:37:10 +00:00
Object.defineProperty(tbl.__proto__, "innerHTML", { get: function() {
var outerHTML = new XMLSerializer().serializeToString(this);
if(outerHTML.match(/</g).length == 1) return "";
return outerHTML.slice(0, outerHTML.lastIndexOf("</")).replace(/<[^"'>]*(("[^"]*"|'[^']*')[^"'>]*)*>/, "");
}});
2023-09-11 04:44:15 +00:00
```
2023-05-20 21:37:10 +00:00
2023-09-11 04:44:15 +00:00
<details><summary><b>Complete Demo</b> (click to show)</summary>
:::note
This demo was last tested on 2023 September 10 against XMLDOM `0.8.10`
:::
1) Install SheetJS and XMLDOM libraries:
<CodeBlock language="bash">{`\
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz @xmldom/xmldom@0.8.10`}
</CodeBlock>
2) Download [the sample script `SheetJSXMLDOM.js`](pathname:///dom/SheetJSXMLDOM.js):
```bash
curl -LO https://docs.sheetjs.com/dom/SheetJSXMLDOM.js
2023-05-20 21:37:10 +00:00
```
3) Run the script:
```bash
node SheetJSXMLDOM.js
```
The script will create a file `SheetJSXMLDOM.xlsx` that can be opened.
2023-05-18 22:41:23 +00:00
</details>
### CheerioJS
2023-09-11 04:44:15 +00:00
:::caution pass
2023-05-18 22:41:23 +00:00
Cheerio does not support a number of fundamental properties out of the box. They
can be shimmed, but it is strongly recommended to use a more compliant library.
:::
2023-09-11 04:44:15 +00:00
CheerioJS provides a DOM-like framework for NodeJS. Many features were missing.
[`SheetJSCheerio.js`](pathname:///dom/SheetJSCheerio.js) implements the missing
features to ensure that SheetJS DOM methods can process TABLE elements.
2023-05-18 22:41:23 +00:00
<details><summary><b>Complete Demo</b> (click to show)</summary>
:::note
2023-09-11 04:44:15 +00:00
This demo was last tested on 2023 September 10 against Cheerio `1.0.0-rc.12`
2023-05-18 22:41:23 +00:00
:::
1) Install SheetJS and CheerioJS libraries:
<CodeBlock language="bash">{`\
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz cheerio@1.0.0-rc.12`}
</CodeBlock>
2) Download [the sample script `SheetJSCheerio.js`](pathname:///dom/SheetJSCheerio.js):
```bash
curl -LO https://docs.sheetjs.com/dom/SheetJSCheerio.js
```
3) Download [the sample `SheetJSTable.html`](pathname:///dom/SheetJSTable.html):
```bash
curl -LO https://docs.sheetjs.com/dom/SheetJSTable.html
```
4) Run the script:
```bash
node SheetJSCheerio.js
```
The script will create a file `SheetJSCheerio.xlsx` that can be opened.
</details>
## Other Platforms
### DenoDOM
2023-09-11 04:44:15 +00:00
DenoDOM provides a DOM framework for Deno. For the tested version (`0.1.38`),
the following patches were needed:
- TABLE `rows` property (explained above)
- TR `cells` property (explained above)
2023-05-18 22:41:23 +00:00
This example fetches [a sample table](pathname:///dom/SheetJSTable.html):
2023-06-25 09:36:58 +00:00
<CodeBlock language="ts" title="SheetJSDenoDOM.ts">{`\
// @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts"
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';
\n\
2023-05-18 22:41:23 +00:00
import { DOMParser } from 'https://deno.land/x/deno_dom@v0.1.38/deno-dom-wasm.ts';
2023-06-25 09:36:58 +00:00
\n\
2023-05-18 22:41:23 +00:00
const doc = new DOMParser().parseFromString(
2023-06-25 09:36:58 +00:00
await (await fetch('https://docs.sheetjs.com/dom/SheetJSTable.html')).text(),
2023-05-18 22:41:23 +00:00
"text/html",
)!;
// highlight-start
const tbl = doc.querySelector("table");
2023-06-25 09:36:58 +00:00
\n\
2023-05-18 22:41:23 +00:00
/* patch DenoDOM element */
tbl.rows = tbl.querySelectorAll("tr");
tbl.rows.forEach(row => row.cells = row.querySelectorAll("td, th"))
2023-06-25 09:36:58 +00:00
\n\
2023-05-18 22:41:23 +00:00
/* generate workbook */
const workbook = XLSX.utils.table_to_book(tbl);
// highlight-end
2023-06-25 09:36:58 +00:00
XLSX.writeFile(workbook, "SheetJSDenoDOM.xlsx");`}
</CodeBlock>
2023-05-18 22:41:23 +00:00
2023-09-11 04:44:15 +00:00
<details><summary><b>Complete Demo</b> (click to show)</summary>
2023-05-18 22:41:23 +00:00
:::note
2023-09-11 04:44:15 +00:00
This demo was last tested on 2023 September 10 against DenoDOM `0.1.38`
2023-05-18 22:41:23 +00:00
:::
1) Save the previous codeblock to `SheetJSDenoDOM.ts`.
2) Run the script with `--allow-net` and `--allow-write` entitlements:
```bash
deno run --allow-net --allow-write SheetJSDenoDOM.ts
```
The script will create a file `SheetJSDenoDOM.xlsx` that can be opened.
2023-09-11 04:44:15 +00:00
</details>
[^1]: See ["HTML Table Input" in "Utilities"](/docs/api/utilities/html#html-table-input)
[^2]: See ["Worksheet Object" in "SheetJS Data Model"](/docs/csf/book) for more details.
[^3]: See ["Workbook Object" in "SheetJS Data Model"](/docs/csf/book) for more details.