docs.sheetjs.com/docz/docs/03-demos/03-net/08-headless/index.md
2025-01-05 21:51:20 -05:00

13 KiB

title pagination_prev pagination_next
Browser Automation demos/net/email/index demos/net/dom

import current from '/version.js'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';

Headless automation involves controlling "headless browsers" to access websites and submit or download data. It is also possible to automate browsers using custom browser extensions.

The SheetJS standalone scripts can be added to any website by inserting a SCRIPT tag. Headless browsers usually provide utility functions for running custom snippets in the browser and passing data back to the automation script.

Use Case

This demo focuses on exporting table data to a workbook. Headless browsers do not generally support passing objects between the browser context and the automation script, so the file data must be generated in the browser context and sent back to the automation script for saving in the file system.

This demo exports data from https://sheetjs.com/demos/table.

:::note pass

It is also possible to parse files from the browser context, but parsing from the automation context is more efficient and strongly recommended.

:::

Key Steps

sequenceDiagram
  autonumber off
  actor U as User
  participant C as Controller
  participant B as Browser
  U->>C: run script
  rect rgba(255, 0, 0, 0.25)
    C->>B: launch browser
    C->>B: load URL
  end
  rect rgba(0, 127, 0, 0.25)
    C->>B: add SheetJS script
  end
  rect rgba(255, 0, 0, 0.25)
    C->>B: ask for file
    Note over B: scrape tables
    Note over B: generate workbook
    B->>C: file bytes
  end
  rect rgba(0, 127, 0, 0.25)
    C->>U: save file
  end
  1. Launch the headless browser and load the target site.

  2. Add the standalone SheetJS build to the page in a SCRIPT tag.

  3. Add a script to the page (in the browser context) that will:

  • Make a SheetJS workbook object1 from the first table using the SheetJS table_to_book2 method.
  • Generate the bytes for an XLSB file using the SheetJS write3 method.
  • Send the bytes back to the automation script
  1. When the automation context receives data, save to a file

Puppeteer

Puppeteer enables headless Chromium automation for NodeJS and BunJS. Releases ship with a script that installs a headless browser.

Binary strings are the favored data type. They can be safely passed from the browser context to the automation script. NodeJS provides an API to write binary strings to file (fs.writeFileSync using encoding binary).

The key steps are commented below:

{\ const fs = require("fs"); const puppeteer = require('puppeteer'); (async () => { /* (1) Load the target page */ const browser = await puppeteer.launch(); const page = await browser.newPage(); page.on("console", msg => console.log("PAGE LOG:", msg.text())); await page.setViewport({width: 1920, height: 1080}); await page.goto('https://sheetjs.com/demos/table'); \n\ /* (2) Load the standalone SheetJS build from the CDN */ await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js' }); \n\ /* (3) Run the snippet in browser and return data */ const bin = await page.evaluate(() => { /* NOTE: this function will be evaluated in the browser context. \page`, `fs` and `puppeteer` are not available. `XLSX` will be available thanks to step 2 / \n
/
find first table / var table = document.body.getElementsByTagName('table')[0]; \n
/
call table_to_book on first table / var wb = XLSX.utils.table_to_book(table); \n
/
generate XLSB and return binary string / return XLSX.write(wb, {type: "binary", bookType: "xlsb"}); }); \n
/
(4) write data to file */ fs.writeFileSync("SheetJSPuppeteer.xlsb", bin, { encoding: "binary" }); \n
await browser.close(); })();`}

Demo

:::note Tested Deployments

This demo was tested in the following deployments:

Puppeteer Date
23.11.1 2024-12-31
22.15.0 2024-12-31
21.11.0 2024-12-31
20.9.0 2024-12-31
15.5.0 2024-12-31
10.4.0 2024-12-31

:::

  1. Install SheetJS and Puppeteer:

{\ npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz puppeteer@23.11.1}

{\ bun install https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz puppeteer@23.11.1}

  1. Save the SheetJSPuppeteer.js code snippet to SheetJSPuppeteer.js.

  2. Run the script:

node SheetJSPuppeteer.js
bun SheetJSPuppeteer.js

When the script finishes, the file SheetJSPuppeteer.xlsb will be created. This file can be opened with a spreadsheet editor that supports XLSB workbooks.

Playwright

Playwright presents a unified scripting framework for Chromium, WebKit, and other browsers. It draws inspiration from Puppeteer. In fact, the example code is almost identical!

Differences from the Puppeteer example are highlighted below:

{\ const fs = require("fs"); // highlight-next-line const { webkit } = require('playwright'); // import desired browser (async () => { /* (1) Load the target page */ // highlight-next-line const browser = await webkit.launch(); // launch desired browser const page = await browser.newPage(); page.on("console", msg => console.log("PAGE LOG:", msg.text())); // highlight-next-line await page.setViewportSize({width: 1920, height: 1080}); // different name :( await page.goto('https://sheetjs.com/demos/table'); \n\ /* (2) Load the standalone SheetJS build from the CDN */ await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js' }); \n\ /* (3) Run the snippet in browser and return data */ const bin = await page.evaluate(() => { /* NOTE: this function will be evaluated in the browser context. \page`, `fs` and the browser engine are not available. `XLSX` will be available thanks to step 2 / \n
/
find first table / var table = document.body.getElementsByTagName('table')[0]; \n
/
call table_to_book on first table / var wb = XLSX.utils.table_to_book(table); \n
/
generate XLSB and return binary string / return XLSX.write(wb, {type: "binary", bookType: "xlsb"}); }); \n
/
(4) write data to file */ fs.writeFileSync("SheetJSPlaywright.xlsb", bin, { encoding: "binary" }); \n
await browser.close(); })();`}

Demo

:::note Tested Deployments

This demo was tested in the following deployments:

Playwright Date
1.49.1 2024-12-31

:::

  1. Install SheetJS and Playwright:

{\ npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz playwright@1.49.1}

{\ bun install https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz playwright@1.49.1}

  1. Save the SheetJSPlaywright.js code snippet to SheetJSPlaywright.js.

  2. Run the script

node SheetJSPlaywright.js
bun SheetJSPlaywright.js

When the script finishes, the file SheetJSPlaywright.xlsb will be created. This file can be opened with a spreadsheet editor that supports XLSB workbooks.

:::caution pass

The commmand may fail with a message such as:

╔═════════════════════════════════════════════════════════════════════════╗
║ Looks like Playwright Test or Playwright was just installed or updated. ║
║ Please run the following command to download new browsers:              ║
║                                                                         ║
║     npx playwright install                                              ║
║                                                                         ║
║ <3 Playwright Team                                                      ║
╚═════════════════════════════════════════════════════════════════════════╝

Running the recommended command will download and install browser engines:

npx playwright install

After installing engines, re-run the script.

:::

PhantomJS

PhantomJS is a headless web browser powered by WebKit.

:::danger pass

This information is provided for legacy deployments. PhantomJS development has been suspended and there are known vulnerabilities, so new projects should use alternatives. For WebKit automation, new projects should use Playwright.

:::

Binary strings are the favored data type. They can be safely passed from the browser context to the automation script. PhantomJS provides an API to write binary strings to file (fs.write using mode wb).

Integration Details and Demo (click to show)

The steps are marked in the comments:

{\ var page = require('webpage').create(); page.onConsoleMessage = function(msg) { console.log(msg); }; \n\ /* (1) Load the target page */ page.open('https://sheetjs.com/demos/table', function() { \n\ /* (2) Load the standalone SheetJS build from the CDN */ page.includeJs("https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js", function() { \n\ /* (3) Run the snippet in browser and return data */ var bin = page.evaluateJavaScript([ "function(){", \n\ /* find first table */ "var table = document.body.getElementsByTagName('table')[0];", \n\ /* call table_to_book on first table */ "var wb = XLSX.utils.table_to_book(table);", \n\ /* generate XLSB file and return binary string */ "return XLSX.write(wb, {type: 'binary', bookType: 'xlsb'});", "}" ].join("")); \n\ /* (4) write data to file */ require("fs").write("SheetJSPhantomJS.xlsb", bin, "wb"); \n\ phantom.exit(); }); });}

:::caution pass

PhantomJS is very finicky and will hang if there are script errors. It is strongly recommended to add verbose logging and to lint scripts before use.

:::

Demo

:::note Tested Deployments

This demo was tested in the following environments:

Architecture PhantomJS Date
darwin-x64 2.1.1 2024-12-17
win11-x64 2.1.1 2024-05-22
linux-x64 2.1.1 2024-04-25

:::

  1. Download and extract PhantomJS

  2. Save the SheetJSPhantom.js code snippet to SheetJSPhantom.js.

  3. Run the phantomjs program and pass the script as the first argument.

For example, if the macOS Archive Utility unzipped the 2.1.1 release, binaries will be placed in phantomjs-2.1.1-macosx/bin/ and the command will be:

./phantomjs-2.1.1-macosx/bin/phantomjs SheetJSPhantom.js

When the script finishes, the file SheetJSPhantomJS.xlsb will be created. This file can be opened with Excel.

:::caution pass

When this demo was last tested on Linux, there were multiple errors.

This application failed to start because it could not find or load the Qt platform plugin "xcb".

The environment variable QT_QPA_PLATFORM=phantom resolves the issue. There is a different error after assignment:

140412268664640:error:25066067:DSO support routines:DLFCN_LOAD:could not load the shared library:dso_dlfcn.c:185:filename(libproviders.so): libproviders.so: cannot open shared object file: No such file or directory
140412268664640:error:25070067:DSO support routines:DSO_load:could not load the shared library:dso_lib.c:244:
140412268664640:error:0E07506E:configuration file routines:MODULE_LOAD_DSO:error loading dso:conf_mod.c:285:module=providers, path=providers
140412268664640:error:0E076071:configuration file routines:MODULE_RUN:unknown module name:conf_mod.c:222:module=providers

This error is resolved by ignoring SSL errors. The complete command is:

env OPENSSL_CONF=/dev/null QT_QPA_PLATFORM=phantom ./phantomjs-2.1.1-linux-x86_64/bin/phantomjs --ignore-ssl-errors=true test.js

:::


  1. See "Workbook Object" for more details about the SheetJS workbook object. ↩︎

  2. See table_to_book in "HTML" Utilities ↩︎

  3. See write in "Writing Files" ↩︎