15 KiB
title | sidebar_label | description | pagination_prev | pagination_next | sidebar_custom_props | ||
---|---|---|---|---|---|---|---|
Spreadsheet Processing in Mathematica | Mathematica | Build complex data pipelines in Mathematica Notebooks. Seamlessly create datasets with SheetJS. Leverage the Mathematica ecosystem to analyze data from Excel workbooks. | demos/cloud/index | demos/bigdata/index |
|
import current from '/version.js'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';
Mathematica is a software system for mathematics and scientific computing. It supports command-line tools and JavaScript extensions.
SheetJS is a JavaScript library for reading and writing data from spreadsheets.
This demo uses SheetJS to pull data from a spreadsheet for further analysis within Mathematica. We'll explore how to run an external tool to generate CSV data from opaque spreadsheets and parse the data from Mathematica.
:::note Tested Deployments
This demo was tested by SheetJS users in the following deployments:
Architecture | Version | Date |
---|---|---|
darwin-x64 |
14.0 |
2024-06-05 |
win10-x64 |
14.0 |
2024-06-05 |
:::
Integration Details
The SheetJS NodeJS module can be
loaded in NodeJS scripts, including scripts invoked using the "NodeJS"
mode
of the ExternalEvaluate
1 Mathematica function.
However, the current cross-platform recommendation involves a dedicated command line tool that leverages SheetJS libraries to to perform spreadsheet processing.
External Engines
The following diagram depicts the workbook waltz:
flowchart LR
subgraph `ExternalEvaluate`
file[(workbook\nfile)]
csvstr(CSV\nString)
end
data[(Dataset)]
file --> |NodeJS\nSheetJS Ops| csvstr
csvstr --> |ImportString\nMathematica| data
Mathematica
NodeJS can be activated from Mathematica using RegisterExternalEvaluator
2.
Once activated, JavaScript code can be run using ExternalEvaluate
3. If the
NodeJS code returns CSV data, ImportString
4 can generate a Dataset
5.
SheetJS
For a file residing on the filesystem, the SheetJS readFile
function6 can
generate a workbook object. The exact location can be determined by printing
require("process").cwd()
7 in ExternalEvaluate
:
In[1]:= ExternalEvaluate["NodeJS", "require('process').cwd()"]
Out[1]= "C:\Users\Me\Documents"
After pulling the first worksheet8, the SheetJS sheet_to_csv
function9
generates a CSV string.
Complete Function
The following function reads a file, parses the first worksheet and returns a Dataset object assuming one header row.
(* Import file stored in the Documents folder (e.g. C:\Users\Me\Documents) *)
SheetJSImportFileEE[filename_]:=Module[{csv}, (
(* This was required in local testing *)
RegisterExternalEvaluator["NodeJS","/usr/local/bin/node"];
(* Generate CSV from first sheet *)
csv:=ExternalEvaluate["NodeJS", StringJoin[
(* module installed in home directory *)
"var XLSX = require('xlsx');",
(* read specified filename *)
"var wb = XLSX.readFile('",filename,"');",
(* grab first worksheet *)
"var ws = wb.Sheets[wb.SheetNames[0]];",
(* convert to CSV *)
"XLSX.utils.sheet_to_csv(ws)"
]];
(* Parse CSV into a dataset *)
Return[ImportString[csv, "Dataset", "HeaderLines"->1]];
)]
(* Import file stored in the Documents folder (e.g. C:\Users\Me\Documents) *)
SheetJSImportFileEE[filename_]:=Module[{csv}, (
(* This was required in local testing *)
RegisterExternalEvaluator["NodeJS","C:\\Program Files\\nodejs\\node.exe"];
(* Generate CSV from first sheet *)
csv:=ExternalEvaluate["NodeJS", StringJoin[
(* module installed in home directory *)
"var XLSX = require('xlsx');",
(* read specified filename *)
"var wb = XLSX.readFile('",filename,"');",
(* grab first worksheet *)
"var ws = wb.Sheets[wb.SheetNames[0]];",
(* convert to CSV *)
"XLSX.utils.sheet_to_csv(ws)"
]];
(* Parse CSV into a dataset *)
Return[ImportString[csv, "Dataset", "HeaderLines"->1]];
)]
Command-Line Tools
The "Command-Line Tools" demo creates xlsx-cli
, a
command-line tool that reads a spreadsheet file and generates CSV rows from the
first worksheet.
ExternalEvaluate
10 can run command-line tools and capture standard output.
The following snippet processes ~/Downloads/pres.numbers
and pulls CSV data
into a variable in Mathematica:
cmd = "/usr/local/bin/xlsx-cli ~/Downloads/pres.numbers"
csvdata = ExternalEvaluate["Shell" -> "StandardOutput", cmd];
ImportString
11 can interpret the CSV data as a Dataset
12. Typically the
first row of the CSV output is the header row. The HeaderLines
13 option
controls how Mathematica parses the data:
data = ImportString[csvdata, "Dataset", "HeaderLines" -> 1]
The following diagram depicts the workbook waltz:
flowchart LR
subgraph `ExternalEvaluate`
file[(workbook\nfile)]
csvstr(CSV\nString)
end
data[(Dataset)]
file --> |`xlsx-cli`\nSheetJS Ops| csvstr
csvstr --> |ImportString\nMathematica| data
Complete Demo
This demo tests the NodeJS external engine and dedicated command line tools.
NodeJS Engine
-
Install NodeJS. When the demo was tested, version
20.14.0
was installed. -
Install dependencies in the Home folder (
~
or$HOME
or%HOMEPATH%
):
{\ npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz zeromq@6.0.0-beta.19
}
- Open a new Mathematica Notebook and register NodeJS. When the example was tested in Windows, the commands were:
RegisterExternalEvaluator["NodeJS","/usr/local/bin/node"]
FindExternalEvaluators["NodeJS"]
The second argument to RegisterExternalEvaluator
should be the path to the
node
program, which can be found by running the following command in a new
terminal window:
which node
RegisterExternalEvaluator["NodeJS","C:\\Program Files\\nodejs\\node.exe"]
FindExternalEvaluators["NodeJS"]
The second argument to RegisterExternalEvaluator
should be the path to the
node.exe
program, which can be found by running the following command in a new
PowerShell window:
Get-Command node.exe
If NodeJS is registered, the value in the "Registered" column will be "True".
- To determine the base folder, run
require("process").cwd()
from NodeJS:
ExternalEvaluate["NodeJS", "require('process').cwd()"]
-
Download
pres.numbers
and move the file to the base folder as shown in the previous step. -
Copy, but do not run, the following snippet into the running notebook:
(* Import file stored in the Documents folder (e.g. C:\Users\Me\Documents) *)
SheetJSImportFileEE[filename_]:=Module[{csv}, (
(* This was required in local testing *)
/* highlight-next-line */
RegisterExternalEvaluator["NodeJS","/usr/local/bin/node"];
(* Generate CSV from first sheet *)
csv:=ExternalEvaluate["NodeJS", StringJoin[
(* module installed in home directory *)
"var XLSX = require('xlsx');",
(* read specified filename *)
"var wb = XLSX.readFile('",filename,"');",
(* grab first worksheet *)
"var ws = wb.Sheets[wb.SheetNames[0]];",
(* convert to CSV *)
"XLSX.utils.sheet_to_csv(ws)"
]];
(* Parse CSV into a dataset *)
Return[ImportString[csv, "Dataset", "HeaderLines"->1]];
)]
(* Import file stored in the Documents folder (e.g. C:\Users\Me\Documents) *)
SheetJSImportFileEE[filename_]:=Module[{csv}, (
(* This was required in local testing *)
/* highlight-next-line */
RegisterExternalEvaluator["NodeJS","C:\\Program Files\\nodejs\\node.exe"];
(* Generate CSV from first sheet *)
csv:=ExternalEvaluate["NodeJS", StringJoin[
(* module installed in home directory *)
"var XLSX = require('xlsx');",
(* read specified filename *)
"var wb = XLSX.readFile('",filename,"');",
(* grab first worksheet *)
"var ws = wb.Sheets[wb.SheetNames[0]];",
(* convert to CSV *)
"XLSX.utils.sheet_to_csv(ws)"
]];
(* Parse CSV into a dataset *)
Return[ImportString[csv, "Dataset", "HeaderLines"->1]];
)]
After pasting, edit the highlighted line to reflect the path of the node
or
node.exe
binary. This path was discovered in Step 2.
After editing the snippet, run the expression.
- Run the function and confirm the result is a proper Dataset:
SheetJSImportFileEE["pres.numbers"]
Standalone Binary
- Create the standalone
xlsx-cli
binary14. The commands should be run in a Terminal or PowerShell window:
{\ npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz exit-on-epipe commander@2 curl -LO https://docs.sheetjs.com/cli/xlsx-cli.js npx -y nexe -t 14.15.3 xlsx-cli.js
}
- Move the generated
xlsx-cli
to a fixed location in/usr/local/bin
:
mkdir -p /usr/local/bin
mv xlsx-cli /usr/local/bin/
:::note pass
If there are permission errors, the command should be run with the root user:
sudo mv xlsx-cli /usr/local/bin/
:::
- Find the current directory:
pwd
The generated binary will be xlsx-cli.exe
in the displayed path.
Reading a Local File
- In a new Mathematica notebook, run the following snippet:
SheetJSImportFile[x_] := ImportString[Block[{Print}, ExternalEvaluate[
"Shell" -> "StandardOutput",
// highlight-next-line
"/usr/local/bin/xlsx-cli " <> x
]], "Dataset", "HeaderLines" -> 1]
- In a new Mathematica notebook, copy but do not run the following snippet:
SheetJSImportFile[x_] := ImportString[Block[{Print}, ExternalEvaluate[
"Shell" -> "StandardOutput",
// highlight-next-line
"/usr/local/bin/xlsx-cli " <> x
]], "Dataset", "HeaderLines" -> 1]
Change /usr/local/bin/xlsx-cli
in the string to the path to the generated
xlsx-cli.exe
binary. For example, if the path in step 2 was
C:\Users\Me\Documents\
, then the code should be:
SheetJSImportFile[x_] := ImportString[Block[{Print}, ExternalEvaluate[
"Shell" -> "StandardOutput",
// highlight-next-line
"C:\\Users\\Me\\Documents\\xlsx-cli.exe " <> x
]], "Dataset", "HeaderLines" -> 1]
:::info pass
Mathematica requires the \
characters must be doubled.
:::
After making the change, run the snippet.
- Download https://docs.sheetjs.com/pres.numbers and save to Downloads folder:
cd ~/Downloads/
curl -LO https://docs.sheetjs.com/pres.numbers
- In the Mathematica notebook, run the new function. If the file was saved to
the Downloads folder, the path will be
"~/Downloads/pres.numbers"
in macOS:
data = SheetJSImportFile["~/Downloads/pres.numbers"]
On Windows, the absolute path to the file must be used. To find this path, run the following commands in PowerShell:
cd $HOME\Downloads
pwd
Append \\pres.numbers
to the displayed path. For example, if the path was
C:\Users\Me\Downloads
, the command will be
data = SheetJSImportFile["C:\\Users\\Me\\Downloads\\pres.numbers"]
The \
characters must be doubled.
The result should be displayed in a concise table.
Reading from a URL
FetchURL
15 downloads a file from a specified URL and returns a path to the
file. This function will be wrapped in a new function called SheetJSImportURL
.
- In the same notebook, run the following:
Needs["Utilities`URLTools`"];
SheetJSImportURL[x_] := Module[{path},(
path = FetchURL[x];
SheetJSImportFile[path]
)];
- Test by downloading the test file in the notebook:
data = SheetJSImportURL["https://docs.sheetjs.com/pres.numbers"]
-
See the
ExternalEvaluate
Node.js example in the Mathematica documentation. ↩︎ -
See
RegisterExternalEvaluator
in the Mathematica documentation. ↩︎ -
See
ExternalEvaluate
in the Mathematica documentation. ↩︎ -
See
ImportString
in the Mathematica documentation. ↩︎ -
A
Dataset
will be created when using the"Dataset"
element inImportString
↩︎ -
See
process.cwd()
in the NodeJS documentation. ↩︎ -
The
Sheets
andSheetNames
properties of workbook objects are described in "Workbook Object" ↩︎ -
See
ExternalEvaluate
in the Mathematica documentation. ↩︎ -
See
ImportString
in the Mathematica documentation. ↩︎ -
A
Dataset
will be created when using the"Dataset"
element inImportString
↩︎ -
See
HeaderLines
in the Mathematica documentation. ↩︎ -
See "Command-line Tools" for more details. ↩︎