This commit is contained in:
SheetJS 2022-07-07 20:04:16 -04:00
parent 0847c803f2
commit 551cd9f093
6 changed files with 325 additions and 77 deletions

@ -0,0 +1,318 @@
---
sidebar_position: 8
title: Typed Arrays and ML
---
<head>
<script src="https://unpkg.com/@tensorflow/tfjs@3.18.0/dist/tf.min.js"></script>
</head>
Machine learning libraries in JS typically use "Typed Arrays". Typed Arrays are
not JS Arrays! SheetJS expects bona fide JS Arrays. With some data wrangling,
translating between SheetJS worksheets and typed arrays is straightforward.
This demo covers conversions between worksheets and Typed Arrays for use with
[TensorFlow.js](https://js.tensorflow.org/js/) and other ML libraries.
:::note
The live code blocks in this demo load the standalone TensorFlow.js build:
```html
<script src="https://unpkg.com/@tensorflow/tfjs@3.18.0/dist/tf.min.js"></script>
```
:::
## CSV Data Interchange
`tf.data.csv` generates a Dataset from CSV data. The function expects a URL.
Fortunately blob URLs are supported, making data import straightforward:
```js
function worksheet_to_csv_url(worksheet) {
/* generate CSV */
const csv = XLSX.utils.sheet_to_csv(worksheet);
/* CSV -> Uint8Array -> Blob */
const u8 = new TextEncoder().encode(csv);
const blob = new Blob([u8], { type: "text/csv" });
/* generate a blob URL */
return URL.createObjectURL(blob);
}
```
[This demo mirrors TFjs docs](https://js.tensorflow.org/api/latest/#data.csv),
fetching [an XLSX export of the example dataset](https://sheetjs.com/bht.xlsx).
<details><summary><b>TF CSV Demo using XLSX files</b> (click to show)</summary>
```jsx live
function SheetJSToTFJSCSV() {
const [output, setOutput] = React.useState("");
const doit = React.useCallback(async () => {
/* fetch file */
const f = await fetch("https://sheetjs.com/bht.xlsx");
const ab = await f.arrayBuffer();
/* parse file and get first worksheet */
const wb = XLSX.read(ab);
const ws = wb.Sheets[wb.SheetNames[0]];
/* generate CSV */
const csv = XLSX.utils.sheet_to_csv(ws);
/* generate blob URL */
const u8 = new TextEncoder().encode(csv);
const blob = new Blob([u8], {type: "text/csv"});
const url = URL.createObjectURL(blob);
/* feed to tfjs */
const dataset = tf.data.csv(url, {columnConfigs:{"medv":{isLabel:true}}});
/* this part mirrors the tf.data.csv docs */
const flat = dataset.map(({xs,ys}) => ({xs: Object.values(xs), ys: Object.values(ys)})).batch(10);
const model = tf.sequential();
model.add(tf.layers.dense({inputShape: [(await dataset.columnNames()).length - 1], units: 1}));
model.compile({ optimizer: tf.train.sgd(0.000001), loss: 'meanSquaredError' });
let base = output;
await model.fitDataset(flat, { epochs: 10, callbacks: { onEpochEnd: async (epoch, logs) => {
setOutput(base += "\n" + epoch + ":" + logs.loss);
}}});
model.summary();
});
return ( <pre><b><a href="https://js.tensorflow.org/api/latest/#data.csv">Original CSV demo</a></b><br/><br/>
<button onClick={doit}>Click to run</button>
{output}
</pre> );
}
```
</details>
In the other direction, `XLSX.read` will readily parse CSV exports.
## JS Array Interchange
[The official Linear Regression tutorial](https://www.tensorflow.org/js/tutorials/training/linear_regression)
loads data from a JSON file:
```json
[
{
"Name": "chevrolet chevelle malibu",
"Miles_per_Gallon": 18,
"Cylinders": 8,
"Displacement": 307,
"Horsepower": 130,
"Weight_in_lbs": 3504,
"Acceleration": 12,
"Year": "1970-01-01",
"Origin": "USA"
},
{
"Name": "buick skylark 320",
"Miles_per_Gallon": 15,
"Cylinders": 8,
"Displacement": 350,
"Horsepower": 165,
"Weight_in_lbs": 3693,
"Acceleration": 11.5,
"Year": "1970-01-01",
"Origin": "USA"
},
// ...
]
```
In real use cases, data is stored in [spreadsheets](https://sheetjs.com/cd.xls)
![cd.xls screenshot](pathname:///files/cd.png)
Following the tutorial, the data fetching method is easily adapted. Differences
from the official example are highlighted below:
```js
/**
* Get the car data reduced to just the variables we are interested
* and cleaned of missing data.
*/
async function getData() {
// highlight-start
/* fetch file */
const carsDataResponse = await fetch('https://sheetjs.com/cd.xls');
/* get file data (ArrayBuffer) */
const carsDataAB = await carsDataResponse.arrayBuffer();
/* parse */
const carsDataWB = XLSX.read(carsDataAB);
/* get first worksheet */
const carsDataWS = carsDataWB.Sheets[carsDataWB.SheetNames[0]];
/* generate array of JS objects */
const carsData = XLSX.utils.sheet_to_json(carsDataWS);
// highlight-end
const cleaned = carsData.map(car => ({
mpg: car.Miles_per_Gallon,
horsepower: car.Horsepower,
}))
.filter(car => (car.mpg != null && car.horsepower != null));
return cleaned;
}
```
## Low-Level Operations
:::caution
While it is more efficient to use low-level operations, JS or CSV interchange
is strongly recommended when possible.
:::
### Data Transposition
A typical dataset in a spreadsheet will start with one header row and represent
each data record in its own row. For example, the Iris dataset might look like
![Iris dataset](pathname:///files/iris.png)
`XLSX.utils.sheet_to_json` will translate this into an array of row objects:
```js
var aoo = [
{"sepal length": 5.1, "sepal width": 3.5, ...},
{"sepal length": 4.9, "sepal width": 3, ...},
...
];
```
TF.js and other libraries tend to operate on individual columns, equivalent to:
```js
var sepal_lengths = [5.1, 4.9, ...];
var sepal_widths = [3.5, 3, ...];
```
When a 2D tensor can be exported, it will look different from the spreadsheet:
```js
var data_set_2d = [
[5.1, 4.9, ...],
[3.5, 3, ...],
...
]
```
This is the transpose of how people use spreadsheets!
#### Typed Arrays and Columns
A single typed array can be converted to a pure JS array with `Array.from`:
```js
var column = Array.from(dataset_typedarray);
```
Similarly, `Float32Array.from` generates a typed array from a normal array:
```js
var dataset = Float32Array.from(column);
```
### Exporting Datasets to a Worksheet
`XLSX.utils.aoa_to_sheet` can generate a worksheet from an array of arrays.
ML libraries typically provide APIs to pull an array of arrays, but it will
be transponsed
a row-major array of arrays. To export multiple data
sets, "transpose" the data:
```js
/* assuming data is an array of typed arrays */
var aoa = [];
for(var i = 0; i < data.length; ++i) {
for(var j = 0; j < data[i].length; ++j) {
if(!aoa[j]) aoa[j] = [];
aoa[j][i] = data[i][j];
}
}
/* aoa can be directly converted to a worksheet object */
var ws = XLSX.utils.aoa_to_sheet(aoa);
```
### Importing Data from a Spreadsheet
`sheet_to_json` with the option `header:1` will generate a row-major array of
arrays that can be transposed. However, it is more efficient to walk the sheet
manually:
```js
/* find worksheet range */
var range = XLSX.utils.decode_range(ws['!ref']);
var out = []
/* walk the columns */
for(var C = range.s.c; C <= range.e.c; ++C) {
/* create the typed array */
var ta = new Float32Array(range.e.r - range.s.r + 1);
/* walk the rows */
for(var R = range.s.r; R <= range.e.r; ++R) {
/* find the cell, skip it if the cell isn't numeric or boolean */
var cell = ws[XLSX.utils.encode_cell({r:R, c:C})];
if(!cell || cell.t != 'n' && cell.t != 'b') continue;
/* assign to the typed array */
ta[R - range.s.r] = cell.v;
}
out.push(ta);
}
```
If the data set has a header row, the loop can be adjusted to skip those rows.
### TF.js Tensors
A single `Array#map` can pull individual named fields from the result, which
can be used to construct TensorFlow.js tensor objects:
```js
const aoo = XLSX.utils.sheet_to_json(worksheet);
const lengths = aoo.map(row => row["sepal length"]);
const tensor = tf.tensor1d(lengths);
```
`tf.Tensor` objects can be directly transposed using `transpose`:
```js
var aoo = XLSX.utils.sheet_to_json(worksheet);
// "x" and "y" are the fields we want to pull from the data
var data = aoo.map(row => ([row["x"], row["y"]]));
// create a tensor representing two column datasets
var tensor = tf.tensor2d(data).transpose();
// individual columns can be accessed
var col1 = tensor.slice([0,0], [1,tensor.shape[1]]).flatten();
var col2 = tensor.slice([1,0], [1,tensor.shape[1]]).flatten();
```
For exporting, `stack` can be used to linearize the columns:
```js
/* pull data into a Float32Array */
var result = tf.stack([col1, col2]).transpose();
var shape = tensor.shape;
var f32 = tensor.dataSync();
/* construct an array of arrays of the data in spreadsheet order */
var aoa = [];
for(var j = 0; j < shape[0]; ++j) {
aoa[j] = [];
for(var i = 0; i < shape[1]; ++i) aoa[j][i] = f32[j * shape[1] + i];
}
/* add headers to the top */
aoa.unshift(["x", "y"]);
/* generate worksheet */
var worksheet = XLSX.utils.aoa_to_sheet(aoa);
```

@ -11,7 +11,7 @@ The demo projects include small runnable examples and short explainers.
- [`XMLHttpRequest and fetch`](https://github.com/SheetJS/SheetJS/tree/master/demos/xhr/)
- [`Clipboard Data`](./clipboard)
- [`Typed Arrays and Math`](https://github.com/SheetJS/SheetJS/tree/master/demos/array/)
- [`Typed Arrays for Machine Learning`](./ml)
### Frameworks

@ -732,13 +732,15 @@ the optional `opts` argument in more detail.
["Complete Example"](../example) contains a detailed example "Get Data
from a JSON Endpoint and Generate a Workbook"
[`x-spreadsheet`](https://github.com/myliang/x-spreadsheet) is an interactive
data grid for previewing and modifying structured data in the web browser. The
[demo](https://github.com/sheetjs/sheetjs/tree/master/demos/xspreadsheet)
includes a sample script with the `xtos` function for converting from
x-spreadsheet to a workbook. Live Demo: <https://oss.sheetjs.com/sheetjs/x-spreadsheet>
["Typed Arrays and ML"](../getting-started/demos/ml) covers strategies for
creating worksheets from ML library exports (datasets stored in Typed Arrays).
<details>
<summary><b>Records from a database query (SQL or no-SQL)</b> (click to show)</summary>
@ -748,44 +750,6 @@ databases and query results.
</details>
<details>
<summary><b>Numerical Computations with TensorFlow.js</b> (click to show)</summary>
`@tensorflow/tfjs` and other libraries expect data in simple arrays, well-suited
for worksheets where each column is a data vector. That is the transpose of how
most people use spreadsheets, where each row is a vector.
When recovering data from `tfjs`, the returned data points are stored in a typed
array. An array of arrays can be constructed with loops. `Array#unshift` can
prepend a title row before the conversion:
```js
const XLSX = require("xlsx");
const tf = require('@tensorflow/tfjs');
/* suppose xs and ys are vectors (1D tensors) -> tfarr will be a typed array */
const tfdata = tf.stack([xs, ys]).transpose();
const shape = tfdata.shape;
const tfarr = tfdata.dataSync();
/* construct the array of arrays */
const aoa = [];
for(let j = 0; j < shape[0]; ++j) {
aoa[j] = [];
for(let i = 0; i < shape[1]; ++i) aoa[j][i] = tfarr[j * shape[1] + i];
}
/* add headers to the top */
aoa.unshift(["x", "y"]);
/* generate worksheet */
const worksheet = XLSX.utils.aoa_to_sheet(aoa);
```
The [`array` demo](https://github.com/SheetJS/SheetJS/tree/master/demos/array/) shows a complete example.
</details>
## Processing HTML Tables
#### API

@ -450,6 +450,9 @@ simple VueJS 3 data table. It is featured in the
### Example: Data Loading
["Typed Arrays and ML"](../getting-started/demos/ml) covers strategies for
generating typed arrays and tensors from worksheet data.
<details>
<summary><b>Populating a database (SQL or no-SQL)</b> (click to show)</summary>
@ -458,44 +461,7 @@ includes examples of working with databases and query results.
</details>
<details>
<summary><b>Numerical Computations with TensorFlow.js</b> (click to show)</summary>
`@tensorflow/tfjs` and other libraries expect data in simple arrays, well-suited
for worksheets where each column is a data vector. That is the transpose of how
most people use spreadsheets, where each row is a vector.
A single `Array#map` can pull individual named rows from `sheet_to_json` export:
```js
const XLSX = require("xlsx");
const tf = require('@tensorflow/tfjs');
const key = "age"; // this is the field we want to pull
const ages = XLSX.utils.sheet_to_json(worksheet).map(r => r[key]);
const tf_data = tf.tensor1d(ages);
```
All fields can be processed at once using a transpose of the 2D tensor generated
with the `sheet_to_json` export with `header: 1`. The first row, if it contains
header labels, should be removed with a slice:
```js
const XLSX = require("xlsx");
const tf = require('@tensorflow/tfjs');
/* array of arrays of the data starting on the second row */
const aoa = XLSX.utils.sheet_to_json(worksheet, {header: 1}).slice(1);
/* dataset in the "correct orientation" */
const tf_dataset = tf.tensor2d(aoa).transpose();
/* pull out each dataset with a slice */
const tf_field0 = tf_dataset.slice([0,0], [1,tensor.shape[1]]).flatten();
const tf_field1 = tf_dataset.slice([1,0], [1,tensor.shape[1]]).flatten();
```
The [`array` demo](https://github.com/SheetJS/SheetJS/tree/master/demos/array/) shows a complete example.
</details>
## Generating HTML Tables

BIN
docz/static/files/cd.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

BIN
docz/static/files/iris.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB