441 lines
13 KiB
Markdown
441 lines
13 KiB
Markdown
|
---
|
||
|
title: Sheets in TensorFlow
|
||
|
sidebar_label: TensorFlow.js
|
||
|
pagination_prev: demos/index
|
||
|
pagination_next: demos/frontend/index
|
||
|
---
|
||
|
|
||
|
<head>
|
||
|
<script src="https://docs.sheetjs.com/tfjs/tf.min.js"></script>
|
||
|
</head>
|
||
|
|
||
|
[TensorFlow.js](https://www.tensorflow.org/js) (shortened to TF.js) is a library
|
||
|
for machine learning in JavaScript.
|
||
|
|
||
|
[SheetJS](https://sheetjs.com) is a JavaScript library for reading and writing
|
||
|
data from spreadsheets.
|
||
|
|
||
|
This demo uses TensorFlow.js and SheetJS to process data in spreadsheets. We'll
|
||
|
explore how to load spreadsheet data into TF.js datasets and how to export
|
||
|
results back to spreadsheets.
|
||
|
|
||
|
- ["CSV Data Interchange"](#csv-data-interchange) uses SheetJS to process sheets
|
||
|
and generate CSV data that TF.js can import.
|
||
|
|
||
|
- ["JSON Data Interchange"](#json-data-interchange) uses SheetJS to process
|
||
|
sheets and generate rows of objects that can be post-processed.
|
||
|
|
||
|
:::info pass
|
||
|
|
||
|
Live code blocks in this page use the TF.js `4.14.0` standalone build.
|
||
|
|
||
|
For use in web frameworks, the `@tensorflow/tfjs` module should be used.
|
||
|
|
||
|
For use in NodeJS, the native bindings module is `@tensorflow/tfjs-node`.
|
||
|
|
||
|
:::
|
||
|
|
||
|
:::note Tested Deployments
|
||
|
|
||
|
Each browser demo was tested in the following environments:
|
||
|
|
||
|
| Browser | TF.js version | Date |
|
||
|
|:------------|:--------------|:-----------|
|
||
|
| Chrome 119 | `4.14.0` | 2023-12-09 |
|
||
|
| Safari 16.6 | `4.14.0` | 2023-12-09 |
|
||
|
|
||
|
:::
|
||
|
|
||
|
## CSV Data Interchange
|
||
|
|
||
|
`tf.data.csv`[^1] generates a Dataset from CSV data. The function expects a URL.
|
||
|
|
||
|
:::note pass
|
||
|
|
||
|
When this demo was last tested, there was no direct method to pass a CSV string
|
||
|
to the underlying parser.
|
||
|
|
||
|
:::
|
||
|
|
||
|
Fortunately blob URLs are supported.
|
||
|
|
||
|
```mermaid
|
||
|
flowchart LR
|
||
|
ws((SheetJS\nWorksheet))
|
||
|
csv(CSV\nstring)
|
||
|
url{{Data\nURL}}
|
||
|
dataset[(TF.js\nDataset)]
|
||
|
ws --> |sheet_to_csv\nSheetJS| csv
|
||
|
csv --> |JavaScript\nAPIs| url
|
||
|
url --> |tf.data.csv\nTensorFlow.js| dataset
|
||
|
```
|
||
|
|
||
|
The SheetJS `sheet_to_csv` method[^2] generates a CSV string from a worksheet
|
||
|
object. Using standard JavaScript techniques, a blob URL can be constructed:
|
||
|
|
||
|
```js
|
||
|
function worksheet_to_csv_url(worksheet) {
|
||
|
/* generate CSV */
|
||
|
const csv = XLSX.utils.sheet_to_csv(worksheet);
|
||
|
|
||
|
/* CSV -> Uint8Array -> Blob */
|
||
|
const u8 = new TextEncoder().encode(csv);
|
||
|
const blob = new Blob([u8], { type: "text/csv" });
|
||
|
|
||
|
/* generate a blob URL */
|
||
|
return URL.createObjectURL(blob);
|
||
|
}
|
||
|
```
|
||
|
|
||
|
### CSV Demo
|
||
|
|
||
|
This demo shows a simple model fitting using the "cars" dataset from TensorFlow.
|
||
|
The [sample XLS file](https://sheetjs.com/data/cd.xls) contains the data. The
|
||
|
data processing mirrors the official "Making Predictions from 2D Data" demo[^3].
|
||
|
|
||
|
```mermaid
|
||
|
flowchart LR
|
||
|
file[(Remote\nFile)]
|
||
|
subgraph SheetJS Operations
|
||
|
ab[(Data\nBytes)]
|
||
|
wb(((SheetJS\nWorkbook)))
|
||
|
ws((SheetJS\nWorksheet))
|
||
|
csv(CSV\nstring)
|
||
|
end
|
||
|
subgraph TensorFlow.js Operations
|
||
|
url{{Data\nURL}}
|
||
|
dataset[(TF.js\nDataset)]
|
||
|
results((Results))
|
||
|
end
|
||
|
file --> |fetch\n\n| ab
|
||
|
ab --> |read\n\n| wb
|
||
|
wb --> |select\nsheet| ws
|
||
|
ws --> |sheet_to_csv\n\n| csv
|
||
|
csv --> |JS\nAPI| url
|
||
|
url --> |tf.data.csv\nTF.js| dataset
|
||
|
dataset --> |fitDataset\nTF.js| results
|
||
|
```
|
||
|
|
||
|
The demo builds a model for predicting MPG from Horsepower data. It:
|
||
|
|
||
|
- fetches <https://sheetjs.com/data/cd.xls>
|
||
|
- parses the data with the SheetJS `read`[^4] method
|
||
|
- selects the first worksheet[^5] and converts to CSV using `sheet_to_csv`[^6]
|
||
|
- generates a blob URL from the CSV text
|
||
|
- generates a TF.js dataset with `tf.data.csv`[^7] and selects data columns
|
||
|
- builds a model and trains with `fitDataset`[^8]
|
||
|
- predicts MPG from a set of sample inputs and displays results in a table
|
||
|
|
||
|
<details><summary><b>Live Demo</b> (click to show)</summary>
|
||
|
|
||
|
:::caution pass
|
||
|
|
||
|
In some test runs, the results did not make sense given the underlying data.
|
||
|
The dependent and independent variables are expected to be anti-correlated.
|
||
|
|
||
|
**This is a known issue in TF.js and affects the official demos**
|
||
|
|
||
|
:::
|
||
|
|
||
|
:::caution pass
|
||
|
|
||
|
If the live demo shows a message
|
||
|
|
||
|
```
|
||
|
ReferenceError: tf is not defined
|
||
|
```
|
||
|
|
||
|
please refresh the page. This is a known bug in the documentation generator.
|
||
|
|
||
|
:::
|
||
|
|
||
|
```jsx live
|
||
|
function SheetJSToTFJSCSV() {
|
||
|
const [output, setOutput] = React.useState("");
|
||
|
const [results, setResults] = React.useState([]);
|
||
|
const [disabled, setDisabled] = React.useState(false);
|
||
|
|
||
|
function worksheet_to_csv_url(worksheet) {
|
||
|
/* generate CSV */
|
||
|
const csv = XLSX.utils.sheet_to_csv(worksheet);
|
||
|
|
||
|
/* CSV -> Uint8Array -> Blob */
|
||
|
const u8 = new TextEncoder().encode(csv);
|
||
|
const blob = new Blob([u8], { type: "text/csv" });
|
||
|
|
||
|
/* generate a blob URL */
|
||
|
return URL.createObjectURL(blob);
|
||
|
}
|
||
|
|
||
|
const doit = React.useCallback(async () => {
|
||
|
setResults([]); setOutput(""); setDisabled(true);
|
||
|
try {
|
||
|
/* fetch file */
|
||
|
const f = await fetch("https://sheetjs.com/data/cd.xls");
|
||
|
const ab = await f.arrayBuffer();
|
||
|
/* parse file and get first worksheet */
|
||
|
const wb = XLSX.read(ab);
|
||
|
const ws = wb.Sheets[wb.SheetNames[0]];
|
||
|
|
||
|
/* generate blob URL */
|
||
|
const url = worksheet_to_csv_url(ws);
|
||
|
|
||
|
/* feed to tf.js */
|
||
|
const dataset = tf.data.csv(url, {
|
||
|
hasHeader: true,
|
||
|
configuredColumnsOnly: true,
|
||
|
columnConfigs:{
|
||
|
"Horsepower": {required: false, default: 0},
|
||
|
"Miles_per_Gallon":{required: false, default: 0, isLabel:true}
|
||
|
}
|
||
|
});
|
||
|
|
||
|
/* pre-process data */
|
||
|
let flat = dataset
|
||
|
.map(({xs,ys}) =>({xs: Object.values(xs), ys: Object.values(ys)}))
|
||
|
.filter(({xs,ys}) => [...xs,...ys].every(v => v>0));
|
||
|
|
||
|
/* normalize manually :( */
|
||
|
let minX = Infinity, maxX = -Infinity, minY = Infinity, maxY = -Infinity;
|
||
|
await flat.forEachAsync(({xs, ys}) => {
|
||
|
minX = Math.min(minX, xs[0]); maxX = Math.max(maxX, xs[0]);
|
||
|
minY = Math.min(minY, ys[0]); maxY = Math.max(maxY, ys[0]);
|
||
|
});
|
||
|
flat = flat.map(({xs, ys}) => ({xs:xs.map(v => (v-minX)/(maxX - minX)),ys:ys.map(v => (v-minY)/(maxY-minY))}));
|
||
|
flat = flat.batch(32);
|
||
|
|
||
|
/* build and train model */
|
||
|
const model = tf.sequential();
|
||
|
model.add(tf.layers.dense({inputShape: [1], units: 1}));
|
||
|
model.compile({ optimizer: tf.train.sgd(0.000001), loss: 'meanSquaredError' });
|
||
|
await model.fitDataset(flat, { epochs: 100, callbacks: { onEpochEnd: async (epoch, logs) => {
|
||
|
setOutput(`${epoch}:${logs.loss}`);
|
||
|
}}});
|
||
|
|
||
|
/* predict values */
|
||
|
const inp = tf.linspace(0, 1, 9);
|
||
|
const pred = model.predict(inp);
|
||
|
const xs = await inp.dataSync(), ys = await pred.dataSync();
|
||
|
setResults(Array.from(xs).map((x, i) => [ x * (maxX - minX) + minX, ys[i] * (maxY - minY) + minY ]));
|
||
|
setOutput("");
|
||
|
|
||
|
} catch(e) { setOutput(`ERROR: ${String(e)}`); } finally { setDisabled(false);}
|
||
|
});
|
||
|
return ( <>
|
||
|
<button onClick={doit} disabled={disabled}>Click to run</button><br/>
|
||
|
{output && <pre>{output}</pre> || <></>}
|
||
|
{results.length && <table><thead><tr><th>Horsepower</th><th>MPG</th></tr></thead><tbody>
|
||
|
{results.map((r,i) => <tr key={i}><td>{r[0]}</td><td>{r[1].toFixed(2)}</td></tr>)}
|
||
|
</tbody></table> || <></>}
|
||
|
</> );
|
||
|
}
|
||
|
```
|
||
|
|
||
|
</details>
|
||
|
|
||
|
## JS Array Interchange
|
||
|
|
||
|
[The official Linear Regression tutorial](https://www.tensorflow.org/js/tutorials/training/linear_regression)
|
||
|
loads data from a JSON file:
|
||
|
|
||
|
```json
|
||
|
[
|
||
|
{
|
||
|
"Name": "chevrolet chevelle malibu",
|
||
|
"Miles_per_Gallon": 18,
|
||
|
"Cylinders": 8,
|
||
|
"Displacement": 307,
|
||
|
"Horsepower": 130,
|
||
|
"Weight_in_lbs": 3504,
|
||
|
"Acceleration": 12,
|
||
|
"Year": "1970-01-01",
|
||
|
"Origin": "USA"
|
||
|
},
|
||
|
// ...
|
||
|
]
|
||
|
```
|
||
|
|
||
|
In real use cases, data is stored in [spreadsheets](https://sheetjs.com/data/cd.xls)
|
||
|
|
||
|
![cd.xls screenshot](pathname:///files/cd.png)
|
||
|
|
||
|
Following the tutorial, the data fetching method can be adapted to handle arrays
|
||
|
of objects, such as those generated by the SheetJS `sheet_to_json` method[^9].
|
||
|
|
||
|
Differences from the official example are highlighted below:
|
||
|
|
||
|
```js
|
||
|
/**
|
||
|
* Get the car data reduced to just the variables we are interested
|
||
|
* and cleaned of missing data.
|
||
|
*/
|
||
|
async function getData() {
|
||
|
// highlight-start
|
||
|
/* fetch file */
|
||
|
const carsDataResponse = await fetch('https://sheetjs.com/data/cd.xls');
|
||
|
/* get file data (ArrayBuffer) */
|
||
|
const carsDataAB = await carsDataResponse.arrayBuffer();
|
||
|
/* parse */
|
||
|
const carsDataWB = XLSX.read(carsDataAB);
|
||
|
/* get first worksheet */
|
||
|
const carsDataWS = carsDataWB.Sheets[carsDataWB.SheetNames[0]];
|
||
|
/* generate array of JS objects */
|
||
|
const carsData = XLSX.utils.sheet_to_json(carsDataWS);
|
||
|
// highlight-end
|
||
|
const cleaned = carsData.map(car => ({
|
||
|
mpg: car.Miles_per_Gallon,
|
||
|
horsepower: car.Horsepower,
|
||
|
}))
|
||
|
.filter(car => (car.mpg != null && car.horsepower != null));
|
||
|
|
||
|
return cleaned;
|
||
|
}
|
||
|
```
|
||
|
|
||
|
## Low-Level Operations
|
||
|
|
||
|
### Data Transposition
|
||
|
|
||
|
A typical dataset in a spreadsheet will start with one header row and represent
|
||
|
each data record in its own row. For example, the Iris dataset might look like
|
||
|
|
||
|
![Iris dataset](pathname:///files/iris.png)
|
||
|
|
||
|
The SheetJS `sheet_to_json` method[^10] will translate worksheet objects into an
|
||
|
array of row objects:
|
||
|
|
||
|
```js
|
||
|
var aoo = [
|
||
|
{"sepal length": 5.1, "sepal width": 3.5, ...},
|
||
|
{"sepal length": 4.9, "sepal width": 3, ...},
|
||
|
...
|
||
|
];
|
||
|
```
|
||
|
|
||
|
TF.js and other libraries tend to operate on individual columns, equivalent to:
|
||
|
|
||
|
```js
|
||
|
var sepal_lengths = [5.1, 4.9, ...];
|
||
|
var sepal_widths = [3.5, 3, ...];
|
||
|
```
|
||
|
|
||
|
When a `tensor2d` can be exported, it will look different from the spreadsheet:
|
||
|
|
||
|
```js
|
||
|
var data_set_2d = [
|
||
|
[5.1, 4.9, ...],
|
||
|
[3.5, 3, ...],
|
||
|
...
|
||
|
]
|
||
|
```
|
||
|
|
||
|
This is the transpose of how people use spreadsheets!
|
||
|
|
||
|
### Exporting Datasets to a Worksheet
|
||
|
|
||
|
The `aoa_to_sheet` method[^11] can generate a worksheet from an array of arrays.
|
||
|
ML libraries typically provide APIs to pull an array of arrays, but it will be
|
||
|
transposed. To export multiple data sets, the data should be transposed:
|
||
|
|
||
|
```js
|
||
|
/* assuming data is an array of typed arrays */
|
||
|
var aoa = [];
|
||
|
for(var i = 0; i < data.length; ++i) {
|
||
|
for(var j = 0; j < data[i].length; ++j) {
|
||
|
if(!aoa[j]) aoa[j] = [];
|
||
|
aoa[j][i] = data[i][j];
|
||
|
}
|
||
|
}
|
||
|
/* aoa can be directly converted to a worksheet object */
|
||
|
var ws = XLSX.utils.aoa_to_sheet(aoa);
|
||
|
```
|
||
|
|
||
|
### Importing Data from a Spreadsheet
|
||
|
|
||
|
`sheet_to_json` with the option `header:1`[^12] will generate a row-major array
|
||
|
of arrays that can be transposed. However, it is more efficient to walk the
|
||
|
sheet manually:
|
||
|
|
||
|
```js
|
||
|
/* find worksheet range */
|
||
|
var range = XLSX.utils.decode_range(ws['!ref']);
|
||
|
var out = []
|
||
|
/* walk the columns */
|
||
|
for(var C = range.s.c; C <= range.e.c; ++C) {
|
||
|
/* create the typed array */
|
||
|
var ta = new Float32Array(range.e.r - range.s.r + 1);
|
||
|
/* walk the rows */
|
||
|
for(var R = range.s.r; R <= range.e.r; ++R) {
|
||
|
/* find the cell, skip it if the cell isn't numeric or boolean */
|
||
|
var cell = ws["!data"] ? (ws["!data"][R]||[])[C] : ws[XLSX.utils.encode_cell({r:R, c:C})];
|
||
|
if(!cell || cell.t != 'n' && cell.t != 'b') continue;
|
||
|
/* assign to the typed array */
|
||
|
ta[R - range.s.r] = cell.v;
|
||
|
}
|
||
|
out.push(ta);
|
||
|
}
|
||
|
```
|
||
|
|
||
|
If the data set has a header row, the loop can be adjusted to skip those rows.
|
||
|
|
||
|
### TF.js Tensors
|
||
|
|
||
|
A single `Array#map` can pull individual named fields from the result, which
|
||
|
can be used to construct TensorFlow.js tensor objects:
|
||
|
|
||
|
```js
|
||
|
const aoo = XLSX.utils.sheet_to_json(worksheet);
|
||
|
const lengths = aoo.map(row => row["sepal length"]);
|
||
|
const tensor = tf.tensor1d(lengths);
|
||
|
```
|
||
|
|
||
|
`tf.Tensor` objects can be directly transposed using `transpose`:
|
||
|
|
||
|
```js
|
||
|
var aoo = XLSX.utils.sheet_to_json(worksheet);
|
||
|
// "x" and "y" are the fields we want to pull from the data
|
||
|
var data = aoo.map(row => ([row["x"], row["y"]]));
|
||
|
|
||
|
// create a tensor representing two column datasets
|
||
|
var tensor = tf.tensor2d(data).transpose();
|
||
|
|
||
|
// individual columns can be accessed
|
||
|
var col1 = tensor.slice([0,0], [1,tensor.shape[1]]).flatten();
|
||
|
var col2 = tensor.slice([1,0], [1,tensor.shape[1]]).flatten();
|
||
|
```
|
||
|
|
||
|
For exporting, `stack` can be used to collapse the columns into a linear array:
|
||
|
|
||
|
```js
|
||
|
/* pull data into a Float32Array */
|
||
|
var result = tf.stack([col1, col2]).transpose();
|
||
|
var shape = tensor.shape;
|
||
|
var f32 = tensor.dataSync();
|
||
|
|
||
|
/* construct an array of arrays of the data in spreadsheet order */
|
||
|
var aoa = [];
|
||
|
for(var j = 0; j < shape[0]; ++j) {
|
||
|
aoa[j] = [];
|
||
|
for(var i = 0; i < shape[1]; ++i) aoa[j][i] = f32[j * shape[1] + i];
|
||
|
}
|
||
|
|
||
|
/* add headers to the top */
|
||
|
aoa.unshift(["x", "y"]);
|
||
|
|
||
|
/* generate worksheet */
|
||
|
var worksheet = XLSX.utils.aoa_to_sheet(aoa);
|
||
|
```
|
||
|
|
||
|
[^1]: See [`tf.data.csv`](https://js.tensorflow.org/api/latest/#data.csv) in the TensorFlow.js documentation
|
||
|
[^2]: See [`sheet_to_csv` in "CSV and Text"](/docs/api/utilities/csv#delimiter-separated-output)
|
||
|
[^3]: The ["Making Predictions from 2D Data" example](https://codelabs.developers.google.com/codelabs/tfjs-training-regression/) uses a hosted JSON file. The [sample XLS file](https://sheetjs.com/data/cd.xls) includes the same data.
|
||
|
[^4]: See [`read` in "Reading Files"](/docs/api/parse-options)
|
||
|
[^5]: See ["Workbook Object"](/docs/csf/book)
|
||
|
[^6]: See [`sheet_to_csv` in "CSV and Text"](/docs/api/utilities/csv#delimiter-separated-output)
|
||
|
[^7]: See [`tf.data.csv`](https://js.tensorflow.org/api/latest/#data.csv) in the TensorFlow.js documentation
|
||
|
[^8]: See [`tf.LayersModel.fitDataset`](https://js.tensorflow.org/api/latest/#tf.LayersModel.fitDataset) in the TensorFlow.js documentation
|
||
|
[^9]: See [`sheet_to_json` in "Utilities"](/docs/api/utilities/array#array-output)
|
||
|
[^10]: See [`sheet_to_json` in "Utilities"](/docs/api/utilities/array#array-output)
|
||
|
[^11]: See [`aoa_to_sheet` in "Utilities"](/docs/api/utilities/array#array-of-arrays-input)
|
||
|
[^12]: See [`sheet_to_json` in "Utilities"](/docs/api/utilities/array#array-output)
|