docs.sheetjs.com/docz/docs/03-demos/01-math/11-tensorflow.md

13 KiB

title sidebar_label pagination_prev pagination_next
Sheets in TensorFlow TensorFlow.js demos/index demos/frontend/index
<head> </head>

TensorFlow.js (shortened to TF.js) is a library for machine learning in JavaScript.

SheetJS is a JavaScript library for reading and writing data from spreadsheets.

This demo uses TensorFlow.js and SheetJS to process data in spreadsheets. We'll explore how to load spreadsheet data into TF.js datasets and how to export results back to spreadsheets.

:::info pass

Live code blocks in this page use the TF.js 4.14.0 standalone build.

For use in web frameworks, the @tensorflow/tfjs module should be used.

For use in NodeJS, the native bindings module is @tensorflow/tfjs-node.

:::

:::note Tested Deployments

Each browser demo was tested in the following environments:

Browser TF.js version Date
Chrome 122 4.14.0 2024-04-07
Safari 17.4 4.14.0 2024-03-23

:::

CSV Data Interchange

tf.data.csv1 generates a Dataset from CSV data. The function expects a URL.

:::note pass

When this demo was last tested, there was no direct method to pass a CSV string to the underlying parser.

:::

Fortunately blob URLs are supported.

flowchart LR
  ws((SheetJS\nWorksheet))
  csv(CSV\nstring)
  url{{Data\nURL}}
  dataset[(TF.js\nDataset)]
  ws --> |sheet_to_csv\nSheetJS| csv
  csv --> |JavaScript\nAPIs| url
  url --> |tf.data.csv\nTensorFlow.js| dataset

The SheetJS sheet_to_csv method2 generates a CSV string from a worksheet object. Using standard JavaScript techniques, a blob URL can be constructed:

function worksheet_to_csv_url(worksheet) {
  /* generate CSV */
  const csv = XLSX.utils.sheet_to_csv(worksheet);

  /* CSV -> Uint8Array -> Blob */
  const u8 = new TextEncoder().encode(csv);
  const blob = new Blob([u8], { type: "text/csv" });

  /* generate a blob URL */
  return URL.createObjectURL(blob);
}

CSV Demo

This demo shows a simple model fitting using the "cars" dataset from TensorFlow. The sample XLS file contains the data. The data processing mirrors the official "Making Predictions from 2D Data" demo3.

flowchart LR
  file[(Remote\nFile)]
  subgraph SheetJS Operations
    ab[(Data\nBytes)]
    wb(((SheetJS\nWorkbook)))
    ws((SheetJS\nWorksheet))
    csv(CSV\nstring)
  end
  subgraph TensorFlow.js Operations
    url{{Data\nURL}}
    dataset[(TF.js\nDataset)]
    results((Results))
  end
  file --> |fetch\n\n| ab
  ab --> |read\n\n| wb
  wb --> |select\nsheet| ws
  ws --> |sheet_to_csv\n\n| csv
  csv --> |JS\nAPI| url
  url --> |tf.data.csv\nTF.js| dataset
  dataset --> |fitDataset\nTF.js| results

The demo builds a model for predicting MPG from Horsepower data. It:

  • fetches https://sheetjs.com/data/cd.xls
  • parses the data with the SheetJS read4 method
  • selects the first worksheet5 and converts to CSV using sheet_to_csv6
  • generates a blob URL from the CSV text
  • generates a TF.js dataset with tf.data.csv7 and selects data columns
  • builds a model and trains with fitDataset8
  • predicts MPG from a set of sample inputs and displays results in a table
Live Demo (click to show)

:::caution pass

In some test runs, the results did not make sense given the underlying data. The dependent and independent variables are expected to be anti-correlated.

This is a known issue in TF.js and affects the official demos

:::

:::caution pass

If the live demo shows a message

ReferenceError: tf is not defined

please refresh the page. This is a known bug in the documentation generator.

:::

function SheetJSToTFJSCSV() {
  const [output, setOutput] = React.useState("");
  const [results, setResults] = React.useState([]);
  const [disabled, setDisabled] = React.useState(false);

  function worksheet_to_csv_url(worksheet) {
    /* generate CSV */
    const csv = XLSX.utils.sheet_to_csv(worksheet);

    /* CSV -> Uint8Array -> Blob */
    const u8 = new TextEncoder().encode(csv);
    const blob = new Blob([u8], { type: "text/csv" });

    /* generate a blob URL */
    return URL.createObjectURL(blob);
  }

  const doit = React.useCallback(async () => {
    setResults([]); setOutput(""); setDisabled(true);
    try {
    /* fetch file */
    const f = await fetch("https://sheetjs.com/data/cd.xls");
    const ab = await f.arrayBuffer();
    /* parse file and get first worksheet */
    const wb = XLSX.read(ab);
    const ws = wb.Sheets[wb.SheetNames[0]];

    /* generate blob URL */
    const url = worksheet_to_csv_url(ws);

    /* feed to tf.js */
    const dataset = tf.data.csv(url, {
      hasHeader: true,
      configuredColumnsOnly: true,
      columnConfigs:{
        "Horsepower": {required: false, default: 0},
        "Miles_per_Gallon":{required: false, default: 0, isLabel:true}
      }
    });

    /* pre-process data */
    let flat = dataset
      .map(({xs,ys}) =>({xs: Object.values(xs), ys: Object.values(ys)}))
      .filter(({xs,ys}) => [...xs,...ys].every(v => v>0));

    /* normalize manually :( */
    let minX = Infinity, maxX = -Infinity, minY = Infinity, maxY = -Infinity;
    await flat.forEachAsync(({xs, ys}) => {
      minX = Math.min(minX, xs[0]); maxX = Math.max(maxX, xs[0]);
      minY = Math.min(minY, ys[0]); maxY = Math.max(maxY, ys[0]);
    });
    flat = flat.map(({xs, ys}) => ({xs:xs.map(v => (v-minX)/(maxX - minX)),ys:ys.map(v => (v-minY)/(maxY-minY))}));
    flat = flat.batch(32);

    /* build and train model */
    const model = tf.sequential();
    model.add(tf.layers.dense({inputShape: [1], units: 1}));
    model.compile({ optimizer: tf.train.sgd(0.000001), loss: 'meanSquaredError' });
    await model.fitDataset(flat, { epochs: 100, callbacks: { onEpochEnd: async (epoch, logs) => {
      setOutput(`${epoch}:${logs.loss}`);
    }}});

    /* predict values */
    const inp = tf.linspace(0, 1, 9);
    const pred = model.predict(inp);
    const xs = await inp.dataSync(), ys = await pred.dataSync();
    setResults(Array.from(xs).map((x, i) => [ x * (maxX - minX) + minX, ys[i] * (maxY - minY) + minY ]));
    setOutput("");

    } catch(e) { setOutput(`ERROR: ${String(e)}`); } finally { setDisabled(false);}
  });
  return ( <>
    <button onClick={doit} disabled={disabled}>Click to run</button><br/>
    {output && <pre>{output}</pre> || <></>}
    {results.length && <table><thead><tr><th>Horsepower</th><th>MPG</th></tr></thead><tbody>
    {results.map((r,i) => <tr key={i}><td>{r[0]}</td><td>{r[1].toFixed(2)}</td></tr>)}
    </tbody></table> || <></>}
  </> );
}

JS Array Interchange

The official Linear Regression tutorial loads data from a JSON file:

[
  {
    "Name": "chevrolet chevelle malibu",
    "Miles_per_Gallon": 18,
    "Cylinders": 8,
    "Displacement": 307,
    "Horsepower": 130,
    "Weight_in_lbs": 3504,
    "Acceleration": 12,
    "Year": "1970-01-01",
    "Origin": "USA"
  },
  // ...
]

In real use cases, data is stored in spreadsheets

cd.xls screenshot

Following the tutorial, the data fetching method can be adapted to handle arrays of objects, such as those generated by the SheetJS sheet_to_json method9.

Differences from the official example are highlighted below:

/**
 * Get the car data reduced to just the variables we are interested
 * and cleaned of missing data.
 */
async function getData() {
  // highlight-start
  /* fetch file */
  const carsDataResponse = await fetch('https://sheetjs.com/data/cd.xls');
  /* get file data (ArrayBuffer) */
  const carsDataAB = await carsDataResponse.arrayBuffer();
  /* parse */
  const carsDataWB = XLSX.read(carsDataAB);
  /* get first worksheet */
  const carsDataWS = carsDataWB.Sheets[carsDataWB.SheetNames[0]];
  /* generate array of JS objects */
  const carsData = XLSX.utils.sheet_to_json(carsDataWS);
  // highlight-end
  const cleaned = carsData.map(car => ({
    mpg: car.Miles_per_Gallon,
    horsepower: car.Horsepower,
  }))
  .filter(car => (car.mpg != null && car.horsepower != null));

  return cleaned;
}

Low-Level Operations

Data Transposition

A typical dataset in a spreadsheet will start with one header row and represent each data record in its own row. For example, the Iris dataset might look like

Iris dataset

The SheetJS sheet_to_json method10 will translate worksheet objects into an array of row objects:

var aoo = [
  {"sepal length": 5.1, "sepal width": 3.5, ...},
  {"sepal length": 4.9, "sepal width":   3, ...},
  ...
];

TF.js and other libraries tend to operate on individual columns, equivalent to:

var sepal_lengths = [5.1, 4.9, ...];
var sepal_widths = [3.5, 3, ...];

When a tensor2d can be exported, it will look different from the spreadsheet:

var data_set_2d = [
  [5.1, 4.9, ...],
  [3.5, 3, ...],
  ...
]

This is the transpose of how people use spreadsheets!

Exporting Datasets to a Worksheet

The aoa_to_sheet method11 can generate a worksheet from an array of arrays. ML libraries typically provide APIs to pull an array of arrays, but it will be transposed. To export multiple data sets, the data should be transposed:

/* assuming data is an array of typed arrays */
var aoa = [];
for(var i = 0; i < data.length; ++i) {
  for(var j = 0; j < data[i].length; ++j) {
    if(!aoa[j]) aoa[j] = [];
    aoa[j][i] = data[i][j];
  }
}
/* aoa can be directly converted to a worksheet object */
var ws = XLSX.utils.aoa_to_sheet(aoa);

Importing Data from a Spreadsheet

sheet_to_json with the option header:112 will generate a row-major array of arrays that can be transposed. However, it is more efficient to walk the sheet manually:

/* find worksheet range */
var range = XLSX.utils.decode_range(ws['!ref']);
var out = []
/* walk the columns */
for(var C = range.s.c; C <= range.e.c; ++C) {
  /* create the typed array */
  var ta = new Float32Array(range.e.r - range.s.r + 1);
  /* walk the rows */
  for(var R = range.s.r; R <= range.e.r; ++R) {
    /* find the cell, skip it if the cell isn't numeric or boolean */
    var cell = ws["!data"] ? (ws["!data"][R]||[])[C] : ws[XLSX.utils.encode_cell({r:R, c:C})];
    if(!cell || cell.t != 'n' && cell.t != 'b') continue;
    /* assign to the typed array */
    ta[R - range.s.r] = cell.v;
  }
  out.push(ta);
}

If the data set has a header row, the loop can be adjusted to skip those rows.

TF.js Tensors

A single Array#map can pull individual named fields from the result, which can be used to construct TensorFlow.js tensor objects:

const aoo = XLSX.utils.sheet_to_json(worksheet);
const lengths = aoo.map(row => row["sepal length"]);
const tensor = tf.tensor1d(lengths);

tf.Tensor objects can be directly transposed using transpose:

var aoo = XLSX.utils.sheet_to_json(worksheet);
// "x" and "y" are the fields we want to pull from the data
var data = aoo.map(row => ([row["x"], row["y"]]));

// create a tensor representing two column datasets
var tensor = tf.tensor2d(data).transpose();

// individual columns can be accessed
var col1 = tensor.slice([0,0], [1,tensor.shape[1]]).flatten();
var col2 = tensor.slice([1,0], [1,tensor.shape[1]]).flatten();

For exporting, stack can be used to collapse the columns into a linear array:

/* pull data into a Float32Array */
var result = tf.stack([col1, col2]).transpose();
var shape = tensor.shape;
var f32 = tensor.dataSync();

/* construct an array of arrays of the data in spreadsheet order */
var aoa = [];
for(var j = 0; j < shape[0]; ++j) {
  aoa[j] = [];
  for(var i = 0; i < shape[1]; ++i) aoa[j][i] = f32[j * shape[1] + i];
}

/* add headers to the top */
aoa.unshift(["x", "y"]);

/* generate worksheet */
var worksheet = XLSX.utils.aoa_to_sheet(aoa);