docs.sheetjs.com/docz/docs/03-demos/01-math/01-summary.md

771 lines
20 KiB
Markdown
Raw Permalink Normal View History

2024-01-07 02:35:22 +00:00
---
title: Summary Statistics
sidebar_label: Summary Statistics
pagination_prev: demos/index
pagination_next: demos/frontend/index
---
import current from '/version.js';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import CodeBlock from '@theme/CodeBlock';
export const bs = ({borderStyle:"none", background:"none", textAlign:"left" });
Summary statistics help people quickly understand datasets and make informed
decisions. Many interesting datasets are stored in spreadsheet files.
[SheetJS](https://sheetjs.com) is a JavaScript library for reading and writing
data from spreadsheets.
This demo uses SheetJS to process data in spreadsheets. We'll explore how to
extract spreadsheet data and how to compute simple summary statistics. This
demo will focus on two general data representations:
- ["Arrays of Objects"](#arrays-of-objects) simplifies processing by translating
from the SheetJS data model to a more idiomatic data structure.
- ["Dense Worksheets"](#dense-worksheets) directly analyzes SheetJS worksheets.
:::tip pass
The [Import Tutorial](/docs/getting-started/examples/import) is a guided example
of extracting data from a workbook. It is strongly recommended to review the
tutorial first.
:::
:::note Tested Deployments
This browser demo was tested in the following environments:
| Browser | Date |
|:------------|:-----------|
2024-06-22 08:16:02 +00:00
| Chrome 126 | 2024-06-21 |
2024-06-21 07:54:27 +00:00
| Safari 17.4 | 2024-06-20 |
2024-01-07 02:35:22 +00:00
:::
## Data Representations
Many worksheets include one header row followed by a number of data rows. Each
row is an "observation" and each column is a "variable".
:::info pass
The "Array of Objects" explanations use more idiomatic JavaScript patterns. It
is suitable for smaller datasets.
The "Dense Worksheets" approach is more performant, but the code patterns are
reminiscent of C. The low-level approach is only encouraged when the traditional
patterns are prohibitively slow.
:::
### Arrays of Objects
The idiomatic JavaScript representation of the dataset is an array of objects.
Variable names are typically taken from the first row. Those names are used as
keys in each observation.
2024-04-12 01:04:37 +00:00
<table>
<thead><tr><th>Spreadsheet</th><th>JS Data</th></tr></thead>
<tbody><tr><td>
2024-01-07 02:35:22 +00:00
![`pres.xlsx` data](pathname:///pres.png)
</td><td>
```js
[
{ Name: "Bill Clinton", Index: 42 },
{ Name: "GeorgeW Bush", Index: 43 },
{ Name: "Barack Obama", Index: 44 },
{ Name: "Donald Trump", Index: 45 },
{ Name: "Joseph Biden", Index: 46 }
]
```
</td></tr></tbody></table>
The SheetJS `sheet_to_json` method[^1] can generate arrays of objects from a
worksheet object. For example, the following snippet fetches a test file and
creates an array of arrays from the first sheet:
```js
const url = "https://docs.sheetjs.com/typedarray/iris.xlsx";
/* fetch file and pull file data into an ArrayBuffer */
const file = await (await fetch(url)).arrayBuffer();
/* parse workbook */
const workbook = XLSX.read(file, {dense: true});
/* first worksheet */
const first_sheet = workbook.Sheets[workbook.SheetNames[0]];
/* generate array of arrays */
// highlight-next-line
const aoo = XLSX.utils.sheet_to_json(first_sheet);
```
### Dense Worksheets
SheetJS "dense" worksheets[^2] store cells in an array of arrays. The SheetJS
`read` method[^3] accepts a special `dense` option to create dense worksheets.
The following example fetches a file:
```js
/* fetch file and pull file data into an ArrayBuffer */
const url = "https://docs.sheetjs.com/typedarray/iris.xlsx";
const file = await (await fetch(url)).arrayBuffer();
/* parse workbook */
// highlight-next-line
const workbook = XLSX.read(file, {dense: true});
/* first worksheet */
const first_dense_sheet = workbook.Sheets[workbook.SheetNames[0]];
```
The `"!data"` property of a dense worksheet is an array of arrays of cell
objects[^4]. Cell objects include attributes including data type and value.
## Analyzing Variables
Individual variables can be extracted by looping through the array of objects
and accessing specific keys. For example, using the Iris dataset:
![Iris dataset](pathname:///typedarray/iris.png)
<Tabs groupId="style">
<TabItem name="aoo" value="Array of Objects">
The following snippet shows the first entry in the array of objects:
```js
{
"sepal length": 5.1,
"sepal width": 3.5,
"petal length": 1.4,
"petal width": 0.2,
"class ": "Iris-setosa"
}
```
The values for the `sepal length` variable can be extracted by indexing each
object. The following snippet prints the sepal lengths:
```js
for(let i = 0; i < aoo.length; ++i) {
const row = aoo[i];
const sepal_length = row["sepal length"];
console.log(sepal_length);
}
```
```jsx live
function SheetJSAoOExtractColumn() {
const [col, setCol] = React.useState([]);
React.useEffect(() => { (async() => {
const ab = await (await fetch("/typedarray/iris.xlsx")).arrayBuffer();
const wb = XLSX.read(ab, {dense: true});
const aoo = XLSX.utils.sheet_to_json(wb.Sheets[wb.SheetNames[0]]);
/* store first 5 sepal lengths in an array */
const col = [];
for(let i = 0; i < aoo.length; ++i) {
const row = aoo[i];
const sepal_length = row["sepal length"];
col.push(sepal_length); if(col.length >= 5) break;
}
setCol(col);
})(); }, []);
return ( <>
<b>First 5 Sepal Length Values</b><br/>
2024-06-21 07:54:27 +00:00
<table><tbody><tr>{col.map(sw => (<td>{sw}</td>))}</tr></tbody></table>
2024-01-07 02:35:22 +00:00
</>
);
}
```
</TabItem>
<TabItem name="ws" value="Dense Worksheet">
The column for the `sepal length` variable can be determined by testing the cell
values in the first row.
**Finding the column index for the variable**
The first row of cells will be the first row in the `"!data"` array:
```js
const first_row = first_dense_sheet["!data"][0];
```
When looping over the cells in the first row, the cell must be tested in the
following order:
- confirm the cell object exists (entry is not null)
- cell is a text cell (the `t` property will be `"s"`[^5])
- cell value (`v` property[^6]) matches `"sepal length"`
```js
let C = -1;
for(let i = 0; i < first_row.length; ++i) {
let cell = first_row[i];
/* confirm cell exists */
if(!cell) continue;
/* confirm cell is a text cell */
if(cell.t != "s") continue;
/* compare the text */
if(cell.v.localeCompare("sepal length") != 0) continue;
/* save column index */
C = i; break;
}
/* throw an error if the column cannot be found */
if(C == -1) throw new Error(`"sepal length" column cannot be found! `);
```
**Finding the values for the variable**
After finding the column index, the rest of the rows can be scanned. This time,
the cell type will be `"n"`[^7] (numeric). The following snippet prints values:
```js
const number_of_rows = first_dense_sheet["!data"].length;
for(let R = 1; R < number_of_rows; ++R) {
/* confirm row exists */
let row = first_dense_sheet["!data"][R];
if(!row) continue;
/* confirm cell exists */
let cell = row[C];
if(!cell) continue;
/* confirm cell is a numeric cell */
if(cell.t != "n") continue;
/* print raw value */
console.log(cell.v);
}
```
**Live Demo**
The following snippet prints the sepal lengths:
```js
for(let i = 0; i < aoo.length; ++i) {
const row = aoo[i];
const sepal_length = row["sepal length"];
console.log(sepal_length);
}
```
```jsx live
function SheetJSDensExtractColumn() {
const [msg, setMsg] = React.useState("");
const [col, setCol] = React.useState([]);
React.useEffect(() => { (async() => {
const ab = await (await fetch("/typedarray/iris.xlsx")).arrayBuffer();
const wb = XLSX.read(ab, {dense: true});
/* first worksheet */
const first_dense_sheet = wb.Sheets[wb.SheetNames[0]];
/* find column index */
const first_row = first_dense_sheet["!data"][0];
let C = -1;
for(let i = 0; i < first_row.length; ++i) {
let cell = first_row[i];
/* confirm cell exists */
if(!cell) continue;
/* confirm cell is a text cell */
if(cell.t != "s") continue;
/* compare the text */
if(cell.v.localeCompare("sepal length") != 0) continue;
/* save column index */
C = i; break;
}
/* throw an error if the column cannot be found */
if(C == -1) return setMsg(`"sepal length" column cannot be found! `);
/* store first 5 sepal lengths in an array */
const col = [];
const number_of_rows = first_dense_sheet["!data"].length;
for(let R = 1; R < number_of_rows; ++R) {
/* confirm row exists */
let row = first_dense_sheet["!data"][R];
if(!row) continue;
/* confirm cell exists */
let cell = row[C];
if(!cell) continue;
/* confirm cell is a numeric cell */
if(cell.t != "n") continue;
/* add raw value */
const sepal_length = cell.v;
col.push(sepal_length); if(col.length >= 5) break;
}
setCol(col);
setMsg("First 5 Sepal Length Values");
})(); }, []);
return ( <><b>{msg}</b><br/><table><tbody>
{col.map(sw => (<tr><td>{sw}</td></tr>))}
</tbody></table></> );
}
```
</TabItem>
</Tabs>
## Average (Mean)
For a given sequence of numbers $x_1\mathellipsis x_{count}$ the mean $M$ is
defined as the sum of the elements divided by the count:
$$
M[x;count] = \frac{1}{count}\sum_{i=1}^{count} x_i
$$
In JavaScript terms, the mean of an array is the sum of the numbers in the array
divided by the total number of numeric values.
Non-numeric elements and array holes do not affect the sum and do not contribute
to the count. Algorithms are expected to explicitly track the count and cannot
assume the array `length` property will be the correct count.
:::info pass
This definition aligns with the spreadsheet `AVERAGE` function.
`AVERAGEA` differs from `AVERAGE` in its treatment of string and Boolean values:
string values are treated as zeroes and Boolean values map to their coerced
numeric equivalent (`true` is `1` and `false` is `0`).
:::
:::note JavaScript Ecosystem
Some JavaScript libraries implement functions for computing array means.
| Library | Implementation |
|:------------------------|:----------------------------------------------|
| `jStat`[^8] | Textbook sum (divide at end) |
| `simple-statistics`[^9] | Neumaier compensated sum (divide at end) |
| `stdlib.js`[^10] | Trial mean (`mean`) / van Reeken (`incrmean`) |
:::
### Textbook Sum
The mean of a sequence of values can be calculated by computing the sum and
dividing by the count.
<Tabs groupId="style">
<TabItem name="aoo" value="Array of Objects">
The following function accepts an array of objects and a key.
```js
function aoa_average_of_key(aoo, key) {
let sum = 0, cnt = 0;
for(let R = 0; R < aoo.length; ++R) {
const row = aoo[R];
if(typeof row == "undefined") continue;
const field = row[key];
if(typeof field != "number") continue;
sum += field; ++cnt;
}
return cnt == 0 ? 0 : sum / cnt;
}
```
<details>
<summary><b>Live Demo</b> (click to show)</summary>
2024-01-07 02:35:22 +00:00
```jsx live
function SheetJSAoOAverageKey() {
const [avg, setAvg] = React.useState(NaN);
function aoa_average_of_key(aoo, key) {
let sum = 0, cnt = 0;
for(let R = 0; R < aoo.length; ++R) {
const row = aoo[R];
if(typeof row == "undefined") continue;
const field = row[key];
if(typeof field != "number") continue;
sum += field; ++cnt;
}
return cnt == 0 ? 0 : sum / cnt;
}
React.useEffect(() => { (async() => {
const ab = await (await fetch("/typedarray/iris.xlsx")).arrayBuffer();
const wb = XLSX.read(ab, {dense: true});
const aoo = XLSX.utils.sheet_to_json(wb.Sheets[wb.SheetNames[0]]);
setAvg(aoa_average_of_key(aoo, "sepal length"));
})(); }, []);
return ( <b>The average Sepal Length is {avg}</b> );
}
```
</details>
</TabItem>
<TabItem name="ws" value="Dense Worksheet">
The following function accepts a SheetJS worksheet and a column index.
```js
function ws_average_of_col(ws, C) {
const data = ws["!data"];
let sum = 0, cnt = 0;
for(let R = 1; R < data.length; ++R) {
const row = data[R];
if(typeof row == "undefined") continue;
const field = row[C];
if(!field || field.t != "n") continue;
sum += field.v; ++cnt;
}
return cnt == 0 ? 0 : sum / cnt;
}
```
<details>
<summary><b>Live Demo</b> (click to show)</summary>
2024-01-07 02:35:22 +00:00
```jsx live
function SheetJSDenseAverageKey() {
const [avg, setAvg] = React.useState(NaN);
function ws_average_of_col(ws, C) {
const data = ws["!data"];
let sum = 0, cnt = 0;
for(let R = 1; R < data.length; ++R) {
const row = data[R];
if(typeof row == "undefined") continue;
const field = row[C];
if(!field || field.t != "n") continue;
sum += field.v; ++cnt;
}
return cnt == 0 ? 0 : sum / cnt;
}
React.useEffect(() => { (async() => {
const ab = await (await fetch("/typedarray/iris.xlsx")).arrayBuffer();
const wb = XLSX.read(ab, {dense: true});
const ws = wb.Sheets[wb.SheetNames[0]];
/* find column index */
const first_row = ws["!data"][0];
let C = -1;
for(let i = 0; i < first_row.length; ++i) {
let cell = first_row[i];
/* confirm cell exists */
if(!cell) continue;
/* confirm cell is a text cell */
if(cell.t != "s") continue;
/* compare the text */
if(cell.v.localeCompare("sepal length") != 0) continue;
/* save column index */
C = i; break;
}
setAvg(ws_average_of_col(ws, C));
})(); }, []);
return ( <b>The average Sepal Length is {avg}</b> );
}
```
</details>
</TabItem>
</Tabs>
:::caution pass
The textbook method suffers from numerical issues when many values of similar
magnitude are summed. As the number of elements grows, the absolute value of the
sum grows to orders of magnitude larger than the absolute values of the
individual values and significant figures are lost.
:::
### van Reeken
Some of the issues in the textbook approach can be addressed with a differential
technique. Instead of computing the whole sum, it is possible to calculate and
update an estimate for the mean.
The van Reeken array mean can be implemented in one line of JavaScript code:
```js
for(var n = 1, mean = 0; n <= x.length; ++n) mean += (x[n-1] - mean)/n;
```
<details>
<summary><b>Math details</b> (click to show)</summary>
2024-01-07 02:35:22 +00:00
Let $M[x;m] = \frac{1}{m}\sum_{i=1}^{m}x_m$ be the mean of the first $m$ elements. Then:
2024-04-12 01:04:37 +00:00
<table style={bs}>
<tbody style={bs}>
<tr style={bs}>
<td style={bs}>
2024-01-07 02:35:22 +00:00
$M[x;m+1]$
2024-04-12 01:04:37 +00:00
</td>
<td style={bs}>
2024-01-07 02:35:22 +00:00
$= \frac{1}{m+1}\sum_{i=1}^{m+1} x_i$
2024-04-12 01:04:37 +00:00
</td>
</tr>
<tr style={bs}>
<td style={bs}>&nbsp;</td>
<td style={bs}>
2024-01-07 02:35:22 +00:00
$= \frac{1}{m+1}\sum_{i=1}^{m} x_i + \frac{x_{m+1}}{m+1}$
2024-04-12 01:04:37 +00:00
</td>
</tr>
<tr style={bs}>
<td style={bs}>&nbsp;</td>
<td style={bs}>
2024-01-07 02:35:22 +00:00
$= \frac{m}{m+1}(\frac{1}{m}\sum_{i=1}^{m} x_i) + \frac{x_{m+1}}{m+1}$
2024-04-12 01:04:37 +00:00
</td>
</tr>
<tr style={bs}>
<td style={bs}>&nbsp;</td>
<td style={bs}>
2024-01-07 02:35:22 +00:00
$= \frac{m}{m+1}M[x;m] + \frac{x_{m+1}}{m+1}$
2024-04-12 01:04:37 +00:00
</td>
</tr>
<tr style={bs}>
<td style={bs}>&nbsp;</td>
<td style={bs}>
2024-01-07 02:35:22 +00:00
$= (1 - \frac{1}{m+1})M[x;m] + \frac{x_{m+1}}{m+1}$
2024-04-12 01:04:37 +00:00
</td>
</tr>
<tr style={bs}>
<td style={bs}>&nbsp;</td>
<td style={bs}>
2024-01-07 02:35:22 +00:00
$= M[x;m] + \frac{x_{m+1}}{m+1} - \frac{1}{m+1}M[x;m]$
2024-04-12 01:04:37 +00:00
</td>
</tr>
<tr style={bs}>
<td style={bs}>&nbsp;</td>
<td style={bs}>
2024-01-07 02:35:22 +00:00
$= M[x;m] + \frac{1}{m+1}(x_{m+1}-M[x;m])$
2024-04-12 01:04:37 +00:00
</td>
</tr>
<tr style={bs}>
<td style={bs}>
2024-01-07 02:35:22 +00:00
$new\_mean$
2024-04-12 01:04:37 +00:00
</td>
<td style={bs}>
2024-01-07 02:35:22 +00:00
$= old\_mean + (x_{m+1}-old\_mean) / (m+1)$
2024-04-12 01:04:37 +00:00
</td>
</tr>
</tbody>
</table>
2024-01-07 02:35:22 +00:00
Switching to zero-based indexing, the relation matches the following expression:
```js
new_mean = old_mean + (x[m] - old_mean) / (m + 1);
```
This update can be succinctly implemented in JavaScript:
```js
mean += (x[m] - mean) / (m + 1);
```
</details>
<Tabs groupId="style">
<TabItem name="aoo" value="Array of Objects">
The following function accepts an array of objects and a key.
```js
function aoa_mean_of_key(aoo, key) {
let mean = 0, cnt = 0;
for(let R = 0; R < aoo.length; ++R) {
const row = aoo[R];
if(typeof row == "undefined") continue;
const field = row[key];
if(typeof field != "number") continue;
mean += (field - mean) / ++cnt;
}
return cnt == 0 ? 0 : mean;
}
```
<details>
<summary><b>Live Demo</b> (click to show)</summary>
2024-01-07 02:35:22 +00:00
```jsx live
function SheetJSAoOMeanKey() {
const [avg, setAvg] = React.useState(NaN);
function aoa_mean_of_key(aoo, key) {
let mean = 0, cnt = 0;
for(let R = 0; R < aoo.length; ++R) {
const row = aoo[R];
if(typeof row == "undefined") continue;
const field = row[key];
if(typeof field != "number") continue;
mean += (field - mean) / ++cnt;
}
return cnt == 0 ? 0 : mean;
}
React.useEffect(() => { (async() => {
const ab = await (await fetch("/typedarray/iris.xlsx")).arrayBuffer();
const wb = XLSX.read(ab, {dense: true});
const aoo = XLSX.utils.sheet_to_json(wb.Sheets[wb.SheetNames[0]]);
setAvg(aoa_mean_of_key(aoo, "sepal length"));
})(); }, []);
return ( <b>The average Sepal Length is {avg}</b> );
}
```
</details>
</TabItem>
<TabItem name="ws" value="Dense Worksheet">
The following function accepts a SheetJS worksheet and a column index.
```js
function ws_mean_of_col(ws, C) {
const data = ws["!data"];
let mean = 0, cnt = 0;
for(let R = 1; R < data.length; ++R) {
const row = data[R];
if(typeof row == "undefined") continue;
const field = row[C];
if(!field || field.t != "n") continue;
mean += (field.v - mean) / ++cnt;
}
return cnt == 0 ? 0 : mean;
}
```
<details>
<summary><b>Live Demo</b> (click to show)</summary>
2024-01-07 02:35:22 +00:00
```jsx live
function SheetJSDenseMeanKey() {
const [avg, setAvg] = React.useState(NaN);
function ws_mean_of_col(ws, C) {
const data = ws["!data"];
let mean = 0, cnt = 0;
for(let R = 1; R < data.length; ++R) {
const row = data[R];
if(typeof row == "undefined") continue;
const field = row[C];
if(!field || field.t != "n") continue;
mean += (field.v - mean) / ++cnt;
}
return cnt == 0 ? 0 : mean;
}
React.useEffect(() => { (async() => {
const ab = await (await fetch("/typedarray/iris.xlsx")).arrayBuffer();
const wb = XLSX.read(ab, {dense: true});
const ws = wb.Sheets[wb.SheetNames[0]];
/* find column index */
const first_row = ws["!data"][0];
let C = -1;
for(let i = 0; i < first_row.length; ++i) {
let cell = first_row[i];
/* confirm cell exists */
if(!cell) continue;
/* confirm cell is a text cell */
if(cell.t != "s") continue;
/* compare the text */
if(cell.v.localeCompare("sepal length") != 0) continue;
/* save column index */
C = i; break;
}
setAvg(ws_mean_of_col(ws, C));
})(); }, []);
return ( <b>The average Sepal Length is {avg}</b> );
}
```
</details>
</TabItem>
</Tabs>
:::note Historical Context
This algorithm is generally attributed to Welford[^11]. However, the original
paper does not propose this algorithm for calculating the mean!
Programmers including Neely[^12] attributed a different algorithm to Welford.
van Reeken[^13] reported success with the algorithm presented in this section.
Knuth[^14] erroneously attributed this implementation of the mean to Welford.
:::
[^1]: See [`sheet_to_json` in "Utilities"](/docs/api/utilities/array#array-output)
[^2]: See ["Dense Mode" in "Utilities"](/docs/csf/sheet#dense-mode)
[^3]: See [`read` in "Reading Files"](/docs/api/parse-options)
[^4]: See ["Dense Mode" in "Utilities"](/docs/csf/sheet#dense-mode)
[^5]: See ["Cell Types" in "Cell Objects"](/docs/csf/cell#cell-types)
[^6]: See ["Underlying Values" in "Cell Objects"](/docs/csf/cell#underlying-values)
[^7]: See ["Cell Types" in "Cell Objects"](/docs/csf/cell#cell-types)
[^8]: See [`mean()`](https://jstat.github.io/all.html#mean) in the `jStat` documentation.
[^9]: See [`mean`](http://simple-statistics.github.io/docs/#mean) in the `simple-statistics` documentation.
[^10]: See [`incrsum`](https://stdlib.io/docs/api/latest/@stdlib/stats/incr/sum) in the `stdlib.js` documentation.
[^11]: See "Note on a Method for Calculated Corrected Sums of Squares and Products" in Technometrics Vol 4 No 3 (1962 August).
[^12]: See "Comparison of Several Algorithms for Computation of Means, Standard Deviations and Correlation Coefficients" in CACM Vol 9 No 7 (1966 July).
[^13]: See "Dealing with Neely's Algorithms" in CACM Vol 11 No 3 (1968 March).
[^14]: See "The Art of Computer Programming: Seminumerical Algorithms" Third Edition page 232.