2022-08-29 20:34:30 +00:00
|
|
|
---
|
2023-09-25 07:30:54 +00:00
|
|
|
title: Data Processing in GitHub
|
|
|
|
sidebar_label: GitHub
|
2023-02-28 11:40:44 +00:00
|
|
|
pagination_prev: demos/local/index
|
|
|
|
pagination_next: demos/extensions/index
|
2022-08-29 20:34:30 +00:00
|
|
|
---
|
|
|
|
|
2023-04-27 09:12:19 +00:00
|
|
|
import current from '/version.js';
|
|
|
|
import CodeBlock from '@theme/CodeBlock';
|
|
|
|
|
2022-08-29 20:34:30 +00:00
|
|
|
Many official data releases by governments and organizations include XLSX or
|
2023-02-12 04:20:11 +00:00
|
|
|
XLS files. Unfortunately some data sources do not retain older versions.
|
2022-08-29 20:34:30 +00:00
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
[Git](https://git-scm.com/) is a popular system for organizing a historical
|
|
|
|
record of text files and changes. Git can also store and track spreadsheets.
|
2023-02-12 04:20:11 +00:00
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
[GitHub](https://github.com/) hosts Git repositories and provides infrastructure
|
|
|
|
to run scheduled tasks. ["Flat Data"](https://octo.github.com/projects/flat-data)
|
|
|
|
explores storing and comparing versions of structured CSV and JSON data.
|
|
|
|
|
|
|
|
[SheetJS](https://sheetjs.com) is a JavaScript library for reading and writing
|
|
|
|
data from spreadsheets.
|
|
|
|
|
|
|
|
This demo uses SheetJS in GitHub to process spreadsheet. We'll explore how to
|
|
|
|
fetch and process spreadsheets at regular intervals, and how to keep track of
|
|
|
|
changes over time.
|
|
|
|
|
|
|
|
:::info pass
|
|
|
|
|
|
|
|
["Excel to CSV"](https://octo.github.com/projects/flat-data#:~:text=Excel) is an
|
|
|
|
official example that pulls XLSX workbooks from an endpoint and uses SheetJS to
|
|
|
|
parse the workbooks and generate CSV files:
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
The following diagram depicts the data dance:
|
2023-02-12 04:20:11 +00:00
|
|
|
|
|
|
|
```mermaid
|
|
|
|
sequenceDiagram
|
|
|
|
autonumber
|
|
|
|
participant R as GH Repo
|
|
|
|
participant A as GH Action
|
|
|
|
participant S as Data Source
|
|
|
|
loop Regular Interval (cron)
|
|
|
|
A->>R: clone repo
|
|
|
|
R->>A: old repo
|
|
|
|
A->>S: fetch file
|
|
|
|
S->>A: spreadsheet
|
|
|
|
Note over A: SheetJS<br/>convert to CSV
|
|
|
|
alt Data changed
|
|
|
|
Note over A: commit new data
|
|
|
|
A->>R: push new commit
|
|
|
|
end
|
|
|
|
end
|
|
|
|
```
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
## Flat Data
|
|
|
|
|
|
|
|
As a project from the company, the entire lifecycle uses GitHub offerings:
|
|
|
|
|
|
|
|
- GitHub offers free hosting for Git repositories
|
2023-09-25 07:30:54 +00:00
|
|
|
- GitHub Actions[^1] infrastructure runs tasks at regular intervals
|
|
|
|
- `githubocto/flat`[^2] Action to help fetch data and automate post-processing
|
|
|
|
- `flat-postprocessing`[^3] Post-processing helper functions and examples
|
|
|
|
- "Flat Viewer"[^4]: Web viewer for structured CSV and JSON data on GitHub
|
2022-08-29 20:34:30 +00:00
|
|
|
|
2023-09-24 03:59:48 +00:00
|
|
|
:::caution pass
|
2022-08-29 20:34:30 +00:00
|
|
|
|
2023-06-03 09:10:50 +00:00
|
|
|
A GitHub account is required. When the demo was tested, free GitHub accounts had
|
|
|
|
no Actions usage limits for public repositories.
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
Using private GitHub repositories is not recommended because the Flat Viewer
|
|
|
|
cannot access private repositories.
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
|
|
|
### Data Source
|
|
|
|
|
|
|
|
Any publicly available spreadsheet can be a valid data source. The process will
|
|
|
|
fetch the data on specified intervals or events.
|
|
|
|
|
2022-11-07 10:41:00 +00:00
|
|
|
For this demo, <https://docs.sheetjs.com/pres.xlsx> will be used.
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
|
|
|
|
### Action
|
|
|
|
|
|
|
|
The `githubocto/flat` action can be added as a step in a workflow:
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
- name: Fetch data
|
|
|
|
uses: githubocto/flat@v3
|
|
|
|
with:
|
2022-11-07 10:41:00 +00:00
|
|
|
http_url: https://docs.sheetjs.com/pres.xlsx
|
2022-08-29 20:34:30 +00:00
|
|
|
downloaded_filename: data.xlsx
|
|
|
|
postprocess: ./postprocess.ts
|
|
|
|
```
|
|
|
|
|
2023-02-12 04:20:11 +00:00
|
|
|
This action performs the following steps:
|
|
|
|
|
|
|
|
1) `http_url` will be fetched and saved to `downloaded_filename` in the repo.
|
2022-08-29 20:34:30 +00:00
|
|
|
This can be approximated with the following command:
|
|
|
|
|
|
|
|
```bash
|
2022-11-07 10:41:00 +00:00
|
|
|
curl -L -o data.xlsx https://docs.sheetjs.com/pres.xlsx
|
2022-08-29 20:34:30 +00:00
|
|
|
```
|
|
|
|
|
2023-02-12 04:20:11 +00:00
|
|
|
2) After saving, the `postprocess` script will be run. When a `.ts` file is the
|
2022-08-29 20:34:30 +00:00
|
|
|
script, it will run the script in the Deno runtime. The `postprocess` script is
|
|
|
|
expected to read the downloaded file and create or overwrite files in the repo.
|
|
|
|
This can be approximated with the following command:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
deno run -A ./postprocess.ts data.xlsx
|
|
|
|
```
|
|
|
|
|
2023-02-12 04:20:11 +00:00
|
|
|
3) The action will compare the contents of the repo, creating a new commit if
|
2022-08-29 20:34:30 +00:00
|
|
|
the source data or artifacts from the `postprocess` script changed.
|
|
|
|
|
|
|
|
|
|
|
|
### Post-Processing Data
|
|
|
|
|
2023-09-19 19:08:29 +00:00
|
|
|
:::warning pass
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
The `flat-postprocessing` library includes a number of utilities for different
|
|
|
|
data formats. The `readXLSX` helper uses SheetJS under the hood.
|
|
|
|
|
|
|
|
The library uses an older version of the SheetJS library. To use the latest
|
|
|
|
releases, the examples import from the SheetJS CDN:
|
|
|
|
|
2023-04-27 09:12:19 +00:00
|
|
|
<CodeBlock language="ts">{`\
|
|
|
|
// @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts"
|
|
|
|
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';`}
|
|
|
|
</CodeBlock>
|
|
|
|
|
2022-08-29 20:34:30 +00:00
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
See [the "Deno" installation section](/docs/getting-started/installation/deno)
|
|
|
|
for more details.
|
2022-09-02 05:52:23 +00:00
|
|
|
|
2022-08-29 20:34:30 +00:00
|
|
|
:::
|
|
|
|
|
|
|
|
#### Post-Process Script
|
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
The first argument to the post-processing script is the filename.
|
|
|
|
|
|
|
|
The SheetJS `readFile` method[^5] will read the file and generate a SheetJS
|
|
|
|
workbook object[^6]. After extracting the first worksheet, `sheet_to_csv`[^7]
|
|
|
|
generates a CSV string.
|
|
|
|
|
|
|
|
After generating a CSV string, the string should be written to the filesystem
|
|
|
|
using `Deno.writeFileSync`[^8]. By convention, the CSV should preserve the file
|
|
|
|
name stem and replace the extension with `.csv`:
|
2022-08-29 20:34:30 +00:00
|
|
|
|
2023-04-27 09:12:19 +00:00
|
|
|
<CodeBlock title="postprocess.ts" language="ts">{`\
|
|
|
|
// @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts"
|
|
|
|
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';
|
2022-08-29 20:34:30 +00:00
|
|
|
/* load the codepage support library for extended support with older formats */
|
2023-04-27 09:12:19 +00:00
|
|
|
import * as cptable from 'https://cdn.sheetjs.com/xlsx-${current}/package/dist/cpexcel.full.mjs';
|
2022-08-29 20:34:30 +00:00
|
|
|
XLSX.set_cptable(cptable);
|
2023-04-27 09:12:19 +00:00
|
|
|
\n\
|
2022-08-29 20:34:30 +00:00
|
|
|
/* get the file path for the downloaded file and generate the CSV path */
|
|
|
|
const in_file = Deno.args[0];
|
|
|
|
const out_file = in_file.replace(/.xlsx$/, ".csv");
|
2023-04-27 09:12:19 +00:00
|
|
|
\n\
|
2022-08-29 20:34:30 +00:00
|
|
|
/* read file */
|
|
|
|
// highlight-next-line
|
|
|
|
const workbook = XLSX.readFile(in_file);
|
2023-04-27 09:12:19 +00:00
|
|
|
\n\
|
2022-08-29 20:34:30 +00:00
|
|
|
/* generate CSV from first worksheet */
|
|
|
|
const first_sheet = workbook.Sheets[workbook.SheetNames[0]];
|
|
|
|
// highlight-next-line
|
|
|
|
const csv = XLSX.utils.sheet_to_csv(first_sheet);
|
2023-04-27 09:12:19 +00:00
|
|
|
\n\
|
2022-08-29 20:34:30 +00:00
|
|
|
/* write CSV */
|
|
|
|
// highlight-next-line
|
2023-04-27 09:12:19 +00:00
|
|
|
Deno.writeFileSync(out_file, new TextEncoder().encode(csv));`}
|
|
|
|
</CodeBlock>
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
|
|
|
|
## Complete Example
|
|
|
|
|
|
|
|
:::note
|
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
This was last tested by SheetJS users on 2023 September 24 using the GitHub UI.
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
:::info pass
|
|
|
|
|
|
|
|
<https://github.com/SheetJS/flat-sheet> is an example from a previous test. The
|
|
|
|
Flat Viewer URL for the repo is <https://flatgithub.com/SheetJS/flat-sheet/>
|
|
|
|
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
### Create Project
|
|
|
|
|
2022-08-29 20:34:30 +00:00
|
|
|
0) Create a free GitHub account or sign into the GitHub web interface.
|
|
|
|
|
|
|
|
1) Create a new repository (click the "+" icon in the upper-right corner).
|
|
|
|
|
|
|
|
- When prompted, enter a repository name of your choosing.
|
|
|
|
- Ensure "Public" is selected
|
|
|
|
- Check "Add a README file"
|
|
|
|
- Click "Create repository" at the bottom.
|
|
|
|
|
|
|
|
You will be redirected to the new project.
|
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
### Add Code
|
|
|
|
|
2022-08-29 20:34:30 +00:00
|
|
|
2) In the browser URL bar, change "github.com" to "github.dev". For example, if
|
|
|
|
the URL was originally `https://github.com/SheetJS/flat-sheet` , the new URL
|
|
|
|
should be `https://github.dev/SheetJS/flat-sheet` . Press Enter.
|
|
|
|
|
|
|
|
3) In the left "EXPLORER" panel, double-click just below README.md. A text box
|
|
|
|
will appear just above README. Type `postprocess.ts` and press Enter.
|
|
|
|
|
|
|
|
The main panel will show a `postprocess.ts` tab. Copy the following code to
|
|
|
|
the main editor window:
|
|
|
|
|
2023-04-27 09:12:19 +00:00
|
|
|
<CodeBlock title="postprocess.ts" language="ts">{`\
|
|
|
|
// @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts"
|
|
|
|
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';
|
2022-08-29 20:34:30 +00:00
|
|
|
/* load the codepage support library for extended support with older formats */
|
2023-04-27 09:12:19 +00:00
|
|
|
import * as cptable from 'https://cdn.sheetjs.com/xlsx-${current}/package/dist/cpexcel.full.mjs';
|
2022-08-29 20:34:30 +00:00
|
|
|
XLSX.set_cptable(cptable);
|
2023-04-27 09:12:19 +00:00
|
|
|
\n\
|
2022-08-29 20:34:30 +00:00
|
|
|
/* get the file path for the downloaded file and generate the CSV path */
|
|
|
|
const in_file = Deno.args[0];
|
|
|
|
const out_file = in_file.replace(/.xlsx$/, ".csv");
|
2023-04-27 09:12:19 +00:00
|
|
|
\n\
|
2022-08-29 20:34:30 +00:00
|
|
|
/* read file */
|
|
|
|
const workbook = XLSX.readFile(in_file);
|
2023-04-27 09:12:19 +00:00
|
|
|
\n\
|
2022-08-29 20:34:30 +00:00
|
|
|
/* generate CSV */
|
|
|
|
const first_sheet = workbook.Sheets[workbook.SheetNames[0]];
|
|
|
|
const csv = XLSX.utils.sheet_to_csv(first_sheet);
|
2023-04-27 09:12:19 +00:00
|
|
|
\n\
|
2022-08-29 20:34:30 +00:00
|
|
|
/* write CSV */
|
2022-09-05 10:00:35 +00:00
|
|
|
// highlight-next-line
|
2023-04-27 09:12:19 +00:00
|
|
|
Deno.writeFileSync(out_file, new TextEncoder().encode(csv));`}
|
|
|
|
</CodeBlock>
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
|
|
|
|
4) In the left "EXPLORER" panel, double-click just below README.md. A text box
|
|
|
|
will appear. Type `.github/workflows/data.yaml` and press Enter.
|
|
|
|
|
|
|
|
Copy the following code into the main area. It will create an action that
|
|
|
|
runs roughly once an hour:
|
|
|
|
|
|
|
|
```yaml title=".github/workflows/data.yaml"
|
|
|
|
name: flatsheet
|
|
|
|
|
|
|
|
on:
|
|
|
|
workflow_dispatch:
|
|
|
|
schedule:
|
|
|
|
- cron: '0 * * * *'
|
|
|
|
|
|
|
|
jobs:
|
|
|
|
scheduled:
|
|
|
|
runs-on: ubuntu-latest
|
|
|
|
steps:
|
|
|
|
- name: Setup deno
|
|
|
|
uses: denoland/setup-deno@main
|
|
|
|
with:
|
|
|
|
deno-version: v1.x
|
|
|
|
- name: Check out repo
|
|
|
|
uses: actions/checkout@v2
|
|
|
|
- name: Fetch data
|
|
|
|
uses: githubocto/flat@v3
|
|
|
|
with:
|
2022-11-07 10:41:00 +00:00
|
|
|
http_url: https://docs.sheetjs.com/pres.xlsx
|
2022-08-29 20:34:30 +00:00
|
|
|
downloaded_filename: data.xlsx
|
|
|
|
postprocess: ./postprocess.ts
|
|
|
|
```
|
|
|
|
|
|
|
|
5) Click on the source control icon (a little blue circle with the number 2).
|
|
|
|
In the left panel, select Message box, type `init` and press `Ctrl+Enter` on
|
|
|
|
Windows (`Command+Enter` on Mac).
|
|
|
|
|
|
|
|
6) Click the `☰` icon and click "Go to Repository" to return to the repo page.
|
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
### Test Action
|
|
|
|
|
2023-04-07 08:30:20 +00:00
|
|
|
7) Click "Settings" to see the repository settings. In the left column, click
|
|
|
|
"Actions" to expand the submenu and click "General".
|
|
|
|
|
|
|
|
Scroll down to "Workflow permissions" and select "Read and write permissions"
|
|
|
|
if it is not selected. Click "Save".
|
|
|
|
|
|
|
|
8) Click "Actions" to see the workflows. In the left column, click `flatsheet`.
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
This is the page for the action. Every time the action is run, a new entry
|
|
|
|
will be added to the list.
|
|
|
|
|
|
|
|
Click "Run workflow", then click the "Run workflow" button in the popup.
|
|
|
|
This will start a new run. After about 30 seconds, a new row should show up
|
|
|
|
in the main area. The icon should be a white `✓` in a green circle.
|
|
|
|
|
2023-04-07 08:30:20 +00:00
|
|
|
9) Click "Code" to return to the main view. It should have a file listing that
|
2022-08-29 20:34:30 +00:00
|
|
|
includes `data.xlsx` (downloaded file) and `data.csv` (generated data)
|
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
10) Repeat step 8 to run the action a second time. Click "Code" again.
|
|
|
|
|
|
|
|
### Viewer
|
2022-08-29 20:34:30 +00:00
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
11) Go to the URL bar and change "github.com" to "flatgithub.com". For example,
|
2022-08-29 20:34:30 +00:00
|
|
|
if the URL was originally `https://github.com/SheetJS/flat-sheet` , the new
|
|
|
|
URL should be `https://flatgithub.com/SheetJS/flat-sheet` . Press Enter.
|
|
|
|
|
|
|
|
You will see the "Flat Viewer". In the top bar, the "Commit" option allows
|
|
|
|
for switching to an older version of the data.
|
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
The following screenshot shows the viewer in action:
|
|
|
|
|
|
|
|
![Flat Viewer for SheetJS/flat-sheet](pathname:///github/viewer.png)
|
|
|
|
|
|
|
|
The column chart in the Index column is a histogram.
|
2023-07-23 21:01:30 +00:00
|
|
|
|
2023-09-25 07:30:54 +00:00
|
|
|
[^1]: See ["GitHub Actions documentation"](https://docs.github.com/en/actions)
|
|
|
|
[^2]: See [`githubocto/flat`](https://github.com/githubocto/flat) repo on GitHub.
|
|
|
|
[^3]: See [`githubocto/flat-postprocessing`](https://github.com/githubocto/flat-postprocessing) repo on GitHub.
|
|
|
|
[^4]: The hosted version is available at <https://flatgithub.com/>
|
|
|
|
[^5]: See [`readFile` in "Reading Files"](/docs/api/parse-options)
|
|
|
|
[^6]: See ["Workbook Object"](/docs/csf/book)
|
|
|
|
[^7]: See [`sheet_to_csv` in "CSV and Text"](/docs/api/utilities/csv#delimiter-separated-output)
|
|
|
|
[^8]: See [`Deno.writeFileSync`](https://deno.land/api?s=Deno.writeFileSync) in the Deno Runtime APIs documentation.
|