docs.sheetjs.com/docz/docs/03-demos/30-cloud/18-github.md

337 lines
12 KiB
Markdown
Raw Normal View History

2022-08-29 20:34:30 +00:00
---
2024-01-03 06:47:00 +00:00
title: Flat Data Processing in GitHub
2023-09-25 07:30:54 +00:00
sidebar_label: GitHub
2023-02-28 11:40:44 +00:00
pagination_prev: demos/local/index
pagination_next: demos/extensions/index
2022-08-29 20:34:30 +00:00
---
2023-04-27 09:12:19 +00:00
import current from '/version.js';
import CodeBlock from '@theme/CodeBlock';
2023-09-25 07:30:54 +00:00
[Git](https://git-scm.com/) is a popular system for organizing a historical
record of text files and changes. Git can also store and track spreadsheets.
2023-02-12 04:20:11 +00:00
2024-01-03 06:47:00 +00:00
GitHub hosts Git repositories and provides infrastructure to execute workflows.
The ["Flat Data" project](https://octo.github.com/projects/flat-data) explores
storing and comparing versions of structured data using GitHub infrastructure.
2023-09-25 07:30:54 +00:00
[SheetJS](https://sheetjs.com) is a JavaScript library for reading and writing
data from spreadsheets.
This demo uses SheetJS in GitHub to process spreadsheet. We'll explore how to
fetch and process spreadsheets at regular intervals, and how to keep track of
changes over time.
:::info pass
["Excel to CSV"](https://octo.github.com/projects/flat-data#:~:text=Excel) is an
official example that pulls XLSX workbooks from an endpoint and uses SheetJS to
2024-01-03 06:47:00 +00:00
parse the workbooks and generate CSV files.
2023-09-25 07:30:54 +00:00
:::
The following diagram depicts the data dance:
2023-02-12 04:20:11 +00:00
```mermaid
sequenceDiagram
autonumber
2024-01-03 06:47:00 +00:00
participant R as GitHub Repo
participant A as GitHub Action
2023-02-12 04:20:11 +00:00
participant S as Data Source
loop Regular Interval (cron)
A->>R: clone repo
R->>A: old repo
A->>S: fetch file
S->>A: spreadsheet
Note over A: SheetJS<br/>convert to CSV
alt Data changed
Note over A: commit new data
A->>R: push new commit
end
end
```
2022-08-29 20:34:30 +00:00
## Flat Data
2024-01-03 06:47:00 +00:00
Many official data releases by governments and organizations include XLSX or
XLS files. Unfortunately some data sources do not retain older versions.
Software developers typically use version control systems such as Git to track
changes in source code.
The "Flat Data" project starts from the idea that the same version control
systems can be used to track changes in data. Third-party data sources can be
snapshotted at regular intervals and stored in Git repositories.
### Components
2022-08-29 20:34:30 +00:00
As a project from the company, the entire lifecycle uses GitHub offerings:
2024-01-03 06:47:00 +00:00
- GitHub.com[^1] offers free hosting for Git repositories
- GitHub Actions[^2] infrastructure runs tasks at regular intervals
- `githubocto/flat`[^3] library helps fetch data and automate post-processing
- `flat-postprocessing`[^4] library provides post-processing helper functions
- "Flat Viewer"[^5] displays structured CSV and JSON data from Git repositories
2022-08-29 20:34:30 +00:00
2023-09-24 03:59:48 +00:00
:::caution pass
2022-08-29 20:34:30 +00:00
2023-11-30 07:10:37 +00:00
A GitHub account is required. When the demo was last tested, "GitHub Free"
2024-01-03 06:47:00 +00:00
accounts had no Actions usage limits for public repositories[^6].
2022-08-29 20:34:30 +00:00
2023-11-30 07:10:37 +00:00
Private GitHub repositories can be used for processing data, but the Flat Viewer
will not be able to display private data.
2022-08-29 20:34:30 +00:00
:::
### Data Source
Any publicly available spreadsheet can be a valid data source. The process will
fetch the data on specified intervals or events.
2022-11-07 10:41:00 +00:00
For this demo, <https://docs.sheetjs.com/pres.xlsx> will be used.
2022-08-29 20:34:30 +00:00
### Action
The `githubocto/flat` action can be added as a step in a workflow:
```yaml
- name: Fetch data
uses: githubocto/flat@v3
with:
2022-11-07 10:41:00 +00:00
http_url: https://docs.sheetjs.com/pres.xlsx
2022-08-29 20:34:30 +00:00
downloaded_filename: data.xlsx
postprocess: ./postprocess.ts
```
2023-02-12 04:20:11 +00:00
This action performs the following steps:
1) `http_url` will be fetched and saved to `downloaded_filename` in the repo.
2022-08-29 20:34:30 +00:00
This can be approximated with the following command:
```bash
2022-11-07 10:41:00 +00:00
curl -L -o data.xlsx https://docs.sheetjs.com/pres.xlsx
2022-08-29 20:34:30 +00:00
```
2023-02-12 04:20:11 +00:00
2) After saving, the `postprocess` script will be run. When a `.ts` file is the
2022-08-29 20:34:30 +00:00
script, it will run the script in the Deno runtime. The `postprocess` script is
expected to read the downloaded file and create or overwrite files in the repo.
This can be approximated with the following command:
```bash
deno run -A ./postprocess.ts data.xlsx
```
2023-02-12 04:20:11 +00:00
3) The action will compare the contents of the repo, creating a new commit if
2022-08-29 20:34:30 +00:00
the source data or artifacts from the `postprocess` script changed.
### Post-Processing Data
2023-09-19 19:08:29 +00:00
:::warning pass
2022-08-29 20:34:30 +00:00
The `flat-postprocessing` library includes a number of utilities for different
data formats. The `readXLSX` helper uses SheetJS under the hood.
The library uses an older version of the SheetJS library. To use the latest
releases, the examples import from the SheetJS CDN:
2023-04-27 09:12:19 +00:00
<CodeBlock language="ts">{`\
// @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts"
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';`}
</CodeBlock>
2022-08-29 20:34:30 +00:00
2023-09-25 07:30:54 +00:00
See [the "Deno" installation section](/docs/getting-started/installation/deno)
for more details.
2022-09-02 05:52:23 +00:00
2022-08-29 20:34:30 +00:00
:::
#### Post-Process Script
2023-09-25 07:30:54 +00:00
The first argument to the post-processing script is the filename.
2024-01-03 06:47:00 +00:00
The SheetJS `readFile` method[^7] will read the file and generate a SheetJS
workbook object[^8]. After extracting the first worksheet, `sheet_to_csv`[^9]
2023-09-25 07:30:54 +00:00
generates a CSV string.
After generating a CSV string, the string should be written to the filesystem
2024-01-03 06:47:00 +00:00
using `Deno.writeFileSync`[^10]. By convention, the CSV should preserve the file
2023-09-25 07:30:54 +00:00
name stem and replace the extension with `.csv`:
2022-08-29 20:34:30 +00:00
2023-04-27 09:12:19 +00:00
<CodeBlock title="postprocess.ts" language="ts">{`\
// @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts"
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';
2022-08-29 20:34:30 +00:00
/* load the codepage support library for extended support with older formats */
2023-04-27 09:12:19 +00:00
import * as cptable from 'https://cdn.sheetjs.com/xlsx-${current}/package/dist/cpexcel.full.mjs';
2022-08-29 20:34:30 +00:00
XLSX.set_cptable(cptable);
2023-04-27 09:12:19 +00:00
\n\
2022-08-29 20:34:30 +00:00
/* get the file path for the downloaded file and generate the CSV path */
const in_file = Deno.args[0];
const out_file = in_file.replace(/.xlsx$/, ".csv");
2023-04-27 09:12:19 +00:00
\n\
2022-08-29 20:34:30 +00:00
/* read file */
// highlight-next-line
const workbook = XLSX.readFile(in_file);
2023-04-27 09:12:19 +00:00
\n\
2022-08-29 20:34:30 +00:00
/* generate CSV from first worksheet */
const first_sheet = workbook.Sheets[workbook.SheetNames[0]];
// highlight-next-line
const csv = XLSX.utils.sheet_to_csv(first_sheet);
2023-04-27 09:12:19 +00:00
\n\
2022-08-29 20:34:30 +00:00
/* write CSV */
// highlight-next-line
2023-04-27 09:12:19 +00:00
Deno.writeFileSync(out_file, new TextEncoder().encode(csv));`}
</CodeBlock>
2022-08-29 20:34:30 +00:00
## Complete Example
2023-11-30 07:10:37 +00:00
:::note Tested Deployments
2022-08-29 20:34:30 +00:00
2023-12-05 03:46:54 +00:00
This was last tested by SheetJS users on 2023 December 04.
2022-08-29 20:34:30 +00:00
:::
2023-09-25 07:30:54 +00:00
:::info pass
<https://github.com/SheetJS/flat-sheet> is an example from a previous test. The
Flat Viewer URL for the repo is <https://flatgithub.com/SheetJS/flat-sheet/>
:::
### Create Project
2022-08-29 20:34:30 +00:00
0) Create a free GitHub account or sign into the GitHub web interface.
1) Create a new repository (click the "+" icon in the upper-right corner).
- When prompted, enter a repository name of your choosing.
- Ensure "Public" is selected
- Check "Add a README file"
- Click "Create repository" at the bottom.
You will be redirected to the new project.
2023-09-25 07:30:54 +00:00
### Add Code
2022-08-29 20:34:30 +00:00
2) In the browser URL bar, change "github.com" to "github.dev". For example, if
the URL was originally `https://github.com/SheetJS/flat-sheet` , the new URL
should be `https://github.dev/SheetJS/flat-sheet` . Press Enter.
3) In the left "EXPLORER" panel, double-click just below README.md. A text box
will appear just above README. Type `postprocess.ts` and press Enter.
The main panel will show a `postprocess.ts` tab. Copy the following code to
the main editor window:
2023-04-27 09:12:19 +00:00
<CodeBlock title="postprocess.ts" language="ts">{`\
// @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts"
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';
2022-08-29 20:34:30 +00:00
/* load the codepage support library for extended support with older formats */
2023-04-27 09:12:19 +00:00
import * as cptable from 'https://cdn.sheetjs.com/xlsx-${current}/package/dist/cpexcel.full.mjs';
2022-08-29 20:34:30 +00:00
XLSX.set_cptable(cptable);
2023-04-27 09:12:19 +00:00
\n\
2022-08-29 20:34:30 +00:00
/* get the file path for the downloaded file and generate the CSV path */
const in_file = Deno.args[0];
const out_file = in_file.replace(/.xlsx$/, ".csv");
2023-04-27 09:12:19 +00:00
\n\
2022-08-29 20:34:30 +00:00
/* read file */
const workbook = XLSX.readFile(in_file);
2023-04-27 09:12:19 +00:00
\n\
2022-08-29 20:34:30 +00:00
/* generate CSV */
const first_sheet = workbook.Sheets[workbook.SheetNames[0]];
const csv = XLSX.utils.sheet_to_csv(first_sheet);
2023-04-27 09:12:19 +00:00
\n\
2022-08-29 20:34:30 +00:00
/* write CSV */
2022-09-05 10:00:35 +00:00
// highlight-next-line
2023-04-27 09:12:19 +00:00
Deno.writeFileSync(out_file, new TextEncoder().encode(csv));`}
</CodeBlock>
2022-08-29 20:34:30 +00:00
4) In the left "EXPLORER" panel, double-click just below README.md. A text box
will appear. Type `.github/workflows/data.yaml` and press Enter.
Copy the following code into the main area. It will create an action that
runs roughly once an hour:
```yaml title=".github/workflows/data.yaml"
name: flatsheet
on:
workflow_dispatch:
schedule:
- cron: '0 * * * *'
jobs:
scheduled:
runs-on: ubuntu-latest
steps:
- name: Setup deno
uses: denoland/setup-deno@main
with:
deno-version: v1.x
- name: Check out repo
uses: actions/checkout@v2
- name: Fetch data
uses: githubocto/flat@v3
with:
2022-11-07 10:41:00 +00:00
http_url: https://docs.sheetjs.com/pres.xlsx
2022-08-29 20:34:30 +00:00
downloaded_filename: data.xlsx
postprocess: ./postprocess.ts
```
5) Click on the source control icon (a little blue circle with the number 2).
In the left panel, select Message box, type `init` and press `Ctrl+Enter` on
Windows (`Command+Enter` on Mac).
6) Click the `☰` icon and click "Go to Repository" to return to the repo page.
2023-09-25 07:30:54 +00:00
### Test Action
2023-04-07 08:30:20 +00:00
7) Click "Settings" to see the repository settings. In the left column, click
"Actions" to expand the submenu and click "General".
Scroll down to "Workflow permissions" and select "Read and write permissions"
if it is not selected. Click "Save".
8) Click "Actions" to see the workflows. In the left column, click `flatsheet`.
2022-08-29 20:34:30 +00:00
This is the page for the action. Every time the action is run, a new entry
will be added to the list.
Click "Run workflow", then click the "Run workflow" button in the popup.
This will start a new run. After about 30 seconds, a new row should show up
in the main area. The icon should be a white `✓` in a green circle.
2023-04-07 08:30:20 +00:00
9) Click "Code" to return to the main view. It should have a file listing that
2022-08-29 20:34:30 +00:00
includes `data.xlsx` (downloaded file) and `data.csv` (generated data)
2023-09-25 07:30:54 +00:00
10) Repeat step 8 to run the action a second time. Click "Code" again.
### Viewer
2022-08-29 20:34:30 +00:00
2023-09-25 07:30:54 +00:00
11) Go to the URL bar and change "github.com" to "flatgithub.com". For example,
2022-08-29 20:34:30 +00:00
if the URL was originally `https://github.com/SheetJS/flat-sheet` , the new
URL should be `https://flatgithub.com/SheetJS/flat-sheet` . Press Enter.
You will see the "Flat Viewer". In the top bar, the "Commit" option allows
for switching to an older version of the data.
2023-09-25 07:30:54 +00:00
The following screenshot shows the viewer in action:
![Flat Viewer for SheetJS/flat-sheet](pathname:///github/viewer.png)
The column chart in the Index column is a histogram.
2023-07-23 21:01:30 +00:00
2024-01-03 06:47:00 +00:00
[^1]: See ["Repositories documentation"](https://docs.github.com/en/repositories) in the GitHub documentation.
[^2]: See ["GitHub Actions documentation"](https://docs.github.com/en/actions) in the GitHub documentation.
[^3]: See [`githubocto/flat`](https://github.com/githubocto/flat) repo on GitHub.
[^4]: See [`githubocto/flat-postprocessing`](https://github.com/githubocto/flat-postprocessing) repo on GitHub.
[^5]: The hosted version is available at <https://flatgithub.com/>
[^6]: See ["About billing for GitHub Actions"](https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions) in the GitHub documentation.
[^7]: See [`readFile` in "Reading Files"](/docs/api/parse-options)
[^8]: See ["Workbook Object"](/docs/csf/book)
[^9]: See [`sheet_to_csv` in "CSV and Text"](/docs/api/utilities/csv#delimiter-separated-output)
[^10]: See [`Deno.writeFileSync`](https://deno.land/api?s=Deno.writeFileSync) in the Deno Runtime APIs documentation.