2022-08-29 20:34:30 +00:00
|
|
|
---
|
2023-02-12 04:20:11 +00:00
|
|
|
title: GitHub
|
|
|
|
pagination_prev: demos/ml
|
2023-02-11 07:13:53 +00:00
|
|
|
pagination_next: solutions/input
|
2022-08-29 20:34:30 +00:00
|
|
|
---
|
|
|
|
|
|
|
|
Many official data releases by governments and organizations include XLSX or
|
2023-02-12 04:20:11 +00:00
|
|
|
XLS files. Unfortunately some data sources do not retain older versions.
|
2022-08-29 20:34:30 +00:00
|
|
|
|
2023-02-12 04:20:11 +00:00
|
|
|
Git is a popular system for organizing a historical record of source code and
|
|
|
|
changes. Git can also store and track binary data artifacts.
|
|
|
|
|
|
|
|
GitHub is a popular host for Git repositories. GitHub's "Flat Data" project
|
|
|
|
explores storing and comparing versions of structured CSV and JSON data. The
|
|
|
|
official "Excel to CSV" example uses SheetJS to generate CSV data from files:
|
|
|
|
|
|
|
|
```mermaid
|
|
|
|
sequenceDiagram
|
|
|
|
autonumber
|
|
|
|
participant R as GH Repo
|
|
|
|
participant A as GH Action
|
|
|
|
participant S as Data Source
|
|
|
|
loop Regular Interval (cron)
|
|
|
|
A->>R: clone repo
|
|
|
|
R->>A: old repo
|
|
|
|
A->>S: fetch file
|
|
|
|
S->>A: spreadsheet
|
|
|
|
Note over A: SheetJS<br/>convert to CSV
|
|
|
|
alt Data changed
|
|
|
|
Note over A: commit new data
|
|
|
|
A->>R: push new commit
|
|
|
|
end
|
|
|
|
end
|
|
|
|
```
|
2022-08-29 20:34:30 +00:00
|
|
|
|
2022-08-30 22:12:52 +00:00
|
|
|
This demo covers implementation details elided in the official write-up.
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
## Flat Data
|
|
|
|
|
|
|
|
As a project from the company, the entire lifecycle uses GitHub offerings:
|
|
|
|
|
|
|
|
- GitHub offers free hosting for Git repositories
|
|
|
|
- GitHub Actions provide the main engine for running tasks at regular intervals
|
|
|
|
- `githubocto/flat` Action to help fetch data and automate post-processing
|
|
|
|
- `flat-postprocessing` Post-processing helper functions and examples
|
|
|
|
- "Flat Viewer": Web viewer for structured CSV and JSON data on GitHub
|
|
|
|
|
|
|
|
:::caution
|
|
|
|
|
2023-02-12 04:20:11 +00:00
|
|
|
A GitHub account is required. At the time of writing (2023 February 11), free
|
2022-08-29 20:34:30 +00:00
|
|
|
GitHub accounts have no Actions usage limits for public repositories.
|
|
|
|
|
|
|
|
Using private GitHub repositories is not recommended because the Flat Viewer
|
|
|
|
cannot access private repositories.
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
|
|
|
### Data Source
|
|
|
|
|
|
|
|
Any publicly available spreadsheet can be a valid data source. The process will
|
|
|
|
fetch the data on specified intervals or events.
|
|
|
|
|
2022-11-07 10:41:00 +00:00
|
|
|
For this demo, <https://docs.sheetjs.com/pres.xlsx> will be used.
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
|
|
|
|
### Action
|
|
|
|
|
|
|
|
The `githubocto/flat` action can be added as a step in a workflow:
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
- name: Fetch data
|
|
|
|
uses: githubocto/flat@v3
|
|
|
|
with:
|
2022-11-07 10:41:00 +00:00
|
|
|
http_url: https://docs.sheetjs.com/pres.xlsx
|
2022-08-29 20:34:30 +00:00
|
|
|
downloaded_filename: data.xlsx
|
|
|
|
postprocess: ./postprocess.ts
|
|
|
|
```
|
|
|
|
|
2023-02-12 04:20:11 +00:00
|
|
|
This action performs the following steps:
|
|
|
|
|
|
|
|
1) `http_url` will be fetched and saved to `downloaded_filename` in the repo.
|
2022-08-29 20:34:30 +00:00
|
|
|
This can be approximated with the following command:
|
|
|
|
|
|
|
|
```bash
|
2022-11-07 10:41:00 +00:00
|
|
|
curl -L -o data.xlsx https://docs.sheetjs.com/pres.xlsx
|
2022-08-29 20:34:30 +00:00
|
|
|
```
|
|
|
|
|
2023-02-12 04:20:11 +00:00
|
|
|
2) After saving, the `postprocess` script will be run. When a `.ts` file is the
|
2022-08-29 20:34:30 +00:00
|
|
|
script, it will run the script in the Deno runtime. The `postprocess` script is
|
|
|
|
expected to read the downloaded file and create or overwrite files in the repo.
|
|
|
|
This can be approximated with the following command:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
deno run -A ./postprocess.ts data.xlsx
|
|
|
|
```
|
|
|
|
|
2023-02-12 04:20:11 +00:00
|
|
|
3) The action will compare the contents of the repo, creating a new commit if
|
2022-08-29 20:34:30 +00:00
|
|
|
the source data or artifacts from the `postprocess` script changed.
|
|
|
|
|
|
|
|
|
|
|
|
### Post-Processing Data
|
|
|
|
|
2022-09-02 05:52:23 +00:00
|
|
|
:::warning
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
The `flat-postprocessing` library includes a number of utilities for different
|
|
|
|
data formats. The `readXLSX` helper uses SheetJS under the hood.
|
|
|
|
|
|
|
|
The library uses an older version of the SheetJS library. To use the latest
|
|
|
|
releases, the examples import from the SheetJS CDN:
|
|
|
|
|
|
|
|
```ts
|
|
|
|
// @deno-types="https://cdn.sheetjs.com/xlsx-latest/package/types/index.d.ts"
|
|
|
|
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-latest/package/xlsx.mjs';
|
|
|
|
```
|
|
|
|
|
2022-11-07 10:41:00 +00:00
|
|
|
The official Deno registry is out of date. This is a known registry bug.
|
2022-09-02 05:52:23 +00:00
|
|
|
|
2022-08-29 20:34:30 +00:00
|
|
|
:::
|
|
|
|
|
|
|
|
#### Post-Process Script
|
|
|
|
|
|
|
|
The first argument to the post-processing script is the filename. The file can
|
2022-09-02 05:52:23 +00:00
|
|
|
be read with `XLSX.readFile` directly. `XLSX.utils.sheet_to_csv` generates CSV:
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
```ts title="postprocess.ts"
|
2022-09-05 10:00:35 +00:00
|
|
|
// @deno-types="https://cdn.sheetjs.com/xlsx-latest/package/types/index.d.ts"
|
2022-08-29 20:34:30 +00:00
|
|
|
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-latest/package/xlsx.mjs';
|
|
|
|
/* load the codepage support library for extended support with older formats */
|
|
|
|
import * as cptable from 'https://cdn.sheetjs.com/xlsx-latest/package/dist/cpexcel.full.mjs';
|
|
|
|
XLSX.set_cptable(cptable);
|
|
|
|
|
|
|
|
/* get the file path for the downloaded file and generate the CSV path */
|
|
|
|
const in_file = Deno.args[0];
|
|
|
|
const out_file = in_file.replace(/.xlsx$/, ".csv");
|
|
|
|
|
|
|
|
/* read file */
|
|
|
|
// highlight-next-line
|
|
|
|
const workbook = XLSX.readFile(in_file);
|
|
|
|
|
|
|
|
/* generate CSV from first worksheet */
|
|
|
|
const first_sheet = workbook.Sheets[workbook.SheetNames[0]];
|
|
|
|
// highlight-next-line
|
|
|
|
const csv = XLSX.utils.sheet_to_csv(first_sheet);
|
|
|
|
|
|
|
|
/* write CSV */
|
|
|
|
// highlight-next-line
|
2022-09-02 05:52:23 +00:00
|
|
|
Deno.writeFileSync(out_file, new TextEncoder().encode(csv));
|
2022-08-29 20:34:30 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Complete Example
|
|
|
|
|
|
|
|
:::note
|
|
|
|
|
2023-02-12 04:20:11 +00:00
|
|
|
This was tested on 2023 February 11 using the GitHub UI.
|
2022-08-29 20:34:30 +00:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
0) Create a free GitHub account or sign into the GitHub web interface.
|
|
|
|
|
|
|
|
1) Create a new repository (click the "+" icon in the upper-right corner).
|
|
|
|
|
|
|
|
- When prompted, enter a repository name of your choosing.
|
|
|
|
- Ensure "Public" is selected
|
|
|
|
- Check "Add a README file"
|
|
|
|
- Click "Create repository" at the bottom.
|
|
|
|
|
|
|
|
You will be redirected to the new project.
|
|
|
|
|
|
|
|
2) In the browser URL bar, change "github.com" to "github.dev". For example, if
|
|
|
|
the URL was originally `https://github.com/SheetJS/flat-sheet` , the new URL
|
|
|
|
should be `https://github.dev/SheetJS/flat-sheet` . Press Enter.
|
|
|
|
|
|
|
|
3) In the left "EXPLORER" panel, double-click just below README.md. A text box
|
|
|
|
will appear just above README. Type `postprocess.ts` and press Enter.
|
|
|
|
|
|
|
|
The main panel will show a `postprocess.ts` tab. Copy the following code to
|
|
|
|
the main editor window:
|
|
|
|
|
|
|
|
```ts title="postprocess.ts"
|
|
|
|
// @deno-types="https://cdn.sheetjs.com/xlsx-latest/package/types/index.d.ts"
|
|
|
|
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-latest/package/xlsx.mjs';
|
|
|
|
/* load the codepage support library for extended support with older formats */
|
|
|
|
import * as cptable from 'https://cdn.sheetjs.com/xlsx-latest/package/dist/cpexcel.full.mjs';
|
|
|
|
XLSX.set_cptable(cptable);
|
|
|
|
|
|
|
|
/* get the file path for the downloaded file and generate the CSV path */
|
|
|
|
const in_file = Deno.args[0];
|
|
|
|
const out_file = in_file.replace(/.xlsx$/, ".csv");
|
|
|
|
|
|
|
|
/* read file */
|
|
|
|
const workbook = XLSX.readFile(in_file);
|
|
|
|
|
|
|
|
/* generate CSV */
|
|
|
|
const first_sheet = workbook.Sheets[workbook.SheetNames[0]];
|
|
|
|
const csv = XLSX.utils.sheet_to_csv(first_sheet);
|
|
|
|
|
|
|
|
/* write CSV */
|
2022-09-05 10:00:35 +00:00
|
|
|
// highlight-next-line
|
|
|
|
Deno.writeFileSync(out_file, new TextEncoder().encode(csv));
|
2022-08-29 20:34:30 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
4) In the left "EXPLORER" panel, double-click just below README.md. A text box
|
|
|
|
will appear. Type `.github/workflows/data.yaml` and press Enter.
|
|
|
|
|
|
|
|
Copy the following code into the main area. It will create an action that
|
|
|
|
runs roughly once an hour:
|
|
|
|
|
|
|
|
```yaml title=".github/workflows/data.yaml"
|
|
|
|
name: flatsheet
|
|
|
|
|
|
|
|
on:
|
|
|
|
workflow_dispatch:
|
|
|
|
schedule:
|
|
|
|
- cron: '0 * * * *'
|
|
|
|
|
|
|
|
jobs:
|
|
|
|
scheduled:
|
|
|
|
runs-on: ubuntu-latest
|
|
|
|
steps:
|
|
|
|
- name: Setup deno
|
|
|
|
uses: denoland/setup-deno@main
|
|
|
|
with:
|
|
|
|
deno-version: v1.x
|
|
|
|
- name: Check out repo
|
|
|
|
uses: actions/checkout@v2
|
|
|
|
- name: Fetch data
|
|
|
|
uses: githubocto/flat@v3
|
|
|
|
with:
|
2022-11-07 10:41:00 +00:00
|
|
|
http_url: https://docs.sheetjs.com/pres.xlsx
|
2022-08-29 20:34:30 +00:00
|
|
|
downloaded_filename: data.xlsx
|
|
|
|
postprocess: ./postprocess.ts
|
|
|
|
```
|
|
|
|
|
|
|
|
5) Click on the source control icon (a little blue circle with the number 2).
|
|
|
|
In the left panel, select Message box, type `init` and press `Ctrl+Enter` on
|
|
|
|
Windows (`Command+Enter` on Mac).
|
|
|
|
|
|
|
|
6) Click the `☰` icon and click "Go to Repository" to return to the repo page.
|
|
|
|
|
|
|
|
7) Click "Actions" to see the workflows. In the left column, click `flatsheet`.
|
|
|
|
|
|
|
|
This is the page for the action. Every time the action is run, a new entry
|
|
|
|
will be added to the list.
|
|
|
|
|
|
|
|
Click "Run workflow", then click the "Run workflow" button in the popup.
|
|
|
|
This will start a new run. After about 30 seconds, a new row should show up
|
|
|
|
in the main area. The icon should be a white `✓` in a green circle.
|
|
|
|
|
|
|
|
8) Click "Code" to return to the main view. It should have a file listing that
|
|
|
|
includes `data.xlsx` (downloaded file) and `data.csv` (generated data)
|
|
|
|
|
|
|
|
Now repeat step 7 to run the action a second time. Click "Code" again.
|
|
|
|
|
|
|
|
9) Go to the URL bar and change "github.com" to "flatgithub.com". For example,
|
|
|
|
if the URL was originally `https://github.com/SheetJS/flat-sheet` , the new
|
|
|
|
URL should be `https://flatgithub.com/SheetJS/flat-sheet` . Press Enter.
|
|
|
|
|
|
|
|
You will see the "Flat Viewer". In the top bar, the "Commit" option allows
|
|
|
|
for switching to an older version of the data.
|
|
|
|
|
|
|
|
The update process will run once an hour. If you return in a few hours and
|
|
|
|
refresh the page, there should be more commits in the selection list.
|