9.2 KiB
title | pagination_prev | pagination_next |
---|---|---|
GitHub | demos/local/index | demos/extensions/index |
import current from '/version.js'; import CodeBlock from '@theme/CodeBlock';
Many official data releases by governments and organizations include XLSX or XLS files. Unfortunately some data sources do not retain older versions.
Git is a popular system for organizing a historical record of source code and changes. Git can also store and track binary data artifacts.
GitHub is a popular host for Git repositories. GitHub's "Flat Data" project explores storing and comparing versions of structured CSV and JSON data. The official "Excel to CSV"1 example uses SheetJS to generate CSV data from files:
sequenceDiagram
autonumber
participant R as GH Repo
participant A as GH Action
participant S as Data Source
loop Regular Interval (cron)
A->>R: clone repo
R->>A: old repo
A->>S: fetch file
S->>A: spreadsheet
Note over A: SheetJS<br/>convert to CSV
alt Data changed
Note over A: commit new data
A->>R: push new commit
end
end
This demo covers implementation details elided in the official write-up.
Flat Data
As a project from the company, the entire lifecycle uses GitHub offerings:
- GitHub offers free hosting for Git repositories
- GitHub Actions provide the main engine for running tasks at regular intervals
githubocto/flat
Action to help fetch data and automate post-processingflat-postprocessing
Post-processing helper functions and examples- "Flat Viewer": Web viewer for structured CSV and JSON data on GitHub
:::caution
A GitHub account is required. When the demo was tested, free GitHub accounts had no Actions usage limits for public repositories.
Using private GitHub repositories is not recommended because the Flat Viewer cannot access private repositories.
:::
Data Source
Any publicly available spreadsheet can be a valid data source. The process will fetch the data on specified intervals or events.
For this demo, https://docs.sheetjs.com/pres.xlsx will be used.
Action
The githubocto/flat
action can be added as a step in a workflow:
- name: Fetch data
uses: githubocto/flat@v3
with:
http_url: https://docs.sheetjs.com/pres.xlsx
downloaded_filename: data.xlsx
postprocess: ./postprocess.ts
This action performs the following steps:
http_url
will be fetched and saved todownloaded_filename
in the repo. This can be approximated with the following command:
curl -L -o data.xlsx https://docs.sheetjs.com/pres.xlsx
- After saving, the
postprocess
script will be run. When a.ts
file is the script, it will run the script in the Deno runtime. Thepostprocess
script is expected to read the downloaded file and create or overwrite files in the repo. This can be approximated with the following command:
deno run -A ./postprocess.ts data.xlsx
- The action will compare the contents of the repo, creating a new commit if
the source data or artifacts from the
postprocess
script changed.
Post-Processing Data
:::warning pass
The flat-postprocessing
library includes a number of utilities for different
data formats. The readXLSX
helper uses SheetJS under the hood.
The library uses an older version of the SheetJS library. To use the latest releases, the examples import from the SheetJS CDN:
{\ // @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts" import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs';
}
The official Deno registry is out of date. This is a known registry bug.
:::
Post-Process Script
The first argument to the post-processing script is the filename. The file can
be read with XLSX.readFile
directly. XLSX.utils.sheet_to_csv
generates CSV:
{\ // @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts" import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs'; /* load the codepage support library for extended support with older formats */ import * as cptable from 'https://cdn.sheetjs.com/xlsx-${current}/package/dist/cpexcel.full.mjs'; XLSX.set_cptable(cptable); \n\ /* get the file path for the downloaded file and generate the CSV path */ const in_file = Deno.args[0]; const out_file = in_file.replace(/.xlsx$/, ".csv"); \n\ /* read file */ // highlight-next-line const workbook = XLSX.readFile(in_file); \n\ /* generate CSV from first worksheet */ const first_sheet = workbook.Sheets[workbook.SheetNames[0]]; // highlight-next-line const csv = XLSX.utils.sheet_to_csv(first_sheet); \n\ /* write CSV */ // highlight-next-line Deno.writeFileSync(out_file, new TextEncoder().encode(csv));
}
Complete Example
:::note
This was last tested on 2023 April 06 using the GitHub UI.
:::
-
Create a free GitHub account or sign into the GitHub web interface.
-
Create a new repository (click the "+" icon in the upper-right corner).
- When prompted, enter a repository name of your choosing.
- Ensure "Public" is selected
- Check "Add a README file"
- Click "Create repository" at the bottom.
You will be redirected to the new project.
-
In the browser URL bar, change "github.com" to "github.dev". For example, if the URL was originally
https://github.com/SheetJS/flat-sheet
, the new URL should behttps://github.dev/SheetJS/flat-sheet
. Press Enter. -
In the left "EXPLORER" panel, double-click just below README.md. A text box will appear just above README. Type
postprocess.ts
and press Enter.The main panel will show a
postprocess.ts
tab. Copy the following code to the main editor window:
{\ // @deno-types="https://cdn.sheetjs.com/xlsx-${current}/package/types/index.d.ts" import * as XLSX from 'https://cdn.sheetjs.com/xlsx-${current}/package/xlsx.mjs'; /* load the codepage support library for extended support with older formats */ import * as cptable from 'https://cdn.sheetjs.com/xlsx-${current}/package/dist/cpexcel.full.mjs'; XLSX.set_cptable(cptable); \n\ /* get the file path for the downloaded file and generate the CSV path */ const in_file = Deno.args[0]; const out_file = in_file.replace(/.xlsx$/, ".csv"); \n\ /* read file */ const workbook = XLSX.readFile(in_file); \n\ /* generate CSV */ const first_sheet = workbook.Sheets[workbook.SheetNames[0]]; const csv = XLSX.utils.sheet_to_csv(first_sheet); \n\ /* write CSV */ // highlight-next-line Deno.writeFileSync(out_file, new TextEncoder().encode(csv));
}
-
In the left "EXPLORER" panel, double-click just below README.md. A text box will appear. Type
.github/workflows/data.yaml
and press Enter.Copy the following code into the main area. It will create an action that runs roughly once an hour:
name: flatsheet
on:
workflow_dispatch:
schedule:
- cron: '0 * * * *'
jobs:
scheduled:
runs-on: ubuntu-latest
steps:
- name: Setup deno
uses: denoland/setup-deno@main
with:
deno-version: v1.x
- name: Check out repo
uses: actions/checkout@v2
- name: Fetch data
uses: githubocto/flat@v3
with:
http_url: https://docs.sheetjs.com/pres.xlsx
downloaded_filename: data.xlsx
postprocess: ./postprocess.ts
-
Click on the source control icon (a little blue circle with the number 2). In the left panel, select Message box, type
init
and pressCtrl+Enter
on Windows (Command+Enter
on Mac). -
Click the
☰
icon and click "Go to Repository" to return to the repo page. -
Click "Settings" to see the repository settings. In the left column, click "Actions" to expand the submenu and click "General".
Scroll down to "Workflow permissions" and select "Read and write permissions" if it is not selected. Click "Save".
-
Click "Actions" to see the workflows. In the left column, click
flatsheet
.This is the page for the action. Every time the action is run, a new entry will be added to the list.
Click "Run workflow", then click the "Run workflow" button in the popup. This will start a new run. After about 30 seconds, a new row should show up in the main area. The icon should be a white
✓
in a green circle. -
Click "Code" to return to the main view. It should have a file listing that includes
data.xlsx
(downloaded file) anddata.csv
(generated data)Now repeat step 7 to run the action a second time. Click "Code" again.
-
Go to the URL bar and change "github.com" to "flatgithub.com". For example, if the URL was originally
https://github.com/SheetJS/flat-sheet
, the new URL should behttps://flatgithub.com/SheetJS/flat-sheet
. Press Enter.
You will see the "Flat Viewer". In the top bar, the "Commit" option allows for switching to an older version of the data.
The update process will run once an hour. If you return in a few hours and refresh the page, there should be more commits in the selection list.
-
See "Excel to CSV" in the "Flat Data" writeup ↩︎