12 KiB
title | sidebar_label | description | pagination_prev | pagination_next |
---|---|---|---|---|
Spreadsheet Data in Pandas | Python (Pandas) | Process structured data in Python with Pandas. Seamlessly integrate spreadsheets into your workflow with SheetJS. Analyze complex Excel spreadsheets with confidence. | demos/cloud/index | demos/bigdata/index |
import current from '/version.js'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';
Pandas1 is a Python software library for data analysis.
SheetJS is a JavaScript library for reading and writing data from spreadsheets.
This demo uses SheetJS to process data from a spreadsheet and translate to the Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts, generate DataFrames from workbooks, and write DataFrames back to workbooks.
:::note
This demo was tested in the following deployments:
Architecture | V8 version | Pandas | Python | Date |
---|---|---|---|---|
darwin-x64 |
11.5.150.16 |
2.0.3 | 3.11.4 | 2023-07-29 |
:::
:::info pass
Pandas includes limited support for reading spreadsheets (pandas.from_excel
)
and writing XLSX spreadsheets (pandas.DataFrame.to_excel
).
The SheetJS approach supports many common spreadsheet formats that are not supported by the current set of Pandas codecs and offers greater flexibility in processing complex worksheets.
:::
Integration Details
JS code cannot literally be run in the Python interpreter. To run JS code from Python, JavaScript engines2 can be embedded in CPython modules.
Loading SheetJS
This demo uses the STPyV8
module3 to access the V8 JavaScript engine.
Initialize V8
The engine library provides a convenient context manager JSContext
for context
resource management. Within the context, the eval
method can evaluate code:
from STPyV8 import JSContext
# Initialize JS context
with JSContext() as ctxt:
# Run code
res = ctxt.eval("'Sheet' + 'JS'")
# print result
print(res)
STPyV8
handles data interchange for common types. Arrays and JS objects can be
translated to Python list
and dict
respectively. The following convert
function is used in the test suite4
# from `tests/test_Wrapper.py` in the STPyV8 library
# License: Apache 2.0
def convert(obj):
if isinstance(obj, JSArray):
return [convert(v) for v in obj]
if isinstance(obj, JSObject):
return dict([[str(k), convert(obj.__getattr__(str(k)))] for k in obj.__dir__()])
return obj
Loading the Library
The Standalone scripts can be
parsed and evaluated from the JS engine. Once evaluated, the XLSX
variable is
available as a global.
Assuming the standalone library is in the same directory as the source file,
the script can be evaluated with eval
:
# Within a JSContext, open `xlsx.full.min.js` and evaluate
with open("xlsx.full.min.js") as f:
ctxt.eval(f.read())
Reading Files
The following diagram depicts the spreadsheet salsa:
flowchart LR
file[(workbook\nfile)]
subgraph SheetJS operations
base64(Base64\nstring)
wb((SheetJS\nWorkbook))
aoo(array of\nobjects)
end
subgraph Pandas operations
lod(list of\nrecords)
df[(Pandas\nDataFrame)]
end
file --> |`open`/`read`\nPython ops| base64
base64 --> |`XLSX.read`\nParse Bytes| wb
wb --> |`sheet_to_json`\nExtract Data| aoo
aoo --> |`convert`\nPython ops|lod
lod --> |`from_records`\nPandas ops| df
At a high level:
-
Pure Python operations read the file and generate a Base64 string
-
SheetJS libraries parse the string and generates JS records
-
JS engine operations translate the rows to Python
list
ofdicts
-
Pandas operations translate the Python data to a DataFrame
Read files
The safest format for data interchange is Base64-encoded strings:
from base64 import b64encode
with open(path, mode="rb") as f:
file_bytes = f.read()
b64 = b64encode(file_bytes)
Parse bytes
From JS code, XLSX.read
5 parses the Base64 string
wb = ctxt.eval("(b64 => XLSX.read(b64, {type: 'base64', dense: true}))")(b64)
The wb
object follows the "Common Spreadsheet Format"6, an in-memory format
for representing workbooks, worksheets, cells, and spreadsheet features.
Get First Worksheet
As explained in the "Workbook Object"7 section:
- the
SheetNames
property is a ordered list of the sheet names in the workbook - the
Sheets
property of the workbook object is an object whose keys are sheet names and whose values are sheet objects.
For use in Python, the SheetNames
array must be converted to a list
:
sheet_names = convert(wb.SheetNames)
first_sheet_name = sheet_names[0]
Since utility functions will process the worksheet object from JavaScript, it is preferable not to convert the object:
first_sheet = wb.Sheets[first_sheet_name] # do not convert
Generate List of Records
In JavaScript, the equivalent of the "list
of dict
s" or "list
of records"
is "array of objects". They can be created with XLSX.utils.sheet_to_json
8:
rows = convert(ctxt.eval("(ws => XLSX.utils.sheet_to_json(ws))")(first_sheet))
Generate Pandas DataFrame
rows
is a list
of dict
objects. from_records
9 understands this data
shape and generates a proper DataFrame:
df = pd.DataFrame.from_records(rows)
Writing Files
The writing process looks similar to the reading process in reverse:
flowchart LR
subgraph Pandas operations
df[(Pandas\nDataFrame)]
json(JSON\nString)
end
subgraph SheetJS operations
aoo(array of\nobjects)
wb((SheetJS\nWorkbook))
base64(Base64\nstring)
end
file[(workbook\nfile)]
df --> |`to_json`\nPandas ops| json
json --> |`JSON.parse`\nJS Engine| aoo
aoo --> |`json_to_sheet`\nSheetJS Ops| wb
wb --> |`XLSX.write`\nBase64| base64
base64 --> |`open`/`write`\nPython ops| file
At a high level:
-
Pandas operations translate the Python data to JSON string
-
JS engine operations translate the JSON string to an array of objects
-
SheetJS libraries parse the array and generate a Base64-encoded workbook
-
Pure Python operations decode the Base64 string and write the bytes to file.
Generate JSON
DataFrame#to_json
10 with the option orient="records"
generates a JSON
string that encodes an array of objects:
json = df.to_json(orient="records")
Generate Worksheet
In JavaScript, JSON.parse
will interpret the string as an array of objects.
XLSX.utils.json_to_sheet
11 generates a SheetJS worksheet object:
sheet = ctxt.eval("(json => XLSX.utils.json_to_sheet(JSON.parse(json)) )")(json)
Export Enhancements
At this point, there are many options for improving the appearance of the sheet. For example, the "Export Tutorial"12 shows how to adjust column widths.
:::tip pass
SheetJS Pro offers additional styling options such as cell styling and frozen rows.
"Pro Edit" offers a special approach for inserting data into an existing file.
:::
Generate Workbook
XLSX.utils.book_new
13 creates a new workbook and XLSX.utils.book_append_sheet
14
appends a worksheet to the workbook. The new worksheet will be called "Export":
:::note pass
The code in the string literal is reproduced below:
(ws, name) => {
const wb = XLSX.utils.book_new();
XLSX.utils.book_append_sheet(wb, ws, name);
return wb;
}
:::
book = ctxt.eval("""((ws, name) => {
const wb = XLSX.utils.book_new();
XLSX.utils.book_append_sheet(wb, ws, name);
return wb;
})""")(sheet, "Export")
Generate File
XLSX.write
15 with the option type: "base64"
attempts to create a file and
generate a Base64 string:
b64 = ctxt.eval("(wb => XLSX.write(wb, {type:'base64', bookType:'xls'}))")(book)
With the Base64 string, standard Python operations can create a file:
from base64 import b64decode
raw = b64decode(b64)
with open("export.xls", mode="wb") as f:
f.write(raw)
Complete Demo
This example will extract data from an Apple Numbers spreadsheet and generate a DataFrame. The DataFrame will be exported to a legacy XLS spreadsheet.
Engine Setup
- Follow the official installation instructions16.
Instructions for macOS 12 (click to show)
- Install
boost-python3
package usingbrew
:
brew install boost-python3
- Identify python version:
python3 --version
:::note pass
When the demo was last tested, the version was 3.11.4
:::
curl -LO https://github.com/cloudflare/stpyv8/releases/download/v11.5.150.16/stpyv8-macos-12-python-3.11.zip
- Extract ZIP file and enter folder
unzip stpyv8-macos-12-python-3.11.zip
cd stpyv8-macos-12-3.11
- Move
icudtl.dat
to/Library/Application Support/STPyV8/
:
sudo mkdir -p /Library/Application\ Support/STPyV8
sudo mv icudtl.dat /Library/Application\ Support/STPyV8/
- Install wheel:
sudo python3 -m pip install --upgrade *.whl
cd ..
Demo
- Follow the standalone script instructions to download the script:
{\ curl -LO https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js
}
- Install Pandas. On macOS:
sudo python3 -m pip install pandas
- Download the following test scripts and files:
curl -LO https://sheetjs.com/pres.numbers
curl -LO https://docs.sheetjs.com/pandas/sheetjs.py
curl -LO https://docs.sheetjs.com/pandas/SheetJSPandas.py
- Run the script:
python3 SheetJSPandas.py pres.numbers
If successful, it will display data rows in the file:
Reading from sheet Sheet1
{'Name': 'Bill Clinton', 'Index': 42}
{'Name': 'GeorgeW Bush', 'Index': 43}
{'Name': 'Barack Obama', 'Index': 44}
{'Name': 'Donald Trump', 'Index': 45}
{'Name': 'Joseph Biden', 'Index': 46}
If Pandas is installed, the script will display DataFrame metadata:
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Index 5 non-null int64
dtypes: int64(1), object(1)
It will also export to pres.xls
. The file can be read in a spreadsheet editor.
-
The official documentation site is https://pandas.pydata.org/ and the official distribution point is https://pypi.org/project/pandas/ ↩︎
-
See "Other Languages" for more examples. ↩︎
-
STPyV8
is a fork of the originalPyV8
project. It is available under the permissive Apache 2.0 License. Special thanks to Flier Lu and CloudFlare! ↩︎ -
See
tests/test_Wrapper.py
in theSTPyV8
code repository. ↩︎ -
See "Workbook Object" ↩︎
-
See
pandas.DataFrame.from_records
in the Pandas documentation. ↩︎ -
See
pandas.DataFrame.to_json
in the Pandas documentation. ↩︎ -
See "Clean up Workbook" in "Export Tutorial". ↩︎
-
See "Installing" in the
STPyV8
project documentation ↩︎