forked from sheetjs/docs.sheetjs.com
332 lines
10 KiB
Markdown
332 lines
10 KiB
Markdown
|
---
|
||
|
title: Spreadsheet Data in Pandas
|
||
|
sidebar_label: Python + Pandas
|
||
|
description: Process structured data in Python with Pandas. Seamlessly integrate spreadsheets into your workflow with SheetJS. Analyze complex Excel spreadsheets with confidence.
|
||
|
pagination_prev: demos/index
|
||
|
pagination_next: demos/frontend/index
|
||
|
---
|
||
|
|
||
|
import current from '/version.js';
|
||
|
import Tabs from '@theme/Tabs';
|
||
|
import TabItem from '@theme/TabItem';
|
||
|
import CodeBlock from '@theme/CodeBlock';
|
||
|
|
||
|
Pandas[^1] is a Python software library for data analysis.
|
||
|
|
||
|
[SheetJS](https://sheetjs.com) is a JavaScript library for reading and writing
|
||
|
data from spreadsheets.
|
||
|
|
||
|
This demo uses SheetJS to process data from a spreadsheet and translate to the
|
||
|
Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts,
|
||
|
generate DataFrames from workbooks, and write DataFrames back to workbooks.
|
||
|
|
||
|
The ["Complete Example"](#complete-example) includes a wrapper library that
|
||
|
simplifies importing and exporting spreadsheets.
|
||
|
|
||
|
:::info pass
|
||
|
|
||
|
Pandas includes limited support for reading spreadsheets (`pandas.from_excel`)
|
||
|
and writing XLSX spreadsheets (`pandas.DataFrame.to_excel`).
|
||
|
|
||
|
**SheetJS supports common spreadsheet formats that Pandas cannot process.**
|
||
|
|
||
|
SheetJS operations also offer more flexibility in processing complex worksheets.
|
||
|
|
||
|
:::
|
||
|
|
||
|
:::note Tested Environments
|
||
|
|
||
|
This demo was tested in the following deployments:
|
||
|
|
||
|
| Architecture | JS Engine | Pandas | Python | Date |
|
||
|
|:-------------|:----------------|:-------|:-------|:-----------|
|
||
|
| `darwin-x64` | Duktape `2.7.0` | 2.0.3 | 3.11.7 | 2024-01-29 |
|
||
|
| `linux-x64` | Duktape `2.7.0` | 1.5.3 | 3.11.3 | 2024-01-29 |
|
||
|
|
||
|
:::
|
||
|
|
||
|
## Integration Details
|
||
|
|
||
|
[`sheetjs.py`](pathname:///pandas/sheetjs.py) is a wrapper script that provides
|
||
|
helper methods for reading and writing spreadsheets. Installation notes are
|
||
|
included in the ["Complete Example"](#complete-example) section.
|
||
|
|
||
|
### JS in Python
|
||
|
|
||
|
JS code cannot be directly evaluated in Python implementations.
|
||
|
|
||
|
To run JS code from Python, JavaScript engines[^2] can be embedded in Python
|
||
|
modules or dynamically loaded using the `ctypes` foreign function library[^3].
|
||
|
This demo uses `ctypes` with the [Duktape engine](/docs/demos/engines/duktape).
|
||
|
|
||
|
### Wrapper
|
||
|
|
||
|
The script exports a class named `SheetJSWrapper`. It is a context manager that
|
||
|
initializes the Duktape engine and executes SheetJS scripts on entrance. All
|
||
|
work should be performed in the context:
|
||
|
|
||
|
```python title="Complete Example"
|
||
|
#!/usr/bin/env python3
|
||
|
from sheetjs import SheetJSWrapper
|
||
|
|
||
|
with SheetJSWrapper() as sheetjs:
|
||
|
|
||
|
# Parse file
|
||
|
wb = sheetjs.read_file("pres.numbers")
|
||
|
print("Loaded file pres.numbers")
|
||
|
|
||
|
# Get first worksheet name
|
||
|
first_ws_name = wb.get_sheet_names()[0]
|
||
|
print(f"Reading from sheet {first_ws_name}")
|
||
|
|
||
|
# Generate DataFrame from first worksheet
|
||
|
df = wb.get_df(first_ws_name)
|
||
|
print(df.info())
|
||
|
|
||
|
# Export DataFrame to XLSB
|
||
|
sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")
|
||
|
```
|
||
|
|
||
|
### Reading Files
|
||
|
|
||
|
`sheetjs.read_file` accepts a path to a spreadsheet file. It will parse the file
|
||
|
and return an object representing the workbook.
|
||
|
|
||
|
The `get_sheet_names` method of the workbook returns a list of sheet names.
|
||
|
|
||
|
The `get_df` method of the workbook generates a DataFrame from the workbook. The
|
||
|
specific sheet can be selected by passing the name.
|
||
|
|
||
|
For example, the following code reads `pres.numbers` and generates a DataFrame
|
||
|
from the second worksheet:
|
||
|
|
||
|
```python title="Generating a DataFrame from the second worksheet"
|
||
|
with SheetJSWrapper() as sheetjs:
|
||
|
# Parse file
|
||
|
wb = sheetjs.read_file(path)
|
||
|
|
||
|
# Generate DataFrame from second worksheet
|
||
|
ws_name = wb.get_sheet_names()[1]
|
||
|
df = wb.get_df(ws_name)
|
||
|
|
||
|
# Print metadata
|
||
|
print(df.info())
|
||
|
```
|
||
|
|
||
|
Under the hood, `sheetjs.py` performs the following steps:
|
||
|
|
||
|
```mermaid
|
||
|
flowchart LR
|
||
|
file[(workbook\nfile)]
|
||
|
subgraph SheetJS operations
|
||
|
bytes(Byte\nstring)
|
||
|
wb((SheetJS\nWorkbook))
|
||
|
csv(CSV\nstring)
|
||
|
end
|
||
|
subgraph Pandas operations
|
||
|
stream(CSV\nStream)
|
||
|
df[(Pandas\nDataFrame)]
|
||
|
end
|
||
|
file --> |`open`/`read`\nPython ops| bytes
|
||
|
bytes --> |`XLSX.read`\nParse Bytes| wb
|
||
|
wb --> |`sheet_to_csv`\nExtract Data| csv
|
||
|
csv --> |`StringIO`\nPython ops| stream
|
||
|
stream --> |`read_csv`\nParse CSV| df
|
||
|
```
|
||
|
|
||
|
1) Pure Python operations read the spreadsheet file and generate a byte string.
|
||
|
|
||
|
2) SheetJS libraries parse the string and generate a clean CSV.
|
||
|
|
||
|
- The `read` method[^4] parses file bytes into a SheetJS workbook object[^5]
|
||
|
- After selecting a worksheet, `sheet_to_csv`[^6] generates a CSV string
|
||
|
|
||
|
3) Python operations convert the CSV string to a stream object.[^7]
|
||
|
|
||
|
4) The Pandas `read_csv` method[^8] ingests the stream and generate a DataFrame.
|
||
|
|
||
|
### Writing Files
|
||
|
|
||
|
`sheetjs.write_df` accepts a DataFrame and a path. It will attempt to export
|
||
|
the data to a spreadsheet file.
|
||
|
|
||
|
For example, the following code exports a DataFrame to `SheetJSPandas.xlsb`:
|
||
|
|
||
|
```python title="Exporting a DataFrame to XLSB"
|
||
|
with SheetJSWrapper() as sheetjs:
|
||
|
# Export DataFrame to XLSB
|
||
|
sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")
|
||
|
```
|
||
|
|
||
|
Under the hood, `sheetjs.py` performs the following steps:
|
||
|
|
||
|
```mermaid
|
||
|
flowchart LR
|
||
|
subgraph Pandas operations
|
||
|
df[(Pandas\nDataFrame)]
|
||
|
json(JSON\nString)
|
||
|
end
|
||
|
subgraph SheetJS operations
|
||
|
aoo(array of\nobjects)
|
||
|
wb((SheetJS\nWorkbook))
|
||
|
u8a(File\nbytes)
|
||
|
end
|
||
|
file[(workbook\nfile)]
|
||
|
df --> |`to_json`\nPandas ops| json
|
||
|
json --> |`JSON.parse`\nJS Engine| aoo
|
||
|
aoo --> |`json_to_sheet`\nSheetJS Ops| wb
|
||
|
wb --> |`XLSX.write`\nUint8Array| u8a
|
||
|
u8a --> |`open`/`write`\nPython ops| file
|
||
|
```
|
||
|
|
||
|
1) The Pandas DataFrame `to_json` method[^9] generates a JSON string.
|
||
|
|
||
|
2) JS engine operations translate the JSON string to an array of objects.
|
||
|
|
||
|
3) SheetJS libraries process the data array and generate file bytes.
|
||
|
|
||
|
- The `json_to_sheet` method[^10] creates a SheetJS sheet object from the data.
|
||
|
- The `book_new` method[^11] creates a SheetJS workbook that includes the sheet.
|
||
|
- The `write` method[^12] generates the spreadsheet file bytes.
|
||
|
|
||
|
4) Pure Python operations write the bytes to file.
|
||
|
|
||
|
## Complete Example
|
||
|
|
||
|
This example will extract data from an Apple Numbers spreadsheet and generate a
|
||
|
DataFrame. The DataFrame will be exported to the binary XLSB spreadsheet format.
|
||
|
|
||
|
0) Install Pandas:
|
||
|
|
||
|
```bash
|
||
|
sudo python3 -m pip install pandas
|
||
|
```
|
||
|
|
||
|
:::caution pass
|
||
|
|
||
|
On Arch Linux-based platforms including the Steam Deck, the install may fail:
|
||
|
|
||
|
```
|
||
|
error: externally-managed-environment
|
||
|
```
|
||
|
|
||
|
In these situations, Pandas must be installed through the package manager:
|
||
|
|
||
|
```bash
|
||
|
sudo pacman -Syu python-pandas
|
||
|
```
|
||
|
|
||
|
:::
|
||
|
|
||
|
1) Build the Duktape shared library:
|
||
|
|
||
|
```bash
|
||
|
curl -LO https://duktape.org/duktape-2.7.0.tar.xz
|
||
|
tar -xJf duktape-2.7.0.tar.xz
|
||
|
cd duktape-2.7.0
|
||
|
make -f Makefile.sharedlibrary
|
||
|
cd ..
|
||
|
```
|
||
|
|
||
|
2) Copy the shared library to the current folder. When the demo was last tested,
|
||
|
the shared library file name differed by platform:
|
||
|
|
||
|
| OS | name |
|
||
|
|:-------|:--------------------------|
|
||
|
| Darwin | `libduktape.207.20700.so` |
|
||
|
| Linux | `libduktape.so.207.20700` |
|
||
|
|
||
|
```bash
|
||
|
cp duktape-*/libduktape.* .
|
||
|
```
|
||
|
|
||
|
3) Download the SheetJS Standalone script and move to the project directory:
|
||
|
|
||
|
<ul>
|
||
|
<li><a href={`https://cdn.sheetjs.com/xlsx-${current}/package/dist/shim.min.js`}>shim.min.js</a></li>
|
||
|
<li><a href={`https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js`}>xlsx.full.min.js</a></li>
|
||
|
</ul>
|
||
|
|
||
|
<CodeBlock language="bash">{`\
|
||
|
curl -LO https://cdn.sheetjs.com/xlsx-${current}/package/dist/shim.min.js
|
||
|
curl -LO https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js`}
|
||
|
</CodeBlock>
|
||
|
|
||
|
4) Download the following test scripts and files:
|
||
|
|
||
|
- [`pres.numbers` test file](https://sheetjs.com/pres.numbers)
|
||
|
- [`sheetjs.py` script](pathname:///pandas/sheetjs.py)
|
||
|
- [`SheetJSPandas.py` script](pathname:///pandas/SheetJSPandas.py)
|
||
|
|
||
|
```bash
|
||
|
curl -LO https://sheetjs.com/pres.numbers
|
||
|
curl -LO https://docs.sheetjs.com/pandas/sheetjs.py
|
||
|
curl -LO https://docs.sheetjs.com/pandas/SheetJSPandas.py
|
||
|
```
|
||
|
|
||
|
5) Edit the `sheetjs.py` script.
|
||
|
|
||
|
The `lib` variable declares the path to the library:
|
||
|
|
||
|
```python title="sheetjs.py (edit highlighted line)"
|
||
|
# highlight-next-line
|
||
|
lib = "libduktape.207.20700.so"
|
||
|
```
|
||
|
|
||
|
<Tabs groupId="triple">
|
||
|
<TabItem value="darwin-x64" label="MacOS">
|
||
|
|
||
|
The name of the library is `libduktape.207.20700.so`:
|
||
|
|
||
|
```python title="sheetjs.py (change highlighted line)"
|
||
|
# highlight-next-line
|
||
|
lib = "libduktape.207.20700.so"
|
||
|
```
|
||
|
|
||
|
</TabItem>
|
||
|
<TabItem value="linux-x64" label="Linux">
|
||
|
|
||
|
The name of the library is `libduktape.so.207.20700`:
|
||
|
|
||
|
```python title="sheetjs.py (change highlighted line)"
|
||
|
# highlight-next-line
|
||
|
lib = "libduktape.so.207.20700"
|
||
|
```
|
||
|
|
||
|
</TabItem>
|
||
|
</Tabs>
|
||
|
|
||
|
6) Run the script:
|
||
|
|
||
|
```bash
|
||
|
python3 SheetJSPandas.py pres.numbers
|
||
|
```
|
||
|
|
||
|
If successful, the script will display DataFrame metadata:
|
||
|
|
||
|
```
|
||
|
RangeIndex: 5 entries, 0 to 4
|
||
|
Data columns (total 2 columns):
|
||
|
# Column Non-Null Count Dtype
|
||
|
--- ------ -------------- -----
|
||
|
0 Name 5 non-null object
|
||
|
1 Index 5 non-null int64
|
||
|
dtypes: int64(1), object(1)
|
||
|
```
|
||
|
|
||
|
It will also export the DataFrame to `SheetJSPandas.xlsb`. The file can be
|
||
|
inspected with a spreadsheet editor that supports XLSB files.
|
||
|
|
||
|
[^1]: The official documentation site is <https://pandas.pydata.org/> and the official distribution point is <https://pypi.org/project/pandas/>
|
||
|
[^2]: See ["Other Languages"](/docs/demos/engines/) for more examples.
|
||
|
[^3]: See [`ctypes`](https://docs.python.org/3/library/ctypes.html) in the Python documentation.
|
||
|
[^4]: See [`read` in "Reading Files"](/docs/api/parse-options)
|
||
|
[^5]: See ["Workbook Object"](/docs/csf/book)
|
||
|
[^6]: See [`sheet_to_csv` in "Utilities"](/docs/api/utilities/csv#delimiter-separated-output)
|
||
|
[^7]: See [the examples in "IO tools"](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) in the Pandas documentation.
|
||
|
[^8]: See [`pandas.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) in the Pandas documentation.
|
||
|
[^9]: See [`pandas.DataFrame.to_json`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html) in the Pandas documentation.
|
||
|
[^10]: See [`json_to_sheet` in "Utilities"](/docs/api/utilities/array#array-of-objects-input)
|
||
|
[^11]: See [`book_new` in "Utilities"](/docs/api/utilities/wb)
|
||
|
[^12]: See [`write` in "Writing Files"](/docs/api/write-options)
|