docs.sheetjs.com/docz/docs/03-demos/42-engines/14-pandas.md
2023-09-22 02:44:32 -04:00

12 KiB

title sidebar_label description pagination_prev pagination_next
Spreadsheet Data in Pandas Python (Pandas) Process structured data in Python with Pandas. Seamlessly integrate spreadsheets into your workflow with SheetJS. Analyze complex Excel spreadsheets with confidence. demos/cloud/index demos/bigdata/index

import current from '/version.js'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';

Pandas1 is a Python software library for data analysis.

SheetJS is a JavaScript library for reading and writing data from spreadsheets.

This demo uses SheetJS to process data from a spreadsheet and translate to the Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts, generate DataFrames from workbooks, and write DataFrames back to workbooks.

:::note

This demo was tested in the following deployments:

Architecture V8 version Pandas Python Date
darwin-x64 11.5.150.16 2.0.3 3.11.4 2023-07-29

:::

:::info pass

Pandas includes limited support for reading spreadsheets (pandas.from_excel) and writing XLSX spreadsheets (pandas.DataFrame.to_excel).

The SheetJS approach supports many common spreadsheet formats that are not supported by the current set of Pandas codecs and offers greater flexibility in processing complex worksheets.

:::

Integration Details

JS code cannot literally be run in the Python interpreter. To run JS code from Python, JavaScript engines2 can be embedded in CPython modules.

Loading SheetJS

This demo uses the STPyV8 module3 to access the V8 JavaScript engine.

Initialize V8

The engine library provides a convenient context manager JSContext for context resource management. Within the context, the eval method can evaluate code:

from STPyV8 import JSContext

# Initialize JS context
with JSContext() as ctxt:
  # Run code
  res = ctxt.eval("'Sheet' + 'JS'")

  # print result
  print(res)

STPyV8 handles data interchange for common types. Arrays and JS objects can be translated to Python list and dict respectively. The following convert function is used in the test suite4

# from `tests/test_Wrapper.py` in the STPyV8 library
# License: Apache 2.0
def convert(obj):
  if isinstance(obj, JSArray):
    return [convert(v) for v in obj]
  if isinstance(obj, JSObject):
    return dict([[str(k), convert(obj.__getattr__(str(k)))] for k in obj.__dir__()])
  return obj

Loading the Library

The SheetJS Standalone scripts can be parsed and evaluated from the JS engine. Once evaluated, the XLSX variable is available as a global.

Assuming the standalone library is in the same directory as the source file, the script can be evaluated with eval:

  # Within a JSContext, open `xlsx.full.min.js` and evaluate
  with open("xlsx.full.min.js") as f:
    ctxt.eval(f.read())

Reading Files

The following diagram depicts the spreadsheet salsa:

flowchart LR
  file[(workbook\nfile)]
  subgraph SheetJS operations
    base64(Base64\nstring)
    wb((SheetJS\nWorkbook))
    aoo(array of\nobjects)
  end
  subgraph Pandas operations
    lod(list of\nrecords)
    df[(Pandas\nDataFrame)]
  end
  file --> |`open`/`read`\nPython ops| base64
  base64 --> |`XLSX.read`\nParse Bytes| wb
  wb --> |`sheet_to_json`\nExtract Data| aoo
  aoo --> |`convert`\nPython ops|lod
  lod --> |`from_records`\nPandas ops| df

At a high level:

  1. Pure Python operations read the file and generate a Base64 string

  2. SheetJS libraries parse the string and generates JS records

  3. JS engine operations translate the rows to Python list of dicts

  4. Pandas operations translate the Python data to a DataFrame

Read files

The safest format for data interchange is Base64-encoded strings:

from base64 import b64encode

with open(path, mode="rb") as f:
  file_bytes = f.read()
  b64 = b64encode(file_bytes)

Parse bytes

From JS code, XLSX.read5 parses the Base64 string

wb = ctxt.eval("(b64 => XLSX.read(b64, {type: 'base64', dense: true}))")(b64)

The wb object follows the "Common Spreadsheet Format"6, an in-memory format for representing workbooks, worksheets, cells, and spreadsheet features.

Get First Worksheet

As explained in the "Workbook Object"7 section:

  • the SheetNames property is a ordered list of the sheet names in the workbook
  • the Sheets property of the workbook object is an object whose keys are sheet names and whose values are sheet objects.

For use in Python, the SheetNames array must be converted to a list:

sheet_names = convert(wb.SheetNames)
first_sheet_name = sheet_names[0]

Since utility functions will process the worksheet object from JavaScript, it is preferable not to convert the object:

first_sheet = wb.Sheets[first_sheet_name] # do not convert

Generate List of Records

In JavaScript, the equivalent of the "list of dicts" or "list of records" is "array of objects". They can be created with XLSX.utils.sheet_to_json8:

rows = convert(ctxt.eval("(ws => XLSX.utils.sheet_to_json(ws))")(first_sheet))

Generate Pandas DataFrame

rows is a list of dict objects. from_records9 understands this data shape and generates a proper DataFrame:

df = pd.DataFrame.from_records(rows)

Writing Files

The writing process looks similar to the reading process in reverse:

flowchart LR
  subgraph Pandas operations
    df[(Pandas\nDataFrame)]
    json(JSON\nString)
  end
  subgraph SheetJS operations
    aoo(array of\nobjects)
    wb((SheetJS\nWorkbook))
    base64(Base64\nstring)
  end
  file[(workbook\nfile)]
  df --> |`to_json`\nPandas ops| json
  json --> |`JSON.parse`\nJS Engine| aoo
  aoo --> |`json_to_sheet`\nSheetJS Ops| wb
  wb --> |`XLSX.write`\nBase64| base64
  base64 --> |`open`/`write`\nPython ops| file

At a high level:

  1. Pandas operations translate the Python data to JSON string

  2. JS engine operations translate the JSON string to an array of objects

  3. SheetJS libraries parse the array and generate a Base64-encoded workbook

  4. Pure Python operations decode the Base64 string and write the bytes to file.

Generate JSON

DataFrame#to_json10 with the option orient="records" generates a JSON string that encodes an array of objects:

json = df.to_json(orient="records")

Generate Worksheet

In JavaScript, JSON.parse will interpret the string as an array of objects. XLSX.utils.json_to_sheet11 generates a SheetJS worksheet object:

sheet = ctxt.eval("(json => XLSX.utils.json_to_sheet(JSON.parse(json)) )")(json)

Export Enhancements

At this point, there are many options for improving the appearance of the sheet. For example, the "Export Tutorial"12 shows how to adjust column widths.

:::tip pass

SheetJS Pro offers additional styling options such as cell styling and frozen rows.

"Pro Edit" offers a special approach for inserting data into an existing file.

:::

Generate Workbook

XLSX.utils.book_new13 creates a new workbook and XLSX.utils.book_append_sheet14 appends a worksheet to the workbook. The new worksheet will be called "Export":

:::note pass

The code in the string literal is reproduced below:

(ws, name) => {
  const wb = XLSX.utils.book_new();
  XLSX.utils.book_append_sheet(wb, ws, name);
  return wb;
}

:::

book = ctxt.eval("""((ws, name) => {
  const wb = XLSX.utils.book_new();
  XLSX.utils.book_append_sheet(wb, ws, name);
  return wb;
})""")(sheet, "Export")

Generate File

XLSX.write15 with the option type: "base64" attempts to create a file and generate a Base64 string:

b64 = ctxt.eval("(wb => XLSX.write(wb, {type:'base64', bookType:'xls'}))")(book)

With the Base64 string, standard Python operations can create a file:

from base64 import b64decode

raw = b64decode(b64)
with open("export.xls", mode="wb") as f:
  f.write(raw)

Complete Demo

This example will extract data from an Apple Numbers spreadsheet and generate a DataFrame. The DataFrame will be exported to a legacy XLS spreadsheet.

Engine Setup

  1. Follow the official installation instructions16.
Instructions for macOS 12 (click to show)
  • Install boost-python3 package using brew:
brew install boost-python3
  • Identify python version:
python3 --version

:::note pass

When the demo was last tested, the version was 3.11.4

:::

curl -LO https://github.com/cloudflare/stpyv8/releases/download/v11.5.150.16/stpyv8-macos-12-python-3.11.zip
  • Extract ZIP file and enter folder
unzip stpyv8-macos-12-python-3.11.zip
cd stpyv8-macos-12-3.11
  • Move icudtl.dat to /Library/Application Support/STPyV8/:
sudo mkdir -p /Library/Application\ Support/STPyV8
sudo mv icudtl.dat /Library/Application\ Support/STPyV8/
  • Install wheel:
sudo python3 -m pip install --upgrade *.whl
cd ..

Demo

  1. Download the SheetJS Standalone script and move to the project directory:
  • xlsx.full.min.js

{\ curl -LO https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js}

  1. Install Pandas. On macOS:
sudo python3 -m pip install pandas
  1. Download the following test scripts and files:
curl -LO https://sheetjs.com/pres.numbers
curl -LO https://docs.sheetjs.com/pandas/sheetjs.py
curl -LO https://docs.sheetjs.com/pandas/SheetJSPandas.py
  1. Run the script:
python3 SheetJSPandas.py pres.numbers

If successful, it will display data rows in the file:

Reading from sheet Sheet1
{'Name': 'Bill Clinton', 'Index': 42}
{'Name': 'GeorgeW Bush', 'Index': 43}
{'Name': 'Barack Obama', 'Index': 44}
{'Name': 'Donald Trump', 'Index': 45}
{'Name': 'Joseph Biden', 'Index': 46}

If Pandas is installed, the script will display DataFrame metadata:

RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    5 non-null      object
 1   Index   5 non-null      int64
dtypes: int64(1), object(1)

It will also export to pres.xls. The file can be read in a spreadsheet editor.


  1. The official documentation site is https://pandas.pydata.org/ and the official distribution point is https://pypi.org/project/pandas/ ↩︎

  2. See "Other Languages" for more examples. ↩︎

  3. STPyV8 is a fork of the original PyV8 project. It is available under the permissive Apache 2.0 License. Special thanks to Flier Lu and CloudFlare! ↩︎

  4. See tests/test_Wrapper.py in the STPyV8 code repository. ↩︎

  5. See read in "Reading Files" ↩︎

  6. See "SheetJS Data Model" ↩︎

  7. See "Workbook Object" ↩︎

  8. See sheet_to_json in "Utilities" ↩︎

  9. See pandas.DataFrame.from_records in the Pandas documentation. ↩︎

  10. See pandas.DataFrame.to_json in the Pandas documentation. ↩︎

  11. See json_to_sheet in "Utilities" ↩︎

  12. See "Clean up Workbook" in "Export Tutorial". ↩︎

  13. See book_new in "Utilities" ↩︎

  14. See book_append_sheet in "Utilities" ↩︎

  15. See write in "Writing Files" ↩︎

  16. See "Installing" in the STPyV8 project documentation ↩︎