docs.sheetjs.com/14-pandas.md at d6abde0e8e2e7ecf3cb92123ed70b40850887a1d

sheetjs/docs.sheetjs.com

2023-09-22 02:44:32 -04:00

12 KiB

Raw Blame History

title	sidebar_label	description	pagination_prev	pagination_next
Spreadsheet Data in Pandas	Python (Pandas)	Process structured data in Python with Pandas. Seamlessly integrate spreadsheets into your workflow with SheetJS. Analyze complex Excel spreadsheets with confidence.	demos/cloud/index	demos/bigdata/index

import current from '/version.js'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';

Pandas¹ is a Python software library for data analysis.

SheetJS is a JavaScript library for reading and writing data from spreadsheets.

This demo uses SheetJS to process data from a spreadsheet and translate to the Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts, generate DataFrames from workbooks, and write DataFrames back to workbooks.

:::note

This demo was tested in the following deployments:

Architecture	V8 version	Pandas	Python	Date
`darwin-x64`	`11.5.150.16`	2.0.3	3.11.4	2023-07-29

:::

:::info pass

Pandas includes limited support for reading spreadsheets (pandas.from_excel) and writing XLSX spreadsheets (pandas.DataFrame.to_excel).

The SheetJS approach supports many common spreadsheet formats that are not supported by the current set of Pandas codecs and offers greater flexibility in processing complex worksheets.

:::

Integration Details

JS code cannot literally be run in the Python interpreter. To run JS code from Python, JavaScript engines² can be embedded in CPython modules.

Loading SheetJS

This demo uses the STPyV8 module³ to access the V8 JavaScript engine.

Initialize V8

The engine library provides a convenient context manager JSContext for context resource management. Within the context, the eval method can evaluate code:

from STPyV8 import JSContext

# Initialize JS context
with JSContext() as ctxt:
  # Run code
  res = ctxt.eval("'Sheet' + 'JS'")

  # print result
  print(res)

STPyV8 handles data interchange for common types. Arrays and JS objects can be translated to Python list and dict respectively. The following convert function is used in the test suite⁴

# from `tests/test_Wrapper.py` in the STPyV8 library
# License: Apache 2.0
def convert(obj):
  if isinstance(obj, JSArray):
    return [convert(v) for v in obj]
  if isinstance(obj, JSObject):
    return dict([[str(k), convert(obj.__getattr__(str(k)))] for k in obj.__dir__()])
  return obj

Loading the Library

The SheetJS Standalone scripts can be parsed and evaluated from the JS engine. Once evaluated, the XLSX variable is available as a global.

Assuming the standalone library is in the same directory as the source file, the script can be evaluated with eval:

  # Within a JSContext, open `xlsx.full.min.js` and evaluate
  with open("xlsx.full.min.js") as f:
    ctxt.eval(f.read())

Reading Files

The following diagram depicts the spreadsheet salsa:

flowchart LR
  file[(workbook\nfile)]
  subgraph SheetJS operations
    base64(Base64\nstring)
    wb((SheetJS\nWorkbook))
    aoo(array of\nobjects)
  end
  subgraph Pandas operations
    lod(list of\nrecords)
    df[(Pandas\nDataFrame)]
  end
  file --> |`open`/`read`\nPython ops| base64
  base64 --> |`XLSX.read`\nParse Bytes| wb
  wb --> |`sheet_to_json`\nExtract Data| aoo
  aoo --> |`convert`\nPython ops|lod
  lod --> |`from_records`\nPandas ops| df

At a high level:

Pure Python operations read the file and generate a Base64 string
SheetJS libraries parse the string and generates JS records
JS engine operations translate the rows to Python list of dicts
Pandas operations translate the Python data to a DataFrame

Read files

The safest format for data interchange is Base64-encoded strings:

from base64 import b64encode

with open(path, mode="rb") as f:
  file_bytes = f.read()
  b64 = b64encode(file_bytes)

Parse bytes

From JS code, XLSX.read⁵ parses the Base64 string

wb = ctxt.eval("(b64 => XLSX.read(b64, {type: 'base64', dense: true}))")(b64)

The wb object follows the "Common Spreadsheet Format"⁶, an in-memory format for representing workbooks, worksheets, cells, and spreadsheet features.

Get First Worksheet

As explained in the "Workbook Object"⁷ section:

the SheetNames property is a ordered list of the sheet names in the workbook
the Sheets property of the workbook object is an object whose keys are sheet names and whose values are sheet objects.

For use in Python, the SheetNames array must be converted to a list:

sheet_names = convert(wb.SheetNames)
first_sheet_name = sheet_names[0]

Since utility functions will process the worksheet object from JavaScript, it is preferable not to convert the object:

first_sheet = wb.Sheets[first_sheet_name] # do not convert

Generate List of Records

In JavaScript, the equivalent of the "list of dicts" or "list of records" is "array of objects". They can be created with XLSX.utils.sheet_to_json⁸:

rows = convert(ctxt.eval("(ws => XLSX.utils.sheet_to_json(ws))")(first_sheet))

Generate Pandas DataFrame

rows is a list of dict objects. from_records⁹ understands this data shape and generates a proper DataFrame:

df = pd.DataFrame.from_records(rows)

Writing Files

The writing process looks similar to the reading process in reverse:

flowchart LR
  subgraph Pandas operations
    df[(Pandas\nDataFrame)]
    json(JSON\nString)
  end
  subgraph SheetJS operations
    aoo(array of\nobjects)
    wb((SheetJS\nWorkbook))
    base64(Base64\nstring)
  end
  file[(workbook\nfile)]
  df --> |`to_json`\nPandas ops| json
  json --> |`JSON.parse`\nJS Engine| aoo
  aoo --> |`json_to_sheet`\nSheetJS Ops| wb
  wb --> |`XLSX.write`\nBase64| base64
  base64 --> |`open`/`write`\nPython ops| file

At a high level:

Pandas operations translate the Python data to JSON string
JS engine operations translate the JSON string to an array of objects
SheetJS libraries parse the array and generate a Base64-encoded workbook
Pure Python operations decode the Base64 string and write the bytes to file.

Generate JSON

DataFrame#to_json¹⁰ with the option orient="records" generates a JSON string that encodes an array of objects:

json = df.to_json(orient="records")

Generate Worksheet

In JavaScript, JSON.parse will interpret the string as an array of objects. XLSX.utils.json_to_sheet¹¹ generates a SheetJS worksheet object:

sheet = ctxt.eval("(json => XLSX.utils.json_to_sheet(JSON.parse(json)) )")(json)

Export Enhancements

At this point, there are many options for improving the appearance of the sheet. For example, the "Export Tutorial"¹² shows how to adjust column widths.

:::tip pass

SheetJS Pro offers additional styling options such as cell styling and frozen rows.

"Pro Edit" offers a special approach for inserting data into an existing file.

:::

Generate Workbook

XLSX.utils.book_new¹³ creates a new workbook and XLSX.utils.book_append_sheet¹⁴ appends a worksheet to the workbook. The new worksheet will be called "Export":

:::note pass

The code in the string literal is reproduced below:

(ws, name) => {
  const wb = XLSX.utils.book_new();
  XLSX.utils.book_append_sheet(wb, ws, name);
  return wb;
}

:::

book = ctxt.eval("""((ws, name) => {
  const wb = XLSX.utils.book_new();
  XLSX.utils.book_append_sheet(wb, ws, name);
  return wb;
})""")(sheet, "Export")

Generate File

XLSX.write¹⁵ with the option type: "base64" attempts to create a file and generate a Base64 string:

b64 = ctxt.eval("(wb => XLSX.write(wb, {type:'base64', bookType:'xls'}))")(book)

With the Base64 string, standard Python operations can create a file:

from base64 import b64decode

raw = b64decode(b64)
with open("export.xls", mode="wb") as f:
  f.write(raw)

Complete Demo

This example will extract data from an Apple Numbers spreadsheet and generate a DataFrame. The DataFrame will be exported to a legacy XLS spreadsheet.

Engine Setup

Follow the official installation instructions¹⁶.

Instructions for macOS 12 (click to show)

Install boost-python3 package using brew:

brew install boost-python3

Identify python version:

python3 --version