Pandas[^1] is a Python software library for data analysis. [SheetJS](https://sheetjs.com) is a JavaScript library for reading and writing data from spreadsheets. This demo uses SheetJS to process data from a spreadsheet and translate to the Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts, generate DataFrames from workbooks, and write DataFrames back to workbooks. :::note This demo was tested in the following deployments: | Architecture | V8 version | Pandas | Python | Date | |:-------------|:--------------|:-------|:-------|:-----------| | `darwin-x64` | `` | 2.0.3 | 3.11.4 | 2023-07-29 | ::: :::info pass Pandas includes limited support for reading spreadsheets (`pandas.from_excel`) and writing XLSX spreadsheets (`pandas.DataFrame.to_excel`). The SheetJS approach supports many common spreadsheet formats that are not supported by the current set of Pandas codecs and offers greater flexibility in processing complex worksheets. ::: ## Integration Details JS code cannot literally be run in the Python interpreter. To run JS code from Python, JavaScript engines[^2] can be embedded in CPython modules. ### Loading SheetJS This demo uses the `STPyV8` module[^3] to access the V8 JavaScript engine. _Initialize V8_ The engine library provides a convenient context manager `JSContext` for context resource management. Within the context, the `eval` method can evaluate code: ```py from STPyV8 import JSContext # Initialize JS context with JSContext() as ctxt: # Run code res = ctxt.eval("'Sheet' + 'JS'") # print result print(res) ``` `STPyV8` handles data interchange for common types. Arrays and JS objects can be translated to Python `list` and `dict` respectively. The following `convert` function is used in the test suite[^4] ```py # from `tests/test_Wrapper.py` in the STPyV8 library # License: Apache 2.0 def convert(obj): if isinstance(obj, JSArray): return [convert(v) for v in obj] if isinstance(obj, JSObject): return dict([[str(k), convert(obj.__getattr__(str(k)))] for k in obj.__dir__()]) return obj ``` _Loading the Library_ The [Standalone scripts](/docs/getting-started/installation/standalone) can be parsed and evaluated from the JS engine. Once evaluated, the `XLSX` variable is available as a global. Assuming the standalone library is in the same directory as the source file, the script can be evaluated with `eval`: ```py # Within a JSContext, open `xlsx.full.min.js` and evaluate with open("xlsx.full.min.js") as f: ctxt.eval(f.read()) ``` ### Reading Files The following diagram depicts the spreadsheet salsa: ```mermaid flowchart LR file[(workbook\nfile)] subgraph SheetJS operations base64(Base64\nstring) wb((SheetJS\nWorkbook)) aoo(array of\nobjects) end subgraph Pandas operations lod(list of\nrecords) df[(Pandas\nDataFrame)] end file --> |`open`/`read`\nPython ops| base64 base64 --> |`XLSX.read`\nParse Bytes| wb wb --> |`sheet_to_json`\nExtract Data| aoo aoo --> |`convert`\nPython ops|lod lod --> |`from_records`\nPandas ops| df ``` At a high level: 1) Pure Python operations read the file and generate a Base64 string 2) SheetJS libraries parse the string and generates JS records 3) JS engine operations translate the rows to Python `list` of `dicts` 4) Pandas operations translate the Python data to a DataFrame #### Read files The safest format for data interchange is Base64-encoded strings: ```py from base64 import b64encode with open(path, mode="rb") as f: file_bytes = f.read() b64 = b64encode(file_bytes) ``` #### Parse bytes From JS code, `XLSX.read`[^5] parses the Base64 string ```py wb = ctxt.eval("(b64 => XLSX.read(b64, {type: 'base64', dense: true}))")(b64) ``` The `wb` object follows the "Common Spreadsheet Format"[^6], an in-memory format for representing workbooks, worksheets, cells, and spreadsheet features. #### Get First Worksheet As explained in the "Workbook Object"[^7] section: - the `SheetNames` property is a ordered list of the sheet names in the workbook - the `Sheets` property of the workbook object is an object whose keys are sheet names and whose values are sheet objects. For use in Python, the `SheetNames` array must be converted to a `list`: ```py sheet_names = convert(wb.SheetNames) first_sheet_name = sheet_names[0] ``` Since utility functions will process the worksheet object from JavaScript, it is preferable not to convert the object: ```py first_sheet = wb.Sheets[first_sheet_name] # do not convert ``` #### Generate List of Records In JavaScript, the equivalent of the "`list` of `dict`s" or "`list` of records" is "array of objects". They can be created with `XLSX.utils.sheet_to_json`[^8]: ```py rows = convert(ctxt.eval("(ws => XLSX.utils.sheet_to_json(ws))")(first_sheet)) ``` #### Generate Pandas DataFrame `rows` is a `list` of `dict` objects. `from_records`[^9] understands this data shape and generates a proper DataFrame: ```py df = pd.DataFrame.from_records(rows) ``` ### Writing Files The writing process looks similar to the reading process in reverse: ```mermaid flowchart LR subgraph Pandas operations df[(Pandas\nDataFrame)] json(JSON\nString) end subgraph SheetJS operations aoo(array of\nobjects) wb((SheetJS\nWorkbook)) base64(Base64\nstring) end file[(workbook\nfile)] df --> |`to_json`\nPandas ops| json json --> |`JSON.parse`\nJS Engine| aoo aoo --> |`json_to_sheet`\nSheetJS Ops| wb wb --> |`XLSX.write`\nBase64| base64 base64 --> |`open`/`write`\nPython ops| file ``` At a high level: 1) Pandas operations translate the Python data to JSON string 2) JS engine operations translate the JSON string to an array of objects 3) SheetJS libraries parse the array and generate a Base64-encoded workbook 4) Pure Python operations decode the Base64 string and write the bytes to file. #### Generate JSON `DataFrame#to_json`[^10] with the option `orient="records"` generates a JSON string that encodes an array of objects: ```py json = df.to_json(orient="records") ``` #### Generate Worksheet In JavaScript, `JSON.parse` will interpret the string as an array of objects. `XLSX.utils.json_to_sheet`[^11] generates a SheetJS worksheet object: ```py sheet = ctxt.eval("(json => XLSX.utils.json_to_sheet(JSON.parse(json)) )")(json) ``` #### Export Enhancements At this point, there are many options for improving the appearance of the sheet. For example, the "Export Tutorial"[^12] shows how to adjust column widths. :::tip pass [SheetJS Pro](https://sheetjs.com/pro) offers additional styling options such as cell styling and frozen rows. "Pro Edit" offers a special approach for inserting data into an existing file. ::: #### Generate Workbook `XLSX.utils.book_new`[^13] creates a new workbook and `XLSX.utils.book_append_sheet`[^14] appends a worksheet to the workbook. The new worksheet will be called "Export": :::note pass The code in the string literal is reproduced below: ```js (ws, name) => { const wb = XLSX.utils.book_new(); XLSX.utils.book_append_sheet(wb, ws, name); return wb; } ``` ::: ```py book = ctxt.eval("""((ws, name) => { const wb = XLSX.utils.book_new(); XLSX.utils.book_append_sheet(wb, ws, name); return wb; })""")(sheet, "Export") ``` #### Generate File `XLSX.write`[^15] with the option `type: "base64"` attempts to create a file and generate a Base64 string: ```py b64 = ctxt.eval("(wb => XLSX.write(wb, {type:'base64', bookType:'xls'}))")(book) ``` With the Base64 string, standard Python operations can create a file: ```py from base64 import b64decode raw = b64decode(b64) with open("export.xls", mode="wb") as f: f.write(raw) ``` ## Complete Demo This example will extract data from an Apple Numbers spreadsheet and generate a DataFrame. The DataFrame will be exported to a legacy XLS spreadsheet. ### Engine Setup 0) Follow the official installation instructions[^16].
Instructions for macOS 12 (click to show) - Install `boost-python3` package using `brew`: ```bash brew install boost-python3 ``` - Identify python version: ```bash python3 --version ``` :::note pass When the demo was last tested, the version was `3.11.4` ::: - [Download latest release](https://github.com/cloudflare/stpyv8/releases) ```bash curl -LO https://github.com/cloudflare/stpyv8/releases/download/v11.5.150.16/stpyv8-macos-12-python-3.11.zip ``` - Extract ZIP file and enter folder ```bash unzip stpyv8-macos-12-python-3.11.zip cd stpyv8-macos-12-3.11 ``` - Move `icudtl.dat` to `/Library/Application Support/STPyV8/`: ```bash sudo mkdir -p /Library/Application\ Support/STPyV8 sudo mv icudtl.dat /Library/Application\ Support/STPyV8/ ``` - Install wheel: ```bash sudo python3 -m pip install --upgrade *.whl cd .. ```
### Demo 1) Follow the [standalone script](/docs/getting-started/installation/standalone) instructions to download the script: {`\ curl -LO https://cdn.sheetjs.com/xlsx-${current}/package/dist/xlsx.full.min.js`} 2) Install Pandas. On macOS: ```python sudo python3 -m pip install pandas ``` 3) Download the following test scripts and files: - [`pres.numbers` test file](https://sheetjs.com/pres.numbers) - [`sheetjs.py` wrapper](pathname:///pandas/sheetjs.py) - [`SheetJSPandas.py` script](pathname:///pandas/SheetJSPandas.py) ```bash curl -LO https://sheetjs.com/pres.numbers curl -LO https://docs.sheetjs.com/pandas/sheetjs.py curl -LO https://docs.sheetjs.com/pandas/SheetJSPandas.py ``` 4) Run the script: ```bash python3 SheetJSPandas.py pres.numbers ``` If successful, it will display data rows in the file: ``` Reading from sheet Sheet1 {'Name': 'Bill Clinton', 'Index': 42} {'Name': 'GeorgeW Bush', 'Index': 43} {'Name': 'Barack Obama', 'Index': 44} {'Name': 'Donald Trump', 'Index': 45} {'Name': 'Joseph Biden', 'Index': 46} ``` If Pandas is installed, the script will display DataFrame metadata: ``` RangeIndex: 5 entries, 0 to 4 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 5 non-null object 1 Index 5 non-null int64 dtypes: int64(1), object(1) ``` It will also export to `pres.xls`. The file can be read in a spreadsheet editor.