NUMBERS primary cell storage parse

docs clarified row and column props (fixes #2486) (fixes #2511)
This commit is contained in:
SheetJS 2022-02-04 00:29:01 -05:00
parent 0000a09c41
commit 55924668a7
19 changed files with 1089 additions and 940 deletions

@ -15,6 +15,7 @@ OData
OpenDocument
OpenFormula
PivotTable
PivotTables
Quattro
SpreadsheetML
Unhide
@ -96,12 +97,14 @@ encodings
filesystem
globals
javascript
lifecycle
metadata
natively
pre-built
pre-generated
prepend
prepended
programmatically
repo
runtime
serverless

564
README.md

@ -1,30 +1,13 @@
# [SheetJS](https://sheetjs.com)
Parser and writer for various spreadsheet formats. Pure-JS cleanroom
implementation from official specifications, related documents, and test files.
Emphasis on parsing and writing robustness, cross-format feature compatibility
with a unified JS representation, and ES3/ES5 browser compatibility back to IE6.
The SheetJS Community Edition offers battle-tested open-source solutions for
extracting useful data from almost any complex spreadsheet and generating new
spreadsheets that will work with legacy and modern software alike.
This is the community version. We also offer a pro version with performance
enhancements, additional features like styling, and dedicated support.
Community Translations of this README:
- [Simplified Chinese](https://github.com/rockboom/SheetJS-docs-zh-CN)
[**Pro Version**](https://sheetjs.com/pro)
[**Commercial Support**](https://sheetjs.com/support)
[**Rendered Documentation**](https://docs.sheetjs.com/)
[**In-Browser Demos**](https://sheetjs.com/demos)
[**Source Code**](https://git.io/xlsx)
[**Issues and Bug Reports**](https://github.com/sheetjs/sheetjs/issues)
[SheetJS Pro](https://sheetjs.com/pro) offers solutions beyond data processing:
Edit complex templates with ease; let out your inner Picasso with styling; make
custom sheets with images/graphs/PivotTables; evaluate formula expressions and
port calculations to web apps; automate common spreadsheet tasks, and much more!
![License](https://img.shields.io/github/license/SheetJS/sheetjs)
[![Build Status](https://img.shields.io/github/workflow/status/sheetjs/sheetjs/Tests:%20node.js)](https://github.com/SheetJS/sheetjs/actions)
@ -54,11 +37,11 @@ Community Translations of this README:
<!-- toc -->
- [Installation](#installation)
- [Getting Started](#getting-started)
* [Installation](#installation)
* [Usage](#usage)
+ [The Zen of SheetJS](#the-zen-of-sheetjs)
* [JS Ecosystem Demos](#js-ecosystem-demos)
* [Optional Modules](#optional-modules)
* [ECMAScript 5 Compatibility](#ecmascript-5-compatibility)
- [Philosophy](#philosophy)
- [Parsing Workbooks](#parsing-workbooks)
* [Parsing Examples](#parsing-examples)
* [Streaming Read](#streaming-read)
@ -89,8 +72,7 @@ Community Translations of this README:
+ [Miscellaneous Workbook Properties](#miscellaneous-workbook-properties)
* [Document Features](#document-features)
+ [Formulae](#formulae)
+ [Column Properties](#column-properties)
+ [Row Properties](#row-properties)
+ [Row and Column Properties](#row-and-column-properties)
+ [Number Formats](#number-formats)
+ [Hyperlinks](#hyperlinks)
+ [Cell Comments](#cell-comments)
@ -112,27 +94,6 @@ Community Translations of this README:
* [HTML Output](#html-output)
* [JSON](#json)
- [File Formats](#file-formats)
* [Excel 2007+ XML (XLSX/XLSM)](#excel-2007-xml-xlsxxlsm)
* [Excel 2.0-95 (BIFF2/BIFF3/BIFF4/BIFF5)](#excel-20-95-biff2biff3biff4biff5)
* [Excel 97-2004 Binary (BIFF8)](#excel-97-2004-binary-biff8)
* [Excel 2003-2004 (SpreadsheetML)](#excel-2003-2004-spreadsheetml)
* [Excel 2007+ Binary (XLSB, BIFF12)](#excel-2007-binary-xlsb-biff12)
* [Delimiter-Separated Values (CSV/TXT)](#delimiter-separated-values-csvtxt)
* [Other Workbook Formats](#other-workbook-formats)
+ [Lotus 1-2-3 (WKS/WK1/WK2/WK3/WK4/123)](#lotus-1-2-3-wkswk1wk2wk3wk4123)
+ [Quattro Pro (WQ1/WQ2/WB1/WB2/WB3/QPW)](#quattro-pro-wq1wq2wb1wb2wb3qpw)
+ [Works for DOS / Windows Spreadsheet (WKS/XLR)](#works-for-dos--windows-spreadsheet-wksxlr)
+ [Numbers 3.0+ / iWork 2013+ Spreadsheet (NUMBERS)](#numbers-30--iwork-2013-spreadsheet-numbers)
+ [OpenDocument Spreadsheet (ODS/FODS)](#opendocument-spreadsheet-odsfods)
+ [Uniform Office Spreadsheet (UOS1/2)](#uniform-office-spreadsheet-uos12)
* [Other Single-Worksheet Formats](#other-single-worksheet-formats)
+ [dBASE and Visual FoxPro (DBF)](#dbase-and-visual-foxpro-dbf)
+ [Symbolic Link (SYLK)](#symbolic-link-sylk)
+ [Lotus Formatted Text (PRN)](#lotus-formatted-text-prn)
+ [Data Interchange Format (DIF)](#data-interchange-format-dif)
+ [HTML](#html)
+ [Rich Text Format (RTF)](#rich-text-format-rtf)
+ [Ethercalc Record Format (ETH)](#ethercalc-record-format-eth)
- [Testing](#testing)
* [Node](#node)
* [Browser](#browser)
@ -149,7 +110,9 @@ Community Translations of this README:
</details>
## Installation
## Getting Started
### Installation
In the browser, just add a script tag:
@ -188,6 +151,157 @@ With [bower](https://bower.io/search/?q=js-xlsx):
$ bower install js-xlsx
```
<details>
<summary><b>Optional features</b> (click to show)</summary>
The node version automatically requires modules for additional features. Some
of these modules are rather large in size and are only needed in special
circumstances, so they do not ship with the core. For browser use, they must
be included directly:
```html
<!-- international support from js-codepage -->
<script src="dist/cpexcel.js"></script>
```
An appropriate version for each dependency is included in the dist/ directory.
The complete single-file version is generated at `dist/xlsx.full.min.js`
A slimmer build is generated at `dist/xlsx.mini.min.js`. Compared to full build:
- codepage library skipped (no support for XLS encodings)
- XLSX compression option not currently available
- no support for XLSB / XLS / Lotus 1-2-3 / SpreadsheetML 2003
- node stream utils removed
Webpack and Browserify builds include optional modules by default. Webpack can
be configured to remove support with `resolve.alias`:
```js
/* uncomment the lines below to remove support */
resolve: {
alias: { "./dist/cpexcel.js": "" } // <-- omit international support
}
```
</details>
<details>
<summary><b>ECMAScript 3 Compatibility</b> (click to show)</summary>
For broad compatibility with JavaScript engines, the library is written using
ECMAScript 3 language dialect as well as some ES5 features like `Array#forEach`.
Older browsers require shims to provide missing functions.
To use the shim, add the shim before the script tag that loads `xlsx.js`:
```html
<!-- add the shim first -->
<script type="text/javascript" src="shim.min.js"></script>
<!-- after the shim is referenced, add the library -->
<script type="text/javascript" src="xlsx.full.min.js"></script>
```
The script also includes `IE_LoadFile` and `IE_SaveFile` for loading and saving
files in Internet Explorer versions 6-9. The `xlsx.extendscript.js` script
bundles the shim in a format suitable for Photoshop and other Adobe products.
</details>
### Usage
Most scenarios involving spreadsheets and data can be broken into 5 parts:
1) **Acquire Data**: Data may be stored anywhere: local or remote files,
databases, HTML TABLE, or even generated programmatically in the web browser.
2) **Extract Data**: For spreadsheet files, this involves parsing raw bytes to
read the cell data. For general JS data, this involves reshaping the data.
3) **Process Data**: From generating summary statistics to cleaning data
records, this step is the heart of the problem.
4) **Package Data**: This can involve making a new spreadsheet or serializing
with `JSON.stringify` or writing XML or simply flattening data for UI tools.
5) **Release Data**: Spreadsheet files can be uploaded to a server or written
locally. Data can be presented to users in an HTML TABLE or data grid.
A common problem involves generating a valid spreadsheet export from data stored
in an HTML table. In this example, an HTML TABLE on the page will be scraped,
a row will be added to the bottom with the date of the report, and a new file
will be generated and downloaded locally. `XLSX.writeFile` takes care of
packaging the data and attempting a local download:
```js
// Acquire Data (reference to the HTML table)
var table_elt = document.getElementById("my-table-id");
// Extract Data (create a workbook object from the table)
var workbook = XLSX.utils.table_to_book(table_elt);
// Process Data (add a new row)
var worksheet = workbook.Sheets["Sheet1"];
XLSX.utils.sheet_add_aoa([["Created "+new Date().toISOString()}]], {origin:-1});
// Package and Release Data (`writeFile` tries to write and save an XLSB file)
XLSX.writeFile(workbook, "Report.xlsb");
```
This library tries to simplify steps 2 and 4 with functions to extract useful
data from spreadsheet files (`read` / `readFile`) and generate new spreadsheet
files from data (`write` / `writeFile`).
This documentation and various demo projects cover a number of common scenarios
and approaches for steps 1 and 5.
Utility functions help with step 3.
#### The Zen of SheetJS
_File formats are implementation details_
The parser covers a wide gamut of common spreadsheet file formats to ensure that
"HTML-saved-as-XLS" files work as well as actual XLS or XLSX files.
The writer supports a number of common output formats for broad compatibility
with the data ecosystem.
_Data processing should fit in any workflow_
The library does not impose a separate lifecycle. It fits nicely in websites
and apps built using any framework. The plain JS data objects play nice with
Web Workers and future APIs.
["Parsing Workbooks"](#parsing-workbooks) describes solutions for common data
import scenarios involving actual spreadsheet files.
["Writing Workbooks"](#writing-workbooks) describes solutions for common data
export scenarios involving actual spreadsheet files.
["Utility Functions"](#utility-functions) details utility functions for
translating JSON Arrays and other common JS structures into worksheet objects.
_JavaScript is a powerful language for data processing_
The ["Common Spreadsheet Format"](#common-spreadsheet-format) is a simple object
representation of the core concepts of a workbook. The various functions in the
library provide low-level tools for working with the object.
For friendly JS processing, there are utility functions for converting parts of
a worksheet to/from an Array of Arrays. For example, summing columns from an
array of arrays can be implemented in a single Array reduce operation:
```js
var aoa = XLSX.utils.sheet_to_json(worksheet, {header: 1});
var sum_of_column_B = aoa.reduce((acc, row) => acc + (+row[1]||0), 0);
```
### JS Ecosystem Demos
The [`demos` directory](demos/) includes sample projects for:
@ -228,100 +342,6 @@ The [`demos` directory](demos/) includes sample projects for:
Other examples are included in the [showcase](demos/showcase/).
### Optional Modules
<details>
<summary><b>Optional features</b> (click to show)</summary>
The node version automatically requires modules for additional features. Some
of these modules are rather large in size and are only needed in special
circumstances, so they do not ship with the core. For browser use, they must
be included directly:
```html
<!-- international support from js-codepage -->
<script src="dist/cpexcel.js"></script>
```
An appropriate version for each dependency is included in the dist/ directory.
The complete single-file version is generated at `dist/xlsx.full.min.js`
A slimmer build is generated at `dist/xlsx.mini.min.js`. Compared to full build:
- codepage library skipped (no support for XLS encodings)
- XLSX compression option not currently available
- no support for XLSB / XLS / Lotus 1-2-3 / SpreadsheetML 2003
- node stream utils removed
Webpack and Browserify builds include optional modules by default. Webpack can
be configured to remove support with `resolve.alias`:
```js
/* uncomment the lines below to remove support */
resolve: {
alias: { "./dist/cpexcel.js": "" } // <-- omit international support
}
```
</details>
### ECMAScript 5 Compatibility
Since the library uses functions like `Array#forEach`, older browsers require
[shims to provide missing functions](https://oss.sheetjs.com/sheetjs/shim.js).
To use the shim, add the shim before the script tag that loads `xlsx.js`:
```html
<!-- add the shim first -->
<script type="text/javascript" src="shim.min.js"></script>
<!-- after the shim is referenced, add the library -->
<script type="text/javascript" src="xlsx.full.min.js"></script>
```
The script also includes `IE_LoadFile` and `IE_SaveFile` for loading and saving
files in Internet Explorer versions 6-9. The `xlsx.extendscript.js` script
bundles the shim in a format suitable for Photoshop and other Adobe products.
## Philosophy
<details>
<summary><b>Philosophy</b> (click to show)</summary>
Prior to SheetJS, APIs for processing spreadsheet files were format-specific.
Third-party libraries either supported one format, or they involved a separate
set of classes for each supported file type. Even though XLSB was introduced in
Excel 2007, nothing outside of SheetJS or Excel supported the format.
To promote a format-agnostic view, SheetJS starts from a pure-JS representation
that we call the ["Common Spreadsheet Format"](#common-spreadsheet-format).
Emphasizing a uniform object representation enables new features like format
conversion (reading an XLSX template and saving as XLS) and circumvents the mess
of classes. By abstracting the complexities of the various formats, tools
need not worry about the specific file type!
A simple object representation combined with careful coding practices enables
use cases in older browsers and in alternative environments like ExtendScript
and Web Workers. It is always tempting to use the latest and greatest features,
but they tend to require the latest versions of browsers, limiting usability.
Utility functions capture common use cases like generating JS objects or HTML.
Most simple operations should only require a few lines of code. More complex
operations generally should be straightforward to implement.
Excel pushes the XLSX format as default starting in Excel 2007. However, there
are other formats with more appealing properties. For example, the XLSB format
is spiritually similar to XLSX but files often tend up taking less than half the
space and open much faster! Even though an XLSX writer is available, other
format writers are available so users can take advantage of the unique
characteristics of each format.
The primary focus of the Community Edition is correct data interchange, focused
on extracting data from any compatible data representation and exporting data in
various formats suitable for any third party interface.
</details>
## Parsing Workbooks
For parsing, the first step is to read the file. This involves acquiring the
@ -890,6 +910,11 @@ Write options are described in the [Writing Options](#writing-options) section.
Utilities are available in the `XLSX.utils` object and are described in the
[Utility Functions](#utility-functions) section:
**Constructing:**
- `book_new` creates an empty workbook
- `book_append_sheet` adds a worksheet to a workbook
**Importing:**
- `aoa_to_sheet` converts an array of arrays of JS data to a worksheet.
@ -1391,7 +1416,23 @@ formulae and Lotus Parsed formulae have to be explicitly unwound. OpenFormula
formulae can be converted with regular expressions.
</details>
#### Column Properties
#### Row and Column Properties
<details>
<summary><b>Format Support</b> (click to show)</summary>
**Row Properties**: XLSX/M, XLSB, BIFF8 XLS, XLML, SYLK, DOM, ODS
**Column Properties**: XLSX/M, XLSB, BIFF8 XLS, XLML, SYLK, DOM
</details>
Row and Column properties are not extracted by default when reading from a file
and are not persisted by default when writing to a file. The option
`cellStyles: true` must be passed to the relevant read or write function.
_Column Properties_
The `!cols` array in each worksheet, if present, is a collection of `ColInfo`
objects which have the following properties:
@ -1412,6 +1453,30 @@ type ColInfo = {
};
```
_Row Properties_
The `!rows` array in each worksheet, if present, is a collection of `RowInfo`
objects which have the following properties:
```typescript
type RowInfo = {
/* visibility */
hidden?: boolean; // if true, the row is hidden
/* row height is specified in one of the following ways: */
hpx?: number; // height in screen pixels
hpt?: number; // height in points
level?: number; // 0-indexed outline / group level
};
```
_Outline / Group Levels Convention_
The Excel UI displays the base outline level as `1` and the max level as `8`.
Following JS conventions, SheetJS uses 0-indexed outline levels wherein the base
outline level is `0` and the max level is `7`.
<details>
<summary><b>Why are there three width types?</b> (click to show)</summary>
@ -1442,6 +1507,20 @@ when changing the pixel width, delete the `wch` and `width` properties.
<details>
<summary><b>Implementation details</b> (click to show)</summary>
_Row Heights_
Excel internally stores row heights in points. The default resolution is 72 DPI
or 96 PPI, so the pixel and point size should agree. For different resolutions
they may not agree, so the library separates the concepts.
Even though all of the information is made available, writers are expected to
follow the priority order:
1) use `hpx` pixel height if available
2) use `hpt` point height if available
_Column Widths_
Given the constraints, it is possible to determine the MDW without actually
inspecting the font! The parsers guess the pixel width by converting from width
to pixels and back, repeating for all possible MDW and selecting the MDW that
@ -1454,41 +1533,7 @@ follow the priority order:
1) use `width` field if available
2) use `wpx` pixel width if available
3) use `wch` character count if available
</details>
#### Row Properties
The `!rows` array in each worksheet, if present, is a collection of `RowInfo`
objects which have the following properties:
```typescript
type RowInfo = {
/* visibility */
hidden?: boolean; // if true, the row is hidden
/* row height is specified in one of the following ways: */
hpx?: number; // height in screen pixels
hpt?: number; // height in points
level?: number; // 0-indexed outline / group level
};
```
Note: Excel UI displays the base outline level as `1` and the max level as `8`.
The `level` field stores the base outline as `0` and the max level as `7`.
<details>
<summary><b>Implementation details</b> (click to show)</summary>
Excel internally stores row heights in points. The default resolution is 72 DPI
or 96 PPI, so the pixel and point size should agree. For different resolutions
they may not agree, so the library separates the concepts.
Even though all of the information is made available, writers are expected to
follow the priority order:
1) use `hpx` pixel height if available
2) use `hpt` point height if available
</details>
#### Number Formats
@ -2518,10 +2563,12 @@ range limits will be silently truncated:
Excel 2003 SpreadsheetML range limits are governed by the version of Excel and
are not enforced by the writer.
### Excel 2007+ XML (XLSX/XLSM)
<details>
<summary>(click to show)</summary>
<summary><b>File Format Details</b> (click to show)</summary>
**Core Spreadsheet Formats**
- **Excel 2007+ XML (XLSX/XLSM)**
XLSX and XLSM files are ZIP containers containing a series of XML files in
accordance with the Open Packaging Conventions (OPC). The XLSM format, almost
@ -2531,12 +2578,7 @@ The format is standardized in ECMA-376 and later in ISO/IEC 29500. Excel does
not follow the specification, and there are additional documents discussing how
Excel deviates from the specification.
</details>
### Excel 2.0-95 (BIFF2/BIFF3/BIFF4/BIFF5)
<details>
<summary>(click to show)</summary>
- **Excel 2.0-95 (BIFF2/BIFF3/BIFF4/BIFF5)**
BIFF 2/3 XLS are single-sheet streams of binary records. Excel 4 introduced
the concept of a workbook (`XLW` files) but also had single-sheet `XLS` format.
@ -2548,12 +2590,7 @@ files in these formats, so record lengths and fields were determined by writing
in all of the supported formats and comparing files. Excel 2016 can generate
BIFF5 files, enabling a full suite of file tests starting from XLSX or BIFF2.
</details>
### Excel 97-2004 Binary (BIFF8)
<details>
<summary>(click to show)</summary>
- **Excel 97-2004 Binary (BIFF8)**
BIFF8 exclusively uses the Compound File Binary container format, splitting some
content into streams within the file. At its core, it still uses an extended
@ -2562,24 +2599,14 @@ version of the binary record format from older versions of BIFF.
The `MS-XLS` specification covers the basics of the file format, and other
specifications expand on serialization of features like properties.
</details>
### Excel 2003-2004 (SpreadsheetML)
<details>
<summary>(click to show)</summary>
- **Excel 2003-2004 (SpreadsheetML)**
Predating XLSX, SpreadsheetML files are simple XML files. There is no official
and comprehensive specification, although MS has released documentation on the
format. Since Excel 2016 can generate SpreadsheetML files, mapping features is
pretty straightforward.
</details>
### Excel 2007+ Binary (XLSB, BIFF12)
<details>
<summary>(click to show)</summary>
- **Excel 2007+ Binary (XLSB, BIFF12)**
Introduced in parallel with XLSX, the XLSB format combines the BIFF architecture
with the content separation and ZIP container of XLSX. For the most part nodes
@ -2588,12 +2615,7 @@ in an XLSX sub-file can be mapped to XLSB records in a corresponding sub-file.
The `MS-XLSB` specification covers the basics of the file format, and other
specifications expand on serialization of features like properties.
</details>
### Delimiter-Separated Values (CSV/TXT)
<details>
<summary>(click to show)</summary>
- **Delimiter-Separated Values (CSV/TXT)**
Excel CSV deviates from RFC4180 in a number of important ways. The generated
CSV files should generally work in Excel although they may not work in RFC4180
@ -2602,32 +2624,20 @@ writer proactively generates cells for formulae if values are unavailable.
Excel TXT uses tab as the delimiter and code page 1200.
Notes:
Like in Excel, files starting with `0x49 0x44 ("ID")` are treated as Symbolic
Link files. Unlike Excel, if the file does not have a valid SYLK header, it
will be proactively reinterpreted as CSV. There are some files with semicolon
delimiter that align with a valid SYLK file. For the broadest compatibility,
all cells with the value of `ID` are automatically wrapped in double-quotes.
- Like in Excel, files starting with `0x49 0x44 ("ID")` are treated as Symbolic
Link files. Unlike Excel, if the file does not have a valid SYLK header, it
will be proactively reinterpreted as CSV. There are some files with semicolon
delimiter that align with a valid SYLK file. For the broadest compatibility,
all cells with the value of `ID` are automatically wrapped in double-quotes.
**Miscellaneous Workbook Formats**
</details>
### Other Workbook Formats
<details>
<summary>(click to show)</summary>
Support for other formats is generally far XLS/XLSB/XLSX support, due in large
Support for other formats is generally far behind XLS/XLSB/XLSX support, due in
part to a lack of publicly available documentation. Test files were produced in
the respective apps and compared to their XLS exports to determine structure.
The main focus is data extraction.
</details>
#### Lotus 1-2-3 (WKS/WK1/WK2/WK3/WK4/123)
<details>
<summary>(click to show)</summary>
- **Lotus 1-2-3 (WKS/WK1/WK2/WK3/WK4/123)**
The Lotus formats consist of binary records similar to the BIFF structure. Lotus
did release a specification decades ago covering the original WK1 format. Other
@ -2637,23 +2647,13 @@ Generated WK1 worksheets are compatible with Lotus 1-2-3 R2 and Excel 5.0.
Generated WK3 workbooks are compatible with Lotus 1-2-3 R9 and Excel 5.0.
</details>
#### Quattro Pro (WQ1/WQ2/WB1/WB2/WB3/QPW)
<details>
<summary>(click to show)</summary>
- **Quattro Pro (WQ1/WQ2/WB1/WB2/WB3/QPW)**
The Quattro Pro formats use binary records in the same way as BIFF and Lotus.
Some of the newer formats (namely WB3 and QPW) use a CFB enclosure just like
BIFF8 XLS.
</details>
#### Works for DOS / Windows Spreadsheet (WKS/XLR)
<details>
<summary>(click to show)</summary>
- **Works for DOS / Windows Spreadsheet (WKS/XLR)**
All versions of Works were limited to a single worksheet.
@ -2669,12 +2669,7 @@ exact Workbook stream for the XLR and the 97-2003 XLS export. Works 6 XLS
includes two empty worksheets but the main worksheet has an identical encoding.
XLR also includes a `WksSSWorkBook` stream similar to Lotus FM3/FMT files.
</details>
#### Numbers 3.0+ / iWork 2013+ Spreadsheet (NUMBERS)
<details>
<summary>(click to show)</summary>
- **Numbers 3.0+ / iWork 2013+ Spreadsheet (NUMBERS)**
iWork 2013 (Numbers 3.0 / Pages 5.0 / Keynote 6.0) switched from a proprietary
XML-based format to the current file format based on the iWork Archive (IWA).
@ -2684,39 +2679,24 @@ The parser focuses on extracting raw data from tables. Numbers technically
supports multiple tables in a logical worksheet, including custom titles. This
parser will generate one worksheet per Numbers table.
</details>
#### OpenDocument Spreadsheet (ODS/FODS)
<details>
<summary>(click to show)</summary>
- **OpenDocument Spreadsheet (ODS/FODS)**
ODS is an XML-in-ZIP format akin to XLSX while FODS is an XML format akin to
SpreadsheetML. Both are detailed in the OASIS standard, but tools like LO/OO
add undocumented extensions. The parsers and writers do not implement the full
standard, instead focusing on parts necessary to extract and store raw data.
</details>
#### Uniform Office Spreadsheet (UOS1/2)
<details>
<summary>(click to show)</summary>
- **Uniform Office Spreadsheet (UOS1/2)**
UOS is a very similar format, and it comes in 2 varieties corresponding to ODS
and FODS respectively. For the most part, the difference between the formats
is in the names of tags and attributes.
</details>
### Other Single-Worksheet Formats
**Miscellaneous Worksheet Formats**
Many older formats supported only one worksheet:
#### dBASE and Visual FoxPro (DBF)
<details>
<summary>(click to show)</summary>
- **dBASE and Visual FoxPro (DBF)**
DBF is really a typed table format: each column can only hold one data type and
each record omits type information. The parser generates a header row and
@ -2727,12 +2707,7 @@ Multi-file extensions like external memos and tables are currently unsupported,
limited by the general ability to read arbitrary files in the web browser. The
reader understands DBF Level 7 extensions like DATETIME.
</details>
#### Symbolic Link (SYLK)
<details>
<summary>(click to show)</summary>
- **Symbolic Link (SYLK)**
There is no real documentation. All knowledge was gathered by saving files in
various versions of Excel to deduce the meaning of fields. Notes:
@ -2740,23 +2715,13 @@ various versions of Excel to deduce the meaning of fields. Notes:
- Plain formulae are stored in the RC form.
- Column widths are rounded to integral characters.
</details>
#### Lotus Formatted Text (PRN)
<details>
<summary>(click to show)</summary>
- **Lotus Formatted Text (PRN)**
There is no real documentation, and in fact Excel treats PRN as an output-only
file format. Nevertheless we can guess the column widths and reverse-engineer
the original layout. Excel's 240 character width limitation is not enforced.
</details>
#### Data Interchange Format (DIF)
<details>
<summary>(click to show)</summary>
- **Data Interchange Format (DIF)**
There is no unified definition. Visicalc DIF differs from Lotus DIF, and both
differ from Excel DIF. Where ambiguous, the parser/writer follows the expected
@ -2769,12 +2734,7 @@ behavior from Excel. In particular, Excel extends DIF in incompatible ways:
- DIF technically has no support for formulae, but Excel will automatically
convert plain formulae. Array formulae are not preserved.
</details>
#### HTML
<details>
<summary>(click to show)</summary>
- **HTML**
Excel HTML worksheets include special metadata encoded in styles. For example,
`mso-number-format` is a localized string containing the number format. Despite
@ -2785,22 +2745,12 @@ looks for those tags and overrides the default interpretation. For example, text
like `<td>12345</td>` will be parsed as numbers but `<td t="s">12345</td>` will
be parsed as text.
</details>
#### Rich Text Format (RTF)
<details>
<summary>(click to show)</summary>
- **Rich Text Format (RTF)**
Excel RTF worksheets are stored in clipboard when copying cells or ranges from a
worksheet. The supported codes are a subset of the Word RTF support.
</details>
#### Ethercalc Record Format (ETH)
<details>
<summary>(click to show)</summary>
- **Ethercalc Record Format (ETH)**
[Ethercalc](https://ethercalc.net/) is an open source web spreadsheet powered by
a record format reminiscent of SYLK wrapped in a MIME multi-part message.

@ -59,6 +59,13 @@ var NUMBERS = (function() {
x = (x & 858993459) + (x >> 2 & 858993459);
return (x + (x >> 4) & 252645135) * 16843009 >>> 24;
};
var readDecimal128LE = function(buf, offset) {
var exp = (buf[offset + 15] & 127) << 7 | buf[offset + 14] >> 1;
var mantissa = buf[offset + 14] & 1;
for (var j = offset + 13; j >= offset; --j)
mantissa = mantissa * 256 + buf[j];
return (buf[offset + 15] & 128 ? -mantissa : mantissa) * Math.pow(10, exp - 6176);
};
// src/proto.ts
function parse_varint49(buf, ptr) {
@ -279,10 +286,10 @@ var NUMBERS = (function() {
return out;
}
// src/prebnccell.ts
function parseit(buf, sst, rsst, version) {
// src/cell.ts
function parse_old_storage(buf, sst, rsst) {
var dv = u8_to_dataview(buf);
var ctype = buf[version == 4 ? 1 : 2];
var ctype = buf[buf[0] == 4 ? 1 : 2];
var flags = dv.getUint32(4, true);
var data_offset = 12 + popcnt(flags & 3470) * 4;
var ridx = -1, sidx = -1, ieee = NaN, dt = new Date(2001, 0, 1);
@ -342,14 +349,79 @@ var NUMBERS = (function() {
}
return ret;
}
function parse_storage(buf, sst, rsst) {
var dv = u8_to_dataview(buf);
var ctype = buf[1];
var flags = dv.getUint32(8, true);
var data_offset = 12;
var ridx = -1, sidx = -1, d128 = NaN, ieee = NaN, dt = new Date(2001, 0, 1);
if (flags & 1) {
d128 = readDecimal128LE(buf, data_offset);
data_offset += 16;
}
if (flags & 2) {
ieee = dv.getFloat64(data_offset, true);
data_offset += 8;
}
if (flags & 4) {
dt.setTime(dt.getTime() + dv.getFloat64(data_offset, true) * 1e3);
data_offset += 8;
}
if (flags & 8) {
sidx = dv.getUint32(data_offset, true);
data_offset += 4;
}
if (flags & 16) {
ridx = dv.getUint32(data_offset, true);
data_offset += 4;
}
var ret;
switch (ctype) {
case 0:
break;
case 2:
ret = { t: "n", v: d128 };
break;
case 3:
ret = { t: "s", v: sst[sidx] };
break;
case 5:
ret = { t: "d", v: dt };
break;
case 6:
ret = { t: "b", v: ieee > 0 };
break;
case 7:
ret = { t: "n", v: ieee };
break;
case 8:
ret = { t: "e", v: 0 };
break;
case 9:
{
if (ridx > -1)
ret = { t: "s", v: rsst[ridx] };
else
throw new Error("Unsupported cell type ".concat(ctype, " : ").concat(flags & 31, " : ").concat(buf.slice(0, 4)));
}
break;
case 10:
ret = { t: "n", v: d128 };
break;
default:
throw new Error("Unsupported cell type ".concat(ctype, " : ").concat(flags & 31, " : ").concat(buf.slice(0, 4)));
}
return ret;
}
function parse(buf, sst, rsst) {
var version = buf[0];
switch (version) {
switch (buf[0]) {
case 3:
case 4:
return parseit(buf, sst, rsst, version);
return parse_old_storage(buf, sst, rsst);
case 5:
return parse_storage(buf, sst, rsst);
default:
throw new Error("Unsupported pre-BNC version ".concat(version));
throw new Error("Unsupported payload version ".concat(buf[0]));
}
}
@ -387,6 +459,10 @@ var NUMBERS = (function() {
};
function parse_numbers(cfb) {
var out = [];
cfb.FullPaths.forEach(function(p) {
if (p.match(/\.iwpv2/))
throw new Error("Unsupported password protection");
});
cfb.FileIndex.forEach(function(s) {
if (!s.name.match(/\.iwa$/))
return;
@ -460,16 +536,30 @@ var NUMBERS = (function() {
return data;
}
function parse_TST_TileRowInfo(u8) {
var _a, _b, _c, _d, _e, _f, _g, _h, _i, _j;
var pb = parse_shallow(u8);
var R = varint_to_i32(pb[1][0].data) >>> 0;
var storage = pb[3][0].data;
var offsets = u8_to_dataview(pb[4][0].data);
var pre_bnc = (_b = (_a = pb[3]) == null ? void 0 : _a[0]) == null ? void 0 : _b.data;
var pre_bnc_offsets = ((_d = (_c = pb[4]) == null ? void 0 : _c[0]) == null ? void 0 : _d.data) && u8_to_dataview(pb[4][0].data);
var storage = (_f = (_e = pb[6]) == null ? void 0 : _e[0]) == null ? void 0 : _f.data;
var storage_offsets = ((_h = (_g = pb[7]) == null ? void 0 : _g[0]) == null ? void 0 : _h.data) && u8_to_dataview(pb[7][0].data);
var wide_offsets = ((_j = (_i = pb[8]) == null ? void 0 : _i[0]) == null ? void 0 : _j.data) && varint_to_i32(pb[8][0].data) > 0 || false;
var width = wide_offsets ? 4 : 1;
var cells = [];
for (var C = 0; C < offsets.byteLength / 2; ++C) {
var off = offsets.getUint16(C * 2, true);
if (off > storage.length)
continue;
cells[C] = storage.subarray(off, offsets.getUint16(C * 2 + 2, true));
var off = 0;
for (var C = 0; C < pre_bnc_offsets.byteLength / 2; ++C) {
if (storage && storage_offsets) {
off = storage_offsets.getUint16(C * 2, true) * width;
if (off < storage.length) {
cells[C] = storage.subarray(off, storage_offsets.getUint16(C * 2 + 2, true) * width);
continue;
}
}
if (pre_bnc && pre_bnc_offsets) {
off = pre_bnc_offsets.getUint16(C * 2, true) * width;
if (off < pre_bnc.length)
cells[C] = pre_bnc.subarray(off, pre_bnc_offsets.getUint16(C * 2 + 2, true) * width);
}
}
return { R: R, cells: cells };
}

@ -1,30 +1,13 @@
# [SheetJS](https://sheetjs.com)
Parser and writer for various spreadsheet formats. Pure-JS cleanroom
implementation from official specifications, related documents, and test files.
Emphasis on parsing and writing robustness, cross-format feature compatibility
with a unified JS representation, and ES3/ES5 browser compatibility back to IE6.
The SheetJS Community Edition offers battle-tested open-source solutions for
extracting useful data from almost any complex spreadsheet and generating new
spreadsheets that will work with legacy and modern software alike.
This is the community version. We also offer a pro version with performance
enhancements, additional features like styling, and dedicated support.
Community Translations of this README:
- [Simplified Chinese](https://github.com/rockboom/SheetJS-docs-zh-CN)
[**Pro Version**](https://sheetjs.com/pro)
[**Commercial Support**](https://sheetjs.com/support)
[**Rendered Documentation**](https://docs.sheetjs.com/)
[**In-Browser Demos**](https://sheetjs.com/demos)
[**Source Code**](https://git.io/xlsx)
[**Issues and Bug Reports**](https://github.com/sheetjs/sheetjs/issues)
[SheetJS Pro](https://sheetjs.com/pro) offers solutions beyond data processing:
Edit complex templates with ease; let out your inner Picasso with styling; make
custom sheets with images/graphs/PivotTables; evaluate formula expressions and
port calculations to web apps; automate common spreadsheet tasks, and much more!
![License](https://img.shields.io/github/license/SheetJS/sheetjs)
[![Build Status](https://img.shields.io/github/workflow/status/sheetjs/sheetjs/Tests:%20node.js)](https://github.com/SheetJS/sheetjs/actions)

@ -1,4 +1,6 @@
## Installation
## Getting Started
### Installation
In the browser, just add a script tag:
@ -37,3 +39,60 @@ With [bower](https://bower.io/search/?q=js-xlsx):
$ bower install js-xlsx
```
<details>
<summary><b>Optional features</b> (click to show)</summary>
The node version automatically requires modules for additional features. Some
of these modules are rather large in size and are only needed in special
circumstances, so they do not ship with the core. For browser use, they must
be included directly:
```html
<!-- international support from js-codepage -->
<script src="dist/cpexcel.js"></script>
```
An appropriate version for each dependency is included in the dist/ directory.
The complete single-file version is generated at `dist/xlsx.full.min.js`
A slimmer build is generated at `dist/xlsx.mini.min.js`. Compared to full build:
- codepage library skipped (no support for XLS encodings)
- XLSX compression option not currently available
- no support for XLSB / XLS / Lotus 1-2-3 / SpreadsheetML 2003
- node stream utils removed
Webpack and Browserify builds include optional modules by default. Webpack can
be configured to remove support with `resolve.alias`:
```js
/* uncomment the lines below to remove support */
resolve: {
alias: { "./dist/cpexcel.js": "" } // <-- omit international support
}
```
</details>
<details>
<summary><b>ECMAScript 3 Compatibility</b> (click to show)</summary>
For broad compatibility with JavaScript engines, the library is written using
ECMAScript 3 language dialect as well as some ES5 features like `Array#forEach`.
Older browsers require shims to provide missing functions.
To use the shim, add the shim before the script tag that loads `xlsx.js`:
```html
<!-- add the shim first -->
<script type="text/javascript" src="shim.min.js"></script>
<!-- after the shim is referenced, add the library -->
<script type="text/javascript" src="xlsx.full.min.js"></script>
```
The script also includes `IE_LoadFile` and `IE_SaveFile` for loading and saving
files in Internet Explorer versions 6-9. The `xlsx.extendscript.js` script
bundles the shim in a format suitable for Photoshop and other Adobe products.
</details>

@ -1,55 +0,0 @@
### Optional Modules
<details>
<summary><b>Optional features</b> (click to show)</summary>
The node version automatically requires modules for additional features. Some
of these modules are rather large in size and are only needed in special
circumstances, so they do not ship with the core. For browser use, they must
be included directly:
```html
<!-- international support from js-codepage -->
<script src="dist/cpexcel.js"></script>
```
An appropriate version for each dependency is included in the dist/ directory.
The complete single-file version is generated at `dist/xlsx.full.min.js`
A slimmer build is generated at `dist/xlsx.mini.min.js`. Compared to full build:
- codepage library skipped (no support for XLS encodings)
- XLSX compression option not currently available
- no support for XLSB / XLS / Lotus 1-2-3 / SpreadsheetML 2003
- node stream utils removed
Webpack and Browserify builds include optional modules by default. Webpack can
be configured to remove support with `resolve.alias`:
```js
/* uncomment the lines below to remove support */
resolve: {
alias: { "./dist/cpexcel.js": "" } // <-- omit international support
}
```
</details>
### ECMAScript 5 Compatibility
Since the library uses functions like `Array#forEach`, older browsers require
[shims to provide missing functions](https://oss.sheetjs.com/sheetjs/shim.js).
To use the shim, add the shim before the script tag that loads `xlsx.js`:
```html
<!-- add the shim first -->
<script type="text/javascript" src="shim.min.js"></script>
<!-- after the shim is referenced, add the library -->
<script type="text/javascript" src="xlsx.full.min.js"></script>
```
The script also includes `IE_LoadFile` and `IE_SaveFile` for loading and saving
files in Internet Explorer versions 6-9. The `xlsx.extendscript.js` script
bundles the shim in a format suitable for Photoshop and other Adobe products.

94
docbits/13_usage.md Normal file

@ -0,0 +1,94 @@
### Usage
Most scenarios involving spreadsheets and data can be broken into 5 parts:
1) **Acquire Data**: Data may be stored anywhere: local or remote files,
databases, HTML TABLE, or even generated programmatically in the web browser.
2) **Extract Data**: For spreadsheet files, this involves parsing raw bytes to
read the cell data. For general JS data, this involves reshaping the data.
3) **Process Data**: From generating summary statistics to cleaning data
records, this step is the heart of the problem.
4) **Package Data**: This can involve making a new spreadsheet or serializing
with `JSON.stringify` or writing XML or simply flattening data for UI tools.
5) **Release Data**: Spreadsheet files can be uploaded to a server or written
locally. Data can be presented to users in an HTML TABLE or data grid.
A common problem involves generating a valid spreadsheet export from data stored
in an HTML table. In this example, an HTML TABLE on the page will be scraped,
a row will be added to the bottom with the date of the report, and a new file
will be generated and downloaded locally. `XLSX.writeFile` takes care of
packaging the data and attempting a local download:
```js
// Acquire Data (reference to the HTML table)
var table_elt = document.getElementById("my-table-id");
// Extract Data (create a workbook object from the table)
var workbook = XLSX.utils.table_to_book(table_elt);
// Process Data (add a new row)
var worksheet = workbook.Sheets["Sheet1"];
XLSX.utils.sheet_add_aoa([["Created "+new Date().toISOString()}]], {origin:-1});
// Package and Release Data (`writeFile` tries to write and save an XLSB file)
XLSX.writeFile(workbook, "Report.xlsb");
```
This library tries to simplify steps 2 and 4 with functions to extract useful
data from spreadsheet files (`read` / `readFile`) and generate new spreadsheet
files from data (`write` / `writeFile`).
This documentation and various demo projects cover a number of common scenarios
and approaches for steps 1 and 5.
Utility functions help with step 3.
#### The Zen of SheetJS
_File formats are implementation details_
The parser covers a wide gamut of common spreadsheet file formats to ensure that
"HTML-saved-as-XLS" files work as well as actual XLS or XLSX files.
The writer supports a number of common output formats for broad compatibility
with the data ecosystem.
_Data processing should fit in any workflow_
The library does not impose a separate lifecycle. It fits nicely in websites
and apps built using any framework. The plain JS data objects play nice with
Web Workers and future APIs.
["Parsing Workbooks"](#parsing-workbooks) describes solutions for common data
import scenarios involving actual spreadsheet files.
["Writing Workbooks"](#writing-workbooks) describes solutions for common data
export scenarios involving actual spreadsheet files.
["Utility Functions"](#utility-functions) details utility functions for
translating JSON Arrays and other common JS structures into worksheet objects.
_JavaScript is a powerful language for data processing_
The ["Common Spreadsheet Format"](#common-spreadsheet-format) is a simple object
representation of the core concepts of a workbook. The various functions in the
library provide low-level tools for working with the object.
For friendly JS processing, there are utility functions for converting parts of
a worksheet to/from an Array of Arrays. For example, summing columns from an
array of arrays can be implemented in a single Array reduce operation:
```js
var aoa = XLSX.utils.sheet_to_json(worksheet, {header: 1});
var sum_of_column_B = aoa.reduce((acc, row) => acc + (+row[1]||0), 0);
```

@ -1,39 +0,0 @@
## Philosophy
<details>
<summary><b>Philosophy</b> (click to show)</summary>
Prior to SheetJS, APIs for processing spreadsheet files were format-specific.
Third-party libraries either supported one format, or they involved a separate
set of classes for each supported file type. Even though XLSB was introduced in
Excel 2007, nothing outside of SheetJS or Excel supported the format.
To promote a format-agnostic view, SheetJS starts from a pure-JS representation
that we call the ["Common Spreadsheet Format"](#common-spreadsheet-format).
Emphasizing a uniform object representation enables new features like format
conversion (reading an XLSX template and saving as XLS) and circumvents the mess
of classes. By abstracting the complexities of the various formats, tools
need not worry about the specific file type!
A simple object representation combined with careful coding practices enables
use cases in older browsers and in alternative environments like ExtendScript
and Web Workers. It is always tempting to use the latest and greatest features,
but they tend to require the latest versions of browsers, limiting usability.
Utility functions capture common use cases like generating JS objects or HTML.
Most simple operations should only require a few lines of code. More complex
operations generally should be straightforward to implement.
Excel pushes the XLSX format as default starting in Excel 2007. However, there
are other formats with more appealing properties. For example, the XLSB format
is spiritually similar to XLSX but files often tend up taking less than half the
space and open much faster! Even though an XLSX writer is available, other
format writers are available so users can take advantage of the unique
characteristics of each format.
The primary focus of the Community Edition is correct data interchange, focused
on extracting data from any compatible data representation and exporting data in
various formats suitable for any third party interface.
</details>

@ -33,6 +33,11 @@ Write options are described in the [Writing Options](#writing-options) section.
Utilities are available in the `XLSX.utils` object and are described in the
[Utility Functions](#utility-functions) section:
**Constructing:**
- `book_new` creates an empty workbook
- `book_append_sheet` adds a worksheet to a workbook
**Importing:**
- `aoa_to_sheet` converts an array of arrays of JS data to a worksheet.

@ -1,4 +1,20 @@
#### Column Properties
#### Row and Column Properties
<details>
<summary><b>Format Support</b> (click to show)</summary>
**Row Properties**: XLSX/M, XLSB, BIFF8 XLS, XLML, SYLK, DOM, ODS
**Column Properties**: XLSX/M, XLSB, BIFF8 XLS, XLML, SYLK, DOM
</details>
Row and Column properties are not extracted by default when reading from a file
and are not persisted by default when writing to a file. The option
`cellStyles: true` must be passed to the relevant read or write function.
_Column Properties_
The `!cols` array in each worksheet, if present, is a collection of `ColInfo`
objects which have the following properties:
@ -19,6 +35,30 @@ type ColInfo = {
};
```
_Row Properties_
The `!rows` array in each worksheet, if present, is a collection of `RowInfo`
objects which have the following properties:
```typescript
type RowInfo = {
/* visibility */
hidden?: boolean; // if true, the row is hidden
/* row height is specified in one of the following ways: */
hpx?: number; // height in screen pixels
hpt?: number; // height in points
level?: number; // 0-indexed outline / group level
};
```
_Outline / Group Levels Convention_
The Excel UI displays the base outline level as `1` and the max level as `8`.
Following JS conventions, SheetJS uses 0-indexed outline levels wherein the base
outline level is `0` and the max level is `7`.
<details>
<summary><b>Why are there three width types?</b> (click to show)</summary>
@ -49,6 +89,20 @@ when changing the pixel width, delete the `wch` and `width` properties.
<details>
<summary><b>Implementation details</b> (click to show)</summary>
_Row Heights_
Excel internally stores row heights in points. The default resolution is 72 DPI
or 96 PPI, so the pixel and point size should agree. For different resolutions
they may not agree, so the library separates the concepts.
Even though all of the information is made available, writers are expected to
follow the priority order:
1) use `hpx` pixel height if available
2) use `hpt` point height if available
_Column Widths_
Given the constraints, it is possible to determine the MDW without actually
inspecting the font! The parsers guess the pixel width by converting from width
to pixels and back, repeating for all possible MDW and selecting the MDW that
@ -61,40 +115,6 @@ follow the priority order:
1) use `width` field if available
2) use `wpx` pixel width if available
3) use `wch` character count if available
</details>
#### Row Properties
The `!rows` array in each worksheet, if present, is a collection of `RowInfo`
objects which have the following properties:
```typescript
type RowInfo = {
/* visibility */
hidden?: boolean; // if true, the row is hidden
/* row height is specified in one of the following ways: */
hpx?: number; // height in screen pixels
hpt?: number; // height in points
level?: number; // 0-indexed outline / group level
};
```
Note: Excel UI displays the base outline level as `1` and the max level as `8`.
The `level` field stores the base outline as `0` and the max level as `7`.
<details>
<summary><b>Implementation details</b> (click to show)</summary>
Excel internally stores row heights in points. The default resolution is 72 DPI
or 96 PPI, so the pixel and point size should agree. For different resolutions
they may not agree, so the library separates the concepts.
Even though all of the information is made available, writers are expected to
follow the priority order:
1) use `hpx` pixel height if available
2) use `hpt` point height if available
</details>

@ -53,10 +53,12 @@ range limits will be silently truncated:
Excel 2003 SpreadsheetML range limits are governed by the version of Excel and
are not enforced by the writer.
### Excel 2007+ XML (XLSX/XLSM)
<details>
<summary>(click to show)</summary>
<summary><b>File Format Details</b> (click to show)</summary>
**Core Spreadsheet Formats**
- **Excel 2007+ XML (XLSX/XLSM)**
XLSX and XLSM files are ZIP containers containing a series of XML files in
accordance with the Open Packaging Conventions (OPC). The XLSM format, almost
@ -66,12 +68,7 @@ The format is standardized in ECMA-376 and later in ISO/IEC 29500. Excel does
not follow the specification, and there are additional documents discussing how
Excel deviates from the specification.
</details>
### Excel 2.0-95 (BIFF2/BIFF3/BIFF4/BIFF5)
<details>
<summary>(click to show)</summary>
- **Excel 2.0-95 (BIFF2/BIFF3/BIFF4/BIFF5)**
BIFF 2/3 XLS are single-sheet streams of binary records. Excel 4 introduced
the concept of a workbook (`XLW` files) but also had single-sheet `XLS` format.
@ -83,12 +80,7 @@ files in these formats, so record lengths and fields were determined by writing
in all of the supported formats and comparing files. Excel 2016 can generate
BIFF5 files, enabling a full suite of file tests starting from XLSX or BIFF2.
</details>
### Excel 97-2004 Binary (BIFF8)
<details>
<summary>(click to show)</summary>
- **Excel 97-2004 Binary (BIFF8)**
BIFF8 exclusively uses the Compound File Binary container format, splitting some
content into streams within the file. At its core, it still uses an extended
@ -97,24 +89,14 @@ version of the binary record format from older versions of BIFF.
The `MS-XLS` specification covers the basics of the file format, and other
specifications expand on serialization of features like properties.
</details>
### Excel 2003-2004 (SpreadsheetML)
<details>
<summary>(click to show)</summary>
- **Excel 2003-2004 (SpreadsheetML)**
Predating XLSX, SpreadsheetML files are simple XML files. There is no official
and comprehensive specification, although MS has released documentation on the
format. Since Excel 2016 can generate SpreadsheetML files, mapping features is
pretty straightforward.
</details>
### Excel 2007+ Binary (XLSB, BIFF12)
<details>
<summary>(click to show)</summary>
- **Excel 2007+ Binary (XLSB, BIFF12)**
Introduced in parallel with XLSX, the XLSB format combines the BIFF architecture
with the content separation and ZIP container of XLSX. For the most part nodes
@ -123,12 +105,7 @@ in an XLSX sub-file can be mapped to XLSB records in a corresponding sub-file.
The `MS-XLSB` specification covers the basics of the file format, and other
specifications expand on serialization of features like properties.
</details>
### Delimiter-Separated Values (CSV/TXT)
<details>
<summary>(click to show)</summary>
- **Delimiter-Separated Values (CSV/TXT)**
Excel CSV deviates from RFC4180 in a number of important ways. The generated
CSV files should generally work in Excel although they may not work in RFC4180
@ -137,32 +114,20 @@ writer proactively generates cells for formulae if values are unavailable.
Excel TXT uses tab as the delimiter and code page 1200.
Notes:
Like in Excel, files starting with `0x49 0x44 ("ID")` are treated as Symbolic
Link files. Unlike Excel, if the file does not have a valid SYLK header, it
will be proactively reinterpreted as CSV. There are some files with semicolon
delimiter that align with a valid SYLK file. For the broadest compatibility,
all cells with the value of `ID` are automatically wrapped in double-quotes.
- Like in Excel, files starting with `0x49 0x44 ("ID")` are treated as Symbolic
Link files. Unlike Excel, if the file does not have a valid SYLK header, it
will be proactively reinterpreted as CSV. There are some files with semicolon
delimiter that align with a valid SYLK file. For the broadest compatibility,
all cells with the value of `ID` are automatically wrapped in double-quotes.
**Miscellaneous Workbook Formats**
</details>
### Other Workbook Formats
<details>
<summary>(click to show)</summary>
Support for other formats is generally far XLS/XLSB/XLSX support, due in large
Support for other formats is generally far behind XLS/XLSB/XLSX support, due in
part to a lack of publicly available documentation. Test files were produced in
the respective apps and compared to their XLS exports to determine structure.
The main focus is data extraction.
</details>
#### Lotus 1-2-3 (WKS/WK1/WK2/WK3/WK4/123)
<details>
<summary>(click to show)</summary>
- **Lotus 1-2-3 (WKS/WK1/WK2/WK3/WK4/123)**
The Lotus formats consist of binary records similar to the BIFF structure. Lotus