docs.sheetjs.com/docz/docs/08-api/03-parse-options.md

237 lines
12 KiB
Markdown
Raw Normal View History

2022-05-16 03:26:04 +00:00
---
2024-05-14 02:43:07 +00:00
title: Reading Files
2023-05-18 09:21:08 +00:00
sidebar_position: 3
2022-05-16 03:26:04 +00:00
hide_table_of_contents: true
---
2024-05-14 02:43:07 +00:00
The main SheetJS method for reading files is `read`. It expects developers to
supply the actual data in a supported representation.
The `readFile` helper method accepts a filename and tries to read the specified
file using standard APIs. *It does not work in web browsers!*
**Parse file data and generate a SheetJS workbook object**
```js
var wb = XLSX.read(data, opts);
```
2022-05-16 03:26:04 +00:00
2023-08-21 23:07:34 +00:00
`read` attempts to parse `data` and return [a workbook object](/docs/csf/book)
2022-05-16 03:26:04 +00:00
2024-05-14 02:43:07 +00:00
The [`type`](#input-type) property of the `opts` object controls how `data` is
2023-08-21 23:07:34 +00:00
interpreted. For string data, the default interpretation is Base64.
2024-05-14 02:43:07 +00:00
**Read a specified file and generate a SheetJS workbook object**
```js
var wb = XLSX.readFile(filename, opts);
```
2023-08-21 23:07:34 +00:00
`readFile` attempts to read a local file with specified `filename`.
:::caution pass
2024-05-14 02:43:07 +00:00
`readFile` works in specific platforms. **It does not support web browsers!**
2023-08-21 23:07:34 +00:00
The [NodeJS installation note](/docs/getting-started/installation/nodejs#usage)
includes additional instructions for non-standard use cases.
:::
## Parsing Options
2022-05-16 03:26:04 +00:00
The read functions accept an options argument:
| Option Name | Default | Description |
2024-05-14 02:43:07 +00:00
|:------------|:--------|:-----------------------------------------------------|
|`type` | | [Input data representation](#input-type) |
|`raw` | `false` | If true, plain text parsing will not parse values ** |
|`dense` | `false` | If true, use a dense worksheet representation ** |
2022-05-16 03:26:04 +00:00
|`codepage` | | If specified, use code page when appropriate ** |
2024-05-14 02:43:07 +00:00
|`cellFormula`| `true` | Save [formulae to the `.f` field](#formulae) |
|`cellHTML` | `true` | Parse rich text and save HTML to the `.h` field |
|`cellNF` | `false` | Save number format string to the `.z` field |
|`cellStyles` | `false` | Save style/theme info to the `.s` field |
|`cellText` | `true` | Generated formatted text to the `.w` field |
|`cellDates` | `false` | Store dates as type `d` (default is `n`) |
2022-05-16 03:26:04 +00:00
|`dateNF` | | If specified, use the string for date code 14 ** |
2024-05-14 02:43:07 +00:00
|`sheetStubs` | `false` | Create cell objects of type `z` for stub cells |
|`sheetRows` | `0` | If >0, read the [specified number of rows](#range) |
|`bookDeps` | `false` | If true, parse calculation chains |
|`bookFiles` | `false` | If true, add raw files to book object ** |
|`bookProps` | `false` | If true, only parse enough to get book metadata ** |
|`bookSheets` | `false` | If true, only parse enough to get the sheet names |
|`bookVBA` | `false` | If true, generate [VBA blob](#vba) |
|`password` | `""` | If defined and file is encrypted, use password ** |
|`WTF` | `false` | If true, throw errors on unexpected file features ** |
2022-05-16 03:26:04 +00:00
|`sheets` | | If specified, only parse specified sheets ** |
2024-05-14 02:43:07 +00:00
|`nodim` | `false` | If true, calculate [worksheet ranges](#range) |
|`PRN` | `false` | If true, allow parsing of PRN files ** |
|`xlfn` | `false` | If true, [preserve prefixes](#formulae) in formulae |
2022-05-16 03:26:04 +00:00
|`FS` | | DSV Field Separator override |
2024-05-14 02:43:07 +00:00
|`UTC` | `true` | If explicitly false, parse text dates in local time |
2022-05-16 03:26:04 +00:00
- Even if `cellNF` is false, formatted text will be generated and saved to `.w`
- In some cases, sheets may be parsed even if `bookSheets` is false.
- Excel aggressively tries to interpret values from CSV and other plain text.
This leads to surprising behavior! The `raw` option suppresses value parsing.
- `bookSheets` and `bookProps` combine to give both sets of information
- `Deps` will be an empty object if `bookDeps` is false
- `bookFiles` behavior depends on file type:
* `keys` array (paths in the ZIP) for ZIP-based formats
* `files` hash (mapping paths to objects representing the files) for ZIP
* `cfb` object for formats using CFB containers
- By default all worksheets are parsed. `sheets` restricts based on input type:
* number: zero-based index of worksheet to parse (`0` is first worksheet)
* string: name of worksheet to parse (case insensitive)
* array of numbers and strings to select multiple worksheets.
- `codepage` is applied to BIFF2 - BIFF5 files without `CodePage` records and to
CSV files without BOM in `type:"binary"`. BIFF8 XLS always defaults to 1200.
- `PRN` affects parsing of text files without a common delimiter character.
- Currently only XOR encryption is supported. Unsupported error will be thrown
for files employing other encryption methods.
2022-08-23 03:20:02 +00:00
- `WTF` is mainly for development. By default, the parser will suppress read
2022-05-16 03:26:04 +00:00
errors on single worksheets, allowing you to read from the worksheets that do
parse properly. Setting `WTF:true` forces those errors to be thrown.
2022-10-24 01:28:03 +00:00
- By default, "sparse" mode worksheets are generated. Individual cells are
2022-10-31 00:58:49 +00:00
accessed by indexing the worksheet object with an A1-Style address. "dense"
2022-10-24 01:28:03 +00:00
worksheets store cells in an array of arrays at `sheet["!data"]`.
2023-06-23 20:24:44 +00:00
- `UTC` applies to CSV, Text and HTML formats. When explicitly set to `false`,
the parsers will assume the files are specified in local time. By default, as
is the case for other file formats, dates and times are interpreted in UTC.
2022-05-16 03:26:04 +00:00
2024-05-14 02:43:07 +00:00
#### Range
Some file formats, including XLSX and XLS, can self-report worksheet ranges. The
self-reported ranges are used by default.
If the `sheetRows` option is set, up to `sheetRows` rows will be parsed from the
worksheets. `sheetRows-1` rows will be generated when looking at the JSON object
output (since the header row is counted as a row when parsing the data).
The `nodim` option instructs the parser to ignore self-reported ranges and use
the actual cells in the worksheet to determine the range. This addresses known
issues with non-compliant third-party exporters.
#### Formulae
For some file formats, the `cellFormula` option must be explicitly enabled to
ensure that formulae are extracted.
Newer Excel functions are serialized with the `_xlfn.` prefix, hidden from the
user. SheetJS will strip `_xlfn.` normally. The `xlfn` option preserves them.
[The "Formulae" docs](/docs/csf/features/formulae#prefixed-future-functions)
covers this in more detail.
["Formulae"](/docs/csf/features/formulae) covers the features in more detail.
#### VBA
When a macro-enabled file is parsed, if the `bookVBA` option is `true`, the raw
VBA blob will be stored in the `vbaraw` property of the workbook.
["VBA and Macros"](/docs/csf/features/vba) covers the features in more detail.
<details>
<summary><b>Implementation Details</b> (click to show)</summary>
The `bookVBA` merely exposes the raw VBA CFB object. It does not parse the data.
XLSM and XLSB store the VBA CFB object in `xl/vbaProject.bin`. BIFF8 XLS mixes
the VBA entries alongside the core Workbook entry, so the library generates a
new blob from the XLS CFB container that works in XLSM and XLSB files.
</details>
2022-05-16 03:26:04 +00:00
### Input Type
2024-05-14 02:43:07 +00:00
The `type` parameter for `read` controls how data is interpreted:
2022-05-16 03:26:04 +00:00
| `type` | expected input |
2023-06-25 09:36:58 +00:00
|:-----------|:----------------------------------------------------------------|
2022-05-16 03:26:04 +00:00
| `"base64"` | string: Base64 encoding of the file |
| `"binary"` | string: binary string (byte `n` is `data.charCodeAt(n)`) |
2022-08-25 08:22:28 +00:00
| `"string"` | string: JS string (only appropriate for UTF-8 text formats) |
2022-05-16 03:26:04 +00:00
| `"buffer"` | nodejs Buffer |
2023-06-25 09:36:58 +00:00
| `"array"` | array: array of 8-bit unsigned integers (byte `n` is `data[n]`) |
2022-05-16 03:26:04 +00:00
| `"file"` | string: path of file that will be read (nodejs only) |
2022-11-13 20:45:13 +00:00
Some common types are automatically deduced from the data input type, including
NodeJS `Buffer` objects, `Uint8Array` and `ArrayBuffer` objects, and arrays of
numbers.
When a JS `string` is passed with no `type`, the library assumes the data is a
Base64 string. `FileReader#readAsBinaryString` or ASCII data requires `"binary"`
type. DOM strings including `FileReader#readAsText` should use type `"string"`.
2022-05-16 03:26:04 +00:00
### Guessing File Type
<details>
<summary><b>Implementation Details</b> (click to show)</summary>
Excel and other spreadsheet tools read the first few bytes and apply other
heuristics to determine a file type. This enables file type punning: renaming
files with the `.xls` extension will tell your computer to use Excel to open the
file but Excel will know how to handle it. This library applies similar logic:
| Byte 0 | Raw File Type | Spreadsheet Types |
|:-------|:--------------|:----------------------------------------------------|
| `0xD0` | CFB Container | BIFF 5/8 or protected XLSX/XLSB or WQ3/QPW or XLR |
| `0x09` | BIFF Stream | BIFF 2/3/4/5 |
| `0x3C` | XML/HTML | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x50` | ZIP Archive | XLSB or XLSX/M or ODS or UOS2 or NUMBERS or text |
| `0x49` | Plain Text | SYLK or plain text |
| `0x54` | Plain Text | DIF or plain text |
2022-08-25 08:22:28 +00:00
| `0xEF` | UTF-8 Text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0xFF` | UTF-16 Text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
2022-05-16 03:26:04 +00:00
| `0x00` | Record Stream | Lotus WK\* or Quattro Pro or plain text |
| `0x7B` | Plain text | RTF or plain text |
| `0x0A` | Plain text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x0D` | Plain text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x20` | Plain text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
DBF files are detected based on the first byte as well as the third and fourth
bytes (corresponding to month and day of the file date)
2022-08-25 08:22:28 +00:00
Works for Windows files are detected based on the `BOF` record with type `0xFF`
2022-05-16 03:26:04 +00:00
Plain text format guessing follows the priority order:
| Format | Test |
|:-------|:--------------------------------------------------------------------|
| XML | `<?xml` appears in the first 1024 characters |
2024-05-14 02:43:07 +00:00
| HTML | starts with `<` and HTML tags appear in the first 1024 characters |
2022-05-16 03:26:04 +00:00
| XML | starts with `<` and the first tag is valid |
| RTF | starts with `{\rt` |
2024-03-12 06:47:52 +00:00
| DSV | starts with `sep=` followed by field delimiter and line separator |
| DSV | more unquoted `\|` chars than `;` `\t` or `,` in the first 1024 |
2022-05-16 03:26:04 +00:00
| DSV | more unquoted `;` chars than `\t` or `,` in the first 1024 |
| TSV | more unquoted `\t` chars than `,` chars in the first 1024 |
| CSV | one of the first 1024 characters is a comma `","` |
| ETH | starts with `socialcalc:version:` |
| PRN | `PRN` option is set to true |
| CSV | (fallback) |
2024-05-14 02:43:07 +00:00
HTML tags include `html`, `table`, `head`, `meta`, `script`, `style`, `div`
2022-05-16 03:26:04 +00:00
</details>
2023-08-21 23:07:34 +00:00
<details open>
<summary><b>Why are random text files valid?</b> (click to hide)</summary>
2022-05-16 03:26:04 +00:00
2024-05-14 02:43:07 +00:00
Excel is extremely aggressive in reading files. Adding the XLS extension to any
text file (where the only characters are ANSI display chars) tricks Excel into
processing the file as if it were a CSV or TSV file, even if the result is not
useful! This library attempts to replicate that behavior.
2022-05-16 03:26:04 +00:00
The best approach is to validate the desired worksheet and ensure it has the
expected number of rows or columns. Extracting the range is extremely simple:
```js
var range = XLSX.utils.decode_range(worksheet['!ref']);
var ncols = range.e.c - range.s.c + 1, nrows = range.e.r - range.s.r + 1;
```
</details>