docs.sheetjs.com/docz/docs/08-api/03-parse-options.md

243 lines
12 KiB
Markdown
Raw Permalink Normal View History

2022-05-16 03:26:04 +00:00
---
2024-05-14 02:43:07 +00:00
title: Reading Files
2023-05-18 09:21:08 +00:00
sidebar_position: 3
2022-05-16 03:26:04 +00:00
hide_table_of_contents: true
---
2024-05-14 02:43:07 +00:00
The main SheetJS method for reading files is `read`. It expects developers to
supply the actual data in a supported representation.
The `readFile` helper method accepts a filename and tries to read the specified
file using standard APIs. *It does not work in web browsers!*
**Parse file data and generate a SheetJS workbook object**
```js
var wb = XLSX.read(data, opts);
```
2022-05-16 03:26:04 +00:00
2023-08-21 23:07:34 +00:00
`read` attempts to parse `data` and return [a workbook object](/docs/csf/book)
2022-05-16 03:26:04 +00:00
2024-05-14 02:43:07 +00:00
The [`type`](#input-type) property of the `opts` object controls how `data` is
2023-08-21 23:07:34 +00:00
interpreted. For string data, the default interpretation is Base64.
2024-05-14 02:43:07 +00:00
**Read a specified file and generate a SheetJS workbook object**
```js
var wb = XLSX.readFile(filename, opts);
```
2023-08-21 23:07:34 +00:00
`readFile` attempts to read a local file with specified `filename`.
:::caution pass
2024-05-14 02:43:07 +00:00
`readFile` works in specific platforms. **It does not support web browsers!**
2023-08-21 23:07:34 +00:00
The [NodeJS installation note](/docs/getting-started/installation/nodejs#usage)
includes additional instructions for non-standard use cases.
:::
## Parsing Options
2022-05-16 03:26:04 +00:00
The read functions accept an options argument:
| Option Name | Default | Description |
2024-05-14 02:43:07 +00:00
|:------------|:--------|:-----------------------------------------------------|
|`type` | | [Input data representation](#input-type) |
|`raw` | `false` | If true, plain text parsing will not parse values ** |
2024-07-08 08:18:18 +00:00
|`dense` | `false` | If true, use a [dense sheet representation](#dense) |
2022-05-16 03:26:04 +00:00
|`codepage` | | If specified, use code page when appropriate ** |
2024-05-14 02:43:07 +00:00
|`cellFormula`| `true` | Save [formulae to the `.f` field](#formulae) |
|`cellHTML` | `true` | Parse rich text and save HTML to the `.h` field |
|`cellNF` | `false` | Save number format string to the `.z` field |
|`cellStyles` | `false` | Save style/theme info to the `.s` field |
|`cellText` | `true` | Generated formatted text to the `.w` field |
|`cellDates` | `false` | Store dates as type `d` (default is `n`) |
2022-05-16 03:26:04 +00:00
|`dateNF` | | If specified, use the string for date code 14 ** |
2024-05-14 02:43:07 +00:00
|`sheetStubs` | `false` | Create cell objects of type `z` for stub cells |
|`sheetRows` | `0` | If >0, read the [specified number of rows](#range) |
|`bookDeps` | `false` | If true, parse calculation chains |
|`bookFiles` | `false` | If true, add raw files to book object ** |
|`bookProps` | `false` | If true, only parse enough to get book metadata ** |
|`bookSheets` | `false` | If true, only parse enough to get the sheet names |
|`bookVBA` | `false` | If true, generate [VBA blob](#vba) |
|`password` | `""` | If defined and file is encrypted, use password ** |
|`WTF` | `false` | If true, throw errors on unexpected file features ** |
2022-05-16 03:26:04 +00:00
|`sheets` | | If specified, only parse specified sheets ** |
2024-05-14 02:43:07 +00:00
|`nodim` | `false` | If true, calculate [worksheet ranges](#range) |
|`PRN` | `false` | If true, allow parsing of PRN files ** |
|`xlfn` | `false` | If true, [preserve prefixes](#formulae) in formulae |
2022-05-16 03:26:04 +00:00
|`FS` | | DSV Field Separator override |
2024-05-14 02:43:07 +00:00
|`UTC` | `true` | If explicitly false, parse text dates in local time |
2022-05-16 03:26:04 +00:00
- Even if `cellNF` is false, formatted text will be generated and saved to `.w`
- In some cases, sheets may be parsed even if `bookSheets` is false.
- Excel aggressively tries to interpret values from CSV and other plain text.
This leads to surprising behavior! The `raw` option suppresses value parsing.
- `bookSheets` and `bookProps` combine to give both sets of information
- `Deps` will be an empty object if `bookDeps` is false
- `bookFiles` behavior depends on file type:
* `keys` array (paths in the ZIP) for ZIP-based formats
* `files` hash (mapping paths to objects representing the files) for ZIP
* `cfb` object for formats using CFB containers
- By default all worksheets are parsed. `sheets` restricts based on input type:
* number: zero-based index of worksheet to parse (`0` is first worksheet)
* string: name of worksheet to parse (case insensitive)
* array of numbers and strings to select multiple worksheets.
- `codepage` is applied to BIFF2 - BIFF5 files without `CodePage` records and to
CSV files without BOM in `type:"binary"`. BIFF8 XLS always defaults to 1200.
- `PRN` affects parsing of text files without a common delimiter character.
- Currently only XOR encryption is supported. Unsupported error will be thrown
for files employing other encryption methods.
2022-08-23 03:20:02 +00:00
- `WTF` is mainly for development. By default, the parser will suppress read
2022-05-16 03:26:04 +00:00
errors on single worksheets, allowing you to read from the worksheets that do
parse properly. Setting `WTF:true` forces those errors to be thrown.
2023-06-23 20:24:44 +00:00
- `UTC` applies to CSV, Text and HTML formats. When explicitly set to `false`,
the parsers will assume the files are specified in local time. By default, as
is the case for other file formats, dates and times are interpreted in UTC.
2022-05-16 03:26:04 +00:00
2024-07-08 08:18:18 +00:00
#### Dense
The ["Cell Storage"](/docs/csf/sheet#cell-storage) section of the SheetJS Data
Model documentation explains the worksheet representation in more detail.
:::note pass
[Utility functions that process SheetJS workbook objects](/docs/api/utilities/)
typically process both sparse and dense worksheets.
:::
2024-05-14 02:43:07 +00:00
#### Range
Some file formats, including XLSX and XLS, can self-report worksheet ranges. The
self-reported ranges are used by default.
If the `sheetRows` option is set, up to `sheetRows` rows will be parsed from the
worksheets. `sheetRows-1` rows will be generated when looking at the JSON object
2024-07-08 08:18:18 +00:00
output (since the header row is counted as a row when parsing the data). The
`!ref` property of the worksheet will hold the adjusted range. For formats that
self-report sheet ranges, the `!fullref` property will hold the original range.
2024-05-14 02:43:07 +00:00
The `nodim` option instructs the parser to ignore self-reported ranges and use
the actual cells in the worksheet to determine the range. This addresses known
issues with non-compliant third-party exporters.
#### Formulae
For some file formats, the `cellFormula` option must be explicitly enabled to
ensure that formulae are extracted.
Newer Excel functions are serialized with the `_xlfn.` prefix, hidden from the
user. SheetJS will strip `_xlfn.` normally. The `xlfn` option preserves them.
[The "Formulae" docs](/docs/csf/features/formulae#prefixed-future-functions)
covers this in more detail.
["Formulae"](/docs/csf/features/formulae) covers the features in more detail.
#### VBA
When a macro-enabled file is parsed, if the `bookVBA` option is `true`, the raw
VBA blob will be stored in the `vbaraw` property of the workbook.
["VBA and Macros"](/docs/csf/features/vba) covers the features in more detail.
<details>
<summary><b>Implementation Details</b> (click to show)</summary>
The `bookVBA` merely exposes the raw VBA CFB object. It does not parse the data.
XLSM and XLSB store the VBA CFB object in `xl/vbaProject.bin`. BIFF8 XLS mixes
the VBA entries alongside the core Workbook entry, so the library generates a
new blob from the XLS CFB container that works in XLSM and XLSB files.
</details>
2022-05-16 03:26:04 +00:00
### Input Type
2024-05-14 02:43:07 +00:00
The `type` parameter for `read` controls how data is interpreted:
2022-05-16 03:26:04 +00:00
| `type` | expected input |
2023-06-25 09:36:58 +00:00
|:-----------|:----------------------------------------------------------------|
2022-05-16 03:26:04 +00:00
| `"base64"` | string: Base64 encoding of the file |
| `"binary"` | string: binary string (byte `n` is `data.charCodeAt(n)`) |
2022-08-25 08:22:28 +00:00
| `"string"` | string: JS string (only appropriate for UTF-8 text formats) |
2022-05-16 03:26:04 +00:00
| `"buffer"` | nodejs Buffer |
2023-06-25 09:36:58 +00:00
| `"array"` | array: array of 8-bit unsigned integers (byte `n` is `data[n]`) |
2022-05-16 03:26:04 +00:00
| `"file"` | string: path of file that will be read (nodejs only) |
2022-11-13 20:45:13 +00:00
Some common types are automatically deduced from the data input type, including
NodeJS `Buffer` objects, `Uint8Array` and `ArrayBuffer` objects, and arrays of
numbers.
When a JS `string` is passed with no `type`, the library assumes the data is a
Base64 string. `FileReader#readAsBinaryString` or ASCII data requires `"binary"`
type. DOM strings including `FileReader#readAsText` should use type `"string"`.
2022-05-16 03:26:04 +00:00
### Guessing File Type
<details>
<summary><b>Implementation Details</b> (click to show)</summary>
Excel and other spreadsheet tools read the first few bytes and apply other
heuristics to determine a file type. This enables file type punning: renaming
files with the `.xls` extension will tell your computer to use Excel to open the
file but Excel will know how to handle it. This library applies similar logic:
| Byte 0 | Raw File Type | Spreadsheet Types |
|:-------|:--------------|:----------------------------------------------------|
| `0xD0` | CFB Container | BIFF 5/8 or protected XLSX/XLSB or WQ3/QPW or XLR |
| `0x09` | BIFF Stream | BIFF 2/3/4/5 |
| `0x3C` | XML/HTML | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x50` | ZIP Archive | XLSB or XLSX/M or ODS or UOS2 or NUMBERS or text |
| `0x49` | Plain Text | SYLK or plain text |
| `0x54` | Plain Text | DIF or plain text |
2022-08-25 08:22:28 +00:00
| `0xEF` | UTF-8 Text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0xFF` | UTF-16 Text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
2022-05-16 03:26:04 +00:00
| `0x00` | Record Stream | Lotus WK\* or Quattro Pro or plain text |
| `0x7B` | Plain text | RTF or plain text |
| `0x0A` | Plain text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x0D` | Plain text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x20` | Plain text | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
DBF files are detected based on the first byte as well as the third and fourth
bytes (corresponding to month and day of the file date)
2022-08-25 08:22:28 +00:00
Works for Windows files are detected based on the `BOF` record with type `0xFF`
2022-05-16 03:26:04 +00:00
Plain text format guessing follows the priority order:
| Format | Test |
|:-------|:--------------------------------------------------------------------|
| XML | `<?xml` appears in the first 1024 characters |
2024-05-14 02:43:07 +00:00
| HTML | starts with `<` and HTML tags appear in the first 1024 characters |
2022-05-16 03:26:04 +00:00
| XML | starts with `<` and the first tag is valid |
| RTF | starts with `{\rt` |
2024-03-12 06:47:52 +00:00
| DSV | starts with `sep=` followed by field delimiter and line separator |
| DSV | more unquoted `\|` chars than `;` `\t` or `,` in the first 1024 |
2022-05-16 03:26:04 +00:00
| DSV | more unquoted `;` chars than `\t` or `,` in the first 1024 |
| TSV | more unquoted `\t` chars than `,` chars in the first 1024 |
| CSV | one of the first 1024 characters is a comma `","` |
| ETH | starts with `socialcalc:version:` |
| PRN | `PRN` option is set to true |
| CSV | (fallback) |
2024-05-14 02:43:07 +00:00
HTML tags include `html`, `table`, `head`, `meta`, `script`, `style`, `div`
2022-05-16 03:26:04 +00:00
</details>
2023-08-21 23:07:34 +00:00
<details open>
2024-10-11 19:36:18 +00:00
<summary><b>Why are random files valid?</b> (click to hide)</summary>
2022-05-16 03:26:04 +00:00
2024-05-14 02:43:07 +00:00
Excel is extremely aggressive in reading files. Adding the XLS extension to any
2024-10-11 19:36:18 +00:00
file tricks Excel into processing the file.
2022-05-16 03:26:04 +00:00
2024-10-11 19:36:18 +00:00
If the file matches certain heuristics, Excel will use a format-specific parser.
2022-05-16 03:26:04 +00:00
2024-10-11 19:36:18 +00:00
If it cannot deduce the file type, Excel will parse the unknown file as if it
were CSV or TSV. SheetJS attempts to replicate that behavior.
2022-05-16 03:26:04 +00:00
</details>