xsheetjs/docbits/80_parseopts.md

## Parsing Options

The exported `read` and `readFile` functions accept an options argument:

| Option Name | Default | Description                                          |
| :---------- | ------: | :--------------------------------------------------- |
|`type`       |         | Input data encoding (see Input Type below)           |
|`raw`        | false   | If true, plain text parsing will not parse values ** |
|`codepage`   |         | If specified, use code page when appropriate **      |
|`cellFormula`| true    | Save formulae to the .f field                        |
|`cellHTML`   | true    | Parse rich text and save HTML to the `.h` field      |
|`cellNF`     | false   | Save number format string to the `.z` field          |
|`cellStyles` | false   | Save style/theme info to the `.s` field              |
|`cellText`   | true    | Generated formatted text to the `.w` field           |
|`cellDates`  | false   | Store dates as type `d` (default is `n`)             |
|`dateNF`     |         | If specified, use the string for date code 14 **     |
|`sheetStubs` | false   | Create cell objects of type `z` for stub cells       |
|`sheetRows`  | 0       | If >0, read the first `sheetRows` rows **            |
|`bookDeps`   | false   | If true, parse calculation chains                    |
|`bookFiles`  | false   | If true, add raw files to book object **             |
|`bookProps`  | false   | If true, only parse enough to get book metadata **   |
|`bookSheets` | false   | If true, only parse enough to get the sheet names    |
|`bookVBA`    | false   | If true, copy VBA blob to `vbaraw` field **          |
|`password`   | ""      | If defined and file is encrypted, use password **    |
|`WTF`        | false   | If true, throw errors on unexpected file features ** |
|`sheets`     |         | If specified, only parse specified sheets **         |

- Even if `cellNF` is false, formatted text will be generated and saved to `.w`
- In some cases, sheets may be parsed even if `bookSheets` is false.
- Excel aggressively tries to interpret values from CSV and other plain text.
  This leads to surprising behavior! The `raw` option suppresses value parsing.
- `bookSheets` and `bookProps` combine to give both sets of information
- `Deps` will be an empty object if `bookDeps` is false
- `bookFiles` behavior depends on file type:
    * `keys` array (paths in the ZIP) for ZIP-based formats
    * `files` hash (mapping paths to objects representing the files) for ZIP
    * `cfb` object for formats using CFB containers
- `sheetRows-1` rows will be generated when looking at the JSON object output
  (since the header row is counted as a row when parsing the data)
- By default all worksheets are parsed.  `sheets` restricts based on input type:
    * number: zero-based index of worksheet to parse (`0` is first worksheet)
    * string: name of worksheet to parse (case insensitive)
    * array of numbers and strings to select multiple worksheets.
- `bookVBA` merely exposes the raw VBA CFB object.  It does not parse the data.
  XLSM and XLSB store the VBA CFB object in `xl/vbaProject.bin`. BIFF8 XLS mixes
  the VBA entries alongside the core Workbook entry, so the library generates a
  new XLSB-compatible blob from the XLS CFB container.
- `codepage` is applied to BIFF2 - BIFF5 files without `CodePage` records and to
  CSV files without BOM in `type:"binary"`.  BIFF8 XLS always defaults to 1200.
- Currently only XOR encryption is supported.  Unsupported error will be thrown
  for files employing other encryption methods.
- WTF is mainly for development.  By default, the parser will suppress read
  errors on single worksheets, allowing you to read from the worksheets that do
  parse properly. Setting `WTF:1` forces those errors to be thrown.

### Input Type

Strings can be interpreted in multiple ways.  The `type` parameter for `read`
tells the library how to parse the data argument:

| `type`     | expected input                                                  |
|------------|-----------------------------------------------------------------|
| `"base64"` | string: Base64 encoding of the file                             |
| `"binary"` | string: binary string (byte `n` is `data.charCodeAt(n)`)        |
| `"string"` | string: JS string (characters interpreted as UTF8)              |
| `"buffer"` | nodejs Buffer                                                   |
| `"array"`  | array: array of 8-bit unsigned int (byte `n` is `data[n]`)      |
| `"file"`   | string: path of file that will be read (nodejs only)            |

### Guessing File Type

<details>
  <summary><b>Implementation Details</b> (click to show)</summary>

Excel and other spreadsheet tools read the first few bytes and apply other
heuristics to determine a file type.  This enables file type punning: renaming
files with the `.xls` extension will tell your computer to use Excel to open the
file but Excel will know how to handle it.  This library applies similar logic:

| Byte 0 | Raw File Type | Spreadsheet Types                                   |
|:-------|:--------------|:----------------------------------------------------|
| `0xD0` | CFB Container | BIFF 5/8 or password-protected XLSX/XLSB or WQ3/QPW |
| `0x09` | BIFF Stream   | BIFF 2/3/4/5                                        |
| `0x3C` | XML/HTML      | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x50` | ZIP Archive   | XLSB or XLSX/M or ODS or UOS2 or plain text         |
| `0x49` | Plain Text    | SYLK or plain text                                  |
| `0x54` | Plain Text    | DIF or plain text                                   |
| `0xEF` | UTF8 Encoded  | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0xFF` | UTF16 Encoded | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x00` | Record Stream | Lotus WK\* or Quattro Pro or plain text             |
| `0x7B` | Plain text    | RTF or plain text                                   |
| `0x0A` | Plain text    | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x0D` | Plain text    | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |
| `0x20` | Plain text    | SpreadsheetML / Flat ODS / UOS1 / HTML / plain text |

DBF files are detected based on the first byte as well as the third and fourth
bytes (corresponding to month and day of the file date)

Plain text format guessing follows the priority order:

| Format | Test                                                                |
|:-------|:--------------------------------------------------------------------|
| XML    | `<?xml` appears in the first 1024 characters                        |
| HTML   | starts with `<` and HTML tags appear in the first 1024 characters * |
| XML    | starts with `<`                                                     |
| RTF    | starts with `{\rt`                                                  |
| DSV    | starts with `/sep=.$/`, separator is the specified character        |
| DSV    | more unquoted `";"` chars than `"\t"` or `","` in the first 1024    |
| TSV    | more unquoted `"\t"` chars than `","` chars in the first 1024       |
| CSV    | one of the first 1024 characters is a comma `","`                   |
| ETH    | starts with `socialcalc:version:`                                   |
| PRN    | (default)                                                           |

- HTML tags include: `html`, `table`, `head`, `meta`, `script`, `style`, `div`

</details>

<details>
  <summary><b>Why are random text files valid?</b> (click to show)</summary>

Excel is extremely aggressive in reading files.  Adding an XLS extension to any
display text file  (where the only characters are ANSI display chars) tricks
Excel into thinking that the file is potentially a CSV or TSV file, even if it
is only one column!  This library attempts to replicate that behavior.

The best approach is to validate the desired worksheet and ensure it has the
expected number of rows or columns.  Extracting the range is extremely simple:

```js
var range = XLSX.utils.decode_range(worksheet['!ref']);
var ncols = range.e.c - range.s.c + 1, nrows = range.e.r - range.s.r + 1;
```

</details>
Documentation improvements - multiformat column widths (fixes #591 h/t @sheeeeep) - skip nested BIFF files 2017-03-20 09:02:25 +00:00			`## Parsing Options`

			The exported `read` and `readFile` functions accept an options argument:

			`\| Option Name \| Default \| Description \|`
			`\| :---------- \| ------: \| :--------------------------------------------------- \|`
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			\|`type` \| \| Input data encoding (see Input Type below) \|
			\|`raw` \| false \| If true, plain text parsing will not parse values ** \|
version bump 0.11.13: codepage - binary CSV `codepage` read option (fixes #907 h/t @popovserhii) - BIFF2-5 `codepage` read option (fixes #912 h/t @makcbrain) - `xlsx` utility `--codepage` override option - HTML support some common entities (fixes #914 h/t @razvanioan) 2017-12-09 07:17:25 +00:00			\|`codepage` \| \| If specified, use code page when appropriate ** \|
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			\|`cellFormula`\| true \| Save formulae to the .f field \|
			\|`cellHTML` \| true \| Parse rich text and save HTML to the `.h` field \|
			\|`cellNF` \| false \| Save number format string to the `.z` field \|
			\|`cellStyles` \| false \| Save style/theme info to the `.s` field \|
			\|`cellText` \| true \| Generated formatted text to the `.w` field \|
			\|`cellDates` \| false \| Store dates as type `d` (default is `n`) \|
			\|`dateNF` \| \| If specified, use the string for date code 14 ** \|
			\|`sheetStubs` \| false \| Create cell objects of type `z` for stub cells \|
			\|`sheetRows` \| 0 \| If >0, read the first `sheetRows` rows ** \|
			\|`bookDeps` \| false \| If true, parse calculation chains \|
			\|`bookFiles` \| false \| If true, add raw files to book object ** \|
			\|`bookProps` \| false \| If true, only parse enough to get book metadata ** \|
			\|`bookSheets` \| false \| If true, only parse enough to get the sheet names \|
			\|`bookVBA` \| false \| If true, copy VBA blob to `vbaraw` field ** \|
			\|`password` \| "" \| If defined and file is encrypted, use password ** \|
			\|`WTF` \| false \| If true, throw errors on unexpected file features ** \|
version bump 0.15.5: `sheets` option 2020-01-28 01:20:38 +00:00			\|`sheets` \| \| If specified, only parse specified sheets ** \|
Documentation improvements - multiformat column widths (fixes #591 h/t @sheeeeep) - skip nested BIFF files 2017-03-20 09:02:25 +00:00
			- Even if `cellNF` is false, formatted text will be generated and saved to `.w`
			- In some cases, sheets may be parsed even if `bookSheets` is false.
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			`- Excel aggressively tries to interpret values from CSV and other plain text.`
miscellany - systemjs browser example - more precise file type resolution - small corner cases from test corpus - removed neq in tests (fixes #735 h/t @TeamworkGuy2) - package.json devDependencies versions (fixes #740 h/t @the-spyke) 2017-07-26 08:35:28 +00:00			This leads to surprising behavior! The `raw` option suppresses value parsing.
Documentation improvements - multiformat column widths (fixes #591 h/t @sheeeeep) - skip nested BIFF files 2017-03-20 09:02:25 +00:00			- `bookSheets` and `bookProps` combine to give both sets of information
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			- `Deps` will be an empty object if `bookDeps` is false
Documentation improvements - multiformat column widths (fixes #591 h/t @sheeeeep) - skip nested BIFF files 2017-03-20 09:02:25 +00:00			- `bookFiles` behavior depends on file type:
			* `keys` array (paths in the ZIP) for ZIP-based formats
			* `files` hash (mapping paths to objects representing the files) for ZIP
			* `cfb` object for formats using CFB containers
			- `sheetRows-1` rows will be generated when looking at the JSON object output
			`(since the header row is counted as a row when parsing the data)`
version bump 0.15.5: `sheets` option 2020-01-28 01:20:38 +00:00			- By default all worksheets are parsed. `sheets` restricts based on input type:
			* number: zero-based index of worksheet to parse (`0` is first worksheet)
			`* string: name of worksheet to parse (case insensitive)`
			`* array of numbers and strings to select multiple worksheets.`
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			- `bookVBA` merely exposes the raw VBA CFB object. It does not parse the data.
RTF write stub - Empty WS on RTF read, rudimentary write - reshape XLS VBA blob - CI adding back old nodejs versions - refresh tests and docs 2017-10-02 08:15:36 +00:00			XLSM and XLSB store the VBA CFB object in `xl/vbaProject.bin`. BIFF8 XLS mixes
			`the VBA entries alongside the core Workbook entry, so the library generates a`
			`new XLSB-compatible blob from the XLS CFB container.`
version bump 0.11.13: codepage - binary CSV `codepage` read option (fixes #907 h/t @popovserhii) - BIFF2-5 `codepage` read option (fixes #912 h/t @makcbrain) - `xlsx` utility `--codepage` override option - HTML support some common entities (fixes #914 h/t @razvanioan) 2017-12-09 07:17:25 +00:00			- `codepage` is applied to BIFF2 - BIFF5 files without `CodePage` records and to
			CSV files without BOM in `type:"binary"`. BIFF8 XLS always defaults to 1200.
Documentation improvements - multiformat column widths (fixes #591 h/t @sheeeeep) - skip nested BIFF files 2017-03-20 09:02:25 +00:00			`- Currently only XOR encryption is supported. Unsupported error will be thrown`
			`for files employing other encryption methods.`
			`- WTF is mainly for development. By default, the parser will suppress read`
			`errors on single worksheets, allowing you to read from the worksheets that do`
			parse properly. Setting `WTF:1` forces those errors to be thrown.

			`### Input Type`

			Strings can be interpreted in multiple ways. The `type` parameter for `read`
			`tells the library how to parse the data argument:`

			\| `type` \| expected input \|
			`\|------------\|-----------------------------------------------------------------\|`
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			\| `"base64"` \| string: Base64 encoding of the file \|
			\| `"binary"` \| string: binary string (byte `n` is `data.charCodeAt(n)`) \|
version bump 0.11.5: "string" type - proper JS string input / output type - bower main now uses full version (fixes #820 h/t @newmesiss) - DOM parse directly acts on innerHTML (see #779 h/t @danxfisher) - unicode core props and ext props (fixes #822 h/t @fureweb-com) - shim update for IE10/11 - test refresh and flow checks 2017-09-30 06:18:11 +00:00			\| `"string"` \| string: JS string (characters interpreted as UTF8) \|
Documentation improvements - multiformat column widths (fixes #591 h/t @sheeeeep) - skip nested BIFF files 2017-03-20 09:02:25 +00:00			\| `"buffer"` \| nodejs Buffer \|
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			\| `"array"` \| array: array of 8-bit unsigned int (byte `n` is `data[n]`) \|
			\| `"file"` \| string: path of file that will be read (nodejs only) \|
Documentation improvements - multiformat column widths (fixes #591 h/t @sheeeeep) - skip nested BIFF files 2017-03-20 09:02:25 +00:00
			`### Guessing File Type`

browser tests and doc cleanup [ci skip] 2017-04-30 20:37:53 +00:00			`<details>`
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			`<summary><b>Implementation Details</b> (click to show)</summary>`
browser tests and doc cleanup [ci skip] 2017-04-30 20:37:53 +00:00
Documentation improvements - multiformat column widths (fixes #591 h/t @sheeeeep) - skip nested BIFF files 2017-03-20 09:02:25 +00:00			`Excel and other spreadsheet tools read the first few bytes and apply other`
			`heuristics to determine a file type. This enables file type punning: renaming`
			files with the `.xls` extension will tell your computer to use Excel to open the
			`file but Excel will know how to handle it. This library applies similar logic:`

			`\| Byte 0 \| Raw File Type \| Spreadsheet Types \|`
			`\|:-------\|:--------------\|:----------------------------------------------------\|`
Lotus / Quattro Pro read support 2017-04-04 16:09:41 +00:00			\| `0xD0` \| CFB Container \| BIFF 5/8 or password-protected XLSX/XLSB or WQ3/QPW \|
Documentation improvements - multiformat column widths (fixes #591 h/t @sheeeeep) - skip nested BIFF files 2017-03-20 09:02:25 +00:00			\| `0x09` \| BIFF Stream \| BIFF 2/3/4/5 \|
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			\| `0x3C` \| XML/HTML \| SpreadsheetML / Flat ODS / UOS1 / HTML / plain text \|
			\| `0x50` \| ZIP Archive \| XLSB or XLSX/M or ODS or UOS2 or plain text \|
			\| `0x49` \| Plain Text \| SYLK or plain text \|
			\| `0x54` \| Plain Text \| DIF or plain text \|
			\| `0xEF` \| UTF8 Encoded \| SpreadsheetML / Flat ODS / UOS1 / HTML / plain text \|
			\| `0xFF` \| UTF16 Encoded \| SpreadsheetML / Flat ODS / UOS1 / HTML / plain text \|
			\| `0x00` \| Record Stream \| Lotus WK\* or Quattro Pro or plain text \|
			\| `0x7B` \| Plain text \| RTF or plain text \|
			\| `0x0A` \| Plain text \| SpreadsheetML / Flat ODS / UOS1 / HTML / plain text \|
			\| `0x0D` \| Plain text \| SpreadsheetML / Flat ODS / UOS1 / HTML / plain text \|
			\| `0x20` \| Plain text \| SpreadsheetML / Flat ODS / UOS1 / HTML / plain text \|
DBF from js-harb - merged DBF from js-harb (fixes #407 h/t @joefreire) - updated codepage to 1.8.0 - stub for macro/dialog sheet parsing (fixes #292 h/t @GenoD) - XLSB/XLSM write vbaraw (fixes #606 h/t @johnothetree) - phantomjs demo (fixes #184 h/t @machinewu) 2017-03-28 04:41:01 +00:00
			`DBF files are detected based on the first byte as well as the third and fourth`
			`bytes (corresponding to month and day of the file date)`
Documentation improvements - multiformat column widths (fixes #591 h/t @sheeeeep) - skip nested BIFF files 2017-03-20 09:02:25 +00:00
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			`Plain text format guessing follows the priority order:`
TXT/PRN - UTF-16 Unicode Text (TXT) write - Lotus Formatted Text (PRN) read/write - DBF version 2 field length adjustments - throw errors if SheetNames is invalid (fixes #376 h/t @pietersv) 2017-04-03 00:16:03 +00:00
			`\| Format \| Test \|`
			`\|:-------\|:--------------------------------------------------------------------\|`
miscellany - systemjs browser example - more precise file type resolution - small corner cases from test corpus - removed neq in tests (fixes #735 h/t @TeamworkGuy2) - package.json devDependencies versions (fixes #740 h/t @the-spyke) 2017-07-26 08:35:28 +00:00			\| XML \| `<?xml` appears in the first 1024 characters \|
			\| HTML \| starts with `<` and HTML tags appear in the first 1024 characters * \|
rollup demo [ci skip] 2017-06-08 06:19:11 +00:00			\| XML \| starts with `<` \|
miscellany - systemjs browser example - more precise file type resolution - small corner cases from test corpus - removed neq in tests (fixes #735 h/t @TeamworkGuy2) - package.json devDependencies versions (fixes #740 h/t @the-spyke) 2017-07-26 08:35:28 +00:00			\| RTF \| starts with `{\rt` \|
version bump 0.9.9: basic TXT/CSV read fixes #489 h/t @vijayst fixes #617 h/t @ayush000 2017-04-03 06:02:02 +00:00			\| DSV \| starts with `/sep=.$/`, separator is the specified character \|
version bump 0.11.6: ancillary format update - BIFF5 XLS write (bookType "biff5") - DBF Level 7 read - ODS whitespace and repeated rows - flow and lint cleanup 2017-10-17 00:14:32 +00:00			\| DSV \| more unquoted `";"` chars than `"\t"` or `","` in the first 1024 \|
			\| TSV \| more unquoted `"\t"` chars than `","` chars in the first 1024 \|
			\| CSV \| one of the first 1024 characters is a comma `","` \|
version bump 0.11.12: merging js-harb - ETH format and utils merged from js-harb - added shim to npm package (fixes #911 h/t @dackmin) - TS defs refresh - updated test_files 2017-12-04 04:41:41 +00:00			\| ETH \| starts with `socialcalc:version:` \|
TXT/PRN - UTF-16 Unicode Text (TXT) write - Lotus Formatted Text (PRN) read/write - DBF version 2 field length adjustments - throw errors if SheetNames is invalid (fixes #376 h/t @pietersv) 2017-04-03 00:16:03 +00:00			`\| PRN \| (default) \|`
rollup demo [ci skip] 2017-06-08 06:19:11 +00:00
miscellany - systemjs browser example - more precise file type resolution - small corner cases from test corpus - removed neq in tests (fixes #735 h/t @TeamworkGuy2) - package.json devDependencies versions (fixes #740 h/t @the-spyke) 2017-07-26 08:35:28 +00:00			- HTML tags include: `html`, `table`, `head`, `meta`, `script`, `style`, `div`

browser tests and doc cleanup [ci skip] 2017-04-30 20:37:53 +00:00			`</details>`
TXT/PRN - UTF-16 Unicode Text (TXT) write - Lotus Formatted Text (PRN) read/write - DBF version 2 field length adjustments - throw errors if SheetNames is invalid (fixes #376 h/t @pietersv) 2017-04-03 00:16:03 +00:00
sheet_to_html - added to TS definition and tests - clarified behavior of plaintext files (fixes #641 h/t @dskrvk) - removed old test files 2017-05-16 17:45:35 +00:00			`<details>`
demo refresh [ci skip] 2017-09-24 23:40:09 +00:00			`<summary><b>Why are random text files valid?</b> (click to show)</summary>`
sheet_to_html - added to TS definition and tests - clarified behavior of plaintext files (fixes #641 h/t @dskrvk) - removed old test files 2017-05-16 17:45:35 +00:00
			`Excel is extremely aggressive in reading files. Adding an XLS extension to any`
			`display text file (where the only characters are ANSI display chars) tricks`
			`Excel into thinking that the file is potentially a CSV or TSV file, even if it`
			`is only one column! This library attempts to replicate that behavior.`

			`The best approach is to validate the desired worksheet and ensure it has the`
			`expected number of rows or columns. Extracting the range is extremely simple:`

			```js
			`var range = XLSX.utils.decode_range(worksheet['!ref']);`
version bump 0.12.4: zip cleanup - PK magic number bound (fixes #1013 h/t @wlpeter) - removed JSZip conflict (fixes #1017 h/t @seanmars) - updated CFB to 1.0.5 - demo HTML conversion `string` 2018-03-06 00:34:04 +00:00			`var ncols = range.e.c - range.s.c + 1, nrows = range.e.r - range.s.r + 1;`
sheet_to_html - added to TS definition and tests - clarified behavior of plaintext files (fixes #641 h/t @dskrvk) - removed old test files 2017-05-16 17:45:35 +00:00			```

			`</details>`