notes/iwa/README.md

293 lines
9.8 KiB
Markdown
Raw Normal View History

2022-04-11 00:51:28 +00:00
# iWork 2013+
2022-03-26 20:24:33 +00:00
2022-07-05 06:29:47 +00:00
There are three different styles of iWork files:
1) The macOS applications generate ZIP files which contain the metadata and
special `.iwa` files which hold the file data.
2) iCloud persistence on macOS is a folder based structure containing an
`Index.zip` file which is "similar" to the macOS standalone file structure.
3) [The web iCloud editors](https://icloud.com) export ZIP files which contain
an `Index.zip` file similar to iCloud persistence. Note that this is literally
a ZIP file within a ZIP file
The `Index.zip` file has an identical structure to an actual file generated by
the macOS applications, so the discussion is applicable to all file styles.
2022-03-26 20:24:33 +00:00
The ZIP container holds a number of Mac binary "property list" files (`.plist`)
which can be safely ignored or blanked. It also can hold preview images that
can be safely ignored.
2022-04-11 00:51:28 +00:00
2022-03-26 20:24:33 +00:00
## File Structure
2022-07-05 06:29:47 +00:00
The iWork file (`.KEY`, `.NUMBERS`, `.PAGES`) is a ZIP file containing a number
of `.iwa` entries. The primary entrypoint is `/Index/Document.iwa`.
2022-03-26 20:24:33 +00:00
`TSPersistence.framework` handles the byte-level operations for the files.
`.iwa` files are sequential blocks of compressed data. Each "block" starts with
a 4-byte header consisting of a `0` byte followed by the compressed length
(stored as a 3-byte little-endian integer)
Each block follows the Snappy compressed format as described in
2022-05-24 04:18:32 +00:00
[the format description from the snappy repo](./snappy_format.txt). iWork
2022-03-26 20:24:33 +00:00
apps do not expect a particular compression level, and it is possible to create
the equivalent of a "STORED" block.
2022-04-11 00:51:28 +00:00
## Protocol Buffers
Most of the data is stored in Protocol Buffer ("protobuf") wire messages.
The iWork apps (Keynote, Numbers, Pages) include embedded Protocol Buffers
definitions as part of the file format processors.
The [`otorp` package on `npm`](https://npm.im/otorp) ships with a command-line
tool for extracting definitions from a Mach-O binary.
Note that some fields marked as `required` have been changed to `optional` in
later versions. File parsers should assume all fields are optional.
### App-Specific Definitions
The listed definitions only appear in one app:
**Keynote**
- `KNArchives.proto`
- `KNArchives.sos.proto`
- `KNCommandArchives.proto`
- `KNCommandArchives.sos.proto`
**Numbers**
- `TNArchives.proto`
- `TNArchives.sos.proto`
- `TNCommandArchives.proto`
- `TNCommandArchives.sos.proto`
**Pages**
- `TPArchives.proto`
- `TPCommandArchives.proto`
- `TPCommandArchives.sos.proto`
The other files are common across the apps.
## Data Storage
2022-03-26 20:24:33 +00:00
The decompressed data is a series of chunks.
Each chunk starts with a `length` stored in a Base 128 `varint`, followed by a
protobuf packet of type `.TSP.ArchiveInfo`.
The `.TSP.ArchiveInfo` message contains a number of `.TSP.MessageInfo` messages
(tag 2). Each `MessageInfo` has a `length` field (tag 3, type `uint32`) for the
actual message body. The data for the message bodies are stored immediately
after the `ArchiveInfo`, in the same order as the `MessageInfo` parts.
The message type from the `MessageInfo` (tag 1, type `uint32`) corresponds to a
dynamic registry spread across the embedded frameworks. The actual message data
is a protobuf packet.
2022-04-11 00:51:28 +00:00
### Dynamic Registry and Message Types
2022-03-26 20:24:33 +00:00
The `.TSP.Reference` type acts as a pointer, referencing another message. The
references do not include message type info, so readers and writers must be
aware of the message types and their interpretations.
Each framework is responsible for registering message types with the master
registry by sending a message to the `TSPRegistry`. The actual types can be
discovered from the frameworks. Some common message types are listed below:
| type | message |
|-----:|:-------------------------|
| 1 | `.TN.DocumentArchive` |
| 2 | `.TN.SheetArchive` |
| 6000 | `.TST.TableInfoArchive` |
| 6001 | `.TST.TableModelArchive` |
| 6002 | `.TST.Tile` |
All referenced types must be registered, but ancillary types do not need to be
registered. For example:
```proto
message .TST.TableInfoArchive {
// ...
required .TSP.Reference tableModel = 2;
// ...
}
message .TST.TableModelArchive {
// ...
required .TST.DataStore base_data_store = 4;
// ...
}
```
2022-04-11 00:51:28 +00:00
The reference in field 2 from `.TST.TableInfoArchive` is expected to be of type
2022-03-26 20:24:33 +00:00
`.TST.TableModelArchive` so the latter must be registered.
`.TST.DataStore` is the type of field 4 from `.TST.TableModelArchive`. Since it
is not referenced indirectly, the message type does not have to be registered.
2022-04-11 00:51:28 +00:00
## Data Storage in Numbers files
2022-03-26 20:24:33 +00:00
2022-04-11 00:51:28 +00:00
The root message (type 1) has the following structure:
2022-03-26 20:24:33 +00:00
2022-04-11 00:51:28 +00:00
```proto
message .TN.DocumentArchive {
repeated .TSP.Reference sheets = 1;
```
2022-03-26 20:24:33 +00:00
2022-04-11 00:51:28 +00:00
The message referenced in field 1 (type 2) has the following structure:
2022-03-26 20:24:33 +00:00
2022-04-11 00:51:28 +00:00
```proto
message .TN.SheetArchive {
required string name = 1;
repeated .TSP.Reference drawable_infos = 2;
```
2022-03-26 20:24:33 +00:00
2022-04-11 00:51:28 +00:00
`name` is the name of the worksheet. Each worksheet can contain multiple tables.
The messages referenced in field 2 (type 6000) refer to `.TST.TableInfoArchive`
2022-03-26 20:24:33 +00:00
2022-04-11 00:51:28 +00:00
### Table Storage in iWork
2022-03-26 20:24:33 +00:00
2022-04-11 00:51:28 +00:00
Table structure is shared across iWork apps. The protobuf definitions are
identical. The root element for tables is the `.TST.TableInfoArchive`:
2022-03-26 20:24:33 +00:00
2022-04-11 00:51:28 +00:00
```proto
message .TST.TableInfoArchive {
required .TSP.Reference tableModel = 2;
```
The message referenced in field 2 (type 6001) has the following structure:
2022-03-26 20:24:33 +00:00
2022-04-11 00:51:28 +00:00
```proto
message .TST.TableModelArchive {
required .TST.DataStore base_data_store = 4;
required uint32 number_of_rows = 6;
required uint32 number_of_columns = 7;
// ...
}
message .TST.DataStore {
required .TST.TileStorage tiles = 3;
required .TSP.Reference stringTable = 4;
optional .TSP.Reference formulaErrorTable = 12;
optional .TSP.Reference rich_text_table = 17;
// ...
}
message .TST.TileStorage {
message .TST.TileStorage.Tile {
required uint32 tileid = 1;
required .TSP.Reference tile = 2;
}
repeated .TST.TileStorage.Tile tiles = 1;
// ...
}
```
Numbers uses a "shared string table" like Excel. Excel stores both plaintext and
rich strings in the same table, while Numbers has two separate tables.
The message referenced in the tiles (type 6002) has the following structure:
```proto
message .TST.Tile {
repeated .TST.TileRowInfo rowInfos = 5;
// ...
}
message .TST.TileRowInfo {
required uint32 tile_row_index = 1;
required uint32 cell_count = 2;
required bytes cell_storage_buffer_pre_bnc = 3;
required bytes cell_offsets_pre_bnc = 4;
// ...
}
```
Each `.TST.TileRowInfo` message holds the data and property references for a
single row in the table.
The cell offset fields are an array of 16-bit integers that describe offsets
within the respective storage buffers. `0xFFFF` indicates that the column index
for the given row is not included.
A 32-bit flag is stored at offset 4, describing which fields are in the cell:
| field description | bit mask | size | notes |
|:------------------|---------:|-----:|-------------------------------------|
| Error index | `0x0100` | 4 | index into formula error table |
| Rich text index | `0x0200` | 4 | index into rich shared string table |
| Plaintext index | `0x0010` | 4 | index into shared string table |
| Double value | `0x0020` | 8 | raw value (IEEE754 double) |
| Datetime value | `0x0040` | 8 | number of seconds since 1/1/2001 |
The starting offset depends on the cell storage version (`0-1` or `2-3`), which
is stored in the first byte of each cell:
| description | v1 offset | v3 offset |
|:----------------|---------------------------:|----------------------------:|
| Error index |`8 + POPCNT(f & 0x008E) * 4`|`12 + POPCNT(f & 0x0C8E) * 4`|
| Rich text index |`8 + POPCNT(f & 0x018E) * 4`|`12 + POPCNT(f & 0x0D8E) * 4`|
| Plaintext index |`8 + POPCNT(f & 0x138E) * 4`|`12 + POPCNT(f & 0x3F8E) * 4`|
| Double value |`8 + POPCNT(f & 0x139E) * 4`|`12 + POPCNT(f & 0x3F9E) * 4`|
| Datetime value |`8 + POPCNT(f & 0x13BE) * 4`|`12 + POPCNT(f & 0x3FBE) * 4`|
The cell type is stored at byte offset 2:
| type | value |
|-----:|:-----------------------------------------------------------------|
| `0` | "blank cell" (no value) |
| `2` | "Double value" (IEEE754 double) |
| `3` | get value from shared string table at "Plaintext index" |
| `5` | interpret "Datetime value" as number of seconds since 1/1/2001 |
| `6` | `true` if "Double value" is greater than zero, `false` otherwise |
| `7` | interpret "Double value" as number of seconds (Duration) |
| `8` | get error from formula error table at "Error index" |
| `9` | get value from rich shared string table at "Rich text index" |
## Misc
### Determining File Type
2022-03-26 20:24:33 +00:00
All three file types use the same message tag (1) for the root `DocumentArchive`
message. However, the required fields vary between formats.
2022-07-05 06:29:47 +00:00
In the 12.1 apps, the required fields are:
2022-03-26 20:24:33 +00:00
```proto
// Keynote optional fields 4
message .KN.DocumentArchive {
required .TSA.DocumentArchive super = 3;
required .TSP.Reference show = 2;
}
2022-07-05 06:29:47 +00:00
// Numbers optional fields 1, 3, 7, 9, 10 - 12
2022-03-26 20:24:33 +00:00
message .TN.DocumentArchive {
required .TSA.DocumentArchive super = 8;
required .TSP.Reference stylesheet = 4;
required .TSP.Reference sidebar_order = 5;
required .TSP.Reference theme = 6;
}
2022-07-05 06:29:47 +00:00
// Pages optional fields 2 - 7, 11 - 14, 16, 17, 20, 21, 30 - 50
2022-03-26 20:24:33 +00:00
message .TP.DocumentArchive {
required .TSA.DocumentArchive super = 15;
}
```
Pages is the only format to use and require field 15. Keynote requires field 2,
a field that does not appear in Numbers.
### MD5 Checksums
2022-05-18 19:38:23 +00:00
- [11.1](./111.md)
2022-04-11 00:51:28 +00:00
- [11.2](./112.md)
- [12.0](./120.md)
2022-06-22 22:19:55 +00:00
- [12.1](./121.md)
2022-03-26 20:24:33 +00:00
[![Analytics](https://ga-beacon.appspot.com/UA-36810333-1/SheetJS/notes?pixel)](https://github.com/SheetJS/notes)