docs.sheetjs.com/docz/docs/02-getting-started/02-examples/06-loader.md

708 lines
20 KiB
Markdown
Raw Normal View History

2024-06-19 11:22:00 +00:00
---
title: Loader Tutorial
pagination_prev: getting-started/installation/index
pagination_next: getting-started/roadmap
sidebar_position: 6
---
import current from '/version.js';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import CodeBlock from '@theme/CodeBlock';
Many existing systems and platforms include support for loading data from CSV
files. Many users prefer to work in spreadsheet software and multi-sheet file
formats including XLSX. SheetJS libraries help bridge the gap by translating
complex workbooks to simple CSV data.
The goal of this example is to load spreadsheet data into a vector store and use
a large language model to generate queries based on English language input. The
existing tooling supports CSV but does not support real spreadsheets.
In ["SheetJS Conversion"](#sheetjs-conversion), we will use SheetJS libraries to
generate CSV files for the LangChain CSV loader. These conversions can be run in
a preprocessing step without disrupting existing CSV workflows.
In ["SheetJS Loader"](#sheetjs-loader), we will use SheetJS libraries in a custom
loader to directly generate documents and metadata.
:::note Tested Deployments
This demo was tested in the following configurations:
| Date | Platform |
|:-----------|:--------------------------------------------------------------|
| 2024-06-19 | Apple M2 Max 12-Core CPU + 30-Core GPU (32 GB unified memory) |
| 2024-06-19 | NVIDIA RTX 4080 SUPER (16 GB VRAM) + i9-10910 (128 GB RAM) |
This explanation was verified against LangChain 0.2.
:::
## CSV Loader
Document loaders generate data objects ("documents") and associated metadata
from data sources.
LangChain offers a `CSVLoader`[^1] component for loading CSV data from a file:
```js title="Generating Documents from a CSV file"
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
const loader = new CSVLoader("pres.csv");
const docs = await loader.load();
console.log(docs);
```
The CSV loader uses the first row to determine column headers and generates one
document per data row. For example, the following CSV holds Presidential data:
```csv
Name,Index
Bill Clinton,42
GeorgeW Bush,43
Barack Obama,44
Donald Trump,45
Joseph Biden,46
```
Each data row is translated to a document whose content is a list of attributes
and values. For example, the third data row is shown below:
<table>
<thead><tr><th>CSV Row</th><th>Document Content</th></tr></thead>
<tbody><tr><td>
```
Name,Index
Barack Obama,44
```
</td><td>
```
Name: Barack Obama
Index: 44
```
</td></tr></tbody>
</table>
The LangChain CSV loader will include source metadata in the document:
```js title="Document generated by the CSV loader"
Document {
pageContent: 'Name: Barack Obama\nIndex: 44',
metadata: { source: 'pres.csv', line: 3 }
}
```
## SheetJS Conversion
The [SheetJS NodeJS module](/docs/getting-started/installation/nodejs) can be
imported in NodeJS scripts that use LangChain and other JavaScript libraries.
A simple pre-processing step can convert workbooks to spreadsheets
```mermaid
flowchart LR
file[(Workbook\nXLSX/XLS)]
subgraph SheetJS Structures
wb(((SheetJS\nWorkbook)))
ws((SheetJS\nWorksheet))
end
csv(CSV\nstring)
docs[[Documents\nArray]]
file --> |readFile\n\n| wb
wb --> |wb.Sheets\nselect sheet| ws
ws --> |sheet_to_csv\n\n| csv
csv --> |CSVLoader\n\n| docs
linkStyle 0,1,2 color:blue,stroke:blue;
```
The SheetJS `readFile` method[^2] can read general workbooks. The method returns
a workbook object that conforms to the SheetJS data model[^3].
Workbook objects represent multi-sheet workbook files. They store individual
worksheet objects and other metadata.
Each worksheet in the workbook can be written to CSV text using the SheetJS
`sheet_to_csv`[^4] method.
For example, the following NodeJS script reads `pres.xlsx` and displays CSV rows
from the first worksheet:
```js title="Print CSV data from the first worksheet"
/* Load SheetJS Libraries */
import { readFile, set_fs, utils } from 'xlsx';
/* Load 'fs' for readFile support */
import * as fs from 'fs';
set_fs(fs);
/* Parse `pres.xlsx` */
const wb = readFile("pres.xlsx");
/* Print CSV rows from first worksheet */
const first_ws = wb.Sheets[wb.SheetNames[0]];
const csv = utils.sheet_to_csv(first_ws);
console.log(csv);
```
### Single Worksheet
For a single worksheet, a SheetJS pre-processing step can write the CSV rows to
file and the `CSVLoader` can load the newly written file.
<details open>
<summary><b>Code example</b> (click to hide)</summary>
```js title="Pulling data from the first worksheet of a workbook"
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
import { readFile, set_fs, utils } from 'xlsx';
/* Load 'fs' for readFile support */
import * as fs from 'fs';
set_fs(fs);
/* Parse `pres.xlsx`` */
const wb = readFile("pres.xlsx");
/* Generate CSV and write to `pres.xlsx.csv` */
const first_ws = wb.Sheets[wb.SheetNames[0]];
const csv = utils.sheet_to_csv(first_ws);
fs.writeFileSync("pres.xlsx.csv", csv);
/* Create documents with CSVLoader */
const loader = new CSVLoader("pres.xlsx.csv");
const docs = await loader.load();
console.log(docs);
// ...
```
</details>
### Workbook
A workbook is a collection of worksheets. Each worksheet can be exported to a
separate CSV. If the CSVs are written to a subfolder, a `DirectoryLoader`[^5]
can process the files in one step.
<details open>
<summary><b>Code example</b> (click to hide)</summary>
In this example, the script creates a subfolder named `csv`. Each worksheet in
the workbook will be processed and the generated CSV will be stored to numbered
files. The first worksheet will be stored to `csv/0.csv`.
```js title="Pulling data from the each worksheet of a workbook"
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { readFile, set_fs, utils } from 'xlsx';
/* Load 'fs' for readFile support */
import * as fs from 'fs';
set_fs(fs);
/* Parse `pres.xlsx`` */
const wb = readFile("pres.xlsx");
/* Create a folder `csv` */
try { fs.mkdirSync("csv"); } catch(e) {}
/* Generate CSV data for each worksheet */
wb.SheetNames.forEach((name, idx) => {
const ws = wb.Sheets[name];
const csv = utils.sheet_to_csv(ws);
fs.writeFileSync(`csv/${idx}.csv`, csv);
});
/* Create documents with DirectoryLoader */
const loader = new DirectoryLoader("csv", {
".csv": (path) => new CSVLoader(path)
});
const docs = await loader.load();
console.log(docs);
// ...
```
</details>
## SheetJS Loader
The `CSVLoader` that ships with LangChain does not add any Document metadata and
does not generate any attributes. A custom loader can work around limitations in
the CSV tooling and potentially include metadata that has no CSV equivalent.
```mermaid
flowchart LR
file[(Workbook\nXLSX/XLS)]
subgraph SheetJS Structures
wb(((SheetJS\nWorkbook)))
ws((SheetJS\nWorksheet))
end
aoo[(Array of\nObjects)]
docs[[Documents\nArray]]
file --> |readFile\n\n| wb
wb --> |wb.Sheets\nEach worksheet| ws
ws --> |sheet_to_json\n\n| aoo
aoo --> |new Document\nEach Row| docs
linkStyle 0,1,2 color:blue,stroke:blue;
```
The demo [`LoadOfSheet` loader](pathname:///loadofsheet/loadofsheet.mjs) will
generate one Document per data row across all worksheets. It will also attempt
to build metadata and attributes for use in self-querying retrievers.
<details>
<summary><b>Sample SheetJS Loader</b> (click to show)</summary>
This example loader pulls data from each worksheet. It assumes each worksheet
includes one header row and a number of data rows.
```js title="loadofsheet.mjs"
import { Document } from "@langchain/core/documents";
import { BufferLoader } from "langchain/document_loaders/fs/buffer";
import { read, utils } from "xlsx";
/**
* Document loader that uses SheetJS to load documents.
*
* Each worksheet is parsed into an array of row objects using the SheetJS
* `sheet_to_json` method and projected to a `Document`. Metadata includes
* original sheet name, row data, and row index
*/
export default class LoadOfSheet extends BufferLoader {
/** @type {import("langchain/chains/query_constructor").AttributeInfo[]} */
attributes = [];
/**
* Document loader that uses SheetJS to load documents.
*
* @param {string|Blob} filePathOrBlob Source Data
*/
constructor(filePathOrBlob) {
super(filePathOrBlob);
this.attributes = [];
}
/**
* Parse document
*
* NOTE: column labels in multiple sheets are not disambiguated!
*
* @param {Buffer} raw Raw data Buffer
* @param {Document["metadata"]} metadata Document metadata
* @returns {Promise<Document[]>} Array of Documents
*/
async parse(raw, metadata) {
/** @type {Document[]} */
const result = [];
this.attributes = [
{ name: "worksheet", description: "Sheet or Worksheet Name", type: "string" },
{ name: "rowNum", description: "Row index", type: "number" }
];
const wb = read(raw, {type: "buffer", WTF:1});
for(let name of wb.SheetNames) {
const fields = {};
const ws = wb.Sheets[name];
if(!ws) return;
const aoo = utils.sheet_to_json(ws);
aoo.forEach((row, idx) => {
result.push({
pageContent: "Row " + (idx + 1) + " has the following content: \n" + Object.entries(row).map(kv => `- ${kv[0]}: ${kv[1]}`).join("\n") + "\n",
metadata: {
worksheet: name,
rowNum: row["__rowNum__"],
...metadata,
...row
}
});
Object.entries(row).forEach(([k,v]) => { if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true } );
});
Object.entries(fields).forEach(([k,v]) => this.attributes.push({
name: k, description: k, type: Object.keys(v).join(" or ")
}));
}
return result;
}
};
```
</details>
### From Text to Binary
Many libraries and platforms offer generic "text" loaders that process files
assuming the UTF8 encoding. This corrupts many spreadsheet formats including
XLSX, XLSB, XLSM and XLS.
:::note pass
This issue affects many JavaScript tools. Various demos cover workarounds:
- [ViteJS plugins](/docs/demos/static/vitejs#plugins) receive the relative path
to the workbook file and can read the file directly.
- [Webpack Plugins](/docs/demos/static/webpack#sheetjs-loader) have a special
option to instruct the library to pass raw binary data rather than text.
:::
The `CSVLoader` extends a special `TextLoader` that forces UTF8 text parsing.
There is a separate `BufferLoader` class, used by the PDF loader, that passes
the raw data using NodeJS `Buffer` objects.
<table>
<thead><tr><th>Binary</th><th>Text</th></tr></thead>
<tbody><tr><td>
```ts title="pdf.ts (structure)"
export class PDFLoader extends BufferLoader {
// ...
public async parse(
raw: Buffer,
metadata: Document["metadata"]
): Promise<Document[]> {
// ...
}
// ...
}
```
</td><td>
```ts title="csv.ts (structure)"
export class CSVLoader extends TextLoader {
// ...
protected async parse(
raw: string
): Promise<string[]> {
// ...
}
// ...
}
```
</td></tr></tbody>
</table>
### NodeJS Buffers
The SheetJS `read` method supports NodeJS Buffer objects directly[^6]:
```js title="Parsing a workbook in a BufferLoader"
import { BufferLoader } from "langchain/document_loaders/fs/buffer";
import { read, utils } from "xlsx";
export default class LoadOfSheet extends BufferLoader {
// ...
async parse(raw, metadata) {
// highlight-next-line
const wb = read(raw, {type: "buffer"});
// At this point, `wb` is a SheetJS workbook object
// ...
}
}
```
The `read` method returns a SheetJS workbook object[^7].
### Generating Content
The SheetJS `sheet_to_json` method[^8] returns an array of data objects whose
keys are drawn from the first row of the worksheet.
<table>
<thead><tr><th>Spreadsheet</th><th>Array of Objects</th></tr></thead>
<tbody><tr><td>
![`pres.xlsx` data](pathname:///pres.png)
</td><td>
```js
[
{ Name: "Bill Clinton", Index: 42 },
{ Name: "GeorgeW Bush", Index: 43 },
{ Name: "Barack Obama", Index: 44 },
{ Name: "Donald Trump", Index: 45 },
{ Name: "Joseph Biden", Index: 46 }
]
```
</td></tr></tbody></table>
The original `CSVLoader` wrote one row for each key-value pair. This text can be
generated by looping over the keys and values of the data row object. The
`Object.entries` helper function simplifies the conversion:
```js
function make_csvloader_doc_from_row_object(row) {
return Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n");
}
```
### Generating Documents
The loader must generate row objects for each worksheet in the workbook.
In the SheetJS data model, the workbook object has two relevant fields:
- `SheetNames` is an array of sheet names
- `Sheets` is an object whose keys are sheet names and values are sheet objects.
A `for..of` loop can iterate across the worksheets:
```js title="Looping over a workbook (skeleton)"
const wb = read(raw, {type: "buffer", WTF:1});
for(let name of wb.SheetNames) {
const ws = wb.Sheets[name];
const aoa = utils.sheet_to_json(ws);
// at this point, `aoa` is an array of objects
}
```
This simplified `parse` function uses the snippet from the previous section:
```js title="BufferLoader parse function (skeleton)"
async parse(raw, metadata) {
/* array to hold generated documents */
const result = [];
/* read workbook */
const wb = read(raw, {type: "buffer", WTF:1});
/* loop over worksheets */
for(let name of wb.SheetNames) {
const ws = wb.Sheets[name];
const aoa = utils.sheet_to_json(ws);
/* loop over data rows */
aoa.forEach((row, idx) => {
/* generate a new document and add to the result array */
result.push({
pageContent: Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n")
});
});
}
return result;
}
```
### Metadata and Attributes
It is strongly recommended to generate additional metadata and attributes for
self-query retrieval applications.
<details>
<summary><b>Implementation Details</b> (click to show)</summary>
**Metadata**
Metadata is attached to each document object. The following example appends the
raw row data to the document metadata:
```js title="Document with metadata (snippet)"
/* generate a new document and add to the result array */
result.push({
pageContent: Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n"),
metadata: {
worksheet: name, // name of the worksheet
rowNum: idx, // data row index
...row // raw row data
}
});
```
**Attributes**
Each attribute object specifies three properties:
- `name` corresponds to the field in the document metadata
- `description` is a description of the field
- `type` is a description of the data type.
While looping through data rows, a simple type check can keep track of the data
type for each column:
```js title="Tracking column types (sketch)"
for(let name of wb.SheetNames) {
/* track column types */
const fields = {};
// ...
aoo.forEach((row, idx) => {
result.push({/* ... */});
/* Check each property */
Object.entries(row).forEach(([k,v]) => {
/* Update fields entry to reflect the new data point */
if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true
});
});
// ...
}
```
Attributes can be generated after writing the worksheet data. Storing attributes
in a loader property will make it accessible to scripts that use the loader.
```js title="Adding Attributes to a Loader (sketch)"
export default class LoadOfSheet extends BufferLoader {
// highlight-next-line
attributes = [];
// ...
async parse(raw, metadata) {
// Add the worksheet name and row index attributes
// highlight-start
this.attributes = [
{ name: "worksheet", description: "Sheet or Worksheet Name", type: "string" },
{ name: "rowNum", description: "Row index", type: "number" }
];
// highlight-end
const wb = read(raw, {type: "buffer", WTF:1});
for(let name of wb.SheetNames) {
// highlight-next-line
const fields = {};
// ...
const aoo = utils.sheet_to_json(ws);
aoo.forEach((row, idx) => {
result.push({/* ... */});
/* Check each property */
Object.entries(row).forEach(([k,v]) => {
/* Update fields entry to reflect the new data point */
if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true
});
});
/* Add one attribute per metadata field */
// highlight-start
Object.entries(fields).forEach(([k,v]) => this.attributes.push({
name: k, description: k,
/* { number: true, string: true } -> "number or string" */
type: Object.keys(v).join(" or ")
}));
// highlight-end
}
// ...
}
```
</details>
## SheetJS Loader Demo
The demo performs the query "Which rows have over 40 miles per gallon?" against
a [sample cars dataset](pathname:///cd.xls) and displays the results.
:::caution pass
This demo was tested using the ChatQA-1.5 model[^9] in Ollama[^10].
The tested model requires 9.2GB VRAM. It is strongly recommended to run the demo
on a newer Apple Silicon Mac or a PC with an Nvidia GPU with at least 12GB VRAM.
:::
0) Create a new project:
```bash
mkdir sheetjs-loader
cd sheetjs-loader
npm init -y
```
1) Download the demo scripts:
- [`loadofsheet.mjs`](pathname:///loadofsheet/loadofsheet.mjs)
- [`query.mjs`](pathname:///loadofsheet/query.mjs)
```bash
curl -LO https://docs.sheetjs.com/loadofsheet/query.mjs
curl -LO https://docs.sheetjs.com/loadofsheet/loadofsheet.mjs
```
2) Install the SheetJS NodeJS module:
<CodeBlock language="bash">{`\
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz`}
</CodeBlock>
3) Install LangChain and HNSWLib dependencies:
```bash
npm i --save @langchain/community@0.2.0 @langchain/core@0.2.6 langchain@0.2.5 hnswlib-node@3.0.0 peggy@3.0.2
```
4) Download the [cars dataset](pathname:///cd.xls):
```bash
curl -LO https://docs.sheetjs.com/cd.xls
```
5) Install the `llama3-chatqa:8b-v1.5-q8_0` model using Ollama:
```bash
ollama pull llama3-chatqa:8b-v1.5-q8_0
```
:::note pass
If the command cannot be found, install Ollama[^10] and run the command in a new
terminal window.
:::
6) Run the demo script
```bash
node query.mjs
```
The demo performs the query "Which rows have over 40 miles per gallon?". It will
print the following nine results:
```js title="Expected output"
{ Name: 'volkswagen rabbit custom diesel', MPG: 43.1 }
{ Name: 'vw rabbit c (diesel)', MPG: 44.3 }
{ Name: 'renault lecar deluxe', MPG: 40.9 }
{ Name: 'honda civic 1500 gl', MPG: 44.6 }
{ Name: 'datsun 210', MPG: 40.8 }
{ Name: 'vw pickup', MPG: 44 }
{ Name: 'mazda glc', MPG: 46.6 }
{ Name: 'vw dasher (diesel)', MPG: 43.4 }
{ Name: 'vw rabbit', MPG: 41.5 }
```
To find the expected results:
- Open the `cd.xls` spreadsheet in Excel
- Select Home > Sort &amp; Filter > Filter in the Ribbon
- Select the filter option for column B (`Miles_per_Gallon`)
- In the popup, select "Greater Than" in the Filter dropdown and type 40
The filtered results should match the following screenshot:
![Expected Results](pathname:///loadofsheet/expected.png)
[^1]: See ["How to load CSV data"](https://js.langchain.com/v0.2/docs/how_to/document_loader_csv) in the LangChain documentation
[^2]: See [`readFile` in "Reading Files"](/docs/api/parse-options)
[^3]: See ["SheetJS Data Model"](/docs/csf/)
[^4]: See [`sheet_to_csv` in "CSV and Text"](/docs/api/utilities/csv#delimiter-separated-output)
[^5]: See ["Folders with multiple files"](https://js.langchain.com/v0.2/docs/integrations/document_loaders/file_loaders/directory/) in the LangChain documentation
[^6]: See ["Supported Output Formats" type in "Writing Files"](/docs/api/write-options#supported-output-formats)
[^7]: See ["Workbook Object"](/docs/csf/book)
[^8]: See [`sheet_to_json` in "Utilities"](/docs/api/utilities/array#array-output)
[^9]: See [the official ChatQA website](https://chatqa-project.github.io/) for the ChatQA paper and other model details.
[^10]: See [the official Ollama website](https://ollama.com/download) for installation instructions.