forked from sheetjs/docs.sheetjs.com
708 lines
20 KiB
Markdown
708 lines
20 KiB
Markdown
|
---
|
||
|
title: Loader Tutorial
|
||
|
pagination_prev: getting-started/installation/index
|
||
|
pagination_next: getting-started/roadmap
|
||
|
sidebar_position: 6
|
||
|
---
|
||
|
|
||
|
import current from '/version.js';
|
||
|
import Tabs from '@theme/Tabs';
|
||
|
import TabItem from '@theme/TabItem';
|
||
|
import CodeBlock from '@theme/CodeBlock';
|
||
|
|
||
|
Many existing systems and platforms include support for loading data from CSV
|
||
|
files. Many users prefer to work in spreadsheet software and multi-sheet file
|
||
|
formats including XLSX. SheetJS libraries help bridge the gap by translating
|
||
|
complex workbooks to simple CSV data.
|
||
|
|
||
|
The goal of this example is to load spreadsheet data into a vector store and use
|
||
|
a large language model to generate queries based on English language input. The
|
||
|
existing tooling supports CSV but does not support real spreadsheets.
|
||
|
|
||
|
In ["SheetJS Conversion"](#sheetjs-conversion), we will use SheetJS libraries to
|
||
|
generate CSV files for the LangChain CSV loader. These conversions can be run in
|
||
|
a preprocessing step without disrupting existing CSV workflows.
|
||
|
|
||
|
In ["SheetJS Loader"](#sheetjs-loader), we will use SheetJS libraries in a custom
|
||
|
loader to directly generate documents and metadata.
|
||
|
|
||
|
:::note Tested Deployments
|
||
|
|
||
|
This demo was tested in the following configurations:
|
||
|
|
||
|
| Date | Platform |
|
||
|
|:-----------|:--------------------------------------------------------------|
|
||
|
| 2024-06-19 | Apple M2 Max 12-Core CPU + 30-Core GPU (32 GB unified memory) |
|
||
|
| 2024-06-19 | NVIDIA RTX 4080 SUPER (16 GB VRAM) + i9-10910 (128 GB RAM) |
|
||
|
|
||
|
This explanation was verified against LangChain 0.2.
|
||
|
|
||
|
:::
|
||
|
|
||
|
## CSV Loader
|
||
|
|
||
|
Document loaders generate data objects ("documents") and associated metadata
|
||
|
from data sources.
|
||
|
|
||
|
LangChain offers a `CSVLoader`[^1] component for loading CSV data from a file:
|
||
|
|
||
|
```js title="Generating Documents from a CSV file"
|
||
|
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
|
||
|
|
||
|
const loader = new CSVLoader("pres.csv");
|
||
|
const docs = await loader.load();
|
||
|
|
||
|
console.log(docs);
|
||
|
```
|
||
|
|
||
|
The CSV loader uses the first row to determine column headers and generates one
|
||
|
document per data row. For example, the following CSV holds Presidential data:
|
||
|
|
||
|
```csv
|
||
|
Name,Index
|
||
|
Bill Clinton,42
|
||
|
GeorgeW Bush,43
|
||
|
Barack Obama,44
|
||
|
Donald Trump,45
|
||
|
Joseph Biden,46
|
||
|
```
|
||
|
|
||
|
Each data row is translated to a document whose content is a list of attributes
|
||
|
and values. For example, the third data row is shown below:
|
||
|
|
||
|
<table>
|
||
|
<thead><tr><th>CSV Row</th><th>Document Content</th></tr></thead>
|
||
|
<tbody><tr><td>
|
||
|
|
||
|
```
|
||
|
Name,Index
|
||
|
Barack Obama,44
|
||
|
```
|
||
|
|
||
|
</td><td>
|
||
|
|
||
|
```
|
||
|
Name: Barack Obama
|
||
|
Index: 44
|
||
|
```
|
||
|
|
||
|
</td></tr></tbody>
|
||
|
</table>
|
||
|
|
||
|
The LangChain CSV loader will include source metadata in the document:
|
||
|
|
||
|
```js title="Document generated by the CSV loader"
|
||
|
Document {
|
||
|
pageContent: 'Name: Barack Obama\nIndex: 44',
|
||
|
metadata: { source: 'pres.csv', line: 3 }
|
||
|
}
|
||
|
```
|
||
|
|
||
|
## SheetJS Conversion
|
||
|
|
||
|
The [SheetJS NodeJS module](/docs/getting-started/installation/nodejs) can be
|
||
|
imported in NodeJS scripts that use LangChain and other JavaScript libraries.
|
||
|
|
||
|
A simple pre-processing step can convert workbooks to spreadsheets
|
||
|
|
||
|
```mermaid
|
||
|
flowchart LR
|
||
|
file[(Workbook\nXLSX/XLS)]
|
||
|
subgraph SheetJS Structures
|
||
|
wb(((SheetJS\nWorkbook)))
|
||
|
ws((SheetJS\nWorksheet))
|
||
|
end
|
||
|
csv(CSV\nstring)
|
||
|
docs[[Documents\nArray]]
|
||
|
file --> |readFile\n\n| wb
|
||
|
wb --> |wb.Sheets\nselect sheet| ws
|
||
|
ws --> |sheet_to_csv\n\n| csv
|
||
|
csv --> |CSVLoader\n\n| docs
|
||
|
linkStyle 0,1,2 color:blue,stroke:blue;
|
||
|
```
|
||
|
|
||
|
The SheetJS `readFile` method[^2] can read general workbooks. The method returns
|
||
|
a workbook object that conforms to the SheetJS data model[^3].
|
||
|
|
||
|
Workbook objects represent multi-sheet workbook files. They store individual
|
||
|
worksheet objects and other metadata.
|
||
|
|
||
|
Each worksheet in the workbook can be written to CSV text using the SheetJS
|
||
|
`sheet_to_csv`[^4] method.
|
||
|
|
||
|
For example, the following NodeJS script reads `pres.xlsx` and displays CSV rows
|
||
|
from the first worksheet:
|
||
|
|
||
|
```js title="Print CSV data from the first worksheet"
|
||
|
/* Load SheetJS Libraries */
|
||
|
import { readFile, set_fs, utils } from 'xlsx';
|
||
|
|
||
|
/* Load 'fs' for readFile support */
|
||
|
import * as fs from 'fs';
|
||
|
set_fs(fs);
|
||
|
|
||
|
/* Parse `pres.xlsx` */
|
||
|
const wb = readFile("pres.xlsx");
|
||
|
|
||
|
/* Print CSV rows from first worksheet */
|
||
|
const first_ws = wb.Sheets[wb.SheetNames[0]];
|
||
|
const csv = utils.sheet_to_csv(first_ws);
|
||
|
console.log(csv);
|
||
|
```
|
||
|
|
||
|
### Single Worksheet
|
||
|
|
||
|
For a single worksheet, a SheetJS pre-processing step can write the CSV rows to
|
||
|
file and the `CSVLoader` can load the newly written file.
|
||
|
|
||
|
<details open>
|
||
|
<summary><b>Code example</b> (click to hide)</summary>
|
||
|
|
||
|
```js title="Pulling data from the first worksheet of a workbook"
|
||
|
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
|
||
|
import { readFile, set_fs, utils } from 'xlsx';
|
||
|
|
||
|
/* Load 'fs' for readFile support */
|
||
|
import * as fs from 'fs';
|
||
|
set_fs(fs);
|
||
|
|
||
|
/* Parse `pres.xlsx`` */
|
||
|
const wb = readFile("pres.xlsx");
|
||
|
|
||
|
/* Generate CSV and write to `pres.xlsx.csv` */
|
||
|
const first_ws = wb.Sheets[wb.SheetNames[0]];
|
||
|
const csv = utils.sheet_to_csv(first_ws);
|
||
|
fs.writeFileSync("pres.xlsx.csv", csv);
|
||
|
|
||
|
/* Create documents with CSVLoader */
|
||
|
const loader = new CSVLoader("pres.xlsx.csv");
|
||
|
const docs = await loader.load();
|
||
|
|
||
|
console.log(docs);
|
||
|
// ...
|
||
|
```
|
||
|
|
||
|
</details>
|
||
|
|
||
|
### Workbook
|
||
|
|
||
|
A workbook is a collection of worksheets. Each worksheet can be exported to a
|
||
|
separate CSV. If the CSVs are written to a subfolder, a `DirectoryLoader`[^5]
|
||
|
can process the files in one step.
|
||
|
|
||
|
<details open>
|
||
|
<summary><b>Code example</b> (click to hide)</summary>
|
||
|
|
||
|
In this example, the script creates a subfolder named `csv`. Each worksheet in
|
||
|
the workbook will be processed and the generated CSV will be stored to numbered
|
||
|
files. The first worksheet will be stored to `csv/0.csv`.
|
||
|
|
||
|
```js title="Pulling data from the each worksheet of a workbook"
|
||
|
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
|
||
|
import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
|
||
|
import { readFile, set_fs, utils } from 'xlsx';
|
||
|
|
||
|
/* Load 'fs' for readFile support */
|
||
|
import * as fs from 'fs';
|
||
|
set_fs(fs);
|
||
|
|
||
|
/* Parse `pres.xlsx`` */
|
||
|
const wb = readFile("pres.xlsx");
|
||
|
|
||
|
/* Create a folder `csv` */
|
||
|
try { fs.mkdirSync("csv"); } catch(e) {}
|
||
|
|
||
|
/* Generate CSV data for each worksheet */
|
||
|
wb.SheetNames.forEach((name, idx) => {
|
||
|
const ws = wb.Sheets[name];
|
||
|
const csv = utils.sheet_to_csv(ws);
|
||
|
fs.writeFileSync(`csv/${idx}.csv`, csv);
|
||
|
});
|
||
|
|
||
|
/* Create documents with DirectoryLoader */
|
||
|
const loader = new DirectoryLoader("csv", {
|
||
|
".csv": (path) => new CSVLoader(path)
|
||
|
});
|
||
|
const docs = await loader.load();
|
||
|
|
||
|
console.log(docs);
|
||
|
// ...
|
||
|
```
|
||
|
|
||
|
</details>
|
||
|
|
||
|
## SheetJS Loader
|
||
|
|
||
|
The `CSVLoader` that ships with LangChain does not add any Document metadata and
|
||
|
does not generate any attributes. A custom loader can work around limitations in
|
||
|
the CSV tooling and potentially include metadata that has no CSV equivalent.
|
||
|
|
||
|
```mermaid
|
||
|
flowchart LR
|
||
|
file[(Workbook\nXLSX/XLS)]
|
||
|
subgraph SheetJS Structures
|
||
|
wb(((SheetJS\nWorkbook)))
|
||
|
ws((SheetJS\nWorksheet))
|
||
|
end
|
||
|
aoo[(Array of\nObjects)]
|
||
|
docs[[Documents\nArray]]
|
||
|
file --> |readFile\n\n| wb
|
||
|
wb --> |wb.Sheets\nEach worksheet| ws
|
||
|
ws --> |sheet_to_json\n\n| aoo
|
||
|
aoo --> |new Document\nEach Row| docs
|
||
|
linkStyle 0,1,2 color:blue,stroke:blue;
|
||
|
```
|
||
|
|
||
|
The demo [`LoadOfSheet` loader](pathname:///loadofsheet/loadofsheet.mjs) will
|
||
|
generate one Document per data row across all worksheets. It will also attempt
|
||
|
to build metadata and attributes for use in self-querying retrievers.
|
||
|
|
||
|
<details>
|
||
|
<summary><b>Sample SheetJS Loader</b> (click to show)</summary>
|
||
|
|
||
|
This example loader pulls data from each worksheet. It assumes each worksheet
|
||
|
includes one header row and a number of data rows.
|
||
|
|
||
|
```js title="loadofsheet.mjs"
|
||
|
import { Document } from "@langchain/core/documents";
|
||
|
import { BufferLoader } from "langchain/document_loaders/fs/buffer";
|
||
|
import { read, utils } from "xlsx";
|
||
|
|
||
|
/**
|
||
|
* Document loader that uses SheetJS to load documents.
|
||
|
*
|
||
|
* Each worksheet is parsed into an array of row objects using the SheetJS
|
||
|
* `sheet_to_json` method and projected to a `Document`. Metadata includes
|
||
|
* original sheet name, row data, and row index
|
||
|
*/
|
||
|
export default class LoadOfSheet extends BufferLoader {
|
||
|
/** @type {import("langchain/chains/query_constructor").AttributeInfo[]} */
|
||
|
attributes = [];
|
||
|
|
||
|
/**
|
||
|
* Document loader that uses SheetJS to load documents.
|
||
|
*
|
||
|
* @param {string|Blob} filePathOrBlob Source Data
|
||
|
*/
|
||
|
constructor(filePathOrBlob) {
|
||
|
super(filePathOrBlob);
|
||
|
this.attributes = [];
|
||
|
}
|
||
|
|
||
|
/**
|
||
|
* Parse document
|
||
|
*
|
||
|
* NOTE: column labels in multiple sheets are not disambiguated!
|
||
|
*
|
||
|
* @param {Buffer} raw Raw data Buffer
|
||
|
* @param {Document["metadata"]} metadata Document metadata
|
||
|
* @returns {Promise<Document[]>} Array of Documents
|
||
|
*/
|
||
|
async parse(raw, metadata) {
|
||
|
/** @type {Document[]} */
|
||
|
const result = [];
|
||
|
|
||
|
this.attributes = [
|
||
|
{ name: "worksheet", description: "Sheet or Worksheet Name", type: "string" },
|
||
|
{ name: "rowNum", description: "Row index", type: "number" }
|
||
|
];
|
||
|
|
||
|
const wb = read(raw, {type: "buffer", WTF:1});
|
||
|
for(let name of wb.SheetNames) {
|
||
|
const fields = {};
|
||
|
const ws = wb.Sheets[name];
|
||
|
if(!ws) return;
|
||
|
|
||
|
const aoo = utils.sheet_to_json(ws);
|
||
|
aoo.forEach((row, idx) => {
|
||
|
result.push({
|
||
|
pageContent: "Row " + (idx + 1) + " has the following content: \n" + Object.entries(row).map(kv => `- ${kv[0]}: ${kv[1]}`).join("\n") + "\n",
|
||
|
metadata: {
|
||
|
worksheet: name,
|
||
|
rowNum: row["__rowNum__"],
|
||
|
...metadata,
|
||
|
...row
|
||
|
}
|
||
|
});
|
||
|
Object.entries(row).forEach(([k,v]) => { if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true } );
|
||
|
});
|
||
|
Object.entries(fields).forEach(([k,v]) => this.attributes.push({
|
||
|
name: k, description: k, type: Object.keys(v).join(" or ")
|
||
|
}));
|
||
|
}
|
||
|
|
||
|
return result;
|
||
|
}
|
||
|
};
|
||
|
```
|
||
|
|
||
|
</details>
|
||
|
|
||
|
### From Text to Binary
|
||
|
|
||
|
Many libraries and platforms offer generic "text" loaders that process files
|
||
|
assuming the UTF8 encoding. This corrupts many spreadsheet formats including
|
||
|
XLSX, XLSB, XLSM and XLS.
|
||
|
|
||
|
:::note pass
|
||
|
|
||
|
This issue affects many JavaScript tools. Various demos cover workarounds:
|
||
|
|
||
|
- [ViteJS plugins](/docs/demos/static/vitejs#plugins) receive the relative path
|
||
|
to the workbook file and can read the file directly.
|
||
|
|
||
|
- [Webpack Plugins](/docs/demos/static/webpack#sheetjs-loader) have a special
|
||
|
option to instruct the library to pass raw binary data rather than text.
|
||
|
|
||
|
:::
|
||
|
|
||
|
The `CSVLoader` extends a special `TextLoader` that forces UTF8 text parsing.
|
||
|
|
||
|
There is a separate `BufferLoader` class, used by the PDF loader, that passes
|
||
|
the raw data using NodeJS `Buffer` objects.
|
||
|
|
||
|
<table>
|
||
|
<thead><tr><th>Binary</th><th>Text</th></tr></thead>
|
||
|
<tbody><tr><td>
|
||
|
|
||
|
```ts title="pdf.ts (structure)"
|
||
|
export class PDFLoader extends BufferLoader {
|
||
|
// ...
|
||
|
public async parse(
|
||
|
raw: Buffer,
|
||
|
metadata: Document["metadata"]
|
||
|
): Promise<Document[]> {
|
||
|
// ...
|
||
|
}
|
||
|
// ...
|
||
|
}
|
||
|
```
|
||
|
|
||
|
</td><td>
|
||
|
|
||
|
```ts title="csv.ts (structure)"
|
||
|
export class CSVLoader extends TextLoader {
|
||
|
// ...
|
||
|
protected async parse(
|
||
|
raw: string
|
||
|
|
||
|
): Promise<string[]> {
|
||
|
// ...
|
||
|
}
|
||
|
// ...
|
||
|
}
|
||
|
```
|
||
|
|
||
|
</td></tr></tbody>
|
||
|
</table>
|
||
|
|
||
|
### NodeJS Buffers
|
||
|
|
||
|
The SheetJS `read` method supports NodeJS Buffer objects directly[^6]:
|
||
|
|
||
|
```js title="Parsing a workbook in a BufferLoader"
|
||
|
import { BufferLoader } from "langchain/document_loaders/fs/buffer";
|
||
|
import { read, utils } from "xlsx";
|
||
|
|
||
|
export default class LoadOfSheet extends BufferLoader {
|
||
|
// ...
|
||
|
async parse(raw, metadata) {
|
||
|
// highlight-next-line
|
||
|
const wb = read(raw, {type: "buffer"});
|
||
|
// At this point, `wb` is a SheetJS workbook object
|
||
|
// ...
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
The `read` method returns a SheetJS workbook object[^7].
|
||
|
|
||
|
### Generating Content
|
||
|
|
||
|
The SheetJS `sheet_to_json` method[^8] returns an array of data objects whose
|
||
|
keys are drawn from the first row of the worksheet.
|
||
|
|
||
|
<table>
|
||
|
<thead><tr><th>Spreadsheet</th><th>Array of Objects</th></tr></thead>
|
||
|
<tbody><tr><td>
|
||
|
|
||
|
![`pres.xlsx` data](pathname:///pres.png)
|
||
|
|
||
|
</td><td>
|
||
|
|
||
|
```js
|
||
|
[
|
||
|
{ Name: "Bill Clinton", Index: 42 },
|
||
|
{ Name: "GeorgeW Bush", Index: 43 },
|
||
|
{ Name: "Barack Obama", Index: 44 },
|
||
|
{ Name: "Donald Trump", Index: 45 },
|
||
|
{ Name: "Joseph Biden", Index: 46 }
|
||
|
]
|
||
|
```
|
||
|
|
||
|
</td></tr></tbody></table>
|
||
|
|
||
|
The original `CSVLoader` wrote one row for each key-value pair. This text can be
|
||
|
generated by looping over the keys and values of the data row object. The
|
||
|
`Object.entries` helper function simplifies the conversion:
|
||
|
|
||
|
```js
|
||
|
function make_csvloader_doc_from_row_object(row) {
|
||
|
return Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n");
|
||
|
}
|
||
|
```
|
||
|
|
||
|
### Generating Documents
|
||
|
|
||
|
The loader must generate row objects for each worksheet in the workbook.
|
||
|
|
||
|
In the SheetJS data model, the workbook object has two relevant fields:
|
||
|
|
||
|
- `SheetNames` is an array of sheet names
|
||
|
- `Sheets` is an object whose keys are sheet names and values are sheet objects.
|
||
|
|
||
|
A `for..of` loop can iterate across the worksheets:
|
||
|
|
||
|
```js title="Looping over a workbook (skeleton)"
|
||
|
const wb = read(raw, {type: "buffer", WTF:1});
|
||
|
for(let name of wb.SheetNames) {
|
||
|
const ws = wb.Sheets[name];
|
||
|
const aoa = utils.sheet_to_json(ws);
|
||
|
// at this point, `aoa` is an array of objects
|
||
|
}
|
||
|
```
|
||
|
|
||
|
This simplified `parse` function uses the snippet from the previous section:
|
||
|
|
||
|
```js title="BufferLoader parse function (skeleton)"
|
||
|
async parse(raw, metadata) {
|
||
|
/* array to hold generated documents */
|
||
|
const result = [];
|
||
|
|
||
|
/* read workbook */
|
||
|
const wb = read(raw, {type: "buffer", WTF:1});
|
||
|
|
||
|
/* loop over worksheets */
|
||
|
for(let name of wb.SheetNames) {
|
||
|
const ws = wb.Sheets[name];
|
||
|
const aoa = utils.sheet_to_json(ws);
|
||
|
|
||
|
/* loop over data rows */
|
||
|
aoa.forEach((row, idx) => {
|
||
|
/* generate a new document and add to the result array */
|
||
|
result.push({
|
||
|
pageContent: Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n")
|
||
|
});
|
||
|
});
|
||
|
}
|
||
|
|
||
|
return result;
|
||
|
}
|
||
|
```
|
||
|
|
||
|
### Metadata and Attributes
|
||
|
|
||
|
It is strongly recommended to generate additional metadata and attributes for
|
||
|
self-query retrieval applications.
|
||
|
|
||
|
<details>
|
||
|
<summary><b>Implementation Details</b> (click to show)</summary>
|
||
|
|
||
|
**Metadata**
|
||
|
|
||
|
Metadata is attached to each document object. The following example appends the
|
||
|
raw row data to the document metadata:
|
||
|
|
||
|
```js title="Document with metadata (snippet)"
|
||
|
/* generate a new document and add to the result array */
|
||
|
result.push({
|
||
|
pageContent: Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n"),
|
||
|
metadata: {
|
||
|
worksheet: name, // name of the worksheet
|
||
|
rowNum: idx, // data row index
|
||
|
...row // raw row data
|
||
|
}
|
||
|
});
|
||
|
```
|
||
|
|
||
|
**Attributes**
|
||
|
|
||
|
Each attribute object specifies three properties:
|
||
|
|
||
|
- `name` corresponds to the field in the document metadata
|
||
|
- `description` is a description of the field
|
||
|
- `type` is a description of the data type.
|
||
|
|
||
|
While looping through data rows, a simple type check can keep track of the data
|
||
|
type for each column:
|
||
|
|
||
|
```js title="Tracking column types (sketch)"
|
||
|
for(let name of wb.SheetNames) {
|
||
|
/* track column types */
|
||
|
const fields = {};
|
||
|
// ...
|
||
|
|
||
|
aoo.forEach((row, idx) => {
|
||
|
result.push({/* ... */});
|
||
|
/* Check each property */
|
||
|
Object.entries(row).forEach(([k,v]) => {
|
||
|
/* Update fields entry to reflect the new data point */
|
||
|
if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true
|
||
|
});
|
||
|
});
|
||
|
// ...
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Attributes can be generated after writing the worksheet data. Storing attributes
|
||
|
in a loader property will make it accessible to scripts that use the loader.
|
||
|
|
||
|
```js title="Adding Attributes to a Loader (sketch)"
|
||
|
export default class LoadOfSheet extends BufferLoader {
|
||
|
// highlight-next-line
|
||
|
attributes = [];
|
||
|
// ...
|
||
|
|
||
|
async parse(raw, metadata) {
|
||
|
// Add the worksheet name and row index attributes
|
||
|
// highlight-start
|
||
|
this.attributes = [
|
||
|
{ name: "worksheet", description: "Sheet or Worksheet Name", type: "string" },
|
||
|
{ name: "rowNum", description: "Row index", type: "number" }
|
||
|
];
|
||
|
// highlight-end
|
||
|
const wb = read(raw, {type: "buffer", WTF:1});
|
||
|
for(let name of wb.SheetNames) {
|
||
|
// highlight-next-line
|
||
|
const fields = {};
|
||
|
// ...
|
||
|
const aoo = utils.sheet_to_json(ws);
|
||
|
aoo.forEach((row, idx) => {
|
||
|
result.push({/* ... */});
|
||
|
/* Check each property */
|
||
|
Object.entries(row).forEach(([k,v]) => {
|
||
|
/* Update fields entry to reflect the new data point */
|
||
|
if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true
|
||
|
});
|
||
|
});
|
||
|
/* Add one attribute per metadata field */
|
||
|
// highlight-start
|
||
|
Object.entries(fields).forEach(([k,v]) => this.attributes.push({
|
||
|
name: k, description: k,
|
||
|
/* { number: true, string: true } -> "number or string" */
|
||
|
type: Object.keys(v).join(" or ")
|
||
|
}));
|
||
|
// highlight-end
|
||
|
}
|
||
|
// ...
|
||
|
}
|
||
|
```
|
||
|
|
||
|
</details>
|
||
|
|
||
|
## SheetJS Loader Demo
|
||
|
|
||
|
The demo performs the query "Which rows have over 40 miles per gallon?" against
|
||
|
a [sample cars dataset](pathname:///cd.xls) and displays the results.
|
||
|
|
||
|
:::caution pass
|
||
|
|
||
|
This demo was tested using the ChatQA-1.5 model[^9] in Ollama[^10].
|
||
|
|
||
|
The tested model requires 9.2GB VRAM. It is strongly recommended to run the demo
|
||
|
on a newer Apple Silicon Mac or a PC with an Nvidia GPU with at least 12GB VRAM.
|
||
|
|
||
|
:::
|
||
|
|
||
|
0) Create a new project:
|
||
|
|
||
|
```bash
|
||
|
mkdir sheetjs-loader
|
||
|
cd sheetjs-loader
|
||
|
npm init -y
|
||
|
```
|
||
|
|
||
|
1) Download the demo scripts:
|
||
|
|
||
|
- [`loadofsheet.mjs`](pathname:///loadofsheet/loadofsheet.mjs)
|
||
|
- [`query.mjs`](pathname:///loadofsheet/query.mjs)
|
||
|
|
||
|
```bash
|
||
|
curl -LO https://docs.sheetjs.com/loadofsheet/query.mjs
|
||
|
curl -LO https://docs.sheetjs.com/loadofsheet/loadofsheet.mjs
|
||
|
```
|
||
|
|
||
|
2) Install the SheetJS NodeJS module:
|
||
|
|
||
|
<CodeBlock language="bash">{`\
|
||
|
npm i --save https://cdn.sheetjs.com/xlsx-${current}/xlsx-${current}.tgz`}
|
||
|
</CodeBlock>
|
||
|
|
||
|
3) Install LangChain and HNSWLib dependencies:
|
||
|
|
||
|
```bash
|
||
|
npm i --save @langchain/community@0.2.0 @langchain/core@0.2.6 langchain@0.2.5 hnswlib-node@3.0.0 peggy@3.0.2
|
||
|
```
|
||
|
|
||
|
4) Download the [cars dataset](pathname:///cd.xls):
|
||
|
|
||
|
```bash
|
||
|
curl -LO https://docs.sheetjs.com/cd.xls
|
||
|
```
|
||
|
|
||
|
5) Install the `llama3-chatqa:8b-v1.5-q8_0` model using Ollama:
|
||
|
|
||
|
```bash
|
||
|
ollama pull llama3-chatqa:8b-v1.5-q8_0
|
||
|
```
|
||
|
|
||
|
:::note pass
|
||
|
|
||
|
If the command cannot be found, install Ollama[^10] and run the command in a new
|
||
|
terminal window.
|
||
|
|
||
|
:::
|
||
|
|
||
|
6) Run the demo script
|
||
|
|
||
|
```bash
|
||
|
node query.mjs
|
||
|
```
|
||
|
|
||
|
The demo performs the query "Which rows have over 40 miles per gallon?". It will
|
||
|
print the following nine results:
|
||
|
|
||
|
```js title="Expected output"
|
||
|
{ Name: 'volkswagen rabbit custom diesel', MPG: 43.1 }
|
||
|
{ Name: 'vw rabbit c (diesel)', MPG: 44.3 }
|
||
|
{ Name: 'renault lecar deluxe', MPG: 40.9 }
|
||
|
{ Name: 'honda civic 1500 gl', MPG: 44.6 }
|
||
|
{ Name: 'datsun 210', MPG: 40.8 }
|
||
|
{ Name: 'vw pickup', MPG: 44 }
|
||
|
{ Name: 'mazda glc', MPG: 46.6 }
|
||
|
{ Name: 'vw dasher (diesel)', MPG: 43.4 }
|
||
|
{ Name: 'vw rabbit', MPG: 41.5 }
|
||
|
```
|
||
|
|
||
|
To find the expected results:
|
||
|
|
||
|
- Open the `cd.xls` spreadsheet in Excel
|
||
|
- Select Home > Sort & Filter > Filter in the Ribbon
|
||
|
- Select the filter option for column B (`Miles_per_Gallon`)
|
||
|
- In the popup, select "Greater Than" in the Filter dropdown and type 40
|
||
|
|
||
|
The filtered results should match the following screenshot:
|
||
|
|
||
|
![Expected Results](pathname:///loadofsheet/expected.png)
|
||
|
|
||
|
[^1]: See ["How to load CSV data"](https://js.langchain.com/v0.2/docs/how_to/document_loader_csv) in the LangChain documentation
|
||
|
[^2]: See [`readFile` in "Reading Files"](/docs/api/parse-options)
|
||
|
[^3]: See ["SheetJS Data Model"](/docs/csf/)
|
||
|
[^4]: See [`sheet_to_csv` in "CSV and Text"](/docs/api/utilities/csv#delimiter-separated-output)
|
||
|
[^5]: See ["Folders with multiple files"](https://js.langchain.com/v0.2/docs/integrations/document_loaders/file_loaders/directory/) in the LangChain documentation
|
||
|
[^6]: See ["Supported Output Formats" type in "Writing Files"](/docs/api/write-options#supported-output-formats)
|
||
|
[^7]: See ["Workbook Object"](/docs/csf/book)
|
||
|
[^8]: See [`sheet_to_json` in "Utilities"](/docs/api/utilities/array#array-output)
|
||
|
[^9]: See [the official ChatQA website](https://chatqa-project.github.io/) for the ChatQA paper and other model details.
|
||
|
[^10]: See [the official Ollama website](https://ollama.com/download) for installation instructions.
|