cfb_add and write performance issues #2

New Issue

2018-04-07T14:58:55Z

rossj commented

2018-04-07 14:58:55 +00:00

(Migrated from github.com)

Hi there,

I'm working on a program which converts .pst files to .msg files, primarily in Node but also in the browser, and it uses this library in a very write-heavy way for saving the .msg files. Through testing and profiling, I've noticed a couple write-related performance issues that I wanted to share.

With some modifications, I've been able to reduce the output generation time of my primary "large" test case (4300 .msg files from 1 .pst) by a factor of 8 from about 16 minutes to 2 minutes (running on Node).

The 1st issue, which may just be a matter of documentation, is that using cfb_add repeatedly to add all streams to a new doc is very slow, as it calls cfb_gc and cfb_rebuild every time. We switched from using cfb_add to directly pushing to cfb.FileIndex and cfb.FullPaths (and then calling cfb_rebuild once at the end) which reduced the output time from 16 minutes to 3.5 minutes.

The 2nd issue is that the _write and WriteShift functions do not utilize Buffer capabilities when it is available. By using Buffer.alloc() for the initial creation, which guarantees a 0-filled initialization, along with Buffer.copy for content streams, Buffer.write for hex / utf16le strings, and Buffer's various write int / uint methods, we were able to further reduce the output time from 3.5 minutes to 2 minutes.

If you wish, I would be happy to share my changes, or to work on a pull request which uses Buffer functions when available. My current changes don't do any feature detection, and rather just rely on Buffer being available, as even in the browser we use feross/buffer, so it would need some more work to maintain functionality in non-Buffer environments.

Thanks

Hi there, I'm working on a program which converts .pst files to .msg files, primarily in Node but also in the browser, and it uses this library in a very write-heavy way for saving the .msg files. Through testing and profiling, I've noticed a couple write-related performance issues that I wanted to share. With some modifications, I've been able to reduce the output generation time of my primary "large" test case (4300 .msg files from 1 .pst) by a factor of 8 from about 16 minutes to 2 minutes (running on Node). The 1st issue, which may just be a matter of documentation, is that using `cfb_add` repeatedly to add all streams to a new `doc` is very slow, as it calls `cfb_gc` and `cfb_rebuild` every time. We switched from using `cfb_add` to directly pushing to `cfb.FileIndex` and `cfb.FullPaths` (and then calling cfb_rebuild once at the end) which reduced the output time from 16 minutes to 3.5 minutes. The 2nd issue is that the `_write` and `WriteShift` functions do not utilize `Buffer` capabilities when it is available. By using `Buffer.alloc()` for the initial creation, which guarantees a 0-filled initialization, along with `Buffer.copy` for content streams, `Buffer.write` for hex / utf16le strings, and `Buffer`'s various write int / uint methods, we were able to further reduce the output time from 3.5 minutes to 2 minutes. If you wish, I would be happy to share my changes, or to work on a pull request which uses `Buffer` functions when available. My current changes don't do any feature detection, and rather just rely on `Buffer` being available, as even in the browser we use [feross/buffer](https://github.com/feross/buffer), so it would need some more work to maintain functionality in non-Buffer environments. Thanks

SheetJSDev commented

2018-04-07 15:47:30 +00:00

(Migrated from github.com)

On performance: none of our tests or workflows deal with hundreds of files, let alone thousands of files. The function that sees the most write activity reshapes XLS VBA blobs to XLSB form, and the most extreme case I've seen involved about 250 extremely small files, so it's not at all surprising that there are ways to improve performance when adding a large number of files. When this was built, node was still in the 0.8.x series and the node developers were still working out performance kinks.

the original intention was to ensure that the representation was valid and "complete" after each operation. Among other things, it ensures that all "parent directory" entries are created. But those are re-created at the end anyway, so removing the GC call makes sense
Older versions of node didn't automatically zero-fill. I agree Buffer.alloc should be used when available (it is not available in the 4.x series IIRC, so a check is necessary). As for the individual utilities like __writeUInt32LE, at the time the buffer utility functions were dramatically slower than the hand-rolled solution (this was in the 0.8.x and 0.10.x series), and the performance may have improved since then.

Contributions would be awesome :). To start, we'd accept a PR that just removed the call to cfb_gc within the cfb_add function.

P.S.: Here's a throwback issue about node function performance https://github.com/nodejs/node-v0.x-archive/issues/7809 affecting the 0.8.x and 0.10.x series

On performance: none of our tests or workflows deal with hundreds of files, let alone thousands of files. The function that sees the most write activity [reshapes XLS VBA blobs to XLSB form](https://github.com/SheetJS/js-xlsx/blob/master/bits/59_vba.js), and the most extreme case I've seen involved about 250 extremely small files, so it's not at all surprising that there are ways to improve performance when adding a large number of files. When this was built, node was still in the `0.8.x` series and the node developers were still working out performance kinks. 1) the original intention was to ensure that the representation was valid and "complete" after each operation. Among other things, it ensures that all "parent directory" entries are created. But those are re-created at the end anyway, so removing the GC call makes sense 2) Older versions of node didn't automatically zero-fill. I agree `Buffer.alloc` should be used when available (it is not available in the `4.x` series IIRC, so a check is necessary). As for the individual utilities like `__writeUInt32LE`, at the time the buffer utility functions were dramatically slower than the hand-rolled solution (this was in the `0.8.x` and `0.10.x` series), and the performance may have improved since then. Contributions would be awesome :). To start, we'd accept a PR that just removed the call to `cfb_gc` within the `cfb_add` function. P.S.: Here's a throwback issue about node function performance https://github.com/nodejs/node-v0.x-archive/issues/7809 affecting the `0.8.x` and `0.10.x` series

rossj commented

2018-04-07 16:34:59 +00:00

(Migrated from github.com)

Great, thank you for the quick response and background info. I'm sure that my use case is outside of what was originally intended / tested, so no fault to the library for not being optimized just for me :).

Am I correct in thinking that cfb.flow.js is the primary source file, and the other .js files are derived from it in a build step?

Great, thank you for the quick response and background info. I'm sure that my use case is outside of what was originally intended / tested, so no fault to the library for not being optimized just for me :). Am I correct in thinking that cfb.flow.js is the primary source file, and the other .js files are derived from it in a build step?

SheetJSDev commented

2018-04-07 16:41:12 +00:00

(Migrated from github.com)

The primary source files are the bits files, which are concatenated in a build step. The approach is a bit weird given what people use in 2018, but if you make the changes to cfb.flow.js directly we can always amend the commit to update the bits files.

The primary source files are the `bits` files, which are [concatenated in a build step](https://github.com/SheetJS/js-cfb/blob/master/Makefile#L27). The approach is a bit weird given what people use in 2018, but if you make the changes to `cfb.flow.js` directly we can always amend the commit to update the bits files.

SheetJSDev commented

2018-04-09 06:59:24 +00:00

(Migrated from github.com)

@rossj we just pushed 1.0.6 with the first part guarded behind the option unsafe:true:

CFB.utils.cfb_add(cfb, path, content, {unsafe:true});

Runkit unfortunately puts a memory limit, but https://runkit.com/5acb0cf21599f20012a3e001/5acb0cf2aeee9400120ba682 should demonstrate 4000 files. It uses a small test script for adding 5000 byte files to the FAT and 500 byte files to the mini FAT:

var CFB = require('./');
var cfb = CFB.utils.cfb_new();
var cnt = 20000;
console.log("alloc", new Date());
var bufs = [];
for(var i = 0; i < cnt; ++i) bufs[i] = [Buffer.alloc(500, i), Buffer.alloc(5000, i)];
console.log("start", new Date());
for(var i = 0; i < cnt; ++i) {
	if(!(i%100)) console.log(i, new Date());
	CFB.utils.cfb_add(cfb, "/short/" + i.toString(16), bufs[i][0], {unsafe:true}); 
	CFB.utils.cfb_add(cfb, "/long/"  + i.toString(16), bufs[i][1], {unsafe:true}); 
}
console.log("prewrite", new Date());
CFB.utils.cfb_gc(cfb);
CFB.writeFile(cfb, "out.bin");
console.log("done", new Date());
var cfb2 = CFB.read("out.bin", {type:"file"});
console.log("read", new Date());
for(var i = 0; i < cnt; i += 100) {
	var file = CFB.find(cfb2, "/short/" + i.toString(16));
	if(0 != Buffer.compare(file.content, bufs[i][0])) throw new Error("short " + i);
	file = CFB.find(cfb2, "/long/" + i.toString(16));
	if(0 != Buffer.compare(file.content, bufs[i][1])) throw new Error("short " + i);
}

@rossj we just pushed 1.0.6 with the first part guarded behind the option `unsafe:true`: ```js CFB.utils.cfb_add(cfb, path, content, {unsafe:true}); ``` Runkit unfortunately puts a memory limit, but https://runkit.com/5acb0cf21599f20012a3e001/5acb0cf2aeee9400120ba682 should demonstrate 4000 files. It uses a small test script for adding 5000 byte files to the FAT and 500 byte files to the mini FAT: ```js var CFB = require('./'); var cfb = CFB.utils.cfb_new(); var cnt = 20000; console.log("alloc", new Date()); var bufs = []; for(var i = 0; i < cnt; ++i) bufs[i] = [Buffer.alloc(500, i), Buffer.alloc(5000, i)]; console.log("start", new Date()); for(var i = 0; i < cnt; ++i) { if(!(i%100)) console.log(i, new Date()); CFB.utils.cfb_add(cfb, "/short/" + i.toString(16), bufs[i][0], {unsafe:true}); CFB.utils.cfb_add(cfb, "/long/" + i.toString(16), bufs[i][1], {unsafe:true}); } console.log("prewrite", new Date()); CFB.utils.cfb_gc(cfb); CFB.writeFile(cfb, "out.bin"); console.log("done", new Date()); var cfb2 = CFB.read("out.bin", {type:"file"}); console.log("read", new Date()); for(var i = 0; i < cnt; i += 100) { var file = CFB.find(cfb2, "/short/" + i.toString(16)); if(0 != Buffer.compare(file.content, bufs[i][0])) throw new Error("short " + i); file = CFB.find(cfb2, "/long/" + i.toString(16)); if(0 != Buffer.compare(file.content, bufs[i][1])) throw new Error("short " + i); } ```

SheetJSDev commented

2021-09-05 00:45:52 +00:00

(Migrated from github.com)

@rossj Before a new release is cut, is there any other change you recommend?

rossj commented

2021-09-05 17:35:59 +00:00

(Migrated from github.com)

First, sorry for being slow with my PRs. I just submitted a change that uses Buffer.copy for the file contents, which shows performance improvements for my use case. It relies on the output Buffer being zero-filled beforehand. I'm now wondering if it may be faster to not rely on the pre-fill and instead just use allocUnsafe() with Buffer.fill() to just zero out those extra padding / byte-alignment bytes between the file entries. I'll see if I can run some benchmarks to see if one is obviously better than the other.

First, sorry for being slow with my PRs. I just submitted a change that uses `Buffer.copy` for the file contents, which shows performance improvements for my use case. It relies on the output Buffer being zero-filled beforehand. I'm now wondering if it may be faster to not rely on the pre-fill and instead just use `allocUnsafe()` with `Buffer.fill()` to just zero out those extra padding / byte-alignment bytes between the file entries. I'll see if I can run some benchmarks to see if one is obviously better than the other.

Sign in to join this conversation.