long formatted strings are lost (undefined) in workbook.Sheets[] cell values #25

New Issue

2013-04-30T21:57:49Z

mitchellsundt commented

2013-04-30 21:57:49 +00:00

(Migrated from github.com)

When js-xlsx processes a XLSX file with a long formatted text string in a cell, it somehow drops or does not process the cell value -- the value is undefined.

See the XLSX file attached to this issue: http://code.google.com/p/opendatakit/issues/detail?id=820

When js-xlsx processes a XLSX file with a long formatted text string in a cell, it somehow drops or does not process the cell value -- the value is undefined. See the XLSX file attached to this issue: http://code.google.com/p/opendatakit/issues/detail?id=820

mitchellsundt commented

2013-04-30 22:42:55 +00:00

(Migrated from github.com)

First half of fix is:

/* 18.4 Shared String Table */
function parseStrs(data) {
var s = [];
var sst = data.match(new RegExp("<sst ([^>]*)>([\\s\\S]*)<\/sst>","m"));
if(sst) {
    s = sst[2].replace(/<si>/g,"").split(/<\/si>/).map(function(x) { var z = {};
        var y=x.match(/<([^>]*)>([\s\S]*)<\/[^>]*>/);
        if(y) z[y[1].split(" ")[0]]=utf8read(unescapexml(y[2]));
        return z;
    });

    sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount;
}
return s;
}

(the inner match expression has changed).

Since this is apparently a rich text string, it has a 'r' instead of a 't' value, so the retrieval in parseSheet, which assumes:

            case 's': p.v = strs[parseInt(p.v, 10)].t; break;

Will still return undefined.

This should probably still be undefined, and the readme should be updated to note that rich text in cells is not supported by the converter.

The fix to the inner match, however, should be done so that expressions are properly extracted into the Strings array.

First half of fix is: ``` /* 18.4 Shared String Table */ function parseStrs(data) { var s = []; var sst = data.match(new RegExp("<sst ([^>]*)>([\\s\\S]*)<\/sst>","m")); if(sst) { s = sst[2].replace(/<si>/g,"").split(/<\/si>/).map(function(x) { var z = {}; var y=x.match(/<([^>]*)>([\s\S]*)<\/[^>]*>/); if(y) z[y[1].split(" ")[0]]=utf8read(unescapexml(y[2])); return z; }); sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount; } return s; } ``` (the inner match expression has changed). Since this is apparently a rich text string, it has a 'r' instead of a 't' value, so the retrieval in parseSheet, which assumes: ``` case 's': p.v = strs[parseInt(p.v, 10)].t; break; ``` Will still return undefined. This should probably still be undefined, and the readme should be updated to note that rich text in cells is not supported by the converter. The fix to the inner match, however, should be done so that expressions are properly extracted into the Strings array.

Niggler commented

2013-04-30 23:00:47 +00:00

(Migrated from github.com)

@mitchellsundt thanks for reporting this.

I think the "Right Thing" looks like this:

cells have a .r or some other property for the rich text
function to convert those values to proper HTML (replacing special tags
like rFont)
for each cell, the .t will hold a plaintext rendering (strip out the
formatting)

I'm not in front of computer now; I expect to push this later tonight

On Tue, Apr 30, 2013 at 6:42 PM, Mitch Sundt notifications@github.comwrote:

First half of fix is:

/* 18.4 Shared String Table /
function parseStrs(data) {
var s = [];
var sst = data.match(new RegExp("])>([\s\S])</sst>","m"));
if(sst) {
s = sst[2].replace(//g,"").split(/</si>/).map(function(x) { var z = {};
var y=x.match(/<([^>])>([\s\S])</[^>]>/);
if(y) z[y[1].split(" ")[0]]=utf8read(unescapexml(y[2]));
return z;
});
sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount;
}
return s;

}

(the inner match expression has changed).

Since this is apparently a rich text string, it has a 'r' instead of a 't'
value, so the retrieval in parseSheet, which assumes:
        case 's': p.v = strs[parseInt(p.v, 10)].t; break;
Will still return undefined.

This should probably still be undefined, and the readme should be updated
to note that rich text in cells is not supported by the converter.

The fix to the inner match, however, should be done so that expressions
are properly extracted into the Strings array.

—
Reply to this email directly or view it on GitHubhttps://github.com/Niggler/js-xlsx/issues/25#issuecomment-17259038
.

@mitchellsundt thanks for reporting this. I think the "Right Thing" looks like this: - cells have a `.r` or some other property for the rich text - function to convert those values to proper HTML (replacing special tags like `rFont`) - for each cell, the `.t` will hold a plaintext rendering (strip out the formatting) I'm not in front of computer now; I expect to push this later tonight On Tue, Apr 30, 2013 at 6:42 PM, Mitch Sundt notifications@github.comwrote: > First half of fix is: > > /\* 18.4 Shared String Table _/ > function parseStrs(data) { > var s = []; > var sst = data.match(new RegExp("]_)>([\s\S]_)<\/sst>","m")); > if(sst) { > s = sst[2].replace(//g,"").split(/<\/si>/).map(function(x) { var z = {}; > var y=x.match(/<([^>]_)>([\s\S]_)<\/[^>]_>/); > if(y) z[y[1].split(" ")[0]]=utf8read(unescapexml(y[2])); > return z; > }); > > ``` > sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount; > ``` > > } > return s; > > } > > (the inner match expression has changed). > > Since this is apparently a rich text string, it has a 'r' instead of a 't' > value, so the retrieval in parseSheet, which assumes: > > ``` > case 's': p.v = strs[parseInt(p.v, 10)].t; break; > ``` > > Will still return undefined. > > This should probably still be undefined, and the readme should be updated > to note that rich text in cells is not supported by the converter. > > The fix to the inner match, however, should be done so that expressions > are properly extracted into the Strings array. > > — > Reply to this email directly or view it on GitHubhttps://github.com/Niggler/js-xlsx/issues/25#issuecomment-17259038 > .

mitchellsundt commented

2013-04-30 23:26:46 +00:00

(Migrated from github.com)

Here's a revised parseStrs that ensures that the <t>...</t> tags match at the beginning and end and do not have any XML (open angle brackets) between those tags. Strings[] objects then have .t for simple text, and .raw for XML text (because that is a raw series of XML elements -- see below)

/* 18.4 Shared String Table */
function parseStrs(data) {
    var s = [];
    var sst = data.match(new RegExp("<sst ([^>]*)>([\\s\\S]*)<\/sst>","m"));
    if(sst) {
        s = sst[2].replace(/<si>/g,"").split(/<\/si>/).map(function(x) { var z = {};
            var simpleString=x.match(/^<t>([^<]*)<\/t>$/);
            if (simpleString) {
            if (simpleString) {
                z['t'] = utf8read(unescapexml(simpleString[1]));
            } else {
                z['raw'] = x;
            }
            return z;
        });

        sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount;
    }
    return s;
}

In the parseSheet function, I did this for the 's' case:

            case 's': {
                    var sval = strs[parseInt(p.v, 10)];
                    if ( "t" in sval ) {
                        p.v = sval.t;
                    } else {
                        delete p.v;
                        p.raw = sval.raw;
                    }
                }
                break;

I.e., put the XML string into the .raw field. My example XLSX file has the following for the 'raw' string:

"<r><t xml:space="preserve">This inserts a 'joins' entry into the column_definitions table for the </t></r><r><rPr><b/><sz val="10"/><color rgb="FF000000"/><rFont val="Arial"/><family val="2"/></rPr><t>household_id</t></r><r><rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t xml:space="preserve"> column of the </t></r><r><rPr><b/><sz val="10"/><color rgb="FF000000"/><rFont val="Arial"/><family val="2"/></rPr><t>household</t></r><r><rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t xml:space="preserve"> table_id of the form: 

"[ { table_id: household_member, element_name: household_id } ]"
</t></r>"

Note that this is just a mixed series of <r> and <t> tags so it isn't wrapped by any enclosing tag

Here's a revised parseStrs that ensures that the <t>...</t> tags match at the beginning and end and do not have any XML (open angle brackets) between those tags. Strings[] objects then have .t for simple text, and .raw for XML text (because that is a raw series of XML elements -- see below) ``` /* 18.4 Shared String Table */ function parseStrs(data) { var s = []; var sst = data.match(new RegExp("<sst ([^>]*)>([\\s\\S]*)<\/sst>","m")); if(sst) { s = sst[2].replace(/<si>/g,"").split(/<\/si>/).map(function(x) { var z = {}; var simpleString=x.match(/^<t>([^<]*)<\/t>$/); if (simpleString) { if (simpleString) { z['t'] = utf8read(unescapexml(simpleString[1])); } else { z['raw'] = x; } return z; }); sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount; } return s; } ``` In the parseSheet function, I did this for the 's' case: ``` case 's': { var sval = strs[parseInt(p.v, 10)]; if ( "t" in sval ) { p.v = sval.t; } else { delete p.v; p.raw = sval.raw; } } break; ``` I.e., put the XML string into the .raw field. My example XLSX file has the following for the 'raw' string: ``` "<r><t xml:space="preserve">This inserts a 'joins' entry into the column_definitions table for the </t></r><r><rPr><b/><sz val="10"/><color rgb="FF000000"/><rFont val="Arial"/><family val="2"/></rPr><t>household_id</t></r><r><rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t xml:space="preserve"> column of the </t></r><r><rPr><b/><sz val="10"/><color rgb="FF000000"/><rFont val="Arial"/><family val="2"/></rPr><t>household</t></r><r><rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t xml:space="preserve"> table_id of the form: "[ { table_id: household_member, element_name: household_id } ]" </t></r>" ``` Note that this is just a mixed series of <r> and <t> tags so it isn't wrapped by any enclosing tag

Niggler commented

2013-05-01 05:05:54 +00:00

(Migrated from github.com)

@mitchellsundt I pushed an update that puts the plaintext in the .t field even if the text is rich (and populates the .r field for plain text as well as rich text), so for plaintext uses (like CSV) this should suffice. The other half (actually parsing the rich text and generating HTML or other output) is in the pipeline.

As for missing support, the ultimate goal is 100% coverage (so any bug or missing feature is very much undesired

@mitchellsundt I pushed an update that puts the plaintext in the .t field even if the text is rich (and populates the .r field for plain text as well as rich text), so for plaintext uses (like CSV) this should suffice. The other half (actually parsing the rich text and generating HTML or other output) is in the pipeline. As for missing support, the ultimate goal is 100% coverage (so any bug or missing feature is very much undesired

mitchellsundt commented

2013-05-01 16:30:25 +00:00

(Migrated from github.com)

Thanks!

On Tue, Apr 30, 2013 at 10:05 PM, Niggler notifications@github.com wrote:

@mitchellsundt https://github.com/mitchellsundt I pushed an update that
puts the plaintext in the .t field even if the text is rich (and populates
the .r field for plain text as well as rich text), so for plaintext uses
(like CSV) this should suffice. The other half (actually parsing the rich
text and generating HTML or other output) is in the pipeline.

As for missing support, the ultimate goal is 100% coverage (so any bug or
missing feature is very much undesired

—
Reply to this email directly or view it on GitHubhttps://github.com/Niggler/js-xlsx/issues/25#issuecomment-17268924
.

Mitch Sundt
Software Engineer
University of Washington
mitchellsundt@gmail.com

Thanks! On Tue, Apr 30, 2013 at 10:05 PM, Niggler notifications@github.com wrote: > @mitchellsundt https://github.com/mitchellsundt I pushed an update that > puts the plaintext in the .t field even if the text is rich (and populates > the .r field for plain text as well as rich text), so for plaintext uses > (like CSV) this should suffice. The other half (actually parsing the rich > text and generating HTML or other output) is in the pipeline. > > As for missing support, the ultimate goal is 100% coverage (so any bug or > missing feature is very much undesired > > — > Reply to this email directly or view it on GitHubhttps://github.com/Niggler/js-xlsx/issues/25#issuecomment-17268924 > . ## Mitch Sundt Software Engineer University of Washington mitchellsundt@gmail.com

Sign in to join this conversation.