Support for parsing Comments #43
No reviewers
Labels
No Label
DBF
Dates
Defined Names
Features
Formula
HTML
Images
Infrastructure
Integration
International
ODS
Operations
Performance
PivotTables
Pro
Protection
Read Bug
SSF
SYLK
Style
Write Bug
good first issue
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sheetjs/sheetjs#43
Loading…
Reference in New Issue
No description provided.
Delete Branch "comment_support"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Thanks for a great library!
Here is my best attempt at contributing support to parse comments.
I used an existing file already in the project for the regressions tests.
I realize I coded the changes directly in the built xlsx.
Let me know if you want me to change something.
Thanks!
Comments parts listed in the [Content Types] are parsed.
Sheets's relationships are parsed.
Comments parts are correlated to their corresponding sheets parts.
Comments's contents are added to the ref'ed cells.
Rich text styling properties are currently ignored.
For example:
The test failed because I moved the POI test cases to a separate file. I pushed changes to the test_files repo and this repo's tests.lst -- can you pull the changes,
make init
thenmake test
?OK thanks for your attention. We rebased our branch on the top of your latest master.
We are working on putting the changes into the bits files to support the tests as expected by Travis.
@kinwahlai I can reconcile that later. According to https://travis-ci.org/SheetJS/js-xlsx/jobs/17037467, Travis is complaining because the file
SimpleWithComments.xlsx
was renamed toapachepoi_SimpleWithComments.xlsx
. The fix is in the test script:Hmm @kinwah @kinwahlai I thought I made a comment yesterday but that seemed to have disappeared ...
Can you take a look at https://github.com/SheetJS/test_files/blob/master/comments_stress_test.xlsx ? In the first sheet, there is a rich-text comment and the raw text is broken up. It may make more sense to concatenate rather than push the raw comments.
Also, do you need the rich text as well? There is some basic HTML conversion (search for 18.4.7 in the code) which really should be moved out of the shared string table function
Hi @SheetJSDev; well in our usecase we need to know the internal structure of the comments. And we are not interested by the styling as html.
For example: to find out that a google-spreadsheet's note matches a comment we make sure we see single rich text in the comment. We put some JSON in some of those notes and we dont want the html styling to appear in there.
When we see multiple rich texts we discard them as they are not notes.
In fact we have the same requirement with number formats: we need to know about the number-formats as defined in the spreadsheet and we are not interested in the formatting of the raw values for them.
For the sake of consistency with the rest of the code we can certainly help with doing to the comments the same type of things than what happens for the other rich texts.
What do you think?
@kinwahlai my point was according to the spec, the text tags don't have author information. It's at the outer comment tag level (see my annotations below):
Hence I would think it makes sense to combine the t tags. I have not played with google docs much, so I don't know how they do it (and my comment may not make sense for gdocs). Can you share an example file with comments from multiple users?
Also, Excel generates an empty tag for comments with no text. See the comments_stress_test.xlsx file (https://github.com/SheetJS/test_files/blob/master/comments_stress_test.xlsx):
You need to guard for the possibility that
x.match(/<text>([^\u2603]*)<\/text>/m)
is nullHi @SheetJSDev, @kinwahlai and I are pairing on this code.
The example sheet we were looking at in the openxml spec had 2 rich texts; I mistook it for the author and body of the comment.
Thanks for clarifying this. In fact we are parsing the author as expected.
Thanks for noting the bug regarding the potentially null text element that would crash the parser.
OK to concatenate the comments as a first step and to pull out the code that applies the comments styles as html styles. It does not really match our usecase and I think it would make our life harder if we were trying to generate xlsx files from the generated model.
But we don't know better than that for now at the moment ;-)
We will add a few more commits to that branch for your kind review then.
@kinwahlai @hmalphettes Keep in mind that the goal here is to find a solution that makes "sense". There is no standard solution here, which is why discussion is good :)
If you use the Excel UI "Insert Comment" option, it inserts the author name followed by a colon (in bold). However, since that is actual content in the comment, you can change it or remove it (and that's how you generate the empty comment in the sample sheet).
Google docs does some wild stuff with the comments. A friend and I tried to make individual comments as well as comment "conversation", and apparently gdocs is not writing author information using the author tag (authors are identified using a dash followed by the name):
complete comments.xml from exported XLSX:
@SheetJSDev thanks for trying all these examples.
And thanks for reminding us of the bigger cause.
In our narrow usecase what we wanted to access were google's 'notes'
Once saved as an xlsx, they show up like this:
The simplest case of them all so far.
@hmalphettes @kinwahlai Playing with gdocs a bit more:
google docs comments and notes are plaintext. The UI is a text field and the generated xml only uses
t
tagseach comment has only one text tag. I can't seem to generate a case where gdocs generates a comment with multiple text tag children.
Google treats conversations and comments differently (notes and comments are separated with
----
when a note is present, but do not appear in a cell with only comments):OK @SheetJSDev thanks again for looking into more spreadsheets. I am convinced.
I have pulled up the code that parses the rich text for sst and reused it to parse the comments.
It does the concatenation and passes the raw values along.
That would work fine for our use case.
Here is what the comment of the test xlsx looks like now:
Let us know!
@hmalphettes @kinwahlai Thanks for the patches! Now for the difficult part (figuring out how to do the same thing for https://github.com/SheetJS/js-xls)
Oh boy xls! Sorry we are putting this new feature on your plate.
Cheers!
Hugues