This whole situation began with a simple question, "How do I remove XML comments in JavaScript?". The Internet hivemind converged on one general approach: <b>regular expressions</b>.
This has one big problem. If you're not careful, parsing relatively small amounts of data could lead to browsers or servers freezing for extended periods of time. The official category for this weakness is "CWE-1333"[^5] "Inefficient Regular Expression Complexity". Some resources also use the phrase "Catastrophic backtracking" to describe the issue.
This discussion focuses on what we've come to call "Regexide", which we've defined as the act of identifying and replacing flawed regular expressions with other techniques that better reflect the intended effect. Let's look at a few examples.
For the purposes of this discussion, it's important to understand exactly what XML comments are. XML comments are special notes that parsers should not treat as data. XML comments start with `<!--` and end with `-->`. Technically XML comments must not contain the string `--` within the comment body. Many programs and people write invalid XML comments, so parsers will typically allow for nested `--`.
[Nunjucks](https://github.com/mozilla/nunjucks/blob/ea0d6d5396d39d9eed1b864febb36fbeca908f23/nunjucks/src/filters.js#L491) used this regular expression within in the `striptags` filter expression:
[PrettierJS](https://github.com/prettier/prettier/blob/45ad4668ebc133621c7f94e678ce399cab318068/scripts/lint-changelog.js#L51) used this regular expression in the build sequence:
[RollupJS](https://github.com/rollup/rollup/blob/18372035f167ec104280e1e91ef795e4f7033f1e/scripts/release-helpers.js#L76) used this regular expression in the build sequence:
[ViteJS](https://github.com/vitejs/vite/blob/9fc5d9cb3a1b9df067e00959faa9da43ae03f776/packages/vite/src/node/optimizer/scan.ts#L259) used the nascent `s` flag to ensure `.` matches newline characters:
[WordPress](https://github.com/WordPress/WordPress/blob/master/wp-admin/js/word-count.js#L73) used regular expressions in the word count calculator:
```js
HTMLcommentRegExp: /<!--[\s\S]*?-->/g,
```
[Element Plus](https://github.com/element-plus/element-plus/blob/4ac4750158fa634aa9da186111bce86c2898fda2/internal/build/src/tasks/helper.ts#L60) used a similar regular expression to match blocks starting with `<del>` and ending with `</del>`:
It's surprising to see that most resources recommend this approach. A prominent O'Reilly textbook, "Regular Expressions Cookbook"[^3], explicitly recommends `/<!--[\s\S]*?-->/` in section 9.9 for matching XML comments. **StackOverflow Answers** recommend this regular expression and variants such as `/<!--[\s\S\n]*?-->/` (which are, for all practical purposes, equivalent). **ChatGPT4** has also recommended the previous regular expression. It also generated code for a complete unrelated tag. 🙄
Consider a string that repeats the header part `<!--` many times. In general, this type of string can be generated in JavaScript using `String.prototype.repeat`:
Results are from local tests on a 2019 Intel i9 MacBook Pro. The following chart displays runtime in seconds (vertical axis) as a function of repetitions (horizontal axis). The quadratic trend line closely fits the data.
<imgsrc="./data/js.png"style="max-height:200px"alt="javascript performance test - quadratic complexity"/>
[Download the raw data as a CSV](./data/js.csv)
When the number of repetitions doubled, the runtime roughly quadrupled. This is a "quadratic" relationship.
The regular expression matches a string that starts with `<!--` and ends with `-->`. Consider a function that repeatedly looks for the `<!--` string and tries to find the first `-->` that appears afterwards. Computer scientists classify this algorithm as "Backtracking"[^4]:
Everyone writes high-performance code in Rust, right? Rust does not have built-in support for regular expressions. The Rust `regress`[^6] crate is designed for JavaScript regular expressions. It represents a true apples-to-apples comparison with JavaScript. `regress` shows the same quadratic behavior as other JavaScript regular expression engines.
If `-->` is not in the string, the scan `str.indexOf("-->", start_index + 4)` will look at every character in the string starting from `start_index + 4`. In the worst case, with repeated `<!--`, the scan will start from index `4`, then index `8`, then index `12`, etc.
The following diagram shows the first three scans when running the function against the string formed by repeating `<!--` 5 times. The `<!--` matches are highlighted in yellow and the scans for the `-->` are highlighted in blue.
In the worst case, the number of characters scanned is roughly proportional to the square of the length of the string. In "Big-O Notation", the complexity is $O(L^2)$. This is colloquially described as a "quadratic blowup".
There are a few general approaches to address the issue.
### Use a Different Engine
By limiting the supported featureset, other regular expression engines have stricter performance guarantees.
#### NodeJS
The `re2`[^7] C++ engine sacrifices backreference and lookaround support for performance. There are bindings for many server-side programming languages.
The `re2`[^8] NodeJS package is a native binding to the C++ engine and can be used in server-side environments. With modern versions of NodeJS, normal regular expressions can be wrapped with `RE2`:
[PrettierJS](https://github.com/prettier/prettier/blob/ff83d55d05e92ceef10ec0cb1c0272ab894a00a0/src/language-markdown/mdx.js#L28) uses a regular expression in the MDX parser that enforces the XML constraint:
Commonly-used regular expression engines can optimize for this pattern and avoid backtracking.
!!! info Spreadsheet Engines
The XML parser in Excel powering the [Excel Workbook (XLSX) format](https://docs.sheetjs.com/docs/miscellany/formats/#excel-2007-xml-xlsxxlsm) expects proper XML comments with no `--` in the comment body.
The XML parser in Excel powering the [Excel 2003-2004 (SpreadsheetML) format](https://docs.sheetjs.com/docs/miscellany/formats#excel-2003-2004-spreadsheetml) allows `--` in the comment body.
#### HTML Comments
The HTML5 standard[^11] permits `--` but forbids `<!--` within comment text. For example, the following comment is not valid according to the standard:
<preclass="language-text">
<!-- I used to be a programmer like you, then I took an <spanstyle="text-decoration-line:underline;text-decoration-color:red;text-decoration-style:wavy;"><!--</span> in the Kleene -->
</pre>
[yt-dlp](https://github.com/yt-dlp/yt-dlp/blob/95e82347b398d8bb160767cdd975edecd62cbabd/yt_dlp/extractor/common.py#L1709) uses a regular expression with a negative lookahead to ensure `<!--` does not appear in the body:
```python
html = re.sub(r'<!--(?:(?!<!--).)*-->', '', html)
```
This expression allows `--` but disallows `<!--` in the comment body. In practice, it will match comments starting from the innermost `<!--`. Using the previous example:
<preclass="language-text">
<!-- I used to be a programmer like you, then I took an <spanstyle="background-color: #FFFF00"><!-- in the Kleene --></span>
</pre>
!!! info Web Browsers
Web browsers generally allow `<!--` in comments. Text between the first `<!--` and the first `-->` are treated as a comment. For example, consider the following HTML:
```html
<pre><!-- this is a nested comment <!----> --> more text</pre>
| |^^^^^^^^^^^^^^ --- content
| this is interpreted as a comment |
```
This exact HTML code is added below:
<pre><!-- this is a nested comment <!----> --> more text</pre>
Chromium and other browsers will display `--> more text`
### Remove the Regular Expression
Regular expression operations can be reimplemented using standard string operations.
In the places where ViteJS used the vulnerable regular expression, the text was validated using a separate HTML parser.
It is still strongly recommended to replace the regular expression.
### Limit to Trusted Data
PrettierJS and RollupJS use the vulnerable regular expression in internal scripts. The expressions are not used or added in websites. The data sources are trusted and malformed data can be corrected manually.
## Special Thanks
Special thanks to [Asadbek](https://asadbek.dev/), [Jardel](http://francoatmega.com/), and members of the [SheetJS team](https://sheetjs.com) for early feedback.
[^1]: See ["Origin and Goals"](https://www.w3.org/TR/REC-xml/#sec-origin-goals) in the Extensible Markup Language (XML) 1.0 specification.
[^2]: The theoretical underpinnings of modern regular expressions were established in the working paper ["Representation of Events in Nerve Nets and Finite Automata"](https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/RM704.pdf)
[^3]: See ["9.9 Remove XML-Style Comments"](https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch09s10.html) on the official site for the book.
[^4]: See [the Wikipedia article for "Backtracking"](https://en.wikipedia.org/wiki/Backtracking) for more details and resources.
[^5]: See [the definition in the "CWE List"](https://cwe.mitre.org/data/definitions/1333.html) for more details and resources.
[^6]: See [the listing for `regress` crate](https://crates.io/crates/regress) for more details.
[^7]: See [the `google/re2` project on GitHub](https://github.com/google/re2) for more details.
[^8]: See [the listing for the `re2` NodeJS package](https://www.npmjs.com/package/re2) for more details.
[^9]: See [the listing for `regex` crate](https://crates.io/crates/regex) for more details.
[^10]: See ["Comments"](https://www.w3.org/TR/REC-xml/#sec-comments) in the XML 1.0 specification.
[^11]: See ["Comments"](https://html.spec.whatwg.org/multipage/syntax.html#comments) in the WHATWG HTML Living Standard.