src: support UTF-8 in compiled-in JS source files#11129
src: support UTF-8 in compiled-in JS source files#11129bnoordhuis merged 2 commits intonodejs:masterfrom
Conversation
|
I rather feared this approach may be a bit too forward-looking for some compilers... I'll see if I can dumb it down. |
tools/js2c.py
Outdated
| # Treat non-ASCII as UTF-8 and convert it to UTF-16. | ||
| if any(ord(c) > 127 for c in lines): | ||
| ctype = 'uint16_t' | ||
| data = map(ord, lines.decode('utf-8').encode('utf-16be')) |
There was a problem hiding this comment.
Maybe 'utf-16be' if sys.byteorder == 'little' else 'utf-16le'?
There was a problem hiding this comment.
The big-endian encoding is intentional, the line below stitches the individual octets together again into uint16s. (.encode() returns a byte string in case you're wondering.)
|
Just curious, have there been specific issues or is this purely preventative? |
|
What's the reason for using UTF-16 rather than UTF-8 (out of interest) to store the files in the binary? |
|
I would have to assume it is because the external string API does not directly support UTF8. That is, if you look at |
|
@gibfahn so that they can be used as external strings. UTF-8 strings must be copied to V8's managed memory. See #5458 by @bnoordhuis. |
src/node_javascript.cc
Outdated
| using v8::Object; | ||
| using v8::String; | ||
|
|
||
| template <typename T, size_t N, T P> |
There was a problem hiding this comment.
Tiny suggestion: Could we be a bit more explicit and rename P to Ptr or Pointer? 😄
|
@bnoordhuis ping. Did you have time to dumb it down? 😉 This patch seems to work on GCC 4.8+. Diff
diff --git a/src/node_javascript.cc b/src/node_javascript.cc
index 8c665381db..07a0daaa83 100644
--- a/src/node_javascript.cc
+++ b/src/node_javascript.cc
@@ -16,16 +16,16 @@ using v8::String;
template <typename T, size_t N, T P>
struct ExternalStringResource;
-template <size_t N, const char (&P)[N]>
-struct ExternalStringResource<const char[N], N, P>
+template <size_t N, const char* P>
+struct ExternalStringResource<const char*, N, P>
: public String::ExternalOneByteStringResource {
const char* data() const override { return P; }
size_t length() const override { return N; }
void Dispose() override { /* Default calls `delete this`. */ }
};
-template <size_t N, const uint16_t (&P)[N]>
-struct ExternalStringResource<const uint16_t[N], N, P>
+template <size_t N, const uint16_t* P>
+struct ExternalStringResource<const uint16_t*, N, P>
: public String::ExternalStringResource {
const uint16_t* data() const override { return P; }
size_t length() const override { return N; }
@@ -34,7 +34,7 @@ struct ExternalStringResource<const uint16_t[N], N, P>
// id##_data is defined in node_natives.h.
#define V(id) \
- static ExternalStringResource<decltype(id##_data), \
+ static ExternalStringResource<decltype(&id##_data[0]), \
arraysize(id##_data), \
id##_data> id##_external_data;
NODE_NATIVES_MAP(V)
diff --git a/tools/js2c.py b/tools/js2c.py
index e6c56cf075..8104e50e5a 100755
--- a/tools/js2c.py
+++ b/tools/js2c.py
@@ -189,7 +189,7 @@ NODE_NATIVES_MAP = """\
SOURCES = """\
static const uint8_t {escaped_id}_name[] = {{
{name}}};
-static const {ctype} {escaped_id}_data[] = {{
+static constexpr const {ctype} {escaped_id}_data[] = {{
{data}}};
"""
|
|
@aqrln Thanks. I tried a couple of different approaches but I ended up settling for something that's pretty similar to your patch. :-) I added a second commit that replaces a few U+2019 quotes with ASCII quotes so that their files can be stored as one-byte strings. Can I get a quick LGTM or two? EDIT: The CI scoreboard is currently linking to a previous CI build. The correct one is https://ci.nodejs.org/job/node-test-commit/7713/. |
Probably one for another PR, but this sounds like a good candidate for a lint rule. |
|
Interesting VS build failure...
I suppose the compiler is technically correct, and wasn't that the best kind of correct? EDIT: New attempt: https://ci.nodejs.org/job/node-test-pull-request/6243/ |
|
Different take: moved all the logic to js2c.py. CI: https://ci.nodejs.org/job/node-test-pull-request/6357/ EDIT: Sigh... |
|
CI is finally green except for that silly issue where the ARM buildbots don't report their status properly. @addaleax The PR changed quite a bit, perhaps you want to take another look? |
node.gyp
Outdated
There was a problem hiding this comment.
You'd expect INTERMEDIATE_DIR to work here but it doesn't: js2c generates the file but the build doesn't pick it up (and I updated the 'sources' list in case you're wondering.) I decided to leave well enough alone for now.
Detect it when source files in lib/ are not ASCII. Decode them as UTF-8 and store them as UTF-16 in the binary so they can be used as external string resources without non-ASCII characters getting mangled. Fixes: nodejs#10673 PR-URL: nodejs#11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
The previous commit stores baked-in files with non-ASCII characters as UTF-16. Replace the \u2019 with a regular quote character so that the files they're in can be stored as one-byte strings. The UTF-16 functionality is still tested by the Unicode diagram in lib/timers.js. PR-URL: nodejs#11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
Detect it when source files in lib/ are not ASCII. Decode them as UTF-8 and store them as UTF-16 in the binary so they can be used as external string resources without non-ASCII characters getting mangled. Fixes: nodejs#10673 PR-URL: nodejs#11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
The previous commit stores baked-in files with non-ASCII characters as UTF-16. Replace the \u2019 with a regular quote character so that the files they're in can be stored as one-byte strings. The UTF-16 functionality is still tested by the Unicode diagram in lib/timers.js. PR-URL: nodejs#11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
In order to allow using Unicode characters inside comments of built-in JavaScript libraries without forcing them to be stored as UTF-16 data in Node's binary, update the tooling to strip comments during build process. All line breaks are preserved so that line numbers in stack traces aren't broken. Refs: nodejs#11129 Refs: nodejs#11371 (comment)
This test ensures that UTF-8 characters can be used in core JavaScript modules built into Node's binary. Refs: nodejs#11129
Notable changes:
* deps:
* update V8 to 5.5 (Michaël Zasso) [#11029](#11029)
* upgrade libuv to 1.11.0 (cjihrig) [#11094](#11094)
* add node-inspect 1.10.2 (Jan Krems) [#10187](#10187)
* lib: build `node inspect` into `node` (Anna Henningsen) [#10187](#10187)
* crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [#9469](#9469)
* inspector: add --inspect-brk (Josh Gavant) [#11149](#11149)
* fs: allow WHATWG URL and file: URLs as paths (James M Snell) [#10739](#10739)
* src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [#11129](#11129)
* url: extend url.format to support WHATWG URL (James M Snell) [#10857](#10857)
PR-URL: #11185
Notable changes:
* deps:
* update V8 to 5.5 (Michaël Zasso) [nodejs#11029](nodejs#11029)
* upgrade libuv to 1.11.0 (cjihrig) [nodejs#11094](nodejs#11094)
* add node-inspect 1.10.4 (Jan Krems) [nodejs#10187](nodejs#10187)
* upgrade zlib to 1.2.11 (Sam Roberts) [nodejs#10980](nodejs#10980)
* lib: build `node inspect` into `node` (Anna Henningsen) [nodejs#10187](nodejs#10187)
* crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [nodejs#9469](nodejs#9469)
* inspector: add --inspect-brk (Josh Gavant) [nodejs#11149](nodejs#11149)
* fs: allow WHATWG URL objects as paths (James M Snell) [nodejs#10739](nodejs#10739)
* src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [nodejs#11129](nodejs#11129)
* url: extend url.format to support WHATWG URL (James M Snell) [nodejs#10857](nodejs#10857)
PR-URL: nodejs#11185
Notable changes:
* deps:
* update V8 to 5.5 (Michaël Zasso) [nodejs#11029](nodejs#11029)
* upgrade libuv to 1.11.0 (cjihrig) [nodejs#11094](nodejs#11094)
* add node-inspect 1.10.4 (Jan Krems) [nodejs#10187](nodejs#10187)
* upgrade zlib to 1.2.11 (Sam Roberts) [nodejs#10980](nodejs#10980)
* lib: build `node inspect` into `node` (Anna Henningsen) [nodejs#10187](nodejs#10187)
* crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [nodejs#9469](nodejs#9469)
* inspector: add --inspect-brk (Josh Gavant) [nodejs#11149](nodejs#11149)
* fs: allow WHATWG URL objects as paths (James M Snell) [nodejs#10739](nodejs#10739)
* src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [nodejs#11129](nodejs#11129)
* url: extend url.format to support WHATWG URL (James M Snell) [nodejs#10857](nodejs#10857)
PR-URL: nodejs#11185
Notable changes:
* deps:
* update V8 to 5.5 (Michaël Zasso) [#11029](nodejs/node#11029)
* upgrade libuv to 1.11.0 (cjihrig) [#11094](nodejs/node#11094)
* add node-inspect 1.10.4 (Jan Krems) [#10187](nodejs/node#10187)
* upgrade zlib to 1.2.11 (Sam Roberts) [#10980](nodejs/node#10980)
* lib: build `node inspect` into `node` (Anna Henningsen) [#10187](nodejs/node#10187)
* crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [#9469](nodejs/node#9469)
* inspector: add --inspect-brk (Josh Gavant) [#11149](nodejs/node#11149)
* fs: allow WHATWG URL objects as paths (James M Snell) [#10739](nodejs/node#10739)
* src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [#11129](nodejs/node#11129)
* url: extend url.format to support WHATWG URL (James M Snell) [#10857](nodejs/node#10857)
PR-URL: nodejs/node#11185
Signed-off-by: Ilkka Myller <ilkka.myller@nodefield.com>
|
Needs a backport PR to land in v4. |
Detect it when source files in lib/ are not ASCII. Decode them as UTF-8 and store them as UTF-16 in the binary so they can be used as external string resources without non-ASCII characters getting mangled. Fixes: #10673 PR-URL: #11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
Detect it when source files in lib/ are not ASCII. Decode them as UTF-8 and store them as UTF-16 in the binary so they can be used as external string resources without non-ASCII characters getting mangled. Fixes: #10673 PR-URL: #11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: James M Snell <jasnell@gmail.com>
This test ensures that UTF-8 characters can be used in core JavaScript modules built into Node's binary. PR-URL: #11423 Ref: #11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>
This test ensures that UTF-8 characters can be used in core JavaScript modules built into Node's binary. PR-URL: nodejs#11423 Ref: nodejs#11129 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>
Detect it when source files in lib/ are not ASCII. Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.
CI: https://ci.nodejs.org/job/node-test-pull-request/6176/