This repository was archived by the owner on Jan 25, 2022. It is now read-only.
forked from goyakin/es-regexp-named-groups
-
Notifications
You must be signed in to change notification settings - Fork 14
This repository was archived by the owner on Jan 25, 2022. It is now read-only.
Options for disambiguating the \k backreferences #7
Copy link
Copy link
Closed
Description
Backreferences to named capture groups have the surface syntax \k<foo>, but that already has semantics in non-Unicode RegExps. A few options have been discussed for the issue:
- Named capture groups only usable in Unicode mode
This was the original proposal, and it's what V8 implements behind a flag. This is one reason why we reserved extra sequences like\kto be a syntax error in Unicode mode--so we could add new features this way. It's the simplest option, and it would give people a carrot to upgrade to using the Unicode flag. Going from 1, we could "always" add 2 or 3 "later". @mathiasbynens and @hashseed have argued for this minimal option. - Named capture groups can be used outside of Unicode mode, but named backreferences are only with Unicode mode on
There seemed to be some concern from the committee that this is something of an unexpected cliff in the middle of the feature. Another argument against it is that we shouldn't add new things to non-Unicode RegExps to encourage people to flip the flag on. This was my funny idea. - Disambiguate by making \k it have the new semantics if there are any named capture groups
This is definitely possible, but more complicated than one might think at first. If there are no named capture groups, then\kcan be anywhere, but otherwise, it needs to be followed by<IdentifierName>; this complicates the grammar. Another piece of complexity is that an implementation can't determine whether there are named capture groups on-line, if lookbehind is in play (because lookbehind semantics are executing the RegExp backwards, and this affects captured groups. For example:/(?<=\k<a>(?<a>.))/matches a zero-length sequence which is preceded by the same character twice. It's definitely unambiguous, just complicated. This was @bakkot's suggestion.
At the September TC39 meeting, we seemed to come to consensus on 3; however, this was without incorporating some feedback from people not present in the room, and without a full understanding of the complexity of 3. With the complexity of 3, and the weird cliff of 2, I'm personally leaning back towards 1. OTOH, 3 feels the most "1JS-y" to me. Any thoughts?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels