Implement auxiliary routines to deal with unicode text#1077
Conversation
There was a problem hiding this comment.
I'm okay with all of this. As you mention it could be filled out more.
Once this is done, we should adjust mathicsscript and mathics-django to use the information.
Some things that had occurred to me when writing this was whether to be more specific and look for those strings only within some context. For example in MathML, the substitutions should appear only in <mtext>.
However that means actually parsing the entire text.
I am sure @mmatera has encountered this issue in iwolfram, so I'd love to hear his thoughts on how to proceed.
In IWolfram, I just pass formated output by TeXForm, and then most of the weird characters are converted to a simple latex - ascii form. For example, in WMA, gives the output and, with MathML form you get which is properly rendered in Firefox, at least, without special fonts installed. On the other hand, on the input side, I didn't fixed it yet, but I guess that the way would be a translation table... |
@mmatera What does this do for |
|
These are the outputs: For these less usual characters, the behavior is poorer... |
|
For these characters, I would translate them just to |
|
The thing I find confusing about For things like "FormalA" where something defined in WL is not defined in TeX, okay, we have a problem in the But it is not clear though that the way to do this in TeX (I am sure there are many ways to put a dot under a letter aside from having it in a font) is also the most natural way to do this in MathML if you considered going from WL to MathML without going through TeX. It might be that "best" way to do this in TeX is to go into math mode and put a dot under an A while in MathML, it has a symbol for that already, which is the Unicode character way if there isn't a variation of that in name which matches WL. |
Well, wouldn't it be better then to either use a translation table to convert characters (as would be done on input) or better change the Mathics core so that it produces Unicode when requested to do so? In terms of fixing bugs in this area it feels like this would be simpler to debug and fix. Your thoughts? |
|
I'll generate First of all, the mapping in the spreadsheet I created is not one-to-one, which means that there are multiple named characters that map to the same unicode representation. This means our mapping doesn't have an inverse, so we wouldn't be able to generate Furthermore, not every cell in the Unicode equivalent column is comprised of a single unicode code-point. In other words, some of the entries in Unicode equivalent are technically comprised of multiple character, even thought they represent a single character. This could cause unexpected behaviour if we're not careful with how we generate Suppose that |
Ok, another not-computed mapping is fine.
Ok noted. That is theoretical. We interested in this particular situation though. I suspect there is an easy way to avoid the problems by avoid composite characters. A with a dot under it, is not composite. |
We could can sort the keys based on how long they are (if the longer ones come first in the regex the issue is avoided).
I was thinking about this too. For each unicode equivalent, there's usually a WL character that fits it's precise description. We could map each WL character that has an unicode equivalent to this specific unicode equivalent. Of course, we'd have to manually build a table for this. There's another issue I forgot to mention. As mentioned before, there are WL named characters that don't have a unicode equivalent, such as Finally, I was also thinking about pickling the directories instead of writing them inline. That way we can procedural generate them at the installation to make sure they are kept up-to-date (we may want to revise the mapping at some point). We could also store them as JSON or something and them read them from disk when the kernel is loaded. |
If it is in WA it should be in the table. If there's no equivalent, we'll that's the way it is.
Pickle is Python specific, so you limit yourself when you use that format. JSON and XML and YAML aren't limiting. But you don't have to get everything right and perfect initially. It's okay to come back to this and make more changes as needs dictate. |
Makes sense.
I agree, JSON or Yaml is a better choice in here. I think I'll just write the dicts inline thought, we can change it later if we ever need to. |
dd0ffb4 to
8848bf6
Compare
mathics/core/util.py
Outdated
| "": "𝕔", | ||
| "": "⋱", | ||
| "": "⨯", | ||
| "∆": "Δ", |
There was a problem hiding this comment.
@GarkGarcia This looks the same on both sides. Should this be in this table and in the reverse?
There was a problem hiding this comment.
Yep, they should. In unicode ∆ (U+2206) is stands for "Increment" while in WL it stands for \[DifferenceDelta]. The guys from Wolfram probably used U+2206 instead of U+0394 (Δ or "Greek Capital Letter Delta") to make it distinguishable from \[CapitalDelta] (visually they are the same, but the represent different symbols in WL). However, U+0394 more accurately fits the description of \[DifferenceDelta], so the entry should stay on the dictionary.
There was a problem hiding this comment.
I think you are mistaken here. I believe Increment and DifferenceDelta are mean the same thing but use different words to express this.
They are both distinct from CapitalDelta because they are to be used differently from a CapitalDelta which could be used as a variable name, while the others are thought more of as a function or operator. Note that Unicode Increment is put in the "Mathematical Operators" block.
And in fact the Unicode Increment and CapitalDelta symbols are visually different in some renderings. In fact if you look above you will see some slight differences. In GNU Emacs the difference is more pronounced with DifferenceDelta looking more like what it should as an operator. (I guess they adjusted CapitalDelta to be more to match what they feel is Capitalness)
The point of this was to take symbols that render in a confusing way and render them in a way that isn't confusing. When the symbol renders properly, it should be used. Otherwise in this case we have two things that map to the same unicode symbol so you can't invert that properly.
There was a problem hiding this comment.
It occurs to me that this would have been less confusing if a comment were added for what the WL and corresponding Unicode symbols were mentioned on the line as a comment. Your CSV tables had this information but it is lost where it is most uself.
It feels to me that we've lost sight of the use case that is needed in turning this thing into some other project (listing all WL symbols as a CSV) that you say you don't have a use case for yet.
There was a problem hiding this comment.
The point of this was to take symbols that render in a confusing way and render them in a way that isn't confusing. When the symbol renders properly, it should be used. Otherwise in this case we have two things that map to the same unicode symbol so you can't invert that properly.
I agree, but there are some issues with this. My understanding is that replace_wl_with_unicode is supposed to replace special characters with the unicode characters they are supposed to represent. Even thought "Increment" looks the same as \[DifferenceDelta], we can't rely on visuals alone. This is an accessibility concern.
Imagine someone uses a screenreader to interact with our programs. If we use "Increment" instead of "Greek Letter Capital Delta" the expression \[DifferenceDelta] x would be rendered as "Increment x" by a screenreader, instead of the more appropriate "Delta x".
Anyway, I had this in mind when creating the table. The characters in the "Unicode equivalent" represents how the characters are supposed to be rendered, not what they actually mean code-wise.
There was a problem hiding this comment.
Imagine someone uses a screenreader to interact with our programs. If we use "Increment" instead of "Greek Letter Capital Delta" the expression
\[DifferenceDelta] xwould be rendered as "Increment x" by a screenreader, instead of the more appropriate "Delta x".
This is why it's important for us to use unicode characters that fit description of any given WL named character, not only it's appearance: if we do so, users who can't see the characters on screen are still able to understand what the output means.
There was a problem hiding this comment.
It feels to me that we've lost sight of the use case that is needed in turning this thing into some other project (listing all WL symbols as a CSV) that you say you don't have a use case for yet.
Well, this PR is the primary usecase for it. I think it's better for us to have this information indexed somewhere in a way that can be progamatically manipulated than to have to fill it by hand in a case-by-case basis, as we were doing before. Keep in mind we can always change the CSV tables and we can deviate from them if needed. The tables are meant as starting point.
There was a problem hiding this comment.
Anyway. If you're concerned about having multiple WL named characters map to the same unicode character we could use the unicode-to-wl-conversion.csv table, which is a one-to-one mapping (so replace_unicode_with_wl(replace_wl_with_unicode(s)) is guaranteed to be the same as s).
There was a problem hiding this comment.
Ok. But let's not lose site that the end result is what is needed and drove this.
I'd appreciate it if you'd add the Unicode and WL names as a comment to each line so that everyone is better informed as to what's going on and we can catch problems like the one with Increment.
Also, somewhere I meant to mention that after this PR is done both mathicsscript and mathics-django should be adjusted to use this info rather than their own incomplete copy.
There was a problem hiding this comment.
I'd appreciate it if you'd add the Unicode and WL names as a comment to each line so that everyone is better informed as to what's going on and we can catch problems like the one with Increment.
Great idea! I'll work on it.
Also, somewhere I meant to mention that after this PR is done both mathicsscript and mathics-django should be adjusted to use this info rather than their own incomplete copy.
Sure, I can work on it too after this is done.
Also added comments with the unicode names of the unicode characters used by WL to encode named characters
|
I took the time to commit the changes we made in here to https://github.com/Mathics3/mathics-development-guide too. |
|
Sure. |
This is a follow to #1075. I still need to fill-in the translation table: I didn't quite understand how it's supposed to work.