Skip to content

Implement auxiliary routines to deal with unicode text#1077

Merged
rocky merged 16 commits intomasterfrom
unicode-escape
Jan 8, 2021
Merged

Implement auxiliary routines to deal with unicode text#1077
rocky merged 16 commits intomasterfrom
unicode-escape

Conversation

@GarkGarcia
Copy link
Copy Markdown
Contributor

This is a follow to #1075. I still need to fill-in the translation table: I didn't quite understand how it's supposed to work.

@GarkGarcia GarkGarcia requested review from mmatera and rocky December 28, 2020 18:15
@GarkGarcia GarkGarcia marked this pull request as draft December 28, 2020 18:15
Copy link
Copy Markdown
Member

@rocky rocky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with all of this. As you mention it could be filled out more.

Once this is done, we should adjust mathicsscript and mathics-django to use the information.

Some things that had occurred to me when writing this was whether to be more specific and look for those strings only within some context. For example in MathML, the substitutions should appear only in <mtext>.

However that means actually parsing the entire text.

I am sure @mmatera has encountered this issue in iwolfram, so I'd love to hear his thoughts on how to proceed.

@mmatera
Copy link
Copy Markdown
Contributor

mmatera commented Dec 29, 2020

I am sure @mmatera has encountered this issue in iwolfram, so I'd love to hear his thoughts on how to proceed.

In IWolfram, I just pass formated output by TeXForm, and then most of the weird characters are converted to a simple latex - ascii form. For example, in WMA,

ToString[TeXForm[Integrate[f[x],x]]]

gives the output
Out[11]= \int f(x) \, dx

and, with MathML form

ToString[TeXForm[Integrate[f[x],x]]]

you get

Out[13]=
<math><mrow> <mo>&#8747;</mo>
  <mrow>   <mrow>
    <mi>f</mi>    <mo>&#8289;</mo>    <mo>(</mo>
    <mi>x</mi>    <mo>)</mo>
   </mrow>   <mo>&#8290;</mo>
   <mrow>    <mo>&#8518;</mo>
    <mi>x</mi>   </mrow>  </mrow> </mrow></math> 

which is properly rendered in Firefox, at least, without special fonts installed. On the other hand, on the input side, I didn't fixed it yet, but I guess that the way would be a translation table...

Copy link
Copy Markdown
Contributor

@mmatera mmatera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for me.

@rocky
Copy link
Copy Markdown
Member

rocky commented Dec 29, 2020

I am sure @mmatera has encountered this issue in iwolfram, so I'd love to hear his thoughts on how to proceed.

In IWolfram, I just pass formated output by TeXForm, and then most of the weird characters are converted to a simple latex - ascii form. For example, in WMA,

ToString[TeXForm[Integrate[f[x],x]]]

@mmatera What does this do for \[FormalA]. Does this work properly?

@mmatera
Copy link
Copy Markdown
Contributor

mmatera commented Dec 29, 2020

These are the outputs:

\[FormalA]
Out[2]= 
ToString[TeXForm[\[FormalA]]]
Out[4]= \unicode{f800}
ToString[MathMLForm[\[FormalA]]]
  Out[5]= <math>
 <mi>&#63488;</mi>
</math>

For these less usual characters, the behavior is poorer...

@mmatera
Copy link
Copy Markdown
Contributor

mmatera commented Dec 29, 2020

For these characters, I would translate them just to \[FormalA]

@rocky
Copy link
Copy Markdown
Member

rocky commented Dec 29, 2020

The thing I find confusing about ToString[TeXForm[Integrate[f[x],x]]] is why it works. I spent the last couple of minutes trying to come up with an explanation of why or whether this is right: What assumptions do you need to make to assume this is correct? Do we have to assume that TeX \inf is the same as MathML \inf for all symbols like \inf?

For things like "FormalA" where something defined in WL is not defined in TeX, okay, we have a problem in the TeXForm part that needs fixing.

But it is not clear though that the way to do this in TeX (I am sure there are many ways to put a dot under a letter aside from having it in a font) is also the most natural way to do this in MathML if you considered going from WL to MathML without going through TeX.

It might be that "best" way to do this in TeX is to go into math mode and put a dot under an A while in MathML, it has a symbol for that already, which is the Unicode character way if there isn't a variation of that in name which matches WL.

@rocky
Copy link
Copy Markdown
Member

rocky commented Dec 29, 2020

For these characters, I would translate them just to \[FormalA]

Well, wouldn't it be better then to either use a translation table to convert characters (as would be done on input) or better change the Mathics core so that it produces Unicode when requested to do so?

In terms of fixing bugs in this area it feels like this would be simpler to debug and fix.

Your thoughts?

@GarkGarcia
Copy link
Copy Markdown
Contributor Author

GarkGarcia commented Dec 31, 2020

I'll generate WL_TO_UNICODE from #1075 (comment), but there are a couple of issues we still need to figure out.

First of all, the mapping in the spreadsheet I created is not one-to-one, which means that there are multiple named characters that map to the same unicode representation. This means our mapping doesn't have an inverse, so we wouldn't be able to generate UNICODE_REPLACE_DICT in the way we're currently doing.

Furthermore, not every cell in the Unicode equivalent column is comprised of a single unicode code-point. In other words, some of the entries in Unicode equivalent are technically comprised of multiple character, even thought they represent a single character. This could cause unexpected behaviour if we're not careful with how we generate UNICODE_REPLACE_RE.

Suppose that UNICODE_REPLACE_DICT = {'a': 'c', 'ab': 'd'}. Currently, we've defined UNICODE_REPLACE_RE by UNICODE_REPLACE_RE = re.compile("|".join(re.escape(k) for k in UNICODE_REPLACE_DICT.keys())). So UNICODE_REPLACE_RE could potentially be /a|ab/ (this depends on the algorithm used by Python to iterate over the keys of UNICODE_REPLACE_DICT). In that case, the result of UNICODE_REPLACE_RE.sub(lambda m: UNICODE_REPLACE_DICT[re.escape(m.group(0))], 'ab') would be cb instead of the expected d:

Python 3.6.9 (default, Oct  8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile('a|ab')
>>> r.sub(lambda m: {:'a': 'c', 'ab': 'd'}[m.group(0)], 'ab')
'cb'
>>>

@rocky
Copy link
Copy Markdown
Member

rocky commented Dec 31, 2020

so we wouldn't be able to generate UNICODE_REPLACE_DICT

Ok, another not-computed mapping is fine.

Furthermore, not every cell in the Unicode equivalent column is comprised of a single unicode code-point. In other words, some of the entries in Unicode equivalent are technically comprised of multiple character, even thought they represent a single character. This could cause unexpected behaviour

Ok noted. That is theoretical. We interested in this particular situation though. I suspect there is an easy way to avoid the problems by avoid composite characters. A with a dot under it, is not composite.

@GarkGarcia
Copy link
Copy Markdown
Contributor Author

Ok noted. That is theoretical. We interested in this particular situation though. I suspect there is an easy way to avoid the problems by avoid composite characters. A with a dot under it, is not composite.

We could can sort the keys based on how long they are (if the longer ones come first in the regex the issue is avoided).

Ok, another not-computed mapping is fine.

I was thinking about this too. For each unicode equivalent, there's usually a WL character that fits it's precise description. We could map each WL character that has an unicode equivalent to this specific unicode equivalent. Of course, we'd have to manually build a table for this.

There's another issue I forgot to mention. As mentioned before, there are WL named characters that don't have a unicode equivalent, such as \[Wolf]. @rocky @mmatera would you guys prefer to map the named character \[Wolf] to the unicode string "\[Wolf]" or to omit it from the map entirely?

Finally, I was also thinking about pickling the directories instead of writing them inline. That way we can procedural generate them at the installation to make sure they are kept up-to-date (we may want to revise the mapping at some point). We could also store them as JSON or something and them read them from disk when the kernel is loaded.

@rocky
Copy link
Copy Markdown
Member

rocky commented Dec 31, 2020

would you guys prefer to map the named character [Wolf] to the unicode string "[Wolf]" or to omit it from the map entirely?

If it is in WA it should be in the table. If there's no equivalent, we'll that's the way it is.

I was also thinking about pickling the directories instead of writing them inline.

Pickle is Python specific, so you limit yourself when you use that format. JSON and XML and YAML aren't limiting.
I don't think compression is at issue here.

But you don't have to get everything right and perfect initially. It's okay to come back to this and make more changes as needs dictate.

@GarkGarcia
Copy link
Copy Markdown
Contributor Author

If it is in WA it should be in the table. If there's no equivalent, we'll that's the way it is.

Makes sense.

Pickle is Python specific, so you limit yourse when you use that format. JSON and XML and YAML aren't limiting.
I don't think compression is at issue here.

But you don't have to get everything right and perfect initially. It's okay to come back to this and make more changes as needs dictate.

I agree, JSON or Yaml is a better choice in here. I think I'll just write the dicts inline thought, we can change it later if we ever need to.

@GarkGarcia GarkGarcia marked this pull request as ready for review January 2, 2021 14:25
@GarkGarcia GarkGarcia changed the title WIP: Implement auxiliary routines to deal with unicode text Implement auxiliary routines to deal with unicode text Jan 2, 2021
"": "𝕔",
"": "⋱",
"": "⨯",
"∆": "Δ",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GarkGarcia This looks the same on both sides. Should this be in this table and in the reverse?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, they should. In unicode (U+2206) is stands for "Increment" while in WL it stands for \[DifferenceDelta]. The guys from Wolfram probably used U+2206 instead of U+0394 (Δ or "Greek Capital Letter Delta") to make it distinguishable from \[CapitalDelta] (visually they are the same, but the represent different symbols in WL). However, U+0394 more accurately fits the description of \[DifferenceDelta], so the entry should stay on the dictionary.

Copy link
Copy Markdown
Member

@rocky rocky Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are mistaken here. I believe Increment and DifferenceDelta are mean the same thing but use different words to express this.

They are both distinct from CapitalDelta because they are to be used differently from a CapitalDelta which could be used as a variable name, while the others are thought more of as a function or operator. Note that Unicode Increment is put in the "Mathematical Operators" block.

And in fact the Unicode Increment and CapitalDelta symbols are visually different in some renderings. In fact if you look above you will see some slight differences. In GNU Emacs the difference is more pronounced with DifferenceDelta looking more like what it should as an operator. (I guess they adjusted CapitalDelta to be more to match what they feel is Capitalness)

The point of this was to take symbols that render in a confusing way and render them in a way that isn't confusing. When the symbol renders properly, it should be used. Otherwise in this case we have two things that map to the same unicode symbol so you can't invert that properly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that this would have been less confusing if a comment were added for what the WL and corresponding Unicode symbols were mentioned on the line as a comment. Your CSV tables had this information but it is lost where it is most uself.

It feels to me that we've lost sight of the use case that is needed in turning this thing into some other project (listing all WL symbols as a CSV) that you say you don't have a use case for yet.

Copy link
Copy Markdown
Contributor Author

@GarkGarcia GarkGarcia Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this was to take symbols that render in a confusing way and render them in a way that isn't confusing. When the symbol renders properly, it should be used. Otherwise in this case we have two things that map to the same unicode symbol so you can't invert that properly.

I agree, but there are some issues with this. My understanding is that replace_wl_with_unicode is supposed to replace special characters with the unicode characters they are supposed to represent. Even thought "Increment" looks the same as \[DifferenceDelta], we can't rely on visuals alone. This is an accessibility concern.

Imagine someone uses a screenreader to interact with our programs. If we use "Increment" instead of "Greek Letter Capital Delta" the expression \[DifferenceDelta] x would be rendered as "Increment x" by a screenreader, instead of the more appropriate "Delta x".

Anyway, I had this in mind when creating the table. The characters in the "Unicode equivalent" represents how the characters are supposed to be rendered, not what they actually mean code-wise.

Copy link
Copy Markdown
Contributor Author

@GarkGarcia GarkGarcia Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine someone uses a screenreader to interact with our programs. If we use "Increment" instead of "Greek Letter Capital Delta" the expression \[DifferenceDelta] x would be rendered as "Increment x" by a screenreader, instead of the more appropriate "Delta x".

This is why it's important for us to use unicode characters that fit description of any given WL named character, not only it's appearance: if we do so, users who can't see the characters on screen are still able to understand what the output means.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels to me that we've lost sight of the use case that is needed in turning this thing into some other project (listing all WL symbols as a CSV) that you say you don't have a use case for yet.

Well, this PR is the primary usecase for it. I think it's better for us to have this information indexed somewhere in a way that can be progamatically manipulated than to have to fill it by hand in a case-by-case basis, as we were doing before. Keep in mind we can always change the CSV tables and we can deviate from them if needed. The tables are meant as starting point.

Copy link
Copy Markdown
Contributor Author

@GarkGarcia GarkGarcia Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway. If you're concerned about having multiple WL named characters map to the same unicode character we could use the unicode-to-wl-conversion.csv table, which is a one-to-one mapping (so replace_unicode_with_wl(replace_wl_with_unicode(s)) is guaranteed to be the same as s).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. But let's not lose site that the end result is what is needed and drove this.

I'd appreciate it if you'd add the Unicode and WL names as a comment to each line so that everyone is better informed as to what's going on and we can catch problems like the one with Increment.

Also, somewhere I meant to mention that after this PR is done both mathicsscript and mathics-django should be adjusted to use this info rather than their own incomplete copy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate it if you'd add the Unicode and WL names as a comment to each line so that everyone is better informed as to what's going on and we can catch problems like the one with Increment.

Great idea! I'll work on it.

Also, somewhere I meant to mention that after this PR is done both mathicsscript and mathics-django should be adjusted to use this info rather than their own incomplete copy.

Sure, I can work on it too after this is done.

Also added comments with the unicode names of the unicode characters used by WL to encode named characters
Copy link
Copy Markdown
Contributor Author

@GarkGarcia GarkGarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. But It has to be tomorrow or probably Wednesday. I have paid work I need to get to.

No problem. Whenever you can do it is fine.

@GarkGarcia
Copy link
Copy Markdown
Contributor Author

I took the time to commit the changes we made in here to https://github.com/Mathics3/mathics-development-guide too.

@GarkGarcia
Copy link
Copy Markdown
Contributor Author

@rocky @mmatera I guess we can merge this?

@rocky
Copy link
Copy Markdown
Member

rocky commented Jan 8, 2021

Sure.

Copy link
Copy Markdown
Member

@rocky rocky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rocky rocky deleted the unicode-escape branch February 21, 2021 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants