Implement auxiliary routines to deal with unicode text by GarkGarcia · Pull Request #1077 · mathics/Mathics

GarkGarcia · 2020-12-28T18:15:31Z

This is a follow to #1075. I still need to fill-in the translation table: I didn't quite understand how it's supposed to work.

rocky

I'm okay with all of this. As you mention it could be filled out more.

Once this is done, we should adjust mathicsscript and mathics-django to use the information.

Some things that had occurred to me when writing this was whether to be more specific and look for those strings only within some context. For example in MathML, the substitutions should appear only in <mtext>.

However that means actually parsing the entire text.

I am sure @mmatera has encountered this issue in iwolfram, so I'd love to hear his thoughts on how to proceed.

mmatera · 2020-12-29T23:18:59Z

I am sure @mmatera has encountered this issue in iwolfram, so I'd love to hear his thoughts on how to proceed.

In IWolfram, I just pass formated output by TeXForm, and then most of the weird characters are converted to a simple latex - ascii form. For example, in WMA,

ToString[TeXForm[Integrate[f[x],x]]]

gives the output
Out[11]= \int f(x) \, dx

and, with MathML form

ToString[TeXForm[Integrate[f[x],x]]]

you get

Out[13]=
<math><mrow> <mo>&#8747;</mo>
  <mrow>   <mrow>
    <mi>f</mi>    <mo>&#8289;</mo>    <mo>(</mo>
    <mi>x</mi>    <mo>)</mo>
   </mrow>   <mo>&#8290;</mo>
   <mrow>    <mo>&#8518;</mo>
    <mi>x</mi>   </mrow>  </mrow> </mrow></math>

which is properly rendered in Firefox, at least, without special fonts installed. On the other hand, on the input side, I didn't fixed it yet, but I guess that the way would be a translation table...

mmatera

Looks good for me.

rocky · 2020-12-29T23:33:39Z

I am sure @mmatera has encountered this issue in iwolfram, so I'd love to hear his thoughts on how to proceed.

In IWolfram, I just pass formated output by TeXForm, and then most of the weird characters are converted to a simple latex - ascii form. For example, in WMA,
ToString[TeXForm[Integrate[f[x],x]]]

@mmatera What does this do for \[FormalA]. Does this work properly?

mmatera · 2020-12-29T23:44:54Z

These are the outputs:

\[FormalA]

Out[2]= 

ToString[TeXForm[\[FormalA]]]

Out[4]= \unicode{f800}

ToString[MathMLForm[\[FormalA]]]

  Out[5]= <math>
 <mi>&#63488;</mi>
</math>

For these less usual characters, the behavior is poorer...

mmatera · 2020-12-29T23:47:21Z

For these characters, I would translate them just to \[FormalA]

rocky · 2020-12-29T23:48:12Z

The thing I find confusing about ToString[TeXForm[Integrate[f[x],x]]] is why it works. I spent the last couple of minutes trying to come up with an explanation of why or whether this is right: What assumptions do you need to make to assume this is correct? Do we have to assume that TeX \inf is the same as MathML \inf for all symbols like \inf?

For things like "FormalA" where something defined in WL is not defined in TeX, okay, we have a problem in the TeXForm part that needs fixing.

But it is not clear though that the way to do this in TeX (I am sure there are many ways to put a dot under a letter aside from having it in a font) is also the most natural way to do this in MathML if you considered going from WL to MathML without going through TeX.

It might be that "best" way to do this in TeX is to go into math mode and put a dot under an A while in MathML, it has a symbol for that already, which is the Unicode character way if there isn't a variation of that in name which matches WL.

rocky · 2020-12-29T23:51:52Z

For these characters, I would translate them just to \[FormalA]

Well, wouldn't it be better then to either use a translation table to convert characters (as would be done on input) or better change the Mathics core so that it produces Unicode when requested to do so?

In terms of fixing bugs in this area it feels like this would be simpler to debug and fix.

Your thoughts?

GarkGarcia · 2020-12-31T18:27:57Z

I'll generate WL_TO_UNICODE from #1075 (comment), but there are a couple of issues we still need to figure out.

First of all, the mapping in the spreadsheet I created is not one-to-one, which means that there are multiple named characters that map to the same unicode representation. This means our mapping doesn't have an inverse, so we wouldn't be able to generate UNICODE_REPLACE_DICT in the way we're currently doing.

Furthermore, not every cell in the Unicode equivalent column is comprised of a single unicode code-point. In other words, some of the entries in Unicode equivalent are technically comprised of multiple character, even thought they represent a single character. This could cause unexpected behaviour if we're not careful with how we generate UNICODE_REPLACE_RE.

Suppose that UNICODE_REPLACE_DICT = {'a': 'c', 'ab': 'd'}. Currently, we've defined UNICODE_REPLACE_RE by UNICODE_REPLACE_RE = re.compile("|".join(re.escape(k) for k in UNICODE_REPLACE_DICT.keys())). So UNICODE_REPLACE_RE could potentially be /a|ab/ (this depends on the algorithm used by Python to iterate over the keys of UNICODE_REPLACE_DICT). In that case, the result of UNICODE_REPLACE_RE.sub(lambda m: UNICODE_REPLACE_DICT[re.escape(m.group(0))], 'ab') would be cb instead of the expected d:

Python 3.6.9 (default, Oct  8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> r = re.compile('a|ab')
>>> r.sub(lambda m: {:'a': 'c', 'ab': 'd'}[m.group(0)], 'ab')
'cb'
>>>

rocky · 2020-12-31T19:08:17Z

so we wouldn't be able to generate UNICODE_REPLACE_DICT

Ok, another not-computed mapping is fine.

Furthermore, not every cell in the Unicode equivalent column is comprised of a single unicode code-point. In other words, some of the entries in Unicode equivalent are technically comprised of multiple character, even thought they represent a single character. This could cause unexpected behaviour

Ok noted. That is theoretical. We interested in this particular situation though. I suspect there is an easy way to avoid the problems by avoid composite characters. A with a dot under it, is not composite.

GarkGarcia · 2020-12-31T20:17:33Z

Ok noted. That is theoretical. We interested in this particular situation though. I suspect there is an easy way to avoid the problems by avoid composite characters. A with a dot under it, is not composite.

We could can sort the keys based on how long they are (if the longer ones come first in the regex the issue is avoided).

Ok, another not-computed mapping is fine.

I was thinking about this too. For each unicode equivalent, there's usually a WL character that fits it's precise description. We could map each WL character that has an unicode equivalent to this specific unicode equivalent. Of course, we'd have to manually build a table for this.

There's another issue I forgot to mention. As mentioned before, there are WL named characters that don't have a unicode equivalent, such as \[Wolf]. @rocky @mmatera would you guys prefer to map the named character \[Wolf] to the unicode string "\[Wolf]" or to omit it from the map entirely?

Finally, I was also thinking about pickling the directories instead of writing them inline. That way we can procedural generate them at the installation to make sure they are kept up-to-date (we may want to revise the mapping at some point). We could also store them as JSON or something and them read them from disk when the kernel is loaded.

rocky · 2020-12-31T21:28:40Z

would you guys prefer to map the named character [Wolf] to the unicode string "[Wolf]" or to omit it from the map entirely?

If it is in WA it should be in the table. If there's no equivalent, we'll that's the way it is.

I was also thinking about pickling the directories instead of writing them inline.

Pickle is Python specific, so you limit yourself when you use that format. JSON and XML and YAML aren't limiting.
I don't think compression is at issue here.

But you don't have to get everything right and perfect initially. It's okay to come back to this and make more changes as needs dictate.

GarkGarcia · 2021-01-02T13:54:47Z

If it is in WA it should be in the table. If there's no equivalent, we'll that's the way it is.

Makes sense.

Pickle is Python specific, so you limit yourse when you use that format. JSON and XML and YAML aren't limiting.
I don't think compression is at issue here.

But you don't have to get everything right and perfect initially. It's okay to come back to this and make more changes as needs dictate.

I agree, JSON or Yaml is a better choice in here. I think I'll just write the dicts inline thought, we can change it later if we ever need to.

rocky · 2021-01-04T00:15:34Z

mathics/core/util.py

+  "": "𝕔",
+  "": "⋱",
+  "": "⨯",
+  "∆": "Δ",


@GarkGarcia This looks the same on both sides. Should this be in this table and in the reverse?

Yep, they should. In unicode ∆ (U+2206) is stands for "Increment" while in WL it stands for \[DifferenceDelta]. The guys from Wolfram probably used U+2206 instead of U+0394 (Δ or "Greek Capital Letter Delta") to make it distinguishable from \[CapitalDelta] (visually they are the same, but the represent different symbols in WL). However, U+0394 more accurately fits the description of \[DifferenceDelta], so the entry should stay on the dictionary.

I think you are mistaken here. I believe Increment and DifferenceDelta are mean the same thing but use different words to express this.

They are both distinct from CapitalDelta because they are to be used differently from a CapitalDelta which could be used as a variable name, while the others are thought more of as a function or operator. Note that Unicode Increment is put in the "Mathematical Operators" block.

And in fact the Unicode Increment and CapitalDelta symbols are visually different in some renderings. In fact if you look above you will see some slight differences. In GNU Emacs the difference is more pronounced with DifferenceDelta looking more like what it should as an operator. (I guess they adjusted CapitalDelta to be more to match what they feel is Capitalness)

The point of this was to take symbols that render in a confusing way and render them in a way that isn't confusing. When the symbol renders properly, it should be used. Otherwise in this case we have two things that map to the same unicode symbol so you can't invert that properly.

It occurs to me that this would have been less confusing if a comment were added for what the WL and corresponding Unicode symbols were mentioned on the line as a comment. Your CSV tables had this information but it is lost where it is most uself.

It feels to me that we've lost sight of the use case that is needed in turning this thing into some other project (listing all WL symbols as a CSV) that you say you don't have a use case for yet.

The point of this was to take symbols that render in a confusing way and render them in a way that isn't confusing. When the symbol renders properly, it should be used. Otherwise in this case we have two things that map to the same unicode symbol so you can't invert that properly.

I agree, but there are some issues with this. My understanding is that replace_wl_with_unicode is supposed to replace special characters with the unicode characters they are supposed to represent. Even thought "Increment" looks the same as \[DifferenceDelta], we can't rely on visuals alone. This is an accessibility concern.

Imagine someone uses a screenreader to interact with our programs. If we use "Increment" instead of "Greek Letter Capital Delta" the expression \[DifferenceDelta] x would be rendered as "Increment x" by a screenreader, instead of the more appropriate "Delta x".

Anyway, I had this in mind when creating the table. The characters in the "Unicode equivalent" represents how the characters are supposed to be rendered, not what they actually mean code-wise.

Imagine someone uses a screenreader to interact with our programs. If we use "Increment" instead of "Greek Letter Capital Delta" the expression \[DifferenceDelta] x would be rendered as "Increment x" by a screenreader, instead of the more appropriate "Delta x".

This is why it's important for us to use unicode characters that fit description of any given WL named character, not only it's appearance: if we do so, users who can't see the characters on screen are still able to understand what the output means.

It feels to me that we've lost sight of the use case that is needed in turning this thing into some other project (listing all WL symbols as a CSV) that you say you don't have a use case for yet.

Well, this PR is the primary usecase for it. I think it's better for us to have this information indexed somewhere in a way that can be progamatically manipulated than to have to fill it by hand in a case-by-case basis, as we were doing before. Keep in mind we can always change the CSV tables and we can deviate from them if needed. The tables are meant as starting point.

Anyway. If you're concerned about having multiple WL named characters map to the same unicode character we could use the unicode-to-wl-conversion.csv table, which is a one-to-one mapping (so replace_unicode_with_wl(replace_wl_with_unicode(s)) is guaranteed to be the same as s).

Ok. But let's not lose site that the end result is what is needed and drove this.

I'd appreciate it if you'd add the Unicode and WL names as a comment to each line so that everyone is better informed as to what's going on and we can catch problems like the one with Increment.

Also, somewhere I meant to mention that after this PR is done both mathicsscript and mathics-django should be adjusted to use this info rather than their own incomplete copy.

I'd appreciate it if you'd add the Unicode and WL names as a comment to each line so that everyone is better informed as to what's going on and we can catch problems like the one with Increment.

Great idea! I'll work on it.

Also, somewhere I meant to mention that after this PR is done both mathicsscript and mathics-django should be adjusted to use this info rather than their own incomplete copy.

Sure, I can work on it too after this is done.

Also added comments with the unicode names of the unicode characters used by WL to encode named characters

GarkGarcia

Ok. But It has to be tomorrow or probably Wednesday. I have paid work I need to get to.

No problem. Whenever you can do it is fine.

GarkGarcia · 2021-01-04T22:58:07Z

I took the time to commit the changes we made in here to https://github.com/Mathics3/mathics-development-guide too.

GarkGarcia · 2021-01-08T20:13:33Z

@rocky @mmatera I guess we can merge this?

rocky · 2021-01-08T20:14:50Z

Sure.

rocky

LGTM

rocky and others added 9 commits October 24, 2020 10:06

Get ready for release 1.1.0rc1

87662f3

Merged master

3ad97d0

Merge branch 'master' of https://github.com/mathics/Mathics

69beb15

Merge branch 'master' of https://github.com/mathics/Mathics

5fbd385

Merge branch 'master' of https://github.com/mathics/Mathics

35d66ca

Merge https://github.com/mathics/Mathics

3afc126

Merge branch 'master' of https://github.com/mathics/Mathics

bb1faca

Merge branch 'master' of https://github.com/mathics/Mathics

d49277d

Implemented routines to convert between WL and Unicode

a3647ad

GarkGarcia added enhancement meta labels Dec 28, 2020

GarkGarcia requested review from mmatera and rocky December 28, 2020 18:15

GarkGarcia marked this pull request as draft December 28, 2020 18:15

Fixed minor error

0f36e60

rocky reviewed Dec 29, 2020

View reviewed changes

GarkGarcia mentioned this pull request Dec 29, 2020

Revise code to allow emitting unicode when desired instead of WL-specific codes #1075

Open

mmatera approved these changes Dec 29, 2020

View reviewed changes

Added all translatable named characters to the dictionaries

cc9ec17

Merge https://github.com/mathics/Mathics into unicode-escape

8848bf6

GarkGarcia force-pushed the unicode-escape branch from dd0ffb4 to 8848bf6 Compare January 2, 2021 14:23

Removed unnecessary file

ce21eda

GarkGarcia marked this pull request as ready for review January 2, 2021 14:25

GarkGarcia changed the title ~~WIP: Implement auxiliary routines to deal with unicode text~~ Implement auxiliary routines to deal with unicode text Jan 2, 2021

rocky reviewed Jan 4, 2021

View reviewed changes

GarkGarcia added 3 commits January 5, 2021 02:52

Added comments to help on finding errors

4cd649b

Fixed some entries of the dictionaries

256a1b0

Also added comments with the unicode names of the unicode characters used by WL to encode named characters

Fixed some more entries of the dictionaries

a5d0270

GarkGarcia commented Jan 4, 2021

View reviewed changes

rocky merged commit af9305b into master Jan 8, 2021

This was referenced Jan 8, 2021

Use replace_wl_with_unicode and replace_unicode_with_wl from mathics Mathics3/mathicsscript#9

Merged

Named characters don't really work #1107

Open

rocky self-requested a review January 16, 2021 17:22

rocky reviewed Jan 16, 2021

View reviewed changes

rocky deleted the unicode-escape branch February 21, 2021 09:29

Uh oh!

Conversation

GarkGarcia commented Dec 28, 2020

Uh oh!

rocky left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmatera commented Dec 29, 2020

Uh oh!

mmatera left a comment

Choose a reason for hiding this comment

Uh oh!

rocky commented Dec 29, 2020

Uh oh!

mmatera commented Dec 29, 2020

Uh oh!

mmatera commented Dec 29, 2020

Uh oh!

rocky commented Dec 29, 2020

Uh oh!

rocky commented Dec 29, 2020

Uh oh!

GarkGarcia commented Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocky commented Dec 31, 2020

Uh oh!

GarkGarcia commented Dec 31, 2020

Uh oh!

rocky commented Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GarkGarcia commented Jan 2, 2021

Uh oh!

rocky Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

GarkGarcia Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

rocky Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

GarkGarcia Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GarkGarcia Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GarkGarcia Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

GarkGarcia Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

GarkGarcia Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

GarkGarcia left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GarkGarcia commented Jan 4, 2021

Uh oh!

GarkGarcia commented Jan 8, 2021

Uh oh!

rocky commented Jan 8, 2021

Uh oh!

rocky left a comment

rocky left a comment •

edited

Loading

GarkGarcia commented Dec 31, 2020 •

edited

Loading

rocky commented Dec 31, 2020 •

edited

Loading

rocky Jan 4, 2021 •

edited

Loading

GarkGarcia Jan 4, 2021 •

edited

Loading

GarkGarcia Jan 4, 2021 •

edited

Loading

GarkGarcia Jan 4, 2021 •

edited

Loading

GarkGarcia left a comment •

edited

Loading