Skip to content

Update Tokenizer to treat Markdown code as text instead of HTML#1

Merged
acjh merged 2 commits into
MarkBind:masterfrom
danielbrzn:markdown-parsing-fix
Feb 3, 2018
Merged

Update Tokenizer to treat Markdown code as text instead of HTML#1
acjh merged 2 commits into
MarkBind:masterfrom
danielbrzn:markdown-parsing-fix

Conversation

@danielbrzn

@danielbrzn danielbrzn commented Jan 25, 2018

Copy link
Copy Markdown

This fix allows Markdown code to contain '<' , '<=' without having it affect other HTML elements as it is now treated as a text element. Furthermore, no spaces are required when typing these symbols within the back ticks.

As such, inequalities like the above can be rendered normally as shown below.

image

Resolves MarkBind/markbind#101

Comment thread lib/Tokenizer.js Outdated
this._ended = false;
this._xmlMode = !!(options && options.xmlMode);
this._decodeEntities = !!(options && options.decodeEntities);
this._isMarkdownCode = false;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tabs vs spaces 😨

Let's be consistent with the rest of the file (tabs).

Comment thread lib/Tokenizer.js Outdated
while(this._index < this._buffer.length && this._running){
var c = this._buffer.charAt(this._index);
// Detect Markdown code so that it is parsed as text instead of HTML
if (c === '`')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No space before opening parentheses 😢

Let's be consistent with L152 of the file (braces even for single line of code).

@acjh

acjh commented Jan 25, 2018

Copy link
Copy Markdown
Collaborator

Off-topic: Add a white space around operators :)

x < y
x <= y

We don't have a JS coding standard but:

@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from c3efc5a to 28804b6 Compare January 25, 2018 17:12
@danielbrzn

danielbrzn commented Jan 25, 2018

Copy link
Copy Markdown
Author

Updated with the requested changes, somehow my WebStorm was set to indent with spaces and I didn't manage to catch the difference in the editor.

Thanks for the tip about the white space!

Comment thread lib/Tokenizer.js Outdated

Tokenizer.prototype._stateText = function(c){
if(c === "<"){
// parse open tags if it is not Markdown

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Parse (capital P)
  • tag (singular)

Comment thread lib/Tokenizer.js Outdated
while(this._index < this._buffer.length && this._running){
var c = this._buffer.charAt(this._index);
// Detect Markdown code so that it is parsed as text instead of HTML
if (c === '`') {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No spaces before/after parentheses.

@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from 28804b6 to 43fe9b5 Compare January 25, 2018 17:31
- Allows Markdown code to contain '<' , '<=' without having it affect other HTML elements
@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from 43fe9b5 to 3880c5c Compare January 25, 2018 18:00
@acjh acjh requested a review from Gisonrg January 26, 2018 04:00
@acjh

acjh commented Jan 26, 2018

Copy link
Copy Markdown
Collaborator

We should treat <␣ and <= as text as well:

  • a < b
  • a <= b

This is reasonable: create a .html file with the above and open it in your browser (tested in Chrome).

@danielbrzn

danielbrzn commented Jan 26, 2018

Copy link
Copy Markdown
Author

Seems like the first case is handled fine. I've modified Tokenizer.js to treat <= as text, but there's a peculiar bug with the beautifying process that uses js-beautify

x <= y will get beautified to x <=y. I've tried using this fix mentioned here but it doesn't work. I'd reckon that the beautifier thinks <= is a valid open tag. Any ideas on how I could fix this?

@acjh

acjh commented Jan 26, 2018

Copy link
Copy Markdown
Collaborator

I've modified Tokenizer.js to treat <= as text, but there's a peculiar bug with the beautifying process that uses js-beautify

Can you commit and push, so we can attempt to repro?

Any ideas on how I could fix this?

Try updating js-beautify from 1.6.12 to 1.7.5 and see if the problem still exists.

@danielbrzn

Copy link
Copy Markdown
Author

js-beautify is at version 1.7.5 and the problem still persists unfortunately.

image

@acjh

acjh commented Jan 26, 2018

Copy link
Copy Markdown
Collaborator

No repro:

@danielbrzn

Copy link
Copy Markdown
Author

Are you generating the site from a index.md or a index.html? I get the bug when it's a html file, but not when it's an md file.

@acjh

acjh commented Jan 27, 2018

Copy link
Copy Markdown
Collaborator

Ah, I see that I suggested to "create a .html file" to see how the browser treats those strings.
Repro-ed when using markbind build with a .html file.

We don't have to solve that in this PR since:

  • it works with .md files which we're primarily concerned with,
  • it doesn't break anything, and
  • it's not caused by bad code in this PR.

So it's partial support for .html files: Given "a <= b", this PR gives "a <=b" instead of just "a".

Comment thread lib/Tokenizer.js Outdated
this._sectionStart = this._index;
} else if(this._isInequality){
// Next character should be parsed normally
this._isInequality = !this._isInequality;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be this._isInequality = false; since it's not a toggle.

Comment thread lib/Tokenizer.js Outdated
Tokenizer.prototype._stateText = function(c){
if(c === "<"){
// Parse open tag if it is not Markdown and not part of an inequality
if(c === "<" && !this._isMarkdownCode && !this._isInequality){

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is && !this._isInequality necessary?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such that the Tokenizer doesn't think that the < of a <= is the start of an open HTML tag.

Comment thread lib/Tokenizer.js Outdated
} else if(c === '<'){
var nextChar = this._buffer.charAt(this._index + 1);
if(nextChar === '='){
this._isInequality = !this._isInequality;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be this._isInequality = true;?

Comment thread lib/Tokenizer.js Outdated
this._xmlMode = !!(options && options.xmlMode);
this._decodeEntities = !!(options && options.decodeEntities);
this._isMarkdownCode = false;
this._isInequality = false;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorder just these 2 in alphabetical order.

@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from 38ac8da to 21ddebe Compare January 27, 2018 06:21
Comment thread lib/Tokenizer.js Outdated
this._sectionStart = this._index;
} else if(this._isInequality){
// Next character should be parsed normally
this._isInequality = false;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be the first if condition?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's the first if condition, this._isInequality would be set to false and then < would then be treated as a valid open tag

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't enter the else if block though?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, yes that's right. Will resolve it ASAP.

Comment thread lib/Tokenizer.js Outdated
this._isMarkdownCode = !this._isMarkdownCode;
} else if(c === '<'){
var nextChar = this._buffer.charAt(this._index + 1);
if(nextChar === '='){

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This needs a comment for consistency.
  • Index should also be checked: if(c === '<' && this._index + 1 < this._buffer.length){

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the index, will do so.

Should the comment be inside the else if block or outside of it?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be inside if you add a section name.

Comment thread lib/Tokenizer.js Outdated
if(nextChar === '='){
this._isInequality = true;
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Add a newline before and after this entire block.
  • Maybe add a section name like the ones below.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By section name, do you mean changing '=' into something like EQUALS?

if(nextChar === EQUALS){
    this._isInequality = true;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would special conditions be an appropriate section name?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for now.

@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from 21ddebe to ed1f971 Compare January 27, 2018 14:59
Comment thread lib/Tokenizer.js Outdated
if(this._isInequality){
// Next character will be parsed normally
this._isInequality = false;
} else if(c === "<" && !this._isMarkdownCode && !this._isInequality){

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& !this._isInequality should be removed.

Comment thread lib/Tokenizer.js Outdated
* special conditions
*/
if(c === '`'){
// Detect Markdown code to be parsed as text

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detect Toggle

Comment thread lib/Tokenizer.js Outdated
this._isMarkdownCode = !this._isMarkdownCode;
} else if(c === '<' && this._index + 1 < this._buffer.length){
var nextChar = this._buffer.charAt(this._index + 1);
// Detect '<=' inequality to be parsed as text

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detect Set

Also, move this comment into the if block.

@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from ed1f971 to c09d6ee Compare February 1, 2018 05:12
@danielbrzn

danielbrzn commented Feb 1, 2018

Copy link
Copy Markdown
Author

Made the necessary changes.

@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from c09d6ee to f11e76a Compare February 1, 2018 05:20
Comment thread lib/Tokenizer.js Outdated
this._isInequality = true;
}
}

@acjh acjh Feb 1, 2018

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this still looks out-of-place.

Let's introduce a new state MARKDOWN instead of tracking this._isMarkdownCode and this._isInequality.

Near top of file:

MARKDOWN                  = i++,
TEXT                      = i++, // No change

In this function:

if(this._state === MARKDOWN) {
	this._stateMarkdown(c);
} else if (this.state === TEXT) {
	this._stateText(c); // No change

Other functions:

Tokenizer.prototype._stateMarkdown = function(c){
	if(c === '`'){
		this._state = TEXT;
	}
}

Tokenizer.prototype._stateText = function(c){
	if(c === '`'){
		this._state = MARKDOWN;
	} else if(c === "<"){
		let isInequality = (this._index + 1 < this._buffer.length) && this._buffer.charAt(this._index + 1) === '=';
		if(!isInequality){
			if(this._index > this._sectionStart){
				this._cbs.ontext(this._getSection());
			}
			this._state = BEFORE_TAG_NAME;
			this._sectionStart = this._index;
		}
	}
}

@acjh acjh added this to the v3.10.0-markbind.1 milestone Feb 1, 2018
@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch 2 times, most recently from 6943bc6 to 7aecd9b Compare February 1, 2018 15:34
@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from 7aecd9b to 369b0ba Compare February 1, 2018 15:37
Comment thread lib/Tokenizer.js Outdated

i = 0,


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove whitespace.

Comment thread lib/Tokenizer.js Outdated
} else if(c === "<"){
var isInequality = (this._index + 1 < this._buffer.length) && (this._buffer.charAt(this._index + 1) === '=');
if(!isInequality){
if (this._index > this._sectionStart) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from 369b0ba to b348228 Compare February 1, 2018 16:41
Comment thread lib/Tokenizer.js
xmlMap = require("entities/maps/xml.json"),

i = 0,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore newline (without whitespace).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added whitespace again 😕

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, fixed it now.

@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from b348228 to 89cde72 Compare February 2, 2018 07:18
@danielbrzn danielbrzn force-pushed the markdown-parsing-fix branch from 89cde72 to 6e614fb Compare February 2, 2018 14:55
@Gisonrg

Gisonrg commented Feb 2, 2018

Copy link
Copy Markdown

Have we test the code block (```) case?

@danielbrzn

Copy link
Copy Markdown
Author

Do you mean whether code block cases render as before?

image

Just tried this out, seems to be fine. Is there something else I should test?

In the current version of the CS2103 website however, this fix will cause the rest of the page to not render as intended as there's an extra backtick; specifically in this page under the code snippet where it says
//Solution below adpated from https://stackoverflow.com/a/16252290`

If this backtick is removed, the page renders as per normal.

@damithc

damithc commented Feb 3, 2018

Copy link
Copy Markdown

In the current version of the CS2103 website however, this fix will cause the rest of the page to not render as intended as there's an extra backtick;

Removed the extra backtick.

@Gisonrg Gisonrg left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work :P

@acjh acjh merged commit 815b507 into MarkBind:master Feb 3, 2018
acjh added a commit to acjh/markbind that referenced this pull request Dec 7, 2019
Let's patch Tokenizer to treat Markdown code as text instead of HTML.

From MarkBind/htmlparser2#1:

> This fix allows Markdown code to contain '<' , '<=' without having it
> affect other HTML elements as it is now treated as a text element.
> Furthermore, no spaces are required when typing these symbols within
> the back ticks.
>
> As such, inequalities like the above can be rendered normally as
> shown below.
>
> `x<y`
> `<`
> `<=`
> `x<=y`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Code snippets: text in angle brackets become lowercase

4 participants