Skip to content

First pass lexical model compiler#1

Merged
mcdurdin merged 11 commits intomasterfrom
basic-compiler
Feb 13, 2019
Merged

First pass lexical model compiler#1
mcdurdin merged 11 commits intomasterfrom
basic-compiler

Conversation

@mcdurdin
Copy link
Copy Markdown
Member

No description provided.

(new LexicalModelCompiler).compile({
format: 'trie-1.0',
wordBreaking: {
allowedCharacters: { initials: 'abcdefghijklmnopqrstuvwxyz', medials: 'abcdefghijklmnopqrstuvwxyz', finals: 'abcdefghijklmnopqrstuvwxyz' },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I like how compact this notation is, JavaScript's strings may not work well, because there's no explicit way to denote full characters that are greater than one code point (e.g., <X̱> in Northern Haida, two code points in NFC). Also, characters that are outside the BMP will be a pain, and error-prone.

Copy link
Copy Markdown
Member Author

@mcdurdin mcdurdin Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So could we support string -- for simple situations or array of strings for others? e.g. ['a','b',...,'X̱']?

I am unsure on how this is going to be used -- I would be happy to have the source format support both but compiled into array of strings for simplicity of consumption.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea!

return indexes;
}

return indexesOf(text);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you feel the indices in the string will be more useful than returning the actual words as strings? If word breaker functions return both the start and the end index of a word, then we can have it both ways: default is to return indices, and a simple function call can convert indices into words as strings.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions. Am happy to adjust this per the requirements of the wordbreaking when we get to it. I've defaulted to the most basic approach possible now of just marking the characters which are wordbreakers -- but I know that is naive. There are plenty of other wordbreak algorithms that we should research as well (yay!)

Comment thread resources/util.sh
# Define terminal colours
#

if [ -t 2 ]; then
Copy link
Copy Markdown
Contributor

@eddieantonio eddieantonio Jan 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to look up what [ -t 2] is. Could you please provide a comment explain why this is here? I reckon it's to define ANSI colours ONLY when stderr is outputting to the terminal (as opposed to redirected to a file).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure :)

@mcdurdin mcdurdin added the enhancement New feature or request label Jan 31, 2019
@mcdurdin mcdurdin added this to the P5S3 milestone Jan 31, 2019
<FileVersion>12.0</FileVersion>
</System>
<Options>
<FollowKeyboardVersion/>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

How would the model follow the keyboard version? As I recall, models and keyboards aren't allowed within the same package.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not great... but, we will overload this tag name to mean "Follow Lexical Model Version" as well. We may bubble that in the UI more appropriately. It should have been called "Follow Content File Version" but my crystal ball was on the blink on the day I chose the tag name :)

Comment thread tools/index.ts
return fs.readFileSync(path.join(sourcePath, source), 'utf8');
});

let oc: LexicalModelCompiled = {id:model_info.id, format:o.format, wordBreaking:o.wordBreaking};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format and wordBreaking are not listed as properties in tools/lexical-model.ts.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interface LexicalModelCompiled extends LexicalModel :)

Comment thread tools/index.ts Outdated
//
// Filename expectations
//
const kpsFileName = '../source/'+model_info.id+'.model.kps';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: String templates may be cleaner here:

const kpsFileName = `../source/${model_info.id}.model.kps`;

Comment thread tools/kmp-json-file.ts
@@ -0,0 +1,79 @@
interface KmpJsonFile {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will the official documentation (i.e., meaning of each field) for the JSON file exist?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an open PR on the help site (currently private, although shortly hoping to move it public) which documents these fields. See https://help.keyman.com/developer/11.0/reference/file-types/metadata (probably will be in 12.0 URL when that lands).

Comment thread tools/lexical-model.ts

interface LexicalModel {
readonly format: 'trie-1.0'|'fst-foma-1.0'|'custom-1.0',
readonly wordBreaking?: {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a custom model may provide its own word breaking.

Comment thread tools/lexical-model.ts
//... metadata ...
}

interface LexicalModelPrediction {
Copy link
Copy Markdown
Contributor

@eddieantonio eddieantonio Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these interfaces to correspond with the interfaces in the main repo?

See: https://github.com/keymanapp/keyman/blob/master/common/predictive-text/message.d.ts#L145-L224

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is really a stub for the real thing. I will update LexicalModelPrediction to correspond more cleanly with the Transform interface (but will not rely on the one in the main repo for now -- refactor can come once everything stabilises I think).

Comment thread tools/lexical-model.ts
fst: string;
}

interface LexicalModelCompiledCustom extends LexicalModelCompiled {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will most likely also contain src: string, but we'll cross that bridge when we get there :D

Comment thread tools/model-info-file.ts
@@ -0,0 +1,33 @@
interface ModelInfoFile {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will the official documentation for this live?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will live in the https://help.keyman.com/developer/cloud/ area, alongside the https://help.keyman.com/DEVELOPER/cloud/keyboard_info/1.0/ .keyboard_info file. Again, I have a PR open for this.

There is also a PR open against api.keyman.com/schemas for the corresponding JSON schema (and the updated kmp.json schema as well). Also hopefully to go public soon.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These PRs are both #1 because we've just moved api and help sites to public source.

@eddieantonio
Copy link
Copy Markdown
Contributor

My major concerns so far are: where the JSON documentation will exist; and whether the prediction interfaces will be duplicated and compatible with those in keymanapp/keyman.

Aside from that, LGTM!

@mcdurdin mcdurdin modified the milestones: P5S3, P5S4 Feb 8, 2019
@mcdurdin
Copy link
Copy Markdown
Member Author

Okay, I've addressed a bunch of bits and pieces with the compiler and documentation, and tried to cover the review comments. At this point, there are still a bunch of unfinished bits and pieces but I think that's fine, because this still gives us a base for generating .kmp files for use in the target applications, and for establishing CI.

Comment thread resources/compile.sh Outdated
pushd build
mkdir obj
../../../../node_modules/.bin/tsc --outDir ./obj ../source/model.ts
../../../../node_modules/.bin/tsc --module commonjs --target es6 --outDir ./obj ../source/model.ts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. would it be cleaner to use npx tsc instead of ../../../../node_modules/.bin/tsc>
  2. Is it worth extracting the compiler options to an explicit tsconfig.json?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Updated to use npx. Note, on Windows, this requires Node.js version 10.0 or later.

  2. Am I right in thinking this would require an explicit tsconfig.json for each model? If so for now I'll leave it in the script.

@mcdurdin mcdurdin merged commit 4bf71a5 into master Feb 13, 2019
@mcdurdin mcdurdin deleted the basic-compiler branch February 13, 2019 23:30
DavidLRowe pushed a commit that referenced this pull request Mar 10, 2022
mcdurdin pushed a commit that referenced this pull request Mar 28, 2022
chore: Fixup broken build script
DavidLRowe pushed a commit that referenced this pull request Jul 11, 2022
DavidLRowe pushed a commit that referenced this pull request Jan 29, 2025
add lexical model for chechen latin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants