Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support other ngram groups. #7

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

shaleh
Copy link
Contributor

@shaleh shaleh commented May 30, 2024

Refactored ngrams.rs into a module directory to support other groups of ngrams. Added 'programming' as the first new group. This is comprised of the top 100 keywords among popular programming languages. To make this easier I defined a trait called NgramData and some matching enums and empty structs.

Yes, there are overlaps with the standard ngrams for English. I had two thoughts here.

  1. Thinking ahead to when people might want to use this in their native language instead of English. So the programming ngrams might be helpful there as well as establishing the trait so adding others is easier than forcing everyone to use a file.
  2. I really wanted to include symbol pairs like <= or ->. However, this does not work because the current definition for the keymaps so not have symbols because it does not know about shifts on qwerty and the like. That coupled with the fact that the on screen keyboard does not show the numbers/symbols row and things are not setup for it.

Together, that means this is something of an aspirational PR. Might be worth applying it now and then iterating on improvements until the final state is reached?

Also, this PR has functions to break words into N sized ngrams. Assuming the logic is correct that could also make it easier for people to use word files. They can provide a dictionary and ask for trigrams and let the code parse it out.

Refactored ngrams.rs to support other groups of ngrams. Added 'programming' as the first new group. This is comprised of the top 100 keywords among popular programming languages. To make this easier I defined a trait called `NgramData` and some matching enums and empty structs.
@shaleh
Copy link
Contributor Author

shaleh commented May 30, 2024

What is inspiring my adoption of this app is I am working on learning colemak and a split ortho keyboard with 52 keys. My symbols are on a layer using QMK. So having a way to practice them and evaluate various layouts would be nice. But if this is the wrong tool that is totally ok.

@shaleh
Copy link
Contributor Author

shaleh commented May 30, 2024

I used "group" instead of "language" because maybe there are other things than languages here -- like programming keywords. Very much an arbitrary decision and open to better clarity if there are suggestions.

use itertools::Itertools;

pub struct ProgrammingData();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the common data instead of making this Rust specific. People might use the tool and not know or care about the language. That means not all of the Rust keywords made the cut.

"this", "throw", "true", "try", "type", "typedef", "typeof", "union", "unsigned", "until",
"using", "var", "void", "volatile", "when", "where", "while", "with", "xor", "yield",
];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "magic" that breaks up the words in ngrams. In theory this could be moved to mod.rs and exposed. Then it could be used to parse dictionaries the user provides instead of requiring them to have ngram lists.

.map(|(to, c)| &source[from..from + to + c.len_utf8()])
})
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the methods below use the itertools unique method to ensure there are no duplicate entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also map(String:from) instead of map(|s| s.to_string()).

src/ngrams/english.rs Outdated Show resolved Hide resolved
@wintermute-cell
Copy link
Owner

Looks interesting, I'll take a closer look at this soon. I'm pretty busy at the moment so it might be a while until I can find the time!

@shaleh
Copy link
Contributor Author

shaleh commented Jun 2, 2024

All good. No rush.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants