Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added JavaScript/wasm target via emscripten #171

Merged
merged 23 commits into from
Jun 30, 2024
Merged

Conversation

RicBent
Copy link
Contributor

@RicBent RicBent commented Jun 13, 2024

This pull requests adds a build target for a JavaScript library, usable in any modern Browser.
For this the emscripten toolchain is utilized.

Solves #169

You can check out a running demo here: https://ricbent.github.io/Kiwi/demo/

Things that can be improved:

  • More of the base API functionality (multiple results, custom builder arguments, etc)
  • Implementation of wasm 128-bit SIMD functions (supported in all current browsers: https://caniuse.com/wasm-simd)
  • Addition of GitHub actions to build this new target
  • Creating a proper npm package (package.json, helpers like the WebWorker wrapper from the demo, TypeScript types, etc), deployed via GitHub actions
  • Usage of wizer to pre-initialize the wasm module with a dictionary. This would entirely bypass the slow Kiwi initialization times.
  • Multi-thread support via WebWorkers
  • Add Documentation

Because I haven't set up GitHub actions for this target yet, you need to manually build. This requires the emscripten toolchain to be installed:

mkdir build
cd build
emcmake cmake -DCMAKE_BUILD_TYPE=Release -DKIWI_USE_CPUINFO=OFF -DKIWI_BUILD_TEST=OFF -DKIWI_USE_MIMALLOC=OFF -DKIWI_BUILD_CLI=OFF -DKIWI_BUILD_EVALUATOR=OFF -DKIWI_BUILD_MODEL_BUILDER=OFF ../
make

This will generate kiwi-wasm.js and kiwi-wasm.wasm. The former can be directly used in any modern browser.

Sorry if I didn't follow some contribution guidelines correctly as I don't speak much Korean (yet).


Things changed since the creation of the PR:

  • Bindings implement the same functionality as the Java bindings, instead of just the tokenize/build functions
  • Typing for all api via TypeScript
  • Inline code documentation, HTML docs can be generated, just like the C++/C/Python bindings
  • npm package should be publishable by release workflow
  • demo project that uses the npm package is provided, replicating the linked demo, now with proper typing

@RicBent
Copy link
Contributor Author

RicBent commented Jun 13, 2024

Is this change correct? RicBent@3f0eb0c

Passing 0 for numThreads to prevent the ThreadPool creation in KiwiBuilder did not work as that triggered some assertion.

@RicBent
Copy link
Contributor Author

RicBent commented Jun 14, 2024

Attempted to implement the release workflow. Not sure how to test it properly without triggering a release.

I guess a workflow that triggers on new pull requests still needs to be added.

@bab2min
Copy link
Owner

bab2min commented Jun 15, 2024

@RicBent
Wow, it's amazing! Thank you for your contribution! The demo seems to work very well.
I'll review it as soon as possible.

@RicBent
Copy link
Contributor Author

RicBent commented Jun 17, 2024

Great!

The latest commit adds a wrapper package that could be directly imported in any npm project. Along with proper types for the entirety of the so far exposed API (https://github.com/bab2min/Kiwi/blob/2161cf137b996471383e7c7b65370e9f28981f14/bindings/wasm/package/src/kiwi.ts).

I'd be happy to implement the remaining API functionality to bring it to the same level as the Python library once you had a look at the PR.

@bab2min
Copy link
Owner

bab2min commented Jun 18, 2024

@RicBent
Oh, it looks good to me. 👍
I would really appreciate it if the rest of the functionality was completed as well.
If it is difficult for you to implement all the functionality, I think it is okay to implement only the core, merge and release them first, and then supplement the rest later.

@RicBent
Copy link
Contributor Author

RicBent commented Jun 19, 2024

I had a look at the Java bindings and it seems like the only missing API is the following:

KiwiBuilder:

  • Init with TypoTransformer
  • Add word
  • Add pre-analyzed word
  • Load additional dictionaries

Kiwi:

  • Analyze/tokenize with blocklist
  • Analyze/tokenize with pre-tokenized spans

I would also like to add documentation to the bindings. However I am not able to make them in Korean, only in English. Is that a problem @bab2min ?

Oh and we would also need an npm package name as 'kiwi' is already taken. I chose 'kiwi-nlp' for now, but I can change it if you have a better suggestion.

@bab2min
Copy link
Owner

bab2min commented Jun 21, 2024

@RicBent
It's okay to write the documentation in English. If you write the documentation in English, I can translate it for Korean documentation.
I think kiwi-nlp is a good name for the package, showing that it is a NLP library.
Thank you for your contribution!

@RicBent
Copy link
Contributor Author

RicBent commented Jun 24, 2024

Alright, I am almost there then. I already wrote pretty much all the required documentation and just the 4 points from KiwiBuilder are missing to match the Java API functionality.

I also amended the release workflow to automatically build and publish the resulting package to npm:

build-emscripten:
name: Emscripten
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
submodules: true
lfs: true
- uses: mymindstorm/setup-emsdk@v14
- name: Build
run: |
cd bindings/wasm
./build.sh
- uses: JS-DevTools/npm-publish@v3
with:
token: ${{ secrets.NPM_TOKEN }}
package: bindings/wasm/package

The workflow requires NPM_TOKEN to be added to the repo's secrets on your end when this gets merged.

To generate an npm token you need to do the following:

  1. Register an account on npm if you didn't already
  2. Follow this to create a token (the token will need write access to the kiwi-nlp package)

To add it to the repo's secrets you can follow this:
https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions

@RicBent
Copy link
Contributor Author

RicBent commented Jun 25, 2024

@bab2min everything on my end should be done now:

  • The TypeScript bindings implement the same functionality as the Java bindings
  • Inline code documentation is done, HTML docs can be generated, just like the Python bindings
  • npm package should be publishable by release workflow (needs token as described)
  • demo project that uses the npm package is provided, replicating this: https://ricbent.github.io/Kiwi/demo/

Let me know if anything needs to be improved/changed!

@bab2min
Copy link
Owner

bab2min commented Jun 27, 2024

@RicBent
Great!!! Thank you for your contribution.
I'll add documentations for Korean and tokens for release workflow.

@bab2min bab2min merged commit b2709dc into bab2min:main Jun 30, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants