-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use full path for inputs. #2
base: master
Are you sure you want to change the base?
Conversation
Why don't you put the full path or use normalizePath in your training script? |
Mostly because I didn't expect to use full path in BOTH When I first use this function I immediately got the error of not finding |
I think there was another reason why I didn't put the full path there but I forgot why. Something about the sentencepiece C++ code or something about how to save objects but I forgot... Don't know what to do know. |
All I know is that spm_train --input=data/sent.txt --model_prefix=model/spm At least this works for |
yes that was the reason. If you specify the full path to the file in model_prefix, as specified in the sentencepiece docs, I no longer correctly load the resulting model at the end of the sentencepiece call in https://github.com/bnosac/sentencepiece/blob/master/R/sentencepiece.R#L88. That's why I didn't specify the full path. Or it least I got lost in this path setup which was both imposed by sentencepiece and cran |
Decouple the `args` expert use from other arguments. The current design may confuse user about using `model_dir` and `model_prefix` with directory being involved. So we make it clear that when `args` is used, user will take full control of the underlying `spm_train` command line. Since the original program doesn't have `verbose` option so we keep only this argument to be still relevant to expert use. Under the new design, the following two calls are equivalent: ```R sentencepiece("data/sentences.txt", model_prefix="spm", model_dir="model") sentencepiece(args=c("--input=data/sentences.txt --model_prefix=model/spm")) ``` Both supports relative path based on the same working directory. Also I change the regex of `model_prefix` matching pattern from `"model_prefix=.+ -"` to `"model_prefix=.+( -|$)"` so `--model_prefix` can also be the last argument.
Let's give it another try. Copy-paste note from the new commit: Decouple the Under the new design, the following two calls are equivalent: sentencepiece("data/sentences.txt", model_prefix="spm", model_dir="model")
sentencepiece(args=c("--input=data/sentences.txt --model_prefix=model/spm")) Both supports relative path based on the same working directory. Also I change the regex of |
Okay, but I think what I was trying to point at that model_prefix can be the full path to a file where to store the model, where the complexity came from. But thinking of this, this is probably not related to the use of |
Currently if we specify
model_dir
the inputx
will need to be relative to that path, which could be counter intuitive.The scenario is that we put the training files in a data directory while we want the resulting model in another dedicated model directory.
By expanding
x
to its full path at the very beginning of the function call we shall be able to avoid any ambiguity resolving the path of both data and model directories.