Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use full path for inputs. #2

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

everdark
Copy link

Currently if we specify model_dir the input x will need to be relative to that path, which could be counter intuitive.
The scenario is that we put the training files in a data directory while we want the resulting model in another dedicated model directory.
By expanding x to its full path at the very beginning of the function call we shall be able to avoid any ambiguity resolving the path of both data and model directories.

@jwijffels
Copy link
Contributor

Why don't you put the full path or use normalizePath in your training script?

@everdark
Copy link
Author

Why don't you put the full path or use normalizePath in your training script?

Mostly because I didn't expect to use full path in BOTH x and model_dir at all.
Of course that makes thing very clear. Then maybe we shall be clear in document that the path should be full path, otherwise when model_dir is supplied, the path to x will be relative to it if not given in full path.
To me it is just more user-friendly to support relative path in a intuitive way. :)

When I first use this function I immediately got the error of not finding x, I couldn't understand since file.exists(x) is definitely a TRUE. So I dived into the source code of the function sentencepiece and realized that x is indeed relative to model_dir.
I believe this can be common for other users, too.

@jwijffels
Copy link
Contributor

I think there was another reason why I didn't put the full path there but I forgot why. Something about the sentencepiece C++ code or something about how to save objects but I forgot... Don't know what to do know.

@everdark
Copy link
Author

All I know is that spm_train doesn't have a separate argument to specify model output directory.
By default it writes the model and vocab file to the current directory.
However we are able to use --model_prefix to specify the model directory:

spm_train --input=data/sent.txt --model_prefix=model/spm

At least this works for sentencepiece 0.1.84.
Not sure if this is related to your forgotten reason though...

@jwijffels
Copy link
Contributor

jwijffels commented Jul 27, 2020

yes that was the reason. If you specify the full path to the file in model_prefix, as specified in the sentencepiece docs, I no longer correctly load the resulting model at the end of the sentencepiece call in https://github.com/bnosac/sentencepiece/blob/master/R/sentencepiece.R#L88. That's why I didn't specify the full path. Or it least I got lost in this path setup which was both imposed by sentencepiece and cran

Decouple the `args` expert use from other arguments.
The current design may confuse user about using `model_dir` and `model_prefix` with directory being involved.
So we make it clear that when `args` is used, user will take full control of the underlying `spm_train` command line.
Since the original program doesn't have `verbose` option so we keep only this argument to be still relevant to expert use.

Under the new design, the following two calls are equivalent:

```R
sentencepiece("data/sentences.txt", model_prefix="spm", model_dir="model")
sentencepiece(args=c("--input=data/sentences.txt --model_prefix=model/spm"))
```
Both supports relative path based on the same working directory.

Also I change the regex of `model_prefix` matching pattern from `"model_prefix=.+ -"` to `"model_prefix=.+( -|$)"` so `--model_prefix` can also be the last argument.
@everdark
Copy link
Author

Let's give it another try. Copy-paste note from the new commit:


Decouple the args expert use from other arguments.
The current design may confuse user about using model_dir and model_prefix with directory being involved.
So we make it clear that when args is used, user will take full control of the underlying spm_train command line.
Since the original program doesn't have verbose option so we keep only this argument to be still relevant to expert use.

Under the new design, the following two calls are equivalent:

sentencepiece("data/sentences.txt", model_prefix="spm", model_dir="model")
sentencepiece(args=c("--input=data/sentences.txt --model_prefix=model/spm"))

Both supports relative path based on the same working directory.

Also I change the regex of model_prefix matching pattern from "model_prefix=.+ -" to "model_prefix=.+( -|$)" so --model_prefix can also be the last argument.

@jwijffels
Copy link
Contributor

jwijffels commented Jul 27, 2020

Okay, but I think what I was trying to point at that model_prefix can be the full path to a file where to store the model, where the complexity came from. But thinking of this, this is probably not related to the use of x. The reason why capture.output was used was at request of cran to suppress training evolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants