Use full path for inputs. #2

everdark · 2020-07-25T03:36:15Z

Currently if we specify model_dir the input x will need to be relative to that path, which could be counter intuitive.
The scenario is that we put the training files in a data directory while we want the resulting model in another dedicated model directory.
By expanding x to its full path at the very beginning of the function call we shall be able to avoid any ambiguity resolving the path of both data and model directories.

jwijffels · 2020-07-27T12:08:30Z

Why don't you put the full path or use normalizePath in your training script?

everdark · 2020-07-27T12:51:15Z

Why don't you put the full path or use normalizePath in your training script?

Mostly because I didn't expect to use full path in BOTH x and model_dir at all.
Of course that makes thing very clear. Then maybe we shall be clear in document that the path should be full path, otherwise when model_dir is supplied, the path to x will be relative to it if not given in full path.
To me it is just more user-friendly to support relative path in a intuitive way. :)

When I first use this function I immediately got the error of not finding x, I couldn't understand since file.exists(x) is definitely a TRUE. So I dived into the source code of the function sentencepiece and realized that x is indeed relative to model_dir.
I believe this can be common for other users, too.

jwijffels · 2020-07-27T12:56:03Z

I think there was another reason why I didn't put the full path there but I forgot why. Something about the sentencepiece C++ code or something about how to save objects but I forgot... Don't know what to do know.

everdark · 2020-07-27T13:24:38Z

All I know is that spm_train doesn't have a separate argument to specify model output directory.
By default it writes the model and vocab file to the current directory.
However we are able to use --model_prefix to specify the model directory:

spm_train --input=data/sent.txt --model_prefix=model/spm

At least this works for sentencepiece 0.1.84.
Not sure if this is related to your forgotten reason though...

jwijffels · 2020-07-27T13:28:05Z

yes that was the reason. If you specify the full path to the file in model_prefix, as specified in the sentencepiece docs, I no longer correctly load the resulting model at the end of the sentencepiece call in https://github.com/bnosac/sentencepiece/blob/master/R/sentencepiece.R#L88. That's why I didn't specify the full path. Or it least I got lost in this path setup which was both imposed by sentencepiece and cran

Decouple the `args` expert use from other arguments. The current design may confuse user about using `model_dir` and `model_prefix` with directory being involved. So we make it clear that when `args` is used, user will take full control of the underlying `spm_train` command line. Since the original program doesn't have `verbose` option so we keep only this argument to be still relevant to expert use. Under the new design, the following two calls are equivalent: ```R sentencepiece("data/sentences.txt", model_prefix="spm", model_dir="model") sentencepiece(args=c("--input=data/sentences.txt --model_prefix=model/spm")) ``` Both supports relative path based on the same working directory. Also I change the regex of `model_prefix` matching pattern from `"model_prefix=.+ -"` to `"model_prefix=.+( -|$)"` so `--model_prefix` can also be the last argument.

everdark · 2020-07-27T14:21:59Z

Let's give it another try. Copy-paste note from the new commit:

Decouple the args expert use from other arguments.
The current design may confuse user about using model_dir and model_prefix with directory being involved.
So we make it clear that when args is used, user will take full control of the underlying spm_train command line.
Since the original program doesn't have verbose option so we keep only this argument to be still relevant to expert use.

Under the new design, the following two calls are equivalent:

sentencepiece("data/sentences.txt", model_prefix="spm", model_dir="model")
sentencepiece(args=c("--input=data/sentences.txt --model_prefix=model/spm"))

Both supports relative path based on the same working directory.

Also I change the regex of model_prefix matching pattern from "model_prefix=.+ -" to "model_prefix=.+( -|$)" so --model_prefix can also be the last argument.

jwijffels · 2020-07-27T19:38:28Z

Okay, but I think what I was trying to point at that model_prefix can be the full path to a file where to store the model, where the complexity came from. But thinking of this, this is probably not related to the use of x. The reason why capture.output was used was at request of cran to suppress training evolution.

Use full path for inputs.

47716cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use full path for inputs. #2

Use full path for inputs. #2

everdark commented Jul 25, 2020

jwijffels commented Jul 27, 2020

everdark commented Jul 27, 2020

jwijffels commented Jul 27, 2020

everdark commented Jul 27, 2020

jwijffels commented Jul 27, 2020 •

edited

Loading

everdark commented Jul 27, 2020

jwijffels commented Jul 27, 2020 •

edited

Loading

Use full path for inputs. #2

Are you sure you want to change the base?

Use full path for inputs. #2

Conversation

everdark commented Jul 25, 2020

jwijffels commented Jul 27, 2020

everdark commented Jul 27, 2020

jwijffels commented Jul 27, 2020

everdark commented Jul 27, 2020

jwijffels commented Jul 27, 2020 • edited Loading

everdark commented Jul 27, 2020

jwijffels commented Jul 27, 2020 • edited Loading

jwijffels commented Jul 27, 2020 •

edited

Loading

jwijffels commented Jul 27, 2020 •

edited

Loading