Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data set requires multiple #7

Open
diziet opened this issue Mar 26, 2019 · 3 comments
Open

Data set requires multiple #7

diziet opened this issue Mar 26, 2019 · 3 comments
Labels
bug Something isn't working

Comments

@diziet
Copy link

diziet commented Mar 26, 2019

I tried a simple dataset to play around with this, and I am running into

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

I believe this is because scikit wants there to be multiple iterations of the variable you're trying to predict. Might want to add it to the docs~

Input:
square.txt

Full log if needed:

>$ automl_gs square.csv square
/usr/local/lib/python3.7/site-packages/automl_gs/utils_automl.py:270: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  metrics = yaml.load(f)
Solving a classification problem, maximizing accuracy using tensorflow.

Modeling with field specifications:
real: numeric
fake1: numeric
fake2: numeric
fake3: numeric
fake4: numeric
text: categorical
bool: categorical
/usr/local/lib/python3.7/site-packages/automl_gs/utils_automl.py:126: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  hps = yaml.load(f)
  0%|                                                                                                                                                                         | 0/100 [00:00<?, ?trial/s/usr/local/lib/python3.7/site-packages/automl_gs/utils_automl.py:199: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  metrics = yaml.load(f)[problem_type]
Traceback (most recent call last):
  File "model.py", line 47, in <module>
    model_train(df, encoders, args, model)
  File "../automodel/automl_train/pipeline.py", line 408, in model_train
    for train_indices, val_indices in split.split(np.zeros(y.shape[0]), y):
  File "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1315, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1695, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
                                                                                                                                                                                                        Traceback (most recent call last):                                                                                                                                              | 0/20 [00:00<?, ?epoch/s]
  File "/usr/local/bin/automl_gs", line 10, in <module>
    sys.exit(cmd())
  File "/usr/local/lib/python3.7/site-packages/automl_gs/automl_gs.py", line 175, in cmd
    tpu_address=args.tpu_address)
  File "/usr/local/lib/python3.7/site-packages/automl_gs/automl_gs.py", line 87, in automl_grid_search
    "metadata", "results.csv"))
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 429, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1853, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'automl_train/metadata/results.csv' does not exist: b'automl_train/metadata/results.csv'
@minimaxir
Copy link
Owner

minimaxir commented Mar 26, 2019

From the provided dataset, the issue here is that it's trying to do a classification problem instead of a regression problem. I don't believe that error is otherwise wrong.

I removed problem_type as a parameter since I thought the heuristic was fine for that; I was apparently wrong. I may need to refine it.

In the meantime, if square is a float (has a decimal), it should work.

@avinash-mishra
Copy link

Although I have not gone through with code, but I think it can be a problem of stratification also. I have seen this kind of issue when I use stratify inside train_test_split.
I'll check and update.

@germanjoey
Copy link

Allowing problem_type as a parameter would be very appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants