-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilabel Classification Dataset Loading #6
Comments
Hi @angrymeir! What do you think having two separate classes for loading datasets from disk? One for "standard" single label dataset ( Do you think we should provide support for another format/structure too? For instance, having a file holding document name and category label pairs, like so:
And a folder containing the actual documents. Being this the case, we should let the user specify somehow the file where these pairs are (also provide the separator/delimiter used, tab? comma? etc.) and the path to the folder where actual documents are. The same should apply to your approach. The user should be able to provide the separator for the labels in labels.txt file, which in your case is a semicolon (;). What do you think the
If Do you recommend me any particular dataset to work with, while implementing full multilabel support? This dataset will be the one used for the tutorial introducing multilabel support, too, similar to the ones that are already available. I'm currently using a Kaggle's dataset for toxic comment classification. |
I just realized we would need two
What do you think about that? |
Hey @sergioburdisso,
Format/Structure
This would imply that there were 3^6 possible categories (in the toxic comment dataset) which seems just not feasible to annotate... Giving the user the option to specify both delimiters makes absolutely sense! I also agree about the default parameters. Dataset |
😊 Following your suggestion, I've added a method called "load_from_files_multilabel" to carry out this task, supporting both dataset structures/format. I've decided to put "multilabel" at the end so that, as with By Now, following your example, you should be able to load your dataset simple by:
In case you need a different separator for labels, for instance, using commas, you could use the
And, finally, in case you need to use a document separator other than
More details are given in the API documentation. 👍 Dataset |
A new method (``load_from_files_multilabel``) was added to the ``Dataset`` class to load multilabel datasets from disk. More details about this new class can be found in the API documentation (https://pyss3.rtfd.io/en/latest/api/index.html#pyss3.util.Dataset.load_from_files_multilabel). Resolves: #6
The dataset is a subset of the CMU Movie Summary Corpus (http://www.cs.cmu.edu/~ark/personas/) with 32985 summaries and only 10 movie genres. The dataset is structured according to #6, i.e., there are two files, one for the labels and another for the movie plot summaries.
PySS3 now fully support multi-label classification! :) - The ``load_from_files_multilabel()`` function was added to the ``Dataset`` class (7ece7ce, resolved #6) - The ``Evaluation`` class now supports multi-label classification (#5) - Add multi-label support to ``train()/fit()`` (4d00476) - Add multi-label support to ``Evaluation.test()`` (0a897dd) - Add multi-label support to ``show_best and get_best()`` (ef2419b) - Add multi-label support to ``kfold_cross_validation()`` (aacd3a0) - Add multi-label support to ``grid_search()`` (925156d, 79f1e9d) - Add multi-label support to the 3D Evaluation Plot (42bbc65) - The Live Test tool now supports multi-label classification as well (15657ee, b617bb7, resolved #9) - Category names are no longer case-insensitive (4ec009a, resolved #8)
Hey @sergioburdisso,
for multilabel classification the file structure described in the topic categorization tutorial is not efficient since the text related to multiple label has to be stored in multiple files.
My current approach is to write the text to one file linewise and the respective labels to another file, also linewise.
The result is the following:
It would be great if
util.Dataset.load_from_files
could be adjusted to also support this!But I'm also open for other suggestions on how to tackle that problem :)
Thanks for your hard work!
The text was updated successfully, but these errors were encountered: