Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phrases and Phraser allow a generator corpus #1099

Merged
merged 3 commits into from
Jan 27, 2017

Conversation

ELind77
Copy link
Contributor

@ELind77 ELind77 commented Jan 20, 2017

Allow Phrases and Phraser models to take a generator
function/expression as input to the transformation method. Previously,
only indexable iterables could be used which is problematic for large
corpora.

Add additional tests to test_phrases using a generator as input.

I was trying to build a Phrases model with a corpus bigger than 1GB. My corpora are always generators and usually that's fine with gensim but it didn't work this time so I fixed it. The fix is basically just adding a function in phrases.py to do some more checking as to the type of the input. It also accommodates a generator of generators in case each document/token is undergoing some kind of transformation as it's pulled through.

This PR does still need a small modification to the docstings in Phraser and Phrases in order to indicate that the input type can be a generator and not just a list but my rst isn't all that great. Maybe if someone reviews this I could get a pointer on that?

Allow Phrases and Phraser models to take a generator
function/expression as input to the transformation method. Previously,
only indexable iterables could be used which is problematic for large
corpora.

Add additional tests to test_phrases using a generator as input.
Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the error that the code produces when given a generator? The call to _apply used here is pretty standard in Gensim. Interested to know what is particular about Phrases.

bigram2_seen = False

for s in self.bigram[gen_sentences()]:
if not bigram1_seen and 'response_time' in s:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be lost if just assert 'response_time' in self.bigram[gen_sentences()]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cop-out, but I just copied the format of the existing test :P

To stream-line that section you could do:

assert len(set(['response_time', 'graph_minors']).intersection(set(it.chain.from_iterable(self.bigram[gen_sentences()])))) == 2

However, the current formulation in both my test and testBigramConsturction short-circuits so that it doesn't have to go through the entire input. This would allow longer test corpora in the future.

In either case, I didn't want to question or rethink the work of whoever designed the tests originally, I just wanted to make sure my changes were non-breaking. Do you want me to change the tests?

@tmylk
Copy link
Contributor

tmylk commented Jan 20, 2017

Would it be easier to modify _apply so that other methods can benefit from this as well?

@gojomo
Copy link
Collaborator

gojomo commented Jan 20, 2017

I like the goal and thank you for including tests! But I'm concerned this seems a bit of a convoluted approach. It's got twisty conditionals, a catchall exception handler, and is named-like a predicate-function (is_…) but returns a tuple.

Maybe it has to be, to continue to offer Phrase-ification via __getitem__() index-lookup, for multiply-shaped input types. The existing is_single test to achieve that is already kind of clunky, and the idea of treating phrase-ification as if it were a []-keyed-lookup has always been non-intuitive to me, though I know some other parts of the gensim API work similarly.

In the interest of explicitness/simplicity, what if phrasify_one(sentence) and phrasify_many(sentences) were explicit parts of the API, so the user of the code indicates what they expect done, rather than leaving it to a bunch of type-checking?

If a __getitem__() accept-and-adapt-to-whatever entry point remains. I believe that peeking at the first item from the argument (which should always be iterable of some sort) should be enough. If it's a string, treat it as a single-text, and if it's anything else (generator or not), consider it an iterable-of-iterables. For example:

def __getitem__(self, sentence_or_sentences):
    sent_iter = iter(sentence_or_sentences)
    try:
        peek = sent_iter.next()
        sentence_or_sentences = it.chain([peek], sent_iter)
    except StopIteration:
        return []
    if isinstance(peek, string_types):
        return phrasify_one(sentence_or_sentences)
    else:
        return phrasify_many(sentence_or_sentences)

@ELind77
Copy link
Contributor Author

ELind77 commented Jan 20, 2017

@tmylk I don't think there's any problem with _apply. The issue is here, when Phrases/Phraser checks if the input is a single document or a corpus.

@gojomo I think you're right, it's convoluted, I was more concerned with changing as little of the existing code as possible than with making the new approach clear. I think your proposed peek with phrasiphy_one and phraseiphy_many is the better approach. Is there any argument for tying to unify Phrases and Phraser? The phrasiphy_one code would be sooo close but not quite close enough to actually be easy to do well.

Also, should the docstings for __getitem__ be modified to indicate more versatile input types? E.g. here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L231

Last question, should I merge my branch back in to my fork's develop branch?

@tmylk
Copy link
Contributor

tmylk commented Jan 20, 2017

@ELind77 This PR won't be merged as it is. Waiting for an improvement with phrasify_one/many .

@piskvorky
Copy link
Owner

piskvorky commented Jan 21, 2017

My suggestion would be to only have one transform() function, which accepts a stream of multiple items (here, sentences). Calling it on one sentence is just transform([one_sentence]), no need to clutter the API.

And yes, we could offer a similar function (called transform to keep in sync with sklearn terminology) in other modules too, in addition to the existing [ ] notation.

Questions around separating model creation vs. model use (model static but more efficient in memory/CPU) seem to crop up a lot. We did something similar with word2vec, where we introduced KeyedVectors for the same reason. Something for @tmylk to think about -- do we need a new API to make this concept clean and consistent across gensim?

@ELind77
Copy link
Contributor Author

ELind77 commented Jan 21, 2017

After going back and looking at the code again I don't think it makes sense to change the public API. I think it's better and more consistent to keep the API exactly as it is.

I think the main issue @gojomo pointed out is the complexity of _is_single. In order to improve that I took a look at how TfidfModel does it and was pleasantly surprised when I discovered that utils.is_corpus behaves in exactly the same way as my _is_single function. I think we can resolve the problems pointed out above by simplifying _is_single (I can make it public too if that's preferable) and giving it a contract and documentation like is_corpus has in the utilities.

A simple modification to @gojomo's code with some documentation inspired by is_corpus give this:

def _is_single(obj):
    """
    Check whether `obj` is a single document or an entire corpus.
    Returns (is_single, new) 2-tuple, where `new` yields the same
    sequence as `obj`.

    `obj` is a single document if it is an iterable of strings.  It
    is a corpus if it is an iterable of documents.
    """
    obj_iter = iter(obj)
    try:
        peek = obj_iter.next()
        obj_iter = it.chain([peek], obj_iter)
    except StopIteration:
        # An empty object is a single document
        return True, obj
    if isinstance(peek, string_types):
        # It's a document, return
        return True, obj_iter
    else:
        return False, obj_iter

What do you guys think? I think the naming could use some tweaking, but it is exactly the same problem and solution that is_corpus solves for raw inputs instead of transformed corpora.

@tmylk
Copy link
Contributor

tmylk commented Jan 24, 2017

@ELind77 Looks concise. Would you like to update the PR with that version?
The change return order to is_single, sentence is also good idea to match is_corpus.

Adding a test for self.bigram[[]] would be good too

The _is_single function in phrases.py is now simpler and has the same
contract as the is_corpus function in utils.
@ELind77
Copy link
Contributor Author

ELind77 commented Jan 24, 2017

@tmylk I've pushed the changes to _is_single. What were the changes you wanted me to make to the tests? My original PR adds two tests that repeat some of the existing test functions with a generator input instead of lists.

Also, what did I break in the compatibility tests? I gather I did something that py 3.x doesn't like but I'm not sure what.

@ELind77
Copy link
Contributor Author

ELind77 commented Jan 24, 2017

I took a look at the error from travis and it looks like one of my tests is failing for python 3.x but I'm not sure why. Something different in the way 3.x handles generator expressions perhaps?

@tmylk
Copy link
Contributor

tmylk commented Jan 24, 2017

next is different between Python 2 and 3

from six import next
next(g)

And an explicit test that no ouput and no exception is received for an empty input [] is needed: self.bigram[[]]

@ELind77
Copy link
Contributor Author

ELind77 commented Jan 26, 2017

Ok, changes made and test added.

I also removed the TODO comments in phrases.py concerning the docstrings as I think the existing module documentation already implies that the phrase models are consistent with iterators.

@tmylk tmylk merged commit 34759be into piskvorky:develop Jan 27, 2017
@tmylk
Copy link
Contributor

tmylk commented Jan 27, 2017

@ELind77 Thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants