Gensim: a generator is not an iterator
When using Gensim word2vec on a dataset stored in a database, I was pleased to see the library accepts an iterator to represent the corpus, allowing to process bigger-than-memory datasets. So, I wrote my generator function to stream text directly from a database, and came across a strange message:
TypeError: You can't pass a generator as the sentences argument. Try an iterator.
Looking at the code of Gensim, this is intended and is for a good reason: while Gensim is fine with iterating over the dataset, it may need to iterate on it more than once. But a generator can be consumed only once and then it’s over.
So, I did this:
class SentencesIterator(): def __init__(self, generator_function): self.generator_function = generator_function self.generator = self.generator_function() def __iter__(self): # reset the generator self.generator = self.generator_function() return self def __next__(self): result = next(self.generator) if result is None: raise StopIteration else: return result
Maybe someone will find it useful.
This is the simplest wrapper I could write, it has an iterator interface but uses the generator under the hood. The generator function is stored as well so it can reset and be used in Gensim like this:
from gensim.models import FastText # [...] sentences = SentencesIterator(tokens_generator) model = FastText(sentences)
I used it to generate word embeddings and LSI indexes from a chat dataset stored in Postgres.
To lazily iterate over query results in Postgres I use Psycopg2 and named cursors.
Complete code here.