Gensim Word2Vec Reproducibility & Saving to S3

1 min readJun 8, 2019

Assuming that you already have data in required format to train the Word2Vec model in gensim, following 2 lines of code is all you need:

Reproducibility

For reproducibility, set a seed and set workers to 1.

model = Word2Vec(train_data, seed=100, workers=1)

Obviously, this will lead to slower training because of single worker.

In python3 you may also have to set PYTHONHASHSEED parameter to ensure reproducibility.

Saving to S3

Saving to S3 is a tricky affair. When the model is large in size, gensim stores it into part files, which has compatibility issues in saving to S3 as of now.

The solution is to set parameters separately to [] and sep_limit to a huge number (it is set to 4gb in the code below).

model.save(model_write_path, separately=[], sep_limit=4294967296)

You can estimate value of sep_limit by training a model and saving it locally to get an understanding of how much space the model needs and set sep_limit = safety_factor * estimated_size. safety_factor can be in range of 1.5–3 depending on your usecase.

Note that with change in size of embedding storage requirements will also change, higher the size, higher the space required.

Gensim Word2Vec Reproducibility & Saving to S3

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Abhay Shukla

No responses yet