Gensim Word2Vec Reproducibility & Saving to S3

Abhay Shukla
1 min readJun 8, 2019

--

Assuming that you already have data in required format to train the Word2Vec model in gensim, following 2 lines of code is all you need:

Reproducibility

For reproducibility, set a seed and set workers to 1.

model = Word2Vec(train_data, seed=100, workers=1)

Obviously, this will lead to slower training because of single worker.

In python3 you may also have to set PYTHONHASHSEED parameter to ensure reproducibility.

Saving to S3

Saving to S3 is a tricky affair. When the model is large in size, gensim stores it into part files, which has compatibility issues in saving to S3 as of now.

The solution is to set parameters separately to [] and sep_limit to a huge number (it is set to 4gb in the code below).

model.save(model_write_path, separately=[], sep_limit=4294967296)

You can estimate value of sep_limit by training a model and saving it locally to get an understanding of how much space the model needs and set sep_limit = safety_factor * estimated_size. safety_factor can be in range of 1.5–3 depending on your usecase.

Note that with change in size of embedding storage requirements will also change, higher the size, higher the space required.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Abhay Shukla
Abhay Shukla

Written by Abhay Shukla

Data Science @ Meesho, Ex- Airtel, Swiggy, [24]7.ai https://www.linkedin.com/in/shuklaabhay/ #DataScience #ML #AI #Statistics #Reading #Music #Running

No responses yet

Write a response