How to Get Started with Data Science

Abhay Shukla
4 min readJun 12, 2019
Source: manypixels.co

At the core, the approach to learning Data Science should begin with solid theoretical foundation. Once you have a grasp of the fundamentals, get practical experience by implementing what you learned in theory or by applying learned techniques to data science problems.

I cannot stress this enough that, you will not be able to go far or appreciate the wide and continuously evolving areas of data science unless you do it the hard way and really get into the math of it.

For that, you need to start strong (which means referring to the best resources listed below), you will have to be patient (because it will not be easy) and keep going.

Machine Learning

  1. Start with Andrew Ng’s Machine Learning course on Coursera.
    This course will give you a map of things you need to know to built a good foundation. This course touches the surface of a lot of techniques without getting into too much depth.
  2. After watching a few weeks of Ng’s course (or simultaneously) start Abu Mostafa’s Machine Learning course taught in Caltech — its available on youtube!
    This course will take you to the right amount of theoretical depth while giving you intuitions along the way. For starters, there is no better course on Machine Learning than Mosfata’s.
    Added bonus — he makes hilarious geeky jokes as well :D
  3. Then pick a problem e.g. Titanic survival classification problem or House Prices prediction regression problem on Kaggle and try to build different models — it will give you a good practical understanding.
  4. Then go for other topics based on your interest —
    a) Deep Learning — Andrew Ng’s Deep Learning Specilization on Coursera is a good start.
    b) NLP — Christopher Manning’s updated NLP course in Stanford is on youtube and is one of the best resources.
    c) CNN — CNN for Visual Recognition taught by Fei Fei Li and her PhD students is pretty good.
    d) Machine Learning Stanford lectures by Andrew Ng — a deep down Math heavy course, better to watch it after having some high level familiarity and practical experience with ML techniques.
  5. Keep participating in competitions on Kaggle and AnalyticsVidhya (AV also has lot of good tutorials on concepts with code)

All this is going to take you a lot of time, especially if you are already working. But, I don’t want to tell you that you will learn everything in 3 months, because even if you rush through the material, it will take time to sink in and you to appreciate it.

Be realistic in expectations and give it time.

Source: undraw.co

Since it can be daunting, I am giving you a list of things below that will keep you focused.

Core ML Techniques (Must Know)

The machine learning techniques you should absolutely know are:

  1. Linear Regression
  2. Logistic Regression
  3. SVM (MIT’s SVM Lecture for Introduction, Andrew Ng’s Stanford course reading on SVM)
  4. Decision Trees
  5. Random Forests
  6. Gradient Boosting Machines (GBM) (MIT’s Boosting Lecture for Introduction)
  7. Singular Value Decomposition (SVD)
  8. K-Means

Learn the math of these techniques in detail. Spend time to understand these well if you plan to pursue Data Science seriously. You might not end up using many of these in the beginning but a lot of ideas are hidden in these techniques which are used everywhere, even in the latest research papers.

Core Concepts (Must Know)

Concepts you should know are:

  1. Bayes theorem
  2. Likelihood
  3. Underfitting/Overfitting — regularization
  4. Bias-Variance
  5. Cross-validation
  6. Hyper-parameter optimization
  7. A/B testing
  8. Train/Validation/Test split
  9. Out-of-time testing
  10. Bagging vs Boosting
  11. Outliers

Metrics / Loss functions (Must Know)

  1. Log-loss or Binary Cross-entropy loss
  2. Cross-entropy loss
  3. MSE/RMSE
  4. R-Square, Adjusted R-Square
  5. Precision, Recall, F-Score
  6. Accuracy
  7. Entropy, Information Gain, Gini Impurity
  8. MAE, MAPE
  9. ROC-AUC
  10. nDCG

Statistics (Must Know)

At the very least you should know:

  1. Distributions — Bernoulli, Binomial, Poisson, Uniform, Normal, t, Chi Squared
  2. Interpretation of p-value
  3. Concept of hypothesis testing
  4. z-test, two sample t-test, paired t-test, chi-squared test

Language / Tools

  1. Python, R (python is the preferred language now)
  2. R Libraries — data.table, dplyr, glm, e1071, xgboost, gbm, H2O, ggplot2, randomForest, lightgbm etc.
  3. Python Libraries — numpy, pandas, sklearn (most ML algorithms are available in sklearn), xgboost, lightgbm, matplotlib etc.
  4. Deep Learning — Keras in Python

This list should be sufficient to have a solid start in Data Science.

Hopefully, this gives you a picture of the arsenal you need to build in order to do well and you use this as your map to start learning the field. I must say, if you do it the hard way, you will start enjoying the field a lot more. Else, it will be just an exercise of tuning parameters of a library to work for you. And that will become annoying very soon. I will go back to my suggestion earlier,

Be realistic in expectations and give it time.

At some point, I will write a part 2 of this on how to go to the next level on Data Science. Till then, keep learning :)

Source: giphy.com

Sign up to discover human stories that deepen your understanding of the world.

Abhay Shukla
Abhay Shukla

Written by Abhay Shukla

Data Science @ Meesho, Ex- Airtel, Swiggy, [24]7.ai https://www.linkedin.com/in/shuklaabhay/ #DataScience #ML #AI #Statistics #Reading #Music #Running

No responses yet

Write a response