How to Get Started with Data Science

4 min readJun 12, 2019

At the core, the approach to learning Data Science should begin with solid theoretical foundation. Once you have a grasp of the fundamentals, get practical experience by implementing what you learned in theory or by applying learned techniques to data science problems.

I cannot stress this enough that, you will not be able to go far or appreciate the wide and continuously evolving areas of data science unless you do it the hard way and really get into the math of it.

For that, you need to start strong (which means referring to the best resources listed below), you will have to be patient (because it will not be easy) and keep going.

Machine Learning

Start with Andrew Ng’s Machine Learning course on Coursera.
This course will give you a map of things you need to know to built a good foundation. This course touches the surface of a lot of techniques without getting into too much depth.
After watching a few weeks of Ng’s course (or simultaneously) start Abu Mostafa’s Machine Learning course taught in Caltech — its available on youtube!
This course will take you to the right amount of theoretical depth while giving you intuitions along the way. For starters, there is no better course on Machine Learning than Mosfata’s.
Added bonus — he makes hilarious geeky jokes as well :D
Then pick a problem e.g. Titanic survival classification problem or House Prices prediction regression problem on Kaggle and try to build different models — it will give you a good practical understanding.
Then go for other topics based on your interest —
a) Deep Learning — Andrew Ng’s Deep Learning Specilization on Coursera is a good start.
b) NLP — Christopher Manning’s updated NLP course in Stanford is on youtube and is one of the best resources.
c) CNN — CNN for Visual Recognition taught by Fei Fei Li and her PhD students is pretty good.
d) Machine Learning Stanford lectures by Andrew Ng — a deep down Math heavy course, better to watch it after having some high level familiarity and practical experience with ML techniques.
Keep participating in competitions on Kaggle and AnalyticsVidhya (AV also has lot of good tutorials on concepts with code)

All this is going to take you a lot of time, especially if you are already working. But, I don’t want to tell you that you will learn everything in 3 months, because even if you rush through the material, it will take time to sink in and you to appreciate it.

Be realistic in expectations and give it time.

Since it can be daunting, I am giving you a list of things below that will keep you focused.

Core ML Techniques (Must Know)

The machine learning techniques you should absolutely know are:

Linear Regression
Logistic Regression
SVM (MIT’s SVM Lecture for Introduction, Andrew Ng’s Stanford course reading on SVM)
Decision Trees
Random Forests
Gradient Boosting Machines (GBM) (MIT’s Boosting Lecture for Introduction)
Singular Value Decomposition (SVD)
K-Means

Learn the math of these techniques in detail. Spend time to understand these well if you plan to pursue Data Science seriously. You might not end up using many of these in the beginning but a lot of ideas are hidden in these techniques which are used everywhere, even in the latest research papers.

Core Concepts (Must Know)

Concepts you should know are:

Bayes theorem
Likelihood
Underfitting/Overfitting — regularization
Bias-Variance
Cross-validation
Hyper-parameter optimization
A/B testing
Train/Validation/Test split
Out-of-time testing
Bagging vs Boosting
Outliers

Metrics / Loss functions (Must Know)

Log-loss or Binary Cross-entropy loss
Cross-entropy loss
MSE/RMSE
R-Square, Adjusted R-Square
Precision, Recall, F-Score
Accuracy
Entropy, Information Gain, Gini Impurity
MAE, MAPE
ROC-AUC
nDCG

Statistics (Must Know)

At the very least you should know:

Distributions — Bernoulli, Binomial, Poisson, Uniform, Normal, t, Chi Squared
Interpretation of p-value
Concept of hypothesis testing
z-test, two sample t-test, paired t-test, chi-squared test

Language / Tools

Python, R (python is the preferred language now)
R Libraries — data.table, dplyr, glm, e1071, xgboost, gbm, H2O, ggplot2, randomForest, lightgbm etc.
Python Libraries — numpy, pandas, sklearn (most ML algorithms are available in sklearn), xgboost, lightgbm, matplotlib etc.
Deep Learning — Keras in Python

This list should be sufficient to have a solid start in Data Science.

Hopefully, this gives you a picture of the arsenal you need to build in order to do well and you use this as your map to start learning the field. I must say, if you do it the hard way, you will start enjoying the field a lot more. Else, it will be just an exercise of tuning parameters of a library to work for you. And that will become annoying very soon. I will go back to my suggestion earlier,

Be realistic in expectations and give it time.

At some point, I will write a part 2 of this on how to go to the next level on Data Science. Till then, keep learning :)

Source: giphy.com

How to Get Started with Data Science

Machine Learning

Core ML Techniques (Must Know)

Core Concepts (Must Know)

Metrics / Loss functions (Must Know)

Statistics (Must Know)

Language / Tools

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Abhay Shukla

No responses yet