How to become a data scientist?
Google search engine processes close to 3 and half billion searches everyday. My italic bold sentence written above is one of the significant contributor of this number without any doubt…:)
Data science being fancy word in market, all of us are obviously interested in knowing more about it. Some of the working professionals are putting hard efforts to get required skill-set which are needed to be put in their resume to be called as “data scientist resume” according to their “data scientist friends”.
Having close to a decade of industry experience in data science and worked significantly on training and mentoring folks on data science, I also get these queries from all over.
When it comes to data science learning, I am a non believer of “one for all” model. The reason being, one’s learning journey to data science should be personalized depend on their current skill set.To give an example — If a masters of statistics person ask me the learning path for data science,I would advice him/her to get hands dirty on programming knowledge/coding skills/databases/SQL etc. On the other hand, If a computer science graduate ask me the similar question, I would advice him/her to get good grasp of statistics/mathematics/hypothesis/probability theory etc.
Though, before coming to this step of deciding on how to learn data science, my suggestion would be to ask few questions to yourself, which will help you understand “If data science is for you??”.
The very first thing you should observe is, does a correlation exist between what you do currently and what happens in data science/analytics/machine learning space?
Assuming you are starting from zero, let me put it simply —” Machine learning is a way to make machines learn from data”.For this learning to happen, data and methodologies are needed
You can refer to this YouTube video from Andrew NG to better understand machine learning
Lecture 1.1 — Introduction What Is Machine Learning — [ Machine Learning | Andrew Ng ]
To summarize on this part, as data scientist, your life will revolve around “data” and “methods” used to make machines learn. Hence,if you aspire to be a data scientist, your affinity towards data and coding should be high.
About the learning path of data science, it has to be personalized, however if I have to give a generic structure around it which can help people to get kick-off their journey, I will be more than happy.
One of the important things to ensure here is to cover breadth of few things listed below:
- SQL — This is one of the most important skill you should have if you want to become a data scientist. To improve your SQL skills and even to learn SQL from beginning, efforts are needed. There are lot of websites available where you can run SQL queries and practice. w3schools is one of my favorite for beginners, there are many more though. Link for w3schools is here. If you consider yourself a level above beginner then you can install any RDBMS in your computer and play around with data sets. Link for a good open source RDBMS system MYSQL is here. This Installer will help you in installing all the needed components.
- Coding/Algorithm — You can be either from coding background or non coding background. R language needs to be part of your data science resume in any case.For people from non coding background, the good news is, R is relatively easy to learn. You can install R studio(one of the most sought after tools in the industry) and start practicing R language.There some some useful posts on process for installation of R Studio. Please refer to link here for more detailed steps. Its quite easy to do. Also, if you do not want to install R and R studio for now, you can practice online as well. Refer to this link . This book available for free will help you in starting get your hands dirty in R
- Statistics — Statistics is one of the skill you must not ignore before jumping into a data science use case. To make your journey smoother and assuming you are a beginner, I advice you to read the ISLR book. Ensure you finish this book at least once before moving to next step. The purpose of ISLR book is to provide an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph. D. students in the non-mathematical sciences. This book has R practice materials as well which will help you understand the statistical concepts and get better in R
- Visualization — You should also be good at data exploration using different techniques like charts, graphs, distribution etc. For this part try to get good command over R and Python libraries which support visualization. For example — ggplot2 in R and matplotlib/seaborn in python. If you can get your hands on specific visualization tools like power BI, Tableau etc its an added advantage.
- Model building — Ahhh!! So you are equipped with query language, R codes and statistics understanding, hence you qualify to touch your first data science use case. Congratulations!!!! Do not stop learning in any of the above mentioned fields from all the sources you have, however in parallel, start to make some simple machine learning models like linear regression, logistic regression, decision trees etc. You will find packages in R which will run these model for you. Please try to understand what is going on internally when your run these models on your data. For example, you should be able to explain R square and adjusted R square if you are running a linear regression model. Do not depend too much on “in built library”. There are few websites out there where you can find data to practice your learning. Please refer to link 1 and link 2
Once you start getting grasp of how to run a machine learning model, then go to different forums where multiple people are working on same data-set. Kaggle being one of the good platform, you can start from there. Create a free account and start practicing on the data provided. The most important things to learn on this platform is what others are doing with same data? How are they approaching the same problem statement?How are they using the features? Are they able to think differently? How? Why? Please allow yourself to digest these learning. Remember, learning is a gradual process.
If you follow above steps properly and regularly, and you are able to answer on below points, then you can put data science as skill-set in your resume.
What you have done in entire model building process?
Why you have done a particular step in model building, what is usability of this?
How you improved your model ?
How is your model beneficial for business?
Learning is always a continuous gradual process and and hence keep practicing, keep learning, keep improving. There are always new challenges and concepts coming in data science world., prepare yourself for that.
Wish you all the best!
Thanks for reading, share it with friends if you like the story.
You can join my Facebook group “Unfold data science”, where I keep mentoring folks here.
You can with join me on LinkedIn here