- Data Visualizations and telling Story in the Data
As a Data Scientist, one must understand data and able to tell what is in the data. Thus, it is very important that you must be able to tell a story with data visualizations.
There are many different ways to visualize Data. One should visualize the data to not only understand what is going on in the data but also help you tell a story in the data.
Through Data storytelling, you are communicating your insights in the data. You must be able to create a story board with visuals on what is or may be in the data. This will lead you to different hypothesis and possibly data model that can capture story in the Data. The ability to capture insight into data and then ability to forecast is the essential part of data scientist work.
- Python – Panda and TensorFlow
Python has been a programming choice for Data Scientists.
Learning Python syntax is easy, but you should be able to write efficient scripts and leverage the wide-range of libraries and packages. Python programming is a building block for applications like manipulating data, building machine learning models, writing DAG files …
- SQL + ETL
SQL is a universal language to interact with relational databases (RDMS). Though no-sql databases are getting more traction, Data is stored in mostly RDMS such as Oracle, Postgres, MS SQL, MySQL and so on. Having a good handle on writing query to extract data for visualization and story telling will hugely benefit you in advancing your career.
Data is stored and optimized for applications and reporting. It is not stored for Data Analytics in view. Therefore, you will need to Extract, Transform, and Load (ETL) your data for analytics. Having a good grasp of how to extract (SQL), how to transform (math involved here), and load it in your favorite toolset will enable you to excel in your Data Science career.
- GIT – Version Control
GIT is a distributed version control system that allows you to store your data – code, documents and so on. There are many flavors of GIT. Two major free GIT repositories are Github and bitbucket.
You should master how to use Git to version your work. Create your repository and store your current code or sample work or something that you are dabbling in.
In Github, you may create a profile that will allow you to showcase your skills and sample work.
Apache Spark is a unified analytics engine for large-scale data processing that is it allows to do distributed processing of your model in the large dataset. It is advantageous to have a good knowledge on how a distributed system work.
As a Data Scientist, you are move consume of technology rather than creator or maintainer of the the technology. However, having a good knowledge and hands on experience will make you an invaluable member of an Analytical team.
Docker is a container that allows to install both operating system and application and run it. This is one more level abstraction from virtual machine as you can run Docker container in most of the OS.
Having a good know of on Docker will enable you to quickly deploy your model. This will be very handy when you are training your models.
Airflow is a platform to programmatically author, schedule and monitor workflows. Why airflow is good to learn? It will provide you a toolset to create your data pipelines to clean up your data along with help you to do the ETL.
This tool will provide you a powerful way to pipeline your data from ETL to cleansing to feature discover. As this tool works with Python, it is a getting traction in industries.