Demystify Data Science

Manish Kulkarni
thinkAI.org
Published in
3 min readOct 13, 2020

--

Data Science is one of the hot topic and trended in the job market. Being a hot cake lot of misconception is also there. In this post I would like to put forth my thoughts on Data Science. Data Science is basically assortment of different skills, 1. Data Collection

2. Data Storage

3. Data Processing

4. Data Description

5. Data Modeling

6. Model Deployment

Data Collection:

Data collection is the first step. This involves finding the data. The data could be either

Inhouse data (Employee or Product details for an engineering company) or collected from outside(paid for doing some survey) or Conducting the experiments to collect new data (example to find the effectiveness of drug on the new virus outbreak).

This requires one to be good in programming, statistics and domain knowledge.

Data Storage

Data storage is important decision for any data project. This primarily depends on two factors : type of data and time period of storage. We have following types of data : Structured, Semi-Structured and Unstructured data.

Structured data is stored in Database Systems or Data warehouse depending on time period of storage. Structured data has a schema.

Unstructured data typically are image, audio file, text files logs and so on. These do not have a schema and are huge in size. These require big data system for storage ,assuming that is generated at high velocity ,high volume and high variety. These are converted in to a specific format for the storage.

Semi-Structured is the data which does not conforms to a data model but has some structure. It lacks a fixed or rigid schema. It is the data that does not reside in a rational database but that have some organizational properties that make it easier to analyze. With some process, we can store them in the relational database.

Data Processing

This is the most important and time consuming task of the Data Science.

It has the following steps:

1. Data Wrangling:

How do we handle multiple data sources as well as understanding the same information that can be stored in different tables with different column names.

2. Data Cleaning:

The curated data used in many tutorials are like teddy bear. On the other hand real world data are messy and are like bear with blood in mouth ready to hunt. Data cleaning is a very important and vital step includes missing value treatment strategy, outliers imputation, spelling correction and so on. We can also do scaling, normalizing and standardizing in this step.

Data Describing

This step is more on understanding the hidden patterns, finding correlations, finding the central value of the data. Data visualization tools like Power BI ,Tableau come in handy. One can use R’s ggplot or Python’s seaborn, plotly to accomplish the same.

Data Modeling

This part of the Data Science skill has created the most hype in the market. This has two broad categories : Statistical Modeling and Algorithmic Modeling

Statistical Modeling are simple models which are more focused on the underlying data distribution, understanding the relationship, Hypothesis testing.

Algorithmic Modeling are more complex models which require more data and can work on large set of variables (columns).These models are more focused on the prediction.

Model Deployment

The above steps are more focused from developer point of view. The developed solution i.e. Model needs be to put in use. Typical usage of Models are auto tagging of photos etc. These models are deployed on to inhouse clouds or solutions by Microsoft Azure ,Google’s GCP or Amazon’s AWS to name a few.

I have tried my level best to give an overview of Data Science. Please do provide feedback for improvements.

Originally published at https://www.thinkai.org on October 13, 2020.

Reference :https://www.geeksforgeeks.org/what-is-semi-structured-data/

--

--