Data Science (Part-1)

Pres1dent
3 min readApr 14, 2021

--

We often see different types of amazing charts, graphs shown above and wonder how did they manage to do it. We often have a clear idea about how the data must be represented but we don't know how and where to begin with. I started with the similar mindset and don't worry it isn't that complicated and will be fun learning little by little. So lets not wait more and dive right into it. Let’s begin with the first basic question.

What is Data Science?

Usually when a person says data science every individual gets some idea about it, which may or may not be completely correct as per what they think. So we will define data science on the basis of the task that a data scientist usually performs. Data Science is the science of collecting, storing, processing, describing and modelling data.

Data Science Pipeline

Collection of Data: The collection of data usually depends on the question the data scientist is trying to answer and also on the environment in which the data scientist is working in.

For eg., A data scientist working at an e-commerce company like Amazon doesn't have to venture out in order to collect data as the data is already stored in. Now if we consider a data scientist working for a political party, the data scientist has to crawl, scrap the data from various social media sources like Facebook, Reddit as people discuss about the new policies in many other social media sites. If a data scientist is working with farmers to test a new type of fertilizer as the data doesn't already exist the data scientist has to manually design and conduct experiments to find the effect on type of seed, fertilizer, irrigation. This experiments can be designed with permutations and combinations.

Storing Data: Once the data is collected it has to be stored somewhere for its requirements in organization. There are 3 types about how a data can exist in a organization

A) Transactional & Operational Data:

This data includes a variety of structured data like patient records, employee records, insurance claims, telephone bills etc. Since this structured data might contains a large amount of columns, RELATIONAL DATABASE are usually used to access those columns.

B) Data from multiple databases:

Consider an e-commerce company eg., Airtel, this type of data is obtained in companies where the company wants to know if people who use their sim cards prefer having broadband services of their company or their broadband isn't as good as their sim cards. This type of database are integrated within a common repository and helps to answer such types of questions as mentioned above with help of some analytics.

C) Unstructured Data:

Due to easy availability of data, huge amount of data is generated which includes unstructured format of data like text, image, video, speech. The data generated is of high volume, high variety, high velocity.

The upcoming part will be updated soon… Stay tuned!

--

--

Pres1dent
Pres1dent

Written by Pres1dent

If you aren't willing to look like a foolish beginner, then you will never become a graceful master.