Data Science from different perspective

What is Data Science

Data science is an interdisciplinary field about scientific methods , processes and systems to extract Knowledge or insights from data in various forms, either structured or unstructured.

Data Science Process

The three components involved in data science are organising, packaging and delivering data.
The 3 step OPD Data Science Process

Step 1. Organise Data.

Organising data involves the physical storage and format of data and incorporated best practices in data management.

Step 2. Package Data.

Packaging data involves logically manipulating and joining the underlying raw data into a new representation and package.

Step 3. Deliver Data.

Delivering data involves ensuring that the message that the data has, is being accessed by those that need to hear it.

Intro to Machine Learning

Machine learning is the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem.

Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.

Types of Machine Learning Systems

There are so many different types of Machine Learning systems that it is useful to classify them in broad categories based on:

Whether or not they are trained with human supervision (supervised, unsupervised, semisupervised, and Reinforcement Learning)
Whether or not they can learn incrementally on the fly (online versus batch learning)
Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do (instance-based versus model-based learning)

Supervised Learning

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output. The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

Supervised learning problems can be further grouped into regression and classification problems.

Classification:

A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.

Regression:

A regression problem is when the output variable is a real value, such as “rupees” or “weight”.

Some popular examples of supervised machine learning algorithms are:

Linear regression for regression problems,
Random forest for classification and regression problems,
Support vector machines (SVM) for classification problems.

In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has several meanings depending on the context, but generally means an attribute plus its value (e.g., “Mileage = 15,000”). Many people use the words attribute and feature interchangeably, though.

Unsupervised Machine Learning

Unsupervised learning is where you only have input data (X) and no corresponding output variables.

The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

Unsupervised learning problems can be further grouped into clustering and association problems.

Clustering:

A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.

Association:

An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.

Supervised Vs Unsupervised

Supervised learning
- Trying to predict a specific quantity.
- Have training examples with labels.
- Cam measure accuracy directly
Unsupervised learning
- Trying to understand the data
- Looking for structures or unusual patterns
- Not looking for something specific (supervised)
- Does no require labelled data
- Evaluation, usually indirect or qualitative
Semi Supervised learning
- Using unsupervised methods to improve supervised algorithms.
- Usually few labelled examples + lot of unlabelled examples

Semisupervised learning

Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data.

Some photo-hosting services, such as Google Photos, are good examples of this.

Why Use Machine Learning?

To summarize, Machine Learning is great for:

Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better.
Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.
Fluctuating environments: a Machine Learning system can adapt to new data.
Getting insights about complex problems and large amounts of data.

Reinforcement Learning

Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards

Batch and Online Learning

Another criterion used to classify Machine Learning systems is whether or not the system can learn incrementally from a stream of incoming data.

Batch learning

In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data.

Online learning

In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly , as it arrives

Instance-Based Versus Model-Based Learning

One more way to categorize Machine Learning systems is by how they generalize.

Instance-based learning

The system learns the examples by heart, then generalizes to new cases using a similarity measure

Model-based learning

Another way to generalize from a set of examples is to build a model of these examples , then use that model to make predictions. This is called model-based learning

INTRO TO MACHINE LEARNING