Updated: May 5, 2020
The term ‘Machine Learning’ refers to the automated detection of meaningful patterns in data. Machine Learning (ML) is sometimes also called as ‘Automated Learning’.
In this article, I am going to talk about Data Science, Machine Learning, Statistics and Probability and many more detailed subjects which a Data Scientist use in their routine jobs.
When do we need Machine Learning?
Tasks where there can be a defect: Human perform many tasks routinely; however, they fail sometimes. Ex: Driving. In such cases if we program efficiently and provide enough training data the program will learn from its experience and achieve satisfactory results.
Task where Human cannot go or perform: Human is one of the intelligent species on earth, however there are few tasks where human has no capability to handle such huge or micro elements. Ex: Datasets of astronomy and Pharma medical knowledge.
Adaptive : Another default feature is that, once program is installed, they are known for its rigidity. There can be many changes in the behavior of a person to person ML program adapts to their inputs. Ex: Siri or Alexa. The program is trained by datasets however can adapt to almost all the users.
Types of Machine Learning:
ML is a vast domain; however, it can widely be categorized into:
Supervised Machine Learning:
As the name suggests this involves an interaction with the user or the environment.
Example: A dataset with 100 pictures of a dog are provided as an input to determine the breed of the dog to an ML program, however in the real world when a new size, shape of the same dog has been provided as a test data to the ML program. The program adapts the new picture and predicts the correct breed of the dog. In this case program learnt from the test data to predict.
The popular algorithms used in Supervised ML are:
Super Vector Machines
Un-Supervised Machine Learning:
In case of Un-Supervised ML, unlike the earlier ML, here we do not have any variant ‘Y’ to predict. Basically, in this the program tries to classify the data in different segments or clusters for better understanding.
Example: Dividing the population based on gender. In banking companies, customers can be segmented based on their expense behavior into prospects for other banking products.
The popular algorithms used in Un-Supervised ML are:
Hierarchical Clustering Algorithm
Artificial Intelligence (AI):
Artificial intelligence came into existence in 1960s, however as there are no enough data for the AI to learn or survive. Off late we have huge volumes of data captured and analyst would like to leverage AI to learn the patterns of the data and predict required variants.
Always there’s a myth around AI and ML. Below picture clarifies all those. Deep learning, Machine Learning are subsets of AI.
Popular Programming Languages for Machine Learning:
All the above can be accessed from an Anaconda navigator and
Python is one my personal favorite for Machine Learning and I am coming up with yet another detailed article/blog on Python. So, stay tuned.
Statistics and Probability:
When we talk about Data Science, we need to shed some light on Statistics and Probability. These 2 subjects are always been given lower importance during school. However, Statistics and Probability are always there in the nerve.
Ex: In a cricket match, depending on the past data, even a small child can predict whether Sachin is going to hit a century or not in a match. Or while playing a Ludo game with a dice, every player thinks about getting a ‘six’, however does not get as the probability of him getting a ‘six’ is 1/6th.
Data refer to facts and statistics collected together for reference or analysis.
Data can be:
Types of Data:
Everyone one is surrounded with different sets of data. For instance, when we enter our office, we have a finite number of people using finite number of gadgets however each of them may use infinite resources like power, water or even the WIFI data.
Data can be categorized into:
1. Quantitative Data: As the name suggests, this is related to quantity. Again, in Quantitative Data we have 2 types:
Discrete: Finite number. Ex: Strength of a classroom
Continuous. Infinite number. Ex: Weight of a person.
2. Qualitative Data: On the divergent side, this is related to quality. Which can be classified into 2 types:
Nominal Ex: Male/Female/LGBT gender of the population
Ordinal Ex: Ranking in the school or CGPA – A, B, C)
Statistics is an area of applied mathematics concerned with the data collection, analysis, interpretation and presentation.
Real time check:
We have 10 years of Sales data of a product, business would like to analyze the data and come up to improve the Key Performance Indicators (KPIs).
This is a Statistical problem statement and can be solved by Statistics.
When we talk about Statistical Data, we have large sets of data which is called as ‘Population’ and a very smaller chunk of it as a ‘Sample’.
Often when we perform any statistical problem, we start to gain knowledge from a small set of the population with various types.
There are 2 ways again in sampling techniques:
We will be focusing more on Probability sampling technique, which is widely used in business cases.
Random Sampling: Each and every person from the population has equal probability to get picked for sampling.
Systematic Sampling: Every nth record/person is picked from the population. Like every 2nd person from a row of 100 people are picked for sample.
Stratified Sampling: Entire population is classified into stratum depending on some common attributes. And from each stratum we perform random sampling. This is one of the best approaches being followed.
Types of statistics:
There are two types of statistics.
1. Descriptive statistics
2. Inferential Statistics.
Descriptive statistics: Descriptive Statistics uses the data to provide description of the population. This is focused on the characteristics of the data.
1. Measures of Center
2. Measures of Spread
Inter quartile Range
Mean: Measure of all average values in a sample.
Median: Measure of central values of a sample set. (Values are sorted in ascending/descending)
If we have even number of sample, we take the mean of 2 central values as Median.
Mode: Measure which occurs most common/often in the sample.
Range: Measure how spread apart the values in datasets.
Inter Quartile Range (IQR): Sample data is divided into several quartiles. IQR falls between 25% to 50% of sample data.
Variance: Measure, describes how much does a random variable differs from the its expected values. In a nut shell how much does the expected values differ from mean.
Standard Deviation: Measure, describes the dispersion of a set of data from its mean.
Information Gain: Measure which provide how much a characteristic can give us about the final outcome.
Entropy: Measure of uncertainty present in the data set.
Inferential statistics: Inferential statistics makes inferences and predictions about a population based on sample of data taken from the population.
All these are used in Data Science techniques.
Stay tuned....coming soon Python for Machine Learning