# DATA ANALYSIS STATISTICAL TECHNIQUES A DATA SCIENTIST SHOULD KNOW

Statistical data analysis is a procedure of performing various statistical operations. It is a kind of quantitative research, which seeks to quantify the data, and typically, applies some form of statistical analysis. Quantitative data involves descriptive data, such as survey data and observational data.Statistical data analysis generally involves some form of statistical tools, which a layman cannot perform without having any statistical knowledge. Here are the top statistical data analysis techniques.

#### Linear Regression

Linear Regression, is the technique that is used to predict a target variable by providing the best linear relationship among the dependent and independent variables where best fit indicates the sum of all the distances amidst the shape and actual observations at each data point is as minimum as achievable. There are two types of linear regression mainly, that are;

Simple Linear Regression: It deploys a sole independent variable to predict a dependent variable by providing the most suitable linear correlation. To understand Simple Linear Regression in detail, click the link.

Multiple Linear Regression: It takes more than one independent variable for predicting the dependent variable by providing the most suited linear relation. There is much more to explore about Multiple Linear Regression, learn with this guide.

#### Classification

Being a data mining technique, Classification authorizes specific categories to a collection of data for making more meticulous predictions and analysis. Types of classification techniques are;

Logistic Regression: A regression analysis technique to perform when the dependent variable is dichotomous or binary. It is a predictive analysis that is utilized for explaining data and the connection amongst one dependent binary variable and other nominal independent variables.

Discriminant Analysis: In this analysis, two or more clusters (populations) are referred to as a priori and the new set of observations are grouped into one of the known clusters depending on computed features. It displays the distribution of the predictors “X” distinctly in each of the response classes and employs Bayes theorem to pitch these classes in terms of estimates for the probability of the response class, given the value of “X”.

#### Resampling Methods

The approach of extracting repeated pieces of samples from the actual data samples is known as Resampling which is a non-parametric method of statistical inference. Also, depending upon the original data, it produces a novel sampling distribution and employs experimental methods instead of analytical methods for generating specific sampling distribution. For understanding the resampling method, the below techniques also need to understand;

Bootstrapping: From validation of a predictive model and its performance, ensemble methods, estimation of bias to the variance of the model, Bootstrapping technique is used in these conditions. It operates through sampling with replacement from the actual data and accounts for the “not selected” data points as test samples.

Cross-Validation: This technique is used to validate the model performance, and can be executed by dividing the training data into K parts. During cross-validation execution, the K-1 part can be considered as training ser and the rest made out part acts as a test set. Up to K times, the process is repeated and then the average of K scores is accepted as performance estimation.

#### Tree-based Methods

Tree-based methods are the most commonly used techniques for both regression and classification problems. They incorporate layering or detaching the predictor space in terms of several manageable sections and are also known as decision-tree methods because the particular splitting rules are applied to fragment the predictor space that can be reviewed in a tree.

Bagging: It decreases the variance of prediction through producing extra data for training out of actual dataset by implementing “combinations with repetitions” for creating multi-step of the equivalent size as of original data. In actuality, the model predictive strength can’t be improved by enhancing the size of the training set, but the variance can be reduced, closely adjusting the prediction to an anticipated upshot.

Boosting:  This approach is used to compute the outcome through diverse models and after that average of the result is calculated applying a weighted average approach. Via integrating the benefits and deadfalls of this approach and varied weighting formula, an appropriate predictive efficiency can be fetched for an extensive chain of input data.

#### Unsupervised Learning

Unsupervised Learning techniques come into the picture and can be applied when the groups or categories across the data are not known. Clustering and the association rules are the common approaches (examples) of unsupervised learning in which various sets of data are assembled into strictly related groups (categories) of items.

Principal Component Analysis:

PCA supports in generating low dimensional illustration of the dataset by recognizing a linear set of the mutually uncorrelated blend of features having maximum variance. Also, it helps in acquiring latent interaction among the variables in an unsupervised framework.

K-Means Clustering:

Based on the distance of the cluster to the centroid, It segregates data into k dissimilar clusters.

Hierarchical Clustering:

By developing a cluster tree, hierarchical clustering aids in developing a multilevel hierarchy of clusters.