bias and variance + cross validation
Bias and variance
Reference : https://www.youtube.com/watch?v=EuBBz3bI-aA
split the data into test and train
ml algo = linear regression - fits a straight line to the training set - this will never capture the true relationship between the variables in the dataset
the inability of an ml algo to capture the true relationship of a dataset is called BIAS
here the straight line can't be curved = it has a high amount of bias
we can measure the distances from the fit lines (lines [squiggly + linear] passing through the points), square them and add them up ; remember this is with the training set -now we do the same on the testing set
here the squiggly line fails terribly with the testing set
in ML lingo, the difference in fits between the datasets is called VARIANCE.
low bias and high variability = squiggly line
an ideal ML algo has low bias and has low variability
high bias and low variability = straight line
- the straight line will give good predictions but not great predictions, but these good predictions will be consistent
#Cross Validation
how to decide which ML algo to use given a specific problem?
CROSS VALIDATION helps us compare different ML methods and see how well they will work in practice.
- Estimate the parameters (training the algorithm)
- Evaluate the method (testing the algo) 75% + 25% split for train + test? -Cross validation tries all sorts of combinations for test and train with the blocks, and keeps a track of the results - now this is repeated for all of the ML algos.
Since we divided the data into 4 blocks, it's called four-fold cross validation
no of blocks is called arbitary.
when each sample is tested individually, we called it leave on out cross validation
in practice 10 fold cross validation is common
for a tuning parameter (consider ridge regression where we have penalties - that is the tuning parameter) we can use ten-fold cross validation to find the best value of that tuning parameter