learning from data

18 Dec, 2025

Regression models vs classification models -
- Regression - deals with outputs that are continuous values - any value in the range of certain values - has discrete values
- Classification - deals with categorzing each result into a category
Convergence - the point at which the model reaches it's optimum performance, cause even if you train the model with more data after this point, there's no point as performance would not be affected much (the error rate becomes constant)
Bias vs Variance trade-off?
- Bais refers to the leant overly simplified assumptions of the model
- Variance refers to the learnt sensitivity of the model on the training data
- High bias + low variance - underfitting ; low bias + high variance - overfitting
- The sweet spot is where total generalization error is minimized by an appropriate level of model complexity.
CNN vs RNN -
- CNNs - Learns hierarchies of spatial features - best for spatial data like images, while
- RNNs - Maintains a hidden memory that is updated step by step - uses previous context - designed for sequential data like text, speech, and time series.
in and out sample - in-sample refers to the error rate of the model when it is dealing with fitted data, i.e., the data that it has been trained on. Out-sample or e-out refers to the error rate when the model works or predicts with data that is out of the training or fitted data of the model.
Curse of dimensionality?
Linear models equations?
Sigmoid function translates the output of the model to a probability that can be understood. Usually it is present in the last layer or output layer of the model
Cross entropy is a cost-function. So for example in binary calssification we would use the binary cross-entropy cost function and in multi-class classification, we would use categorigal cross-entropy.
Cost function is calculated at the end of the network, the output layer and not after every neuron activation
KNN - supervised classification algo
- works well with low dimensional data
VC Dimension - pick a specific set of n points; if for every one of the 2^n possible +/− labelings of those same points there exists some classifier in the family (e.g., some straight line with some slope/intercept) that gets them all right, then that set is shattered; the VC dimension is the largest n for which such a set exists. [ largest number of points it can shatter ie. realize every possible 0/1 labeling of those points.] [ shatter = some classifier from the class gets them all right]
The perceptron algorithm is a classifier algorithm and works well with linearly separable data
Cross-validation is an algorithm which helps us gauge the generalization of a model. We can use K-fold validation, which is one the most common ones or L or C or something like. Then we also have the simple split-reg test train validates split which is like one of the most simple validation algorithms.
Regularization is a technique by which we can reduce or decrease high coefficients in a model, decrease the possibility of the model's coefficients to be high.
- L1 (lasso) does induce sparsity by driving many coefficients exactly to zero, enabling embedded feature selection in high-dimensional settings, but L2 (ridge) generally does not set coefficients to zero; it shrinks them smoothly toward zero so all features typically retain small, nonzero weights. L2 also helps deal with multicollinearity.
The Perceptron activation function helps us arrive at either 0 or 1 because we're basically classifying. If the threshold is 0.75 and the output that we get after the activation (or after our computation) is say lesser than 0.75, it'll classify it accordingly. This is also known as the heavi side step function. The activation function in a perceptron is a mathematical function that determines the output of the neuron based on the weighted sum of its inputs and a bias term.
Generalization can be explained with the help of 2 theories, namely VC theory and bias-variance trade-off. A model with an infinite VC dimension cannot generalize
- VC theory (Generalization bound) - it’s an inequality that says test error ≤ training error + a cushion term. So even if training error is tiny, test error is only guaranteed to be small when that cushion term is also small.
- The bias-variance tradeoff - particularly suitable for real-valued functions (regression problems)
LSTM - Vanilla RNNs struggle to carry information across many time steps due to vanishing gradients, whereas LSTMs selectively store, update, and expose information, making them effective for tasks like translation, speech recognition, and long time‑series.
Pooling is a simple way to shrink feature maps while keeping the strongest or most representative signals.
The growth function measures how expressive a hypothesis class is by counting how many distinct labelings it can realize on any sample of size m.
The learning rate is a single hyperparameter that controls how much the model’s learned parameters (weights) are adjusted on each optimization step

#study-notes