data mining

19 Dec, 2025

Train, validation and testing split
Training data - used to train the model
- training error is back propagated, learning algorithms
Validation data - every now and then, the interim model is tested on the validation result
- Validation error - wrong classification of data
- If you are not satisfied with the interim model then you will change the model's hyperparameters and re-train the model
- Q : When does hyperparamter optimization occur? After validation error or after running it against the testing dataset?
Feature engineering
- Feature encoding - Why don't we replace brans with different numbers? We do not want to add ordinality to the data - it might make the model the higher the number, the more superior it is.
- Feature binning -
22nd April - Lecture 3 Downsampling and upsampling You always scale features on training data and never on testing data. The more features we have, the easier it would be for the model to classify or distinguish (there is a certain threshold) This happens until the optimal of features is reached. The more features, the more sparsity we will have in our model ie. more amounts of features (too many input variables), less amount of data that actually represents each feature - To solve this we use dimensionality reduction Learn about EIGEN VALUES AND EIGEN VECTORS (brush up linear algebra) Exercise pandas has series and dataframes practice groupby, aggregate, lamda ; label based and position based indexes PCA is good for visualization and can be used for downstream analaysis and ML, but T-sne is only suitable for visualization Classification metrics - F1 score etc.
Lecture 27/05 Mean square error helps us guage the loss function for a linear regression model? The predicted value is bias+sum of all weights Find the best cost function that minimizes the error - to do this we find the derivate of the function In linear regression, since it's a convex function, there is only one minimum Logistic regression is linear + the sigmoid function For multi-level classification in logistic regression- softmax function of generalization of the sigmoid function Here the first step is to convert our data into one-hot encoded vector so that we can compare the output of the probabilities (from the softmax function) | Value | Label | |------:|:-----:| | 0.0 | 0 | | 0.7 | 1 | | 0.3 | 0 | Each column has to add up to 1 In a neural network the number of inputs depends on the number of features one has in their dataset. Each node (neuron) in the hiddent layer is the weighted sum + activation function The number of nodes depends on the problem that we are solving - it is problem specific Backpropagating the gradients = Back propagation

A linear regression model assumes that the relationship between the target variable and the dependent variables are somehow linearlly dependent on each other Loss function in linear regression is the mean squared error Gradient descent algo is used to optimize the weights L1 regularization will encourage sparcity - if a specific feature is important, then it keeps it but if it is not then it sets the weights of that feature to 0

17th June
Training an LLM requires no manual labels - it learns from raw text by treating parts of the text as input and the next word as the target. = self-supervised training
What is the probability of the next word given the previous words as inputs
Training a LLM: 1. Gather large corpus of data 2. Tokenization - text is converted into chunks that are called tokens. The split of words or characters depends on the tokenizer. The larger the vocabulary size the more unique tokens we will have. 3. Predict the next token (training loop) : Causal masking - learns to generate next token
What does an an LLM output on it's output layer? - softmax
What is the size of the output layer in the LLM? - equals to the vocabulary size
Greedy decoding - choose the next token based on the highest probability
Auto-regression: the process of iteratively adding the predicted token to the input
Every time a model gives an output/token, it has to compute with the newly generated token as the input
Temperature-based sampling
RAG (Retrieval based generation) uses sentence embeddings which are essentially an extension of word embeddings, aiming to encode a sequence of words instead of singular words
In embedding space, we need distance metrics as we work with vectors
Exercise 9 Convert the 32pixels into 1 dimension - we are flattening the input Logistic regression model will have a weighted sum All features will have a weight attached to it and the sum of the weights is a sigmoid model Since we have 10 categories (multi-lable classification) here we have to use softmax function. If we were dealing with only 2 categories, we could have directly used logistic regresssion (binary classification) Name of the cost function/loss function for classification in NN with logistic regression = cross entropy (equation is important for exam) Cost function for linear regression = MSE Apply one-hot encoding on the labels for the ground truth, so that they can be used in the cross entropy Now ship will be [0 0 0 0 0 0 0 0 0 1] Then for one value we take only the one where 1 is there Input will be 32x32x32 (in NN) which optimizer do we use to find the perfect algorithm - we use it to update the gradients - gradient descent algo Stochastic gradient descent - we can use one single point when we get the result (find it for one single data point) or we can find it for a batch of the dataset. (mini batch stochastic gradient descent) learning reate, batch size and number of epochs - parameters of the optimizer ie. stochastic gradient What happens when you change the learning rate? The graph will make big jumps - it will not converge If we have 2 classes the precision will be 50-50 Formulas are important!!! - Gradient descent, cross entropy, soft max, sigmoid f1 score, specifity (evaluation metrics) are not required.
24th June Explainability is about making a model's decisions understandable to humans Trade-off between explainability and accuracy Neural networks are more accurate but their explainability is low - which parameters are being used to predict the outcome Explainability is critical in high risk fields Feature importance analysis - in some scenarios we are not interested in the predictions - we are interested in the key features that drive the predictions Disease prediction using an explainable AI appraoch - once the model is trained we leverage it to get the risk factors and use them to predict patients with outcome
Local explainability - explain a single prediction
Global explainability - understand the model's overall behaviour Glass box models vs Black box models Post-hoc explainability (applie to both - but more relevant to black box model)
Permutation importance algorithm (feature permutations)

Train a model
Evaluate baseline performance - use a scoring function (accuracy, R-sqaured)
Take each feature's randomly shuffle it breaking the relationship. After that if the performance of this randomized dataset decreases (when compared to the baseline score), that means that feature is not important. Feature importance via permutation - formula is important for final exam! Limitations - doesn't work well with correlated features Work around is to use collinearity test or PCA and then use a single PCA

Feature importance via SHAPLEY Values (Equation is not required for final exam)

List all possible features in the subset except for the feature you want to check for - ie. check the importance or how much this feature contributes to the model.
Evaluate the importance of this feature with all the different features - and then you average

Explainable AI
- We want to understand how the AI is making a certain decision and not necesarrily the result of the output.
Examination Press window, access anaconda powershell prompt you should be in the home folder >> conda activate exam_DM Copy notebooks from questions folder, run conda env there and save the notebooks in the answer folder. Only the notebooks will be checked. Run the jupyter notebook in the correct folder
- No cheatsheet
- GenAI exercise not required for preperation, but the theory/topic is important ( workflow is important)
- Explainable AI, pandas is important
- No tasks from exercise 1 and exercise 2 (diff b/w list and dictionary, print head of df) these willl not be there. You will not have a pandas, or pythin or numpy tasks. It's about datamining using these libraries.
- Coding, writing functions (you have a dataset, now build a model for this - will not be asked) yoou will have small slides
- If you don't understand random forest how will you set a specific parameter in the function
- No derivations, no questions like explain large language models
- Answer are short and brief
- Most of the tasks will be independent from each other
- Process the dataset, evaluate the model
- Equations : MSE, Entropy, (binary) cross entropy, Softmax, Sigmoid, Scaling (min, max, mean, median), no dimensionality reduction, softmax with temperature, gradient descent, stochastic gradient descent (SGD), minibatch SGD, R2 score for linear regression
Topics to study Scaling, PCA, T-Sne, implementation of k-means clustering, neural networks, nn - different learning rates, SHAP roc curve, accuracy, cross-validation, (stochastic) gradient descent,
Study notes
K-means clustering - it is an unsupervised clustering algorithm. k = to the number of clusters we want to have. we randomly choose centroids (the average of mean of the data points from the centroid). we calculate the euclidean distance of the points from the centroids and cluster them accordingly. if we have 2 centroids, we calculate the euclidean distance of all the data points in the dataset from both the centroids, say centroid 1 and centroid 2. each datapoint joins the cluster of the centroid it's closest to. this is the first iteration of the clustering algorithm. then we take all the data points of the cluster and calculate the mean of them - using this new mean value we update the centroids and calculate the euclidean distance of all the points again. based on the proximity of the centroids, it is categorized to a specific cluster. this process is repeated until the model converges ie. there is no switching of the data points between clusters after each iteration and all of the data points are clustered accurately.
Classification: - decision trees - it makes a statement and then makes a decision based on whether the statement is true or false. - when a decision tree classifies things into categories, it's called a classification tree. - when a decision tree predicts numerical values, it's called a regression tree. - entropy : degree of uncertainty or impurity in data. lower entropy implies greater predictability - information gain - it is the amount by which entropy is reduced due to a split - to reduce entropy - split a feature into multiple categories (also called partitioning the data based on feature values) - this increases predictability
Generalization - the model works well on testing data and real world data
Onehotencoding + scaling
for both you first fit and then transform - you only scale the feature set, not the target variable - the split for train and test is always in (x,y) format
PCA (Principal Component Analysis) [dimensionality reduction] Ref : https://www.youtube.com/watch?v=ZgyY3JuGQY8 - principal components capture the most important information from the dataset, as compared to the other features - helps us limit the curse of dimensionality - helps us minimize overfitting (generalize poorly to new data that was not part of their training set) - the larger the variability captured in the first PC, the larger the information retained in the dataset ; no other PC in the dataset can have higher variability than PC1 - there is no correlation between PC1 and PC2
t-SNE Ref : https://www.youtube.com/watch?v=NEaUSP4YerM - takes a high dimensional dataset and reduces it to a low dimensional graph - we measure the distance of the point of interest to all the other clustered points - scale the similarities to add up to 1 - t-SNE has a perplexity parameter equal to the expected density around each point - the width of the distribution (standard deviation graph) is based on the density of the surrounding points - we use t-distribution and that's why it's called t-SNE, the tails are elongated, otherwise all the clusters would be clumped up in the middle - similarity scores on the graph vs similarity scores on the one dimensional line - it is sensitive to noise and computationally sensitive
Neural Networks - sort of the generalization of linear regression networks ; neural networks are meant for non-linear networks - activation functions are the key that makes neural networks work with non-linear data - bias is the baseline of the output parameter/s "baseline adjusters" - cost/loss functions help us gauge the error of predicted output vs actual output, thereby enabling us to tweak the weights and biases of each neuron during backward propagation - basic neurons contain weights + bias + activation function - a neuron is essentially a mathematical function = [ takes multiple features (inputs) + applies learned weights to each input + adds a learned bias + passes the result through an activation function + produces one output ] - the number of outputs produced by a layer = number of neurons in that layer. - if you have, for example, 10 neurons in this layer: after forward propagation, you will have a vector of 10 outputs—one from each neuron. - forward propagation = activation function + weighted sums (incl. biases) - backward propagation = loss/cost function + activation function derivatives (gradients) - weights are the actual parameters that determine how much influence each input has. each feature has it's own weight - gradients are the directions and amounts we should adjust those weights - the learning rate (often denoted as α or η) directly multiplies the gradient to determine how much to change each weight ie. new_weight = old_weight - learning_rate × gradient
regularization (technique) is the process of adding constraints or penalties during training to prevent overfitting. it helps increase generalization (goal). some regularization techniques are : - L1/L2 : adds penalty for large weights to the loss function; encourages simpler models that generalize better - dropout : randomly deactivates neurons during training; forces network to not rely on specific neurons - early stopping : monitors validation performance during training; stops training when validation performance starts declining - batch/layer normalization :
- optimizers are the algorithms that determine how to update the weights based on the gradients ; Optimizers exist specifically to minimize the cost function - w/o optimizers we will have basic gradient descent but leaves us with limitations - the optimizer helps us smoothen out the adjustment of weights in back propagation - otherwise we have the probability of drastic changes in the weights (going to and fro without converging)
Gradient Descent Algos - it's gradient descent and not gradient ascent cause it's focusing on decreasing the cost/loss function and not increasing it. - each parameter has it's own gradient - so if there are 40 features in the dataset, each neuron will have 40 parameters passed as inputs to it during forward propagation. during optimization, in backward propagation using gradient descent, each neuron will have a gradient per feature, as we have a weight per feature. so the updated or optimized weight that would be applied is the average of all the gradients of a weight (eg: w1) across all data samples (row) in that batch, passed to that specific weight in that neuron only. - features are columns and data samples are rows. if there are 40 features, each data sample will have 40 features. - in stochastic gradient descent, the weights are updated by calculating and applying the gradients of one randomly chosen data sample - however, it is very noisy - in minibatch gradient descent, the entire dataset is divided into multiple mini-batches, so the weights are optimized based on the gradients of that batch. one epoch will contain multiple mini-batches so together it will cover the entire dataset. noise and speed balanced

Forward pass: Input goes through the network → compute predictions.
Compute loss: Compare predictions to true values using the loss function.
Backward pass (backpropagation): Compute gradients of the loss w.r.t. each parameter.
Gradient descent step: Use those gradients to update weights and reduce loss.

optimization techniques - learning rate decay : is a technique that gradually reduces the learning rate during training - weight initialization : the process of setting the starting values for all weights and biases in your neural network before training begins. If you initialize all weights to zero, every neuron in a layer will compute the same output and receive the same gradient updates. This means they’ll all learn the same thing, making your network effectively have only one neuron per layer! - Gradient clipping : is a technique that prevents exploding gradients by capping the magnitude of gradients during back propagation - apply maximum threshold value

#study-notes