π Homework 2 - balancing, transformations, and dimensionality reduction (deadline 21. 12. 2025, 23:59)
In short, the main task is to play with balancing, binning, transformations, and dimensionality reduction to obtain the best results for the binary classification task.
The instructions are not given in detail: It is up to you to come up with ideas on how to fulfill the particular tasks as best you can!
However, we strongly recommend and require the following:
- Follow the assignment step by step. Number each step.
- Properly comment on all your steps. Use Markdown cells and visualizations. Comments are evaluated for 2 points of the total, together with the final presentation of the solution. However, it is not desirable to write novels!
- This task is timewise and computationally intensive. Do not leave it to the last minute.
- Most steps contain the number of features that should be treated. You can preprocess more features. However, it does not mean the teacher will give you more points. Focus on quality, not quantity.
- Hand in a notebook that has already been run (i.e., do not delete outputs before handing in).
- Download the dataset here. Split the dataset into a train, validation, and test set and use these parts correctly (!) in the following steps.
- Choose at least one classification algorithm whose performance is to be improved in the following steps.
- Use at least two binning methods (on features of your choice, with your choice of parameters) and comment on their effects on classification performance. I.e., one kind of classifier trained for each binning and a comparison of the effect of binning methods against each other. (4 points, depends on creativity)
- Use at least two data balancing techniques on the correct part of the dataset and comment on its effects on classification performance. Focus on the comparison of methods between each other. I.e., one type of classifier trained for each balancing and a comparison of the effect of balancing methods against each other. Just copied code from tutorial four will not be accepted. (6 points, depends on creativity)
- Transform the features appropriately and prepare new ones (i.e., feature engineering) - focus on the increase in the model's performance (possibly in combination with further steps). (5 points, depends on creativity)
- Try to find some suitable subset of features - use at least two feature selection methods. Evaluate your choice on the validation set and discuss the influence. Do not use PCA (principal component analysis) in this step. Manual selection will not be accepted. (4 points, depends on creativity)
- Use PCA to reduce the dimensionality. Discuss the influence of the number of principal components. (4 points)
- Try to find the best combination of the previous steps and run final classification tests on the correct part of the dataset - first for the original data, second for the best-found combination of the previous preprocessing steps. Compare the results and discuss (give a comment, use graphs, and so on). (5 points)
All your steps, choices, and the following code must be commented on! For text comments (discussion, etc., not code comments), use Markdown cells. Comments are evaluated for 2 points together with the final presentation of the solution.
If you do all this properly, you will obtain 30 points.
- Select the appropriate metric to evaluate the classification results.
- In steps 2 and 3, you are comparing methods against each other, not their effect on unadjusted data. However, you can comment on that, too.
- You choose the subset of features only in step 2 because of the binning methods. In the other steps, you work with the whole dataset.
- Please follow the technical instructions from https://courses.fit.cvut.cz/NI-PDD/homeworks/index.html.
- Methods that are more complex and were not shown during the tutorials are considered more creative and should be described in detail.
- English is not compulsory.
- The dataset can be downloaded here.
- The data are devoted to the binary classification task. The aim is to predict the probability that a driver will initiate an auto insurance claim next year.
- The target feature is called 'y' and signifies whether a claim was filed for that policyholder.
- To fulfill the task, one does not need to know the meaning of predictors.
- Predictors that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and the postfix cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation.
- While using train_test_split, control the shuffling of data by random_state parameter. Do not use shuffle=False, probably never (can cause systematic error).