Note technique
1. Convert numerical to categorical
https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
Basic
- A categorical variable has too many levels. This pulls down the performance level of the model.
- A categorical has levels that rarely occur. Many of these levels have minimal chance of making a real impact on model fit.
- There is one level that always occurs. E.g: most of the observations in the data set, there is only one level. Variables with such levels fail to make a positive impact on model performance due to very low variation.
- If the categorical is masked, it's a difficult task to decipher its meaning.
- We can't fit the categorical variables into a regression equation in their raw form.'
We should iterate our modeling process with different techniques. Later, evaluate the model performance. Below are the methods:
Convert to number
- Label encoder
For example, We have two features “age” (range: 0-80) and “city” (81 different levels). Now, when we’ll apply the label encoder to the ‘city’ variable, it will represent ‘city’ with numeric values range from 0 to 80. The ‘city’ variable is now similar to the ‘age’ variable since both will have similar data points, which is certainly not the right approach.
- Convert numeric bins to number
- Mean or mode
- Two new features: one lower bound, one upper bound.
- Combine levels
- Using business logic.
- Using frequency and response rate (positive response/total).
- Dummy encoding
keyword: feature hashing
Advanced
https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/
Categorical data
Ordinal data: The categories have an inherent order.
Nominal data: The categories do not have an inherent order.
1. Label encoding or ordinal encoding
2. One hot encoding
3. Dummy encoding
** Dummy encoding is similar to one-hot encoding. This categorical data encoding method transforms the categorical variable into a set of binary variables (also known as dummy variables). In the case of one-hot encoding, for N categoricals in a variable, it uses N binary variables. The dummy encoding is a small improvement over one-hot encoding. Dummy encoding uses N-1 features to represent N labels/categories.
4. Effect encoding: https://www.researchgate.net/publication/256349393_Categorical_Variables_in_Regression_Analysis_A_Comparison_of_Dummy_and_Effect_Coding
5. Hash encoding
6. Binary encoding
7. Base N encoding
8. Target encoding
2. L2 regularization of linear regression
Pseudo-inverse không tồn tại do ma trận X * X.T không khả nghịch vì ma trận không độc lập tuyến tính. Khi add L2 reglarization, lấy đạo hàm ta sẽ được X * X.T + lambda * I trở thành ma trận độc lập tuyến tính khả nghịch.
3. C trong L2 regularization logistic regression
C is the "complexity" of the model.
C small -> regularization is strong -> underfit.
C large -> regularization is weak -> overfitting.
C -> tuning hyperparameter.
4. Khởi tạo bias và weight là 0 trong Logistic Regression
Logistic regression does not have a hidden layer. If you initialize the weights to zeros, the first example x fed in the logistic regression will output zero but the derivatives of the Logistic Regression depend on the input x (because there's no hidden layer) which is not zero. So at the second iteration, the values of the weights follow x's distribution and are different from each other if x is not a constant vector.
5. Normalize inputs. Why?
- Normalize the training set and test set in the same way using mu and sigma.
- Why normalize inputs? -> Cost function
Nhận xét
Đăng nhận xét