Note technique

tháng 9 18, 2020

1. Convert numerical to categorical

https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/

Basic

- A categorical variable has too many levels. This pulls down the performance level of the model.

- A categorical has levels that rarely occur. Many of these levels have minimal chance of making a real impact on model fit.

- There is one level that always occurs. E.g: most of the observations in the data set, there is only one level. Variables with such levels fail to make a positive impact on model performance due to very low variation.

- If the categorical is masked, it's a difficult task to decipher its meaning.

- We can't fit the categorical variables into a regression equation in their raw form.'

We should iterate our modeling process with different techniques. Later, evaluate the model performance. Below are the methods:

Convert to number

- Label encoder

For example, We have two features “age” (range: 0-80) and “city” (81 different levels). Now, when we’ll apply the label encoder to the ‘city’ variable, it will represent ‘city’ with numeric values range from 0 to 80. The ‘city’ variable is now similar to the ‘age’ variable since both will have similar data points, which is certainly not the right approach.

- Convert numeric bins to number

- Mean or mode

- Two new features: one lower bound, one upper bound.

- Combine levels

- Using business logic.

- Using frequency and response rate (positive response/total).

- Dummy encoding

keyword: feature hashing

Advanced

https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

Categorical data

Ordinal data: The categories have an inherent order.

Nominal data: The categories do not have an inherent order.

1. Label encoding or ordinal encoding

2. One hot encoding

3. Dummy encoding

** Dummy encoding is similar to one-hot encoding. This categorical data encoding method transforms the categorical variable into a set of binary variables (also known as dummy variables). In the case of one-hot encoding, for N categoricals in a variable, it uses N binary variables. The dummy encoding is a small improvement over one-hot encoding. Dummy encoding uses N-1 features to represent N labels/categories.

4. Effect encoding: https://www.researchgate.net/publication/256349393_Categorical_Variables_in_Regression_Analysis_A_Comparison_of_Dummy_and_Effect_Coding

5. Hash encoding

6. Binary encoding

7. Base N encoding

8. Target encoding

2. L2 regularization of linear regression

Pseudo-inverse không tồn tại do ma trận X * X.T không khả nghịch vì ma trận không độc lập tuyến tính. Khi add L2 reglarization, lấy đạo hàm ta sẽ được X * X.T + lambda * I trở thành ma trận độc lập tuyến tính khả nghịch.

3. C trong L2 regularization logistic regression

C is the "complexity" of the model.

C small -> regularization is strong -> underfit.

C large -> regularization is weak -> overfitting.

C -> tuning hyperparameter.

4. Khởi tạo bias và weight là 0 trong Logistic Regression

Logistic regression does not have a hidden layer. If you initialize the weights to zeros, the first example x fed in the logistic regression will output zero but the derivatives of the Logistic Regression depend on the input x (because there's no hidden layer) which is not zero. So at the second iteration, the values of the weights follow x's distribution and are different from each other if x is not a constant vector.

5. Normalize inputs. Why?

- Normalize the training set and test set in the same way using mu and sigma.

- Why normalize inputs? -> Cost function

Tìm kiếm Blog này

Training diary

Note technique

Nhận xét

Đăng nhận xét

Bài đăng phổ biến từ blog này

Review data structure & algorithm

Review Machine Learning

Note for CS231N: CNN for Visual Recognition