Power Transformation [To Gaussian] {Box-Cox}
Description
This technique is similar to log transformation but allows for a broader range of transformations. The most common power transformation is the Box-Cox transformation, which raises the feature values to a power that is determined using maximum likelihood estimation.
Formula
Here, \(x\) is the original feature value, and \(\lambda\) is the power parameter that is estimated using maximum likelihood.
Varieties
The Box-Cox transform is another popular function belonging to the power transform family of functions. This function has a prerequisite that the numeric values to be transformed must be positive (similar to what log transform expects). In case they are negative, shifting using a constant value helps.
The resulting transformed output y is a function of input x and the transformation parameter \(\lambda\) such that when \(\lambda\) = 0, the resultant transform is the natural log transform which we discussed earlier. The optimal value of \(\lambda\) is usually determined using a maximum likelihood or log-likelihood estimation.
Example
Let's now apply the Box-Cox transform on our developer income feature. First we get the optimal lambda value from the data distribution by removing the non-null values as follows.
income = np.array(fcc_survey_df["Income"])
income_clean = income[~np.isnan(income)]
l, opt_lambda = spstats.boxcox(income_clean)
print("Optimal lambda value:", opt_lambda)
# Optimal lambda value: 0.117991239456
Now that we have obtained the optimal \(\lambda\) value, let us use the Box-Cox transform for two values of \(\lambda\) such that \(\lambda = 0\) and \(\lambda = \lambda_{optimal}\) and transform the developer Income feature.
fcc_survey_df["Income_boxcox_lambda_0"] = spstats.boxcox(
(1 + fcc_survey_df["Income"]),
lmbda=0
)
fcc_survey_df["Income_boxcox_lambda_opt"] = spstats.boxcox(
fcc_survey_df["Income"],
lmbda=opt_lambda
)
Info
The transformed features are depicted in the above data frame. Just like we expected, Income_log and Income_boxcox_lamba_0 have the same values.
The distribution looks more normal-like similar to what we obtained after the log transform.
