Data scientist need to do many repetitive tasks. Most of the time, we copy and paste chunks of code repetitively. For example, normalization of a variable is highly recommended before we run a machine learning algorithm. The formula to normalize a variable is:
We already know how to use the min() and max() function in R. We use the tibble library to create the data frame. Tibble is so far the most convenient function to create a data set from scratch.
library(tibble)
# Create a data frame
data_frame <- tibble(
c1 = rnorm(50, 5, 1.5),
c2 = rnorm(50, 5, 1.5),
c3 = rnorm(50, 5, 1.5),
)
We will proceed in two steps to compute the function described above. In the first step, we will create a variable called c1_norm which is the rescaling of c1. In step two, we just copy and paste the code of c1_norm and change with c2 and c3.
Detail of the function with the column c1:
Nominator: : data_frame$c1 -min(data_frame$c1))
Denominator: max(data_frame$c1)-min(data_frame$c1))
Therefore, we can divide them to get the normalized value of column c1:
(data_frame$c1 -min(data_frame$c1))/(max(data_frame$c1)-min(data_frame$c1))
We can create c1_norm, c2_norm and c3_norm:
Create c1_norm: rescaling of c1
data_frame$c1_norm <- (data_frame$c1 -min(data_frame$c1))/(max(data_frame$c1)-min(data_frame$c1))
# show the first five values
head(data_frame$c1_norm, 5)