Factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables.

In a dataset, we can distinguish two types of variables: categorical and continuous.

In a categorical variable, the value is limited and usually based on a particular finite group. For example, a categorical variable can be countries, year, gender, occupation.

A continuous variable, however, can take any values, from integer to decimal. For example, we can have the revenue, price of a share, etc..

Categorical Variables

R stores categorical variables into a factor. Let's check the code below to convert a character variable into a factor variable. Characters are not supported in machine learning algorithm, and the only way is to convert a string to an integer.

Syntax

factor(x = character(), levels, labels = levels, ordered = is.ordered(x))

Arguments:

x: A vector of data. Need to be a string or integer, not decimal.

Levels: A vector of possible values taken by x. This argument is optional. The default value is the unique list of items of the vector x.

Labels: Add a label to the x data. For example, 1 can take the label `male` while 0, the label `female`.

ordered: Determine if the levels should be ordered.

Example:

Let's create a factor data frame.

# Create gender vector

gender_vector <- c("Male", "Female", "Female", "Male", "Male")

class(gender_vector)

# Convert gender_vector to a factor

factor_gender_vector <-factor(gender_vector)

class(factor_gender_vector)

Output:

## [1] "character"

## [1] "factor"

Ordinal Categorical Variable

Ordinal categorical variables do have a natural ordering. We can specify the order, from the lowest to the highest with order = TRUE and highest to lowest with order = FALSE.

Example:

We can use summary to count the values for each factor.

# Create Ordinal categorical vector

day_vector <- c('evening', 'morning', 'afternoon', 'midday', 'midnight', 'evening')

# Convert `day_vector` to a factor with ordered level

factor_day <- factor(day_vector, order = TRUE, levels =c('morning', 'midday', 'afternoon', 'evening', 'midnight'))

# Print the new variable

factor_day

Output:

## [1] evening morning afternoon midday

midnight evening

## Continuous Variables

Continuous class variables are the default value in R. They are stored as numeric or integer. We can see it from the dataset below. mtcars is a built-in dataset. It gathers information on different types of car. We can import it by using mtcars and check the class of the variable mpg, mile per gallon. It returns a numeric value, indicating a continuous variable.