0 votes
in R Language by
R Dplyr: Data Cleaning functions

1 Answer

0 votes
by

Following are four important functions to tidy the data:

  • gather(): Transform the data from wide to long
  • spread(): Transform the data from long to wide
  • separate(): Split one variable into two
  • unit(): Unit two variables into one

We use the tidyr library. This library belongs to the collection of the library to manipulate, clean and visualize the data. If we install R with anaconda, the library is already installed. We can find the library here, https://anaconda.org/r/r-tidyr.

If not installed already, enter the following command

install tidyr : install.packages("tidyr")

to install tidyr

gather()

The objectives of the gather() function is to transform the data from wide to long.

gather(data, key, value, na.rm = FALSE)
Arguments:
-data: The data frame used to reshape the dataset 
-key: Name of the new column created
-value: Select the columns used to fill the key column
-na.rm: Remove missing values. FALSE by default

spread()

The spread() function does the opposite of gather.

spread(data, key, value)
arguments: 
  • data: The data frame used to reshape the dataset
  • key: Column to reshape long to wide
  • value: Rows used to fill the new column

We can reshape the tidier dataset back to messy with spread()

separate()

The separate() function splits a column into two according to a separator. This function is helpful in some situations where the variable is a date. Our analysis can require focussing on month and year and we want to separate the column into two new variables.

Syntax:

separate(data, col, into, sep= "", remove = TRUE)
arguments:
-data: The data frame used to reshape the dataset 
-col: The column to split
-into: The name of the new variables
-sep: Indicates the symbol used that separates the variable, i.e.:  "-", "_", "&"
-remove: Remove the old column. By default sets to TRUE.

We can split the quarter from the year in the tidier dataset by applying the separate() function.

unite()

The unite() function concanates two columns into one.

Syntax:

unit(data, col, conc ,sep= "", remove = TRUE)
arguments:
-data: The data frame used to reshape the dataset 
-col: Name of the new column
-conc: Name of the columns to concatenate
-sep: Indicates the symbol used that unites the variable, i.e:  "-", "_", "&"
-remove: Remove the old columns. By default, sets to TRUE

Related questions

0 votes
asked Nov 6, 2019 in R Language by MBarbieri
+5 votes
asked Jul 28, 2019 in R Language by Aarav2017
...