Summary of a variable is important to have an idea about the data. Although, summarizing a variable by group gives better information on the distribution of the data.
In this tutorial, you will learn how summarize a dataset by group with the dplyr library.
Before you perform summary, you will do the following steps to prepare the data:
- Step 1: Import the data
- Step 2: Select the relevant variables
- Step 3: Sort the data
# Step 1
data <- read.csv("https://raw.githubusercontent.com/guru99-edu/R-Programming/master/lahman-batting.csv") % > %
# Step 2
select(c(playerID, yearID, AB, teamID, lgID, G, R, HR, SH)) % > %
# Step 3
arrange(playerID, teamID, yearID)
A good practice when you import a dataset is to use the glimpse() function to have an idea about the structure of the dataset.
# Structure of the data
The syntax of summarise() is basic and consistent with the other verbs included in the dplyr library.
- `df`: Dataset used to construct the summary statistics
- `variable_name=condition`: Formula to create the new variable
Look at the code below:
summarise(data, mean_run =mean(R))
- summarise(data, mean_run = mean(R)): Creates a variable named mean_run which is the average of the column run from the dataset data.
Group_by vs no group_by
The function summerise() without group_by() does not make any sense. It creates summary statistic by group. The library dplyr applies a function automatically to the group you passed inside the verb group_by.
Note that, group_by works perfectly with all the other verbs (i.e. mutate(), filter(), arrange(), ...).
It is convenient to use the pipeline operator when you have more than one step. You can compute the average homerun by baseball league.
data % > %
group_by(lgID) % > %
summarise(mean_run = mean(HR))
- data: Dataset used to construct the summary statistics
- group_by(lgID): Compute the summary by grouping the variable `lgID
- summarise(mean_run = mean(HR)): Compute the average homerun