0 votes
in R Language by
R Aggregate Function: Summarise & Group_by() Example

1 Answer

0 votes
by

Summary of a variable is important to have an idea about the data. Although, summarizing a variable by group gives better information on the distribution of the data.

In this tutorial, you will learn how summarize a dataset by group with the dplyr library.

Before you perform summary, you will do the following steps to prepare the data:

  • Step 1: Import the data
  • Step 2: Select the relevant variables
  • Step 3: Sort the data
library(dplyr)

# Step 1
data <- read.csv("https://raw.githubusercontent.com/guru99-edu/R-Programming/master/lahman-batting.csv") % > %

# Step 2
select(c(playerID, yearID, AB, teamID, lgID, G, R, HR, SH))  % > % 

# Step 3
arrange(playerID, teamID, yearID)

A good practice when you import a dataset is to use the glimpse() function to have an idea about the structure of the dataset.

# Structure of the data
glimpse(data)

Summarise()

The syntax of summarise() is basic and consistent with the other verbs included in the dplyr library.

summarise(df, variable_name=condition) 
arguments: 
- `df`: Dataset used to construct the summary statistics 
- `variable_name=condition`: Formula to create the new variable

Look at the code below:

summarise(data, mean_run =mean(R))

Code Explanation

  • summarise(data, mean_run = mean(R)): Creates a variable named mean_run which is the average of the column run from the dataset data.
  • Group_by vs no group_by

    The function summerise() without group_by() does not make any sense. It creates summary statistic by group. The library dplyr applies a function automatically to the group you passed inside the verb group_by.

    Note that, group_by works perfectly with all the other verbs (i.e. mutate(), filter(), arrange(), ...).

    It is convenient to use the pipeline operator when you have more than one step. You can compute the average homerun by baseball league.

    data % > %
    	group_by(lgID) % > %
    	summarise(mean_run = mean(HR))
  • Code Explanation

    • data: Dataset used to construct the summary statistics
    • group_by(lgID): Compute the summary by grouping the variable `lgID
    • summarise(mean_run = mean(HR)): Compute the average homerun

Related questions

0 votes
asked Nov 13, 2019 in R Language by MBarbieri
0 votes
asked Nov 13, 2019 in R Language by MBarbieri
...