R and SQL (Part 2)

This is part 2 of a series of posts demonstrating how equivalent analysis can be carried out in SQL and R. Here we look at joins, grouped data, and subqueries.


Introduction

In Part 1 of this series we looked at the similarities between SQL and R when selecting, sorting and filtering data. In this second part we are going to extend the comparison to include joins, grouped data and subqueries.



The data

We will use the same sample dataset from mysqltutorial.org that was used in the first part of this post, as described here. The schema is reproduced below for ease of reference:


Instructions on how to load the sample dataset into MySQL and R were provided here.

(Top)

1. Common joins

A join allows you to combine variables or columns from two tables. It first matches observations by their keys, then copies across variables from one table to another.

To understand joins, we are going to use the explanation provided by Hadley Wickham in R for Data Science, and the excellent animations that were produced by Garrick Aden-Buie. These illustrate the different ways that we can join the data in table x, below, with the data in table y. The colored columns represent the “key” variable that is used to match the rows between the tables.

This section looks at four common joins:

  • Inner join
  • Left join
  • Right join
  • Full join
(Top)

1.1 Inner join

An inner join shows all rows from x where there are matching values in y, and all columns from x and y:

Example 1.1a:

Say we want to list the name of every employee from the employees table who has one or more customers recorded in the customers table.

This is an example of an inner join, as we only want records that have a match in both tables. In this case our keys - the columns that we will be matching - are the employeeNumber in the employees table and the salesRepEmployeeNumber in the customers table.

We would like the code to return the following information:

  • employee number
  • employee first name
  • employee second name
  • customer name(s)

With SQL:

To do this in SQL we would use INNER JOIN, followed by the name of the table we are joining to. We then use ON to specify the names of the columns (keys) that are being used to match the data in the two tables:

SELECT 
  employees.employeeNumber, 
  employees.lastName, 
  employees.firstName, 
  customers.customerName
FROM   
  employees 
  INNER JOIN 
  customers 
    ON employees.employeeNumber = customers.salesRepEmployeeNumber
ORDER BY 
  lastName, customerName

Note that we’ve specified the table from which each column is taken; for example, using employees.employeeNumber rather than just employeeNumber. This is necessary in cases where the same column name appears in more than one table. In this particular example, we could have written the code without the table prefixes, but they have been included for illustrative purposes.

With R:

The same join could be carried out in R using the inner_join function. This includes the by argument (equivalent to SQL’s ON), in which we list the names of the keys:

employees %>%   #left-hand table
  #after 'by' we list the key from the left-hand table first:
  inner_join(customers, by = c("employeeNumber" = "salesRepEmployeeNumber")) %>%
  select(employeeNumber, lastName, firstName, customerName) %>% 
  arrange(lastName, customerName)
Both the SQL and R code above returns the following 100 records. Note that each employee’s name appears more than once if they have more than one customer.
employeeNumber lastName firstName customerName
1337 Bondur Loui Auto Canal+ Petit
1337 Bondur Loui La Corne D’abondance, Co. 
1337 Bondur Loui Lyon Souveniers
1337 Bondur Loui Marseille Mini Autos
1337 Bondur Loui Reims Collectables
1337 Bondur Loui Saveley & Henriot, Co. 
1501 Bott Larry AV Stores, Co. 
1501 Bott Larry Double Decker Gift Stores, Ltd
1501 Bott Larry giftsbymail.co.uk
1501 Bott Larry Oulu Toy Supplies, Inc. 
1501 Bott Larry Stylish Desk Decors, Co. 
1501 Bott Larry Suominen Souveniers
1501 Bott Larry Toys of Finland, Co. 
1501 Bott Larry UK Collectables, Ltd. 
1401 Castillo Pamela Amica Models & Co. 
1401 Castillo Pamela Danish Wholesale Imports
1401 Castillo Pamela Frau da Collezione
1401 Castillo Pamela Heintze Collectables
1401 Castillo Pamela L’ordine Souveniers
1401 Castillo Pamela Mini Auto Werke
1401 Castillo Pamela Petit Auto
1401 Castillo Pamela Rovelli Gifts
1401 Castillo Pamela Royale Belge
1401 Castillo Pamela Salzburg Collectables
1188 Firrelli Julie Cambridge Collectables Co. 
1188 Firrelli Julie Classic Gift Ideas, Inc
1188 Firrelli Julie Collectables For Less Inc. 
1188 Firrelli Julie Diecast Collectables
1188 Firrelli Julie Mini Creations Ltd. 
1188 Firrelli Julie Online Mini Collectables
1611 Fixter Andy Anna’s Decorations, Ltd
1611 Fixter Andy Australian Collectables, Ltd
1611 Fixter Andy Australian Collectors, Co. 
1611 Fixter Andy Australian Gift Network, Co
1611 Fixter Andy Souveniers And Things Co. 
1702 Gerard Martin CAF Imports
1702 Gerard Martin Corrida Auto Replicas, Ltd
1702 Gerard Martin Enaco Distributors
1702 Gerard Martin Iberia Gift Imports, Corp. 
1702 Gerard Martin Precious Collectables
1702 Gerard Martin Vida Sport, Ltd
1370 Hernandez Gerard Alpha Cognac
1370 Hernandez Gerard Atelier graphique
1370 Hernandez Gerard Auto Associés & Cie.
1370 Hernandez Gerard Daedalus Designs Imports
1370 Hernandez Gerard Euro+ Shopping Channel
1370 Hernandez Gerard La Rochelle Gifts
1370 Hernandez Gerard Mini Caravy
1165 Jennings Leslie Corporate Gift Ideas Co. 
1165 Jennings Leslie Mini Gifts Distributors Ltd. 
1165 Jennings Leslie Mini Wheels Co. 
1165 Jennings Leslie Signal Collectibles Ltd. 
1165 Jennings Leslie Technics Stores Inc. 
1165 Jennings Leslie The Sharp Gifts Warehouse
1504 Jones Barry Baane Mini Imports
1504 Jones Barry Bavarian Collectables Imports, Co. 
1504 Jones Barry Blauer See Auto, Co. 
1504 Jones Barry Clover Collections, Co. 
1504 Jones Barry Herkku Gifts
1504 Jones Barry Norway Gifts By Mail, Co. 
1504 Jones Barry Scandinavian Gift Ideas
1504 Jones Barry Toms Spezialitäten, Ltd
1504 Jones Barry Volvo Model Replicas, Co
1612 Marsh Peter Down Under Souveniers, Inc
1612 Marsh Peter Extreme Desk Decorations, Ltd
1612 Marsh Peter GiftsForHim.com
1612 Marsh Peter Handji Gifts& Co
1612 Marsh Peter Kelly’s Gift Shop
1621 Nishi Mami Cruz & Sons Co. 
1621 Nishi Mami Dragon Souveniers, Ltd. 
1621 Nishi Mami King Kong Collectables, Co. 
1621 Nishi Mami Osaka Souveniers Co. 
1621 Nishi Mami Tokyo Collectables, Ltd
1216 Patterson Steve Auto-Moto Classics Inc. 
1216 Patterson Steve Diecast Classics Inc. 
1216 Patterson Steve FunGiftIdeas.com
1216 Patterson Steve Gifts4AllAges.com
1216 Patterson Steve Marta’s Replicas Co. 
1216 Patterson Steve Online Diecast Creations Co. 
1166 Thompson Leslie Boards & Toys Co. 
1166 Thompson Leslie Collectable Mini Designs Co. 
1166 Thompson Leslie Men ‘R’ US Retailers, Ltd. 
1166 Thompson Leslie Signal Gift Stores
1166 Thompson Leslie Toys4GrownUps.com
1166 Thompson Leslie West Coast Collectables Co. 
1286 Tseng Foon Yue American Souvenirs Inc
1286 Tseng Foon Yue Classic Legends Inc. 
1286 Tseng Foon Yue Microscale Inc. 
1286 Tseng Foon Yue Muscle Machine Inc
1286 Tseng Foon Yue Québec Home Shopping Network
1286 Tseng Foon Yue Super Scale Inc. 
1286 Tseng Foon Yue Vitachrome Inc. 
1323 Vanauf George Canadian Gift Exchange Network
1323 Vanauf George Gift Depot Inc. 
1323 Vanauf George Gift Ideas Corp. 
1323 Vanauf George Land of Toys Inc. 
1323 Vanauf George Mini Classics
1323 Vanauf George Motor Mint Distributors Inc. 
1323 Vanauf George Royal Canadian Collectables, Ltd. 
1323 Vanauf George Tekni Collectables Inc. 
Example 1.1b:

Say that we wanted to list every customer from the customers table who has one or more orders in the orders table.

This is another example of an inner join, this time matching records using the ‘customerNumber’ column as our key, which is the same in both tables.

We want the query to return the following information:

  • customer number
  • customer name
  • order number
  • current order status

With SQL:

The fact that the same column name is used in both tables means we can simplify our code with USING instead of ON, as follows:

SELECT 
  c.customerNumber, 
  c.customerName, 
  o.orderNumber, 
  o.status
FROM
  customers AS c 
  INNER JOIN 
  orders AS o
    USING (customerNumber)
ORDER BY
  customerName

This simplifies the code as we only need to list one column after USING.

In the example above, we’ve also used aliases in place of the full table names, which further simplifies the code. This has assigned the short name ‘c’ to the ‘customers’ table and ‘o’ to the ‘orders’ table. This is done with the keyword AS, though even this could be excluded. Aliases makes the code shorter and easier to read.

With R:

The equivalent R code is shown below. Note that we now only need to specify the one column name in the by argument.

customers %>% 
  inner_join(orders, by = "customerNumber") %>%
  select(customerNumber, customerName, orderNumber, status) %>% 
  arrange(customerName)


This code returns 326 rows, the first 50 of which are shown below:
customerNumber customerName orderNumber status
242 Alpha Cognac 10136 Shipped
242 Alpha Cognac 10178 Shipped
242 Alpha Cognac 10397 Shipped
249 Amica Models & Co.  10280 Shipped
249 Amica Models & Co.  10293 Shipped
276 Anna’s Decorations, Ltd 10148 Shipped
276 Anna’s Decorations, Ltd 10169 Shipped
276 Anna’s Decorations, Ltd 10370 Shipped
276 Anna’s Decorations, Ltd 10391 Shipped
103 Atelier graphique 10123 Shipped
103 Atelier graphique 10298 Shipped
103 Atelier graphique 10345 Shipped
471 Australian Collectables, Ltd 10193 Shipped
471 Australian Collectables, Ltd 10265 Shipped
471 Australian Collectables, Ltd 10415 Disputed
114 Australian Collectors, Co.  10120 Shipped
114 Australian Collectors, Co.  10125 Shipped
114 Australian Collectors, Co.  10223 Shipped
114 Australian Collectors, Co.  10342 Shipped
114 Australian Collectors, Co.  10347 Shipped
333 Australian Gift Network, Co 10152 Shipped
333 Australian Gift Network, Co 10174 Shipped
333 Australian Gift Network, Co 10374 Shipped
198 Auto-Moto Classics Inc.  10130 Shipped
198 Auto-Moto Classics Inc.  10290 Shipped
198 Auto-Moto Classics Inc.  10352 Shipped
256 Auto Associés & Cie. 10216 Shipped
256 Auto Associés & Cie. 10304 Shipped
406 Auto Canal+ Petit 10211 Shipped
406 Auto Canal+ Petit 10252 Shipped
406 Auto Canal+ Petit 10402 Shipped
187 AV Stores, Co.  10110 Shipped
187 AV Stores, Co.  10306 Shipped
187 AV Stores, Co.  10332 Shipped
121 Baane Mini Imports 10103 Shipped
121 Baane Mini Imports 10158 Shipped
121 Baane Mini Imports 10309 Shipped
121 Baane Mini Imports 10325 Shipped
415 Bavarian Collectables Imports, Co.  10296 Shipped
128 Blauer See Auto, Co.  10101 Shipped
128 Blauer See Auto, Co.  10230 Shipped
128 Blauer See Auto, Co.  10300 Shipped
128 Blauer See Auto, Co.  10323 Shipped
219 Boards & Toys Co.  10154 Shipped
219 Boards & Toys Co.  10376 Shipped
344 CAF Imports 10177 Shipped
344 CAF Imports 10231 Shipped
173 Cambridge Collectables Co.  10228 Shipped
173 Cambridge Collectables Co.  10249 Shipped
202 Canadian Gift Exchange Network 10206 Shipped
(Top)

1.2 Left join

A left join (or outer left join) returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA or NULL values in the new columns.

Example 1.2:

In the example above we wanted to see the names customers in the customers table who had a corresponding order in the orders table. But what if we wanted to see names of all customers and their order information, even if they don’t have any orders?

This is an example of a left join. We can do this using LEFT JOIN in SQL and left_join in R, as follows:

With SQL:

SELECT 
  c.customerNumber, 
  c.customerName, 
  o.orderNumber, 
  o.status
FROM  
  customers AS c
  LEFT JOIN 
  orders AS o
    ON c.customerNumber = o.customerNumber
ORDER BY customerName

With R:

customers %>% 
  left_join(orders, by = "customerNumber") %>% 
  select(customerNumber, customerName, orderNumber, status) %>% 
  arrange(customerName)

This returns 350 records, the first 50 of which are shown in the table below. If you scroll down, you will see that some customers have ‘NA’ for orderNumber and status (this will appear as ‘NULL’ in MySQL). These are customers who had no corresponding orders in the orders table. Note that each customer will appear more than once if they have more than one order.


customerNumber customerName orderNumber status
242 Alpha Cognac 10136 Shipped
242 Alpha Cognac 10178 Shipped
242 Alpha Cognac 10397 Shipped
168 American Souvenirs Inc NA NA
249 Amica Models & Co.  10280 Shipped
249 Amica Models & Co.  10293 Shipped
237 ANG Resellers NA NA
276 Anna’s Decorations, Ltd 10148 Shipped
276 Anna’s Decorations, Ltd 10169 Shipped
276 Anna’s Decorations, Ltd 10370 Shipped
276 Anna’s Decorations, Ltd 10391 Shipped
465 Anton Designs, Ltd.  NA NA
206 Asian Shopping Network, Co NA NA
348 Asian Treasures, Inc.  NA NA
103 Atelier graphique 10123 Shipped
103 Atelier graphique 10298 Shipped
103 Atelier graphique 10345 Shipped
471 Australian Collectables, Ltd 10193 Shipped
471 Australian Collectables, Ltd 10265 Shipped
471 Australian Collectables, Ltd 10415 Disputed
114 Australian Collectors, Co.  10120 Shipped
114 Australian Collectors, Co.  10125 Shipped
114 Australian Collectors, Co.  10223 Shipped
114 Australian Collectors, Co.  10342 Shipped
114 Australian Collectors, Co.  10347 Shipped
333 Australian Gift Network, Co 10152 Shipped
333 Australian Gift Network, Co 10174 Shipped
333 Australian Gift Network, Co 10374 Shipped
198 Auto-Moto Classics Inc.  10130 Shipped
198 Auto-Moto Classics Inc.  10290 Shipped
198 Auto-Moto Classics Inc.  10352 Shipped
256 Auto Associés & Cie. 10216 Shipped
256 Auto Associés & Cie. 10304 Shipped
406 Auto Canal+ Petit 10211 Shipped
406 Auto Canal+ Petit 10252 Shipped
406 Auto Canal+ Petit 10402 Shipped
187 AV Stores, Co.  10110 Shipped
187 AV Stores, Co.  10306 Shipped
187 AV Stores, Co.  10332 Shipped
121 Baane Mini Imports 10103 Shipped
121 Baane Mini Imports 10158 Shipped
121 Baane Mini Imports 10309 Shipped
121 Baane Mini Imports 10325 Shipped
415 Bavarian Collectables Imports, Co.  10296 Shipped
293 BG&E Collectables NA NA
128 Blauer See Auto, Co.  10101 Shipped
128 Blauer See Auto, Co.  10230 Shipped
128 Blauer See Auto, Co.  10300 Shipped
128 Blauer See Auto, Co.  10323 Shipped
219 Boards & Toys Co.  10154 Shipped
(Top)

1.3 Right join

A right join (or outer right join) returns all rows from y, and all columns from x and y. Rows in y with no match in x will have NA or NULL values in the new columns.

This is exactly the same as a left join, described in the section above, but this time we’re returning all rows from the right-hand table (in this case, y) rather than those from the left-hand table.

A right join in which y is designated as the right-hand table (as in the animation above) would give exactly the same result as a left join in which y was designated as the left-hand table. The choice of whether a table is designated on the left or the right is entirely arbitrary.

Right joins can be carried out using RIGHT JOIN in SQL and right_join in R. No example is provided here given the similarity to left joins.

(Top)

1.4 Full join

A full join (or full outer join) returns all rows and all columns from both x and y. Where there is no match for a particular key value, the missing values will appear as NA or NULL.


Full joins are not actually supported in MySQL. However, a full join is equivalent to a combination of a left join and right join, as described above. To combine the results of these two joins, we need to use union.

Union returns all unique rows from tables x and y as follows:

Example 1.4:

Say we want to list the name of every employee from the employees table, and every customer from the customers table. Where possible, we want to match the name of each employee to his/her customers. However we also want to return the names of all employees without customers, and the names of all customers without an associated employee. This is an example of a full join.

Assume that we want our query to return:

  • each employee’s last name
  • each employee’s first name
  • each customer’s name

With SQL:

To simulate a full join in SQL, we are going to use a LEFT JOIN, a RIGHT JOIN and the UNION of the two, as follows:

SELECT 
  e.lastName, e.firstName, c.customerName 
FROM 
  employees AS e
  LEFT JOIN 
  customers AS c 
    ON e.employeeNumber = c.salesRepEmployeeNumber

UNION

SELECT 
  e.lastName, e.firstName, c.customerName 
FROM 
  employees AS e
  RIGHT JOIN 
  customers AS c 
    ON e.employeeNumber = c.salesRepEmployeeNumber

With R:

In R, the syntax is much simpler as we can use full_join:

employees %>% 
  full_join(customers, by = c('employeeNumber' = 'salesRepEmployeeNumber')) %>% 
  select(lastName, firstName, customerName)

The two sets of code above return the following 130 records. As expected, this shows the name of each employee matched to their customer(s), where such a match exists, but also the names of each employee and customer even if there is no match:


lastName firstName customerName
Murphy Diane NA
Patterson Mary NA
Firrelli Jeff NA
Patterson William NA
Bondur Gerard NA
Bow Anthony NA
Jennings Leslie Mini Gifts Distributors Ltd. 
Jennings Leslie Mini Wheels Co. 
Jennings Leslie Technics Stores Inc. 
Jennings Leslie Corporate Gift Ideas Co. 
Jennings Leslie The Sharp Gifts Warehouse
Jennings Leslie Signal Collectibles Ltd. 
Thompson Leslie Signal Gift Stores
Thompson Leslie Toys4GrownUps.com
Thompson Leslie Boards & Toys Co. 
Thompson Leslie Collectable Mini Designs Co. 
Thompson Leslie Men ‘R’ US Retailers, Ltd. 
Thompson Leslie West Coast Collectables Co. 
Firrelli Julie Cambridge Collectables Co. 
Firrelli Julie Online Mini Collectables
Firrelli Julie Mini Creations Ltd. 
Firrelli Julie Classic Gift Ideas, Inc
Firrelli Julie Collectables For Less Inc. 
Firrelli Julie Diecast Collectables
Patterson Steve Diecast Classics Inc. 
Patterson Steve Auto-Moto Classics Inc. 
Patterson Steve Marta’s Replicas Co. 
Patterson Steve Gifts4AllAges.com
Patterson Steve Online Diecast Creations Co. 
Patterson Steve FunGiftIdeas.com
Tseng Foon Yue Muscle Machine Inc
Tseng Foon Yue American Souvenirs Inc
Tseng Foon Yue Vitachrome Inc. 
Tseng Foon Yue Québec Home Shopping Network
Tseng Foon Yue Classic Legends Inc. 
Tseng Foon Yue Super Scale Inc. 
Tseng Foon Yue Microscale Inc. 
Vanauf George Land of Toys Inc. 
Vanauf George Gift Depot Inc. 
Vanauf George Canadian Gift Exchange Network
Vanauf George Royal Canadian Collectables, Ltd. 
Vanauf George Mini Classics
Vanauf George Tekni Collectables Inc. 
Vanauf George Gift Ideas Corp. 
Vanauf George Motor Mint Distributors Inc. 
Bondur Loui Saveley & Henriot, Co. 
Bondur Loui La Corne D’abondance, Co. 
Bondur Loui Lyon Souveniers
Bondur Loui Marseille Mini Autos
Bondur Loui Reims Collectables
Bondur Loui Auto Canal+ Petit
Hernandez Gerard Atelier graphique
Hernandez Gerard La Rochelle Gifts
Hernandez Gerard Euro+ Shopping Channel
Hernandez Gerard Daedalus Designs Imports
Hernandez Gerard Mini Caravy
Hernandez Gerard Alpha Cognac
Hernandez Gerard Auto Associés & Cie.
Castillo Pamela Danish Wholesale Imports
Castillo Pamela Heintze Collectables
Castillo Pamela Amica Models & Co. 
Castillo Pamela Rovelli Gifts
Castillo Pamela Petit Auto
Castillo Pamela Royale Belge
Castillo Pamela Salzburg Collectables
Castillo Pamela L’ordine Souveniers
Castillo Pamela Mini Auto Werke
Castillo Pamela Frau da Collezione
Bott Larry Toys of Finland, Co. 
Bott Larry AV Stores, Co. 
Bott Larry UK Collectables, Ltd. 
Bott Larry giftsbymail.co.uk
Bott Larry Oulu Toy Supplies, Inc. 
Bott Larry Stylish Desk Decors, Co. 
Bott Larry Suominen Souveniers
Bott Larry Double Decker Gift Stores, Ltd
Jones Barry Baane Mini Imports
Jones Barry Blauer See Auto, Co. 
Jones Barry Volvo Model Replicas, Co
Jones Barry Herkku Gifts
Jones Barry Clover Collections, Co. 
Jones Barry Toms Spezialitäten, Ltd
Jones Barry Norway Gifts By Mail, Co. 
Jones Barry Bavarian Collectables Imports, Co. 
Jones Barry Scandinavian Gift Ideas
Fixter Andy Australian Collectors, Co. 
Fixter Andy Anna’s Decorations, Ltd
Fixter Andy Souveniers And Things Co. 
Fixter Andy Australian Gift Network, Co
Fixter Andy Australian Collectables, Ltd
Marsh Peter Handji Gifts& Co
Marsh Peter Down Under Souveniers, Inc
Marsh Peter GiftsForHim.com
Marsh Peter Extreme Desk Decorations, Ltd
Marsh Peter Kelly’s Gift Shop
King Tom NA
Nishi Mami Dragon Souveniers, Ltd. 
Nishi Mami Osaka Souveniers Co. 
Nishi Mami King Kong Collectables, Co. 
Nishi Mami Cruz & Sons Co. 
Nishi Mami Tokyo Collectables, Ltd
Kato Yoshimi NA
Gerard Martin Enaco Distributors
Gerard Martin Vida Sport, Ltd
Gerard Martin CAF Imports
Gerard Martin Precious Collectables
Gerard Martin Corrida Auto Replicas, Ltd
Gerard Martin Iberia Gift Imports, Corp. 
NA NA Havel & Zbyszek Co
NA NA Porto Imports Co. 
NA NA Asian Shopping Network, Co
NA NA Natürlich Autos
NA NA ANG Resellers
NA NA Messner Shopping Network
NA NA Franken Gifts, Co
NA NA BG&E Collectables
NA NA Schuyler Imports
NA NA Der Hund Imports
NA NA Cramer Spezialitäten, Ltd
NA NA Asian Treasures, Inc. 
NA NA SAR Distributors, Co
NA NA Kommission Auto
NA NA Lisboa Souveniers, Inc
NA NA Stuttgart Collectable Exchange
NA NA Feuer Online Stores, Inc
NA NA Warburg Exchange
NA NA Anton Designs, Ltd. 
NA NA Mit Vergnügen & Co. 
NA NA Kremlin Collectables, Co. 
NA NA Raanan Stores, Inc
(Top)

2. Grouped data

In SQL and R, it is possible to group a set of rows into a summary row. This summary row can include an aggregate measure of each group; for example, the sum, the average, or the count of records in each group.

It is also possible to filter the data based on these aggregates. For example, you could filter to only show rows from groups with more than 20 records, or only show rows from groups where the average value of a certain variable was greater than 100.


2.1 Summarizing grouped data

Grouped data is often summarized using aggregate functions such as SUM, AVG, MAX, MIN and COUNT. Here we look at a couple of examples using our sample database.

Example A

Say we want to find the total value (in $) of each order in our sample data set.

To do this, we need to (1) calculate the value of each product in every order, i.e. the quantity times the price; and (2) group each product sold by its order number and add these values together.

With SQL:

In SQL we group the data using the GROUP BY clause, and specify our aggregate calculation in the SELECT clause, as follows:

SELECT 
  orderNumber,
  SUM(quantityOrdered * priceEach) AS value
FROM 
  orderdetails
GROUP BY 
  orderNumber

With R:

In R, we would use group_by() to group the data, and summarize() to apply the aggregate function to each group:

orderdetails %>% 
  group_by(orderNumber) %>% 
  summarize(value = sum(quantityOrdered * priceEach)) %>% 
  select(orderNumber, value)

The first 10 records (out of 326) are shown in the table below:

## `summarise()` ungrouping output (override with `.groups` argument)
orderNumber value
10100 10223.83
10101 10549.01
10102 5494.78
10103 50218.95
10104 40206.20
10105 53959.21
10106 52151.81
10107 22292.62
10108 51001.22
10109 25833.14
Example B

We can also group using an expression rather than values that already exist in a column.

For example, say we wanted to find the total value of orders in each year, where the order has been shipped.

To do this, we need to do a combination of the following:

  • Join the orders table, which has the date of each order, with the orderdetails table, which contains information on the quantity and price of the products ordered
  • Use the YEAR() function in SQL, and year() in R (using the lubridate package), which the returns the year corresponding to each order date. This is will be used to group the data
  • Filter the data to only include orders whose status is ‘shipped’.

This can be done in the following code:

With SQL:

SELECT 
  YEAR(orderDate) AS year,
  SUM(quantityOrdered * priceEach) AS value
FROM
  orders
  INNER JOIN
  orderdetails 
    USING (orderNumber)
WHERE
  status = 'Shipped'
GROUP BY 
  YEAR(orderDate)

With R:

orders %>% 
  inner_join(orderdetails, by = 'orderNumber') %>%
  filter(status == 'Shipped') %>% 
  group_by(year = lubridate::year(orderDate)) %>%
  summarize(value = sum(quantityOrdered * priceEach))

Which gives the total value of shipped orders for the three years:

## `summarise()` ungrouping output (override with `.groups` argument)
year value
2003 3223096
2004 4300603
2005 1341396
(Top)

2.2 Filtering grouped data

In SQL and R we can specify filter conditions for a group of rows. This assumes that we have already grouped our data using GROUP BY in SQL or group_by() in R.

When filtering grouped data in SQL we use the HAVING clause. Note that we do not use the WHERE clause, which is only used to filter ungrouped data.

In R, we filter our grouped data using filter(), which was also used with ungrouped data.

Example A

Say we want to show all order numbers in which more than 650 individual items were ordered. This requires us to group by order number (orderNumber) from the orderdetails table, and to filter the results based on the total quantity of products in each order.

With SQL:

In SQL we apply our filter using the HAVING clause, as follows:

SELECT 
  orderNumber, 
  sum(quantityOrdered) as quantity
FROM
  orderdetails
GROUP BY 
  orderNumber
HAVING 
  sum(quantityOrdered) > 650
ORDER BY 
  quantity DESC

With R:

With R, we use the filter() function:

orderdetails %>% 
  group_by(orderNumber) %>% 
  summarize(quantity = sum(quantityOrdered)) %>% 
  filter(quantity > 650) %>% 
  arrange(desc(quantity)) 

This gives the following three orders with more than 650 items:

## `summarise()` ungrouping output (override with `.groups` argument)
orderNumber quantity
10222 717
10106 675
10165 670
Example B

Assume we want to find customers with an average order value over $3,800, for orders that have been shipped.

This example is slightly more complex as we need to join three tables, as follows:

With SQL:

SELECT 
  a.customerName,
  b.status,
  avg(c.quantityOrdered * c.priceEach) as avg_value
FROM
  customers a
  INNER JOIN
  orders b 
    USING (customerNumber)
  INNER JOIN
  orderdetails c 
    USING (orderNumber)
GROUP BY 
  customerName, status
HAVING 
  avg_value > 3800 AND status = 'Shipped'
ORDER BY 
  avg_value DESC

Note that in MySQL we can use the alias for average value (avg_value) in the HAVING clause, which was defined in the SELECT clause. Apparently not all versions of SQL support the use of aliases in the HAVING clause.

With R:

customers %>% 
  left_join(orders, by = 'customerNumber') %>% 
  left_join(orderdetails, by = 'orderNumber') %>% 
  group_by(customerName, status) %>% 
  summarize(avg_value = mean(quantityOrdered * priceEach)) %>% 
  filter(avg_value > 3800 & status == 'Shipped') %>% 
  arrange(desc(avg_value))

We find that the following five customers have average shipped order sizes that are over $3,800:

## `summarise()` regrouping output by 'customerName' (override with `.groups` argument)
customerName status avg_value
Tekni Collectables Inc.  Shipped 4253.501
Super Scale Inc.  Shipped 4139.921
UK Collectables, Ltd.  Shipped 4077.812
Mini Caravy Shipped 3992.596
Gift Depot Inc.  Shipped 3816.985
(Top)

3. Subqueries

In SQL, a subquery is where one query (the ‘inner query’) nested within another query (the ‘outer query’). The result of the inner query is used as part of the outer query.

This concept is illustrated in the two examples below: one in which a subquery is used in the WHERE clause, and one in which it is used in the FROM clause.

(Top)

3.1 Subquery in the WHERE clause

In this example, we use a subquery find customers whose payments are greater than the average payment.

With SQL:

SELECT 
  customerNumber, checkNumber, amount
FROM
  payments
WHERE
  amount > 
   (SELECT AVG(amount)
    FROM payments)

With R:

The equivalent code in R is as follows:

payments %>% 
  filter(amount > mean(amount))

Which gives 134 customers, the first five of which are shown below:

customerNumber checkNumber paymentDate amount
112 HQ55022 2003-06-06 32641.98
112 ND748579 2004-08-20 33347.88
114 GG31455 2003-05-20 45864.03
114 MA765515 2004-12-15 82261.22
114 NR27552 2004-03-10 44894.74
(Top)

3.2 Subquery in the FROM clause (derived tables)

We can also use subqueries in the FROM clause. This creates what’s known as a ‘derived table’ - a virtual table which is only stored for the duration of the query.

In this example we want to find the maximum, minimum and average number of items in an order. To do this we create a subquery / derived table which shows the count of different items that were included in each order.

With SQL:

In MySQL we use a subquery to create the derived table called ‘itemcount’. This counts the number of items in each order (‘items’). Note that in MySQL derived tables must always be given an alias.

SELECT
  select max(items), min(items), floor(avg(items))
FROM( 
  SELECT 
    orderNumber, count(orderNumber) as items
  FROM orderdetails
  GROUP BY orderNumber) as itemcount

With R:

orderdetails %>% 
  # This is equivalent to the inner query
  group_by(orderNumber) %>% 
  summarise(items = n()) %>% 
  # This is equivalent to the outer query
  summarise(
    max = max(items),
    min = min(items),
    avg = floor(mean(items))
  )

This shows that the maximum number of different items in an single order was 18, the minimum was 1, and the average was 9 (rounded).

(Top)

3.3 Correlated subqueries

In the previous two examples the subqueries were independent. This means the the inner query did not depend on the outer query, and could have been run as a standalone query.

A correlated subquery is one that uses data from the outer query. In other words, the inner query depends on the outer query. A correlated subquery is evaluated once for each row in the outer query.

For example: say we want to select products with a buy price that is greater than the average buy price for its product line.

With SQL:

SELECT 
  productname, buyprice
FROM 
  products as p1
WHERE buyprice >
  (SELECT avg(buyprice)
  FROM products
  WHERE productLine = p1.productLine)

Here the inner query is finding the average buy price corresponding to the product line listed in each row of the outer query, and only selecting rows where the product price exceeds this average.

With R:

The same result can be achieved in R using a helper column called ‘avgBuyPrice’ which, for each row, shows the average price for the associated product line. The data is then filter to only show products whose price is higher than this average:

products %>%
  # This is equivalent to the inner query
  # Creates a column with the avg price of the relevant product line
  group_by(productLine) %>% 
  mutate(avgBuyPrice = mean(buyPrice)) %>% 
  ungroup() %>%
  # This is equivalent to the outer query
  filter(buyPrice > avgBuyPrice) %>% 
  select(productName, buyPrice)

This code returns 55 rows, the first five of which are shown below:

productName buyPrice
1952 Alpine Renault 1300 98.58
1996 Moto Guzzi 1100i 68.99
2003 Harley-Davidson Eagle Drag Bike 91.02
1972 Alfa Romeo GTA 85.68
1962 LanciaA Delta 16V 103.42
(Top)

3.4 Common table expressions

A common table expression (CTE), like a derived table, is a named temporary table that exists only for the duration of a query. It is created using the WITH clause.

Unlike a derived table, a CTE can be referenced multiple times in the same query. In addition, a CTE provides better readability and performance in comparison with a derived table.

As a simple illustration, we can repeat the example above (see section 3.2) where we found the maximum, minimum and average number of items in an order. This time, instead of using a subquery, the inner query can be expressed as a common table expression called ‘itemcount’

With SQL:

This time we create the temporary table called ‘itemcount’ using the WITH clause. We then select the relevant information from this CTE:

WITH itemcount AS (
    SELECT orderNumber, count(orderNumber) as items
  FROM orderdetails
  GROUP BY orderNumber)
    
SELECT 
  max(items), min(items), floor(avg(items))
FROM 
  itemcount

With R:

In R, we create a separate dataframe which is analogous to the CTE above:

# This is equivalent to the CTE table
itemcount <- orderdetails %>% 
  group_by(orderNumber) %>% 
  summarise(items = n())

# This is the code which makes use of the CTE
itemcount %>%
  summarize(
    max = max(items), 
    min = min(items), 
    avg = floor(mean(items))
  )

Here we get the same results as above:

max min avg
18 1 9
(Top)