Find and Delete Duplicate Rows in MySQL with GROUP BY and HAVING Clauses
Introduction
Sometimes, you might see more than one copy of the same record appearing in a MySQL database table. This usually happens after importing data from external sources such as spreadsheet applications that don't have the features of relational databases. Having identical copies of the same record may negatively affect your application and business logic. For instance, if customers get registered multiple times in your invoicing software, this might complicate the process of allocating credit limits. To overcome this challenge, you must delete the duplicate customers' records from your database.
In this guide, you'll create a test_db
database and a sample customers
table. You'll then populate the table with a random list of customers' details containing some duplicates. Finally, you'll use the MySQL GROUP BY
and HAVING
clauses to find and DROP
duplicate rows.
Prerequisites
To follow along with this guide, make sure you have the following:
- An Ubuntu 20.04 server.
- A sudo user.
- A LAMP Stack. To test this guide, you only need to install the MySQL database server.
Create a test_db
Database
First, SSH
to your server and run the command below to log in to your MySQL database server as root.
$ sudo mysql -u root -p
When prompted, enter the root password of your MySQL server and press Enter to continue. Then, run the command below to create a test_db
database.
mysql> CREATE DATABASE test_db;
Use the test_db
database.
mysql> USE test_db;
Next, set up a customers
table. In this table, you'll uniquely identify the customers by referring to the customer_id
column, which should be auto-populated since you'll define it with a PRIMARY KEY
statement. The table will then record the first_name
, last_name
and phone
number of the customer.
Create the customers
table.
mysql> CREATE TABLE customers (
customer_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
phone VARCHAR(15)
) ENGINE = InnoDB;
Next, populate the customers
table with some records. Please note that it is possible for you to INSERT
details of the same customer multiple times in this table. As earlier indicated, you might be importing some records from an external data source that lacks the functionalities of a relational database.
Run the INSERT
commands one by one to populate the table.
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('JOHN', 'DOE', '111111');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('MARY', 'SMITH', '222222');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('JACOB', 'JAMES', '444444');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('STEVE', 'JACKES', '888888');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('MARY', 'SMITH', '222222');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('RIAN', 'WHITE', '101010');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('APPLE', 'GRABRIEL', '242424');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('RIAN', 'WHITE', '101010');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('JACOB', 'JAMES', '444444');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('JOHN', 'DOE', '111111');
After each INSERT
command, you should get the following output indicating that the record was inserted successfully.
...
Query OK, 1 row affected (0.01 sec)
Ensure the customer's details are in place by running the following SELECT
statement against the customers
table.
mysql> SELECT
customer_id,
first_name,
last_name,
phone
FROM customers;
Confirm the output below.
+-------------+------------+-----------+--------+
| customer_id | first_name | last_name | phone |
+-------------+------------+-----------+--------+
| 1 | JOHN | DOE | 111111 |
| 2 | MARY | SMITH | 222222 |
| 3 | JACOB | JAMES | 444444 |
| 4 | STEVE | JACKES | 888888 |
| 5 | MARY | SMITH | 222222 |
| 6 | RIAN | WHITE | 101010 |
| 7 | APPLE | GRABRIEL | 242424 |
| 8 | RIAN | WHITE | 101010 |
| 9 | JACOB | JAMES | 444444 |
| 10 | JOHN | DOE | 111111 |
+-------------+------------+-----------+--------+
10 rows in set (0.00 sec)
As you can see from the output above, JOHN DOE's
, MARY SMITH's
, RIAN WHITE
and JACOB JAMES'
records have been repeated. You can notice this difference by physically examining the output from the SELECT
statement above because you just have a few records in your table.
Your table might contain thousands or millions of records in a production environment, and finding and eliminating duplicates manually may not be a feasible solution. In the next step, you'll learn how to do this using a single SQL statement.
Determine the Duplicate Rows
In MySQL, you can find the duplicate rows by executing a GROUP BY
clause against the target column and then using the HAVING
clause to check a group having more than 1
record.
For instance, to find duplicates in your sample customers
table, use the MySQL CONCAT
statement to concatenate the first_name
, last_name
and phone
fields to a single derived column that you'll use to group the records. Then, use the HAVING
clause to check any resulting group having multiple records.
When completed, your SQL syntax should be similar to the following statement.
mysql> SELECT
customer_id,
first_name,
last_name,
phone,
COUNT(CONCAT(first_name, last_name, phone)) as total_count
FROM
customers
GROUP BY
CONCAT(first_name, last_name, phone)
HAVING total_count > 1 ;
Execute the command above. Then, examine the output below. As you can see from the following list, the SQL command has retrieved all customers having more than one record.
+-------------+------------+-----------+--------+-------------+
| customer_id | first_name | last_name | phone | total_count |
+-------------+------------+-----------+--------+-------------+
| 1 | JOHN | DOE | 111111 | 2 |
| 2 | MARY | SMITH | 222222 | 2 |
| 3 | JACOB | JAMES | 444444 | 2 |
| 6 | RIAN | WHITE | 101010 | 2 |
+-------------+------------+-----------+--------+-------------+
4 rows in set (0.01 sec)
The next step is determining the rows that should remain when the duplicates are deleted. Use the command below to get the PRIMARY KEYs
of these rows.
mysql> SELECT MIN(customer_id)
FROM CUSTOMERS
GROUP BY CONCAT(first_name, last_name, phone);
Essentially, the MySQL MIN()
function returns the first record in each group of the customer's records when grouped by the concatenated column. If a customer name is repeated several times, the MIN()
function ensures that only the PRIMARY KEY
of the record that appears first for each group is retrieved, as shown below.
+------------------+
| MIN(customer_id) |
+------------------+
| 1 |
| 2 |
| 3 |
| 4 |
| 6 |
| 7 |
+------------------+
6 rows in set (0.00 sec)
Drop the Duplicate Rows
Once you've determine your clean list of the customer_ids
, get rid of the duplicate rows by telling MySQL to delete all records from the customers
table(DELETE FROM customers....
) EXCEPT
those that appear in the allow list. In other words, delete everything from the customers
table apart from the allowlisted records(NOT IN
).
Please note, in MySQL, you can't modify the same table when using a SELECT
command against it in the same SQL statement. To avoid encountering an error, nest the ....SELECT MIN()...
statement one level deeper as shown below.
mysql> DELETE FROM customers
WHERE customer_id NOT IN
(
SELECT
customer_id
FROM
(
SELECT MIN(customer_id) as customer_id
FROM CUSTOMERS
GROUP BY CONCAT(first_name, last_name, phone)
) AS duplicate_customer_ids
);
Once you execute the command above, the MySQL server should delete the 4
duplicate records as you can confirm from the output below.
Query OK, 4 rows affected (0.01 sec)
Ensure that your customers
table doesn't contain any more duplicates by executing the GROUP BY
and HAVING
statement one more time.
mysql> SELECT
customer_id,
first_name,
last_name,
phone,
COUNT(CONCAT(first_name, last_name, phone)) as total_count
FROM
customers
GROUP BY
CONCAT(first_name, last_name, phone)
HAVING total_count > 1 ;
This time around, you should get an Empty set
since there are no duplicates.
Empty set (0.00 sec)
Query the customers
table again.
mysql> SELECT
customer_id,
first_name,
last_name,
phone
FROM customers;
Even by physically examining the list below, you can see that the duplicate records have been removed completely.
+-------------+------------+-----------+--------+
| customer_id | first_name | last_name | phone |
+-------------+------------+-----------+--------+
| 1 | JOHN | DOE | 111111 |
| 2 | MARY | SMITH | 222222 |
| 3 | JACOB | JAMES | 444444 |
| 4 | STEVE | JACKES | 888888 |
| 6 | RIAN | WHITE | 101010 |
| 7 | APPLE | GRABRIEL | 242424 |
+-------------+------------+-----------+--------+
6 rows in set (0.00 sec)
Conclusion
In this guide, you've learned how to use the MySQL GROUP BY
and HAVING
clauses to find and delete duplicate records in a database. Use the logic each time you import data from non-relational data sources to eliminate identical rows that might interfere with your business logic.