Latest Row (deduplicate)

Latest Row (deduplicate)

Query to return only the latest row based on a column or columns

With BigQuery it is common to have duplicates in your tables especially datalake ones so we need to remove duplicates and often want the latest version of each row. We can make use of the ROW_NUMBER window/analytical function to get a count of each record with the same key and then using QUALIFY to only return the 1st row order by the latest first.

Example

Here we have some rows with duplicates. I’ve added the ROW_NUMBER so we can see the values it returns.

WITH
  sample_data AS (
  SELECT 1 AS product_id,'widgit' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
  UNION ALL
  SELECT 2 AS product_id,'borble' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
  UNION ALL
  SELECT 3 AS product_id,'connector' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
  UNION ALL
  SELECT 1 AS product_id,'widgit' AS description,'Inactive' AS status,'2019-01-01' AS created_date,'2021-02-01' AS modified_date
  UNION ALL
  SELECT 2 AS product_id,'borble thing' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-03-01' AS modified_date )
SELECT
  *, ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY modified_date DESC, created_date DESC) as row_no
FROM
  sample_data
|product_id|description |status  |created_date|modified_date|row_no|
+----------+------------+--------+------------+-------------+------+
|1         |widgit      |Inactive|2019-01-01  |2021-02-01   |1     |
|1         |widgit      |Active  |2019-01-01  |2021-01-01   |2     |
|2         |borble thing|Active  |2019-01-01  |2021-03-01   |1     |
|2         |borble      |Active  |2019-01-01  |2021-01-01   |2     |
|3         |connector   |Active  |2019-01-01  |2021-01-01   |1     |
+----------+------------+--------+------------+-------------+------+

We can use QUALIFY to only return a single row. Also if we use a named subquery/cte we can work with the deduped as if it was just a table.
The product_id can be replaced with a list of columns to specify the unique rows.
Note
QUALIFY always needs a WHERE

WITH
  sample_data AS (
  SELECT 1 AS product_id,'widgit' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
  UNION ALL
  SELECT 2 AS product_id,'borble' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
  UNION ALL
  SELECT 3 AS product_id,'connector' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
  UNION ALL
  SELECT 1 AS product_id,'widgit' AS description,'Inactive' AS status,'2019-01-01' AS created_date,'2021-02-01' AS modified_date
  UNION ALL
  SELECT 2 AS product_id,'borble thing' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-03-01' AS modified_date 
  ),
data_deduped as (
SELECT
  *
FROM
  sample_data
WHERE
  1=1 
QUALIFY ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY modified_date DESC, created_date DESC)=1
  )
SELECT
  *
FROM
  data_deduped
+----------+------------+--------+------------+-------------+
|product_id|description |status  |created_date|modified_date|
+----------+------------+--------+------------+-------------+
|1         |widgit      |Inactive|2019-01-01  |2021-02-01   |
|2         |borble thing|Active  |2019-01-01  |2021-03-01   |
|3         |connector   |Active  |2019-01-01  |2021-01-01   |
+----------+------------+--------+------------+-------------+

References

3 Likes

Hi, thanks for this!
I made an interactive version of this snippet here: Latest Row (deduplicate)

1 Like

I’ve updated the sample data to make it more intersting

Hi, I am not sure this dataset counts as duplicate since each row is unique?