FinDL

A Faster Way to Check Existence in MariaDB: Replacing EXISTS with Sequences

Here's a fancy problem and somewhat elegant solution to it, using MariaDB's sequence engine. Couldn't find any general name for the problem, so just follow along. The gist is in replacing the intuitive EXISTS-subquery with some sequence generation and slicing, leading to significantly faster query execution time.

You can leave comment or suggest a name for the problem in this HackerNews post

Problem

The problem is following:

Question: For a sales-date in location, were there any special products available during the day?

Database schema and data initialization

SELECT @min_date := '2025-01-01'
;
-- What product is available in which location, in what time-span
CREATE TABLE product_location (
  product int,
  location int,
  start date NOT NULL,
  end date NOT NULL,
  PRIMARY KEY (product, location, start)
);
INSERT INTO product_location
SELECT ROUND(RAND() * 1000), ROUND(RAND() * 1000),
@d := @min_date + INTERVAL ROUND(RAND() * 1000) DAY, @d + INTERVAL 1 MONTH
FROM seq_0_to_99
ON DUPLICATE KEY UPDATE end = VALUES(end)
;
-- daily sales by location
CREATE TABLE sales (
  date date,
  location int,
  sales int,
  PRIMARY KEY (date, location)
);
INSERT INTO sales
SELECT @min_date + INTERVAL ROUND(RAND() * 1000) DAY, ROUND(RAND() * 1000),
ROUND(RAND() * 100)
FROM seq_0_to_999999
ON DUPLICATE KEY UPDATE sales = VALUES(sales)

We can check that product_location has 100 rows and sales around 600k (because of duplicates).

What doesn't work

The product_location table can't be just JOINed to the sales, since one location might have many products available in one day, leading to duplicate rows. Maybe DISTINCT could be used, but that would be too simple and just hides the problem...

First approach with EXISTS

Quite literally, for every sales date and location we search if there's any products on the location during the date:

WITH a AS (
SELECT s.*, EXISTS (
  SELECT *
  FROM product_location pl
  WHERE pl.location = s.location
  AND s.date BETWEEN pl.start AND pl.end
) AS product_available
FROM sales s
)
SELECT COUNT(*) FROM a WHERE product_available
+----------+
| COUNT(*) |
+----------+
|     1945 |
+----------+
1 row in set (17.854 sec)

By examining the query using MariaDB's ANALYZE FORMAT=JSON, we can see that the EXISTS-subquery takes most of the time:

...
"table": {
  "table_name": "pl",
  "access_type": "ALL",
  "r_loops": 632692,
  "rows": 100,
  "r_rows": 99.84876053,
  "r_table_time_ms": 11438.31424,
  "r_other_time_ms": 3904.123227,
  "r_engine_stats": {
    "pages_accessed": 632692
  },
  "filtered": 100,
  "r_filtered": 0.003078822,
  "attached_condition": "pl.location = s.location and s.`date` between pl.`start` and pl.`end`"
}
...

Notice also that access_type is ALL, i.e. no indexes were used, likely because of product_location has only 100 rows.

Faster approach using seq + JOIN

We start by creating a combination of all dates and locations, and then filtering it with ranges from the product_location:

SELECT @min_date := MIN(date) FROM sales;
SELECT DISTINCT @min_date + INTERVAL seq DAY AS date, location
FROM seq_0_to_999999, product_location
WHERE @min_date + INTERVAL seq DAY BETWEEN start AND end

This can be thought as generating a N*M matrix, where N is number of dates and M is number of product-location pairs, which is sliced by valid dates and locations in the product_location-table. Using DISTINCT ensures that even if one location has overlapping products, it will show only once for the date. seq can be any big number here, since it's only used for generating dates and it gets filtered by the product-ranges.

After creation the location-ranges, they can be simply joined to the sales table straightforwardly:

SELECT @min_date := MIN(date) FROM sales;
WITH ranges AS (
SELECT DISTINCT @min_date + INTERVAL seq DAY AS date, location
FROM seq_0_to_99999, product_location
WHERE @min_date + INTERVAL seq DAY BETWEEN start AND end
)
, a AS (
SELECT s.*, r.location IS NOT NULL AS product_available
FROM sales s
LEFT JOIN ranges r ON r.date = s.date AND r.location = s.location
)
SELECT COUNT(*) FROM a WHERE product_available
| COUNT(*) |
+----------+
|     1945 |
+----------+
1 row in set (2.687 sec)

The query is now about 5-6x faster. From the query plan we can see that most of the time is used in creating temporary table for the range:

...
"r_total_time_ms": 2536.334956,
"temporary_table": {
"nested_loop": [
{
"table": {
"table_name": "product_location",
"access_type": "ALL",
"r_loops": 1,
...
"block-nl-join": {
"table": {
"table_name": "seq_0_to_99999",
"access_type": "index",
...
},
"buffer_type": "flat",
"buffer_size": "1Kb",
"join_type": "BNL",
"attached_condition": "@min_date + interval seq_0_to_99999.seq day between product_location.`start` and product_location.`end`",
"r_loops": 100,
"r_filtered": 0.03143,
"r_unpack_time_ms": 564.1461964,
"r_other_time_ms": 1964.448763,

The operation, however, is done only once and its result is materialized, and the JOIN afterwards stays fast.

Setup considerations

In terms of reproducibility, number of rows in each table is quite significant. This setup, with many rows in sales table and some in product_location, suits well for the seq-approach. When either table has low(er) number of rows, the EXISTS-subquery won't take that much time, leading same type of results from both approaches. Additionally, larger date ranges might make the seq-version slower, when more dates have to be stored per location. If the products need to be filtered as well, EXISTS-version should be more vulnerable, since a) the subquery gets more complicated b) fewer seq-ranges have to be computed and the filtering must be done only once. The seq-approach should work also when answering how many? instead of are there any?.

#mariadb #optimization