Binning in MySQL

A versatile approach to binning data in MySQL

·

4 min read

Binning, and eventually graphing as a histogram, is an effective and useful way to analyze and understand trends in your data. BI tools like Tableau provide easy binning capabilities so you can bin on ranges like month or year. But is there a way without a BI tool?

Presented here is a simple, effective, and highly versatile method for binning your data directly in MySQL.

To have a highly versatile method, we should have some initial CTE to define our bins, our Select Statement to get our data, and a Group By. This method will just allow us to swap our select statement very easily without changing too much of the code.

The first example will bin on predefined ranges; the second example will use recursive CTEs to 'auto-magically' create bins.

First some fake data:

CREATE TABLE `books` ( 
    `id` INT NOT NULL AUTO_INCREMENT, 
    `name` VARCHAR(255)  NOT NULL, 
    `pages` VARCHAR(255)  NOT NULL,
    `created_at` DATE  NOT NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1;

INSERT INTO books (name, pages, created_at) VALUES 
("Book 1",  228, "2000-01-03") ,
("Book 2",  235, "2000-01-07") ,
("Book 3",  197, "2000-02-09") ,
("Book 4",  179, "2000-03-14") ,
("Book 5",  197, "2000-04-12") ,
("Book 6",  230, "2000-04-27") ,
("Book 7",  217, "2000-05-29") ,
("Book 8",  197, "2000-05-08") ,
("Book 9",  241, "2000-05-29") ,
("Book 10", 240, "2000-05-02") ,
("Book 11", 230, "2000-06-01") ,
("Book 12", 198, "2000-07-09") ,
("Book 13", 234, "2000-08-17") ,
("Book 14", 240, "2000-09-30") ,
("Book 15", 238, "2000-10-03");

Example 1: Bin over page number
We can define some ranges and then bin over them. Some useful ranges would be

  • 0 - 10 page
  • 11 - 100 pages
  • 101 - 200 pages
  • 201 + pages

We then left join our books table onto our bins with a where statement and then we can group.

WITH bins (bin_name, bin_start, bin_end) AS (
   SELECT   '0-10' AS bin_name,  0 AS bin_start, 10 AS bin_end
   UNION SELECT '11-100' , 11 , 100
   UNION SELECT '101-200' , 101 , 200
   UNION SELECT '201+' , 201 , NULL
),
T AS (
    SELECT
        pages AS `value`
   FROM books
)
SELECT
       bins.bin_name,
       COUNT(T.value) AS count,
       COUNT(T.value) / (SELECT COUNT(*) FROM T) * 100 AS percentages
FROM bins
LEFT JOIN T ON T.value
  WHERE  (
     T.value >= bins.bin_start 
     AND (T.value < bins.bin_end OR bins.bin_end IS NULL)
  )
GROUP BY 1;

Giving us:

+----------+-------+-------------+
| bin_name | count | percentages |
+----------+-------+-------------+
| 0-10     |     1 |      6.6667 |
| 11-100   |     2 |     13.3333 |
| 101-200  |     3 |     20.0000 |
| 201+     |     8 |     53.3333 |
+----------+-------+-------------+

It is so versatile because we can swap in any table you'd like for T and adjust the bin ranges and it just works.

Example 2: Bin with recursive CTE
What if we want to bin over something like months, and don't want to list out all the dates. We can use a recursive CTE to do the work for us!

WITH RECURSIVE bins as (
  SELECT '2000-01-01' as dt, MONTH('2000-01-01') as `month`, YEAR('2000-01-01') as `year`
  UNION
    SELECT
        DATE_ADD(bins.dt, INTERVAL 1 MONTH),
        MONTH(DATE_ADD(bins.dt, INTERVAL 1 MONTH)),
        YEAR(DATE_ADD(bins.dt, INTERVAL 1 MONTH))
    FROM bins WHERE DATE_ADD(bins.dt, INTERVAL 1 MONTH) <= '2001-01-01'
),
T AS (
    SELECT
        id,
        MONTH(created_at) AS `month`,
        YEAR(created_at) AS `year`
   FROM books
)
SELECT
   bins.year,
   bins.month,
   COUNT(T.id) AS count,
   COUNT(T.id) / (SELECT COUNT(T.id) FROM T) * 100 AS percentage
FROM bins
LEFT JOIN T ON (T.month = bins.month and T.year = bins.year)
GROUP BY 1, 2
ORDER BY 1, 2

And we get:

+------+-------+-------+------------+
| year | month | count | percentage |
+------+-------+-------+------------+
| 2000 |     1 |     2 |    13.3333 |
| 2000 |     2 |     1 |     6.6667 |
| 2000 |     3 |     1 |     6.6667 |
| 2000 |     4 |     2 |    13.3333 |
| 2000 |     5 |     4 |    26.6667 |
| 2000 |     6 |     1 |     6.6667 |
| 2000 |     7 |     1 |     6.6667 |
| 2000 |     8 |     1 |     6.6667 |
| 2000 |     9 |     1 |     6.6667 |
| 2000 |    10 |     1 |     6.6667 |
| 2000 |    11 |     0 |     0.0000 |
| 2000 |    12 |     0 |     0.0000 |
| 2001 |     1 |     0 |     0.0000 |
+------+-------+-------+------------+

With these two examples you can now bin by pre-defined ranges or by dates in a clean, simple, and versatile manner!