TL;DR
BigQuery features Arrays that are a powerful addition to standard SQL. In MySQL, JSON objects offer similar functionalities, but generating sequences requires some workarounds. MySQL's implementation of recursions offers even greater flexibility.
BigQuery vs MySQL
Cloud-based distributed data warehouses like BigQuery are all the rage these days, however, they often do not stand alone just containing static data sets. Data might first be generated within operational applications that are backed by systems like MySQL that allow for real-time data inserts and updates. Those operational data feeds are then usually shipped to the warehouse through connectors like Fivetran. In addition, emerging architecture like Oracle heatwave allows MySQL to take up compute loads that are traditionally placed within a warehouse.
Operating in an environment of fast-moving and developing technologies, it becomes desirable to stay within common patterns and not overly rely on special features that one vendor offers but not the other. Thus the interest in comparing SQL on BigQuery and MySQL for similar tasks.
Generate a sequence of numbers n = 1,2,3,...,N
Leopold Kronecker once said, "God made the integers, all the rest is the work of man". His quote is an attestation to the fundamental nature of the sequence of counting numbers. It seems more so surprising how cumbersome it is to generate such numbers using the SQL language within modern data systems.
Fortunately, Google's BigQuery extension to standard SQL features the ARRAY data type coming to the rescue. Within BigQuery arrays can be specified as comma-separated lists.
Executing on the BigQuery console
select [1,2,3] as arr
returns a single row that contains the array in the following form.
Row arr
1 1
2
3
What makes this powerful is that a sequence can be generated dynamically using the build-in generate_array(start, end, stepsize)
function.
Finally, UNNEST()
allows transforming array elements back into individual rows as shown
select odd_number from UNNEST(generate_array(1,5,2)) as odd_number
resulting now in a table where each row now contains a single number.
Row odd_number
1 1
2 3
3 5
Using this technique we can compute such fun problems as to sum over all numbers between 1 and 100
select sum(number) as total from UNNEST(generate_array(1,100,1)) as number
which by the way Johann Karl Friedrich Gauss had figured out already as 50*101=5050 (without having to google it), to estimate PI from a simulation of 1,000,000 samples
select
4*avg(if(pow(rand(),2)+pow(rand(),2)<1,1,0)) as pi
from
UNNEST(generate_array(1,1000000,1)) as i
or to approximate the value of PI from a series formula according to Bailey-Borwein Plouffe
with s as
(
select
k,
pow(16,-k)*
(
4/(8*k+1)
-2/(8*k+4)
-1/(8*k+5)
-1/(8*k+6)
) as n
from UNNEST(generate_array(0,11,1)) as k
)
select
k,
sum(n) over(order by k) as bbp_pi,
acos(-1) as pi,
sum(n) over(order by k)-acos(-1) as error
from s
which converges pretty quick.
Row k bbp_pi pi error
1 0 3.1333333333333333 3.141592653589793 -0.008259320256459812
2 1 3.1414224664224664 3.141592653589793 -1.7018716732675188E-4
3 2 3.141587390346581 3.141592653589793 -5.26324321192817E-6
4 3 3.141592457567435 3.141592653589793 -1.9602235790117106E-7
5 4 3.141592645460336 3.141592653589793 -8.12945710748636E-9
6 5 3.1415926532280873 3.141592653589793 -3.6170577644156765E-10
7 6 3.1415926535728804 3.141592653589793 -1.6912693467929785E-11
8 7 3.1415926535889724 3.141592653589793 -8.206768598029157E-13
9 8 3.141592653589752 3.141592653589793 -4.1300296516055823E-14
10 9 3.141592653589791 3.141592653589793 -2.220446049250313E-15
11 10 3.1415926535897927 3.141592653589793 -4.440892098500626E-16
12 11 3.141592653589793 3.141592653589793 0.0
Histograms, counting zip-codes
Being able to generate sequences of numbers comes in handy when one needs an independent data axis to report on. Let's consider the example of census data that is reported as population counts by zip code. Counts are provided in the table by cohorts as well as rolled up into subtotals. To retrieve the respective overall total we fetch the row with the maximum value for each zip code. We want to generate a report of the number of zip codes that fall within a total population range bucket.
Below we first generate buckets of size 10,000 and then count the number of zip code regions within the selected population size.
with total_population as
(
select
zipcode,
max(population) as population -- the row with the largest amount has the aggregate total
from
bigquery-public-data.census_bureau_usa.population_by_zip_2010
group by
zipcode
)
select
population_bucket,
count(zipcode) as number_of_zipcodes
FROM UNNEST(generate_array(10000,150000, 10000) ) as population_bucket
left join total_population
on
population between population_bucket-10000 and population_bucket
group by
population_bucket
order by
population_bucket
Generating the buckets independently of the population table ensures that a bucket is reported even in the case there are no corresponding zip code regions in the table, as shown in the rows corresponding to buckets 13,000 and above.
Row population_bucket number_of_zipcodes
1 10000 23520
2 20000 3779
3 30000 2621
4 40000 1664
5 50000 831
6 60000 399
7 70000 167
8 80000 75
9 90000 34
10 100000 22
11 110000 8
12 120000 2
13 130000 0
14 140000 0
15 150000 0
MYSQL 8.0: Hello JSON_ARRAY
To experiment with MySQL we use version 8.0 on dbfiddle.uk . MySQL does not feature arrays as directly as BigQuery, however, as a sort of work-around, it does support JSON types that can represent arrays.
select JSON_ARRAY(1,2,3);
yields
[1, 2, 3]
So far so good, unfortunately, there is no generate_json_array function. To go further on our quest to generate a sequence of numbers with arrays, we need to resort to workarounds, such as first generating a carefully formatted string that then can be cast into a JSON object.
The following first step looks promising
select cast(concat("[1",repeat(",1",3-1),"]") as JSON)
which gives a JSON array of repeated unit elements
[1,1,1]
Next, we transform this back to a table of sequential numbers using the JSON_TABLE function. Similar to UNNEST in the BigQuery case, JSON_TABLE extracts each element of the JSON array into rows of a single column.
select number from
JSON_TABLE
(
cast(concat("[1", repeat(",1", 3-1), "]") as JSON),
"$[*]" -- selects all elements in the JSON array
COLUMNS(number FOR ORDINALITY) -- enumerates rows
) numbers;
Rather than taking the actual value of each element, we export out its ordinal position instead, which generates the desired sequence of increasing numbers.
number
1
2
3
With these methods at hand, the MySQL version of the Gauss summation becomes
select sum(number) from
JSON_TABLE
(
cast(concat("[1",repeat(",1",100-1),"]") as JSON),
"$[*]"
COLUMNS(number FOR ORDINALITY)
) numbers;
and the formula for generating PI translates to
with s as
(
select
k,
pow(16,-k+1)*
(
4/(8*k-7)
-2/(8*k-4)
-1/(8*k-3)
-1/(8*k-2)
) as n
FROM
JSON_TABLE
(
cast(concat("[1",repeat(",1",11-1),"]") as JSON),
"$[*]"
COLUMNS(k FOR ORDINALITY)
) as numbers
)
select
k,
sum(n) over(order by k) as bbp_pi,
acos(-1) as pi,
sum(n) over(order by k)-acos(-1) as error
from s
where we adjusted terms for the fact that k starts with 1 instead of 0.
Recursions: let's do the time warp again
It's astounding, MySQL has another trick up its sleeve by supporting recursions. It's just a jump to the left and Gauss gets his summation of the first 100 numbers as
set @@cte_max_recursion_depth = 100;
with recursive numbers(n) as
(
select 1
union
select n+1 from numbers where n<100
)
select sum(n) from numbers;
where we set cte_max_recursion_depth to the required capacity. That's fantastic, so why bother with JSON_ARRAYs when recursions are so powerful? One limitation is that recursion depth max_sp_recursion_depth might be restricted by your particular system for user-defined functions. The other aspect that makes arrays interesting is the ability to store more complex structures than just individual numbers and strings. That shall be a topic for another time.