BigQuery Arrays vs MySQL JSON: Generating sequences.

TL;DR

BigQuery features Arrays that are a powerful addition to standard SQL. In MySQL, JSON objects offer similar functionalities, but generating sequences requires some workarounds. MySQL's implementation of recursions offers even greater flexibility.

BigQuery vs MySQL

Cloud-based distributed data warehouses like BigQuery are all the rage these days, however, they often do not stand alone just containing static data sets. Data might first be generated within operational applications that are backed by systems like MySQL that allow for real-time data inserts and updates. Those operational data feeds are then usually shipped to the warehouse through connectors like Fivetran. In addition, emerging architecture like Oracle heatwave allows MySQL to take up compute loads that are traditionally placed within a warehouse.

Operating in an environment of fast-moving and developing technologies, it becomes desirable to stay within common patterns and not overly rely on special features that one vendor offers but not the other. Thus the interest in comparing SQL on BigQuery and MySQL for similar tasks.

Generate a sequence of numbers n = 1,2,3,...,N

Leopold Kronecker once said, "God made the integers, all the rest is the work of man". His quote is an attestation to the fundamental nature of the sequence of counting numbers. It seems more so surprising how cumbersome it is to generate such numbers using the SQL language within modern data systems.

Fortunately, Google's BigQuery extension to standard SQL features the ARRAY data type coming to the rescue. Within BigQuery arrays can be specified as comma-separated lists.

Executing on the BigQuery console

select [1,2,3] as arr

returns a single row that contains the array in the following form.

What makes this powerful is that a sequence can be generated dynamically using the build-in generate_array(start, end, stepsize) function. Finally, UNNEST() allows transforming array elements back into individual rows as shown

select odd_number from UNNEST(generate_array(1,5,2)) as odd_number

resulting now in a table where each row now contains a single number.

Row odd_number
1     1
2     3
3     5

Using this technique we can compute such fun problems as to sum over all numbers between 1 and 100

select sum(number) as total from UNNEST(generate_array(1,100,1)) as number

which by the way Johann Karl Friedrich Gauss had figured out already as 50*101=5050 (without having to google it), to estimate PI from a simulation of 1,000,000 samples

select 
          4*avg(if(pow(rand(),2)+pow(rand(),2)<1,1,0)) as pi
from 
           UNNEST(generate_array(1,1000000,1)) as i

or to approximate the value of PI from a series formula according to Bailey-Borwein Plouffe

with s as 
(
  select
   k,
   pow(16,-k)*
   (
    4/(8*k+1)
    -2/(8*k+4)
    -1/(8*k+5)
    -1/(8*k+6)
   ) as n
from UNNEST(generate_array(0,11,1)) as k  
)
select 
  k, 
  sum(n) over(order by k) as bbp_pi, 
  acos(-1) as pi, 
  sum(n) over(order by k)-acos(-1) as error 
from s

which converges pretty quick.

Row    k    bbp_pi             pi                error    
1      0    3.1333333333333333 3.141592653589793 -0.008259320256459812
2      1    3.1414224664224664 3.141592653589793 -1.7018716732675188E-4
3      2    3.141587390346581  3.141592653589793 -5.26324321192817E-6    
4      3    3.141592457567435  3.141592653589793 -1.9602235790117106E-7
5      4    3.141592645460336  3.141592653589793 -8.12945710748636E-9
6      5    3.1415926532280873 3.141592653589793 -3.6170577644156765E-10    
7      6    3.1415926535728804 3.141592653589793 -1.6912693467929785E-11
8      7    3.1415926535889724 3.141592653589793 -8.206768598029157E-13    
9      8    3.141592653589752  3.141592653589793 -4.1300296516055823E-14    
10     9    3.141592653589791  3.141592653589793 -2.220446049250313E-15
11    10    3.1415926535897927 3.141592653589793 -4.440892098500626E-16
12    11    3.141592653589793  3.141592653589793  0.0

Histograms, counting zip-codes

Being able to generate sequences of numbers comes in handy when one needs an independent data axis to report on. Let's consider the example of census data that is reported as population counts by zip code. Counts are provided in the table by cohorts as well as rolled up into subtotals. To retrieve the respective overall total we fetch the row with the maximum value for each zip code. We want to generate a report of the number of zip codes that fall within a total population range bucket.

Below we first generate buckets of size 10,000 and then count the number of zip code regions within the selected population size.

with total_population as
(
select 
            zipcode,
            max(population) as population  --  the row with the largest amount has the aggregate total
from 
            bigquery-public-data.census_bureau_usa.population_by_zip_2010
group by 
            zipcode
)

select 
            population_bucket, 
            count(zipcode) as number_of_zipcodes 
            FROM  UNNEST(generate_array(10000,150000, 10000) ) as population_bucket 
            left join total_population
on 
            population between population_bucket-10000 and population_bucket
group by 
            population_bucket
order by 
            population_bucket

Generating the buckets independently of the population table ensures that a bucket is reported even in the case there are no corresponding zip code regions in the table, as shown in the rows corresponding to buckets 13,000 and above.

Row    population_bucket    number_of_zipcodes    
1      10000                      23520
2      20000                      3779
3      30000                      2621
4      40000                      1664
5      50000                      831
6      60000                      399
7      70000                      167
8      80000                      75
9      90000                      34
10    100000                      22
11    110000                      8
12    120000                      2
13    130000                      0
14    140000                      0
15    150000                      0

MYSQL 8.0: Hello JSON_ARRAY

To experiment with MySQL we use version 8.0 on dbfiddle.uk . MySQL does not feature arrays as directly as BigQuery, however, as a sort of work-around, it does support JSON types that can represent arrays.

select JSON_ARRAY(1,2,3);

yields

[1, 2, 3]

So far so good, unfortunately, there is no generate_json_array function. To go further on our quest to generate a sequence of numbers with arrays, we need to resort to workarounds, such as first generating a carefully formatted string that then can be cast into a JSON object.

The following first step looks promising

select cast(concat("[1",repeat(",1",3-1),"]") as JSON)

which gives a JSON array of repeated unit elements

[1,1,1]

Next, we transform this back to a table of sequential numbers using the JSON_TABLE function. Similar to UNNEST in the BigQuery case, JSON_TABLE extracts each element of the JSON array into rows of a single column.

select number from
JSON_TABLE
( 
       cast(concat("[1", repeat(",1", 3-1), "]") as JSON),
       "$[*]"    -- selects all elements in the JSON array
       COLUMNS(number FOR ORDINALITY)    -- enumerates rows
) numbers;

Rather than taking the actual value of each element, we export out its ordinal position instead, which generates the desired sequence of increasing numbers.

number
1
2
3

With these methods at hand, the MySQL version of the Gauss summation becomes

select sum(number) from
JSON_TABLE
( 
   cast(concat("[1",repeat(",1",100-1),"]") as JSON),
   "$[*]"
   COLUMNS(number FOR ORDINALITY)
) numbers;

and the formula for generating PI translates to

with s as 
(
 select
   k,
   pow(16,-k+1)*
   (
     4/(8*k-7) 
    -2/(8*k-4) 
    -1/(8*k-3) 
    -1/(8*k-2) 
   ) as n
FROM
JSON_TABLE
( 
           cast(concat("[1",repeat(",1",11-1),"]") as JSON),
           "$[*]"
           COLUMNS(k FOR ORDINALITY)
) as numbers

)
select 
          k,
          sum(n) over(order by k) as bbp_pi,  
          acos(-1) as pi, 
          sum(n) over(order by k)-acos(-1) as error 
from s

where we adjusted terms for the fact that k starts with 1 instead of 0.

Recursions: let's do the time warp again

It's astounding, MySQL has another trick up its sleeve by supporting recursions. It's just a jump to the left and Gauss gets his summation of the first 100 numbers as

set @@cte_max_recursion_depth = 100;
with recursive numbers(n) as
(
  select 1
  union
  select n+1 from numbers where n<100
)
select sum(n) from numbers;

where we set cte_max_recursion_depth to the required capacity. That's fantastic, so why bother with JSON_ARRAYs when recursions are so powerful? One limitation is that recursion depth max_sp_recursion_depth might be restricted by your particular system for user-defined functions. The other aspect that makes arrays interesting is the ability to store more complex structures than just individual numbers and strings. That shall be a topic for another time.