Data source

Publish at:

Every pipeline starts somewhere. Our pipeline is going to start with the data source available in the form of HTTP APIs. There are several of them and they all represent different pieces of the puzzle. We would have to download, shape and combine data from all of them to get the full picture at our disposal. So let's analyse what this data source looks like.

Population #

This one is probably the most self-explanatory. It provides population statistics broken down by NSW postal codes. Pay attention to "postal codes", we are going to get back to them later. For now, let's see what this API has to offer:

curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/population.json
[
    {
        "POA_NAME16": 2006,
        "Combined": "THE UNIVERSITY OF SYDNEY",
        "Tot_p_p": 1259
    },
    {
        "POA_NAME16": 2007,
        "Combined": "BROADWAY,ULTIMO",
        "Tot_p_p": 8845
    },
    {
        "POA_NAME16": 2008,
        "Combined": "CHIPPENDALE,DARLINGTON,GOLDEN GROVE",
        "Tot_p_p": 11712
    },
...
  • POA_NAME16 - looks like a postal code value
  • Combined - comma-separated names of suburbs combined under the same postal code
  • Tot_p_p - total population

Cases #

Next is the core statistical data across the state. It represents a combination of parameters related to COVID cases:

curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/data_cases2.json
{
    "data": [
        {
            "Recovered": 5,
            "POA_NAME16": "2536",
            "Deaths": 0,
            "Cases": 5,
            "Date": "12-Jul"
    },
...
  • Recovered - total recovered cases
  • POA_NAME16 - postal code
  • Deaths - total fatal cases
  • Cases - total cases
  • Date - measurement date

Tests #

This one is given as a separate API, but could have been combined with the previous into one call. It represents different parameters around number of tests across postal codes.

curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/data_tests.json
{
    "data": [
        {
            "Recent": 796,
            "POA_NAME16": "2260",
            "Number": 4178,
            "Date": "12-Jul"
        },
...
  • Recent - total number of tests for a short time interval in the past
  • POA_NAME16 - postal code
  • Number - total number of tests
  • Date - measurement date

Postal codes #

Finally, the API that combines them all. This one is a bit different though. It's postal codes but in the form of GeoJSON[1]. It's a format for encoding geographic data structures. In our case, that would be NSW suburbs. The reason this API is formatted this way is so we can visualize it using any web map viewer capable of reading the format.

curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/nswpostcodes_final.json
{
    "type":"FeatureCollection",
    "features": [
        {
            "type":"Feature",
            "geometry":{
                "type":"Polygon",
                "coordinates":[[[130.85017131100005,-12.453012270999977],...]]
                "properties": {"POA_CODE16":"0800","POA_NAME16":"0800","AREASQKM16":3.1734 }
            }
        }
...
  • geometry - describes the shape of every postal code area on the map
  • coordinates - the shape itself
  • properties - a bag of key-value pairs that could represent any information not related specifically to how the shape is rendered (metadata)
  • POA_NAME16 - postal code

As for the data source, these were all the parts. A few important things to note here. Postal code is the parameter that unites all the APIs, combining them into one data source. Date is another parameter where most of the measurements intersect.

Population and Post codes APIs have no Date measurement. It is based on the assumption that their results do not change within considered time frames.

Another interesting observation comes out of inspecting API results as a whole and for a few days in a row. The data source represents a slice of the data in time. All the APIs return a sliding window starting about half a month ago until the present day.


time
----------------------------------->
        |                          |                01.XX.2020
      start                       end
        -------------|--------------
                Set of points

----------------------------------------->
            |                            |          02.XX.2020
          start                         end
            --------------|---------------
                     Set of points

----------------------------------------------->
                  |                            |    03.XX.2020
                start                         end
                  --------------|---------------
                           Set of points

References

  1. GeoJSON (opens in a new tab) · Back