Data wrangling

Publish at: 05 Feb 2021

Once we have the data source, the next step is to convert it into a form that fits our use case (Data wrangling^[1]). This step is probably the most important one. By design our pipeline should feed a static website. In this case, a natural data representation would imply JSON format. As we saw earlier, data source APIs are already providing data in this format. So what is left is to represent data in a form that fits both statistical analysis and visualization best. In this case fits best means mostly performance of computations.

Data accumulation #

As we discovered before, we never get a dataset as a whole sequence of observations—only a sliding window. So we have to combine and store all the sliding window results in one place every time we get a new set of points/measurements from the source. We are going to store the results in the file system next to the web application itself, so they can be served by the browser as simply as invoking a fetch request to a particular measurement JSON file (cases, tests, etc.).

Download from source #

Here is how we download a file from the source:

const https = require('https')

const download = (url) => new Promise((resolve, reject) => {
    let data = '';

    https.get(url, (resp) => {
        resp.on('data', (chunk) => {
            data += chunk;
        });

        resp.on('end', () => {
            resolve(data);
        });

    })
    .on("error", (err) => {
        console.log(reject(err));
        process.exit(1)
    });
})

const population = await download('https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/population.json')
const postCodes = await download('https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/nswpostcodes_final.json')

We could also use the same method to download previously stored information:

const cases = await download('https://raw.githubusercontent.com/maxgherman/nsw-corona-virus/gh-pages/cases-total.json')
const tests = await download('https://raw.githubusercontent.com/maxgherman/nsw-corona-virus/gh-pages/tests-total.json')

Writing a file into a file system is also straightforward:

const fs = require('fs')
const path = require('path')
const util = require('util')

const writeFile = util.promisify(fs.writeFile)

await writeFile(path.resolve(baseDir, 'build/population.json'), population)
await writeFile(path.resolve(baseDir, 'build/post-codes.json'), postCodes)

Merge #

Population and postal codes results could be stored as is, but cases and tests have to be downloaded and merged with the previously downloaded results before they can be saved. Since data source fragmentation happens on the basis of time and postal codes, we have to merge parts together using time and location principles: Date, POA_NAME16 measurements. To speed things up, we are using a Hash table^[2], where the key represents a measurement day and the value is another hash table with the postal code key and the measurement itself as a value.

Map { '01-10' => Map { '2053' => { Cases: 11, ... }, ... }, ... }

This way we traverse both datasets (previously merged and newly downloaded) in O(n) time. We also sort the final result by date. It saves us a headache in the future during data visualization.

const arrangeByDate = (data, store) => data.data.reduce((acc, curr) => {
  const currentYear = new Date().getFullYear();
  const date = curr.Date;

  const parsedDate = /^[0-9]{2}-[a-z,A-Z]{3}$/.test(date) ? `${date}-${currentYear}` : curr.Date;
  curr.Date = parsedDate;

  const entry = acc.has(parsedDate) ? acc.get(parsedDate) : new Map();
  entry.set(curr.POA_NAME16, curr);
  acc.set(parsedDate, entry);

  return acc;
}, store);

const mergeCases = async (casesUrl, baseCasesUrl, activeCasesUrl) => {
  const cases = await download(casesUrl).then(JSON.parse);
  const baseCases = await download(baseCasesUrl).then(JSON.parse);
  const activeCases = await download(activeCasesUrl).then(JSON.parse);

  const caseStore = arrangeByDate(cases, new Map());
  const mergedCases = arrangeByDate(baseCases, caseStore);

  activeCases.data.forEach((activeCase) => {
    Object.keys(activeCase).forEach((activeCaseKey) => {
      if (activeCaseKey === 'POA_NAME16') {
        return;
      }

      if (daysMap.has(activeCaseKey)) {
        const dateKey = daysMap.get(activeCaseKey);

        if (mergedCases.has(dateKey)) {
          const casesByDate = mergedCases.get(dateKey);

          if (casesByDate.has(activeCase.POA_NAME16)) {
            const caseItem = casesByDate.get(activeCase.POA_NAME16);
            const active = activeCase[activeCaseKey];
            caseItem.Recovered = caseItem.Cases - active;
          }
        }
      }
    });
  });

  const maps = [...mergedCases.values()]
    .map((item) => [...item.values()])
    .flat();

  maps.sort(sort);

  return {
    data: maps
  };
};

We can run this procedure on a schedule (say once a day) and get all the historical results stored along with the website static assets, ready to be served.

Data preparation #

At this stage, our transformed data source is not much different from the original one, except historical data accumulation. It is time to start thinking about how to use it within the web application itself (web browser front end) so it can participate in visualizations and analysis. Optimization is important here. Data structures chosen to represent data in memory directly affect data manipulation/traversal procedures and, as a result, might degrade performance. The key aspect is to understand the most important points of the dataset: the most important measurements from which the rest of the measurements can be inspected/traversed. Think of the dataset as a graph and points as vertices. Essentially, we are looking for strongly connected components^[3] within a graph. Strong connection guarantees that we can traverse the data structure faster by finding shorter paths, simply because there are more possibilities of connected paths where components have a higher degree of connection.

Earlier we noticed that time and location form a connected component. Most of the measurements are based on Date and postal code points.

We could also look at the dataset as two separate slices: time — as a summary of all postal codes, and postal code — as a summary of chronological events.

So it makes sense to represent our data structures in a way where time and location are points achievable in one step. Sounds familiar? Yes, we've used this idea before to transform the original data source. One hash table is used to represent points in time, another — locations. Together they give access to any other point.

Map {
    '01-10' => Map {
                    '2053' => { Cases: 11, Active: 3, ... },
                    '2057' => { Cases: 3, Active: 0, ... },
                    ...
                    },
    '02-10' => Map {
                    '2053' => { Cases: 11, Active: 2, ... },
                    '2057' => { Cases: 5, Active: 0, ... },
                    ...
                    },
                ...
}

const formatMap = (data) =>
    data.data.reduce((acc, curr) => {
        const rootEntry = acc.has(curr.Date) ? acc.get(curr.Date) : new Map();

        acc.set(curr.Date, rootEntry);
        rootEntry.set(curr.POA_NAME16, curr);

        return acc;
    }, new Map());

Population is slightly different because there is no time measurement, but for the sake of optimization, the same idea could be applied:

const formatPopulation = (data) =>
    data.reduce((acc, curr) => {
        const key = curr.POA_NAME16.toString();

        const population = acc.population;
        const entry = population.has(key) ? population.get(key) : curr;
        population.set(key, entry);

        acc.suburbs.push({ postCode: key, name: curr.Combined  });

        return acc;
    }, {
        population: new Map(),
        suburbs: []
    });

Here we are forming a population Map and a list of suburbs to be used later.

Geo Map #

Since the data we have represents state suburb spread of accumulated numbers, it's a good idea to represent it visually using a map. Recall that the postal codes file is in GeoJSON format. We just need a library to represent it as a NSW map in a browser. For the purpose of our application we're going to use Mapbox^[4]. It understands GeoJSON and has a free tier.

The postal codes file could be rendered as is, but the rest of the dataset (cases, active, recovered, ...) needs to be merged into it manually. We can use GeoJSON Feature properties^[5] to add additional metadata for extra visualizations. Features would represent NSW suburbs where every feature is going to have a range of additional properties representing numerical data of various statistics such as:

Total cases - number of total cases
Total cases range - level of severity of the number of total cases (0, 10, 20, 30, 40, 50, > 50)
Active cases - number of active cases,
Active cases range - level of severity of the number of active cases (0, 1, 3, 8, 21, 51, > 51)
Recovered cases - number of recovered cases,
Recovered cases range - level of progression of the number of recovered cases (0, 2, 3, 8, 21, 51, > 51)
Tests - number of tests taken,
Tests range - level of progression of the number of tests (0, 250, 500, 1000, 2000, 3000)

As a data structure, feature properties are just a bag of key‑value pairs. That means we are free to add any key‑value to it:

export const mergeData = ({
  postCodes: postCodesGeometry,
  cases: casesInitial,
  population: populationInitial,
  tests: testsInitial
}) => {
  if (!postCodesGeometry) {
    return {};
  }

  const caseLevelKey = getLevelKey(caseLevels);
  const testLevelKey = getLevelKey(testLevels);
  const activeLevelsKey = getLevelKey(activeLevels);
  const recoveredLevelsKey = getLevelKey(recoveredLevels);

  const cases = formatMap(casesInitial);
  const tests = formatMap(testsInitial);
  const { population: populationByCode, suburbs } = formatPopulation(populationInitial);

  const dates = Array.from(cases.keys());
  const selectedDate = dates[dates.length - 1];

  postCodesGeometry.features.forEach((feature) => {
    dates.forEach((date) => {
      const caseEntry = cases.get(date);
      const testsEntry = tests.get(date);

      const {
        testsKey, activeKey, totalKey, recoveredKey, deadKey,
        rangeTestsKey, rangeActiveKey, rangeRecoveredKey
      } = bagKeys(date);

      // initialize properties with zeroes
      feature.properties[date] = 0;
      feature.properties[testsKey] = 0;
      feature.properties[activeKey] = 0;
      feature.properties[totalKey] = 0;
      feature.properties[recoveredKey] = 0;
      feature.properties[deadKey] = 0;

      // ranges
      feature.properties[rangeTestsKey] = 0;
      feature.properties[rangeActiveKey] = 0;
      feature.properties[rangeRecoveredKey] = 0;

      const hasCasesPOA = caseEntry.has(feature.properties.POA_NAME16);

      if (hasCasesPOA) {
        const caseEntryValue = caseEntry.get(feature.properties.POA_NAME16);
        const total = parseInt(caseEntryValue.Cases, 10);
        const recovered = parseInt(caseEntryValue.Recovered, 10);
        const dead = parseInt(caseEntryValue.Deaths, 10);
        caseEntryValue.Active = recovered === 0 ? 0 : total - (recovered + dead);

        feature.properties[date] = caseLevelKey(total);
        feature.properties[rangeActiveKey] = activeLevelsKey(caseEntryValue.Active);
        feature.properties[rangeRecoveredKey] = recoveredLevelsKey(recovered);

        feature.properties[totalKey] = total;
        feature.properties[activeKey] = caseEntryValue.Active;
        feature.properties[recoveredKey] = recovered;
        feature.properties[deadKey] = dead;
      }

      if (testsEntry && testsEntry.has(feature.properties.POA_NAME16)) {
        const testEntryValue = testsEntry.get(feature.properties.POA_NAME16);
        const total = parseInt(testEntryValue.Number, 10);
        const testRange = testLevelKey(total);

        if (hasCasesPOA) {
          const caseEntryValue = caseEntry.get(feature.properties.POA_NAME16);
          caseEntryValue.Tests = testEntryValue.Number;
          caseEntryValue.RecentTests = testEntryValue.Recent;
        }

        feature.properties[testsKey] = total;
        feature.properties[rangeTestsKey] = testRange;
      }
    });

    if (populationByCode.has(feature.properties.POA_NAME16)) {
      const populationEntry = populationByCode.get(feature.properties.POA_NAME16);
      feature.properties.population = populationEntry.Tot_p_p;
      feature.properties.suburbName = populationEntry.Combined;
    }
  });

  return {
    postCodesGeometry,
    suburbs,
    cases,
    populationByCode,
    population: populationInitial,
    tests,
    selectedDate,
    dates
  };
};

The same bag of key - value properties will be used during map visualizations.

References

Data wrangling (opens in a new tab) · Back
Hash table (opens in a new tab) · Back
Strongly connected component (opens in a new tab) · Back
Mapbox (opens in a new tab) · Back
RFC 7946 Feature properties (opens in a new tab) · Back