Analysis

Publish at: 05 Feb 2021

Since we know how to display different statistical results using different UI elements, understanding how those numbers are calculated will give a full picture of all the data transformations involved — from the start of data gathering until the end of the data display.

Daily numbers #

As mentioned before, the dashboard contains summaries of data slices represented by the map (total cases, active cases, recovered, and total tests). It also shows the difference between these numbers and the same calculations for the previous day. This gives the ability to see not only the latest results, but also the trend of observations.

Reminder: the final dataset represents a hash table of hash tables of daily observations per suburb - merge strategy

Here is how we calculate daily numbers for the active cases:

const { cases, dates } = useContext(DataContext);

  const { totalActive, progressActive } = useMemo(() => {
    const lastDateData = [
      ...cases.get(dates[dates.length - 1]).values()
    ]
      .reduce((acc, value) => acc + (value.Active ? value.Active : 0), 0);

    const secondLastDateData = [
      ...cases.get(dates[dates.length - 2]).values()
    ]
      .reduce((acc, value) => acc + (value.Active ? value.Active : 0), 0);

    const progress = lastDateData >= secondLastDateData ? 1 : -1;
    const diff = Math.abs(lastDateData - secondLastDateData);
    const percent = (progress * diff * 100) / secondLastDateData;

    return {
      totalActive: lastDateData, progressActive: percent
    };
  }, [cases, dates]);

We calculate the summary for the last day of observations first (reduce → lastDateData). Next is the second‑to‑last day of observations: secondLastDateData. And the final value (progressActive) is the result of the relative change^[1] formula between the last and the second‑to‑last day of observations.

                            Actual change      x1 - x2
Relative change (x1, x2) = --------------- = -----------
                                 x2              x2

The same idea applies to the rest of the dataset slices represented by the dashboard.

Cumulative totals #

For the cumulative totals we need two running summaries: one for the total number of cases by a particular day, another for the absolute change between any two dates except the first day of the observations.

Here is how we calculate total cases per day:

const calcTotal = (cases, date) => {
  if (!cases.has(date)) {
    return 0;
  }

  return [...cases.get(date)
    .values()]
    .reduce((acc, curr) => acc + curr.Cases, 0);
};

The difference between the value of cases for a particular day and the previous day is calculated as the difference clamped above zero:

Absolute daily change (x1, x2) = max(0, (x1 - x2))

Here is how we calculate totals per suburb as well as the difference per day:

/*
    cases - original dataset (Map of maps)
    date - selected date
    prevDate - previous date
    suburb - list of suburbs to calculate statistics for
    result - calculation result - a Map of suburb key and calculated result value
*/
const calcSuburbsTotal = (cases, date, prevDate, suburbs, result) => {
  suburbs.forEach((suburb) => {
    const entry = result.has(suburb.postCode) ? result.get(suburb.postCode) : [];
    const value = cases.has(date) && cases.get(date).has(suburb.postCode)
      ? cases.get(date).get(suburb.postCode).Cases : 0;

    const diff = !prevDate || !cases.has(date) || !cases.get(prevDate).has(suburb.postCode)
      ? value
      : value - cases.get(prevDate).get(suburb.postCode).Cases;
    entry.push({ value, diff: Math.max(0, diff) });
    result.set(suburb.postCode, entry);
  });
};

And here we calculate final dataset for the UI page:

const { cases, dates } = useContext(DataContext);

const data = useMemo(() => {
    return dates.reduce((acc, curr) => {
      const {
        prev, daily, cumulative, prevDate
      } = acc;
      const total = calcTotal(cases, curr);

      cumulative.push(total);

      if (!prev) {
        daily.push(total);
      } else {
        const diff = total - prev;
        daily.push(Math.max(0, diff));
      }

      acc.prev = total;

      if (!cases.has(curr)) {
        return acc;
      }

      calcSuburbsTotal(cases, curr, prevDate, selectedSuburbs, acc.suburbs);
      acc.prevDate = curr;
      return acc;
    }, {
      prev: null, daily: [], cumulative: [], prevDate: null, suburbs: new Map()
    });
  }, [cases, dates, selectedSuburbs]);

Distribution #

Our data analysis wouldn't be complete without checking how different slices of the data are distributed across their own values (probability distribution^[2]). As we discussed in a previous chapter, histogram is used for the approximated visualization. For the calculations we are using the compute‑histogram package^[3]. In its basic form, it just needs an array of values to return the result as an array of pairs, where the first item in a pair is a bin index and the second item is the number of observations in the bin:

/**
 * Calculates the values required to draw a histogram based on the input array and the number of bins.
 * Tails can be removed by limiting the calculation to a specific percentile.
 * The number of bins can be automatically calculated using a heuristic.
 *
 * @param arr
 * @param numBins If numBins === 0, then max of the Sturges and Freedman–Diaconis' choice methods is used
 *        See: https://en.wikipedia.org/wiki/Histogram
 * @param trimTailPercentage removes the right and left tails from the distribution
 * @returns Array Two dimensional array. First dimension is the index of the bin, and the second index
 *          is the count. This allows for direct import into ChartJS without having to change the data shape
 */
function calculateHistogram(arr) {

Usage is relatively straightforward. We just need to make sure we don’t pass non‑existing values by filtering them out from the dataset:

const values = computeHistogram(
      population
        .filter((item) => !!item.POA_NAME16 && !isNaN(Number(item.Tot_p_p)))
        .map((item) => item.Tot_p_p)
    );

Correlation #

With correlation we are measuring the level of dependence between two variables by calculating their correlation coefficient — correlation^[4]. The Simple Statistics^[5] package will do the heavy lifting for the numerical part of the calculations (coefficients themselves). It just needs two arrays of data (same length) and returns a calculated coefficient.

const value = useMemo(() => {
    return sampleCorrelation(
      dataSlice.map((item) => item.x),
      dataSlice.map((item) => item.y)
    )
      .toFixed(5);
  }, [dataSlice]);

Regression #

Simple Statistics provides regression calculations as well — linear regression^[6]. As a convenience, it can also calculate a regression line for us.

const lineData = useMemo(() => {
    const regressionData = data.map((item) => [item.x, item.y]);
    const line = linearRegressionLine(
      linearRegression(regressionData)
    );

    return data.map((item) => ({
      x: item.x, y: Number(line(item.x).toFixed(3))
    }));
  }, [data]);

Daily numbers #

Cumulative totals #

Distribution #

Correlation #

Regression #

References