This exercise presents you with a challenge that involves accessing, understanding, and writing data using various methods. You will get to use

  • CSV (comma-separated values) for tabular data
  • language-specific binaries or serializations
    • R: RDS files
    • Python: pickles
  • an SQL database
  • an Excel file

The story

Scenario: workflow with Pat's database and script

You are part of a multi-author team preparing a revised manuscript on the COVID-19 epidemic in southern Africa. Unfortunately, a key team member, Pat, moved on to another job last September, and is not able to update plots from the original manuscript, like this plot which only goes through August 2022.

You have been tasked with updating and expanding the plots by

  • Adding data through the end of 2022 for South Africa
  • Making an additional plot with new cases per million for all southern African countries
  • Adding vaccinations for South Africa
  • Making this into a supplementary data repository

That sounds hard, but Pat has made this easier, by sending you 2 files with missing information, along with a brief email with instructions

Hello! I heard that you are going to add the new data and clean this up for publication. You probably won’t be able to get the db connection to work in the script, but that’s no problem, just launch “sqlite3 AfricaCOVID.db” and run the SELECT statements for whatever countries you want. If you type “.header on”, “.mode csv”, and “.once ‘myfilename.csv’”, then the results of the next SELECT statement will be written to the named csv file, and then you can work with the data directly.

Jan provided 2 data files to make the requested updates. The Excel file (newer-african-data.xslx) has data on cases and deaths from October 2022 onward. Beware that Excel can’t keep dates straight. The vax data are in african-vax-data.rds. Both of these have data for all the African countries.

Everything else should work fine. Good luck,

Pat

The attachment to Pat’s email.

Pat’s script below has just 11 lines of code in R. It uses a database connector that is no longer supported, but Pat’s email suggests a way to deal with that.

library("BadRSQLite")
library("ggplot2")

# don't need this anymore bc ISO codes for Region are in ccodes table
# region <- c("AGO", "BWA", "LSO", "MWI", "MOZ", "NAM", "ZAF", "SWZ", "ZWB", "ZWE")

# get the db link and query it for southern Africa data
mydblink <- dbConnect(BadRSQLite::SQLite(), "AfricaCOVID.db")
regiondat <- dbGetQuery(mydblink, "SELECT COVID.* FROM COVID JOIN ccodes ON 
   COVID.iso_code=ccodes.ISO3 WHERE ccodes.Region='Southern Africa';")
dbDisconnect(mydblink)

# make the date column into R dates, numeric columns into R numeric type
regiondat$date <- as.Date(regiondat$date) 
regiondat[5:16] <- apply(regiondat[5:16], 2, as.numeric)
zafdat <- subset(regiondat, iso_code = 'ZAF')

# plot the ZAF data with new cases and deaths, save as file 
p <- ggplot(zafdat, aes(x = date, y = new_cases_smoothed_per_million, 
   group = location)) + 
  geom_line() + scale_x_date(date_breaks = "1 month", date_labels = "%b") + 
  ylab(label = "New cases or deaths per million, smoothed") 
p2 <- p + geom_line(aes(y = new_deaths_smoothed_per_million), colour= "red")
ggsave("zaf-plot.jpg", p2, width=20, height=8, units="cm")

# plot the data for all southern African countries 
p3 <- ggplot(regiondat, aes(x = date, y = new_cases_smoothed_per_million, 
   group = location, color = location)) + 
  geom_line() + scale_x_date(date_breaks = "1 month", date_labels = "%b") + 
  ylab(label = "New cases per million, smoothed") 
# ggsave("southern-plot.jpg", p3, width=20, height=8, units="cm")

Python version of the story

The Python version of the story is the same, except the script is in Python, and there is a pickle file instead of an RDS.

# import libraries 
import badsqlite3
import pandas as pd

# setup db connection
dbpath = os.path.abspath("AfricaCOVID.db")
con = badsqlite3.connect(dbpath)
cur = con.cursor()

# get all the southern africa data 
result = cur.execute("SELECT COVID.* FROM COVID JOIN ccodes ON \\
   COVID.iso_code=ccodes.ISO3 WHERE ccodes.Region='Southern Africa'")
regiondat = pd.DataFrame(result.fetchall())

# name the columns
regiondat.columns = ["iso_code", "continent", "location", "date", "total_cases", 
   "new_cases", "new_cases_smoothed", "total_deaths", "new_deaths", 
   "new_deaths_smoothed", "total_cases_per_million", "new_cases_per_million", 
   "new_cases_smoothed_per_million", "total_deaths_per_million", 
   "new_deaths_per_million", "new_deaths_smoothed_per_million"]
regiondat.head(10)

# this part doesn't work but it's close
# plot results 
regiondat.set_index('date', inplace=True)
regiondat.groupby('location')['new_cases_smoothed_per_million'].plot(legend=True)

What you’ll need to complete this task

Here is what Pat and the team have provided

  • the script make-SA-plot.R
  • the sqlite3 database AfricaCOVID.db (4.9 MB file)
  • the Excel file with new data newer-african-data.xlsx
  • the RDS (or pickle) file with vax data african-vax-data.rds

And you will need to supplement that by

How to proceed

The goal is to get those updated plots ready for submission, along with a data package (if you have time). There are many ways to do this. The problem is set up to force you to use multiple IO approaches.

Whatever approach you take, remember the following: (1) make a stepwise plan, (2) check what you are doing along the way, i.e., use available tools to examine data sources so that you know what is happening, and (3) don’t get stuck, and be prepared to revise your plan.

Checkpoints

Who is having issues or would like some help with

  • accessing the database?
  • getting data out of the database into another form?
  • the plotting code?
  • the RDS or pickle file?
  • the Excel file?
  • merging the old and new data sources?

Questions

How difficult is it to get data out of an SQL db, an Excel file, a csv file, or a language-specific data serialization format? What makes a format more accessible? What makes it easier?

How difficult is it to use the data once you get it? Is there a difference between how easy it is to get access to data, and how easy it is to use the data effectively? How? What is the basis of that.

Which way of storing and transferring data is the most robust? Which is likely to be in use 10 years from now?

For simple tables, a csv file is usually the best. When would you consider a binary serialization? An SQL database? An Excel file?

How did you decide on filenames? How did you decide on the number and content of files to make?

What did you learn about formats from this exercise? What would you do differently in the future based on what you learned?