This exercise presents you with a challenge that involves accessing, understanding, and writing data using various methods. You will get to use
- CSV (comma-separated values) for tabular data
- language-specific binaries or serializations
- R: RDS files
- Python: pickles
- an SQL database
- an Excel file
The story
You are part of a multi-author team preparing a revised manuscript on the COVID-19 epidemic in southern Africa. Unfortunately, a key team member, Pat, moved on to another job last September, and is not able to update plots from the original manuscript, like this plot which only goes through August 2022.
You have been tasked with updating and expanding the plots by
- Adding data through the end of 2022 for South Africa
- Making an additional plot with new cases per million for all southern African countries
- Adding vaccinations for South Africa
- Making this into a supplementary data repository
That sounds hard, but Pat has made this easier, by sending you 2 files with missing information, along with a brief email with instructions
Hello! I heard that you are going to add the new data and clean this up for publication. You probably won’t be able to get the db connection to work in the script, but that’s no problem, just launch “sqlite3 AfricaCOVID.db” and run the SELECT statements for whatever countries you want. If you type “.header on”, “.mode csv”, and “.once ‘myfilename.csv’”, then the results of the next SELECT statement will be written to the named csv file, and then you can work with the data directly.
Jan provided 2 data files to make the requested updates. The Excel file (newer-african-data.xslx) has data on cases and deaths from October 2022 onward. Beware that Excel can’t keep dates straight. The vax data are in african-vax-data.rds. Both of these have data for all the African countries.
Everything else should work fine. Good luck,
Pat
The attachment to Pat’s email.
Pat’s script below has just 11 lines of code in R. It uses a database connector that is no longer supported, but Pat’s email suggests a way to deal with that.
library("BadRSQLite")
library("ggplot2")
# don't need this anymore bc ISO codes for Region are in ccodes table
# region <- c("AGO", "BWA", "LSO", "MWI", "MOZ", "NAM", "ZAF", "SWZ", "ZWB", "ZWE")
# get the db link and query it for southern Africa data
mydblink <- dbConnect(BadRSQLite::SQLite(), "AfricaCOVID.db")
regiondat <- dbGetQuery(mydblink, "SELECT COVID.* FROM COVID JOIN ccodes ON
COVID.iso_code=ccodes.ISO3 WHERE ccodes.Region='Southern Africa';")
dbDisconnect(mydblink)
# make the date column into R dates, numeric columns into R numeric type
regiondat$date <- as.Date(regiondat$date)
regiondat[5:16] <- apply(regiondat[5:16], 2, as.numeric)
zafdat <- subset(regiondat, iso_code = 'ZAF')
# plot the ZAF data with new cases and deaths, save as file
p <- ggplot(zafdat, aes(x = date, y = new_cases_smoothed_per_million,
group = location)) +
geom_line() + scale_x_date(date_breaks = "1 month", date_labels = "%b") +
ylab(label = "New cases or deaths per million, smoothed")
p2 <- p + geom_line(aes(y = new_deaths_smoothed_per_million), colour= "red")
ggsave("zaf-plot.jpg", p2, width=20, height=8, units="cm")
# plot the data for all southern African countries
p3 <- ggplot(regiondat, aes(x = date, y = new_cases_smoothed_per_million,
group = location, color = location)) +
geom_line() + scale_x_date(date_breaks = "1 month", date_labels = "%b") +
ylab(label = "New cases per million, smoothed")
# ggsave("southern-plot.jpg", p3, width=20, height=8, units="cm")
Python version of the story
The Python version of the story is the same, except the script is in Python, and there is a pickle file instead of an RDS.
# import libraries
import badsqlite3
import pandas as pd
# setup db connection
dbpath = os.path.abspath("AfricaCOVID.db")
con = badsqlite3.connect(dbpath)
cur = con.cursor()
# get all the southern africa data
result = cur.execute("SELECT COVID.* FROM COVID JOIN ccodes ON \\
COVID.iso_code=ccodes.ISO3 WHERE ccodes.Region='Southern Africa'")
regiondat = pd.DataFrame(result.fetchall())
# name the columns
regiondat.columns = ["iso_code", "continent", "location", "date", "total_cases",
"new_cases", "new_cases_smoothed", "total_deaths", "new_deaths",
"new_deaths_smoothed", "total_cases_per_million", "new_cases_per_million",
"new_cases_smoothed_per_million", "total_deaths_per_million",
"new_deaths_per_million", "new_deaths_smoothed_per_million"]
regiondat.head(10)
# this part doesn't work but it's close
# plot results
regiondat.set_index('date', inplace=True)
regiondat.groupby('location')['new_cases_smoothed_per_million'].plot(legend=True)
What you’ll need to complete this task
Here is what Pat and the team have provided
- the script
make-SA-plot.R
- the sqlite3 database
AfricaCOVID.db
(4.9 MB file) - the Excel file with new data
newer-african-data.xlsx
- the RDS (or pickle) file with vax data
african-vax-data.rds
And you will need to supplement that by
- using your skills in R or Python
- looking things up online
- working in a programming environment like RStudio or Python Notebook
- installing pre-compiled binary of SQLite3
- for a beginner lesson and a test database, see Software Carpentry’s SQL lesson
How to proceed
The goal is to get those updated plots ready for submission, along with a data package (if you have time). There are many ways to do this. The problem is set up to force you to use multiple IO approaches.
Whatever approach you take, remember the following: (1) make a stepwise plan, (2) check what you are doing along the way, i.e., use available tools to examine data sources so that you know what is happening, and (3) don’t get stuck, and be prepared to revise your plan.
Checkpoints
Who is having issues or would like some help with
- accessing the database?
- getting data out of the database into another form?
- the plotting code?
- the RDS or pickle file?
- the Excel file?
- merging the old and new data sources?
Questions
How difficult is it to get data out of an SQL db, an Excel file, a csv file, or a language-specific data serialization format? What makes a format more accessible? What makes it easier?
How difficult is it to use the data once you get it? Is there a difference between how easy it is to get access to data, and how easy it is to use the data effectively? How? What is the basis of that.
Which way of storing and transferring data is the most robust? Which is likely to be in use 10 years from now?
For simple tables, a csv file is usually the best. When would you consider a binary serialization? An SQL database? An Excel file?
How did you decide on filenames? How did you decide on the number and content of files to make?
What did you learn about formats from this exercise? What would you do differently in the future based on what you learned?