Overview: guiding principles

Our goal as scientists is to create useful knowledge (useful may be defined on a very long time scale). Knowledge doesn’t exist if people don’t have access to it, and it’s not useful if they can’t engage with it.

Think in Terms of the Product

We want “software” (e.g., the combination of scripts, analysis code, data management, reference management, external tools devoted to addressing a particular research question or area) that is:

easy to use correctly
easy to verify (i.e., does what intends) and validate (i.e., clear expectation of output for given inputs)
easy to know when using incorrectly, and what to change to use correctly
easy to understand, both as a whole and individual parts
plausible to deploy in other settings (e.g., distributed computation, GUI tool for non-technical users)
generalizable (e.g., different dataset / context)

Think in Terms of the Process

Organization should support:

shifting data (new data, updates to existing data, changes to schema)
effective collaboration with people in different roles (e.g., theoreticians, modelers, field scientists)
portability
publication (results as well as recipe)
subsequent extension

Resources

DataONE primer on data management (PDF)
Wilson, et al. Best Practices for Scientific Computing
Ten Simple Rules for Reproducible Computational Research

Topics

Plan to manage your resources

Stages and activities in the data life-cycle

Storage and backup. Safeguard against accidental loss or corruption.
Organization. see below.
File encodings
Describe and document. see below.
Sharing and re-use
Preservation

Resources

UK Data Archive - good general read, but certain specific sections pertinent to how to organize / save yours

Using version Control

Resources

SO: Why Should I Use Version Control? and Academia SE: Why Use VC for Writing a Paper?
Biomed Central Blog - several links to other publications on value of version control in science
Git Tutorials:
- with GitHub
- visualization
- undoing
- writing good commit messages <!– not needed
- daunting book –>

Protocol questions

what goes under version control, and what doesn’t
branching protocol
gitflow and other patterns

Organizing project resources

Your future self is probably the top stakeholder. Think about designing products that will be distributed, either to your future self or to others.

If you can’t keep everything in one directory, maintain an up-to-date Project Map document that points to all resources. Think of your project as a network of resources. It needs to have a clearly identifiable root from which you can reach everything else– no orphans.
Use a directory hierarchy for major dimensions (e.g., studies, data types), and file names for minor dimensions (e.g., replicates, dates).
- Use “Archive” subdirs to keep your project tidy (leave them out of a distributed package)
- Use “README.txt” files to explain major directories
- Apply a consistent model, e.g., CamelCase.txt, kebab-case.txt, snake_case.txt.
- Make names from letters, numbers, and dot (.), dash (-) or underscore (_). Other characters can hinder automated processing.
- Be concise, e.g., “data-table” is redundant in ir-expt01-sample028-data-table.csv.
- Make sure dates and numbers sort as desired. Use ISO 8601 (YYYYMMDD). Pad counting numbers with leading zeros: 2 sorts after 12, but 02 sorts before 12.
Create a file list (manifest) and refer to it in your top-level README file
- ls -AFR1 path_to_my_dir > MANIFEST
- tar -cv path_to_my_dir 2>&1 >/dev/null | sort | sed 's/^a //' > MANIFEST

Resources

Dryad’s instructions to Name files and directories in a consistent and descriptive manner
Dryad’s instructions to Organize files in a logical schemaf

Issue tracking

GitHub.com and GitLab.com both offer built-in issue tracking systems.

However, issue tracking systems are not just for software. Customer service departments use them to record complaints and other issues, and to track how those issues are addressed. Many people use issue trackers such as Trello as personal productivity tools.

Some of the key concepts are

opening and closing. An issue or ticket is “open” until it is resolved, then it is “closed”. Often there is a flow for tickets with 4 or 5 stages
- backlog. low priority or unprioritized issues, e.g., something we might do some day
- ready. issues prioritized for action
- in progress. a developer is working on this
- needs review. developer is done, now look at results and decide whether to close or put back in “ready”
- closed. this issue is done. open new tickets for any separate tasks provoked by this issue.
assignment. Issues can be assigned to one or more persons. A programmer who completes work on a ticket may assign it to a second person for review.
monitoring and notification. Action on issues can be monitored with automatic notification. At the level of the whole project, you can monitor how many issues are closed each month.
tagging and grouping. Issues can be tagged (“bug”, “feature request”) and grouped into sets corresponding to milestones or sprints.
prioritization. A team may get together regularly to review open tickets and prioritize them for action.

Resources

Wikipedia issue tracking system
Comparison of issue-tracking systems

Using tests

What do you test? Focus first on functional requirements. This applies both to low-level testing and to high-level testing. Your software (and its functions or methods) is supposed to generate some useful output from valid inputs. Make sure it does. Then you can start testing for other things, like how well it handles exceptions such as invalid inputs.

Some concepts

“regression testing” means checking that the software still does what it did previously
unit tests focus on individual functions or methods
integration tests depend on multiple functions or methods working together

Resources

20 practical testing tips
wikipedia on software testing
R unit-testing library testthat
Python unittest

Packaging resources for distribution

Resources

General Discussion for R Packages
Python Project Template Guide - challenge: write a bash script to automate this approach
…or try any of several pre-packaged options

Documenting code

Resources

a beginner’s guide to writing docs
Sphinx
in-line documentation
- Doxygen (for C, C++, Java, Python, some other languages)
- Python in-line DocStrings explanation
- Writing documentation with ROxygen2 (R in-line documentation)
- Perl’s POD (plain old documentation). Always remember with POD that you have to put blank lines before and after each command.
why writing better code reduces the need for commenting

IDEs and other tools

What is an IDE and why use one (or not)?
Broad Comparison of IDE - what seems to be important?
RStudio IDE Project
PyCharm IDE Project

Collaboration technology and protocols

How are you going to share work? Communicate? Keep records of decisions? Track progress on goals? Choose a set of assistive technologies for sharing, and develop protocols for using them with your team.

Technologies

Decide what types of communication and collaboration technology facilitate the success of your project. Choose (ideally) just one tool of each type.

communication and collaboration technology types
- real-time communication by phone, email, or chat
- virtual meetings (Skype, Hangout, Zoom, etc)
- comment threads on tickets in your issue tracker
file-sharing
- version control
- DropBox, Drive, etc
document-sharing and collaborative editing
- Drive docs, etherpad

How much can you do with one platform? For instance, maintain code and issues in GitHub, maintain planning docs on GitHub as well, add gitter for chat, add Waffle for multi-repo project tracker, Jekyll for CI.

Protocols

Agree on which technologies are used for which things.
- What goes into version control in a shared repo, and what goes in document sharing?
- Public vs. private is a big issue here. You may have a public code repo, but you need a private channel to discuss sensitive project issues, and a private file-sharing space for things like manuscripts in progress. Do you want to keep planning documents private? Do you want a private communication channel?
- Which kinds of plans are discussed in a public chatroom, and which are reserved for private channels
Agree on how to use a technology
- shared editing. Wikipedia style, or is one person the lead author or owner?