Overview: guiding principles
Our goal as scientists is to create useful knowledge (useful may be defined on a very long time scale). Knowledge doesn’t exist if people don’t have access to it, and it’s not useful if they can’t engage with it.
Think in Terms of the Product
We want “software” (e.g., the combination of scripts, analysis code, data management, reference management, external tools devoted to addressing a particular research question or area) that is:
- easy to use correctly
- easy to verify (i.e., does what intends) and validate (i.e., clear expectation of output for given inputs)
- easy to know when using incorrectly, and what to change to use correctly
- easy to understand, both as a whole and individual parts
- plausible to deploy in other settings (e.g., distributed computation, GUI tool for non-technical users)
- generalizable (e.g., different dataset / context)
Think in Terms of the Process
Organization should support:
- shifting data (new data, updates to existing data, changes to schema)
- effective collaboration with people in different roles (e.g., theoreticians, modelers, field scientists)
- portability
- publication (results as well as recipe)
- subsequent extension
Resources
- DataONE primer on data management (PDF)
- Wilson, et al. Best Practices for Scientific Computing
- Ten Simple Rules for Reproducible Computational Research
Topics
Plan to manage your resources
Stages and activities in the data life-cycle
- Storage and backup. Safeguard against accidental loss or corruption.
- Organization. see below.
- File encodings
- Describe and document. see below.
- Sharing and re-use
- Preservation
Resources
- FAIR data principles
- DataONE primer on data management (PDF)
- Introduction to Open Science: Why data versioning and data care practices are key for science and social science.
- Data Management Discussion
- UK Data Archive - good general read, but certain specific sections pertinent to how to organize / save yours
Using version Control
Resources
- SO: Why Should I Use Version Control? and Academia SE: Why Use VC for Writing a Paper?
- Biomed Central Blog - several links to other publications on value of version control in science
- Git Tutorials:
- with GitHub
- visualization
- undoing
- writing good commit messages <!– not needed
- daunting book –>
Protocol questions
- what goes under version control, and what doesn’t
- branching protocol
- gitflow and other patterns
Organizing project resources
Your future self is probably the top stakeholder. Think about designing products that will be distributed, either to your future self or to others.
- If you can’t keep everything in one directory, maintain an up-to-date Project Map document that points to all resources. Think of your project as a network of resources. It needs to have a clearly identifiable root from which you can reach everything else– no orphans.
- Use a directory hierarchy for major dimensions (e.g., studies, data types), and file names for minor dimensions (e.g., replicates, dates).
- Use “Archive” subdirs to keep your project tidy (leave them out of a distributed package)
- Use “README.txt” files to explain major directories
- Apply a consistent model, e.g., CamelCase.txt, kebab-case.txt, snake_case.txt.
- Make names from letters, numbers, and dot (.), dash (-) or underscore (_). Other characters can hinder automated processing.
- Be concise, e.g., “data-table” is redundant in ir-expt01-sample028-data-table.csv.
- Make sure dates and numbers sort as desired. Use ISO 8601 (YYYYMMDD). Pad counting numbers with leading zeros: 2 sorts after 12, but 02 sorts before 12.
- Create a file list (manifest) and refer to it in your top-level README file
ls -AFR1 path_to_my_dir > MANIFEST
tar -cv path_to_my_dir 2>&1 >/dev/null | sort | sed 's/^a //' > MANIFEST
Resources
- Dryad’s instructions to Name files and directories in a consistent and descriptive manner
- Dryad’s instructions to Organize files in a logical schemaf
Issue tracking
GitHub.com and GitLab.com both offer built-in issue tracking systems.
However, issue tracking systems are not just for software. Customer service departments use them to record complaints and other issues, and to track how those issues are addressed. Many people use issue trackers such as Trello as personal productivity tools.
Some of the key concepts are
- opening and closing. An issue or ticket is “open” until it is resolved, then it is “closed”. Often there is a flow for tickets with 4 or 5 stages
- backlog. low priority or unprioritized issues, e.g., something we might do some day
- ready. issues prioritized for action
- in progress. a developer is working on this
- needs review. developer is done, now look at results and decide whether to close or put back in “ready”
- closed. this issue is done. open new tickets for any separate tasks provoked by this issue.
- assignment. Issues can be assigned to one or more persons. A programmer who completes work on a ticket may assign it to a second person for review.
- monitoring and notification. Action on issues can be monitored with automatic notification. At the level of the whole project, you can monitor how many issues are closed each month.
- tagging and grouping. Issues can be tagged (“bug”, “feature request”) and grouped into sets corresponding to milestones or sprints.
- prioritization. A team may get together regularly to review open tickets and prioritize them for action.
Resources
Using tests
What do you test? Focus first on functional requirements. This applies both to low-level testing and to high-level testing. Your software (and its functions or methods) is supposed to generate some useful output from valid inputs. Make sure it does. Then you can start testing for other things, like how well it handles exceptions such as invalid inputs.
Some concepts
- “regression testing” means checking that the software still does what it did previously
- unit tests focus on individual functions or methods
- integration tests depend on multiple functions or methods working together
Resources
- 20 practical testing tips
- wikipedia on software testing
- R unit-testing library testthat
- Python unittest
Packaging resources for distribution
Resources
- General Discussion for R Packages
- Python Project Template Guide - challenge: write a bash script to automate this approach
- …or try any of several pre-packaged options
Documenting code
Resources
- a beginner’s guide to writing docs
- Sphinx
- in-line documentation
- Doxygen (for C, C++, Java, Python, some other languages)
- Python in-line DocStrings explanation
- Writing documentation with ROxygen2 (R in-line documentation)
- Perl’s POD (plain old documentation). Always remember with POD that you have to put blank lines before and after each command.
- why writing better code reduces the need for commenting
IDEs and other tools
- What is an IDE and why use one (or not)?
- Broad Comparison of IDE - what seems to be important?
- RStudio IDE Project
- PyCharm IDE Project
Collaboration technology and protocols
How are you going to share work? Communicate? Keep records of decisions? Track progress on goals? Choose a set of assistive technologies for sharing, and develop protocols for using them with your team.
Technologies
Decide what types of communication and collaboration technology facilitate the success of your project. Choose (ideally) just one tool of each type.
- communication and collaboration technology types
- real-time communication by phone, email, or chat
- virtual meetings (Skype, Hangout, Zoom, etc)
- comment threads on tickets in your issue tracker
- file-sharing
- version control
- DropBox, Drive, etc
- document-sharing and collaborative editing
- Drive docs, etherpad
How much can you do with one platform? For instance, maintain code and issues in GitHub, maintain planning docs on GitHub as well, add gitter for chat, add Waffle for multi-repo project tracker, Jekyll for CI.
Protocols
- Agree on which technologies are used for which things.
- What goes into version control in a shared repo, and what goes in document sharing?
- Public vs. private is a big issue here. You may have a public code repo, but you need a private channel to discuss sensitive project issues, and a private file-sharing space for things like manuscripts in progress. Do you want to keep planning documents private? Do you want a private communication channel?
- Which kinds of plans are discussed in a public chatroom, and which are reserved for private channels
- Agree on how to use a technology
- shared editing. Wikipedia style, or is one person the lead author or owner?