Overview: guiding principles

Our goal as scientists is to create useful knowledge (useful may be defined on a very long time scale). Knowledge doesn’t exist if people don’t have access to it, and it’s not useful if they can’t engage with it.

Think in Terms of the Product

We want “software” (e.g., the combination of scripts, analysis code, data management, reference management, external tools devoted to addressing a particular research question or area) that is:

  • easy to use correctly
  • easy to verify (i.e., does what intends) and validate (i.e., clear expectation of output for given inputs)
  • easy to know when using incorrectly, and what to change to use correctly
  • easy to understand, both as a whole and individual parts
  • plausible to deploy in other settings (e.g., distributed computation, GUI tool for non-technical users)
  • generalizable (e.g., different dataset / context)

Think in Terms of the Process

Organization should support:

  • shifting data (new data, updates to existing data, changes to schema)
  • effective collaboration with people in different roles (e.g., theoreticians, modelers, field scientists)
  • portability
  • publication (results as well as recipe)
  • subsequent extension

Resources

Topics

Plan to manage your resources

Stages and activities in the data life-cycle

  • Storage and backup. Safeguard against accidental loss or corruption.
  • Organization. see below.
  • File encodings
  • Describe and document. see below.
  • Sharing and re-use
  • Preservation

Resources

  • UK Data Archive - good general read, but certain specific sections pertinent to how to organize / save yours

Using version Control

Resources

Protocol questions

  • what goes under version control, and what doesn’t
  • branching protocol
  • gitflow and other patterns

Organizing project resources

Your future self is probably the top stakeholder. Think about designing products that will be distributed, either to your future self or to others.

  • If you can’t keep everything in one directory, maintain an up-to-date Project Map document that points to all resources. Think of your project as a network of resources. It needs to have a clearly identifiable root from which you can reach everything else– no orphans.
  • Use a directory hierarchy for major dimensions (e.g., studies, data types), and file names for minor dimensions (e.g., replicates, dates).
    • Use “Archive” subdirs to keep your project tidy (leave them out of a distributed package)
    • Use “README.txt” files to explain major directories
    • Apply a consistent model, e.g., CamelCase.txt, kebab-case.txt, snake_case.txt.
    • Make names from letters, numbers, and dot (.), dash (-) or underscore (_). Other characters can hinder automated processing.
    • Be concise, e.g., “data-table” is redundant in ir-expt01-sample028-data-table.csv.
    • Make sure dates and numbers sort as desired. Use ISO 8601 (YYYYMMDD). Pad counting numbers with leading zeros: 2 sorts after 12, but 02 sorts before 12.
  • Create a file list (manifest) and refer to it in your top-level README file
    • ls -AFR1 path_to_my_dir > MANIFEST
    • tar -cv path_to_my_dir 2>&1 >/dev/null | sort | sed 's/^a //' > MANIFEST

Resources

Issue tracking

GitHub.com and GitLab.com both offer built-in issue tracking systems.

However, issue tracking systems are not just for software. Customer service departments use them to record complaints and other issues, and to track how those issues are addressed. Many people use issue trackers such as Trello as personal productivity tools.

Some of the key concepts are

  • opening and closing. An issue or ticket is “open” until it is resolved, then it is “closed”. Often there is a flow for tickets with 4 or 5 stages
    • backlog. low priority or unprioritized issues, e.g., something we might do some day
    • ready. issues prioritized for action
    • in progress. a developer is working on this
    • needs review. developer is done, now look at results and decide whether to close or put back in “ready”
    • closed. this issue is done. open new tickets for any separate tasks provoked by this issue.
  • assignment. Issues can be assigned to one or more persons. A programmer who completes work on a ticket may assign it to a second person for review.
  • monitoring and notification. Action on issues can be monitored with automatic notification. At the level of the whole project, you can monitor how many issues are closed each month.
  • tagging and grouping. Issues can be tagged (“bug”, “feature request”) and grouped into sets corresponding to milestones or sprints.
  • prioritization. A team may get together regularly to review open tickets and prioritize them for action.

Resources

Using tests

What do you test? Focus first on functional requirements. This applies both to low-level testing and to high-level testing. Your software (and its functions or methods) is supposed to generate some useful output from valid inputs. Make sure it does. Then you can start testing for other things, like how well it handles exceptions such as invalid inputs.

Some concepts

  • “regression testing” means checking that the software still does what it did previously
  • unit tests focus on individual functions or methods
  • integration tests depend on multiple functions or methods working together

Resources

Packaging resources for distribution

Resources

Documenting code

Resources

IDEs and other tools

Collaboration technology and protocols

How are you going to share work? Communicate? Keep records of decisions? Track progress on goals? Choose a set of assistive technologies for sharing, and develop protocols for using them with your team.

Technologies

Decide what types of communication and collaboration technology facilitate the success of your project. Choose (ideally) just one tool of each type.

  1. communication and collaboration technology types
    • real-time communication by phone, email, or chat
    • virtual meetings (Skype, Hangout, Zoom, etc)
    • comment threads on tickets in your issue tracker
  2. file-sharing
    • version control
    • DropBox, Drive, etc
  3. document-sharing and collaborative editing
    • Drive docs, etherpad

How much can you do with one platform? For instance, maintain code and issues in GitHub, maintain planning docs on GitHub as well, add gitter for chat, add Waffle for multi-repo project tracker, Jekyll for CI.

Protocols

  1. Agree on which technologies are used for which things.
    • What goes into version control in a shared repo, and what goes in document sharing?
    • Public vs. private is a big issue here. You may have a public code repo, but you need a private channel to discuss sensitive project issues, and a private file-sharing space for things like manuscripts in progress. Do you want to keep planning documents private? Do you want a private communication channel?
    • Which kinds of plans are discussed in a public chatroom, and which are reserved for private channels
  2. Agree on how to use a technology
    • shared editing. Wikipedia style, or is one person the lead author or owner?