REQUIRED

Tomorrow, you will need to re-present your project. Recall the general feedback on the purpose of these: to think of your work from the perspective of how it is engineered to acheive your results, and less about the results / analysis per se.

Spend some time working on preparing that today; there will be some additional time tomorrow and you can work on it outside of formal course time, and we do want you to spend some time thinking about the prompts below relative to inputs. Express your project in terms of the flow from start to finish AND identify what element of that you want to work on next week. “Work on” might take many forms - e.g. it could be replacing some element with a library, instrumenting some step to understand how long it takes, documenting code to make it more reuseable, and so on.

But: be clear about what the outcome of that work would be if accomplished AND have a plan that has incremental, small steps. That is, have the work plan start with step that you can think can accomplish in a very short period of time: a morning at most. Assemble your plan out of several such steps.

Code vs Input

  • Does your project have “code” currently that might alternatively be thought of as input? E.g., parameter or configuration values that are included directly in your scripts or source code, instead of in input files. If so, make an inventory of those files / a tabulation of the parameters / etc: some sort of key as to what’s currently being hard coded.
  • For those cases, do you find yourself regularly changing those values, e.g. to run a new scenario? If other people were to use your code, might they want to change those values? What is the “nature” of that data: key-value, tabular, …? Annotate the above list with those conclusions.

Kinds of Input & Output

  • List your inputs and outputs, and label those data with their best fit as tabular, relational, hierarchical, key-value, or human-oriented (e.g., a plot or sound file).
  • For your inputs and outputs, which are general purpose formats (e.g., csv, json, sqlite, protobuf) with libraries for many languages, specialized (e.g., rds, docx) with application-specific access, or custom (e.g., human readable, but with specialized parsing rules)? What are your reasons for using general vs specialized vs custom formats?
  • For your inputs and outputs, which are human-readable (plaintext) vs machine-only readable (binary)? What are your reasons for using plaintext vs binary formats for different parts of your project?

Intermediate Results

  • Compared to your project overall outputs, are there intermediate files that are only used internally to your project? Thinking of your project in pseudocode or a flow-diagram, what steps are those files associated with?
  • Does your project have a “resume” capability? If so, how do you implement it? If not, how might it benefit from one?
  • Do you cache any intermediate results? If so, why?

Testing & Validation

  • Which of your project requirements specifically concern an input or output file feature? E.g., must read / write particular format, size, …? Or more generally the creation or consumption of specific data in specific formats?
  • Does your project include in testing of inputs or outputs? E.g., validation of inputs, checking for missing data, confirmation of output consistency with a validation standard, …? If not, have you ever had to troubleshoot issues to do with invalid inputs or outputs? Would a validator have been useful for that troubleshooting?

Amounts & Locations

  • How much input does your project rely on? Estimate in kilobytes, Mbs, Gbs, (Tbs?) as appropriate.
  • …how much output does it create? Again estimate the amount.
  • Is the input used or output created by your project all local? Or does some travel over a network (e.g. from a supercomputer to your machine, to users over the internet, shared to collaborators, …)?
  • How much time does your project spend loading data–e.g., reading in a csv vs how much time the program spends analyzing the information in the file? If you don’t know, how might you measure this?
  • How much time does your project spend writing data–e.g., generating and saving a plot vs generating the time series being plotted. If you don’t know, how might you measure this?

Sharing

  • Are your raw inputs shared? Cleaned up inputs? Why or why not?
  • How do your collaborators manage data? Which services or processes did your group consider before settling on the current one(s)?
  • What (if any) agreements or policies are in place governing how you handle your data and results? For example, if your data relates to people, are there restrictions regarding identifiability of individuals? What restrictions are there on where your data may be stored?
  • If there are restrictions on sharing inputs or outputs, how are those restrictions made clear to other researchers (e.g. who are evaluating the results that can be publicized)?