REQUIRED
Later today, you will re-present your project. Recall the general feedback on the purpose of these: to think of your work from the perspective of how it is engineered to acheive your results, and less about the results / analysis per se. Express your project in terms of the flow from start to finish AND identify what element of that you want to work on next week. “Work on” might take many forms - e.g. it could be replacing some element with a library, instrumenting some step to understand how long it takes, documenting code to make it more reuseable, and so on. Be clear about what the outcome of that work would be if accomplished AND have a plan that has incremental, small steps. That is, have the work plan start with step that you can think can accomplish in a very short period of time: a morning at most. Assemble your plan out of several such steps.
Spend some time finalizing that today. Once that’s done, please upload your slides to the shared google drive folder.
However, we do want you to spend some time thinking about the prompts below relative to HPC. Your presentation need not be perfect - it needs to be sufficient to elicit feedback on your plan for the hack-a-thon next week. Hence the emphasis on some.
Performance of your code
- Do you know how long it takes for you project to run start to finish? How much memory it uses? How much it reads from / writes to disk?
- Do you know what the relative contribution of each of the sub-steps to that run-time, memory use, read/write?
- If you do know these durations etc, how did you get that information? What tools did you use to measure this? If you don’t know, what tools are out there to make this measurement? What might you need to change in your code to make these measurements?
- Measure how long it takes for the operating system to start up and shut down your program
Parallel vs Serial
- Thinking about a pseudocode / flow diagram perspective of your overall project: is there (or could there be) a “repeat X for …” step? If so, do the steps depend on each other? For example, does the program need the results from step N to accomplish step N+1? Or all those “repeat …” steps independent? Some mixture of those two?
- If there is a “repeat X for …” step, is your code currently organized such that those steps can be done independently? I.e., if they are parallelizable, do they each have independent access to input and output space? If serial, does each step produce a “thing” that could be stored and then used in the next step?
- Think about how to organize inputs, sections of code, and outputs in order to run your code in a parallel platform, sketch this on paper
- Does it make sense to parallelize some part of your code? Why or why not?
Optional Tasks
- If you answered yes to the last question, start parallelizing your code!
- Investigate some HPC tools. Are any of these relevant to your project?
- tensorflow, e.g. this quick start tutorial
- hadoop / mapreduce, e.g. this quick start tutorial,
- cuda, for a GPU-based approach
- Explore HPC / cloud computing service providers:
- Google Colab
- AWS
- Azure
- Google Cloud
- others?