Reproducible Research Workflow in MATLAB

5 views (last 30 days)
Chad Gilbert
Chad Gilbert on 14 Sep 2018
Commented: Chad Gilbert on 15 Apr 2019
I'm intrigued by this article about Reproducible Research with R by Lincoln Mullen and would like to try the same idea out, but in MATLAB.
In the workflow for my current project, I am performing several long-running queries on a database and storing the results in .mat files locally (call this process A). Then I do some post-processing, which also takes a while, and then cache those results locally again (call this B). Finally, I produce some plots and a custom output file that will feed into some other analysis tools (call this C).
Currently, if I change process C, I'll reload the results from B and then redo C. Similar for B and A. This saves me time, but as my code becomes a bit more complex, I suspect I'm at risk of changing some earlier code in A or B and then forgetting to re-run them, perhaps just re-running C.
If I could define the results files as targets in a way similar to GNU Make, then I'd be able to ensure that I don't forget to re-do any steps needed, without ever needlessly waiting for these long-running processes to complete. If someone else wants to reproduce my result, they should not have to learn how to run the several commands in sequence - they should just have to run make or something equivalent.
Driving MATLAB with Make itself doesn't seem like a solution because of MATLAB's long start-up time. I'm also on a Windows machine for this work, so it would be inconvenient to set that up. So I believe I should be running something in MATLAB to do this. Does anybody know of a MATLAB library or similar tool that can support that kind of workflow?

Answers (1)

Jonathan A
Jonathan A on 11 Apr 2019
I was looking for similar tools in Matlab a few years ago. However the one I found were only focused on one part of the solution to implement the kind of behavior you are describing. Therefore, I tried to implement a class which performs automatic persistent memoization by defining a directed acyclic graph (DAG) where nodes are Matlab functions. If the node main function and all the sub-functions called during the node execution (a) and the input variables of the node (b) remain unchanged, then the results are retrieved from the disk and not re-computed. This holds true even across different Matlab processes as it is based on the file system.
For (a), the major time consumer is the code analysis to find out the sub-functions involved in the main function execution. Therefore, I also persisted this "dependency" information based on the last modified date of the function files.
For (b), either the variable content is hashed or the variable mat-file date is taken to check that the variables did not change
Moreover, I wanted to declare the DAG only with Matlab and not driving it with Make as you wrote and also wanted to visualize the execution of the DAG. Therefore, I implemented the following class:
Not sure if it is the kind of tool you are looking for, but if in the meantime you found other solutions or other relevant discussions, I would be curious to know about :-) Thanks !
  1 Comment
Chad Gilbert
Chad Gilbert on 15 Apr 2019
This looks fantastic, thanks for making it and for replying to my question.
I am not using much MATLAB right now with my current projects at work, but I will be back to it some time, I'm sure. I'll make a note to let you know how it's going either on your File Exchange page and/or on your project GitHub.
I appreciate one of the links you made from your GitHub as well: Sandve, Geir Kjetil, et al. I hadn't seen that yet. I think the list will be handy.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!