Cvapipeline_analysis - Can I download a smaller dataset from Quilt to test the integrated modeling and analysis pipeline?

Hello,

I would like to follow the steps in cvapipeline_analysis to better understand the integrated cell modeling and analysis workflow. However, cvapipe_analysis loaddata run downloads 7.3 TB of data. I don’t have that much space, and I don’t think I need all of the raw and segmented data. All I want to do is 1) see the format of the input data (single cell segmented stacks), 2) try running the whole modeling and analysis pipeline, and 3) figure out how to feed my own data into the pipeline. So a smaller set of sample data should be enough.

Can I download a smaller dataset, such as only the segmented images for 1 cell line (as in quilt-data-access-tutorials, tutorial 2)? Would the following step, cvapipe_analysis shapemode run work properly as is, or would I need to modify the code somehow?

Thanks,
Mary

hello @MaryMirvis,

you can certainly download all the single cell data from only one cell line. To do this, you can check tutorial 2 and see how to select only one cell line, but only download the single cell data “crop_raw” and “crop_seg” (see Example 2 in tutorial 1). Feel free to let me know if you have trouble with this. I can certainly make a new tutorial for this purpose.

About how you can use cvapipe_analysis on a subset, I will ping my colleage to comment on.

Thanks,
Jianxu

Hi @MaryMirvis, I am in a similar situation to you where I want to apply some steps of the cell variance pipeline on my own dataset. It looks like they recently added the ability to download a smaller test dataset when running the loaddata step. I haven’t tried it myself because I’ve been testing steps using my own locally stored data, but I think if you update your local repository and run cvapipe_analysis loaddata run --test=True then you could download a smaller test dataset from Quilt to try out the subsequent steps. Looks like this would download a subset of 12 interphase cells per structure, which is a useful feature for those of us wanting to try things out ourselves without downloading tons of data!

Anyone who knows more about this can feel free to correct me. I just happened to notice this while looking over some recent changes and figured I would mention it in case anyone still wanted this info. :slight_smile:

Hi @lynn. Thank you for following up on this and watching the development of cvapipe_analysis so closely. Yes, as you noticed, we are refactoring all the code used in the paper and adding new functionalities to it. One of them is exactly the ability of downloading a small subset of data for test: 300 cells in total (12 per structure). We are also improving the documentation all along the code, so please stay tuned. Any feedback is very welcomed.

Hi @vianamp , I very much appreciate the additional features and documentation, this is cool stuff. If you would like feedback, I have a few suggestions that might improve things from my perspective as an external user:

  • Glad to see there are plans to add a config file - this would be great for things like microscope resolution in xy, which seems to be hard-coded in a few places (edit: I think I misremembered this, it might actually have been in cvapipe that I saw this). Would also be useful if the config file could define where the local_staging directory is created, which defaults to the current working directory in my hands. datastep seems to suggest that this could be provided in a config file but I haven’t yet figured out how to do this.

  • Currently, the loaddata step seems geared toward downloading Quilt data (which makes sense for the purposes of the Institute). I was able to work around this, but it would be useful if the data loading step were more agnostic to where the data are coming from. Not sure what the best solution is, but it might be good enough to have the option to use local data and indicate where the data are stored on the command line (hm, maybe I will try to implement this on my local repo).

  • Right now, some steps are dependent on having membrane, DNA, and structures labeled - but some users may want only to analyze shapemodes on images with only the membrane and/or DNA channel, for instance. I have a branch that I’ve been tweaking to work on “membrane-only” images for the time being, but it’s more of a workaround than a robust solution. I think it would be more ideal if the user could specify this at the beginning and have subsequent steps adapt. This might be more challenging to implement, though, and I’d understand if it’s not a priority at this time.

Not sure if this is the kind of feedback you all are looking for, feel free to ignore if it’s not, haha. Again, very cool stuff, and I’m thankful to have access to the code as it’s being developed, it’s already been helpful for my work. I don’t know if there’s much I can do to help given my skill level, but I will definitely be keeping an eye on this project in the near future.

Hi @lynn. Thank you for you message. About your first point. you can change the default staging directory by creating a file name workflow_config.json inside the root of cvapipe_analysis. The content of this file should look like:

{
"project_local_staging_dir": "my/new/staging/folder"
}

We are also working on creating a global config file that will make easier to set this and other parameters used along the code. Your 2nd point one of the next items in my list.

Your 3rd point was not on the roadmap, but I will bring this to others and see what they think. As a hack, you can always duplicate one of segmentations and pretend you always have cell and nuclear segmentations available.