class: title, smokescreen background-image: url(https://media.defense.gov/2017/Jun/22/2001767169/-1/-1/0/170519-F-OR423-002.JPG) # Organizing Data Science Projects ## CBIIT/FNL Workshop -- December 12, 2019 --- class: col-2 # About ## Justin M. Fear - IRTA Fellow (NIDDK/NIH) - <i class="fas fa-dna"></i> Genomics - <i class="fas fa-project-diagram"></i> Gene Regulation - <img src="/datascience_presentations/images/drosophila-wt.jpg" width="4%"/> Drosophila - Contact: - <i class="fab fa-github"></i> [@jfear](https://github.com/jfear) - <i class="fas fa-envelope-open-text"></i> justin.fear@nih.gov .qrcode.db.pa-3.w-90pct.ml-4[] <http://geneticsunderground.com/talk> --- class: title, smokescreen background-image: url("/datascience_presentations/images/organization.jpg") # Why Project Organization <!-- Part 1 --> --- class: img-right-full, roomy # Find things quickly ![](https://cdn.pixabay.com/photo/2017/02/13/16/10/scooby-doo-2063042_960_720.png) - Find the code used to generate result - Tweak a plot - Pickup where you left off .myheader[\> Why Project Organization] --- class: img-right, roomy # Share code and results ![](https://i.imgflip.com/1ulazs.jpg) - Send snippets to collaborator - Show colleague what you did - Track tangential analysis .myheader[\> Why Project Organization] --- class: img-right, roomy # Recover from data disasters ![](https://live.staticflickr.com/4017/4529320513_ba3d22388f_z.jpg) - Oops we swapped sample names - Forgot to give you these addition 10 samples - The file we sent you was truncated - I accidentally deleted your folder on the share drive .myheader[\> Why Project Organization] --- class: img-caption, roomy ![](https://miro.medium.com/max/1838/1*ACoKY0p7DSt4FGcZNjkP2Q.jpeg) # Reproducible Research .myheader[\> Why Project Organization] --- class: title, smokescreen background-image: url(https://summary.org/wp-content/uploads/2018/12/Messy02.jpeg) # Don't Do This <!-- Part 2 --> ## How to make a mess --- class: col-2 # Poor uses of file names ## Version Control <pre><code class="Shell"> ├── deg_lmm_v1.sh ├── deg_lmm_v2.sh ├── deg_lmm_final.sh <mark class="highlight">├── deg_jmf_final_v2.sh</mark> </code></pre> - If you name something final, you will always have another version. ## Workflow <pre><code class="Shell"> ├── deg_step1_jmf_v1.sh ├── deg_step2_jmf_v1.sh <mark class="highlight">├── deg_step2a_jmf_v1.sh</mark> ├── deg_step3_jmf_v1.sh </code></pre> - Adding or re-ordering steps is confusing at best. .absolute.center.w-6-12th.pa-1.l-3-12th.b-1.ba.bw-3.br-4.bg-white-60pct[ Make file names descriptive and concise. ] .myheader[\> Don't do this] --- class: img-right-full, compact # Poor uses of folders ![](/datascience_presentations/images/lots_o_files.png) ## One folder to rule them all - Hard to browse ≥ 30 files - Search requires you to know what you are looking for ## Too many folders - Lots folder levels are hard to browse too - Easy to loose files .w-10-12th.pa-1.ba.bw-3.br-4.bg-white-60pct.fs-90[ Make your own folder hierarchy and stick to it. ] .myheader[\> Don't do this] --- class: col-2, compact # Poor uses of scripts ## Comment and Uncomment ```bash # Run first # wget ... # export FILE="./this_file.txt" # ... 300 more lines of code ... # Run third export FILE="./that_file.yaml" do_more_stuff() ``` - Doesn't track what was done - Generates different results if run in different order ## Copy and Paste - A script is meant to be run - Don't copy and paste from a script > Beginners often write lots of comments describing each step. > They the copy and paste from the script onto the command line. .myheader[\> Don't do this] --- class: title, smokescreen background-image: url(http://www.freestufffinder.com/wp-content/uploads/2018/08/tina-marie-kondo-folding-method.jpg) # How to get organized <!-- Part 3 --> --- class: img-right-full, compact # Master Your Weapons <!-- Part 3a: Tools --> ![](https://p2.piqsels.com/preview/213/564/1010/yoda-lego-legoland-star-wars.jpg) - Version Control - Workflow Tools - Development Environment(s) --- class: compact # Version Control System (VCS) ## .center[A.K.A track changes] ![](/datascience_presentations/images/git-branch.png# absolute t-4 r-0) .absolute.t-3-12th.l-3[ ### Popular Tools - Git - Mercurial - VCS - CVS ### Cloud Storage - Github - GitLab - Bitbucket ] .absolute.t-3-12th.l-4-12th[ ### Data Is Different - git-lfs ] .myheader[\> Master Your Weapons] --- class: img-caption # Workflow Management ![](/datascience_presentations/images/rnaseq.png) .absolute.t-1.l-1[Galaxy] .absolute.t-1.r-1[Snakemake] .absolute.b-2.l-1[Make] .absolute.b-2.r-1[Airflow] .myheader[\> Master Your Weapons] --- class: compact # Development Environment ![](/datascience_presentations/images/vscode.png# w-60pct absolute b-1 r-1) - Syntax Highlighting - Code Completion - Refactoring Tools - Debugging Tools - Version control - Containers/Environments - Remote development ove SSH ## Text Editors - vim - emacs - nano .myheader[\> Master Your Weapons] --- class: compact # Could Development ![](/datascience_presentations/images/gigantum_notebook.png# w-60pct absolute r-2 t-0) ![](/datascience_presentations/images/gigantum.png# w-30pct absolute l-5-12th b-0) ## Examples - VScode Online - Azure Notebooks - Google Colab - Datalore - CoCalc - Binder ## Gigantum - Jupyter + RStudio in the cloud - Container based environment - Automatic Version Control .myheader[\> Master Your Weapons] --- class: col-2, compact # Project Organization ## General best practices - Folder Structure - Separate data from scripts - Use workflow tools to orchestrate - Split out configuration - Modularize - Use a defined style - Use containers and environments - Document everything ## Personal preferences - Folder structure - Folder names [Example Project](https://github.com/jfear/example_project) --- .absolute.l-7-12th.w-40pct[ # 1. Same folder structure and names across projects ## But, don't be afraid to tweak ] ![](/datascience_presentations/images/example_folder.png# absolute l-0 t-0 h-100pct) .myheader-right[\> Project Organization] --- class: compact # 2. Separate data from code ## .red[Data is NOT stored typically in version control] .fl.db.w-40pct[ ```bash ├── data # original and external ├── lcdb-references # multi-project ├── output # generate output ``` - Improves mobility - Delineates what you generated - Allows reuse of common data across projects .bq-shrink[ > I work on multiple computers. > I store data in a single location and mount the drive remotely. > I can do more locally instead of messing with Biowulf. ] ] .fr.db.w-55pct[ ```bash data # original and external ├── external │ ├── DroID_DPiM_2018-03-29.txt # website │ ├── Ferrari_et_al_2006.tsv # paper │ ├── Ferrari_et_al_2006.readme # paper details │ ├── FlyBase/ # community │ └── maria/ # collaborator ├── rnaseq_samples # our data │ ├── ... │ └── w1118_LG_m_r4_B_C12.fastq.gz └── singleCellSeqData # out data ├── ... └── SV_9_10X_Te/ ``` ] .myheader[\> Project Organization] --- .absolute.l-2.w-30pct[ # 3. Workflow Orchestration ```bash ./example1-wf ├── config │ ├── config.yaml │ └── sampletable.tsv ├── scripts/ └── Snakefile ``` ] ![](/datascience_presentations/images/snakefile.png# absolute l-40pct t-0 h-100pct) .myheader[\> Project Organization] --- class: compact # 4. Modularize code ```bash lcdb-wf@56c948d #submodules src/ # project level package ├── my_project │ ├── io.py │ ├── plotting.py │ └── stats.py ├── tests/ │ ├── test_io.py │ └── test_stats.py └── setup.py ``` ![](/datascience_presentations/images/modularize.png# absolute l-50pct t-0 w-50pct) .myheader[\> Project Organization] --- class: compact .absolute.l-2.w-40pct[ # 5. Style guides and linters - Consistent style improves readability - Google `my language` and style guide - Linters catch syntax errors and point out style problems. - `pylint # python` - `lintr # R` - Fix ugly code with software - `black # python` - `styler # R` ] .fr.db.w-50pct[ ## Fix ugly code the easy way <pre><code class="R" style="font-size: .9rem"> for (i in seq(10)) { for (j in seq(100)) { if (i == j) {print(TRUE)} else if (i %% j == 0) { print("modulo") } else {print(FALSE)}}} </code></pre> <pre><code class="R" style="font-size: 1rem"> for (i in seq(10)) { for (j in seq(100)) { if (i == j) { print(TRUE) } else if (i %% j == 0) { print("modulo") } else { print(FALSE) } } } </code></pre> ] .myheader[\> Project Organization] --- class: compact # 6. Split out configuration for consistency .absolute.l-2.w-40pct[ ```bash ./config # Project config ├── common.yaml ├── gene_sets.yaml └── colors.yaml ./example1-wf # Workflow config ├── config │ ├── config.yaml │ └── sampletable.tsv ``` ] .fr.db.w-50pct[ ## Project config Contains info that is needed across the project. - Project name and github url - Assembly and Annotation - alpha level ## Workflow config Anything you may tweak in the future. - Various thresholds - Workflow specific references - Various Mappings (i.e. file name to title) ] .myheader[\> Project Organization] --- class: fit-h1, compact # 7. Containers and environments (portability and reproducibility) ## .red[One of the hardest problems in data science is managing software.] .absolute.l-2.w-40pct[ ```bash ./environment.yaml # project env ./envs # specific tools conda envs ├── deseq2.yaml ├── scrublet.yaml ├── seurat2.yaml └── seurat3.yaml ``` ] .fr.db.w-50pct[ ## Containers (Docker, Singularity) - Completely reproducible system - Kernel and Software ## Environments (Conda, pipenv) - Install and manage software versions - Different versions of software can be installed in different environments ] .myheader[\> Project Organization] --- class: title, smokescreen background-image: url(https://www.mayerdan.com/assets/img/ring-binders.jpg) # 8., 9., 10. Documentation <!-- Part 4 --> ## What is not documented, stays not documented --- # What to document (Everything!) .ul-space[ - How was the data generated - Record all "experiments" .red[ - failed attempts - comparing different methods ] - Record the reasoning for any decision points - Clearly describe how to get final results ] **6 months from now, your future-self will thank you!** .myheader[\> Project Organization > Documentation] --- # Where to document (Everywhere!) - Sample/Resource Table - README - Top of scripts - Function/Class Docstrings - Code comments (but not too many) - Literate Programming (i.e. notebooks) - Project Blog .myheader[\> Project Organization > Documentation] --- class: compact # Sample Table ```bash ./example1-wf ├── config │ ├── config.yaml │ └── sampletable.tsv ``` | samplename | orig_filename | group | wellID | row | col | num_parts | sex | testis | ovary | fatbody | ercc | |------------|-----------------------|-------|--------|-----|-----|-----------|-----|--------|-------|---------|------| | A1_OCP | ....A1_OCP_1.fastq.gz | OCP | A1 | A | 1 | 25 | f | 0 | 1 | 0 | A | | A6_TCP | ....A6_TCP_2.fastq.gz | TCP | A6 | A | 6 | 15 | m | 1 | 0 | 0 | B | Add as much information about your samples. .myheader[\> Project Organization > Documentation > Where] --- class: compact # Top of Scripts .w-50pct[ - Describe what the script does - Any major decisions that you made - Anything to help you remember ] ![](/datascience_presentations/images/script_doc.png# absolute l-6-12th t-0 h-100pct) .myheader[\> Project Organization > Documentation > Where] --- class: compact # Functions and Classes .fl.db.w-50pct[ - Any function you will call from another script. - Add type hints if it is confusing what goes in. - Add examples to clearly show what the function does. ] ![](/datascience_presentations/images/modularize.png# absolute l-6-12th t-0 w-50pct) .myheader[\> Project Organization > Documentation > Where] --- class: compact # Literate Programming .fl.db.w-50pct[ ```bash ./notebook ├── 2019-08-01_bulk_deg.Rmd └── 2019-08-10_bulk_ma.ipynb ./docs ├── cell_number_counts.ipynb └── permutation_summary.ipynb ``` - Jupyter Notebooks - R Notebooks and Rmarkdown ] ![](/datascience_presentations/images/jupyter.png# absolute l-6-12th t-0 w-50pct) .myheader[\> Project Organization > Documentation > Where] .footer[https://github.com/markusschanta/awesome-jupyter] --- class: compact # Dedicated Project Blog - Aggregate notebooks - `bookdown # R` - `jupyter webbook # python` - Static site generators - Pelican - Nikola - jekyll - **hugo** ![](/datascience_presentations/images/blog.png# absolute l-6-12th t-0 w-50pct) .myheader[\> Project Organization > Documentation > Where] --- # 10 Best Practices <!-- Summary --> <ol> <li>Use the same structure and names across projects</li> <li>Separate original data, generated data, and scripts</li> <li>Use workflows to orchestrate</li> <li>Split out configuration for consistency</li> <li>Modularize reusable code</li> <li>Use a style guide and linters</li> <li>Use containers and environments</li> <li style="font-size: 1em;">Document as you go</li> <li style="font-size: 1.2em;">Document as you go</li> <li style="font-size: 1.6em; font-weight: bold">Document as you go!</li> </ol> --- # Links and Examples .fl.w-50pct[ ## Mine - [Example Project](https://github.com/jfear/example_project) - [scRNASeq Project](https://github.com/jfear/larval_gonad) - [Large Remapping Project](https://github.com/jfear/ncbi_remap) - [PacBio Project](https://github.com/jfear/dmel_pacbio) ] .fr.w-50pct[ ## Others - [Cookiecutter Example](https://drivendata.github.io/cookiecutter-data-science/) - [Summary of Nobel Paper](https://davetang.org/muse/2018/02/09/organising-computational-biology-projects-cookiecutter/) - [Updated concepts Nobel paper](https://medium.com/outlier-bio-blog/a-quick-guide-to-organizing-data-science-projects-updated-for-2016-4cbb1e6dac71) - [Short Blog Post](https://towardsdatascience.com/manage-your-data-science-project-structure-in-early-stage-95f91d4d0600) - [Short Blog Post](https://www.thinkingondata.com/how-to-organize-data-science-projects/) ]