This lesson is still being designed and assembled (Pre-Alpha version)

Running the example in Snakemake

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

Snakemake

GNU make + Python = Snakemake

GNU make: software to manage projects with several intermediate steps. “You only build what needs to be built” Instructions are written in a makefile, which looks like this:

all proteins.faa
proteins.faa: transcripts.fna
  getorf -sequence transcripts.fna -outseq proteins.faa

Why Snakemake instead of make?

Assembling and running the example Workflow in Snakemake

Step-by-Step

Download transcripts

TRANSCRIPTS_URL = "http://ftp.ensembl.org/pub/release-105/fasta/naja_naja/cdna/Naja_naja.Nana_v5.cdna.all.fa.gz"

rule all:
	input:
		"output/transcripts.fa.gz",


rule fetch_transcriptome:
	output:
		"output/transcripts.fa.gz"
  params:
    url = TRANSCRIPTS_URL
	shell:
		"""
		mkdir -p output/
		wget {TRANSCRIPTS_URL} -O output/transcripts.fa.gz
		"""

Filter by length

ENVIRONMENTS = "../envs"

Point at environments.

rule all:
	input:
		"output/transcripts.fa.gz",
		"output/transcripts.length_filtered.fa.gz",

Add filtered to rule all.

rule filter_by_length:
	input:
		transcripts = rules.fetch_transcriptome.output.transcripts
	output:
		filtered = rules.fetch_transcriptome.output.transcripts.replace(".fa.gz", ".length_filtered.fa.gz")
	params:
		minlen = 300
	conda:
		f"{ENVIRONMENTS}/bbmap.yml"
	shell:
		"""
		bbduk.sh in={input.transcripts} out={output.filtered} minlen={params.minlen}
		"""

Splitting transcriptome (“scatter”)

Add the rule to split the transcriptome.

rule split_transcriptome:
	input:
		rules.filter_by_length.output.filtered
	output:
		txome_chunks = expand(os.path.join("output", "txome_chunks", "chunk-{chunk}.fa"), chunk=range(10))
	params:
		chunksize = 2935
	shell:
		"""
		mkdir -p output/txome_chunks/
		gzip -dc {input[0]} | awk 'BEGIN {{n=0;m=0;}} /^>/ {{ if (n%{params.chunksize}==0) {{f=sprintf(\"output/txome_chunks/chunk-%d.fa\",m); m++;}}; n++; }} {{ print >> f }}'
		"""

Check the steps

This is currently the structure of your Snakemake file.

TRANSCRIPTS_URL = ...
ENVIRONMENTS = ...

rule all:
  input:
    ...

rule fetch_transcriptome:
  ...

rule filter_by_length:
  ...

rule split_transcriptome:
  ...

Find and translate ORFs

rule find_and_translate_orfs:
	input:
		chunk = os.path.join("output", "txome_chunks", "chunk-{chunk}.fa"),
	output:
		orfs = os.path.join("output", "orf_chunks", "chunk-{chunk}.orfs.fa"),
	params:
		minlen = 300
	conda:
		f"{ENVIRONMENTS}/emboss.yml"
	shell:
		"""
		mkdir -p output/orf_chunks/
		getorf -sequence {input.chunk} -outseq {output.orfs} -minsize {params.minlen}
		"""

Extract toxins

rule extract_toxins:
	input:
		toxin_proteins_blast = rules.combine_chunks_and_filter.output.toxin_proteins_blast,
		proteins = rules.combine_proteins.output.all_proteins
	output:
		toxin_proteins = os.path.join("output", "toxins.faa")
	conda:
		f"{ENVIRONMENTS}/seqtk.yml"
	shell:
		"""
		seqtk subseq {input.proteins} <(cut -f 1 {input.toxin_proteins_blast} | sort -u) > {output.toxin_proteins}
		"""

Final result

# TRANSCRIPTS_URL = "http://ftp.ensembl.org/pub/release-105/fasta/naja_naja/cdna/Naja_naja.Nana_v5.cdna.all.fa.gz"

# ENVIRONMENTS = "../envs"

# report: "workflow_report.rst"

rule all:
	input:
		transcripts = "output/transcripts.fa.gz",
		txome_chunks = expand(os.path.join("output", "txome_chunks", "chunk-{chunk}.fa"), chunk=range(10)),
		orf_chunks = expand(os.path.join("output", "orf_chunks", "chunk-{chunk}.orfs.fa"), chunk=range(10)),
		toxin_chunks = expand(os.path.join("output", "toxin_chunks", "chunk-{chunk}.toxins.tsv"), chunk=range(10)),
		all_proteins = os.path.join("output", "all_proteins.faa"),
		blast_output = os.path.join("output", "toxin_blast.filtered.tsv"),
		toxins = os.path.join("output", "toxins.faa")

	shell:
		"""
		rm -rvf {input.transcripts} $(dirname {input.txome_chunks}) $(dirname {input.orf_chunks}) $(dirname {input.toxin_chunks}) {input.all_proteins} {input.blast_output}
		"""


rule fetch_transcriptome:
	output:
		transcripts = "output/transcripts.fa.gz"

	params:
		url = config["TRANSCRIPTS_URL"]

	shell:
		"""
		mkdir -p output/
		wget {params.url} -O {output.transcripts}
		"""


rule filter_by_length:
	input:
		transcripts = rules.fetch_transcriptome.output.transcripts
	output:
		filtered = rules.fetch_transcriptome.output.transcripts.replace(".fa.gz", ".length_filtered.fa.gz")
	params:
		minlen = config["transcript_criteria"]["minimum_transcript_length"]
	conda:
		f"{config['ENVIRONMENTS']}/bbmap.yml"
	shell:
		"""
		bbduk.sh in={input.transcripts} out={output.filtered} minlen={params.minlen}
		"""


rule split_transcriptome:
	input:
		rules.fetch_transcriptome.output.transcripts
	output:
		txome_chunks = expand(os.path.join("output", "txome_chunks", "chunk-{chunk}.fa"), chunk=range(10))
	params:
		chunksize = config["transcript_criteria"]["chunksize"]
	shell:
		"""
		mkdir -p output/txome_chunks/
		gzip -dc {input[0]} | awk 'BEGIN {{n=0;m=0;}} /^>/ {{ if (n%{params.chunksize}==0) {{f=sprintf(\"output/txome_chunks/chunk-%d.fa\",m); m++;}}; n++; }} {{ print >> f }}'
		"""


rule find_and_translate_orfs:
	input:
		chunk = os.path.join("output", "txome_chunks", "chunk-{chunk}.fa"),
	output:
		orfs = os.path.join("output", "orf_chunks", "chunk-{chunk}.orfs.fa"),
	params:
		minlen = 1000
	conda:
		f"{config['ENVIRONMENTS']}/emboss.yml"
	shell:
		"""
		mkdir -p output/orf_chunks/
		getorf -sequence {input.chunk} -outseq {output.orfs} -minsize {params.minlen}
		"""


rule find_toxins:
	input:
		proteins = os.path.join("output", "orf_chunks", "chunk-{chunk}.orfs.fa"),
		db = config["VENOM_DATABASE"]
	output:
		toxin_hits = os.path.join("output", "toxin_chunks", "chunk-{chunk}.toxins.tsv")
	conda:
		f"{config['ENVIRONMENTS']}/blast.yml"
	threads:
		2
	shell:
		"""
		mkdir -p output/toxin_chunks/
		blastp -db {input.db} -query {input.proteins} -num_threads {threads} -out {output.toxin_hits} -outfmt '6 qseqid sseqid pident qstart qend sstart send qlen slen length nident mismatch positive gapopen gaps evalue bitscore'
		"""


rule combine_chunks_and_filter:
	input:
		expand(os.path.join("output", "toxin_chunks", "chunk-{chunk}.toxins.tsv"), chunk=range(10))
	output:
		toxin_proteins_blast = os.path.join("output", "toxin_blast.filtered.tsv")
	params:
		eval_threshold = config['transcript_criteria']['eval_threshold']
	shell:
		"""
		awk '$16 < {params.eval_threshold}' {input} > {output.toxin_proteins_blast}
		"""

rule combine_proteins:
	input:
		expand(os.path.join("output", "orf_chunks", "chunk-{chunk}.orfs.fa"), chunk=range(10))
	output:
		all_proteins = os.path.join("output", "all_proteins.faa")
	shell:
		"""
		cat {input} > {output.all_proteins}
		"""


rule extract_toxins:
	input:
		toxin_proteins_blast = rules.combine_chunks_and_filter.output.toxin_proteins_blast,
		proteins = rules.combine_proteins.output.all_proteins
	output:
		toxin_proteins = os.path.join("output", "toxins.faa")
	conda:
		f"{config['ENVIRONMENTS']}/seqtk.yml"
	shell:
		"""
		seqtk subseq {input.proteins} <(cut -f 1 {input.toxin_proteins_blast} | sort -u) > {output.toxin_proteins}
		"""
	

Key Points

  • First key point. Brief Answer to questions. (FIXME)