Technical Details

Technical details regarding internals of the off-the-shelf LENS workflow are described below.

Off-the-shelf defaults

Default references

Workflow

Reference type

Reference

DNA alignment

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

DNA alignment post-processing

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

DNA alignment post-processing

BED

hg38_exome.bed

DNA alignment post-processing

Known sites VCF

Homo_sapiens_assembly38.dbsnp138.vcf.gz

RNA alignment

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

RNA alignment

GTF

gencode.v37.annotation.with.hervs.gtf

Transcript quantification

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

Transcript quantification

GTF

gencode.v37.annotation.with.hervs.gtf

Somatic variant calling

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

Somatic variant calling

BED

hg38_exome.bed

Somatic variant calling

Panel of normals VCF

1000g_pon.hg38.vcf.gz

Somatic variant calling

Allele frequencies VCF

af-only-gnomad.hg38.vcf.gz

Somatic variant calling

Known sites VCF

small_exac_common_3.hg38.vcf.gz

Variant annotation

snpEff annotation file

GRCh38.GENCODEv37

Germline variant calling

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

Germline variant calling

BED

hg38_exome.bed

Variant phasing

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

Variant phasing

GTF

gencode.v37.annotation.with.hervs.gtf

Splice variant calling

Tool-specific reference

snaf-data

Virus detection

Virus-specific (no Homo sapiens homology) sequences

virus_masked_hg38.fa

Virus detection

Virus-specific sequences

virus.cds.2024f2.fa

Fusion detection

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

Fusion detection

Tool-specific reference

GRCh38_gencode_v37_CTAT_lib_Mar012021.plug-n-play

Tumor purity detection

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

Tumor purity detection

BED

hg38_exome.bed

Copy number variant detection

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

Copy number variant detection

BED

hg38_exome.bed

CTA pMHC generation

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

CTA pMHC generation

GTF

gencode.v37.annotation.with.hervs.gtf

CTA pMHC generation

CTA gene list

cta_and_self_antigen.homo_sapiens.gene_list

ERV pMHC generation

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

ERV pMHC generation

ERV annotations

Hsap38.txt

SNV pMHC generation

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

SNV pMHC generation

GTF

gencode.v37.annotation.with.hervs.gtf

SNV pMHC generation

Canonical protein reference

gencode.v37.pc_translations.fa

InDel pMHC generation

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

InDel pMHC generation

GTF

gencode.v37.annotation.with.hervs.gtf

InDel pMHC generation

Canonical protein reference

gencode.v37.pc_translations.fa

Fusion pMHC generation

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

Fusion pMHC generation

GTF

gencode.v37.annotation.with.hervs.gtf

pMHC characterization

Tool-specific reference

mhcflurry

CTA annotation

CTA metadata

canonical_txs.mtec.norm.subcell.annot.tsv

ERV annotation

ERV metadata

erv_scores.25SEP2023.tsv

Sample swap detection

Genomic reference

Homo_sapiens.assembly38.no_ebv.fa

Sample swap detection

Known sites VCF

somalier.sites.hg38.vcf.gz

Default tools

Workflow

Tool

Tool version

DNA alignment

fastp

v0.24.0

DNA alignment

bwa-mem2

v2.2.1

DNA alignment post-processing

samblaster

v0.1.26

DNA alignment

fastp

v0.24.0

RNA alignment

star

v2.7.3a

Transcript quantification

salmon

v1.10.3

Somatic variant calling

mutect2

v4.6.1.0

Somatic variant calling

varscan2

v2.4.6

Somatic variant calling

strelka2

v2.2.9

Somatic variant filtering

bcftools

v1.21

Variant annotation

snpeff

v4.3k

Somatic SNV/InDel filtering

snpsift

v4.3k

Variant unionizing

jacquard

v1.1.4

Germline variant calling

deepvariant

v1.8.0

Somatic and germline merging

jacquard

v1.1.4

Variant phasing

whatshap

v2.4

HLA Typing

seq2hla

v2.2

Splice variant calling

snaf

v0.7.0

Fusion detection

starfusion

v1.14.0 (v1.8.1b for mouse)

Tumor purity detection

sequenza

v3.0.0

Copy number variant detection

cnvkit

v0.9.12

Copy number variant detection

cnvkit

v0.9.12

CTA pMHC generation

lenstools

v1.8

ERV pMHC generation

lenstools

v1.8

SNV pMHC generation

lenstools

v1.8

InDel pMHC generation

lenstools

v1.8

Viral pMHC generation

lenstools

v1.8

Splice pMHC generation

lenstools

v1.8

Fusion pMHC generation

lenstools

v1.8

pMHC characterization

mhcflurry

v2.1.1

Sample swap detection

somalier

v0.2.19

Default parameters

Workflow

Tool

Parameters

DNA alignment

fastp

--in1 ${fq1} --out1 ${dataset}-${pat_name}-${run}_1.trimmed.fq.gz --in2 ${fq2} --out2 ${dataset}-${pat_name}-${run}_2.trimmed.fq.gz  --failed_out ${dataset}-${pat_name}-${run}.fastp_fails.fq.gz --thread ${task.cpus}

DNA alignment

bwa-mem2

-R "@RG\\tID:${dataset}-${pat_name}-${run}\\tSM:${dataset}-${pat_name}-${run}\\tLB:NULL\\tPL:Illumina" ${fa} ${fq1} ${fq2} -t ${task.cpus - 1}

DNA alignment post-processing

samblaster

N/A

RNA alignment

fastp

--in1 ${fq1} --out1 ${dataset}-${pat_name}-${run}_1.trimmed.fq.gz --in2 ${fq2} --out2 ${dataset}-${pat_name}-${run}_2.trimmed.fq.gz --failed_out ${dataset}-${pat_name}-${run}.fastp_fails.fq.gz --thread ${task.cpus}

RNA alignment

star

-quantMode TranscriptomeSAM --outSAMtype BAM SortedByCoordinate --twopassMode Basic --outSAMunmapped Within

Transcript quantification

salmon

quant --threads ${task.cpus} -t ${fa} -l a -a ${aln} -o .

Somatic variant calling

mutect2

Placeholder

Somatic variant calling

strelka2

Placeholder

Somatic variant calling

varscan2

Placeholder

Somatic variant filtering (mutect2)

bcftools

FILTER="PASS"

Somatic variant filtering (strelka2)

bcftools

FILTER="PASS"

Somatic variant filtering (varscan2)

bcftools

FILTER="PASS"

Variant annotation

snpeff

N/A

Somatic SNV filtering

snpsift

ANN[*].EFFECT has 'missense_variant'

Somatic InDel filtering

snpsift

(ANN[*].EFFECT has 'conservative_inframe_insertion') || (ANN[*].EFFECT has 'conservative_inframe_deletion') || (ANN[*].EFFECT has 'disruptive_inframe_insertion') || (ANN[*].EFFECT has 'disruptive_inframe_deletion') || (ANN[*].EFFECT has 'frameshift_variant')

Variant unionizing

jacquard

--include_format_tags=\"GT,AF,AU,CU,GU,TU,TAR,TIR,FREQ,VAF

Germline variant calling

deepvariant

--model_type WES

Variant merging

jacquard

N/A

Variant phasing

whatshap

--ignore-read-groups

Splice variant filtering

split_snaf_by_sample

snaf_awk_filter: '\$5 == "True" && \$3 == 0.0', snaf_tumor_exp_threshold: '1000'

Fusion detection

star

--outReadsUnmapped None --twopassMode Basic --outSAMstrandField intronMotif --outSAMunmapped Within --chimSegmentMin 12 --chimJunctionOverhangMin 8 --chimOutJunctionFormat 1 --alignSJDBoverhangMin 10 --alignMatesGapMax 100000 --alignIntronMax 100000 --alignSJstitchMismatchNmax 5 -1 5 5 --chimMultimapScoreRange 3 --chimScoreJunctionNonGTAG -4 --chimMultimapNmax 20 --chimNonchimScoreDropMin 10 --peOverlapNbasesMin 12 --peOverlapMMp 0.1 --alignInsertionFlush Right --alignSplicedMateMapLminOverLmate 0 --alignSplicedMateMapLmin 30

Fusion detection

starfusion

--examine_coding_effect

Tumor purity detection

sequenza

sequenza_gc_wiggle: '-w 50', sequenza_bam2seqz: '', sequenza_seqz_binning: '-w 50'

Copy number variant detection

cnvkit

cnvkit_segment: '--drop-low-coverage', cnvkit_call: '--ploidy 2'

Expressed CTA detection

lenstools_filter_expressed_self_genes

-p 95

Expressed ERV detection

lenstools_filter_ervs_by_rna_coverage

--mean-depth 60

Expressed SNV detection

lenstools_filter_expressed_variants_parameters

-p 75

Expressed InDel detection

lenstools_filter_expressed_variants_parameters

-p 75

Expressed virus detection

lenstools_filter_viruses_by_rna_coverage

--coverage 25

pMHC characterization

mhcflurry

8,9,10,11

pMHC filtering

lenstools

<500 nM

LENS workflow flowchart

The flowchart below summarizes the module-level data flow in manifest_to_lens. It is intended as a human-readable map of the major branches and merge points.

digraph lens_workflow {
  rankdir=LR;
  graph [fontsize=11, fontname="Helvetica", nodesep=0.35, ranksep=0.5];
  node [shape=box, style="rounded,filled", fontsize=10, fontname="Helvetica", color="#4a4a4a", fillcolor="#f8f8f8"];
  edge [color="#555555", arrowsize=0.7];

  input [label="Inputs\nManifest + FASTQs/BAMs + references", fillcolor="#f2f2f2"];
  align [label="Alignment + preprocessing\nmanifest_to_*_alns\nalns_to_*_procd_alns", fillcolor="#e6f0ff"];

  somatic [label="Somatic path\nalns_to_som_vars\nfilter/norm/isec/union\nsnpeff + snpsift", fillcolor="#ffeedd"];
  germline [label="Germline path\nalns_to_germ_vars\nfilter + bgzip/index", fillcolor="#ffeedd"];
  phased [label="Tumor VCF + phasing\ngerm_and_som_vars_to_tumor_vars\nmake_phased_*_vars", fillcolor="#ffeedd"];

  rna [label="RNA quant + signatures\nalns_to_transcript_counts\nquants_to_gene_signatures", fillcolor="#eaffea"];
  hla [label="HLA typing + expression\nseq2hla/optitype/arcas\nconsensus_hla_alleles", fillcolor="#eaffea"];
  alt [label="Splice / Virus / Fusion\nalns_to_splice_variants\nalns_to_viruses\nprocd_fqs_to_fusions", fillcolor="#eaffea"];
  onco [label="Tumor context\npurity + CNA + LOH + CCF", fillcolor="#fff8dd"];

  neos [label="Neo generation\nsnvs_to_neos, indels_to_neos\nselfs_to_neos, ervs_to_neos\nviruses_to_neos, fusions_to_neos", fillcolor="#efe6ff"];
  subjoin [label="Group-aware subjoin\nget_groups_wrapper + join_run*\ncombine_peptide_fastas\ncombine_nt_fastas", fillcolor="#efe6ff"];

  pmhc [label="pMHC scoring + filtering\npeps_and_alleles_to_antigen_stats\naggregate_pmhc_summaries\nlenstools_annotate_pmhcs\nmultipass + mutant filtering", fillcolor="#e6f7ff"];
  annotate [label="Annotation + prioritization\nCCF/variant/HLA/tx annotations\npriority scores + APPM", fillcolor="#e6f7ff"];
  report [label="Final outputs\nlenstools_make_lens_report\nlenstools_multigroup_analysis", fillcolor="#ddfbff"];
  qc [label="QC + funnels\ncomprehensive_qc\nsample_swap_check\nlenstools_*_funnel_wrapper", fillcolor="#f2f2f2"];

  input -> align;
  align -> somatic;
  align -> germline;
  somatic -> phased;
  germline -> phased;
  align -> rna;
  align -> hla;
  align -> alt;
  align -> onco;
  phased -> neos;
  rna -> neos;
  alt -> neos;
  neos -> subjoin;
  hla -> pmhc;
  subjoin -> pmhc;
  onco -> annotate;
  pmhc -> annotate;
  annotate -> report;
  align -> qc;
  report -> qc;
}

Somatic Nucleotide Variants (SNVs)

Somatic single-nucleotide variants (SNVs), or variants present within tumor tissue but absent from germline tissue, can be a source of tumor-specific immunogenic peptides.

This section describes:

  • How LENS generates a set of patient-specific predicted SNV-derived peptides from filtered VCFs.

  • Provides explanations for design decisions.

  • Lists shortcomings and caveats of the current approach (and their planned fixes).

  • General and technical notes.

Variant calling

  • SNVs are called using three separate variant callers: mutect2, strelka2, and varscan2.

  • mutect2 raw VCFs are processed through the variant filtering workflow (raw VCFs -> LearnReadOrientationModel -> GetPileupSummaries -> CalculateContamination -> FilterMutectCalls).

  • Germline variants are called using deepvariant.

Variant filtering

  • Resulting somatic and germline VCFs are filtered for the PASS filter using bcftools.

  • Variants are classified as PASS by each variant caller.

  • strelka2 uses its empirical scoring (EVS) module.

  • mutect2 uses FilterMutectCalls.

  • varscan2 uses its internal filtering protocol

  • deepvariant uses its internal variant qualifying strategy.

Note

Users can modify the variant filtering strategy (e.g. adding hard filters) by modifying the params.lens$somatic$som_vars_to_filtd_som_vars$vcf_filtering_tool_parameters parameter. These filters are done on a per-variant caller basis, and are provided in a key:value structure. For example: params.lens$somatic$som_vars_to_filtd_som_vars$vcf_filtering_tool_parameters = "['mutect2': '-i \'MIN(FMT/DP)>10 & MIN(FMT/GQ)>15\'', 'strelka2':'\'MIN(FMT/DP)>10 & MIN(FMT/GQ)>15\'']"

Variant combining

A union of somatic variant calls is combined using jacquard. jacquard is called with parameters --include_format_tags="GT,AF,AU,CU,GU,TU,TAR,TIR,FREQ,VAF".

Variant annotation

Intersected somatic VCF annotated using snpeff.

Filtering variants for expression

A salmon quant.sf (from patient’s tumor RNA sequencing data), the annotated unioned somatic VCF, and a user-provided percentile (default: 75) are used to create a list of transcripts harboring somatic SNVs (and InDels) that are in the specified percentile.

Phasing variants with read-backed phasing

Read-backed phasing is performed for tumor and normal tissues separately using whatshap. The tumor phased VCF is created by phasing (using tumor DNA and tumor RNA reads) somatic variants and germline variants. Germline phased VCF is created by phasing (using normal DNA reads) germline variants.

Creating variant-specific VCFs

Variant-specific VCFs are generated from the tumor and germline phased VCFs.

This is performed separately for tumor and germline VCFs in order to create the variant context within the tumor around the variant of interest and to create the matching normal variant context around the same genomic position. Variant-specific VCFs include the somatic variant of interest as well as any variants that are either germline homozygous or germline heterozygous phased with the variant of interest.

Variant-specific VCF and haplotype logic

For each somatic variant of interest (VOI), LENS builds a variant-specific VCF for tumor and normal contexts using phased VCFs.

Current behavior (as implemented in lenstools_make_variant_specific_vcfs):

  • Always include the VOI itself.

  • Include homozygous alternate variants (GT="AA") on the VOI chromosome.

  • If the VOI has a usable numeric phase set (PS), include additional variants with matching PS on the VOI chromosome.

This strategy is intended to preserve the local phased context around each VOI while retaining homozygous alterations that affect both haplotypes.

Creating variant-specific, transcript-specific sequences

Exonic sequences from expressed transcripts harboring somatic variants of interest are extracted from the reference FASTA using the annotation GTF and samtools faidx into a BED file. The BED file is provided to samtools faidx to create a FASTA in which each entry is an exon sequence from an expressed transcripts harboring somatic variants of interest. Exonic FASTA file is combined with each variant-specific VCF to create the tumor and normal sequences for each transcript.

for each variant in annotated, intersected VCF:
    for each transcript listed in variant's annotation:
        if transcript is listed as expressed:
            apply all variants (focal somatic and neighboring germline) to transcript's exonic sequences using bcftools consensus
            store resulting exonic sequences into variant-specific, transcript-specific, tissue-specific FASTA

Creating SNV-derived peptides

Intersected annotated VCF and transcript and variant-specific exonic FASTAs are combined to create SNV-derived peptides with lenstools make-snv-peptides-context.

for each somatic SNV of interest:
    for each expressed transcript harboring SNV:
        concatenate exonic sequences from transcript to create full transcript
        extract region of interest from transcript (based on snpEff information)
        if tumor sequence:
            translate and store peptide in \*mt_aa.fa
            store DNA coding sequence in \*mt_nt..fa
        if normal sequence:
            translate and store peptide in \*wt_aa.fa

Calculating binding affinity and presentation score

The tumor sequences are processed by mhcflurry.

Peptide quantification using RNA reads

Each peptide’s DNA coding sequence (from lenstools make-snv-peptides-context) are combined with patient’s RNA sequencing reads for peptide quantification. RNA reads that fully overlap the peptide’s genomic origin (and are therefore capable of containing the full coding sequence) are queried for the coding sequence. A count of reads containing the coding sequence is included as the rna_reads_covering_genomic_origin_with_peptide_cds value. A subset of these reads will be primary alignments and may be a better representation of the actual transcript abundance of the peptide’s coding sequence. The count of primary alignments containing the coding sequence is the primary_aln_rna_reads_covering_genomic_origin_with_peptide_cds value.

Shortcomings and caveats

Design decisions

The FASTAs generated by the lenstools make-snv-peptides-context step contain headers with several pieces of metadata associated with the peptide. More information regarding the contents and usage of the FASTA headers can be found in the Technical section.

Notes

SNVs are described here in isolation (e.g., not considering InDels), but many of the steps (e.g., bedtools intersect, snpEff ann, etc.) are performed on VCFs with all variant types (SNVs and InDels). Somatic missense SNVs are not handled in isolation until after snpSift filter.

Technical

MD5 checksums were utilized for creating uniquely identifiable peptides within the mutant peptide FASTA file provided to netMHCpan. netMHCpan has an internal string length limitation for the IDENTITY column. The checksum is performed on the string <CHR>:<POS>:<TRANSCRIPT_ID>:<REF>:<ALT>. Checksum values are also applied to the reference peptide FASTA file headers in order to allow matching of peptides for agretopicity calculations.

The mutant peptide FASTA header has been expanded to include a variety of relevant metadata. These data are later included (after being match to binding affinity data using the unique MD5 chekcsum) in the final peptide metadata report. The mutant peptide FASTA headers include the following information: MD5, VARIANT_POS, TRANSCRIPT, REF, ALT, SNV_TYPE, PROTEIN_CONTEXT, GENOMIC_CONTEXT.

An example of a mutant peptide FASTA header:

>MD5:be460b5e5095139a VARIANT_POS:chr14_69230557 TRANSCRIPT:ENST00000409242.5 REF:C ALT:[T] SNV_TYPE:missense PROTEIN_CONTEXT:TVLNFPLDKSLLLCCSNWDAETLTEDQ GENOMIC_CONTEXT:ACTGTTTTGAACTTTCCCCTTGACAAGTCCCTTCTACTTTGTTGCAGCAACTGGGATGCTGAGACTCTCACAGAGGACCAG

Somatic Insertion and Deletion Variants (InDels)

Somatic insertion/deletion variants (InDels), or variants present within tumor tissue but absent from germline tissue, can be a source of tumor-specific immunogenic peptides.

Splice Variants

Fusion Events

Endogenous Retroviruses (ERVs)

Viruses

Cancer-testis Antigens (CTAs)

Aberrantly expressed genes (e.g. CTAs) can be a source of tumor-associated immunogenic peptides.

Determining expressed CTA and self-antigens

CTA/Self-antigens that are included in the user-provided list. The default list is available in /path/to/raft/references/homo_sapiens/cta_self/cta_and_self_antigen.homo_sapiens.gene_list. This list includes loci described in the CTDatabase (http://www.cta.lncc.br/). Targetable CTA/Self-antigen peptides are generated using the coding sequence of CTA transcripts that exceed the user-provided expression percentile (default: 95%).

Generating patient-specific CTA coding sequences

LENS performs germline variant calling (default: DeepVariant) as part of its workflow.