Technical Details
Technical details regarding internals of the off-the-shelf LENS workflow are described below.
Off-the-shelf defaults
Default references
Workflow |
Reference type |
Reference |
|---|---|---|
DNA alignment |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
DNA alignment post-processing |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
DNA alignment post-processing |
BED |
hg38_exome.bed |
DNA alignment post-processing |
Known sites VCF |
Homo_sapiens_assembly38.dbsnp138.vcf.gz |
RNA alignment |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
RNA alignment |
GTF |
gencode.v37.annotation.with.hervs.gtf |
Transcript quantification |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Transcript quantification |
GTF |
gencode.v37.annotation.with.hervs.gtf |
Somatic variant calling |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Somatic variant calling |
BED |
hg38_exome.bed |
Somatic variant calling |
Panel of normals VCF |
1000g_pon.hg38.vcf.gz |
Somatic variant calling |
Allele frequencies VCF |
af-only-gnomad.hg38.vcf.gz |
Somatic variant calling |
Known sites VCF |
small_exac_common_3.hg38.vcf.gz |
Variant annotation |
snpEff annotation file |
GRCh38.GENCODEv37 |
Germline variant calling |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Germline variant calling |
BED |
hg38_exome.bed |
Variant phasing |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Variant phasing |
GTF |
gencode.v37.annotation.with.hervs.gtf |
Splice variant calling |
Tool-specific reference |
snaf-data |
Virus detection |
Virus-specific (no Homo sapiens homology) sequences |
virus_masked_hg38.fa |
Virus detection |
Virus-specific sequences |
virus.cds.2024f2.fa |
Fusion detection |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Fusion detection |
Tool-specific reference |
GRCh38_gencode_v37_CTAT_lib_Mar012021.plug-n-play |
Tumor purity detection |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Tumor purity detection |
BED |
hg38_exome.bed |
Copy number variant detection |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Copy number variant detection |
BED |
hg38_exome.bed |
CTA pMHC generation |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
CTA pMHC generation |
GTF |
gencode.v37.annotation.with.hervs.gtf |
CTA pMHC generation |
CTA gene list |
cta_and_self_antigen.homo_sapiens.gene_list |
ERV pMHC generation |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
ERV pMHC generation |
ERV annotations |
Hsap38.txt |
SNV pMHC generation |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
SNV pMHC generation |
GTF |
gencode.v37.annotation.with.hervs.gtf |
SNV pMHC generation |
Canonical protein reference |
gencode.v37.pc_translations.fa |
InDel pMHC generation |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
InDel pMHC generation |
GTF |
gencode.v37.annotation.with.hervs.gtf |
InDel pMHC generation |
Canonical protein reference |
gencode.v37.pc_translations.fa |
Fusion pMHC generation |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Fusion pMHC generation |
GTF |
gencode.v37.annotation.with.hervs.gtf |
pMHC characterization |
Tool-specific reference |
mhcflurry |
CTA annotation |
CTA metadata |
canonical_txs.mtec.norm.subcell.annot.tsv |
ERV annotation |
ERV metadata |
erv_scores.25SEP2023.tsv |
Sample swap detection |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Sample swap detection |
Known sites VCF |
somalier.sites.hg38.vcf.gz |
Default tools
Workflow |
Tool |
Tool version |
|---|---|---|
DNA alignment |
fastp |
v0.24.0 |
DNA alignment |
bwa-mem2 |
v2.2.1 |
DNA alignment post-processing |
samblaster |
v0.1.26 |
DNA alignment |
fastp |
v0.24.0 |
RNA alignment |
star |
v2.7.3a |
Transcript quantification |
salmon |
v1.10.3 |
Somatic variant calling |
mutect2 |
v4.6.1.0 |
Somatic variant calling |
varscan2 |
v2.4.6 |
Somatic variant calling |
strelka2 |
v2.2.9 |
Somatic variant filtering |
bcftools |
v1.21 |
Variant annotation |
snpeff |
v4.3k |
Somatic SNV/InDel filtering |
snpsift |
v4.3k |
Variant unionizing |
jacquard |
v1.1.4 |
Germline variant calling |
deepvariant |
v1.8.0 |
Somatic and germline merging |
jacquard |
v1.1.4 |
Variant phasing |
whatshap |
v2.4 |
HLA Typing |
seq2hla |
v2.2 |
Splice variant calling |
snaf |
v0.7.0 |
Fusion detection |
starfusion |
v1.14.0 (v1.8.1b for mouse) |
Tumor purity detection |
sequenza |
v3.0.0 |
Copy number variant detection |
cnvkit |
v0.9.12 |
Copy number variant detection |
cnvkit |
v0.9.12 |
CTA pMHC generation |
lenstools |
v1.8 |
ERV pMHC generation |
lenstools |
v1.8 |
SNV pMHC generation |
lenstools |
v1.8 |
InDel pMHC generation |
lenstools |
v1.8 |
Viral pMHC generation |
lenstools |
v1.8 |
Splice pMHC generation |
lenstools |
v1.8 |
Fusion pMHC generation |
lenstools |
v1.8 |
pMHC characterization |
mhcflurry |
v2.1.1 |
Sample swap detection |
somalier |
v0.2.19 |
Default parameters
Workflow |
Tool |
Parameters |
|---|---|---|
DNA alignment |
fastp |
|
DNA alignment |
bwa-mem2 |
|
DNA alignment post-processing |
samblaster |
|
RNA alignment |
fastp |
|
RNA alignment |
star |
|
Transcript quantification |
salmon |
|
Somatic variant calling |
mutect2 |
|
Somatic variant calling |
strelka2 |
|
Somatic variant calling |
varscan2 |
|
Somatic variant filtering (mutect2) |
bcftools |
|
Somatic variant filtering (strelka2) |
bcftools |
|
Somatic variant filtering (varscan2) |
bcftools |
|
Variant annotation |
snpeff |
|
Somatic SNV filtering |
snpsift |
|
Somatic InDel filtering |
snpsift |
|
Variant unionizing |
jacquard |
|
Germline variant calling |
deepvariant |
|
Variant merging |
jacquard |
|
Variant phasing |
whatshap |
|
Splice variant filtering |
split_snaf_by_sample |
|
Fusion detection |
star |
|
Fusion detection |
starfusion |
|
Tumor purity detection |
sequenza |
|
Copy number variant detection |
cnvkit |
|
Expressed CTA detection |
lenstools_filter_expressed_self_genes |
|
Expressed ERV detection |
lenstools_filter_ervs_by_rna_coverage |
|
Expressed SNV detection |
lenstools_filter_expressed_variants_parameters |
|
Expressed InDel detection |
lenstools_filter_expressed_variants_parameters |
|
Expressed virus detection |
lenstools_filter_viruses_by_rna_coverage |
|
pMHC characterization |
mhcflurry |
8,9,10,11 |
pMHC filtering |
lenstools |
<500 nM |
LENS workflow flowchart
The flowchart below summarizes the module-level data flow in manifest_to_lens.
It is intended as a human-readable map of the major branches and merge points.
![digraph lens_workflow {
rankdir=LR;
graph [fontsize=11, fontname="Helvetica", nodesep=0.35, ranksep=0.5];
node [shape=box, style="rounded,filled", fontsize=10, fontname="Helvetica", color="#4a4a4a", fillcolor="#f8f8f8"];
edge [color="#555555", arrowsize=0.7];
input [label="Inputs\nManifest + FASTQs/BAMs + references", fillcolor="#f2f2f2"];
align [label="Alignment + preprocessing\nmanifest_to_*_alns\nalns_to_*_procd_alns", fillcolor="#e6f0ff"];
somatic [label="Somatic path\nalns_to_som_vars\nfilter/norm/isec/union\nsnpeff + snpsift", fillcolor="#ffeedd"];
germline [label="Germline path\nalns_to_germ_vars\nfilter + bgzip/index", fillcolor="#ffeedd"];
phased [label="Tumor VCF + phasing\ngerm_and_som_vars_to_tumor_vars\nmake_phased_*_vars", fillcolor="#ffeedd"];
rna [label="RNA quant + signatures\nalns_to_transcript_counts\nquants_to_gene_signatures", fillcolor="#eaffea"];
hla [label="HLA typing + expression\nseq2hla/optitype/arcas\nconsensus_hla_alleles", fillcolor="#eaffea"];
alt [label="Splice / Virus / Fusion\nalns_to_splice_variants\nalns_to_viruses\nprocd_fqs_to_fusions", fillcolor="#eaffea"];
onco [label="Tumor context\npurity + CNA + LOH + CCF", fillcolor="#fff8dd"];
neos [label="Neo generation\nsnvs_to_neos, indels_to_neos\nselfs_to_neos, ervs_to_neos\nviruses_to_neos, fusions_to_neos", fillcolor="#efe6ff"];
subjoin [label="Group-aware subjoin\nget_groups_wrapper + join_run*\ncombine_peptide_fastas\ncombine_nt_fastas", fillcolor="#efe6ff"];
pmhc [label="pMHC scoring + filtering\npeps_and_alleles_to_antigen_stats\naggregate_pmhc_summaries\nlenstools_annotate_pmhcs\nmultipass + mutant filtering", fillcolor="#e6f7ff"];
annotate [label="Annotation + prioritization\nCCF/variant/HLA/tx annotations\npriority scores + APPM", fillcolor="#e6f7ff"];
report [label="Final outputs\nlenstools_make_lens_report\nlenstools_multigroup_analysis", fillcolor="#ddfbff"];
qc [label="QC + funnels\ncomprehensive_qc\nsample_swap_check\nlenstools_*_funnel_wrapper", fillcolor="#f2f2f2"];
input -> align;
align -> somatic;
align -> germline;
somatic -> phased;
germline -> phased;
align -> rna;
align -> hla;
align -> alt;
align -> onco;
phased -> neos;
rna -> neos;
alt -> neos;
neos -> subjoin;
hla -> pmhc;
subjoin -> pmhc;
onco -> annotate;
pmhc -> annotate;
annotate -> report;
align -> qc;
report -> qc;
}](_images/graphviz-3b3277d7decb799556923ff2397b3bf2ca351437.png)
Somatic Nucleotide Variants (SNVs)
Somatic single-nucleotide variants (SNVs), or variants present within tumor tissue but absent from germline tissue, can be a source of tumor-specific immunogenic peptides.
This section describes:
How LENS generates a set of patient-specific predicted SNV-derived peptides from filtered VCFs.
Provides explanations for design decisions.
Lists shortcomings and caveats of the current approach (and their planned fixes).
General and technical notes.
Variant calling
SNVs are called using three separate variant callers:
mutect2,strelka2, andvarscan2.mutect2raw VCFs are processed through the variant filtering workflow (raw VCFs -> LearnReadOrientationModel -> GetPileupSummaries -> CalculateContamination -> FilterMutectCalls).Germline variants are called using
deepvariant.
Variant filtering
Resulting somatic and germline VCFs are filtered for the
PASSfilter usingbcftools.Variants are classified as PASS by each variant caller.
strelka2uses its empirical scoring (EVS) module.mutect2uses FilterMutectCalls.varscan2uses its internal filtering protocoldeepvariantuses its internal variant qualifying strategy.
Note
Users can modify the variant filtering strategy (e.g. adding hard filters)
by modifying the params.lens$somatic$som_vars_to_filtd_som_vars$vcf_filtering_tool_parameters
parameter. These filters are done on a per-variant caller basis, and are
provided in a key:value structure. For example:
params.lens$somatic$som_vars_to_filtd_som_vars$vcf_filtering_tool_parameters = "['mutect2': '-i \'MIN(FMT/DP)>10 & MIN(FMT/GQ)>15\'', 'strelka2':'\'MIN(FMT/DP)>10 & MIN(FMT/GQ)>15\'']"
Variant combining
A union of somatic variant calls is combined using jacquard. jacquard
is called with parameters
--include_format_tags="GT,AF,AU,CU,GU,TU,TAR,TIR,FREQ,VAF".
Variant annotation
Intersected somatic VCF annotated using snpeff.
Filtering variants for expression
A salmon quant.sf (from patient’s tumor RNA sequencing data), the
annotated unioned somatic VCF, and a user-provided percentile (default: 75) are
used to create a list of transcripts harboring somatic SNVs (and InDels) that
are in the specified percentile.
Phasing variants with read-backed phasing
Read-backed phasing is performed for tumor and normal tissues separately
using whatshap. The tumor phased VCF is created by phasing (using tumor DNA
and tumor RNA reads) somatic variants and germline variants. Germline phased
VCF is created by phasing (using normal DNA reads) germline variants.
Creating variant-specific VCFs
Variant-specific VCFs are generated from the tumor and germline phased VCFs.
This is performed separately for tumor and germline VCFs in order to create the variant context within the tumor around the variant of interest and to create the matching normal variant context around the same genomic position. Variant-specific VCFs include the somatic variant of interest as well as any variants that are either germline homozygous or germline heterozygous phased with the variant of interest.
Variant-specific VCF and haplotype logic
For each somatic variant of interest (VOI), LENS builds a variant-specific VCF for tumor and normal contexts using phased VCFs.
Current behavior (as implemented in lenstools_make_variant_specific_vcfs):
Always include the VOI itself.
Include homozygous alternate variants (
GT="AA") on the VOI chromosome.If the VOI has a usable numeric phase set (
PS), include additional variants with matchingPSon the VOI chromosome.
This strategy is intended to preserve the local phased context around each VOI while retaining homozygous alterations that affect both haplotypes.
Creating variant-specific, transcript-specific sequences
Exonic sequences from expressed transcripts harboring somatic variants of
interest are extracted from the reference FASTA using the annotation GTF and
samtools faidx into a BED file.
The BED file is provided to samtools faidx to create a FASTA in which each
entry is an exon sequence from an expressed transcripts harboring somatic
variants of interest.
Exonic FASTA file is combined with each variant-specific VCF to create the
tumor and normal sequences for each transcript.
for each variant in annotated, intersected VCF:
for each transcript listed in variant's annotation:
if transcript is listed as expressed:
apply all variants (focal somatic and neighboring germline) to transcript's exonic sequences using bcftools consensus
store resulting exonic sequences into variant-specific, transcript-specific, tissue-specific FASTA
Creating SNV-derived peptides
Intersected annotated VCF and transcript and variant-specific exonic FASTAs are
combined to create SNV-derived peptides with
lenstools make-snv-peptides-context.
for each somatic SNV of interest:
for each expressed transcript harboring SNV:
concatenate exonic sequences from transcript to create full transcript
extract region of interest from transcript (based on snpEff information)
if tumor sequence:
translate and store peptide in \*mt_aa.fa
store DNA coding sequence in \*mt_nt..fa
if normal sequence:
translate and store peptide in \*wt_aa.fa
Calculating binding affinity and presentation score
The tumor sequences are processed by mhcflurry.
Peptide quantification using RNA reads
Each peptide’s DNA coding sequence (from lenstools
make-snv-peptides-context) are combined with patient’s RNA sequencing reads
for peptide quantification.
RNA reads that fully overlap the peptide’s genomic origin (and are therefore
capable of containing the full coding sequence) are queried for the coding
sequence.
A count of reads containing the coding sequence is included as the
rna_reads_covering_genomic_origin_with_peptide_cds value. A subset of
these reads will be primary alignments and may be a better representation of
the actual transcript abundance of the peptide’s coding sequence. The count of
primary alignments containing the coding sequence is the
primary_aln_rna_reads_covering_genomic_origin_with_peptide_cds value.
Shortcomings and caveats
Design decisions
The FASTAs generated by the lenstools make-snv-peptides-context step contain
headers with several pieces of metadata associated with the peptide. More
information regarding the contents and usage of the FASTA headers can be found
in the Technical section.
Notes
SNVs are described here in isolation (e.g., not considering InDels), but many of the steps (e.g., bedtools intersect, snpEff ann, etc.) are performed on VCFs with all variant types (SNVs and InDels). Somatic missense SNVs are not handled in isolation until after snpSift filter.
Technical
MD5 checksums were utilized for creating uniquely identifiable peptides within the mutant peptide FASTA file provided to netMHCpan. netMHCpan has an internal string length limitation for the IDENTITY column. The checksum is performed on the string <CHR>:<POS>:<TRANSCRIPT_ID>:<REF>:<ALT>. Checksum values are also applied to the reference peptide FASTA file headers in order to allow matching of peptides for agretopicity calculations.
The mutant peptide FASTA header has been expanded to include a variety of relevant metadata. These data are later included (after being match to binding affinity data using the unique MD5 chekcsum) in the final peptide metadata report. The mutant peptide FASTA headers include the following information: MD5, VARIANT_POS, TRANSCRIPT, REF, ALT, SNV_TYPE, PROTEIN_CONTEXT, GENOMIC_CONTEXT.
An example of a mutant peptide FASTA header:
>MD5:be460b5e5095139a VARIANT_POS:chr14_69230557 TRANSCRIPT:ENST00000409242.5 REF:C ALT:[T] SNV_TYPE:missense PROTEIN_CONTEXT:TVLNFPLDKSLLLCCSNWDAETLTEDQ GENOMIC_CONTEXT:ACTGTTTTGAACTTTCCCCTTGACAAGTCCCTTCTACTTTGTTGCAGCAACTGGGATGCTGAGACTCTCACAGAGGACCAG
Somatic Insertion and Deletion Variants (InDels)
Somatic insertion/deletion variants (InDels), or variants present within tumor tissue but absent from germline tissue, can be a source of tumor-specific immunogenic peptides.
Splice Variants
Fusion Events
Endogenous Retroviruses (ERVs)
Viruses
Cancer-testis Antigens (CTAs)
Aberrantly expressed genes (e.g. CTAs) can be a source of tumor-associated immunogenic peptides.
Determining expressed CTA and self-antigens
CTA/Self-antigens that are included in the user-provided list. The default list is available in /path/to/raft/references/homo_sapiens/cta_self/cta_and_self_antigen.homo_sapiens.gene_list. This list includes loci described in the CTDatabase (http://www.cta.lncc.br/). Targetable CTA/Self-antigen peptides are generated using the coding sequence of CTA transcripts that exceed the user-provided expression percentile (default: 95%).
Generating patient-specific CTA coding sequences
LENS performs germline variant calling (default: DeepVariant) as part of its workflow.