Technical Details
Technical details are regarding the internals to the off-the-shelf LENS workflow are described below.
Off-the-shelf defaults
Default references
Workflow |
Reference type |
Reference |
|---|---|---|
DNA alignment |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
DNA alignment post-processing |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
DNA alignment post-processing |
BED |
hg38_exome.bed |
DNA alignment post-processing |
Known sites VCF |
Homo_sapiens_assembly38.dbsnp138.vcf.gz |
RNA alignment |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
RNA alignment |
GTF |
gencode.v37.annotation.with.hervs.gtf |
Transcript quantification |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Transcript quantification |
GTF |
gencode.v37.annotation.with.hervs.gtf |
Somatic variant calling |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Somatic variant calling |
BED |
hg38_exome.bed |
Somatic variant calling |
Panel of normals VCF |
1000g_pon.hg38.vcf.gz |
Somatic variant calling |
Allele frequencies VCF |
af-only-gnomad.hg38.vcf.gz |
Somatic variant calling |
Known sites VCF |
small_exac_common_3.hg38.vcf.gz |
Variant annotation |
snpEff annotation file |
GRCh38.GENCODEv37 |
Germline variant calling |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Germline variant calling |
BED |
hg38_exome.bed |
Variant phasing |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Variant phasing |
GTF |
gencode.v37.annotation.with.hervs.gtf |
Splice variant calling |
Tool-specific reference |
snaf-data |
Virus detection |
Virus-specific (no Homo sapiens homology) sequences |
virus_masked_hg38.fa |
Virus detection |
Virus-specific sequences |
virus.cds.2024f2.fa |
Fusion detection |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Fusion detection |
Tool-specific reference |
GRCh38_gencode_v37_CTAT_lib_Mar012021.plug-n-play |
Tumor purity detection |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Tumor purity detection |
BED |
hg38_exome.bed |
Copy number variant detection |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Copy number variant detection |
BED |
hg38_exome.bed |
CTA pMHC generation |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
CTA pMHC generation |
GTF |
gencode.v37.annotation.with.hervs.gtf |
CTA pMHC generation |
CTA gene list |
cta_and_self_antigen.homo_sapiens.gene_list |
ERV pMHC generation |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
ERV pMHC generation |
ERV annotations |
Hsap38.txt |
SNV pMHC generation |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
SNV pMHC generation |
GTF |
gencode.v37.annotation.with.hervs.gtf |
SNV pMHC generation |
Canonical protein reference |
gencode.v37.pc_translations.fa |
InDel pMHC generation |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
InDel pMHC generation |
GTF |
gencode.v37.annotation.with.hervs.gtf |
InDel pMHC generation |
Canonical protein reference |
gencode.v37.pc_translations.fa |
Fusion pMHC generation |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Fusion pMHC generation |
GTF |
gencode.v37.annotation.with.hervs.gtf |
pMHC characterization |
Tool-specific reference |
mhcflurry |
CTA annotation |
CTA metadata |
canonical_txs.mtec.norm.subcell.annot.tsv |
ERV annotation |
ERV metadata |
erv_scores.25SEP2023.tsv |
Sample swap detection |
Genomic reference |
Homo_sapiens.assembly38.no_ebv.fa |
Sample swap detection |
Known sites VCF |
somalier.sites.hg38.vcf.gz |
Default tools
Workflow |
Tool |
Tool version |
|---|---|---|
DNA alignment |
fastp |
v0.24.0 |
DNA alignment |
bwa-mem2 |
v2.2.1 |
DNA alignment post-processing |
samblaster |
v0.1.26 |
DNA alignment |
fastp |
v0.24.0 |
RNA alignment |
star |
v2.7.3a |
Transcript quantification |
salmon |
v1.10.3 |
Somatic variant calling |
mutect2 |
v4.6.1.0 |
Somatic variant calling |
varscan2 |
v2.4.6 |
Somatic variant calling |
strelka2 |
v2.2.9 |
Somatic variant filtering |
bcftools |
v1.21 |
Variant annotation |
snpeff |
v4.3k |
Somatic SNV/InDel filtering |
snpsift |
v4.3k |
Variant unionizing |
jacquard |
v1.1.4 |
Germline variant calling |
deepvariant |
v1.8.0 |
Somatic and germline merging |
jacquard |
v1.1.4 |
Variant phasing |
whatshap |
v2.4 |
HLA Typing |
seq2hla |
v2.2 |
Splice variant calling |
snaf |
v0.7.0 |
Fusion detection |
starfusion |
v1.14.0 (v1.8.1b for mouse) |
Tumor purity detection |
sequenza |
v3.0.0 |
Copy number variant detection |
cnvkit |
v0.9.12 |
Copy number variant detection |
cnvkit |
v0.9.123 |
CTA pMHC generation |
lenstools |
v1.8 |
ERV pMHC generation |
lenstools |
v1.8 |
SNV pMHC generation |
lenstools |
v1.8 |
InDel pMHC generation |
lenstools |
v1.8 |
Viral pMHC generation |
lenstools |
v1.8 |
Splice pMHC generation |
lenstools |
v1.8 |
Fusion pMHC generation |
lenstools |
v1.8 |
pMHC characterization |
mhcflurry |
v2.1.1 |
Sample swap detection |
somalier |
v0.2.19 |
Default parameters
Workflow |
Tool |
Parameters |
|---|---|---|
DNA alignment |
fastp |
|
DNA alignment |
bwa-mem2 |
|
DNA alignment post-processing |
samblaster |
|
RNA alignment |
fastp |
|
RNA alignment |
star |
|
Transcript quantification |
salmon |
|
Somatic variant calling |
mutect2 |
|
Somatic variant calling |
strelka2 |
|
Somatic variant calling |
varscan2 |
|
Somatic variant filtering (mutect2) |
bcftools |
|
Somatic variant filtering (strelka2) |
bcftools |
|
Somatic variant filtering (varscan2) |
bcftools |
|
Variant annotation |
snpeff |
|
Somatic SNV filtering |
snpsift |
|
Somatic InDel filtering |
snpsift |
|
Variant unionizing |
jacquard |
|
Germline variant calling |
deepvariant |
|
Variant merging |
jacquard |
|
Variant phasing |
whatshap |
|
Splice variant filtering |
split_snaf_by_sample |
|
Fusion detection |
star |
|
Fusion detection |
starfusion |
|
Tumor purity detection |
sequenza |
|
Copy number variant detection |
cnvkit |
|
Expressed CTA detection |
lenstools_filter_expressed_self_genes |
|
Expressed ERV detection |
lenstools_filter_ervs_by_rna_coverage |
|
Expressed SNV detection |
lenstools_filter_expressed_variants_parameters |
|
Expressed InDel detection |
lenstools_filter_expressed_variants_parameters |
|
Expressed virus detection |
lenstools_filter_viruses_by_rna_coverage |
|
pMHC characterization |
mhcflurry |
8,9,10,11 |
pMHC filtering |
lenstools |
<500 nM |
LENS workflow flowchart
Somatic Nucleotide Variants (SNVs)
Somatic single-nucleotide variants (SNVs), or variants present within tumor tissue but absent from germline tissue, can be a source of tumor-specific immunogenic peptides.
This section describes:
How LENS generates a set of patient-specific predicted SNV-derived peptides from filtered VCFs.
Provies explanations for design decisions.
Lists shortcomings and caveats of the current approach (and their planned fixes).
General and technical notes.
Variant calling
SNVs are called using three separate variant callers:
mutect2,strelka2, andvarscan2.mutect2raw VCFs are processed through the variant filtering workflow (raw VCFs -> LearnReadOrientationModel -> GetPileupSummaries -> CalculateContamination -> FilterMutectCalls).Germline variants are called using
deepvariant.
Variant filtering
Resulting somatic and germline VCFs are filtered for the
PASSfilter usingbcftools.Variants are classified as PASS by each variant caller.
strelka2uses its empirical scoring (EVS) module.mutect2uses FilterMutectCalls.varscan2uses its internal filtering protocolldeepvariantuses its internal variant qualifying strategy.
Note
Users can modify the variant filtering strategy (e.g. adding hard filters)
by modifying the params.lens$somatic$som_vars_to_filtd_som_vars$vcf_filtering_tool_parameters
parameter. These filters are done on a per-variant caller basis, and are
provided in a key:value structure. For example:
params.lens$somatic$som_vars_to_filtd_som_vars$vcf_filtering_tool_parameters = "['mutect2': '-i \'MIN(FMT/DP)>10 & MIN(FMT/GQ)>15\'', 'strelka2':'\'MIN(FMT/DP)>10 & MIN(FMT/GQ)>15\'']"
Variant combining
A union of somatic variant calls is combined using jacquard. jacquard
is called with parameters
--include_format_tags="GT,AF,AU,CU,GU,TU,TAR,TIR,FREQ,VAF".
Variant annotation
Intersected somatic VCF annotated using snpeff.
Filtering variants for expression
A salmon quant.sf (from patient’s tumor RNA sequencing data), the
annotated unioned somatic VCF, and a user-provided percentile (default: 75) are
used to create list of transcripts harboring somatic SNV (and InDels) and in
the specified percentile.
Phasing variants with read-backed phasing
Read-backed phasing is performed for the tumor and normal tissues separately
using whatshap. The tumor phased VCF is created by phasing (using tumor DNA
and tumor RNA reads) somatic variants and germline variants. Germline phased
VCF is created by phasing (using normal DNA reads) germline variants.
Creating variant-specific VCFs
Variant-specific VCFs are generated from the tumor and germline phased VCFs.
This is performed separately for tumor and germline VCFs in order to create the variant context within the tumor around the variant of interest and to create the matching normal variant context around the same genomic position. Variant-specific VCFs include the somatic variant of interest as well as any variants that are either germline homozygous or germline heterozygous phased with the variant of interest.
Creating variant-specific, transcript-specific sequences
Exonic sequences from expressed transcripts harboring somatic variants of
interest are extracted from the reference FASTA using the annotation GTF and
samtools faidx into a BED file.
The BED file is provided to samtools faidx to create a FASTA in which each
entry is an exon sequence from an expressed transcripts harboring somatic
variants of interest.
Exonic FASTA file is combined with each variant-specific VCF to create the
tumor and normal sequences for each transcript.
for each variant in annotated, intersected VCF:
for each transcript listed in variant's annotation:
if transcript is listed as expressed:
apply all variants (focal somatic and neighboring germline) to transcript's exonic sequences using bcftools consensus
store resulting exonic sequences into variant-specific, transcript-specific, tissue-specific FASTA
Creating SNV-derived peptides
Intersected annotated VCF and transcript and variant-specific exonic FASTAs are
combined to create SNV-derived peptides with
lenstools make-snv-peptides-context.
Calculating binding affinity and presentation score
The tumor sequencesare processed by mhcflurry.
Peptide quantification using RNA reads
Each peptide’s DNA coding sequence (from lenstools
make-snv-peptides-context) are combined with patient’s RNA sequencing reads
for peptide quantification.
RNA reads that fully overlap the peptide’s genomic origin (and are therefore
capable of containing the full coding sequence) are queried for the coding
sequence.
A count of reads containing the coding sequence is included as the
rna_reads_covering_genomic_origin_with_peptide_cds value. A subset of
these reads will be primary alignments and may be a better representation of
the actual transcript abundance of the peptide’s coding sequence. The count of
primary alignments containing the coding sequence is the
primary_aln_rna_reads_covering_genomic_origin_with_peptide_cds value.
Shortcomings and caveats
Design decisions
The FASTAs generated by the lenstools make-snv-peptides-context step contains headers with several pieces of metadata associated with the peptide. More information regarding the contents and usage of the FASTA headers can be found in the Technical section.
Notes
SNVs are described here in isolation (e.g. not considering InDels), but many of the steps (e.g., bedtools intersect, snpEff ann, etc.) are performed on VCFs with all variant types (SNVs and InDels). Somatic missense SNVs are not handled in isolation until after snpSift filter.
Technical
MD5 checksums were utilized for creating uniquely identifiable peptides within the mutant peptide FASTA file provided to netMHCpan. netMHCpan has an internal string length limitation for the IDENTITY column. The checksum is performed on the string <CHR>:<POS>:<TRANSCRIPT_ID>:<REF>:<ALT>. Checksum values are also applied to the reference peptide FASTA file headers in order to allow matching of peptides for agretopicity calculations.
The mutant peptide FASTA header has been expanded to include a variety of relevant metadata. These data are later included (after being match to binding affinity data using the unique MD5 chekcsum) in the final peptide metadata report. The mutant peptide FASTA headers include the following information: MD5, VARIANT_POS, TRANSCRIPT, REF, ALT, SNV_TYPE, PROTEIN_CONTEXT, GENOMIC_CONTEXT.
An example of a mutant peptide FASTA header:
Somatic Insertion and Deletion Variants (InDels)
Somatic insertion/deletion variants (InDels), or variants present within tumor tissue but absent from germline tissue, can be a source of tumor-specific immunogenic peptides.
Splice Variants
Fusion Events
Endogenous Retroviruses (ERVs)
Viruses
Cancer-testis Antigens (CTAs)
Aberrantly expressed genes (e.g. CTAs) can be a source of tumor-associated immunogenic peptides.
Determining expressed CTA and self-antigens
CTA/Self-antigens that are included in the user-provided list. The default list is available in /path/to/raft/references/homo_sapiens/cta_self/cta_and_self_antigen.homo_sapiens.gene_list. This list includes loci described in the CTDatabase (http://www.cta.lncc.br/). Targetable CTA/Self-antigen peptides are generated using the coding sequence of CTA transcripts that exceed the user-provided expression percentile (default: 95%).
Generating patient-specific CTA coding sequences
LENS performs germline variant calling (default: DeepVariant) as part of its workflow.
Reference generation notes
Using NetMHC tools in LENS
LENS utilizes MHCflurry to estimate pMHC-specific binding affinities and presentation scores. Nevertheless, users may wish to include other tools, such as NetMHCpan or NetMHCstabpan to further describe pMHCs of interest. LENS supports these tools, but users must provide their own Docker images due to restrictive licensing. Informations for creating the Docker images and modifying LENS to use these tools are below.
Creating NetMHC Docker images
Users will need to create their own Docker images after obtaining their own license for each tool. Docker images can be created using these example Dockerfiles:
NetMHCpan
FROM ubuntu:20.04
RUN apt-get update && apt-get upgrade -y
RUN apt-get install -y wget vim tcsh
COPY netMHCpan-4.1b.Linux.tar.gz /
RUN tar xvf /netMHCpan-4.1b.Linux.tar.gz
RUN wget https://services.healthtech.dtu.dk/services/NetMHCpan-4.1/data.tar.gz
RUN tar xvf data.tar.gz
RUN mv /data /netMHCpan-4.1/data
RUN cp -r /netMHCpan-4.1/Linux_x86_64/bin /netMHCpan-4.1/bin
ENV NETMHCpan="/netMHCpan-4.1"
RUN rm /data.tar.gz
RUN rm /netMHCpan-4.1b.Linux.tar.gz
RUN sed -i 's/\/net\/sund-nas.win.dtu.dk\/storage\/services\/www\/packages\/netMHCpan\/4.1\/netMHCpan-4.1/\/netMHCpan-4.1\/data\//g' /netMHCpan-4.1/netMHCpan
ENV TMPDIR /tmp
CMD ["/bin/bash"]
NetMHCstabpan
FROM ubuntu:20.04
RUN apt-get update -y
RUN apt-get upgrade -y
RUN apt-get install -y wget vim tcsh gawk
COPY netMHCstabpan-1.0a.Linux.tar.gz /
RUN tar xvf /netMHCstabpan-1.0a.Linux.tar.gz
RUN wget https://services.healthtech.dtu.dk/services/NetMHCstabpan-1.0/data.tar.gz
RUN tar xvf data.tar.gz
RUN mv /data /netMHCstabpan-1.0/data
RUN cp -r /netMHCstabpan-1.0/Linux_x86_64/bin /netMHCstabpan-1.0/bin
ENV NETMHCstabpan="/netMHCstabpan-1.0"
RUN rm /data.tar.gz
RUN rm /netMHCstabpan-1.0a.Linux.tar.gz
RUN sed -i 's/\/net\/sund-nas.win.dtu.dk\/storage\/services\/www\/packages\/netMHCstabpan\/1.0\/netMHCstabpan-1.0/\/netMHCstabpan-1.0\/data\//g' /netMHCstabpan-1.0/netMHCstabpan
RUN chmod -R 777 /netMHCstabpan-1.0
ENV TMPDIR /tmp
CMD ["/bin/bash"]
NetCTLpan
FROM ubuntu:20.04
RUN apt-get update && apt-get upgrade -y
RUN apt-get install -y wget vim tcsh gawk
COPY netCTLpan-1.1b.Linux.tar.Z /
RUN tar xvf /netCTLpan-1.1b.Linux.tar.Z
RUN mkdir /netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/tmp
RUN cp -r /netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/bin/* /netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/tmp
RUN mv /netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/tmp /netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/bin/bin
RUN wget https://services.healthtech.dtu.dk/services/NetCTLpan-1.1/data_netCTLpan-1.1.tar.Z
RUN tar xvf data_netCTLpan-1.1.tar.Z
RUN mv data_netCTLpan-1.1 /netCTLpan-1.1/data
RUN wget https://services.healthtech.dtu.dk/services/NetCTLpan-1.1/data_netMHCpan-2.3.tar.Z
RUN tar xvf data_netMHCpan-2.3.tar.Z
RUN cp -r data_netMHCpan-2.3 /netCTLpan-1.1/netMHCpan-2.3/data
RUN cp -r data_netMHCpan-2.3 netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/bin/data
RUN cp -r /netCTLpan-1.1/Linux_x86_64/bin /netCTLpan-1.1/bin
RUN sed -i 's/\/usr\/cbs\/packages\/netCTLpan\/1.1\/netCTLpan-1.1/\/netCTLpan-1.1/g' /netCTLpan-1.1/netCTLpan
ENV NETCTLpan="/netCTLpan-1.1"
ENV NETMHCpan="/netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/bin"
RUN rm /netCTLpan-1.1b.Linux.tar.Z
ENV TMPDIR /tmp
CMD ["/bin/bash"]