================= Technical Details ================= Technical details are regarding the internals to the off-the-shelf LENS workflow are described below. Off-the-shelf defaults ====================== Default references ------------------ .. list-table:: :header-rows: 1 * - Workflow - Reference type - Reference * - DNA alignment - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - DNA alignment post-processing - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - DNA alignment post-processing - BED - hg38_exome.bed * - DNA alignment post-processing - Known sites VCF - Homo_sapiens_assembly38.dbsnp138.vcf.gz * - RNA alignment - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - RNA alignment - GTF - gencode.v37.annotation.with.hervs.gtf * - Transcript quantification - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - Transcript quantification - GTF - gencode.v37.annotation.with.hervs.gtf * - Somatic variant calling - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - Somatic variant calling - BED - hg38_exome.bed * - Somatic variant calling - Panel of normals VCF - 1000g_pon.hg38.vcf.gz * - Somatic variant calling - Allele frequencies VCF - af-only-gnomad.hg38.vcf.gz * - Somatic variant calling - Known sites VCF - small_exac_common_3.hg38.vcf.gz * - Variant annotation - snpEff annotation file - GRCh38.GENCODEv37 * - Germline variant calling - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - Germline variant calling - BED - hg38_exome.bed * - Variant phasing - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - Variant phasing - GTF - gencode.v37.annotation.with.hervs.gtf * - Splice variant calling - Tool-specific reference - snaf-data * - Virus detection - Virus-specific (no Homo sapiens homology) sequences - virus_masked_hg38.fa * - Virus detection - Virus-specific sequences - virus.cds.2024f2.fa * - Fusion detection - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - Fusion detection - Tool-specific reference - GRCh38_gencode_v37_CTAT_lib_Mar012021.plug-n-play * - Tumor purity detection - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - Tumor purity detection - BED - hg38_exome.bed * - Copy number variant detection - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - Copy number variant detection - BED - hg38_exome.bed * - CTA pMHC generation - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - CTA pMHC generation - GTF - gencode.v37.annotation.with.hervs.gtf * - CTA pMHC generation - CTA gene list - cta_and_self_antigen.homo_sapiens.gene_list * - ERV pMHC generation - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - ERV pMHC generation - ERV annotations - Hsap38.txt * - SNV pMHC generation - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - SNV pMHC generation - GTF - gencode.v37.annotation.with.hervs.gtf * - SNV pMHC generation - Canonical protein reference - gencode.v37.pc_translations.fa * - InDel pMHC generation - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - InDel pMHC generation - GTF - gencode.v37.annotation.with.hervs.gtf * - InDel pMHC generation - Canonical protein reference - gencode.v37.pc_translations.fa * - Fusion pMHC generation - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - Fusion pMHC generation - GTF - gencode.v37.annotation.with.hervs.gtf * - pMHC characterization - Tool-specific reference - mhcflurry * - CTA annotation - CTA metadata - canonical_txs.mtec.norm.subcell.annot.tsv * - ERV annotation - ERV metadata - erv_scores.25SEP2023.tsv * - Sample swap detection - Genomic reference - Homo_sapiens.assembly38.no_ebv.fa * - Sample swap detection - Known sites VCF - somalier.sites.hg38.vcf.gz Default tools ------------- .. list-table:: :header-rows: 1 * - Workflow - Tool - Tool version * - DNA alignment - fastp - v0.24.0 * - DNA alignment - bwa-mem2 - v2.2.1 * - DNA alignment post-processing - samblaster - v0.1.26 * - DNA alignment - fastp - v0.24.0 * - RNA alignment - star - v2.7.3a * - Transcript quantification - salmon - v1.10.3 * - Somatic variant calling - mutect2 - v4.6.1.0 * - Somatic variant calling - varscan2 - v2.4.6 * - Somatic variant calling - strelka2 - v2.2.9 * - Somatic variant filtering - bcftools - v1.21 * - Variant annotation - snpeff - v4.3k * - Somatic SNV/InDel filtering - snpsift - v4.3k * - Variant unionizing - jacquard - v1.1.4 * - Germline variant calling - deepvariant - v1.8.0 * - Somatic and germline merging - jacquard - v1.1.4 * - Variant phasing - whatshap - v2.4 * - HLA Typing - seq2hla - v2.2 * - Splice variant calling - snaf - v0.7.0 * - Fusion detection - starfusion - v1.14.0 (v1.8.1b for mouse) * - Tumor purity detection - sequenza - v3.0.0 * - Copy number variant detection - cnvkit - v0.9.12 * - Copy number variant detection - cnvkit - v0.9.123 * - CTA pMHC generation - lenstools - v1.8 * - ERV pMHC generation - lenstools - v1.8 * - SNV pMHC generation - lenstools - v1.8 * - InDel pMHC generation - lenstools - v1.8 * - Viral pMHC generation - lenstools - v1.8 * - Splice pMHC generation - lenstools - v1.8 * - Fusion pMHC generation - lenstools - v1.8 * - pMHC characterization - mhcflurry - v2.1.1 * - Sample swap detection - somalier - v0.2.19 Default parameters ------------------ .. list-table:: :header-rows: 1 * - Workflow - Tool - Parameters * - DNA alignment - fastp - ``--in1 ${fq1} --out1 ${dataset}-${pat_name}-${run}_1.trimmed.fq.gz --in2 ${fq2} --out2 ${dataset}-${pat_name}-${run}_2.trimmed.fq.gz --failed_out ${dataset}-${pat_name}-${run}.fastp_fails.fq.gz --thread ${task.cpus}`` * - DNA alignment - bwa-mem2 - ``-R "@RG\\tID:${dataset}-${pat_name}-${run}\\tSM:${dataset}-${pat_name}-${run}\\tLB:NULL\\tPL:Illumina" ${fa} ${fq1} ${fq2} -t ${task.cpus - 1}`` * - DNA alignment post-processing - samblaster - ```` * - RNA alignment - fastp - ``--in1 ${fq1} --out1 ${dataset}-${pat_name}-${run}_1.trimmed.fq.gz --in2 ${fq2} --out2 ${dataset}-${pat_name}-${run}_2.trimmed.fq.gz --failed_out ${dataset}-${pat_name}-${run}.fastp_fails.fq.gz --thread ${task.cpus}`` * - RNA alignment - star - ``-quantMode TranscriptomeSAM --outSAMtype BAM SortedByCoordinate --twopassMode Basic --outSAMunmapped Within`` * - Transcript quantification - salmon - ``quant --threads ${task.cpus} -t ${fa} -l a -a ${aln} -o .`` * - Somatic variant calling - mutect2 - ``Placeholder`` * - Somatic variant calling - strelka2 - ``Placeholder`` * - Somatic variant calling - varscan2 - ``Placeholder`` * - Somatic variant filtering (mutect2) - bcftools - ``FILTER="PASS"`` * - Somatic variant filtering (strelka2) - bcftools - ``FILTER="PASS"`` * - Somatic variant filtering (varscan2) - bcftools - ``FILTER="PASS"`` * - Variant annotation - snpeff - ```` * - Somatic SNV filtering - snpsift - ``ANN[*].EFFECT has 'missense_variant'`` * - Somatic InDel filtering - snpsift - ``(ANN[*].EFFECT has 'conservative_inframe_insertion') || (ANN[*].EFFECT has 'conservative_inframe_deletion') || (ANN[*].EFFECT has 'disruptive_inframe_insertion') || (ANN[*].EFFECT has 'disruptive_inframe_deletion') || (ANN[*].EFFECT has 'frameshift_variant')`` * - Variant unionizing - jacquard - ``--include_format_tags=\"GT,AF,AU,CU,GU,TU,TAR,TIR,FREQ,VAF`` * - Germline variant calling - deepvariant - ``--model_type WES`` * - Variant merging - jacquard - ```` * - Variant phasing - whatshap - ``--ignore-read-groups`` * - Splice variant filtering - split_snaf_by_sample - ``snaf_awk_filter: '\$5 == "True" && \$3 == 0.0', snaf_tumor_exp_threshold: '1000'`` * - Fusion detection - star - ``--outReadsUnmapped None --twopassMode Basic --outSAMstrandField intronMotif --outSAMunmapped Within --chimSegmentMin 12 --chimJunctionOverhangMin 8 --chimOutJunctionFormat 1 --alignSJDBoverhangMin 10 --alignMatesGapMax 100000 --alignIntronMax 100000 --alignSJstitchMismatchNmax 5 -1 5 5 --chimMultimapScoreRange 3 --chimScoreJunctionNonGTAG -4 --chimMultimapNmax 20 --chimNonchimScoreDropMin 10 --peOverlapNbasesMin 12 --peOverlapMMp 0.1 --alignInsertionFlush Right --alignSplicedMateMapLminOverLmate 0 --alignSplicedMateMapLmin 30`` * - Fusion detection - starfusion - ``--examine_coding_effect`` * - Tumor purity detection - sequenza - ``sequenza_gc_wiggle: '-w 50', sequenza_bam2seqz: '', sequenza_seqz_binning: '-w 50'`` * - Copy number variant detection - cnvkit - ``cnvkit_segment: '--drop-low-coverage', cnvkit_call: '--ploidy 2'`` * - Expressed CTA detection - lenstools_filter_expressed_self_genes - ``-p 95`` * - Expressed ERV detection - lenstools_filter_ervs_by_rna_coverage - ``--mean-depth 60`` * - Expressed SNV detection - lenstools_filter_expressed_variants_parameters - ``-p 75`` * - Expressed InDel detection - lenstools_filter_expressed_variants_parameters - ``-p 75`` * - Expressed virus detection - lenstools_filter_viruses_by_rna_coverage - ``--coverage 25`` * - pMHC characterization - mhcflurry - 8,9,10,11 * - pMHC filtering - lenstools - <500 nM LENS workflow flowchart ======================= Somatic Nucleotide Variants (SNVs) ================================== Somatic single-nucleotide variants (SNVs), or variants present within tumor tissue but absent from germline tissue, can be a source of tumor-specific immunogenic peptides. This section describes: - How LENS generates a set of patient-specific predicted SNV-derived peptides from filtered VCFs. - Provies explanations for design decisions. - Lists shortcomings and caveats of the current approach (and their planned fixes). - General and technical notes. Variant calling --------------- - SNVs are called using three separate variant callers: ``mutect2``, ``strelka2``, and ``varscan2``. - ``mutect2`` raw VCFs are processed through the variant filtering workflow (raw VCFs -> LearnReadOrientationModel -> GetPileupSummaries -> CalculateContamination -> FilterMutectCalls). - Germline variants are called using ``deepvariant``. Variant filtering ----------------- - Resulting somatic and germline VCFs are filtered for the ``PASS`` filter using ``bcftools``. - Variants are classified as PASS by each variant caller. - ``strelka2`` uses its empirical scoring (EVS) module. - ``mutect2`` uses FilterMutectCalls. - ``varscan2`` uses its internal filtering protocoll - ``deepvariant`` uses its internal variant qualifying strategy. .. note:: Users can modify the variant filtering strategy (e.g. adding hard filters) by modifying the ``params.lens$somatic$som_vars_to_filtd_som_vars$vcf_filtering_tool_parameters`` parameter. These filters are done on a per-variant caller basis, and are provided in a key:value structure. For example: ``params.lens$somatic$som_vars_to_filtd_som_vars$vcf_filtering_tool_parameters = "['mutect2': '-i \'MIN(FMT/DP)>10 & MIN(FMT/GQ)>15\'', 'strelka2':'\'MIN(FMT/DP)>10 & MIN(FMT/GQ)>15\'']"`` Variant combining ----------------- A union of somatic variant calls is combined using ``jacquard``. ``jacquard`` is called with parameters ``--include_format_tags="GT,AF,AU,CU,GU,TU,TAR,TIR,FREQ,VAF"``. Variant annotation ------------------ Intersected somatic VCF annotated using ``snpeff``. Filtering variants for expression --------------------------------- A ``salmon`` ``quant.sf`` (from patient's tumor RNA sequencing data), the annotated unioned somatic VCF, and a user-provided percentile (default: 75) are used to create list of transcripts harboring somatic SNV (and InDels) and in the specified percentile. Phasing variants with read-backed phasing ----------------------------------------- Read-backed phasing is performed for the tumor and normal tissues separately using ``whatshap``. The tumor phased VCF is created by phasing (using tumor DNA and tumor RNA reads) somatic variants and germline variants. Germline phased VCF is created by phasing (using normal DNA reads) germline variants. Creating variant-specific VCFs ------------------------------ Variant-specific VCFs are generated from the tumor and germline phased VCFs. This is performed separately for tumor and germline VCFs in order to create the variant context within the tumor around the variant of interest and to create the matching normal variant context around the same genomic position. Variant-specific VCFs include the somatic variant of interest as well as any variants that are either germline homozygous or germline heterozygous phased with the variant of interest. Creating variant-specific, transcript-specific sequences -------------------------------------------------------- Exonic sequences from expressed transcripts harboring somatic variants of interest are extracted from the reference FASTA using the annotation GTF and ``samtools faidx`` into a BED file. The BED file is provided to samtools faidx to create a FASTA in which each entry is an exon sequence from an expressed transcripts harboring somatic variants of interest. Exonic FASTA file is combined with each variant-specific VCF to create the tumor and normal sequences for each transcript. .. code-block:: console for each variant in annotated, intersected VCF: for each transcript listed in variant's annotation: if transcript is listed as expressed: apply all variants (focal somatic and neighboring germline) to transcript's exonic sequences using bcftools consensus store resulting exonic sequences into variant-specific, transcript-specific, tissue-specific FASTA Creating SNV-derived peptides ----------------------------- Intersected annotated VCF and transcript and variant-specific exonic FASTAs are combined to create SNV-derived peptides with ``lenstools make-snv-peptides-context``. .. code-block:: console for each somatic SNV of interest: for each expressed transcript harboring SNV: concatenate exonic sequences from transcript to create full transcript extract region of interest from transcript (based on snpEff information) if tumor sequence: translate and store peptide in \*mt_aa.fa store DNA coding sequence in \*mt_nt..fa if normal sequence: translate and store peptide in \*wt_aa.fa .. note: Peptides are tagged with checksums (calculated by variant position, transcript, ref allele, and alt allele) for downstream identification. Calculating binding affinity and presentation score --------------------------------------------------- The tumor sequencesare processed by ``mhcflurry``. Peptide quantification using RNA reads -------------------------------------- Each peptide's DNA coding sequence (from ``lenstools make-snv-peptides-context``) are combined with patient's RNA sequencing reads for peptide quantification. RNA reads that fully overlap the peptide's genomic origin (and are therefore capable of containing the full coding sequence) are queried for the coding sequence. A count of reads containing the coding sequence is included as the ``rna_reads_covering_genomic_origin_with_peptide_cds`` value. A subset of these reads will be primary alignments and may be a better representation of the actual transcript abundance of the peptide's coding sequence. The count of primary alignments containing the coding sequence is the ``primary_aln_rna_reads_covering_genomic_origin_with_peptide_cds`` value. Shortcomings and caveats ------------------------ Design decisions ---------------- The FASTAs generated by the lenstools make-snv-peptides-context step contains headers with several pieces of metadata associated with the peptide. More information regarding the contents and usage of the FASTA headers can be found in the Technical section. Notes ----- SNVs are described here in isolation (e.g. not considering InDels), but many of the steps (e.g., bedtools intersect, snpEff ann, etc.) are performed on VCFs with all variant types (SNVs and InDels). Somatic missense SNVs are not handled in isolation until after snpSift filter. Technical --------- MD5 checksums were utilized for creating uniquely identifiable peptides within the mutant peptide FASTA file provided to netMHCpan. netMHCpan has an internal string length limitation for the IDENTITY column. The checksum is performed on the string ::::. Checksum values are also applied to the reference peptide FASTA file headers in order to allow matching of peptides for agretopicity calculations. The mutant peptide FASTA header has been expanded to include a variety of relevant metadata. These data are later included (after being match to binding affinity data using the unique MD5 chekcsum) in the final peptide metadata report. The mutant peptide FASTA headers include the following information: MD5, VARIANT_POS, TRANSCRIPT, REF, ALT, SNV_TYPE, PROTEIN_CONTEXT, GENOMIC_CONTEXT. An example of a mutant peptide FASTA header: .. code-block:: console >MD5:be460b5e5095139a VARIANT_POS:chr14_69230557 TRANSCRIPT:ENST00000409242.5 REF:C ALT:[T] SNV_TYPE:missense PROTEIN_CONTEXT:TVLNFPLDKSLLLCCSNWDAETLTEDQ GENOMIC_CONTEXT:ACTGTTTTGAACTTTCCCCTTGACAAGTCCCTTCTACTTTGTTGCAGCAACTGGGATGCTGAGACTCTCACAGAGGACCAG Somatic Insertion and Deletion Variants (InDels) ================================================ Somatic insertion/deletion variants (InDels), or variants present within tumor tissue but absent from germline tissue, can be a source of tumor-specific immunogenic peptides. Splice Variants =============== Fusion Events ============= Endogenous Retroviruses (ERVs) ============================== Viruses ======= Cancer-testis Antigens (CTAs) ============================= Aberrantly expressed genes (e.g. CTAs) can be a source of tumor-associated immunogenic peptides. Determining expressed CTA and self-antigens ------------------------------------------- CTA/Self-antigens that are included in the user-provided list. The default list is available in ``/path/to/raft/references/homo_sapiens/cta_self/cta_and_self_antigen.homo_sapiens.gene_list``. This list includes loci described in the CTDatabase (http://www.cta.lncc.br/). Targetable CTA/Self-antigen peptides are generated using the coding sequence of CTA transcripts that exceed the user-provided expression percentile (default: 95%). Generating patient-specific CTA coding sequences ------------------------------------------------ LENS performs germline variant calling (default: DeepVariant) as part of its workflow. Reference generation notes ========================== Using NetMHC tools in LENS ========================== LENS utilizes MHCflurry to estimate pMHC-specific binding affinities and presentation scores. Nevertheless, users may wish to include other tools, such as ``NetMHCpan`` or ``NetMHCstabpan`` to further describe pMHCs of interest. LENS supports these tools, but users must provide their own Docker images due to restrictive licensing. Informations for creating the Docker images and modifying LENS to use these tools are below. Creating NetMHC Docker images ----------------------------- Users will need to create their own Docker images after obtaining their own license for each tool. Docker images can be created using these example Dockerfiles: **NetMHCpan** .. code-block:: console FROM ubuntu:20.04 RUN apt-get update && apt-get upgrade -y RUN apt-get install -y wget vim tcsh COPY netMHCpan-4.1b.Linux.tar.gz / RUN tar xvf /netMHCpan-4.1b.Linux.tar.gz RUN wget https://services.healthtech.dtu.dk/services/NetMHCpan-4.1/data.tar.gz RUN tar xvf data.tar.gz RUN mv /data /netMHCpan-4.1/data RUN cp -r /netMHCpan-4.1/Linux_x86_64/bin /netMHCpan-4.1/bin ENV NETMHCpan="/netMHCpan-4.1" RUN rm /data.tar.gz RUN rm /netMHCpan-4.1b.Linux.tar.gz RUN sed -i 's/\/net\/sund-nas.win.dtu.dk\/storage\/services\/www\/packages\/netMHCpan\/4.1\/netMHCpan-4.1/\/netMHCpan-4.1\/data\//g' /netMHCpan-4.1/netMHCpan ENV TMPDIR /tmp CMD ["/bin/bash"] **NetMHCstabpan** .. code-block:: console FROM ubuntu:20.04 RUN apt-get update -y RUN apt-get upgrade -y RUN apt-get install -y wget vim tcsh gawk COPY netMHCstabpan-1.0a.Linux.tar.gz / RUN tar xvf /netMHCstabpan-1.0a.Linux.tar.gz RUN wget https://services.healthtech.dtu.dk/services/NetMHCstabpan-1.0/data.tar.gz RUN tar xvf data.tar.gz RUN mv /data /netMHCstabpan-1.0/data RUN cp -r /netMHCstabpan-1.0/Linux_x86_64/bin /netMHCstabpan-1.0/bin ENV NETMHCstabpan="/netMHCstabpan-1.0" RUN rm /data.tar.gz RUN rm /netMHCstabpan-1.0a.Linux.tar.gz RUN sed -i 's/\/net\/sund-nas.win.dtu.dk\/storage\/services\/www\/packages\/netMHCstabpan\/1.0\/netMHCstabpan-1.0/\/netMHCstabpan-1.0\/data\//g' /netMHCstabpan-1.0/netMHCstabpan RUN chmod -R 777 /netMHCstabpan-1.0 ENV TMPDIR /tmp CMD ["/bin/bash"] **NetCTLpan** .. code-block:: console FROM ubuntu:20.04 RUN apt-get update && apt-get upgrade -y RUN apt-get install -y wget vim tcsh gawk COPY netCTLpan-1.1b.Linux.tar.Z / RUN tar xvf /netCTLpan-1.1b.Linux.tar.Z RUN mkdir /netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/tmp RUN cp -r /netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/bin/* /netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/tmp RUN mv /netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/tmp /netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/bin/bin RUN wget https://services.healthtech.dtu.dk/services/NetCTLpan-1.1/data_netCTLpan-1.1.tar.Z RUN tar xvf data_netCTLpan-1.1.tar.Z RUN mv data_netCTLpan-1.1 /netCTLpan-1.1/data RUN wget https://services.healthtech.dtu.dk/services/NetCTLpan-1.1/data_netMHCpan-2.3.tar.Z RUN tar xvf data_netMHCpan-2.3.tar.Z RUN cp -r data_netMHCpan-2.3 /netCTLpan-1.1/netMHCpan-2.3/data RUN cp -r data_netMHCpan-2.3 netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/bin/data RUN cp -r /netCTLpan-1.1/Linux_x86_64/bin /netCTLpan-1.1/bin RUN sed -i 's/\/usr\/cbs\/packages\/netCTLpan\/1.1\/netCTLpan-1.1/\/netCTLpan-1.1/g' /netCTLpan-1.1/netCTLpan ENV NETCTLpan="/netCTLpan-1.1" ENV NETMHCpan="/netCTLpan-1.1/netMHCpan-2.3/Linux_x86_64/bin" RUN rm /netCTLpan-1.1b.Linux.tar.Z ENV TMPDIR /tmp CMD ["/bin/bash"]