Manifest

Purpose

The RAFT manifest defines the relationships among samples, patients, and datasets. Multiple samples are required per patient and each sample plays a crucial role in predicting pMHCs. For example, the normal and tumor DNA samples are required for calling somatic variants while the RNA tumor sample is essential in determining tumor expression of a variety of tumor antigens (ERVs, viruses, somatic variants, splice variants, and fusions).

A RAFT manifest for LENS requires the following samples per patient:

  • At least one tumor DNA sample

  • At least one normal DNA sample

  • At least one tumor RNA sample

  • At least one tissue-matched, patient-matched normal RNA sample (Optional)

  • At least one tissue-matched, non-patient-matched normal RNA sample (e.g. GTEx)

Users can run a reduced LENS analysis (without any SNV or InDel pMHCs reported) using the following samples per patient:

  • At least one tumor RNA sample

  • At least one tissue-matched, patient-matched normal RNA sample (Optional)

  • At least one tissue-matched, non-patient-matched normal RNA sample (e.g. GTEx)

RAFT manifests may contain one or more patients. Many computer clusters will allow for multiple patients to be run in parallel to reduce run time.

Sample, patient, and dataset hierarchy

The organizational hierarchy within RAFT follows

Sample ∈ Patient ∈ Dataset

In other words, samples belong to patients (patients can have multiple samples) and patients belong to datasets (datasets can have multiple patients). This hierarchy allows for accurate and efficient combining of samples for processing.

Contents

A RAFT manifest must have at least the columns defined in the table below. Columns can be in any order and other columns containing non-RAFT metadata are also allowed.

RAFT Columns

Column

Description

Allowed values

Dataset

Name for collection of patients

Free text

Patient Name

Name for collection of samples

Free text (except UNIV – see note below)

Run Name

Name for the specific sample

Free text (see note below)

File Prefix

Base name (or full path) of input files

Free text

Sequencing Method

Sequencing protocol for sample

(RNA-seq, WES, WXS, WGS)

Normal

Is the sample normal or abnormal (tumor)?

(TRUE, FALSE)

Note

A sample’s Run_Name is instrumental in guiding samples through some RAFT workflows. A sample’s Run_Name should have a two-letter prefix that describes the type of sample and a delimiter (- or _) followed by an arbitrary unique identifier. The first letter of the prefix is either a (for abnormal) or n (for normal). The second letter is either r (for RNA) or d (for DNA). For example, a sample with an ar- (or ar_) prefix is an abnormal (tumor) RNA sample while a sample with a nd- prefix is a normal DNA sample.

Note

The patient name UNIV (short for UNIVersal) is reserved and intended for samples that are expected to be shared among patients. For example, if not all of your patients have a patient-matched, tissue-matched normal RNA sample, then an external tissue-matched sample (e.g. GTEx) can be utilized with the patient name UNIV. Patients that have the apprpriate patient-matched, tissue-matched normal RNA sample will use their own sample while other patients will use the UNIV sample instead.

Each line in the manifest after the header corresponds to a sample and provides the necessary data for running a RAFT workflow. The samples described within the manifest may, in some cases, be effectively independent (as in, the workflow does not attempt to pair samples from a patient), but in other cases, users must be careful that samples are properly labeled. For example, somatic variant calling generally requires a normal DNA sample and a tumor DNA sample. For RAFT to properly pair these samples together, they must have the correct sample prefix (nd- for the DNA tumor sample and nd- for the DNA normal sample) and be paired with the patient (Patient_Name field) and dataset (Dataset field). Consider the following example:

Patient_Name Run_Name Dataset File_Prefix Sequencing_Method Normal
Pt01  ad-Pt01-03A     AML     9f7f7   WES     FALSE
Pt01  nd-Pt01-11A     AML     8e74a   WES     TRUE
Pt01  ar-Pt01-03A     AML     cdb288  RNA-Seq FALSE
UNIV  nr-CTRL         AML     CD34-U  RNA-Seq TRUE

Note

Both the tumor DNA sample (ad-Pt01-03A) and the normal DNA (nd-Pt01-11A) sample belong to the same patient (Pt01) and the same dataset (AML).

Complex sample sets

Users may encounter situations that require more than one sample per sample type per patient. For example, users may have a single set of DNA samples (normal DNA and tumor DNA), but may have multiple RNA-seq samples (e.g. multiple timepoints). Users can use RAFT’s subjoin functionality to support these cases. Specifically, each sample in the manifest requires an additional column (Group). By default, LENS assumes all samples belonging to a single pat_name should all be used together (like the example above). Users should use group identifiers to ensure groups of samples are used together. For example,

Patient_Name Run_Name Dataset File_Prefix Sequencing_Method Normal Group
Pt01  ad-Pt01-03A     AML     9f7f7   WES     FALSE 1-2
Pt01  nd-Pt01-11A     AML     8e74a   WES     TRUE 1-2
Pt01  ar-Pt01-03A     AML     cdb288  RNA-Seq FALSE 1
Pt01  ar-Pt01-03B     AML     cdb289  RNA-Seq FALSE 2
UNIV  nr-CTRL         AML     CD34-U  RNA-Seq TRUE *

Note

UNIV samples can be assigned a * Group value which will allow them to be paired to all possible combinations (where a patient-specific sample of that type is not already present). UNIV samples can also be limited to specific Group use Group identifiers (e.g. 1-2) as well.

This scenario may depict an instance where we have a patient (Pt01) that has a DNA normal sample, a DNA tumor sample, and two RNA-seq samples. For this example, let’s assume the first RNA-seq sample (ar-Pt01-03A) is a pre-treatment sample while the second RNA-seq sample (ar-PT01-03B) is a post-treatment sample and you, as an investigator, are interested in understanding how treatment affected the tumor antigen landscape. In this case, LENS will be run on two distinct sample sets (tumor RNA-seq samples bolded for emphasis):

The pre-treatment sample set:
  • ad-Pt01-03A

  • nd-Pt01-11A

  • ar-Pt01-03A

  • nr-CTRL

and the post-treatment sample set:

  • ad-Pt01-03A

  • nd-Pt01-11A

  • ar-Pt01-03B

  • nr-CTRL

Both of these sample sets will have their own individual LENS report after the completion of the LENS workflow.

Note

Groups identifiers do not have to be numbers (like the example above). The identifiers can also be descriptive identifiers (e.g. pre-treatment and post-treatment).

Checking manifest

LENS automtically checks manifest integrity when raft is run in either run-ots or run-workflow modes.

Users can manually verify the integrity of their manifest files using:

raft.py check-manifest -m </PATH/TO/MANIFEST>