Manifest
=======

Purpose
________________
The RAFT manifest defines the relationships among samples, patients, and
datasets. Multiple samples are required per patient and each sample plays a
crucial role in predicting pMHCs. For example, the normal and tumor DNA samples
are required for calling somatic variants while the RNA tumor sample is
essential in determining tumor expression of a variety of tumor antigens (ERVs,
viruses, somatic variants, splice variants, and fusions).

A RAFT manifest for LENS requires the following samples per patient:

- At least one tumor DNA sample
- At least one normal DNA sample
- At least one tumor RNA sample
- At least one tissue-matched, patient-matched normal RNA sample (Optional)
- At least one tissue-matched, non-patient-matched normal RNA sample (e.g. GTEx)

Users can run a reduced LENS analysis (without any SNV or InDel pMHCs reported) using the following samples per patient:

- At least one tumor RNA sample
- At least one tissue-matched, patient-matched normal RNA sample (Optional)
- At least one tissue-matched, non-patient-matched normal RNA sample (e.g. GTEx)

.. note:
   The tissue-matched, non-patient-matched normal RNA sample is only required
   if one or more patients do not have a tissue-matched, patient-matched normal
   RNA sample.

RAFT manifests may contain one or more patients. Many computer clusters will
allow for multiple patients to be run in parallel to reduce run time.

Sample, patient, and dataset hierarchy
______________________________________

The organizational hierarchy within RAFT follows

.. code-block:: console

  Sample ∈ Patient ∈ Dataset

In other words, samples belong to patients (patients can have multiple samples)
and patients belong to datasets (datasets can have multiple patients). This
hierarchy allows for accurate and efficient combining of samples for
processing.

Contents
_________________

A RAFT manifest must have at least the columns defined in the table below. Columns can
be in any order and other columns containing non-RAFT metadata are also
allowed.

.. list-table:: RAFT Columns
 :widths: 25 25 25
 :header-rows: 1

 * - Column
   - Description
   - Allowed values
 * - Dataset
   - Name for collection of patients
   - Free text
 * - Patient Name
   - Name for collection of samples
   - Free text (except ``UNIV`` -- see note below)
 * - Run Name
   - Name for the specific sample
   - Free text (see note below)
 * - File Prefix
   - Base name (or full path) of input files
   - Free text
 * - Sequencing Method
   - Sequencing protocol for sample
   - (RNA-seq, WES, WXS, WGS)
 * - Normal
   - Is the sample normal or abnormal (tumor)?
   - (TRUE, FALSE)

.. note::
  A sample's ``Run_Name`` is instrumental in guiding samples through some RAFT workflows.
  A sample’s ``Run_Name`` should have a two-letter prefix that describes the type of
  sample and a delimiter (``-`` or ``_``) followed by an arbitrary unique identifier. The first letter of the prefix
  is either ``a`` (for abnormal) or ``n`` (for normal). The second letter is either ``r`` (for
  RNA) or ``d`` (for DNA). For example, a sample with an ar- (or ar\_) prefix is an abnormal
  (tumor) RNA sample while a sample with a nd- prefix is a normal DNA sample.

.. note::
  The patient name ``UNIV`` (short for UNIVersal) is reserved and intended for
  samples that are expected to be shared among patients. For example, if not all of your
  patients have a patient-matched, tissue-matched normal RNA sample, then an
  external tissue-matched sample (e.g. GTEx) can be utilized with the patient name ``UNIV``.
  Patients that have the apprpriate patient-matched, tissue-matched normal RNA
  sample will use their own sample while other patients will use the ``UNIV``
  sample instead.
  

Each line in the manifest after the header corresponds to a sample and
provides the necessary data for running a RAFT workflow. The samples described
within the manifest may, in some cases, be effectively independent (as in, the
workflow does not attempt to pair samples from a patient), but in other cases,
users must be careful that samples are properly labeled. For example, somatic
variant calling generally requires a normal DNA sample and a tumor DNA
sample. For RAFT to properly pair these samples together, they must have
the correct sample prefix (nd- for the DNA tumor sample and nd- for the DNA
normal sample) and be paired with the patient (``Patient_Name`` field) and dataset
(``Dataset`` field). Consider the following example:

.. code-block:: console

  Patient_Name Run_Name Dataset File_Prefix Sequencing_Method Normal
  Pt01	ad-Pt01-03A	AML	9f7f7	WES	FALSE
  Pt01	nd-Pt01-11A	AML	8e74a	WES	TRUE
  Pt01	ar-Pt01-03A	AML	cdb288	RNA-Seq	FALSE
  UNIV	nr-CTRL		AML	CD34-U	RNA-Seq	TRUE

.. note::
  Both the tumor DNA sample (ad-Pt01-03A) and the normal DNA
  (nd-Pt01-11A) sample belong to the same patient (Pt01) and the same dataset
  (AML).

Complex sample sets
___________________

Users may encounter situations that require more than one sample
per sample type per patient. For example, users may have a single set of DNA
samples (normal DNA and tumor DNA), but may have multiple RNA-seq samples (e.g.
multiple timepoints). Users can use RAFT's ``subjoin`` functionality to support
these cases. Specifically, each sample in the manifest requires an additional
column (``Group``). By default, LENS assumes all samples belonging to a single
``pat_name`` should all be used together (like the example above). Users should
use group identifiers to ensure groups of samples are used together. For example,


.. code-block:: console

  Patient_Name Run_Name Dataset File_Prefix Sequencing_Method Normal Group
  Pt01	ad-Pt01-03A	AML	9f7f7	WES	FALSE 1-2
  Pt01	nd-Pt01-11A	AML	8e74a	WES	TRUE 1-2
  Pt01	ar-Pt01-03A	AML	cdb288	RNA-Seq	FALSE 1
  Pt01	ar-Pt01-03B	AML	cdb289	RNA-Seq	FALSE 2
  UNIV	nr-CTRL		AML	CD34-U	RNA-Seq	TRUE *

.. note::
  ``UNIV`` samples can be assigned a ``*`` Group value which will allow them
  to be paired to all possible combinations (where a patient-specific sample of
  that type is not already present). ``UNIV`` samples can also be limited to
  specific ``Group`` use ``Group`` identifiers (e.g. ``1-2``) as well.

This scenario may depict an instance where we have a patient (``Pt01``) that
has a DNA normal sample, a DNA tumor sample, and **two** RNA-seq samples. For
this example, let's assume the first RNA-seq sample (``ar-Pt01-03A``) is a
pre-treatment sample while the second RNA-seq sample (``ar-PT01-03B``) is a
post-treatment sample and you, as an investigator, are interested in
understanding how treatment affected the tumor antigen landscape. In this case,
LENS will be run on two distinct sample sets (tumor RNA-seq samples bolded for
emphasis):

The pre-treatment sample set:
 - ad-Pt01-03A
 - nd-Pt01-11A
 - **ar-Pt01-03A**
 - nr-CTRL

and the post-treatment sample set:

 - ad-Pt01-03A
 - nd-Pt01-11A
 - **ar-Pt01-03B**
 - nr-CTRL

Both of these sample sets will have their own individual LENS report after the
completion of the LENS workflow.

.. note::
   ``Groups`` identifiers do not have to be numbers (like the example above).
   The identifiers can also be descriptive identifiers (e.g. ``pre-treatment``
   and ``post-treatment``).

Checking manifest
_________________

LENS automtically checks manifest integrity when ``raft`` is run in either
``run-ots`` or ``run-workflow`` modes. 

Users can manually verify the integrity of their manifest files using:

.. code-block:: console

  raft.py check-manifest -m </PATH/TO/MANIFEST>