Quick Usage
python MuMRescueLite.py <input file> <output file><window>
What Does This Script Do?
Sequence tags that map to multiple genomic loci (multi-mapping tags or MuMs), are routinely omitted from further analysis, leading to experimental bias and reduced coverage. MuMRescueLite probabilistically reincorporates multi-mapping tags into mapped short read data with acceptable computational requirements. Please check the reference articles for more details.
More detailed usage
This program requires the following arguments:
- input file: the data file to be processed. Its format isdescribed in the next section.
- output file: the rescue result is written to this file. Its format is described in the next section.
- window: the number of bases around each MuM location to seek single mapped tags at each multi mapping location; MuMRescueLite will search a length of window/2 upstream and downstream of a given MuM location.
Input File Format
MuMRescueLite.py accepts a tab-delimited ascii text file with 1 header line as input. These columns of this file must consist of:
- identifier of the tag. unique id or unique sequence.
- total number of mapped locations of this tag.
- mapped chromosome or name of assembly.
- mapped strand.
- start of genomic mapped position; must be same or smaller than the end.
- end of genomic mapped position.
- number of times (a count) the sequence was observed in this experimental condition.
The start of an example input file:
#ID locations chromosome strand start end count
s3.25mer.txt-1 1 chr12 + 105579297 105579321 1
s3.25mer.txt-4 1 chr8 + 95642182 95642206 1
s3.25mer.txt-7 6 chr13 + 66975161 66975185 1
s3.25mer.txt-7 6 chr13 - 72592620 72592644 1
s3.25mer.txt-7 6 chr14 - 46332831 46332855 1
s3.25mer.txt-7 6 chr19 - 32540873 32540897 1
s3.25mer.txt-7 6 chr1 - 113777719 113777743 1
s3.25mer.txt-7 6 chr2 + 70297183 70297207 1
Output File Format
MuMRescueLite.py writes results as a tab-delimited ascii text file and appends a "weight" column for each input line, with 1 header line. A detailed description is show as follows;
- identifier of the tag. unique id or unique sequence.
- total number of mapped locations of this tag
- mapped chromosome or name of assembly
- mapped strand
- start of genomic mapped position; must be same or smaller than the end
- end of genomic mapped position
- number of the sequence observed in this experimental condition
- weight as probability for this sequence of this mapped position; 1.0 for the single mapped sequences, from 0.0 to 1.0 for the multi mapped tags
System Requirements
- python 2.4 or later is required.
- Approximately 512MB free memory for mouse and human genomes with ten million reads to be analyzed; this requirement will vary greatly with genome and experiment size. Note: we do not recommend combining different biological samples for rescue as this will lead to tags from different biological contexts rescuing one another.
Reference
- Faulkner, G.J., et al. (2008) A rescue strategy for multi-mapping short sequence tags refines surveys of transcriptional activity by CAGE, Genomics.
- Hashimoto, T., et al. (2009) Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite, Bioinformatics.
License
MIT license; see LICENCE.txt
Contact
g.faulkner@expressiongenomics.org
README Authors
Takehiro Hashimoto, Michiel J. L. deHoon, Geoffrey J. Faulkner