Fundamentals of parabam¶
Here are some fundamental concepts relating to parabam that will make interacting with parabam easier.
The subset and stat operations¶
Currently parabam allows you to conduct two separate “operations”. These are subset and stat.
The subset operation allows the user to create BAMs containing a specific subset of reads from within a given BAM file. For example, a subset BAM containing all of the unmapped reads in the file, or all the reads which contain a certain sequence.
The stat operation allows the user to collect data concerning the reads in a given BAM file. These data are collected and output in a .csv file at the end of operation. Data can be collected as an int, float or NumPy array. For instance, the user may wish to collect the amount of reads in a file that contain a specific sequence, or statistics like the average insert size of a BAM file.
Ways of interacting with parabam¶
There are several ways to interact with parabam. The most direct way is to write a short instruction script in Python and then run this on the command line. For details on how to write instructions files for the various parabam operations consult the subset and stat sections.
Users may also incorporate parabam into their own programs. This is as simple as importing the relevant Interface class for your intended operation (subset or stat) and then providing the relevant arguments.
Examples for both of these methods may be found on the subset examples and stat examples pages.
Engines and constants¶
parabam works by applying a rule (a user defined function) to every read in the BAM file. Engines are simple functions that only take three arguments. These are:
- read - A sequencing read. This is an object of type pysam.AlignedSegment
- constants - A set of user defined constants. This is a simple python dict.
- master - An object which represents the sample BAM file from which the read originated. This is an object of type pysam.AlignmentFile
For examples which show in detail how to create rules for both subset and stat operations, see the subset and stat sections.
Output Files¶
Certain invocations of parabam will create many output files, particularly runs using multiple statistics.
Pysam classes: read and master¶
parabam uses the package pysam to process BAM files. As such the objects we interact with are pysam classes.
The most prevalent examples of pysam objects are the read and master arguments when defining a rule. The read is of type AlignedSegment and master is of type AlignmentFile. Both of these are pysam classes. Their documentation is reproduced here for reference. Pysam is entirely the work of Andreas Heger and contributors.
Please Note: Users of parabam don’t need to know a great deal about pysam classes (most of this is taken care of behind the scenes), but when writing rules it will help to know the different variables and functions that you may reference with the read and master arguments.
read (AlignedSegment)¶
Any of the following functions or variables may be referenced when dealing with the read
- class pysam.AlignedSegment¶
AlignedSegment() Class representing an aligned segment.
This class stores a handle to the samtools C-structure representing an aligned read. Member read access is forwarded to the C-structure and converted into python objects. This implementation should be fast, as only the data needed is converted.
For write access, the C-structure is updated in-place. This is not the most efficient way to build BAM entries, as the variable length data is concatenated and thus needs to be resized if a field is updated. Furthermore, the BAM entry might be in an inconsistent state.
One issue to look out for is that the sequence should always be set before the quality scores. Setting the sequence will also erase any quality scores that were set previously.
- __hash__¶
x.__hash__() <==> hash(x)
- __str__¶
return string representation of alignment.
The representation is an approximate sam format.
An aligned read might not be associated with a AlignmentFile. As a result tid is shown instead of the reference name.
Similarly, the tags field is returned in its parsed state.
- bin¶
properties bin
- cigarstring¶
the cigar alignment as a string.
The cigar string is a string of alternating integers and characters denoting the length and the type of an operation.
Note
The order length,operation is specified in the SAM format. It is different from the order of the cigar property.
Returns None if not present.
To unset the cigarstring, assign None or the empty string.
- cigartuples¶
the cigar alignment. The alignment is returned as a list of tuples of (operation, length).
If the alignment is not present, None is returned.
The operations are:
M BAM_CMATCH 0 I BAM_CINS 1 D BAM_CDEL 2 N BAM_CREF_SKIP 3 S BAM_CSOFT_CLIP 4 H BAM_CHARD_CLIP 5 P BAM_CPAD 6 = BAM_CEQUAL 7 X BAM_CDIFF 8 Note
The output is a list of (operation, length) tuples, such as [(0, 30)]. This is different from the SAM specification and the cigarstring property, which uses a (length, operation) order, for example: 30M.
To unset the cigar property, assign an empty list or None.
- compare(self, AlignedSegment other)¶
return -1,0,1, if contents in this are binary <,=,> to other
- flag¶
properties flag
- get_aligned_pairs(self)¶
a list of aligned read and reference positions.
- get_blocks(self)¶
a list of start and end positions of aligned gapless blocks.
The start and end positions are in genomic coordinates.
Blocks are not normalized, i.e. two blocks might be directly adjacent. This happens if the two blocks are separated by an insertion in the read.
- get_overlap(self, uint32_t start, uint32_t end)¶
return number of aligned bases of read overlapping the interval start and end on the reference sequence.
Return None if cigar alignment is not available.
- get_reference_positions(self, full_length=False)¶
a list of reference positions that this read aligns to.
By default, this method only returns positions in the reference that are within the alignment. If full_length is set, None values will be included for any soft-clipped or unaligned positions within the read. The returned list will thus be of the same length as the read.
- infer_query_length(self, always=True)¶
inferred read length from CIGAR string.
If always is set to True, the read length will be always inferred. If set to False, the length of the read sequence will be returned if it is available.
Returns None if CIGAR string is not present.
- is_duplicate¶
true if optical or PCR duplicate
- is_paired¶
true if read is paired in sequencing
- is_proper_pair¶
true if read is mapped in a proper pair
- is_qcfail¶
true if QC failure
- is_read1¶
true if this is read1
- is_read2¶
true if this is read2
- is_reverse¶
true if read is mapped to reverse strand
- is_secondary¶
true if not primary alignment
- is_unmapped¶
true if read itself is unmapped
- mapping_quality¶
mapping quality
- mate_is_reverse¶
true is read is mapped to reverse strand
- mate_is_unmapped¶
true if the mate is unmapped
- next_reference_id¶
the reference id of the mate/next read.
- next_reference_start¶
the position of the mate/next read.
- opt(self, tag)¶
retrieves optional data given a two-letter tag
- overlap(self)¶
- query_aligment_length¶
length of the aligned query sequence.
This is equal to qend - qstart
- query_alignment_end¶
end index of the aligned query portion of the sequence (0-based, exclusive)
- query_alignment_length¶
length of the query template. This includes soft-clipped bases and is equal to len(seq).
This property is read-only.
Returns 0 if not available.
- query_alignment_qualities¶
aligned query sequence quality values (None if not present). These are the quality values that correspond to query, that is, they exclude qualities of soft clipped bases. This is equal to qual[qstart:qend].
Quality scores are returned as a python array of unsigned chars. Note that this is not the ASCII-encoded value typically seen in FASTQ or SAM formatted files. Thus, no offset of 33 needs to be subtracted.
This property is read-only.
- query_alignment_sequence¶
aligned portion of the read.
This is a substring of seq that excludes flanking bases that were soft clipped (None if not present). It is equal to seq[qstart:qend].
SAM/BAM files may include extra flanking bases that are not part of the alignment. These bases may be the result of the Smith-Waterman or other algorithms, which may not require alignments that begin at the first residue or end at the last. In addition, extra sequencing adapters, multiplex identifiers, and low-quality bases that were not considered for alignment may have been retained.
- query_alignment_start¶
start index of the aligned query portion of the sequence (0-based, inclusive).
This the index of the first base in seq that is not soft-clipped.
- query_length¶
the length of the query/read.
This value corresponds to the length of the sequence supplied in the BAM/SAM file. The length of a query is 0 if there is no sequence in the BAM/SAM file. In those cases, the read length can be inferred from the CIGAR alignment, see pysam.AlignmentFile.infer_query_length.().
This property can be set by providing a sequence.
- query_name¶
the query template name (None if not present)
- query_qualities¶
read sequence base qualities, including soft clipped bases (None if not present).
Quality scores are returned as a python array of unsigned chars. Note that this is not the ASCII-encoded value typically seen in FASTQ or SAM formatted files. Thus, no offset of 33 needs to be subtracted.
Note that to set quality scores the sequence has to be set beforehand as this will determine the expected length of the quality score array.
This method raises a ValueError if the length of the quality scores and the sequence are not the same.
- query_sequence¶
read sequence bases, including soft clipped bases (None if not present).
Note that assigning to seq will invalidate any quality scores. Thus, to in-place edit the sequence and quality scores, copies of the quality scores need to be taken. Consider trimming for example:
q = read.qual read.seq = read.seq[5:10] read.qual = q[5:10]
The sequence is returned as it is stored in the BAM file. Some mappers might have stored a reverse complement of the original read sequence.
- reference_end¶
aligned reference position of the read on the reference genome.
aend points to one past the last aligned residue. Returns None if not available.
- reference_id¶
reference ID
Note
This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.AlignmentFile.getrname()
- reference_length¶
aligned length of the read on the reference genome.
This is equal to aend - pos. Returns None if not available.
- reference_start¶
0-based leftmost coordinate
- setTag(self, tag, value, value_type=None, replace=True)¶
Set optional field of alignment tag to value. value_type may be specified, but if not the type will be inferred based on the Python type of value
An existing value of the same tag will be overwritten unless replace is set to False.
the tags in the AUX field.
This property permits convenience access to the tags. Changes it the returned list will not update the tags automatically. Instead, the following is required for adding a new tag:
read.tags = read.tags + [("RG",0)]
This method will happily write the same tag multiple times.
- template_length¶
the observed query template length
master (AlignmentFile)¶
Any of the following functions or variables may be referenced when dealing with the master
- class pysam.AlignmentFile¶
- *(filename, mode=None, template = None,
- referencenames=None, referencelengths = None, text=NULL, header=None, add_sq_text=False, check_header=True, check_sq=True)*
A SAM/BAM formatted file. The file is automatically opened.
mode should be r for reading or w for writing. The default is text mode (SAM). For binary (BAM) I/O you should append b for compressed or u for uncompressed BAM output. Use h to output header information in text (TAM) mode.
If b is present, it must immediately follow r or w. Valid modes are r, w, wh, rb, wb, wbu and wb0. For instance, to open a BAM formatted file for reading, type:
f = pysam.AlignmentFile('ex1.bam','rb')
If mode is not specified, we will try to auto-detect in the order ‘rb’, ‘r’, thus both the following should work:
f1 = pysam.AlignmentFile('ex1.bam') f2 = pysam.AlignmentFile('ex1.sam')
If an index for a BAM file exists (.bai), it will be opened automatically. Without an index random access to reads via fetch() and pileup() is disabled.
For writing, the header of a SAM file/BAM file can be constituted from several sources (see also the samtools format specification):
- If template is given, the header is copied from a another AlignmentFile (template must be of type AlignmentFile).
- If header is given, the header is built from a multi-level dictionary. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.
- If text is given, new header text is copied from raw text.
- The names (referencenames) and lengths (referencelengths) are supplied directly as lists. By default, ‘SQ’ and ‘LN’ tags will be added to the header text. This option can be changed by unsetting the flag add_sq_text.
By default, if a file is opened in mode ‘r’, it is checked for a valid header (check_header = True) and a definition of chromosome names (check_sq = True).
- __enter__(self)¶
- __exit__(self, exc_type, exc_value, traceback)¶
- __iter__¶
x.__iter__() <==> iter(x)
- __next__()¶
python version of next().
- close(self)¶
closes the pysam.AlignmentFile.
- count(self, reference=None, start=None, end=None, region=None, until_eof=False)¶
(reference = None, start = None, end = None, region = None, callback = None, until_eof = False)
count reads region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.
Note that a SAM file does not allow random access. If region or reference are given, an exception is raised.
- fetch(self, reference=None, start=None, end=None, region=None, tid=None, callback=None, until_eof=False, multiple_iterators=False)¶
fetch aligned reads in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.
Without reference or region all mapped reads will be fetched. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file.
If until_eof is given, all reads from the current file position will be returned in order as they are within the file. Using this option will also fetch unmapped reads.
Set multiple_iterators to true if you will be using multiple iterators on the same file at the same time. The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file. Re-opening a file creates some overhead, so beware.
If only reference is set, all reads aligned to reference will be fetched.
Note that a SAM file does not allow random access. If region or reference are given, an exception is raised.
- filename¶
number of filename associated with this object.
- getrname(self, tid)¶
convert numerical tid into reference name.
- gettid(self, reference)¶
convert reference name into numerical tid
returns -1 if reference is not known.
- head(self, n, multiple_iterators=True)¶
return iterator over the first n alignments.
This is useful for inspecting the bam-file.
multiple_iterators is set to True by default in order to avoid changing the current file position.
- header¶
header information within the sam file. The records and fields are returned as a two-level dictionary.
The first level contains the record (HD, SQ, etc) and the second level contains the fields (VN, LN, etc).
The parser is validating and will raise an AssertionError if if encounters any record or field tags that are not part of the SAM specification. Use the pysam.AlignmentFile.text attribute to get the unparsed header.
The parsing follows the SAM format specification with the exception of the CL field. This option will consume the rest of a header line irrespective of any additional fields. This behaviour has been added to accommodate command line options that contain characters that are not valid field separators.
- lengths¶
tuple of the lengths of the reference sequences. The lengths are in the same order as pysam.AlignmentFile.references
- mapped¶
total number of mapped alignments in file.
- mate(self, AlignedSegment read)¶
return the mate of AlignedSegment read.
Throws a ValueError if read is unpaired or the mate is unmapped.
Note
Calling this method will change the file position. This might interfere with any iterators that have not re-opened the file.
Note
This method is too slow for high-throughput processing. If a read needs to be processed with its mate, work from a read name sorted file or, better, cache reads.
- next¶
x.next() -> the next value, or raise StopIteration
- nocoordinate¶
total number of reads without coordinates
- nreferences¶
number of reference sequences in the file.
- pileup(self, reference=None, start=None, end=None, region=None, **kwargs)¶
perform a pileup within a region. The region is specified by reference, start and end (using 0-based indexing). Alternatively, a samtools region string can be supplied.
Without reference or region all reads will be used for the pileup. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file.
The method returns an iterator of type pysam.IteratorColumn unless a callback is provided. If a *callback is given, the callback will be executed for each column within the region.
Note that SAM formatted files do not allow random access. In these files, if a region or reference are given an exception is raised.
Optional kwargs to the iterator:
- stepper
The stepper controlls how the iterator advances. Possible options for the stepper are
- all
- use all reads for pileup.
- pass
- skip reads in which any of the following flags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
- samtools
- same filter and read processing as in csamtools pileup. This requires a fastafile to be given.
- fastafile
- A FastaFile object. This is required for some of the steppers.
- mask
- Skip all reads with bits set in mask if mask=True.
- max_depth
- Maximum read depth permitted. The default limit is 8000.
truncate
By default, the samtools pileup engine outputs all reads overlapping a region (see note below). If truncate is True and a region is given, only output columns in the exact region specificied.Note
all reads which overlap the region are returned. The first base returned will be the first base of the first read not necessarily the first base of the region used in the query.
- references¶
tuple with the names of reference sequences.
- reset(self)¶
reset file position to beginning of file just after the header.
- seek(self, uint64_t offset, int where=0)¶
move file pointer to position offset, see pysam.AlignmentFile.tell().
- tell(self)¶
return current file position.
- text¶
full contents of the sam file header as a string
See pysam.AlignmentFile.header to get a parsed representation of the header.
- unmapped¶
total number of unmapped reads in file.
- write(self, AlignedSegment read) → int¶
write a single pysam.AlignedSegment to disk.
returns the number of bytes written.