Map module

This module performs iterative mapping of reads in FASTQ files to a reference genome.

The main function is iterative_mapping(), which requires an input FASTQ file, an output SAM file, and a suitable mapper. Mapper options are Bowtie2Mapper and BwaMapper for the time being, but by subclassing Mapper a user can easily write their own mapper implementations that can fully leverage the iterative mapping capabilities of FAN-C. Take a look at the code of Bowtie2Mapper for an example.

Example usage:

import fanc
mapper = fanc.BwaMapper("bwa-index/hg19_chr18_19.fa", min_quality=3)
fanc.iterative_mapping("SRR4271982_chr18_19_1.fastq.gzip", "SRR4271982_chr18_19_1.bam",
                       mapper, threads=4, restriction_enzyme="HindIII")
class fanc.map.Bowtie2Mapper(bowtie2_index, min_quality=30, additional_arguments=(), threads=1, _bowtie2_path='bowtie2', **kwargs)

Bases: fanc.map.Mapper

Bowtie2 Mapper for aligning reads against a reference genome.

Implements Mapper by calling the command line “bowtie2” program.

bowtie2_index

Path to the bowtie2 index of the reference genome of choice.

min_quality

Minimum MAPQ of an alignment so that it won’t be resubmitted in iterative mapping.

additional_arguments

Arguments passed to the “bowtie2” command in addition to -x, -U, –no-unal, –threads, and -S.

threads

Number of threads for this mapping process.

close()

Final operations after mapping completes.

map(input_file, output_folder=None)

Map reads in the given FASTQ file using _map() implementation.

Will internally map the FASTQ reads to a SAM file in a temporary folder (use output_folder to choose a specific folder), split the SAM output into (i) valid alignments according to _resubmit() and (ii) invalid alignments that get caught by the resubmission filter. A FASTQ file will be constructed from the invalid alignments, extending the reads by a given step size, which can then be used to iteratively repeat the mapping process until a valid alignment is found or the full length of the read has been restored.

Parameters:
  • input_file – Path to FASTQ file
  • output_folder – (optional) path to temporary folder for SAM output
Returns:

tuple, path to valid SAM alignments, path to resubmission FASTQ

resubmit(sam_fields)

Determine if an alignment should be resubmitted.

Filters unmappable reads by default. Additional criteria can be implemented using the _resubmit() method.

Parameters:sam_fields – The individual fields in a SAM line (split by tab)
Returns:True if read should be extended and aligned to the reference again, False if the read passes the validity criteria.
class fanc.map.BwaMapper(bwa_index, min_quality=0, additional_arguments=(), threads=1, algorithm='mem', memory_map=False, _bwa_path='bwa')

Bases: fanc.map.Mapper

BWA Mapper for aligning reads against a reference genome.

Implements Mapper by calling the command line “bwa” program.

bwa_index

Path to the BWA index of the reference genome of choice.

min_quality

Minimum MAPQ of an alignment so that it won’t be resubmitted in iterative mapping.

additional_arguments

Arguments passed to the “bowtie2” command in addition to -t and -o.

threads

Number of threads for this mapping process.

algorithm

BWA algorithm to use for mapping. Uses “mem” by default. See http://bio-bwa.sourceforge.net/bwa.shtml for other options.

close()

Final operations after mapping completes.

map(input_file, output_folder=None)

Map reads in the given FASTQ file using _map() implementation.

Will internally map the FASTQ reads to a SAM file in a temporary folder (use output_folder to choose a specific folder), split the SAM output into (i) valid alignments according to _resubmit() and (ii) invalid alignments that get caught by the resubmission filter. A FASTQ file will be constructed from the invalid alignments, extending the reads by a given step size, which can then be used to iteratively repeat the mapping process until a valid alignment is found or the full length of the read has been restored.

Parameters:
  • input_file – Path to FASTQ file
  • output_folder – (optional) path to temporary folder for SAM output
Returns:

tuple, path to valid SAM alignments, path to resubmission FASTQ

resubmit(sam_fields)

Determine if an alignment should be resubmitted.

Filters unmappable reads by default. Additional criteria can be implemented using the _resubmit() method.

Parameters:sam_fields – The individual fields in a SAM line (split by tab)
Returns:True if read should be extended and aligned to the reference again, False if the read passes the validity criteria.
class fanc.map.SimpleBowtie2Mapper(bowtie2_index, additional_arguments=(), threads=1, _bowtie2_path='bowtie2')

Bases: fanc.map.Bowtie2Mapper

Bowtie2 Mapper for aligning reads against a reference genome without resubmission.

Implements Mapper by calling the command line “bowtie2” program. Does not resubmit reads under any circumstance.

bowtie2_index

Path to the bowtie2 index of the reference genome of choice.

additional_arguments

Arguments passed to the “bowtie2” command in addition to -x, -U, –no-unal, –threads, and -S.

threads

Number of threads for this mapping process.

close()

Final operations after mapping completes.

map(input_file, output_folder=None)

Map reads in the given FASTQ file using _map() implementation.

Will internally map the FASTQ reads to a SAM file in a temporary folder (use output_folder to choose a specific folder), split the SAM output into (i) valid alignments according to _resubmit() and (ii) invalid alignments that get caught by the resubmission filter. A FASTQ file will be constructed from the invalid alignments, extending the reads by a given step size, which can then be used to iteratively repeat the mapping process until a valid alignment is found or the full length of the read has been restored.

Parameters:
  • input_file – Path to FASTQ file
  • output_folder – (optional) path to temporary folder for SAM output
Returns:

tuple, path to valid SAM alignments, path to resubmission FASTQ

resubmit(sam_fields)

Determine if an alignment should be resubmitted.

Filters unmappable reads by default. Additional criteria can be implemented using the _resubmit() method.

Parameters:sam_fields – The individual fields in a SAM line (split by tab)
Returns:True if read should be extended and aligned to the reference again, False if the read passes the validity criteria.
class fanc.map.SimpleBwaMapper(bwa_index, additional_arguments=(), threads=1, memory_map=False, _bwa_path='bwa')

Bases: fanc.map.BwaMapper

BWA Mapper for aligning reads against a reference genome without resubmission.

Implements Mapper by calling the command line “bwa” program. Does not resubmit reads under any circumstance, i.e. does not perform iterative mapping.

bwa_index

Path to the BWA index of the reference genome of choice.

min_quality

Minimum MAPQ of an alignment so that it won’t be resubmitted in iterative mapping.

additional_arguments

Arguments passed to the “bowtie2” command in addition to -t and -o.

threads

Number of threads for this mapping process.

algorithm

BWA algorithm to use for mapping. Uses “mem” by default. See http://bio-bwa.sourceforge.net/bwa.shtml for other options.

close()

Final operations after mapping completes.

map(input_file, output_folder=None)

Map reads in the given FASTQ file using _map() implementation.

Will internally map the FASTQ reads to a SAM file in a temporary folder (use output_folder to choose a specific folder), split the SAM output into (i) valid alignments according to _resubmit() and (ii) invalid alignments that get caught by the resubmission filter. A FASTQ file will be constructed from the invalid alignments, extending the reads by a given step size, which can then be used to iteratively repeat the mapping process until a valid alignment is found or the full length of the read has been restored.

Parameters:
  • input_file – Path to FASTQ file
  • output_folder – (optional) path to temporary folder for SAM output
Returns:

tuple, path to valid SAM alignments, path to resubmission FASTQ

resubmit(sam_fields)

Determine if an alignment should be resubmitted.

Filters unmappable reads by default. Additional criteria can be implemented using the _resubmit() method.

Parameters:sam_fields – The individual fields in a SAM line (split by tab)
Returns:True if read should be extended and aligned to the reference again, False if the read passes the validity criteria.
fanc.map.iterative_mapping(fastq_file, sam_file, mapper, tmp_folder=None, threads=1, min_size=25, step_size=5, batch_size=200000, trim_front=False, restriction_enzyme=None)

Iteratively map sequencing reads using the provided mapper.

Will attempt to align a read using mapper. If unsuccessful, will truncate the read by step_size and attempt to align again. This is repeated until a successful alignment is found or the read gets truncated below min_size.

Parameters:
  • fastq_file – An input FASTQ file path with reds to align
  • sam_file – An output file path for sequencing results. If it ends with ‘.bam’ will compress output in bam format.
  • mapper – An instance of Mapper, e.g. Bowtie2Mapper. Override Mapper for creating your own custom mappers.
  • tmp_folder – A temporary folder for outputting subsets of FASTQ files
  • threads – Number of mapper threads to use in parallel.
  • min_size – Minimum length of read for which an alignment is attempted.
  • step_size – Number of base pairs by which to truncate read.
  • batch_size – Maximum number of reads processed in one batch
  • trim_front – Trim bases from front of read instead of back
  • restriction_enzyme – If provided, will calculate the expected ligation junction between reads and split reads accordingly. Both ends will be attempted to map. Can be the name of a restriction enzyme or a restriction pattern (e.g. A^AGCT_T)