bismark

Version:

0.20.1, 0.22.3

Category:

bio

Cluster:

Loki

Author / Distributor

https://felixkrueger.github.io/Bismark/

Description

Bismark is a program to map bisulfite treated sequencing reads to a genome of interest and perform methylation calls in a single step. The output can be easily imported into a genome viewer, such as SeqMonk or samtools, and enables a researcher to analyze the methylation levels of their samples straight away. Bismark needs a working version of Perl and it is run from the command line. Furthermore, Bowtie 2 or HISAT2 needs to be installed on your computer. For BAM output Samtools is required and included.

Documentation

USAGE: bismark [options] <genome_folder> {-1 <mates1> -2 <mates2> | <singles>}


ARGUMENTS:

<genome_folder>          The path to the folder containing the unmodified reference genome
                        as well as the subfolders created by the Bismark_Genome_Preparation
                        script (/Bisulfite_Genome/CT_conversion/ and /Bisulfite_Genome/GA_conversion/).
                        Bismark expects one or more fastA files in this folder (file extension: .fa, .fa.gz
                        or .fasta or .fasta.gz). The path can be relative or absolute. The path may also be set
                        as '--genome_folder /path/to/genome/folder/'.

-1 <mates1>              Comma-separated list of files containing the #1 mates (filename usually includes
                        "_1"), e.g. flyA_1.fq,flyB_1.fq). Sequences specified with this option must
                        correspond file-for-file and read-for-read with those specified in <mates2>.
                        Reads may be a mix of different lengths. Bismark will produce one mapping result
                        and one report file per paired-end input file pair.

-2 <mates2>              Comma-separated list of files containing the #2 mates (filename usually includes
                        "_2"), e.g. flyA_2.fq,flyB_2.fq). Sequences specified with this option must
                        correspond file-for-file and read-for-read with those specified in <mates1>.
                        Reads may be a mix of different lengths.

<singles>                A comma- or space-separated list of files containing the reads to be aligned (e.g.
                        lane1.fq,lane2.fq lane3.fq). Reads may be a mix of different lengths. Bismark will
                        produce one mapping result and one report file per input file. Please note that
                        one should supply a list of files in conjunction with --basename as the output files
                        will constantly overwrite each other...



OPTIONS:


Input:

--se/--single_end <list> Sets single-end mapping mode explicitly giving a list of file names as <list>.
                        The filenames may be provided as a comma [,] or colon [:] separated list.

-q/--fastq               The query input files (specified as <mate1>,<mate2> or <singles> are FASTQ
                        files (usually having extension .fg or .fastq). This is the default. See also
                        --solexa-quals.

-f/--fasta               The query input files (specified as <mate1>,<mate2> or <singles> are FASTA
                        files (usually having extensions .fa, .mfa, .fna or similar). All quality values
                        are assumed to be 40 on the Phred scale. FASTA files are expected to contain both
                        the read name and the sequence on a single line (and not spread over several lines).

-s/--skip <int>          Skip (i.e. do not align) the first <int> reads or read pairs from the input.

-u/--upto <int>          Only aligns the first <int> reads or read pairs from the input. Default: no limit.

--phred33-quals          FASTQ qualities are ASCII chars equal to the Phred quality plus 33. Default: ON.

--phred64-quals          FASTQ qualities are ASCII chars equal to the Phred quality plus 64. Default: off.

--path_to_bowtie2        The full path </../../> to the Bowtie 2 installation on your system. If not
                        specified it is assumed that Bowtie 2 is in the PATH.

--path_to_hisat2         The full path </../../> to the HISAT2 installation on your system. If not
                        specified it is assumed that HISAT2 is in the PATH.

Alignment:


-N <int>                 Sets the number of mismatches to allowed in a seed alignment during multiseed alignment.
                        Can be set to 0 or 1. Setting this higher makes alignment slower (often much slower)
                        but increases sensitivity. Default: 0. This option is only available for Bowtie 2 (for
                        Bowtie 1 see -n).

-L <int>                 Sets the length of the seed substrings to align during multiseed alignment. Smaller values
                        make alignment slower but more senstive. Default: the --sensitive preset of Bowtie 2 is
                        used by default, which sets -L to 20. maximum of L can be set to 32. The length of the seed
                        would effect the alignment speed dramatically while the larger L, the faster the aligment.
                        This option is only available for Bowtie 2 (for Bowtie 1 see -l).

--ignore-quals           When calculating a mismatch penalty, always consider the quality value at the mismatched
                        position to be the highest possible, regardless of the actual value. I.e. input is treated
                        as though all quality values are high. This is also the default behavior when the input
                        doesn't specify quality values (e.g. in -f mode). This option is invariable and on by default.

-I/--minins <int>        The minimum insert size for valid paired-end alignments. E.g. if -I 60 is specified and
                        a paired-end alignment consists of two 20-bp alignments in the appropriate orientation
                        with a 20-bp gap between them, that alignment is considered valid (as long as -X is also
                        satisfied). A 19-bp gap would not be valid in that case. Default: 0.

-X/--maxins <int>        The maximum insert size for valid paired-end alignments. E.g. if -X 100 is specified and
                        a paired-end alignment consists of two 20-bp alignments in the proper orientation with a
                        60-bp gap between them, that alignment is considered valid (as long as -I is also satisfied).
                        A 61-bp gap would not be valid in that case. Default: 500.

--parallel <int>         (May also be --multicore <int>) Sets the number of parallel instances of Bismark to be run concurrently.
                        This forks the Bismark alignment step very early on so that each individual Spawn of Bismark processes
                        only every n-th sequence (n being set by --parallel). Once all processes have completed,
                        the individual BAM files, mapping reports, unmapped or ambiguous FastQ files are merged
                        into single files in very much the same way as they would have been generated running Bismark
                        conventionally with only a single instance.

                        If system resources are plentiful this is a viable option to speed up the alignment process
                        (we observed a near linear speed increase for up to --parallel 8 tested). However, please note
                        that a typical Bismark run will use several cores already (Bismark itself, 2 or 4 threads of
                        Bowtie2/HISAT2, Samtools, gzip etc...) and ~10-16GB of memory depending on the choice of aligner
                        and genome. WARNING: Bismark Parallel (BP?) is resource hungry! Each value of --parallel specified
                        will effectively lead to a linear increase in compute and memory requirements, so --parallel 4 for
                        e.g. the GRCm38 mouse genome will probably use ~20 cores and eat ~40GB or RAM, but at the same time
                        reduce the alignment time to ~25-30%. You have been warned.

--local                  In this mode, it is not required that the entire read aligns from one end to the other. Rather, some
                        characters may be omitted (β€œsoft-clipped”) from the ends in order to achieve the greatest possible
                        alignment score. For Bowtie 2, the match bonus --ma (default: 2) is used in this mode, and the best possible
                        alignment score is equal to the match bonus (--ma) times the length of the read. This is mutually exclusive with
                        end-to-end alignments. For HISAT2, it is currently not exactly known how the best alignment is calculated.
                        DEFAULT: OFF.


Output:

--non_directional        The sequencing library was constructed in a non strand-specific manner, alignments to all four
                        bisulfite strands will be reported. Default: OFF.

                        (The current Illumina protocol for BS-Seq is directional, in which case the strands complementary
                        to the original strands are merely theoretical and should not exist in reality. Specifying directional
                        alignments (which is the default) will only run 2 alignment threads to the original top (OT)
                        or bottom (OB) strands in parallel and report these alignments. This is the recommended option
                        for sprand-specific libraries).

--pbat                   This options may be used for PBAT-Seq libraries (Post-Bisulfite Adapter Tagging; Kobayashi et al.,
                        PLoS Genetics, 2012). This is essentially the exact opposite of alignments in 'directional' mode,
                        as it will only launch two alignment threads to the CTOT and CTOB strands instead of the normal OT
                        and OB ones. Use this option only if you are certain that your libraries were constructed following
                        a PBAT protocol (if you don't know what PBAT-Seq is you should not specify this option). The option
                        --pbat works only for FastQ files (in both Bowtie and Bowtie 2 mode) and using uncompressed
                        temporary files only).

--sam-no-hd              Suppress SAM header lines (starting with @). This might be useful when very large input files are
                        split up into several smaller files to run concurrently and the output files are to be merged.

--rg_tag                 Write out a Read Group tag to the resulting SAM/BAM file. This will write the following line to the
                        SAM header: @RG PL: ILLUMINA ID:SAMPLE SM:SAMPLE ; to set ID and SM see --rg_id and --rg_sample.
                        In addition each read receives an RG:Z:RG-ID tag. Default: OFF.

--rg_id <string>         Sets the ID field in the @RG header line. The default is 'SAMPLE'.

--rg_sample <string>     Sets the SM field in the @RG header line; can't be set without setting --rg_id as well. The default is
                        'SAMPLE'.

-un/--unmapped           Write all reads that could not be aligned to a file in the output directory. Written reads will
                        appear as they did in the input, without any translation of quality values that may have
                        taken place within Bowtie or Bismark. Paired-end reads will be written to two parallel files with _1
                        and _2 inserted in their filenames, i.e. _unmapped_reads_1.txt and unmapped_reads_2.txt. Reads
                        with more than one valid alignment with the same number of lowest mismatches (ambiguous mapping)
                        are also written to _unmapped_reads.txt unless the option --ambiguous is specified as well.

--ambiguous              Write all reads which produce more than one valid alignment with the same number of lowest
                        mismatches or other reads that fail to align uniquely to a file in the output directory.
                        Written reads will appear as. they did in the input, without any of the translation of quality
                        values that may have taken place within Bowtie or Bismark. Paired-end reads will be written to two
                        parallel files with _1 and _2 inserted in their filenames, i.e. _ambiguous_reads_1.txt and
                        _ambiguous_reads_2.txt. These reads are not written to the file specified with --un.

-o/--output_dir <dir>    Write all output files into this directory. By default the output files will be written into
                        the same folder as the input file(s). If the specified folder does not exist, Bismark will attempt
                        to create it first. The path to the output folder can be either relative or absolute.

--temp_dir <dir>         Write temporary files to this directory instead of into the same directory as the input files. If
                        the specified folder does not exist, Bismark will attempt to create it first. The path to the
                        temporary folder can be either relative or absolute.

--non_bs_mm              Optionally, outputs an extra column specifying the number of non-bisulfite mismatches a read has.
                        This option is only available in end-to-end mode. The value is just the number of actual non-bisulfite
                        mismatches and ignores potential insertions or deletions.
                        The format for single-end reads and read 1 of paired-end reads is 'XA:Z:number of mismatches'
                        and 'XB:Z:number of mismatches' for read 2 of paired-end reads.

--gzip                   Temporary bisulfite conversion files will be written out in a GZIP compressed form to save disk
                        space. This option is available for most alignment modes but is not available for paired-end FastA
                        files. This option might be somewhat slower than writing out uncompressed files, but this awaits
                        further testing.

--sam                    The output will be written out in SAM format instead of the default BAM format. Be warned that this
                        requires ~10 times more disk space. --sam is not compatible with the option --parallel.

--bam                    Bismark will attempt to use the path to Samtools that was specified with '--samtools_path', or, if it hasn't
                        been specified, attempt to find Samtools in the PATH. If no installation of Samtools can be found,
                        the SAM output will be compressed with GZIP instead (yielding a .sam.gz output file). Default: ON.

--cram                   Writes the output to a CRAM file instead of BAM. This requires the use of Samtools 1.2 or higher.

--cram_ref <ref_file>    CRAM output requires you to specify a reference genome as a single FastA file. If this single-FastA
                        reference file is not supplied explicitly it will be regenerated from the genome .fa sequence(s)
                        used for the Bismark run and written to a file called 'Bismark_genome_CRAM_reference.mfa' into the
                        oputput directory.

--samtools_path          The path to your Samtools installation, e.g. /home/user/samtools/. Does not need to be specified
                         explicitly if Samtools is in the PATH already.

--prefix <prefix>        Prefixes <prefix> to the output filenames. Trailing dots will be replaced by a single one. For
                        example, '--prefix test' with 'file.fq' would result in the output file 'test.file.fq_bismark.sam' etc.

-B/--basename <basename> Write all output to files starting with this base file name. For example, '--basename foo'
                        would result in the files 'foo.bam' and 'foo_SE_report.txt' (or its paired-end equivalent). Takes
                        precedence over --prefix. Be advised that you should not use this option in conjunction with supplying
                        lists of files to be processed consecutively, as all output files will constantly overwrite each other.

--old_flag               Only in paired-end SAM mode, uses the FLAG values used by Bismark v0.8.2 and before. In addition,
                        this options appends /1 and /2 to the read IDs for reads 1 and 2 relative to the input file. Since
                        both the appended read IDs and custom FLAG values may cause problems with some downstream tools
                        such as Picard, new defaults were implemented as of version 0.8.3.


                                            default                         old_flag
                                      ===================              ===================
                                      Read 1       Read 2              Read 1       Read 2

                             OT:         99          147                  67          131

                             OB:         83          163                 115          179

                             CTOT:      147           99                  67          131

                             CTOB:      163           83                 115          179

--ambig_bam              For reads that have multiple alignments a random alignment is written out to a special file ending in
                        '.ambiguous.bam'. The alignments are in Bowtie2 format and do not any contain Bismark specific
                        entries such as the methylation call etc. These ambiguous BAM files are intended to be used as
                        coverage estimators for variant callers.

--nucleotide_coverage    Calculates the mono- and di-nucleotide sequence composition of covered positions in the analysed BAM
                        file and compares it to the genomic average composition once alignments are complete by calling 'bam2nuc'.
                        Since this calculation may take a while, bam2nuc attempts to write the genomic sequence composition
                        into a file called 'genomic_nucleotide_frequencies.txt' indside the reference genome folder so it can
                        be re-used the next time round instead of calculating it once again. If a file 'nucleotide_stats.txt' is
                        found with the Bismark reports it will be automatically detected and used for the Bismark HTML report.
                        This option works only for BAM or CRAM files.

--icpc                   This option will truncate read IDs at the first space or tab it encounters, which are sometimes used to add
                        comments to a FastQ entry (instead of replacing them with underscores (_) as is the default behaviour). The
                        opion is deliberately somewhat cryptic ("I couldn't possibly comment"), as it only becomes relevant when R1 and R2
                        of read pairs are mapped separately in single-end mode, and then re-paired afterwards (the SAM format dictates
                        that R1 and R2 have the same read ID). Paired-end mapping already creates BAM files with identical read IDs.
                        For more information please see here: https://github.com/FelixKrueger/Bismark/issues/236. Default: OFF.

Examples/Usage

  • List available modules:

    $ module avail bismark
    
  • Load the bismark module:

    $ module load bio/Bismark/0.22.3
    
  • Check the loaded modules:

    $ module list
    
  • Unload the Anaconda module:

    $ module unload bio/Bismark/0.22.3
    

Installation

Source code is obtained from Bismark