htseq

Version:

2.0.2

Category:

bio

Cluster:

Loki

Author / Distributor

https://htseq.readthedocs.io/en/master/history.html

Description

Htseq is a group of python packages that provide intrastructure to process data from high throughput sequencing arrays. While the main purpose of HSTeq is to allow you to write your own analysis scripts, customized to your own needs. There are also some stand-alone scripts for tasks that can be used without any Python knowledge.

Documentation

usage: htseq-count [-h] [--version] [-f {sam,bam,auto}] [-r {pos,name}] [--max-reads-in-buffer MAX_BUFFER_SIZE] [-s {yes,no,reverse}] [-a MINAQUAL] [-t FEATURE_TYPE] [-i IDATTR]
                  [--additional-attr ADDITIONAL_ATTRIBUTES] [--add-chromosome-info] [-m {union,intersection-strict,intersection-nonempty}] [--nonunique {none,all,fraction,random}]
                  [--secondary-alignments {score,ignore}] [--supplementary-alignments {score,ignore}] [-o SAMOUTS] [-p {SAM,BAM,sam,bam}] [-d OUTPUT_DELIMITER] [-c OUTPUT_FILENAME] [--counts-output-sparse]
                  [--append-output] [-n NPROCESSES] [--feature-query FEATURE_QUERY] [-q] [--with-header]
                  samfilenames [samfilenames ...] featuresfilename

This script takes one or more alignment files in SAM/BAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it. See
http://htseq.readthedocs.io/en/master/count.html for details.

positional arguments:
 samfilenames          Path to the SAM/BAM files containing the mapped reads. If '-' is selected, read from standard input
 featuresfilename      Path to the GTF file containing the features

optional arguments:
 -h, --help            show this help message and exit
 --version             Show software version and exit
 -f {sam,bam,auto}, --format {sam,bam,auto}
                       Type of <alignment_file> data. DEPRECATED: file format is detected automatically. This option is ignored.
 -r {pos,name}, --order {pos,name}
                       'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be
                       specified. Ignored for single-end data.
 --max-reads-in-buffer MAX_BUFFER_SIZE
                       When <alignment_file> is paired end sorted by position, allow only so many reads to stay in memory until the mates are found (raising this number will use more memory). Has no effect
                       for single end or paired end sorted by name
 -s {yes,no,reverse}, --stranded {yes,no,reverse}
                       Whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand interpretation
 -a MINAQUAL, --minaqual MINAQUAL
                       Skip all reads with MAPQ alignment quality lower than the given minimum value (default: 10). MAPQ is the 5th column of a SAM/BAM file and its usage depends on the software used to map
                       the reads.
 -t FEATURE_TYPE, --type FEATURE_TYPE
                       Feature type (3rd column in GTF file) to be used, all features of other type are ignored (default, suitable for Ensembl GTF files: exon)
 -i IDATTR, --idattr IDATTR
                       GTF attribute to be used as feature ID (default, suitable for Ensembl GTF files: gene_id). All feature of the right type (see -t option) within the same GTF attribute will be added
                       together. The typical way of using this option is to count all exonic reads from each gene and add the exons but other uses are possible as well. You can call this option multiple
                       times: in that case, the combination of all attributes separated by colons (:) will be used as a unique identifier, e.g. for exons you might use -i gene_id -i exon_number.
 --additional-attr ADDITIONAL_ATTRIBUTES
                       Additional feature attributes (default: none, suitable for Ensembl GTF files: gene_name). Use multiple times for more than one additional attribute. These attributes are only used as
                       annotations in the output, while the determination of how the counts are added together is done based on option -i.
 --add-chromosome-info
                       Store information about the chromosome of each feature as an additional attribute (e.g. colunm in the TSV output file).
 -m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}
                       Mode to handle reads overlapping more than one feature (choices: union, intersection-strict, intersection-nonempty; default: union)
 --nonunique {none,all,fraction,random}
                       Whether and how to score reads that are not uniquely aligned or ambiguously assigned to features (choices: none, all, fraction, random; default: none)
 --secondary-alignments {score,ignore}
                       Whether to score secondary alignments (0x100 flag)
 --supplementary-alignments {score,ignore}
                       Whether to score supplementary alignments (0x800 flag)
 -o SAMOUTS, --samout SAMOUTS
                       Write out all SAM alignment records into SAM/BAM files (one per input file needed), annotating each line with its feature assignment (as an optional field with tag 'XF'). See the -p
                       option to use BAM instead of SAM.
 -p {SAM,BAM,sam,bam}, --samout-format {SAM,BAM,sam,bam}
                       Format to use with the --samout option.
 -d OUTPUT_DELIMITER, --delimiter OUTPUT_DELIMITER
                       Column delimiter in output (default: TAB).
 -c OUTPUT_FILENAME, --counts_output OUTPUT_FILENAME
                       Filename to output the counts to instead of stdout.
 --counts-output-sparse
                       Store the counts as a sparse matrix (mtx, h5ad, loom).
 --append-output       Append counts output to an existing file instead of creating a new one. This option is useful if you have already creates a TSV/CSV/similar file with a header for your samples (with
                       additional columns for the feature name and any additionl attributes) and want to fill in the rest of the file.
 -n NPROCESSES, --nprocesses NPROCESSES
                       Number of parallel CPU processes to use (default: 1). This option is useful to process several input files at once. Each file will use only 1 CPU. It is possible, of course, to split
                       a very large input SAM/BAM files into smaller chunks upstream to make use of this option.
 --feature-query FEATURE_QUERY
                       Restrict to features descibed in this expression. Currently supports a single kind of expression: attribute == "one attr" to restrict the GFF to a single gene or transcript, e.g.
                       --feature-query 'gene_name == "ACTB"' - notice the single quotes around the argument of this option and the double quotes around the gene name. Broader queries might become available
                       in the future.
 -q, --quiet           Suppress progress report
 --with-header         Whether to add a column header to the output TSV file indicating which column corresponds to which input BAM file. Only used if output to console or tsv or csv file. Default to False.

Examples/Usage

  • List available modules:

    $ module avail htseq
    
  • Load the htseq module:

    $ module load bio/htseq/2.0.2
    
  • Check the loaded modules:

    $ module list
    
  • Unload the htseq module:

    $ module unload bio/htseq/2.0.2
    
  • Access the help information:

    $ htseq-count --help
    

Installation

Source code is obtained from htseq