BioReadQualityStatistics class

Quality statistics from a short-read sequence file

Description

The BioReadQualityStatistics class contains quality statistics data from short-read sequences and provides a standard set of quality control plots for such data.

Construct a BioReadQualityStatistics object from short-read sequence data stored in FASTQ, SAM, or BAM files. Perform data quality analyses using the object's methods to generate several quality control plots regarding average quality score for each base position, average quality score distribution, read count percentage for each base position, percentage of G and C nucleotides for each base position, G and C content distribution, and all nucleotide distribution. The object lets parse a sequence file without creating a BioRead object and interact with the quality data in order to compare different data sets or filtering options and create customized plots.

Construction

QSObj = BioReadQualityStatistics(File) constructs QSObj, a BioReadQualityStatistics object, from the data stored in File, a FASTQ-, SAM-, or BAM-formatted file.

QSObj = BioReadQualityStatistics(Obj) constructs QSObj, a BioReadQualityStatistics object, from the data stored in Obj, a BioRead or BioMap object.

QSObj = BioReadQualityStatistics(___,Name,Value) constructs a BioReadQualityStatistics object using options specified by one or more name-value pair arguments.

    Note:   Once created, you cannot modify the properties of QSObj since it is an immutable object.

Input Arguments

expand all

File

String specifying a FASTQ file. The string can contain the path or folder location of the file.

Obj

A BioRead or BioMap object.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'Encoding' — Encoding format'Illumina18' (default) | 'Sanger' | 'Illumina13' | 'Illumina15' | 'Solexa'

Encoding format, specified as 'Sanger', 'Illumina13', 'Illumina15', 'Illumina18', or 'Solexa'. It is the format that is used for characters encoding sequence information and quality scores in a FASTQ file.

Example: 'Encoding','Sanger'

'FilterLength' — Number of characters[] (default) | positive integer

Number of characters, specified as a positive integer, from each read to be used. No filtering is applied if you use an empty array, which is the default value.

Example: 'FilterLength',40

'QualityScoreThreshold' — Average quality threshold-Inf (default) | real number

Average quality threshold, specified as a real number. Any read with an average score of less than the specified threshold is ignored.

Example: 'QualityScoreThreshold', 10

Properties

FileName

Name of a file used to create BioReadQualityStatistics object.

FileType

Type of file from which a BioReadQualityStatistics object is created. Supported file types are FASTQ, SAM, and BAM formats.

Encoding

String specifying the format of the character encoding sequence information and quality scores in the file. Supported formats are: 'Sanger', 'Illumina13', 'Illumina15', 'Illumina18', and 'Solexa'. The default format is 'Illumina18'.

CharOffset

Integer specifying ASCII code where the quality score begins for a sequence.

NumberOfReads

Integer representing the number of short-read sequences BioReadQualityStatistics object contains.

MaxReadLength

Integer representing maximum length of a short-read sequence among all sequences of BioReadQualityStatistics object.

MinEncodingPhred

Integer specifying minimum Phred quality score [1] among all short-read sequences of a BioReadQualityStatistics object.

MaxEncodingPhred

Integer specifying maximum Phred quality score among all short-read sequences of a BioReadQualityStatistics object.

SkipPhred

Integer specifying the number of Phred scores that are not considered in the quality score range.

PerSeqAverageQualityDist

Vector of integers representing average quality distribution per sequence.

PerPosQualities

s-by-p matrix of integers that represent quality scores (s) per base positions (p).

PerSeqGCDist

Vector of integers representing the distribution of G and C nucleotides per sequence.

PerPosBaseDist

n-by-p matrix of integers that represents distribution of all nucleotides (n = 5) per base position (p).

Name

String describing the user-defined name for the object.

MaxScore

Integer representing maximum sequence quality score among all scores.

MinScore

Integer representing minimum sequence quality score among all scores.

FilterLength

Positive integer specifying the length of each read used in quality analysis.

QualityScoreThreshold

Scalar value specifying minimum average quality threshold for a read. Any read with an average score of less than the specified threshold is ignored. The default value is –Inf, which causes all reads to be considered.

Subset

Vector of integers specifying the index for subset of information from the original sequence data used in analysis.

Methods

plotPerPositionCountByQualityPlot fractions of reads with Phred scores in ranges
plotPerPositionGCPlot percentages of G or C nucleotides at each base position
plotPerPositionQuality Plot Phred score distributions
plotPerSequenceGCPlot G or C nucleotide distribution
plotPerSequenceQualityPlot distribution of average quality scores
plotSummaryPlot summary statistics of a BioReadQualityStatistics object
plotTotalGCPlot distribution of all nucleotides of short-read sequences

Examples

expand all

Create a BioReadQualityStatistics object and plot its summary statistics

This example shows how to create a BioReadQualityStatistics object and plot summary statistics of it.

Create a BioReadQualityStatistics object from a FASTQ file using only the first 40 characters of each read with a minimum average quality score of 5.

QSObj = BioReadQualityStatistics('SRR005164_1_50.fastq', 'FilterLength',...
                                    40, 'QualityScoreThreshold', 5)
QSObj = 

  BioReadQualityStatistics with properties:

                    FileName: '/mathworks/devel/bat/Bdoc14b/build/matlab/t...'
                    FileType: 'FASTQ'
                    Encoding: 'Illumina18'
                  CharOffset: 33
               NumberOfReads: 50
               MaxReadLength: 40
            MinEncodingPhred: 0
            MaxEncodingPhred: 62
                   SkipPhred: []
    PerSeqAverageQualityDist: [1x62 double]
             PerPosQualities: [63x40 double]
                PerSeqGCDist: [0 0 0 0 3 3 8 5 9 7 6 5 2 2 0 0 0 0 0 0]
              PerPosBaseDist: [5x40 double]
                        Name: ''
                    MaxScore: 34
                    MinScore: 1
                FilterLength: 40
       QualityScoreThreshold: 5
                      Subset: NaN

Plot the summary statistics of the object.

plotSummary(QSObj)
ans =

    1.0099
    2.0099
    3.0099
    4.0099
    5.0099
    6.0099

References

[1] Wikipedia. (2012). Phred quality score, http://en.wikipedia.org/wiki/Phred_quality_score

See Also

|

Was this topic helpful?