Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

seqsplit

Split sequences into separate files based on barcodes

Syntax

seqsplit(fastqFile,barcodeFile)
seqsplit(___,Name,Value)
[outFiles,N] = seqsplit(___)

Description

example

seqsplit(fastqFile,barcodeFile) splits sequences in fastqFile according to the barcodes in barcodeFile and saves the sequences in separate files. By default, the output file name consists of the input file name followed by the barcode identifier. Sequences that do not match any provided barcodes, or that match multiple barcodes ambiguously, are saved in a file with the suffix '_unmatched' instead of the barcode identifier.

example

seqsplit(___,Name,Value) uses additional options specified by one or more Name,Value pair arguments.

example

[outFiles,N] = seqsplit(___) returns the names of output files in a cell array outFiles. N represents a vector containing the numbers of sequences saved in each output file.

Examples

collapse all

Create a tab-delimited file with barcode IDs and barcode sequences.

 barcodeInfo = {'ID1', 'AAAAC'; 'ID2', 'AGATT'; 'ID3', 'GACTT'};
 writetable(cell2table(barcodeInfo), 'barcodeExample.txt', ...
        'Delimiter', '\t', 'WriteVariableNames', false);

Split sequences into separate output files based on the barcode sequences. By default, the function assumes that the barcode is located at the 5' end of each sequence, and no mistmatches are allowed during barcode matching.

[outFiles, N] = seqsplit('SRR005164_1_50.fastq', 'barcodeExample.txt');

Check the number of sequences in each output file after splitting.

N
N =

     2
     1
     1

Allow up to two mismatches during the barcode matching.

[outFiles, N] = seqsplit('SRR005164_1_50.fastq', 'barcodeExample.txt', ...
        'MaxMismatches',2,'OutputSuffix','_MM2_split');
N
N =

     5
     9
     5

Input Arguments

collapse all

Names of FASTQ-formatted files with sequence and quality information, specified as a character vector or cell array of character vectors.

Example: 'SRR005164_1_50.fastq'

Name of barcode file with barcode information, specified as a character vector. The file must be tab-formatted, containing barcode IDs and barcode sequences. Each ID must be followed by a barcode sequence, and all barcode sequences must have the same length.

Example: 'barcodeExample.txt'

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'MaxMismatches',2 specifies to allow up to 2 mismatches during barcode matching.

collapse all

Maximum number of mismatches allowed during barcode matching, specified as a nonnegative integer. The default is 0, that is, no mismatches are allowed.

Example: 'MaxMismatches',2

Type of barcode to match, specified as 3 or 5. A value of 5 corresponds to the barcode located at the 5' end of each sequence, and 3 corresponds to the 3' end.

Example: 'BarcodeFormat',3

Whether to remove the barcode and corresponding quality information from the matched sequences, specified as true or false. The default is true.

Example: 'RemoveBarcode',false

Whether to save unmatched sequences and corresponding quality information in a separate output file, specified as true or false. The output file name has the suffix '_unmatched' instead of the barcode ID.

Example: 'WriteUnmatched',true

Relative or absolute path to the output file directory, specified as a character vector. The default is the current directory.

Example: 'OutputDir','F:\results'

Suffix to use in the output file name, specified as a character vector. It is inserted after the input file name and before the barcode ID. The default is '_split'.

Example: 'OutputSuffix','_MisMatches2_split'

Whether to perform computation in parallel, specified as true or false.

For parallel computing, you must have Parallel Computing Toolbox™. If a parallel pool does not exist, one is created automatically when the auto-creation option is enabled in your parallel preferences. Otherwise, computation runs in serial mode.

Note

There is a cost associated with sharing large input files across workers in a distributed environment. In some cases, running in parallel may not be beneficial in terms of performance.

Example: 'UseParallel',true

Output Arguments

collapse all

Output file names, returned as a cell array of character vectors. By default, the name of each output file consists of the input file name followed by the output suffix ('_split') and the barcode identifier.

Numbers of sequences saved in each output file, returned as a scalar or an n-by-1 vector, where n is the number of output files. If there are multiple output files, the order within N corresponds to the order of the output files.

Introduced in R2016b

Was this topic helpful?