For developers¶
Adding new protocols to dropTag¶
For each new protocol you need to write new c++ class, which inherits TagsFinderBase
(“TagsSearch/TagsFinderBase.cpp”). There are several examples for
existed protocols (see all classes with suffix “TagsFinder”). All you
really need there is to define function parse_fastq_records
. After
that you need to add new protocol type to function get_tags_finder
in “droptag.cpp”. It takes only several lines, e.g.:
if (protocol_type == "ddSEQ")
return ...
Having this your are able to set “TagsSearch/protocol” to “ddSEQ” in the config.xml and work with ddSEQ as with any other supported protocol in very efficient and parallel manner.
General workflow of dropTag under the hood¶
Entry point is “droptag.cpp” file, main()
function. Most of it is just parsing
CLI parameters and logging. So, there are only two important places:
(1) get_tags_finder
: factory-like function, which creates TagsFinder for a
specific protocol based on configs, and (2) finder->run
: TagsFinder’s method,
which actually does the whole work.
Workflow of finder->run
is the following:
- Read fastq records from the files, which contain gene and barcode reads. Reading is performed synchronously over all files, as records in different files correspond to each other.
- This set of fastq records is parsed to a single gene record with its parameters
(i.e. cell barcode, UMI and its quality). There are two options on how to store the
parameters: it can either be done in the read name of the gene record or in a
separate data structure. In the later case, this information is stored as a gzipped
tsv file (see
-s
option). - Parsed record is converted to a string and gzipped.
- Gzipped info is written to the output file.
Though code is more complicated then that, because it’s implemented in a
multi-threaded way, and there it can’t be done by simple “parallel map” style.
So, parallelism style looks more like MPI (though of course it’s implemented in C++11
threads), and works as the follows (see TagsFinderBase::run_thread
function):
- All data is stored in concurrent queues, which have limited maximal size.
- Each workers independently iterates over all 4 tasks and check if it can do some work on it. If task is single-threaded and another worker is already doing it, or if corresponding queue is already full, the worker goes to the next task.
Such scheme allows to achieve ~10 times higher performance.