RabbitFX

RabbitFX Tutorial

Installation

Dependency

  1. c++11

  2. zlib

  3. ISA-L

Using Cmake

Copy folder io to your program, and then you can integrate RabbitFX in your CMakeLists.txt:

AUX_SOURCE_DIRECTORY(. SOURCE_LIST)
ADD_LIBRARY(rabbitfx ${SOURCE_LIST})
TARGET_LINK_LIBRAIES(rabbitfx z)

Using build libray

mkdir build
cmake .. -DCMAKE_INSTALL_PREFIX=/your/install/path
make && make install
g++ -std=c++11 YOURFILE.cpp -I/your/install/path/include -L/your/install/path/lib -lrabbitfx -lz -lpthread

Build with libisal libray for processing gziped file faster

  • step1: install ISA-L

  • step2: build RabbitFX

mkdir build
cmake .. -DCMAKE_INSTALL_PREFIX=/your/install/path -DIGZIP_PREFIX=/path/to/libisal
make && make install
g++ -std=c++11 YOURFILE.cpp -I/your/install/path/include -L/your/install/path/lib -lrabbitfx -lz -lpthread

Highlight

  • RabbitFX highly support multi-core paltform

  • RabbitFX efficiency processing FASTQ/FASTA files

  • Concurrency data pool and data queue.

  • Non-copy read format

  • Bit-based sequencing processing

Illustration

Pipeline

Case study

Runing Example in main.cpp and TestCount.cpp

cd RabbitFX
mkdir build && cd build
cmake ..
make
#then there is an test file in build file
time ./test
time ./testcount

Note: We integtated CLI as default command line patser for user convenience

FASTQ data example

Single-end data processing example

  1. Example of define a mult-threading task

int test_fastq_se(int argc, char** argv){
  std::string file = "/home/old_home/haoz/workspace/QC/out_1.fq";
  //---------------cmd parser----------------
  CLI::App app{"Wellcome to RabbitFX"};
  CLI::Option* opt;
  std::string filename ;
  int th;
  app.add_option("-f, --file", filename, "input file name")
    ->default_val(file);
  app.add_option("-t, --threads", th, "worktreads")
    ->default_val(2);
  //----------------------------------------
  CLI11_PARSE(app, argc, argv);
  if(app.count("-f"))
    std::cout << "filename: " << filename << std::endl;
  else{
    std::cout << "-f not find, use default: " << filename << std::endl;
  }
  rabbit::fq::FastqDataPool *fastqPool = new rabbit::fq::FastqDataPool(32, 1<<22);
  rabbit::core::TDataQueue<rabbit::fq::FastqDataChunk> queue1(64, 1);
  std::thread producer(producer_fastq_task, filename, fastqPool, std::ref(queue1));
  std::thread** threads = new std::thread*[th];
  for(int t = 0; t < th; t++){
    threads[t] = new std::thread(std::bind(consumer_fastq_task, fastqPool, std::ref(queue1)));
  }
  producer.join();
  for(int t = 0; t < th; t++){
    threads[t]->join();
  }
  //-----free
  delete fastqPool;
  delete[] threads;
  return 0;
}
  1. example of define producer and consumer task

int producer_fastq_task(std::string file, rabbit::fq::FastqDataPool* fastqPool, rabbit::core::TDataQueue<rabbit::fq::FastqDataChunk> &dq){
  rabbit::fq::FastqFileReader *fqFileReader;
  fqFileReader = new rabbit::fq::FastqFileReader(file, *fastqPool);
  rabbit::int64 n_chunks = 0;
  while(true){
    rabbit::fq::FastqDataChunk* fqdatachunk;// = new rabbit::fq::FastqDataChunk;
    fqdatachunk = fqFileReader->readNextChunk();
    if (fqdatachunk == NULL) break;
    n_chunks++;
    //std::cout << "readed chunk: " << n_chunks << std::endl;
    dq.Push(n_chunks, fqdatachunk);
  }
  dq.SetCompleted();
  std::cout << "file " << file << " has " << n_chunks << " chunks" << std::endl;
  return 0;
}

void consumer_fastq_task(rabbit::fq::FastqDataPool* fastqPool, rabbit::core::TDataQueue<rabbit::fq::FastqDataChunk> &dq){
    long line_sum = 0;
    rabbit::int64 id = 0;
    std::vector<neoReference> data;
    rabbit::fq::FastqDataChunk* fqdatachunk;// = new rabbit::fq::FastqDataChunk;
    data.resize(10000);
    while(dq.Pop(id, fqdatachunk)){
      line_sum += rabbit::fq::chunkFormat(fqdatachunk, data);
      fastqPool->Release(fqdatachunk);
    }
    std::cout << "line_sum: " << line_sum << std::endl;
}

Pair-end data processing example

An example of processing Pair-end sequencing data is showed in file TestCount.cpp. It is tested that compared to FQReader, in the task of counting ATCG of pair-end data, RabbitFX is 2 times faster in 20 thread.

RabbitFX is about 2G/s I/O speed now

FASTA data example

this is an example of reading and processing FASTA file

  • example code of using only one thread (count chunk number of input file):

int proces_fasta_task(std::string file) {
  rabbit::fa::FastaDataPool *fastaPool = new rabbit::fa::FastaDataPool(256, 1 << 22);
  rabbit::fa::FastaFileReader *faFileReader;
  faFileReader = new rabbit::fa::FastaFileReader(file, *fastaPool, false);
  int n_chunks = 0;
  int line_sum = 0;
  while (true) {
    rabbit::fa::FastaChunk *fachunk = new rabbit::fa::FastaChunk;
    fachunk = faFileReader->readNextChunkList();
    if (fachunk == NULL) break;
    n_chunks++;
    //-----relaease
    rabbit::fa::FastaDataChunk *tmp = fachunk->chunk;
    do {
      fastaPool->Release(tmp);
      tmp = tmp->next;
    } while (tmp != NULL);
    // line_sum += rabbit::fa::chunkFormat(*fachunk, data);
  }
  std::cout << "file " << file << " has " << line_sum << " lines" << std::endl;
  return 0;
}

int test_fasta(int argc, char** argv){
  producer_fasta_task("/home/old_home/haoz/workspace/data/hg19/hg19.fa");
  return 0;
}

Cite

RabbitFX paper is under review now.