Search / Alignment FAQ

Informal FAQ: Some frequently asked questions about the alignment algorithm, some common misconceptions, and some things that people should ask.

Use the  form below to add questions.

 

Need

Q: Another tool, really?

A: This is not a tool (yet), this is a different algorithmic approach.
Since this is a new approach, one can write tools with new properties. Some of these properties can be speed and design flexibility.
There is a prototype implementation, which demonstrates some properties and works like an alignment tool, but I am not proposing that this prototype should replace existing tools as is.

Also: traditional alignment is probably not the most interesting application for this algorithm. What other things can be done when this type of string search is “free?”

 

Q: Why would anyone need faster algorithms? Do alignment or assembly take any significant computation resources or slow down any step of the analysis?

A: Sequencing throughput is rapidly growing. Speed is a problem in some applications  now, and it is likely to become a bigger problem in the future.
This algorithm is designed mainly for problems where search is a real limiting factor: large repetetive references and large datasets.

 

Q: I am not interested in algorithms, theory, prototypes, DNA statistics, speed, flexibility or new applications.

A: At this point, I am wasting your time.

 

Implementation

Q: How fast can it be?

A: The current first implementation is faster than comparable tools. The implementation does not use SIMD/SSE and only indexes one direction of the reference. Some later variations of the implementation demonstrate some of the ways to increase the speed and accuracy.
The efficient mature software will be A LOT faster.

 

 Q: These algorithms require a supercomputer!

A: The memory requirement is roughly proportional to the size of the reference. Alignment to a human genome require less than 16GB of RAM. If memory is an issue, there are versions that require much less RAM. More RAM can give more flexibility and better performance.
RAM is cheap. One can buy 16GB of RAM for as little as $50 (Dec 2012, and probably less by Dec 2013). If the speed and flexibility do not justify the investment in additional RAM, one can use a version that requires less memory, but the task is probably small enough that one can use alternative solutions.

 

Q: The program does not have a name.

A: The current implementation is a basic prototype. I don’t want people to mistake it for a mature tool, so I didn’t give it a name. Feel free to propose names in the comments below.

 

Q: Why do you sometimes say “approximate alignment” or “mapping to the right area”.

A: When I measure alignment results, I sometimes count an alignments as correct even if they are a few base-pairs off. I do this because different programs (not only the prototype) have slightly different reporting standards. In any case, the shifts are so small that the approximate alignment can easily be fixed before it is used (contact me if you are looking for an algorithm for this).
Some versions of the first prototype can be a few base-pairs off in some cases of indels. This is a known issue in the prototype, that should be fixed in future versions, but it does not affect the results.

 

Q: Why doesn’t the tool do X?

A: For the same reason it does not do Y, because it is not a tool, but rather a prototype. If you have an interesting experiment involving X, contact me.

 

Q: I tried to use the prototype for X, and it segfaulted.

A: Great, you found a bug in the prototype. If you tried the example in the readme file, and if your the input files are similar in format to the example, let me know about the problem.
This is not a mature tool. It is a prototype implementation, designed to demonstrate permutations-based algorithms. If you are looking for ways to crash this prototype, I will provide you with many ideas.

 

Q: Why isn’t the output of the prototype 100% standard?

A: So that users do not try to plug it into existing pipelines without understanding the properties of this implementation.

 

Q: The prototype is not fully randomized.

A: True. A better implementation should manage indexes better. This prototype still demonstrates the key properties.

 

Q: It takes a long time to index a reference.

A: Indexes are created once and used many times, so the indexing step was never optimized. Indexing can be done much faster.

 

Q: It takes a long time to load a large index.

A: If one has a large alignment project, this is negligible.

 

Q: Where can I download the tool?

A: The “software” page. Contact me if you would like to experiment with other versions.

 

Search Capabilities

Q: It can’t do indels.

A: Yes. and no. The basic permutations based algorithm is designed for mismatches. However, we have demonstrated some of the many ways to handle indels in many scenarios. There are additional solutions for indels, please contact me if you are looking for better solutions for indels.
For the case of homopolymer-length-errors (454, ionTorrent) we showed that the combination of permutations-based alignment and “Homopolymer-Length-Filters” produce a very fast and very sensitive aligner.
This is not the algorithm for traditional alignment of reads with many random indels, like “pacbio” reads. But we are examining different applications for “pacbio”.

 

Q: The algorithm requires all reads to have the same length.

A: The basic algorithm does require a uniform read length, because each index is designed for some read length.
In practice, each index can be used for a range of lengths, so it is enough to have several indexes for several ranges of length. When a set of reads contains reads of very different lengths, it can be divided into several subsets, and each subset can be aligned using the appropriate index.

 

Comparisons

Q: X is better at Y, if you measure Z.

A: Probably. Permutations-based algorithms are not better than everything else in every possible way. The first implementation of a permutation-based algorithm is definitely not better than existing tools in every way. Permutations based algorithms may be appropriate for V, or when you measure W, or if you have a large scale problem and finite computation resources.

 

Q: Why didn’t you compare to X?

A: For the same reason I didn’t compare to Y. The first prototype was designed to demonstrate the key properties of permutations-based algorithms, which I believe it does. The prototype was not designed to replace existing software, so exhaustive comparisons add very little to the discussion.

In fact, I probably did compare to X and verified that permutations-based algorithms are advantageous in some ways in various scenarios.

 

Q: Why didn’t you measure X?

A: Usually, because measure X was designed to compare between prefix-tree or hash-table programs.

In fact, I probably did examine X and found that permutations-based algorithms perform well in terms X.

 

Q: Can one do anything to reduce the error probability even more?

A: Yes! permutations-based algorithms provide plenty of statistical information that can be used. There are many ways in which permutations-based algorithms can identify and fix errors using information that is usually not available to other software – more on this in a future post/report.

 

Terminology

Q:Why do you say character / nucleotides / base-pairs instead of base-pairs / character / nucleotides?

A: To have a uniform terminology when I describe the algorithm and the application to biology.

 

Q: Isn’t this just a traditional prefix-tree / hash-table / gapped-hashing algorithm?

A: No, but the classification is terminology.

 

Q: Why does the tool …. 

A: Because it is not a tool.

 

Q: Why doesn’t the tool ….

A: Because it is not a tool.

 

Q: You keep calling the tool “a prototype” and “an implementation”, but I want a tool.

A: Known bug.