Nullomers
Background
The human genome is made up of approximately 3.2 billion ‘A’s, ‘C’s, ‘T’s, and ‘G’s (these are bases). K-mers are sequences within the human genome of length k. This means that there are 4^k possible k-mers for an arbitrary length. If the human genome were completely random at the macro scale, trivial probability tells us that that all k-mers up to k = 15 should exist. But our genome is not random, it has undergone selection since the first organisms lived off hydrothermal vent fumes. In practice, the shortest nullomers (k-mers not present in a genome) occur at length 11.
The naive method for finding nullomers, used in most academic papers on the topic, is to generate all possible k-mers, count the frequency of k-mers contained within a genome string, and return all k-mers that do not exist. An example of this is found here here. I propose two minor improvement methods that are much faster.
Sets Approach
The first approach is a simple set-based approach. Generate A: the set of all k-mers and B: the set of all k-mers within the genome. Nullomers equals the set difference A-B.
fasta_string = load_fasta_as_string("path")
k = 11 #nullomer length
all_nmers = {fasta_string[i:i + k] for i in range(len(fasta_string) - k + 1)}
all_possible_nmers = {''.join(nmer) for nmer in product("ATCG", repeat=k)}
nullomers = all_possible_nmers - all_nmers_set
If you plan to keep your codebase in purely python, this is the fastest reasonable approach. I see a roughly 2x speedup on my laptop with this approach, taking 15-20 minutes as opposed to 30+ for a 3.2 billion length genome with k = 11. But we can get faster.
Bitset Approach
As there are only 4 bases, we can use a bitset of size 4^k (from 2^(2k)) to contain all bases. For k = 11, we create a bitset of size 4194304. Any arbitrary k-mer can be read to and from a unique address in the bitset.
constexpr size_t dna_to_index(const std::string& dna) {
size_t index = 0;
for (char c : dna) {
index <<= 2;
switch (c) {
case 'A': index |= 0b00; break;
case 'C': index |= 0b01; break;
case 'T': index |= 0b10; break;
case 'G': index |= 0b11; break;
default: throw std::invalid_argument("Invalid character: " + c);
}
}
return index;
}
std::string index_to_dna(size_t index, size_t length) {
std::string dna;
for (size_t i = 0; i < length; ++i) {
size_t nucleotides = index & 0b11;
switch (nucleotides) {
case 0b00: dna = 'A' + dna; break;
case 0b01: dna = 'C' + dna; break;
case 0b10: dna = 'T' + dna; break;
case 0b11: dna = 'G' + dna; break;
}
index >>= 2;
}
return dna;
}
We then simply iterate over a DNA string, setting the value at the corresponding index of each sequence to 1. All the bitset indices that contain a zero will be our nullomers.
std::string fasta = load_fasta("path");
for (size_t i = 0; i <= fasta.size() - 11; i++) {
b.set(dna_to_index(fasta.substr(i, 11)));
}
//Write output to file
std::ofstream outputFile("nullomer_ouput.txt");
for (size_t i = 0; i < 4194304; i++) {
if (!b.test(i)) {
std::string dnaSequence = index_to_dna(i, 11);
outputFile << dnaSequence << std::endl;
}
}
This runs in around 9 minutes, but has the major advantage of being very easy to run in parallel. This code is linked in the next section.
The parallel version takes just over a minute to run, and can probably be optimized further by someone better with concurrency. This speedup makes the collection of nullomer data computationally trivial, which hopefully allows better usage of them. Read about some use cases of nullomers in the next section.
A final speedup which I have not yet worked on is to use a file-encoding for DNA that uses the 4-bit representation for bases.
Links and Further Reading
Python Data Loading (Used to generate files for CPP)
Good Papers (Not Me):