Using suffix array algorithm for Burrows Wheeler transform - c++

I've sucessfully implemented a BWT stage (using regular string sorting) for a compression testbed I'm writing. I can apply the BWT and then inverse BWT transform and the output matches the input. Now I wanted to speed up creation of the BW index table using suffix arrays. I have found 2 relatively simple, supposedly fast O(n) algorithms for suffix array creation, DC3 and SA-IS which both come with C++/C source code. I tried using the sources (out-of-the-box compiling SA-IS source can also be found here), but failed to get proper a proper suffix array / BWT index table out. Here's what I've done:
T=input data, SA=output suffix array, n=size of T, K=alphabet size, BWT=BWT index table
I work on 8-bit bytes, but both algorithms need a unique sentinel / EOF marker in form of a zero byte (DC3 needs 3, SA-IS needs one), thus I convert all my input data to 32-bit integers, increase all symbols by 1 and append the sentinel zero bytes. This is T.
I create an integer output array SA (of size n for DC3, n+1 for KA-IS) and apply the algorithms. I get results similar to my sorting BWT transform, but some values are odd (see UPDATE 1). Also the results of both algorithms differ slightly. The SA-IS algorithm produces an excess index value at the front, so all results need to be copied left by one index (SA[i]=SA[i+1]).
To convert the suffix array to the proper BWT indices, I subtract 1 from the suffix array values, do a modulo and should have the BWT indices (according to this): BWT[i]=(SA[i]-1)%n.
This is my code to feed the SA algorithms and convert to BWT. You should be able to more or less just plug in the SA construction code from the papers:
std::vector<int32_t> SuffixArray::generate(const std::vector<uint8_t> & data)
{
std::vector<int32_t> SA;
if (data.size() >= 2)
{
//copy data over. we need to append 3 zero bytes,
//as the algorithm expects T[n]=T[n+1]=T[n+2]=0
//also increase the symbol value by 1, because the algorithm alphabet is [1,K]
//(0 is used as an EOF marker)
std::vector<int32_t> T(data.size() + 3, 0);
std::copy(data.cbegin(), data.cend(), T.begin());
std::for_each(T.begin(), std::prev(T.end(), 3), [](int32_t & n){ n++; });
SA.resize(data.size());
SA_DC3(T.data(), SA.data(), data.size(), 256);
OR
//copy data over. we need to append a zero byte,
//as the algorithm expects T[n-1]=0 (where n is the size of its input data)
//also increase the symbol value by 1, because the algorithm alphabet is [1,K]
//(0 is used as an EOF marker)
std::vector<int32_t> T(data.size() + 1, 0);
std::copy(data.cbegin(), data.cend(), T.begin());
std::for_each(T.begin(), std::prev(T.end(), 1), [](int32_t & n){ n++; });
SA.resize(data.size() + 1); //crashes if not one extra byte at the end
SA_IS((unsigned char *)T.data(), SA.data(), data.size() + 1, 256, 4); //algorithm expects size including sentinel
std::rotate(SA.begin(), std::next(SA.begin()), SA.end()); //rotate left by one to get same result as DC3
SA.resize(data.size());
}
else
{
SA.push_back(0);
}
return SA;
}
void SuffixArray::toBWT(std::vector<int32_t> & SA)
{
std::for_each(SA.begin(), SA.end(), [SA](int32_t & n){ n = ((n - 1) < 0) ? (n + SA.size() - 1) : (n - 1); });
}
What am I doing wrong?
UPDATE 1
When applying the algorithms to short amounts of test text data like "yabbadabbado" / "this is a test." / "abaaba" or a big text file (alice29.txt from the Canterbury corpus) they work fine. Actually the toBWT() function isn't even necessary.
When applying the algorithms to binary data from a file containing the full 8-bit byte alphabet (executable etc.), they don't seem to work correctly. Comparing the results of the algorithms to that of the regular BWT indices, I notice erroneous indices (4 in my case) at the front. The number of indices (incidently?) corresponds to the recursion depth of the algorithms. The indices point to where the original source data had the last occurrences of 0s (before I converted them to 1s when building T)...
UPDATE 2
There are more differing values when I binary compare the regular BWT array and the suffix array. This might be expected, as afair sorting must not necessarily be the same as with a standard sort, BUT the resulting data transformed by the arrays should be the same. It is not.
UPDATE 3
I tried modifying a simple input string till both algorithm "failed". After changing two bytes of the string "this is a test." to 255 or 0 (from 74686973206973206120746573742Eh to e.g. 746869732069732061FF74657374FFh, the last byte has to be changed!) the indices and transformed string are not correct anymore. It also seems to be enough to change the last character of the string to a character already ocurring in the string, e.g. "this is a tests" 746869732069732061207465737473h. Then two indices and two characters of the transformed strings will swapped (comparing regular sorting BWT and BWT that uses SAs).
I find the whole process of having to convert the data to 32-bit a bit awkward. If somebody has a better solution (paper, better yet, some source code) to generate a suffix array DIRECTLY from a string with an 256-char alphabet, I'd be happy.

I have now figured this out. My solution was two-fold. Some people suggested using a library, which I did SAIS-lite by Yuta Mori.
The real solution was to duplicate and concatenate the input string and run the SA-generation on this string. When saving the output string you need to filter out all SA indices above the original data size. This is not an ideal solution, because you need to allocate twice as much memory, copy twice and do the transform on the double amount of data, but it is still 50-70% faster than std::sort. If you have a better solution, I'd love to hear it.
You can find the updated code here.

Related

Given a string, find two identical subsequences with consecutive indexes C++

I need to construct an algorithm (not necessarily effective) that given a string finds and prints two identical subsequences (by print I mean color for example). What more, the union of the sets of indexes of these two subsequences has to be a set of consecutive natural numbers (a full segment of integers).
In mathematics, the thing what I am looking for is called "tight twins", if it helps anything. (E.g., see the paper (PDF) here.)
Let me give a few examples:
1) consider string 231213231
It has two subsequences I am looking for in the form of "123". To see it better look at this image:
The first subsequence is marked with underlines and the second with overlines. As you can see they have all the properties I need.
2) consider string 12341234
3) consider string 12132344.
Now it gets more complicated:
4) consider string: 13412342
It is also not that easy:
I think that these examples explain well enough what I meant.
I've been thinking a long time about an algorithm that could do that but without success.
For coloring, I wanted to use this piece of code:
using namespace std;
HANDLE hConsole;
hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hConsole, k);
where k is color.
Any help, even hints, would be highly appreciated.
Here's a simple recursion that tests for tight twins. When there's a duplicate, it splits the decision tree in case the duplicate is still part of the first twin. You'd have to run it on each substring of even length. Other optimizations for longer substrings could include hashing tests for char counts, as well as matching the non-duplicate portions of the candidate twins (characters that only appear twice in the whole substring).
Explanation of the function:
First, a hash is created with each character as key and the indexes it appears in as values. Then we traverse the hash: if a character count is odd, the function returns false; and indexes of characters with a count greater than 2 are added to a list of duplicates - characters half of which belong in one twin but we don't know which.
The basic rule of the recursion is to only increase i when a match for it is found later in the string, while maintaining a record of chosen matches (js) that i must skip without looking for a match. It works because if we find n/2 matches, in order, by the time j reaches the end, that's basically just another way of saying the string is composed of tight twins.
JavaScript code:
function isTightTwins(s){
var n = s.length,
char_idxs = {};
for (var i=0; i<n; i++){
if (char_idxs[s[i]] == undefined){
char_idxs[s[i]] = [i];
} else {
char_idxs[s[i]].push(i);
}
}
var duplicates = new Set();
for (var i in char_idxs){
// character with odd count
if (char_idxs[i].length & 1){
return false;
}
if (char_idxs[i].length > 2){
for (let j of char_idxs[i]){
duplicates.add(j);
}
}
}
function f(i,j,js){
// base case positive
if (js.size == n/2 && j == n){
return true;
}
// base case negative
if (j > n || (n - j < n/2 - js.size)){
return false;
}
// i is not less than j
if (i >= j) {
return f(i,j + 1,js);
}
// this i is in the list of js
if (js.has(i)){
return f(i + 1,j,js);
// yet to find twin, no match
} else if (s[i] != s[j]){
return f(i,j + 1,js);
} else {
// maybe it's a twin and maybe it's a duplicate
if (duplicates.has(j)) {
var _js = new Set(js);
_js.add(j);
return f(i,j + 1,js) | f(i + 1,j + 1,_js);
// it's a twin
} else {
js.add(j);
return f(i + 1,j + 1,js);
}
}
}
return f(0,1,new Set());
}
console.log(isTightTwins("1213213515")); // true
console.log(isTightTwins("11222332")); // false
WARNING: Commenter גלעד ברקן points out that this algorithm gives the wrong answer of 6 (higher than should be possible!) for the string 1213213515. My implementation gets the same wrong answer, so there seems to be a serious problem with this algorithm. I'll try to figure out what the problem is, but in the meantime DO NOT TRUST THIS ALGORITHM!
I've thought of a solution that will take O(n^3) time and O(n^2) space, which should be usable on strings of up to length 1000 or so. It's based on a tweak to the usual notion of longest common subsequences (LCS). For simplicity I'll describe how to find a minimal-length substring with the "tight twin" property that starts at position 1 in the input string, which I assume has length 2n; just run this algorithm 2n times, each time starting at the next position in the input string.
"Self-avoiding" common subsequences
If the length-2n input string S has the "tight twin" (TT) property, then it has a common subsequence with itself (or equivalently, two copies of S have a common subsequence) that:
is of length n, and
obeys the additional constraint that no character position in the first copy of S is ever matched with the same character position in the second copy.
In fact we can safely tighten the latter constraint to no character position in the first copy of S is ever matched to an equal or lower character position in the second copy, due to the fact that we will be looking for TT substrings in increasing order of length, and (as the bottom section shows) in any minimal-length TT substring, it's always possible to assign characters to the two subsequences A and B so that for any matched pair (i, j) of positions in the substring with i < j, the character at position i is assigned to A. Let's call such a common subsequence a self-avoiding common subsequence (SACS).
The key thing that makes efficient computation possible is that no SACS of a length-2n string can have more than n characters (since clearly you can't cram more than 2 sets of n characters into a length-2n string), so if such a length-n SACS exists then it must be of maximum possible length. So to determine whether S is TT or not, it suffices to look for a maximum-length SACS between S and itself, and check whether this in fact has length n.
Computation by dynamic programming
Let's define f(i, j) to be the length of the longest self-avoiding common subsequence of the length-i prefix of S with the length-j prefix of S. To actually compute f(i, j), we can use a small modification of the usual LCS dynamic programming formula:
f(0, _) = 0
f(_, 0) = 0
f(i>0, j>0) = max(f(i-1, j), f(i, j-1), m(i, j))
m(i, j) = (if S[i] == S[j] && i < j then 1 else 0) + f(i-1, j-1)
As you can see, the only difference is the additional condition && i < j. As with the usual LCS DP, computing it takes O(n^2) time, since the 2 arguments each range between 0 and n, and the computation required outside of recursive steps is O(1). (Actually we need only compute the "upper triangle" of this DP matrix, since every cell (i, j) below the diagonal will be dominated by the corresponding cell (j, i) above it -- though that doesn't alter the asymptotic complexity.)
To determine whether the length-2j prefix of the string is TT, we need the maximum value of f(i, 2j) over all 0 <= i <= 2n -- that is, the largest value in column 2j of the DP matrix. This maximum can be computed in O(1) time per DP cell by recording the maximum value seen so far and updating as necessary as each DP cell in the column is calculated. Proceeding in increasing order of j from j=1 to j=2n lets us fill out the DP matrix one column at a time, always treating shorter prefixes of S before longer ones, so that when processing column 2j we can safely assume that no shorter prefix is TT (since if there had been, we would have found it earlier and already terminated).
Let the string length be N.
There are two approaches.
Approach 1. This approach is always exponential-time.
For each possible subsequence of length 1..N/2, list all occurences of this subsequence. For each occurence, list positions of all characters.
For example, for 123123 it should be:
(1, ((1), (4)))
(2, ((2), (5)))
(3, ((3), (6)))
(12, ((1,2), (4,5)))
(13, ((1,3), (4,6)))
(23, ((2,3), (5,6)))
(123, ((1,2,3),(4,5,6)))
(231, ((2,3,4)))
(312, ((3,4,5)))
The latter two are not necessary, as their appear only once.
One way to do it is to start with subsequences of length 1 (i.e. characters), then proceed to subsequences of length 2, etc. At each step, drop all subsequences which appear only once, as you don't need them.
Another way to do it is to check all 2**N binary strings of length N. Whenever a binary string has not more than N/2 "1" digits, add it to the table. At the end drop all subsequences which appear only once.
Now you have a list of subsequences which appear more than 1 time. For each subsequence, check all the pairs, and check whether such a pair forms a tight twin.
Approach 2. Seek for tight twins more directly. For each N*(N-1)/2 substrings, check whether the substring is even length, and each character appears in it even number of times, and then, being its length L, check whether it contains two tight twins of the length L/2. There are 2**L ways to divide it, the simplest you can do is to check all of them. There are more interesting ways to seek for t.t., though.
I would like to approach this as a dynamic programming/pattern matching problem. We deal with characters one at a time, left to right, and we maintain a herd of Non-Deterministic Finite Automata / NDFA, which correspond to partial matches. We start off with a single null match, and with each character we extend each NDFA in every possible way, with each NDFA possibly giving rise to many children, and then de-duplicate the result - so we need to minimise the state held in the NDFA to put a bound on the size of the herd.
I think a NDFA needs to remember the following:
1) That it skipped a stretch of k characters before the match region.
2) A suffix which is a p-character string, representing characters not yet matched which will need to be matched by overlines.
I think that you can always assume that the p-character string needs to be matched with overlines because you can always swap overlines and underlines in an answer if you swap throughout the answer.
When you see a new character you can extend NDFAs in the following ways:
a) An NDFA with nothing except skips can add a skip.
b) An NDFA can always add the new character to its suffix, which may be null
c) An NDFA with a p character string whose first character matches the new character can turn into an NDFA with a p-1 character string which consists of the last p-1 characters of the old suffix. If the string is now of zero length then you have found a match, and you can work out what it was if you keep links back from each NDFA to its parent.
I thought I could use a neater encoding which would guarantee only a polynomial herd size, but I couldn't make that work, and I can't prove polynomial behaviour here, but I notice that some cases of degenerate behaviour are handled reasonably, because they lead to multiple ways to get to the same suffix.

Huffman's Data compression filltable and invert code problems

I just began learning about Huffman's Data compression algorithm and I need help on the following function > filltable() and invertcode()
I don't understand why a codetable array is needed.
while (n>0){
copy = copy * 10 + n %10;
n /= 10;
}
Please help me understand what is going on for this part of the function and why if n is larger than 0 it is divided by ten because it is alway going to be greater than 0 no matter how many times you divided it.
Link for code: http://www.programminglogic.com/implementing-huffman-coding-in-c/
void fillTable(int codeTable[], Node *tree, int Code){
if (tree->letter<27)
codeTable[(int)tree->letter] = Code;
else{
fillTable(codeTable, tree->left, Code*10+1);
fillTable(codeTable, tree->right, Code*10+2);
}
return;
}
void invertCodes(int codeTable[],int codeTable2[]){
int i, n, copy;
for (i=0;i<27;i++){
n = codeTable[i];
copy = 0;
while (n>0){
copy = copy * 10 + n %10;
n /= 10;
}
codeTable2[i]=copy;
}
** edit **
To make this question more clear I don't need an explanation on huffman encoding and decoding but I need a explanation on how these two functions work and why codetables are necessary.
n is an int. Therefore, it will reduce to 0 over time. If n starts at 302 at the first iteration, it will be reduced to 30 after the first n /= 10;. At the end of the second iteration of the while loop, it will be reduced to 3. at the end of the fourth iteration, it will equal 0 ( int 4 / int 10 = int 0 ).
It is integer math. No decimal bits to extend to infinity.
I made a minor update to the example program to include an end of data code. The original example code may append an extra letter to the end of the original data when decompressing. Also there's a lot of stuff "hard coded" in this code, such as the number of codes, which was 27, and which I changed to 28 to include the end of data code that I added, and also the output file names which I changed to "compress.bin" (if compressing) or "output.txt" (if decompressing). It's not an optimal implementation, but it's ok to use as a learning example. It would help if you follow the code with a source level debugger.
http://rcgldr.net/misc/huffmanx.zip
A more realistic Huffman program would use tables to do the encode and decode. The encode table is indexed with the input code, and each table entry contains two values, the number of bits in the code, and the code itself. The decode table is indexed with a code composed of the minimum number of bits from the input stream required to determine the code (it's at least 9 bits, but may need to be 10 bits), and each entry in that table contains two values, the actual number of bits, and the character (or end of data) represented by that code. Since the actual number of bits may be less than the number bits used to determine the code, the left over bits will need to be buffered and used before reading data from the compressed file.
One variation of a Huffman like process is to have the length of the code determined by the leading bits of each code, to reduce the size of the decode table.

Random pairs of different bits

I have the following problem. I have a number represented in binary representation. I need a way to randomly select two bits of them that are different (i.e. find a 1 and a 0). Besides this I run other operations on that number (reversing sequences, permute sequences,...) These are the approaches I already used:
Keep track of all the ones and the zeros. When I create the binary representation of the binary number I store the places of the 0's and 1's. So that I can choose an index for one list and one index from the other one. I then have two different bits. To run my other operations I created those from an elementary swap operations which updates the indices of the 1's and 0's when manipulating. Therefore I have a third list that stores the list index for each bit. If a bit is 1 I know where to find in the list with all the indices of the ones (same goes for zeros).
The method above yields some overhead when operations are done that do not require the bits to be different. So another way would be to create the lists whenever different bits are needed.
Does anyone have a better idea to do this? I need these operations to be really fast (I am working with popcount, clz, and other binary operations)
I don't feel as though I have enough information to assess the tradeoffs properly, but perhaps you'll find this idea useful. To find a random 1 in a word (find a 1 over multiple words by popcount and reservoir sampling; find a 0 by complementing), first test the popcount. If the popcount is high, then generate indexes uniformly at random and test them until a one is found. If the popcount is medium, then take bitwise ANDs with uniform random masks (but keep the original if the AND is zero) to reduce the popcount. When the popcount is low, use clz to compile the (small) list of candidates efficiently and then sample uniformly at random.
I think the following might be a rather efficient algorithm to do what you are asking. You only iterate over each bit in the number once, and for each element, you have to generate a random number (not exactly sure how costly that is, but I believe there are some optimized CPU instructions for getting random numbers).
Idea is to iterate over all the bits, and with the right probability, update the index to the current index you are visiting.
Generic pseudocode for getting an element from a stream/array:
p = 1
e = null
for s in stream:
with probability 1/p:
replace e with s
p++
return e
Java version:
int[] getIdx(int n){
int oneIdx = 0;
int zeroIdx = 0;
int ones = 1;
int zeros = 1;
// this loop depends on whether you want to select all the prepended zeros
// in a 32/64 bit representation. Alter to your liking...
for(int i = n, j = 0; i > 0; i = i >>> 1, j++){
if((i & 1) == 1){ // current element is 1
if(Math.random() < 1/(float)ones){
oneIdx = j;
}
ones++;
} else{ // element is 0
if(Math.random() < 1/(float)zeros){
zeroIdx = j;
}
zeros++;
}
}
return new int[]{zeroIdx,oneIdx};
}
An optimization you might look into is to do the probability selection using ints instead of floats, might be slightly faster. Here is a short proof I did some time ago regarding that this works: here . I believe the algorithm is attributed to Knuth but can't remember exactly.

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

Generate a new element different from 1000 elements of an array

I was asked this questions in an interview. Consider the scenario of punched cards, where each punched card has 64 bit pattern. I was suggested each card as an int since each int is a collection of bits.
Also, to be considered that I have an array which already contains 1000 such cards. I have to generate a new element everytime which is different from the previous 1000 cards. The integers(aka cards) in the array are not necessarily sorted.
Even more, how would that be possible the question was for C++, where does the 64 bit int comes from and how can I generate this new card from the array where the element to be generated is different from all the elements already present in the array?
There are 264 64 bit integers, a number that is so much
larger than 1000 that the simplest solution would be to just generate a
random 64 bit number, and then verify that it isn't in the table of
already generated numbers. (The probability that it is is
infinitesimal, but you might as well be sure.)
Since most random number generators do not generate 64 bit values, you
are left with either writing your own, or (much simpler), combining the
values, say by generating 8 random bytes, and memcpying them into a
uint64_t.
As for verifying that the number isn't already present, std::find is
just fine for one or two new numbers; if you have to do a lot of
lookups, sorting the table and using a binary search would be
worthwhile. Or some sort of a hash table.
I may be missing something, but most of the other answers appear to me as overly complicated.
Just sort the original array and then start counting from zero: if the current count is in the array skip it, otherwise you have your next number. This algorithm is O(n), where n is the number of newly generated numbers: both sorting the array and skipping existing numbers are constants. Here's an example:
#include <algorithm>
#include <iostream>
unsigned array[] = { 98, 1, 24, 66, 20, 70, 6, 33, 5, 41 };
unsigned count = 0;
unsigned index = 0;
int main() {
std::sort(array, array + 10);
while ( count < 100 ) {
if ( count > array[index] )
++index;
else {
if ( count < array[index] )
std::cout << count << std::endl;
++count;
}
}
}
Here's an O(n) algorithm:
int64 generateNewValue(list_of_cards)
{
return find_max(list_of_cards)+1;
}
Note: As #amit points out below, this will fail if INT64_MAX is already in the list.
As far as I'm aware, this is the only way you're going to get O(n). If you want to deal with that (fairly important) edge case, then you're going to have to do some kind of proper sort or search, which will take you to O(n log n).
#arne is almost there. What you need is a self-balancing interval tree, which can be built in O(n lg n) time.
Then take the top node, which will store some interval [i, j]. By the properties of an interval tree, both i-1 and j+1 are valid candidates for a new key, unless i = UINT64_MIN or j = UINT64_MAX. If both are true, then you've stored 2^64 elements and you can't possibly generate a new element. Store the new element, which takes O(lg n) worst-case time.
I.e.: init takes O(n lg n), generate takes O(lg n). Both are worst-case figures. The greatest thing about this approach is that the top node will keep "growing" (storing larger intervals) and merging with its successor or predecessor, so the tree will actually shrink in terms of memory use and eventually the time per operation decays to O(1). You also won't waste any numbers, so you can keep generating until you've got 2^64 of them.
This algorithm has O(N lg N) initialisation, O(1) query and O(N) memory usage. I assume you have some integer type which I will refer to as int64 and that it can represent the integers [0, int64_max].
Sort the numbers
Create a linked list containing intervals [u, v]
Insert [1, first number - 1]
For each of the remaining numbers, insert [prev number + 1, current number - 1]
Insert [last number + 1, int64_max]
You now have a list representing the numbers which are not used. You can simply iterate over them to generate new numbers.
I think the way to go is to use some kind of hashing. So you store your cards in some buckets based on lets say on MOD operation. Until you create some sort of indexing you are stucked with looping over the whole array.
IF you have a look on HashSet implementation in java you might get a clue.
Edit: I assume you wanted them to be random numbers, if you don't mind sequence MAX+1 below is good solution :)
You could build a binary tree of the already existing elements and traverse it until you find a node whose depth is not 64 and which has less than two child nodes. You can then construct a "missing" child node and have a new element. The should be fairly quick, in the order of about O(n) if I'm not mistaken.
bool seen[1001] = { false };
for each element of the original array
if the element is in the range 0..1000
seen[element] = true
find the index for the first false value in seen
Initialization:
Don't sort the list.
Create a new array 1000 long containing 0..999.
Iterate the list and, if any number is in the range 0..999, invalidate it in the new array by replacing the value in the new array with the value of the first item in the list.
Insertion:
Use an incrementing index to the new array. If the value in the new array at this index is not the value of the first element in the list, add it to the list, else check the value from the next position in the new array.
When the new array is used up, refill it using 1000..1999 and invalidating existing values as above. Yes, this is looping over the list, but it doesn't have to be done for each insertion.
Near O(1) until the list gets so large that occasionally iterating it for invalidation of the 'new' new array becomes significant. Maybe you could mitigate this by using a new array that grows, maybee always the size of the list?
Rgds,
Martin
Put them all into a hash table of size > 1000, and find the empty cell (this is the parking problem). Generate a key for that. This will of course work better for bigger table size. The table needs only 1-bit entries.
EDIT: this is the pigeonhole principle.
This needs "modulo tablesize" (or some other "semi-invertible" function) for a hash function.
unsigned hashtab[1001] = {0,};
unsigned long long long long numbers[1000] = { ... };
void init (void)
{
unsigned idx;
for (idx=0; idx < 1000; idx++) {
hashtab [ numbers[idx] % 1001 ] += 1; }
}
unsigned long long long long generate(void)
{
unsigned idx;
for (idx = 0; idx < 1001; idx++) {
if ( !hashtab [ idx] ) break; }
return idx + rand() * 1001;
}
Based on the solution here: question on array and number
Since there are 1000 numbers, if we consider their remainders with 1001, at least one remainder will be missing. We can pick that as our missing number.
So we maintain an array of counts: C[1001], which will maintain the number of integers with remainder r (upon dividing by 1001) in C[r].
We also maintain a set of numbers for which C[j] is 0 (say using a linked list).
When we move the window over, we decrement the count of the first element (say remainder i), i.e. decrement C[i]. If C[i] becomes zero we add i to the set of numbers. We update the C array with the new number we add.
If we need one number, we just pick a random element from the set of j for which C[j] is 0.
This is O(1) for new numbers and O(n) initially.
This is similar to other solutions but not quite.
How about something simple like this:
1) Partition the array into numbers equal and below 1000 and above
2) If all the numbers fit within the lower partition then choose 1001 (or any number greater than 1000) and we're done.
3) Otherwise we know that there must exist a number between 1 and 1000 that doesn't exist within the lower partition.
4) Create a 1000 element array of bools, or a 1000-element long bitfield, or whatnot and initialize the array to all 0's
5) For each integer in the lower partition, use its value as an index into the array/bitfield and set the corresponding bool to true (ie: do a radix sort)
6) Go over the array/bitfield and pick any unset value's index as the solution
This works in O(n) time, or since we've bounded everything by 1000, technically it's O(1), but O(n) time and space in general. There are three passes over the data, which isn't necessarily the most elegant approach, but the complexity remains O(n).
you can create a new array with the numbers that are not in the original array, then just pick one from this new array.
¿O(1)?