Well, I am making a c++ program, that goes through long streams of symbols and I need to store information for further analysis where in the stream appears symbol sequences of certain length. For instance in binary stream
100110010101
I have a sequences for example of length 6 like this:
100110 starting on position 0
001100 starting on position 1
011001 starting on position 2
etc.
What I need to store are vectors of all positions where I can find the one certain sequence. So the result should be something like a table, maybe resembling a hash table that look like this:
sequence/ positions
10010101 | 1 13 147 515
01011011 | 67 212 314 571
00101010 | 2 32 148 322 384 419 455
etc.
Now, I figured mapping strings to integers is slow, so because I have information about symbols in the stream upfront, I can use it to map this fixed length sequences to an integer.
The next step was to create a map, that maps these "representing integers" to a corresponding index in the table, where I add next occurence of this sequence. However this is slow, much slower than I can afford. I tried both ordered and unordered map of both std and boost libraries, none having enough efficiency. And I tested it, the map is the real bottleneck here
And here is the loop in pseudocode:
for (int i=seqleng-1;i<stream.size();i++) {
//compute characteristic value for the sequence by adding one symbol
charval*=symb_count;
charval+=sdata[j][i]-'0';
//sampspacesize is number off all possible sequence with this symbol count and this length
charval%=sampspacesize;
map<uint64,uint64>::iterator &it=map.find(charval);
//if index exists, add starting position of the sequence to the table
if (it!=map.end()) {
(table[it->second].add(i-seqleng+1);
}
//if current sequence is found for the first time, extend the table and add the index
else {
table.add_row();
map[charval]=table.last_index;
table[table.last_index].add(i-seqleng+1)
}
}
So the question is, can I use something better than a map to keep the record of corresponding indeces in the table, or is this the best way possible?
NOTE: I know there is a fast way here, and thats creating a storage large enough for every possible symbol sequence (meaning if I have sequence of length 10 and 4 symbols, I reserve 4^10 slots and can omitt the mapping), but I am going to need to work with lengths and number of symbols that results in reserving amount of memory way beyond the computer's capacity. But the the actual number of used slots will not exceed 100 million (which is guaranteed by the maximal stream length) and that can be stored in a computer just fine.
Please ask anything if there is something unclear, this is my first large question here, so I lack experience to express myself the way others would understand.
An unordered map with pre-allocated space is usually the fastest way to store any kind of sparse data.
Given that std::string has SSO I can't see why something like this won't be about as fast as it gets:
(I have used an unordered_multimap but I may have misunderstood the requirements)
#include <unordered_map>
#include <string>
#include <iostream>
using sequence = std::string; /// #todo - perhaps replace with something faster if necessary
using sequence_position_map = std::unordered_multimap<sequence, std::size_t>;
int main()
{
auto constexpr sequence_size = std::size_t(6);
sequence_position_map sequences;
std::string input = "11000111010110100011110110111000001111010101010101111010";
if (sequence_size <= input.size()) {
sequences.reserve(input.size() - sequence_size);
auto first = std::size_t(0);
auto last = input.size();
while (first + sequence_size < last) {
sequences.emplace(input.substr(first, sequence_size), first);
++first;
}
}
std::cout << "results:\n";
auto first = sequences.begin();
auto last = sequences.end();
while(first != last) {
auto range = sequences.equal_range(first->first);
std::cout << "sequence: " << first->first;
std::cout << " at positions: ";
const char* sep = "";
while (first != range.second) {
std::cout << sep << first->second;
sep = ", ";
++first;
}
std::cout << "\n";
}
}
output:
results:
sequence: 010101 at positions: 38, 40, 42, 44
sequence: 000011 at positions: 30
sequence: 000001 at positions: 29
sequence: 110000 at positions: 27
sequence: 011100 at positions: 25
sequence: 101110 at positions: 24
sequence: 010111 at positions: 46
sequence: 110111 at positions: 23
sequence: 011011 at positions: 22
sequence: 111011 at positions: 19
sequence: 111000 at positions: 26
sequence: 111101 at positions: 18, 34, 49
sequence: 011110 at positions: 17, 33, 48
sequence: 001111 at positions: 16, 32
sequence: 110110 at positions: 20
sequence: 101010 at positions: 37, 39, 41, 43
sequence: 010001 at positions: 13
sequence: 101000 at positions: 12
sequence: 101111 at positions: 47
sequence: 110100 at positions: 11
sequence: 011010 at positions: 10
sequence: 101101 at positions: 9, 21
sequence: 010110 at positions: 8
sequence: 101011 at positions: 7, 45
sequence: 111010 at positions: 5, 35
sequence: 011101 at positions: 4
sequence: 001110 at positions: 3
sequence: 100000 at positions: 28
sequence: 000111 at positions: 2, 15, 31
sequence: 100011 at positions: 1, 14
sequence: 110001 at positions: 0
sequence: 110101 at positions: 6, 36
After many suggestions in comments and answer, I tested most of them and picked the fastest possibility, reducing the bottleneck caused by the mapping to almost the same time it ran without the "map"(but producing incorrect data, however I needed to find the minimum speed this can be reduced to)
This was achieved by replacing the unordered_map<uint64,uint> and vector<vector<uint>> with just unordered_map<uint64, vector<uint> >, more precisely boost::unordered_map. I tested it also with unord_map<string,vector<uint>> and it surprised me that it was not that much slower as I expected. However it was slower.
Also, probably due to the fact ordered_map moves nodes to remain a balanced tree in its internal structure, ord_map<uint64, vector<uint>> was a bit slower than ord_map<uint64,uint> together with vector<vector<uint>>. But since unord_map does not move its internal data during computation, seems that it is the fastest possible configuration one can use.
Related
I'm using C++. Using sort from STL is allowed.
I have an array of int, like this :
1 4 1 5 145 345 14 4
The numbers are stored in a char* (i read them from a binary file, 4 bytes per numbers)
I want to do two things with this array :
swap each number with the one after that
4 1 5 1 345 145 4 14
sort it by group of 2
4 1 4 14 5 1 345 145
I could code it step by step, but it wouldn't be efficient. What I'm looking for is speed. O(n log n) would be great.
Also, this array can be bigger than 500MB, so memory usage is an issue.
My first idea was to sort the array starting from the end (to swap the numbers 2 by 2) and treating it as a long* (to force the sorting to take 2 int each time). But I couldn't manage to code it, and I'm not even sure it would work.
I hope I was clear enough, thanks for your help : )
This is the most memory efficient layout I could come up with. Obviously the vector I'm using would be replaced by the data blob you're using, assuming endian-ness is all handled well enough. The premise of the code below is simple.
Generate 1024 random values in pairs, each pair consisting of the first number between 1 and 500, the second number between 1 and 50.
Iterate the entire list, flipping all even-index values with their following odd-index brethren.
Send the entire thing to std::qsort with an item width of two (2) int32_t values and a count of half the original vector.
The comparator function simply sorts on the immediate value first, and on the second value if the first is equal.
The sample below does this for 1024 items. I've tested it without output for 134217728 items (exactly 536870912 bytes) and the results were pretty impressive for a measly macbook air laptop, about 15 seconds, only about 10 of that on the actual sort. What is ideally most important is no additional memory allocation is required beyond the data vector. Yes, to the purists, I do use call-stack space, but only because q-sort does.
I hope you get something out of it.
Note: I only show the first part of the output, but I hope it shows what you're looking for.
#include <iostream>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <cstdint>
// a most-wacked-out random generator. every other call will
// pull from a rand modulo either the first, or second template
// parameter, in alternation.
template<int N,int M>
struct randN
{
int i = 0;
int32_t operator ()()
{
i = (i+1)%2;
return (i ? rand() % N : rand() % M) + 1;
}
};
// compare to integer values by address.
int pair_cmp(const void* arg1, const void* arg2)
{
const int32_t *left = (const int32_t*)arg1;
const int32_t *right = (const int32_t *)arg2;
return (left[0] == right[0]) ? left[1] - right[1] : left[0] - right[0];
}
int main(int argc, char *argv[])
{
// a crapload of int values
static const size_t N = 1024;
// seed rand()
srand((unsigned)time(0));
// get a huge array of random crap from 1..50
vector<int32_t> data;
data.reserve(N);
std::generate_n(back_inserter(data), N, randN<500,50>());
// flip all the values
for (size_t i=0;i<data.size();i+=2)
{
int32_t tmp = data[i];
data[i] = data[i+1];
data[i+1] = tmp;
}
// now sort in pairs. using qsort only because it lends itself
// *very* nicely to performing block-based sorting.
std::qsort(&data[0], data.size()/2, sizeof(data[0])*2, pair_cmp);
cout << "After sorting..." << endl;
std::copy(data.begin(), data.end(), ostream_iterator<int32_t>(cout,"\n"));
cout << endl << endl;
return EXIT_SUCCESS;
}
Output
After sorting...
1
69
1
83
1
198
1
343
1
367
2
12
2
30
2
135
2
169
2
185
2
284
2
323
2
325
2
347
2
367
2
373
2
382
2
422
2
492
3
286
3
321
3
364
3
377
3
400
3
418
3
441
4
24
4
97
4
153
4
210
4
224
4
250
4
354
4
356
4
386
4
430
5
14
5
26
5
95
5
145
5
302
5
379
5
435
5
436
5
499
6
67
6
104
6
135
6
164
6
179
6
310
6
321
6
399
6
409
6
425
6
467
6
496
7
18
7
65
7
71
7
84
7
116
7
201
7
242
7
251
7
256
7
324
7
325
7
485
8
52
8
93
8
156
8
193
8
285
8
307
8
410
8
456
8
471
9
27
9
116
9
137
9
143
9
190
9
190
9
293
9
419
9
453
With some additional constraints on both your input and your platform, you can probably use an approach like the one you are thinking of. These constraints would include
Your input contains only positive numbers (i.e. can be treated as unsigned)
Your platform provides uint8_t and uint64_t in <cstdint>
You address a single platform with known endianness.
In that case you can divide your input into groups of 8 bytes, do some byte shuffling to arrange each groups as one uint64_t with the "first" number from the input in the lower-valued half and run std::sort on the resulting array. Depending on endianness you may need to do more byte shuffling to rearrange each sorted 8-byte group as a pair of uint32_t in the expected order.
If you can't code this on your own, I'd strongly advise you not to take this approach.
A better and more portable approach (you have some inherent non-portability by starting from a not clearly specified binary file format), would be:
std::vector<int> swap_and_sort_int_pairs(const unsigned char buffer[], size_t buflen) {
const size_t intsz = sizeof(int);
// We have to assume that the binary format in buffer is compatible with our int representation
// we also require an even number of integers
assert(buflen % (2*intsz) == 0);
// load pairwise
std::vector< std::pair<int,int> > pairs;
pairs.reserve(buflen/(2*intsz));
for (const unsigned char* bufp=buffer; bufp<buffer+buflen; bufp+= 2*intsz) {
// It would be better to have a more portable binary -> int conversion
int first_value = *reinterpret_cast<int*>(bufp);
int second_value = *reinterpret_cast<int*>(bufp + intsz);
// swap each pair here
pairs.emplace_back( second_value, firstvalue );
}
// less<pair<..>> does lexicographical ordering, which is what you are looking ofr
std::sort(pairs.begin(), pairs.end());
// convert back to linear vector
std::vector<int> result;
result.reserve(2*pairs.size());
for (auto& entry : pairs) {
result.push_back(entry.first);
result.push_back(entry.second);
}
return result;
}
Both the inital parse/swap pass (which you need anyway) and the final conversion are O(N), so the total complexity is still (O(N log(N)).
If you can continue to work with pairs, you can save the final conversion. The other way to save that conversion would be to use a hand-coded sort with two-int strides and two-int swap: much more work - and possibly still hard to get as efficient as a well-tuned library sort.
Do one thing at a time. First, give your data some *struct*ure. It seems that each 8 byte form a unit of the
form
struct unit {
int key;
int value;
}
If the endianness is right, you can do this in O(1) with a reinterpret_cast. If it isn't, you'll have to live with a O(n) conversion effort. Both vanish compared to the O(n log n) search effort.
When you have an array of these units, you can use std::sort like:
bool compare_units(const unit& a, const unit& b) {
return a.key < b.key;
}
std::sort(array, length, compare_units);
The key to this solution is that you do the "swapping" and byte-interpretation first and then do the sorting.
You get 10 numbers that you have to split into two lists where the sum of numbers in the lists have the smallest difference possible.
so let's say you get:
10 29 59 39 20 17 29 48 33 45
how would you sort this into two lists where the difference in the sum of the lists is as small as possible
so in this case, the answer (i think) would be:
59 48 29 17 10 = 163
45 39 33 29 20 = 166
I'm using mIRC script as the language but perl or C++ is just as good for me.
edit: actually there can be multiple answers such as in this scenario, it could also be:
59 48 29 20 10 = 166
45 39 33 29 17 = 163
to me, it doesn't matter so long as the end result is that the difference of the sum of the lists is as small as possible
edit 2: each list must contain 5 numbers.
What you have listed is exactly the partition problem (for more details look at http://en.wikipedia.org/wiki/Partition_problem).
The point is that this is a NP-complete problem, therefore it does not exist a program able to solve any instance of this problem (i.e. with a bigger amount of numbers).
But if your problem is always with only ten numbers to divide into two lists of exactly five items each, then it becomes feasible, also to try naively all possible solutions, since they are only p^N, where p=2 is the number of partitions, and N=10 is the number of integers, thus only 2^10=1024 combinations, and each takes only O(N) to be verified (i.e. compute the difference).
Otherwise you can implement the greedy algorithm described in the Wikipedia page, it is simple to implement but there is no guarantee of optimality, in fact you can see this implementation in Java:
static void partition() {
int[] set = {10, 29, 59, 39, 20, 17, 29, 48, 33, 45}; // array of data
Arrays.sort(set); // sort data in descending order
ArrayList<Integer> A = new ArrayList<Integer>(5); //first list
ArrayList<Integer> B = new ArrayList<Integer>(5); //second list
String stringA=new String(); //only to print result
String stringB=new String(); //only to print result
int sumA = 0; //sum of items in A
int sumB = 0; //sum of items in B
for (int i : set) {
if (sumA <= sumB) {
A.add(i); //add item to first list
sumA+=i; //update sum of first list
stringA+=" "+i;
} else {
B.add(i); //add item to second list
sumB+=i; //update sum of second list
stringB+=" "+i;
}
}
System.out.println("First list:" + stringA + " = " + sumA);
System.out.println("Second list:"+ stringB+ " = " + sumB);
System.out.println("Difference (first-second):" + (sumA-sumB));
}
It does not return a good result:
First list: 10 20 29 39 48 = 146
Second list: 17 29 33 45 59 = 183
Difference (first-second):-37
I'm trying to read numbers from a file into an array, discarding duplicates. For instance, say the following numbers are in a file:
41 254 14 145 244 220 254 34 135 14 34 25
Though the number 34 occurs twice in the file, I would only like to store it once in the array. How would I do this?
(fixed, but I guess a better term would be a 64 bit Unsigned int) (was using numbers above 255)
vector<int64_t> v;
copy(istream_iterator<int64_t>(cin), istream_iterator<int64_t>(), back_inserter(v));
set<int64_t> s;
vector<int64_t> ov; ov.reserve(v.size());
for( auto i = v.begin(); i != v.end(); ++i ) {
if ( s.insert(v[i]).second )
ov.push_back(v[i]);
}
// ov contains only unique numbers in the same order as the original input file.
I'm looking for some regex/automata help. I'm limited to + or the Kleene Star. Parsing through a string representing a ternary number (like binary, just 3), I need to be able to know if the result is 1-less than a multiple of 4.
So, for example 120 = 0*1+2*3+1*9 = 9+6 = 15 = 16-1 = 4(n)-1.
Even a pointer to the pattern would be really helpful!
You can generate a series of values to do some observation with bc in bash:
for n in {1..40}; do v=$((4*n-1)); echo -en $v"\t"; echo "ibase=10;obase=3;$v" | bc ; done
3 10
7 21
11 102
15 120
19 201
23 212
27 1000
31 1011
...
Notice that each digit's value (in decimal) is either 1 more or 1 less than something divisible by 4, alternately. So the 1 (lsb) digit is one more than 0, the 3 (2nd) digit is one less than 4, the 9 (3rd) digit is 1 more than 8, the 27 (4th) digit is one less than 28, etc.
If you sum up all the even-placed digits and all the odd-placed digits, then add 1 to the odd-placed ones (if counting from 1), you should get equality.
In your example: odd: (0+1)+1, even: (2). So they are equal, and so the number is of the form 4n-1.
I'm new to C++. Only been programming for 2 days so this will probably look messy. The purpose of the program is that you enter a word, and then the program randomizes the placement of the letters in the word.
I have three questions.
Why, if the same string is entered twice, will the same "random" numbers be output?
How can I make sure no random number is picked twice. I already tried an IF statement nested inside the FOR statement but it just made things worse.
What will make this work?
The code:
#include <iostream>
#include <sstream>
#include <string>
#include <cstdlib>
#include <stdio.h>
#include <string.h>
using namespace std;
int main () {
cout << "Enter word to be randomized: ";
char rstring[30];
char rstring2[30];
cin >> rstring;
strcpy(rstring2, rstring);
int length;
length = strlen(rstring);
int max=length;
int min=0;
int randint;
for (int rdm=0; rdm<length; rdm++) {
randint=rand()%(max-min)+min;
cout << rstring[rdm]; //This is temporary. Just a visualization of what I'm doing.
cout << randint << endl; //Temporary as well.
rstring2[randint]=rstring[rdm];
}
cout << endl << rstring2 << endl;
return 0;
}
If you compile and run this you will notice that the same random numbers are output for the same text. Like "hello" outputs 24330. Why is this random generator generating nonrandom numbers?
You need to seed your random number generator to get different results with each run. Otherwise, (as you have noticed) you will get the same random numbers with each run.
Put this at the start of the program:
srand(time(NULL));
This will seed the random number generator with time - which will likely be different between runs.
Note that you'll also need #include <time.h> to access the time() function.
You're not using a random number generator. You're calling rand(), a pseudo-random number generator, which produces sequences of numbers that share many properties with truly random numbers (e.g. mean, standard deviation, frequency spectrum will all be correct).
To get a different sequence, you have to initialize the seed using srand(). The usual way to do this is:
srand(time(NULL));
Furthermore, a sequence that guarantees the same number cannot be picked twice, is no longer a sequence of i.i.d. (independent identically distributed) random numbers. (the sequence is highly dependent) Most uses of random numbers rely on the i.i.d. property, so the library-provided functions are i.i.d. However, filtering out repeats yourself is not especially hard.
If you don't want to change the cardinality (number of occurrences) of each character in the string, the easiest thing to do is not pick one character after the other, but randomly pick a pair to swap. By only swapping, you change order but not cardinality.
You always get the same random numbers because you don't seed this random number generator. Call srand() before your first call to rand(). Examples: http://www.cplusplus.com/reference/clibrary/cstdlib/srand/
The random number generated by rand() is pseudo-random. C++ rand() documentation says following
rand() Returns a pseudo-random integral number in the range 0 to RAND_MAX.
This number is generated by an algorithm that returns a sequence of apparently non-related numbers each time it is called. This algorithm uses a seed to generate the series, which should be initialized to some distinctive value using srand.
Because (at least on Linux) pseudo-random number generators are seeded with the same value (to make programs more deterministic, so two consecutive identical runs will give the same answers).
You could seed your PRNG with a different value (the time, the pid, whatever). On Linux you could also consider reading the /dev/urandom (or much rarely, even the /dev/random) pseudo file - often to seed your PRNG.
The code below remembers what random number that was previously picked.
It generates a unique random number only once.
It stores results in an array, so that when rand() produces a number
that already exists, it does not store that number in the array.
#include <ctime>
#include <iostream>
using namespace std;
int main()
{
int size=100;
int random_once[100];
srand(time(0));
cout<<"generating unique random numbers between [0 and "<<size <<"] only once \n\n";
for (int i=0;i<size;i++) // generate random numbers
{
random_once[i]=rand() % size;
//if number already exists, dont store that number in the array
for(int j=0;j<i;j++) if (random_once[j]==random_once[i]) i--;
}
for ( i=0;i<size;i++) cout<<" "<<random_once[i]<<"\t";
cout<<"\n";
return 0;
}
Output :
generating unique random numbers between [0 and 100] only once
50 80 99 16 11 56 48 36 21 34
90 87 33 85 96 77 63 5 60 52
59 4 84 30 7 95 25 1 45 49
10 43 44 82 22 74 32 68 70 86
57 24 39 51 83 2 81 71 42 94
78 72 41 73 92 35 76 9 3 58
19 40 37 67 31 23 55 69 8 17
64 46 93 27 28 91 26 65 47 14
15 75 79 88 62 97 54 12 18 89
13 38 61 0 29 66 53 6 98 20
Press any key to continue