C++ char search in a long string (random locations) - c++

So basically I have a character such as 'g' and I want to find the instances of the char in a string such as 'george'. The twist is that I want to return the location of the character randomly.
I have it working with string.find which just returns the first instance of the location of the character, so in the above example it would be 0. But there is also a 'g'at 4.
I want my code to randomly return a location of the character in the string aka 0 or 4 instead of just returning the first instance of the letter. I was thinking of using a regex statement but I will admit I am not very confident in my regex skills.
Any guidance is greatly appreciated, thanks in advance :)

One solution could follow the following steps:
Find all occurrences of a character in a string, store them in a vector
Generate a random number using rand() function which should be between 0 and length of the vector -1.
Use the generated number to index an element from the match vector and return the result.

You could writhe a function that store into an array all the occurrences of char then pick a random index from that array.
something like this...
int findX(char x, char* s){
int *indexes = new int[strlen(s)]; // reserve
int count= 0;
int index = findFirst(x, s, 0);
while(index!=-1){
indexes[count++] = index;
index = findFirst(x, s, index );
}
if(count>0){
int randomIndex = generateRandom(count);
index = indexes[randomIndex];
}
else
index = -1;
delete []indexes;
return index;
}

One possible solution is to find all instances of the character in a loop (just iterate over all of the string and compare the characters). Save the positions of the letters in a vector.
Then randomly select one of the elements in the vector of positions to return.
For the random selection I suggest std::uniform_int_distribution.
If the data is read from a large file (and with "large" I mean multi-megabytes or larger) then instead of just a single loop over the string, consider using threads. Divide the string into smaller chunks, and have each thread go through its own chunk in parallel, adding to its own vector of positions. Then when all threads are done merge the position vectors into a single vector and randomly choose the position from that collected vector.
If the file is very large (multi-gigabytes) then if it's stored on a SSD have the threads read its chunk as well. Otherwise you could memory map the file contents, and have each thread just go through the mapped memory as a large array. Memory mapping such large files requires a 64-bit system though.

You can use the C++ pseudo random generation rand() function. Here are more details on how to use it: http://www.cplusplus.com/reference/cstdlib/rand/
You are encouraged to use C++11 random generators http://en.cppreference.com/w/cpp/numeric/random

Related

Get string of characters from a vector of strings where the resulting string is equivalent to the majority of the same chars in the nth pos in strings

I'm reading from a file a series of strings, one for each line.
The strings all have the same length.
File looks like this:
010110011101
111110010100
010000100110
010010001001
010100100011
[...]
Result : 010110000111
I'll need to compare each 1st char of every string to obtain one single string at the end.
If the majority of the nth char in the strings is 1, the result string in that index will be 1, otherwise it's going to be 0 and so on.
For reference, the example I provided should return the value shown in the code block, as the majority of the chars in the first index of all strings is 0, the second one is 1 and so on.
I'm pretty new to C++ as I'm trying to move from Web Development to Software Development.
Also, I tought about making this with vectors, but maybe there is a better way.
Thanks.
First off, you show the result of your example input should be 010110000111 but it should actually be 010110000101 instead, because the 11th column has more 0s than 1s in it.
That being said, what you are asking for is simple. Just put the strings into a std::vector, and then run a loop for each string index, running a second loop through the vector counting the number of 1s and 0s at the current string index, eg:
vector<string> vec;
// fill vec as needed...
string result(12, '\0');
for(size_t i = 0; i < 12; ++i) {
int digits[2]{};
for(const auto &str : vec) {
digits[str[i] - '0']++;
}
result[i] = (digits[1] > digits[0]) ? '1' : '0';
}
// use result as needed...
Online Demo

String encoding for memory optimization

I have a stream of strings in format something like this a:b, d:a, t:w, i:r, etc. Since I keep on appending these string, in the end it becomes a very large string.
I am trying to encode, for example:
a:b -> 1
d:a -> 2
etc.
My intension is to keep the final string as small as possible to save on memory. Hence I need to give single digit value to string occuring maximum number of times.
I have following method in mind:
Create: map<string, int> - this will keep the string and its count. In the end I will replace string with maximum count with 1, next with 2 and so on till last element of map.
Currently size of final string are ~100,000 characters.
I can't compromise on speed, please suggest if anyone has better technique to achieve this.
If I understand correctly your input strings are of the range "a:a"..."z:z" and you simply need to count the appearances of each in the stream, regardless of order. If your distribution is even enough you can count them in using a uint16_t.
A map is implemented using a tree, so an array is much more efficient than a map both in memory and time.
So you can define an array
array<array<uint16_t, 26>, 26> counters = {{}};
and assuming your input is, for example input = "c:d", you can fill up the array as follows
counters[input[0]-'a'][input[2]-'a']++;
Then finally you can print out the frequencies of the input like this
for (auto i=0; i < counters.size() ; ++i) {
for (auto j=0; j < counters[i].size(); ++j) {
cout<<char(i+'a')<<":"<<char(j+'a')<<" "<<counters[i][j]<<endl;
}
}

C++: Remove repeated numbers in a matrix

I want to remove numbers from a matrix that represents coordinates with the format 'x y z'. One example:
1.211 1.647 1.041
2.144 2.684 1.548
1.657 2.245 1.021
1.657 0.984 2.347
2.154 0.347 2.472
1.211 1.647 1.041
In this example the coordinates 1 and 6 are the same (x, y and z are the same) and I want to remove them but I do not want to remove cases with only one value equal as coordinates 3 and 4 for x-coordinate).
These values are in a text file and I want to print the coordinates without duplication in another file or even in the same one.
A very simple solution would be to treat each line as a string and use a set of strings. As you traverse the file line-wise, you check if the current line exists in the set and if not, you insert and print it.
Complexity: O(nlogn), extra memory needed: almost the same as your input file in the worst case
With the same complexity and exactly the worst case memory consumption as the previous solution, you can load the file in memory, sort it line-wise, and then easily skip duplicates while printing. The same can be done inside the file if you are allowed to re-order it, and this way you need very little extra memory, but be much slower.
If memory and storage is an issue (I'm assuming since you can't duplicate the file), you can use the simple method of comparing the current line with all previous lines before printing, with O(n^2) complexity but no extra memory. This however is a rather bad solution, since you have to read multiple times from the file, which can be really slow compared to the main memory.
How to do this if you want to preserve the order.
Read the coordinates into an array of structures like this
struct Coord
{
double x,y,z;
int pos;
bool deleted;
};
pos is the line number, deleted is set to false.
Sort the structs by whatever axis tends to show the greatest variation.
Run through the array comparing the value of the axis you were using in the sort from the previous item to the value in the current item. If the difference is less than a certain preset delta (.i.e. if you care about three digits after the decimal point you would look for a difference of 0.000999999 or so) you compare the remaining values and set deleted for any line where x,y,z are close enough.
for(int i=1;i<count;i++)
{
if(fabs(arr[i].x-arr[i-1].x)<0.001)
if(fabs(arr[i].y-arr[i-1].y)<0.001)
if(fabs(arr[i].z-arr[i-1].z)<0.001)
arr[i].deleted=true;
}
sort the array again, this time ascending by pos to restore the order.
Go through the array and output all items where deleted is false.
In c++, you can use the the power of STL to solve this problem. Use the map and store the three coordinates x, y and z as a key in the map. The mapped value to the key will store the count of that key.
Key_type = pair<pair<float,float>,float>
mapped_type = int
Create a map m with the above given key_type and mapped_type and insert all the rows into the map updating the count for each row. Let's assume n is the total number of rows.
for(i = 0; i < n; i++) {
m[make_pair(make_pair(x,y),z)]++;
}
Each insertion takes O(logn) and you have to insert n times. So,overall time complexity will be O(nlogn). Now, loop over all the rows of the matrix again and if the mapped_value of that row is 1, then it is unique.

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

Fast way to pick randomly from a set, with each entry picked only once?

I'm working on a program to solve the n queens problem (the problem of putting n chess queens on an n x n chessboard such that none of them is able to capture any other using the standard chess queen's moves). I am using a heuristic algorithm, and it starts by placing one queen in each row and picking a column randomly out of the columns that are not already occupied. I feel that this step is an opportunity for optimization. Here is the code (in C++):
vector<int> colsleft;
//fills the vector sequentially with integer values
for (int c=0; c < size; c++)
colsleft.push_back(c);
for (int i=0; i < size; i++)
{
vector<int>::iterator randplace = colsleft.begin() + rand()%colsleft.size();
/* chboard is an integer array, with each entry representing a row
and holding the column position of the queen in that row */
chboard[i] = *randplace;
colsleft.erase(randplace);
}
If it is not clear from the code: I start by building a vector containing an integer for each column. Then, for each row, I pick a random entry in the vector, assign its value to that row's entry in chboard[]. I then remove that entry from the vector so it is not available for any other queens.
I'm curious about methods that could use arrays and pointers instead of a vector. Or <list>s? Is there a better way of filling the vector sequentially, other than the for loop? I would love to hear some suggestions!
The following should fulfill your needs:
#include <algorithm>
...
int randplace[size];
for (int i = 0; i < size; i ++)
randplace[i] = i;
random_shuffle(randplace, randplace + size);
You can do the same stuff with vectors, too, if you wish.
Source: http://gethelp.devx.com/techtips/cpp_pro/10min/10min1299.asp
Couple of random answers to some of your questions :):
As far as I know, there's no way to fill an array with consecutive values without iterating over it first. HOWEVER, if you really just need consecutive values, you do not need to fill the array - just use the cell indices as the values: a[0] is 0 and a[100] is 100 - when you get a random number, treat the number as the value.
You can implement the same with a list<> and remove cells you already hit, or...
For better performance, rather than removing cells, why not put an "already used" value in them (like -1) and check for that. Say you get a random number like 73, and a[73] contains -1, you just get a new random number.
Finally, describing item 3 reminded me of a re-hashing function. Perhaps you can implement your algorithm as a hash-table?
Your colsleft.erase(randplace); line is really inefficient, because erasing an element in the middle of the vector requires shifting all the ones after it. A more efficient approach that will satisfy your needs in this case is to simply swap the element with the one at index (size - i - 1) (the element whose index will be outside the range in the next iteration, so we "bring" that element into the middle, and swap the used one out).
And then we don't even need to bother deleting that element -- the end of the array will accumulate the "chosen" elements. And now we've basically implemented an in-place Knuth shuffle.