String encoding for memory optimization - c++

I have a stream of strings in format something like this a:b, d:a, t:w, i:r, etc. Since I keep on appending these string, in the end it becomes a very large string.
I am trying to encode, for example:
a:b -> 1
d:a -> 2
etc.
My intension is to keep the final string as small as possible to save on memory. Hence I need to give single digit value to string occuring maximum number of times.
I have following method in mind:
Create: map<string, int> - this will keep the string and its count. In the end I will replace string with maximum count with 1, next with 2 and so on till last element of map.
Currently size of final string are ~100,000 characters.
I can't compromise on speed, please suggest if anyone has better technique to achieve this.

If I understand correctly your input strings are of the range "a:a"..."z:z" and you simply need to count the appearances of each in the stream, regardless of order. If your distribution is even enough you can count them in using a uint16_t.
A map is implemented using a tree, so an array is much more efficient than a map both in memory and time.
So you can define an array
array<array<uint16_t, 26>, 26> counters = {{}};
and assuming your input is, for example input = "c:d", you can fill up the array as follows
counters[input[0]-'a'][input[2]-'a']++;
Then finally you can print out the frequencies of the input like this
for (auto i=0; i < counters.size() ; ++i) {
for (auto j=0; j < counters[i].size(); ++j) {
cout<<char(i+'a')<<":"<<char(j+'a')<<" "<<counters[i][j]<<endl;
}
}

Related

Get string of characters from a vector of strings where the resulting string is equivalent to the majority of the same chars in the nth pos in strings

I'm reading from a file a series of strings, one for each line.
The strings all have the same length.
File looks like this:
010110011101
111110010100
010000100110
010010001001
010100100011
[...]
Result : 010110000111
I'll need to compare each 1st char of every string to obtain one single string at the end.
If the majority of the nth char in the strings is 1, the result string in that index will be 1, otherwise it's going to be 0 and so on.
For reference, the example I provided should return the value shown in the code block, as the majority of the chars in the first index of all strings is 0, the second one is 1 and so on.
I'm pretty new to C++ as I'm trying to move from Web Development to Software Development.
Also, I tought about making this with vectors, but maybe there is a better way.
Thanks.
First off, you show the result of your example input should be 010110000111 but it should actually be 010110000101 instead, because the 11th column has more 0s than 1s in it.
That being said, what you are asking for is simple. Just put the strings into a std::vector, and then run a loop for each string index, running a second loop through the vector counting the number of 1s and 0s at the current string index, eg:
vector<string> vec;
// fill vec as needed...
string result(12, '\0');
for(size_t i = 0; i < 12; ++i) {
int digits[2]{};
for(const auto &str : vec) {
digits[str[i] - '0']++;
}
result[i] = (digits[1] > digits[0]) ? '1' : '0';
}
// use result as needed...
Online Demo

find a frequency of words in text file

Please can anybody help me? I'm a beginner and I have a hard assignment.
I need to write a c++ program that does the following :
Ask the user to enter two text file the first one contains the text file, the second one contains a list of words in one column Regardless of their number like this:
//output
Enter the text file: text_file.txt
Enter the keywords file: keywords_file.txt
Search for the keywords from the keywords file in the text file
Find the appearance of each word like this :
system: 55 times
analysis: 21 times
Then output in new text file (ofstream)
This is my code it's coded properly ..but it asks the user to enter the words..I want it to take the words from the input text file ..and show the output on the text file (ofstream)
this is a part of keywords contents
//they are as a list in the original file
List item
model,
management,
e-commerce,
system,
cluster
infrastructure,
computer,
knowledge,
metadata,
process,
alter,
simulate,
stock,
inventory,
strategy,
plan,
historical,
deices,
exact,
Analyst,
break even point,
SWOT,
tactic,
develop,
prototype,
feasible,
Inferences,
busy,
cloud compute,
schema,
enterprise,
custom,
expert system,
structure,
data mine,
data warehouse,
organism,
data mart,
operate,
quality assure,
forecast,
report,
this is a part of the book contents
the circuit is characterised by long straights and chicanes. this means the cars’ engines
are at full throttle for over 75% of the lap, a higher percentage than most other circuits.
the track requires heavier-than-average braking over a given lap, as the cars repeatedly
decelerate at the end of some of the world’sfasTest straights for the slow chicanes.
the chicanes are lined by rugged kerbs. riding over these hard is crucial for fast laps.
the long straights require small wings for minimum drag. this means lower downforce,
resulting in lower grip on corners and under braking, and less stability over bumps.
the main high-speed corners lesmo 1, lesmo 2 and parabolica are all right turns.
parts of the circuit are surrounded by trees, which means leaves can be blown
onto the track.
#include <algorithm>
#include <fstream>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
using namespace std;
int main() {
ofstream ofs("output.txt");
ofs << "Keywords " << endl;
string kw_file;
cout << "enter the file name :" << " ";
getline(cin, kw_file);
ifstream keywordFile(kw_file.c_str());
string text_file;
cout << "enter the file name :" << " ";
getline(cin, text_file);
ifstream textFile(text_file.c_str());
if (!keywordFile.is_open() || !textFile.is_open()) {
std::cout << "Error in opening files\n";
return 1;
}
{//vector method
size_t i = 0;
std::vector<std::string> keywordVector;
std::string keyword;
while (keywordFile >> std::ws >> keyword) {
keywordVector.push_back(keyword);
}
std::vector<int> countVector(keywordVector.size());
std::string textWord;
while (textFile >> std::ws >> textWord) {
for (size_t i = 0; i < keywordVector.size(); ++i) {
if (keywordVector[i] == textWord) {
countVector[i]++;
}
}
}
for (size_t i = 0; i < keywordVector.size(); ++i) {
ofs << "The number of times [" << keywordVector[i] << "] appears in
textFile is " << countVector[i] << '\n';
}
}
keywordFile.clear(); textFile.clear();
keywordFile.seekg(0);
textFile.seekg(0);
}
Preliminary
Before even starting to solve this problem, let's think about what is actually required from this:
Read in the keywords from keywords_file.txt
Store these keywords in a data structure for later use.
Read in words from text_file.txt
Use the data structure from Step#2 to compare the words read in Step#3
Data structure simply refers to how your program stores the data that it works on. When you create an array such as
int arr[] = {1, 2, 3};
to later do some manipulations on, that too is a data structure.
Choosing the Data Structure for the Problem
From the above steps, reading in words from the file is a simple problem to solve. The most pressing problem is to figure out the data structure in Step#2.
Let's try to start with a basic one: vector. We are going to use vector instead of arrays like in the C-language primarily for 2 reasons:
Fixed arrays are simple to handle and are the preferred data structure when we have fixed data. But in this problem, we have dynamic data because the number of words in the keywords_file.txt could vary. It's much easier to use vectors as dynamic arrays rather than C-style dynamic arrays because it's much easier to manage memory with vector than with a plain array.
We are using C++ and not C, so let's try to use the data structures available in the Standard Library of C++.
If we choose a vector as the data structure to store the words from keywords_file.txt, then we would also need a data strcture to store the count of each word which would be the number of times it's found in text_file.txt. We can again use a vector for this. We could have some code like (pseudo-code only)
Read in a keyword(using a std::string) from keywords_file.txt
Store it in a vector (using the push_back() function) [Let's call this vector as keywordVector]
Repeat Step#1 and #2 until all the words from keywords_file.txt are read in and added to keywordVector.
Create another vector (let's call it countVector for storing the counts of the word which is of the same size as that of keywordVector.
Initialize all the values in countVector with 0.
Read in a word(again a std::string) from text_file.txt
Search for this word in the vector.
If the word is found in the vector, increment that word's count by increasing the count in countVector for that word, else ignore the word
Repeat Step#5 and Step#6 till all the words from text_file.txt are read in.
Note for Step#8, if word from text_file.txt matches with word at keywordVector[i], then increment the value of the corresponding index in the countVector i.e. ++countVector[i]
This should work for small programs and should solve the problem. Here's how the code would look like:
std::vector<std::string> keywordVector;//empty vector of strings
std::string keyword; //string to store each word read in from keywords_file.txt
while(keywordFile>>std::ws>>keyword) {//std::ws just ignores any whitespace
keywordVector.push_back(keyword);
}
//create vector of same size as keywordVector to keep track of count
std::vector<int> countVector(keywordVector.size());
std::string textWord;
while(textFile>>std::ws>>textWord) {
//Search through the keywordVector tocheck if any word matches
for(size_t i = 0; i < keywordVector.size(); ++i) {
if(keywordVector[i] == textWord) {
countVector[i]++;//Found a match, increment the count
}
}
}
Is that all there is?
No! The above method should work fine for small programs but things become problematic when the number of keywords becomes large and/or the number of words in text_file.txt becomes large.
Imagine this. Your keywords_file.txt becomes very large since it stores all the words in the English Oxford Dictionary and has around a million words. The first part of reading in the keywords would be fine since we require it anyway. But the next part is where the problems start. Imagine all the words in your text_file.txt were the word zebra. Now while searching for zebra in this list, you would have to go through that while loop every time for each word. If there are a billion words in your text_file.txt then you would end up doing a million iterations for each word that you read and the total iterations would be 1billion X 1 million = 1 quadrillion iterations (1 followed by 15 zeroes). This is too big. Your program would complete in a very long time. Assuming each iteration takes 1 nanosecond, then total time would be 1 million seconds (10^15* 10^-9) which is ((1000000/60)/60)/24 = around 11.5 days!! Surely that is not acceptable!
We realize that the basic problem is that we have to search the entire loop every time we read in a word from text_file.txt. Only if there was a way to just lookup the word in the keywordVector directly without having to iterate over the entire vector each and every time.
This is exactly what map data structure helps with. These data structures use a hash-function to quickly lookup things in a collection. They store what is called Key-Value pairs and you use the key to lookup its value.
e.g. if we have something like
{ 101: "Alice", 202: "Bob", 303: "Charlie"}
Then 101 is the key and Alice is the value of that key, 202 is the key, Bob is the value etc.
In the C++ STL there are 2 data structures that are build upon this concept std::unordered_map and std::map. As the names suggest, the first one stores the keys in no particular order but the second one stores them in a sorted way.
Given this very basic intro to maps, we can see how this might be helpful for our case. We don't need to store the words from keyword_file.txt in any particular order. So we can use std::unordered_map, and use the keyword as the key for this map. We'll store the number of times that word appears as the value of this key. Here's some code:
/* Create a map to store key-value pairs of
string and int, where each string is a keyword
from the keywords_file.txt
*/
std::unordered_map<std::string, int> keywordMap;
std::string keyword;
while(keywordFile>>std::ws>>keyword) {
//Initialize each word's count to 0
keywordMap[keyword] = 0;
}
std::string textWord;
while(textFile>>std::ws>>textWord) {
/*We do a find for the textWord in the map.
This find isn't a linear loop like thing (unlike vector)
but uses hashing to quickly look up if textWord exists in the
map or not
*/
if(keywordMap.find(textWord) != end(keywordMap)) {
//If it exists, then we can just directly increment the count
keywordMap[textWord]++;
}
}
Using our time calculations, this time around the lookup of a billion words from the text_file.txt would each take up only in the order of a few nano-seconds since the unordered_map.find() has an average case constant time complexity, unlike the earlier approach's linear complexity.
So for a billion words, it takes an order of a billion nano-seconds which is just 1 second! Imagine the drastic difference in the times. The earlier method took days and this takes seconds. Hashing is a very powerful concept and finds applications in a lot of problems.
Followup
Here's the full code, if you want to use it. This is a basic solution to finding the frequency of words in a file. Since you're beginning out in C++, I'd suggest you take your time to read in-depth into all the data structures used here and use this example to build upon your understanding. Also, if complexity is new to you, please do acquaint yourself with the topic.

C++ char search in a long string (random locations)

So basically I have a character such as 'g' and I want to find the instances of the char in a string such as 'george'. The twist is that I want to return the location of the character randomly.
I have it working with string.find which just returns the first instance of the location of the character, so in the above example it would be 0. But there is also a 'g'at 4.
I want my code to randomly return a location of the character in the string aka 0 or 4 instead of just returning the first instance of the letter. I was thinking of using a regex statement but I will admit I am not very confident in my regex skills.
Any guidance is greatly appreciated, thanks in advance :)
One solution could follow the following steps:
Find all occurrences of a character in a string, store them in a vector
Generate a random number using rand() function which should be between 0 and length of the vector -1.
Use the generated number to index an element from the match vector and return the result.
You could writhe a function that store into an array all the occurrences of char then pick a random index from that array.
something like this...
int findX(char x, char* s){
int *indexes = new int[strlen(s)]; // reserve
int count= 0;
int index = findFirst(x, s, 0);
while(index!=-1){
indexes[count++] = index;
index = findFirst(x, s, index );
}
if(count>0){
int randomIndex = generateRandom(count);
index = indexes[randomIndex];
}
else
index = -1;
delete []indexes;
return index;
}
One possible solution is to find all instances of the character in a loop (just iterate over all of the string and compare the characters). Save the positions of the letters in a vector.
Then randomly select one of the elements in the vector of positions to return.
For the random selection I suggest std::uniform_int_distribution.
If the data is read from a large file (and with "large" I mean multi-megabytes or larger) then instead of just a single loop over the string, consider using threads. Divide the string into smaller chunks, and have each thread go through its own chunk in parallel, adding to its own vector of positions. Then when all threads are done merge the position vectors into a single vector and randomly choose the position from that collected vector.
If the file is very large (multi-gigabytes) then if it's stored on a SSD have the threads read its chunk as well. Otherwise you could memory map the file contents, and have each thread just go through the mapped memory as a large array. Memory mapping such large files requires a 64-bit system though.
You can use the C++ pseudo random generation rand() function. Here are more details on how to use it: http://www.cplusplus.com/reference/cstdlib/rand/
You are encouraged to use C++11 random generators http://en.cppreference.com/w/cpp/numeric/random

Caesar Cipher w/Frequency Analysis how to proceed next?

I understand this has been asked before and I somewhat have a grasp on how to compare frequency tables between cipher and English(this is the language I'm assuming its in for my program) but I'm unsure about how to get this into code.
void frequencyUpdate(std::vector< std::vector< std::string> > &file, std::vector<int> &freqArg) {
for (int itr_1 = 0; itr_1 < file.size(); ++itr_1) {
for (int itr_2 = 0; itr_2 < file.at(itr_1).size(); ++itr_2) {
for (int itr_3 = 0; itr_3 < file.at(itr_1).at(itr_2).length(); ++itr_3) {
file.at(itr_1).at(itr_2).at(itr_3) = toupper(file.at(itr_1).at(itr_2).at(itr_3));
if (!((int)file.at(itr_1).at(itr_2).at(itr_3) < 65 || (int)file.at(itr_1).at(itr_2).at(itr_3) > 90)) {
int temp = (int)file.at(itr_1).at(itr_2).at(itr_3) - 65;
freqArg.at(temp) += 1;
}
}
}
}
}
this is how I get the frequency of a given file that has its contents split into lines and then into words, hence the double vector of strings and using ASCII values of the chars - 65 for indices. The resulting vector of ints that hold frequency is saved.
Now is where I don't knot how to proceed. Should I hardcode in a const std:: vector <int> for the English frequency of letters and then somehow to comparison? How would I compare efficiently rather than simply compare each vector to each other for is possible not an efficient method?
This comparison is for getting an appropriate shift value for caesar cipher shifting to decrypt a text. I don't wanna use brute force and shift one at a time until the text is readable. Any advice on how to approach this? Thanks.
Take your frequency vector and the frequency vector for "typical" English text, and find the cross-correlation.
The highest values of the cross-correlation correspond to the most likely shift values. At that point you'll need to use each one to decrypt, and see whether the output is sensible (i.e. forms real words and coherent sentences).
In English, 'e' has the highest frequency. So whatever most frequent letter you got from your ciphertext, it most likely maps to 'e'.
Since e --> X then the key should be difference between 'e' and your most frequent letter X.
If this is not the right key (due to too short ciphertext distorting the statistics), try to match your most frequent ciphertext letter with the second one in English i.e. a.
I would suggest a graph traversal algorithm. Your starting node has no substitutions assigned and has 26 connected nodes, one for each possible letter substitution for the most frequently occurring ciphertext letter. The next node has another 25 connected nodes for the possible letters for the second most frequent ciphertext letter (one less, since you've already used one possible letter). Which destination node you choose should be based on which letters are most likely given a normal frequency distribution for the target language.
At each node, you can test for success by doing your substitutions into the ciphertext, and finding all the resulting words that now match entries in a dictionary file. The more matches you've found, the more likely you've got the correct substitution key.

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.