Working with big text files - c++

I have a file in following format:
[1]
Parameter1=Value1
.
.
.
End
[2]
.
.
The number between brackets presents id of the entity. There're around 4500 entites. I need to parse through all entites and pick the ones matching my parameters and values. Size of file is around 20mb. My first approach was to reading file line by line and storing them in a struct array like:
struct Component{
std::string parameter;
std::string value;
};
struct Entity{
std::string id;
std::list<Component> components;
};
std::list<Entity> g_entities;
But this approach took enormous amount of memory and was very slow. I've also tried storing only the ones that match my parameters/values. But that also was really slow and took quite some memory. Ideally i would like to store all data in memory so that i won't have to load the file everytime i need to filter my parameters/values if it's possible with reasonable amount of memory usage.
Edit 1:
I read file line by line:
std::ifstream readTemp(filePath);
std::stringstream dataStream;
dataStream << readTemp.rdbuf();
readTemp.close();
while (std::getline(dataStream, line)){
if (line.find('[') != std::string::npos){
// Create Entity
Entity entity;
// Set entity id
entity.id = line.substr(line.find('[') + 1, line.find(']') - 1);
// Read all lines until EnumEnd=0
while (1){
std::getline(dataStream, line);
// Break loop if end of entity
if (line.find("EnumEnd=0") != std::string::npos){
if (CheckMatch(entity))
entities.push_back(entity);
entity.components.clear();
break;
}
Component comp;
int pos_eq = line.find('=');
comp.parameterId = line.substr(0, pos_eq);
comp.value = line.substr(pos_eq + 1);
entity.components.push_back(comp);
}
}
}

PS: After your edit. and Comment concerning memory consumption
500MB / 20MB = 25.
If each line is 25 chars long, the memory consumption looks ok.
OK you could use a look-up table for mapping parameter-names to numbers.
If the names-set is small, this will save the consumption up to 2 times.
Your data structure could look like this:
std::map<int, std::map<int, std::string> > my_ini_file_data;
std::map<std::string, int> param_to_idx;
(provided the parameter names within sections (entities as you call it) are not unique)
Putting the data is:
std::string param = "Param";
std::string value = "Val";
int entity_id = 0;
if ( param_to_idx.find(param) == param_to_idx.end() )
param_to_idx[param] = param_to_idx.size();
my_ini_file_data[entity_id][ param_to_idx[param] ] = value;
getting the data is:
value = my_ini_file_data[entity_id][ param_to_idx[param] ];
If the values-set is also considerably smaller than the number of entries,
you could even map values to numbers:
std::map<int, std::map<int, int> > my_ini_file_data;
std::map<std::string, int> param_to_idx;
std::map<std::string, int> value_to_idx;
std::map<int, std::string> idx_to_value;
Putting the data is:
std::string param = "Param";
std::string value = "Val";
int entity_id = 0;
if ( param_to_idx.find(param) == param_to_idx.end() )
param_to_idx[param] = param_to_idx.size();
if ( value_to_idx.find(value) == value_to_idx.end() )
{
int idx = value_to_idx.size();
value_to_idx[value] = idx;
idx_to_value[idx] = value;
}
my_ini_file_data[entity_id][ param_to_idx[param] ] = value_to_idx[value];
getting the data is:
value = idx_to_value[my_ini_file_data[entity_id][ param_to_idx[param] ] ];
Hope, this helps.
Initial answer
Concerning memory, I wouldn't care unless you have a kind of embedded system with very small memory.
Concerning the speed, I could give you some suggestions:
Find out, what is the bottleneck.
Use std::list! Using std::vector you re-initialize the memory each time the vector grows. If for some reason you need a vector at the end, create the vector reserving the requires number of entries, which you'll get by calling list::size()
Write a while loop, there you only call getline. If this alone is
already slow, read the entire block at once, create a reader-stream
out of the char* block and read line by line from the stream.
If the speed of the simple reading is OK, optimize your parsing code. You
can reduce the number of find-calls by storing the position. e.g.
int pos_begin = line.find('[]');
if (pos_begin != std::string::npos){
int pos_end = line.find(']');
if (pos_end != std::string::npos){
entity.id = line.substr(pos_begin + 1, pos_begin - 1);
// Read all lines until EnumEnd=0
while (1){
std::getline(readTemp, line);
// Break loop if end of entity
if (line.find("EnumEnd=0") != std::string::npos){
if (CheckMatch(entity))
entities.push_back(entity);
break;
}
Component comp;
int pos_eq = line.find('=');
comp.parameter= line.substr(0, pos_eq);
comp.value = line.substr(pos_eq + 1);
entity.components.push_back(comp);
}
}
}
Depending on how big your entities are, check if CheckMatch is slow. The smaller the entities, the slower the code - in this case.

You can use less memory by interning your params and values, so as not to store multiple copies of them.
You could have a map of strings to unique numeric IDs, that you create when loading the file, and then just use the IDs when querying your data structure. At the expense of possibly slower parsing initially, working with these structures afterwards should be faster, as you'd only be matching 32-bit integers rather than comparing strings.
Sketchy proof of concept for storing each string once:
#include <unordered_map>
#include <string>
#include <iostream>
using namespace std;
int string_id(const string& s) {
static unordered_map<string, int> m;
static int id = 0;
auto it = m.find(s);
if (it == m.end()) {
m[s] = ++id;
return id;
} else {
return it->second;
}
}
int main() {
// prints 1 2 2 1
cout << string_id("hello") << " ";
cout << string_id("world") << " ";
cout << string_id("world") << " ";
cout << string_id("hello") << endl;
}
The unordered_map will end up storing each string once, so you're set for memory. Depending on your matching function, you can define
struct Component {
int parameter;
int value;
};
and then your matching can be something like myComponent.parameter == string_id("some_key") or even myComponent.parameter == some_stored_string_id. If you want your strings back, you'll need a reverse mapping as well.

Related

How could I craft a function that check if a word is repeated more than two time or more in a vector and output the number of time it repeated? in C++

first time on Stack here, I hope to learn from you guys!
So my code involve the user reading a passage from a text-file and adding the word into a vector. That vector would be pass into a word count function and print out how many word are repeating.
for example:
count per word: age = 2 , belief = 1, best =1, it = 10
however i'm trying to come up with a function that loop to the same vector and print out the word that are repeated more than two time. In this case the word "it" is repeated more than two time.
map<string, int> get_word_count(const vector<string>& words) {
map<string, int> word_count{};
for (string word : words) {
auto search = word_count.find(word);
if (search == word_count.end()) {
word_count[word] = 1; // not found - add word with count of 1
}
else {
word_count[word] += 1; // found - increment count for word
}
}
return word_count;
}
this is my snipet of code that check the many word that are repeated from the vector. However im struggle to figure out how to add a condition to check if the word are repeated twice or more than two time. I try to add a condition if word_count > 2, then print out the repeated word of twice. However it did not work. I hope to hear from you guys hint, thank.
No need to check as a std::map automatically check if the entry exists or not. If not, it creates a new one, if yes, the value is handled correctly.
Simply loop over the std::map which holds the words vs. counts and use a condition as needed. See full example.
int main()
{
std::vector< std::string > words{ "belief","best","it","it","it" };
std::map< std::string, int > counts;
for ( const auto& s: words )
{
counts[s]++;
}
for ( const auto& p: counts )
{
if ( p.second > 2 ) { std::cout << p.first << " repeats " << p.second << std::endl; }
}
}
Hint:
If you write auto x: y you get a copy of each element of y which is typically not what you want. If you write const auto& x: y you get a const reference to the element of your container, which is in your case much faster/efficient, as there is no need to create a copy. The compiler is maybe able to "see" that the copy is not needed and optimize it out, but it is more clear to the reader of the source code, what is intended!
first of all I really suggest you to explore documentation about C++ before coding,
your code is actually rewritable in this way
map<string, int> get_word_count(const vector<string>& words) {
map<string, int> word_count{};
for (string& word : words) {
word_count[word] += 1;
}
return word_count;
}
This works because map::operator[] (like unordered_map::operator[]) doesn't work like map::at does (which throws a std::out_of_range exception if the key is not in the map). The difference is that operator[] returns a reference to the Value at the given key and if the key is not already in the map, it is inserted and default-initialized (in your case a int is value-initialized to 0).
Operator[] on cppreference
In order to add the "twice or more than twice" part you can modify your code by adding the condition in the for loop.
map<string, int> get_word_count(const vector<string>& words) {
map<string, int> word_count{};
for (string& word : words) {
auto& map_ref = word_count[word];
map_ref += 1;
if(map_ref == 2){
// Here goes your code
}
}
return word_count;
}
If you're interested in how many times a word is repeated you shall scan again the map with a loop.

I need to create MultiMap using hash-table but I get time-limit exceeded error (C++)

I'm trying to solve algorithm task: I need to create MultiMap(key,(values)) using hash-table. I can't use Set and Map libraries. I send code to testing system, but I get time-limit exceeded error on test 20. I don't know what exactly this test contains. The code must do following tasks:
put x y - add pair (x,y).If pair exists, do nothing.
delete x y - delete pair(x,y). If pair doesn't exist, do nothing.
deleteall x - delete all pairs with first element x.
get x - print number of pairs with first element x and second elements.
The amount of operations <= 100000
Time limit - 2s
Example:
multimap.in:
put a a
put a b
put a c
get a
delete a b
get a
deleteall a
get a
multimap.out:
3 b c a
2 c a
0
#include <iostream>
#include <fstream>
#include <vector>
using namespace std;
inline long long h1(const string& key) {
long long number = 0;
const int p = 31;
int pow = 1;
for(auto& x : key){
number += (x - 'a' + 1 ) * pow;
pow *= p;
}
return abs(number) % 1000003;
}
inline void Put(vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key, const string& value) {
int checker = 0;
for(int i = 0; i < Hash_table[hash].size();i++) {
if(Hash_table[hash][i].first == key && Hash_table[hash][i].second == value) {
checker = 1;
break;
}
}
if(checker == 0){
pair <string,string> key_value = make_pair(key,value);
Hash_table[hash].push_back(key_value);
}
}
inline void Delete(vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key, const string& value) {
for(int i = 0; i < Hash_table[hash].size();i++) {
if(Hash_table[hash][i].first == key && Hash_table[hash][i].second == value) {
Hash_table[hash].erase(Hash_table[hash].begin() + i);
break;
}
}
}
inline void Delete_All(vector<vector<pair<string,string>>>& Hash_table,const long long& hash,const string& key) {
for(int i = Hash_table[hash].size() - 1;i >= 0;i--){
if(Hash_table[hash][i].first == key){
Hash_table[hash].erase(Hash_table[hash].begin() + i);
}
}
}
inline string Get(const vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key) {
string result="";
int counter = 0;
for(int i = 0; i < Hash_table[hash].size();i++){
if(Hash_table[hash][i].first == key){
counter++;
result += Hash_table[hash][i].second + " ";
}
}
if(counter != 0)
return to_string(counter) + " " + result + "\n";
else
return "0\n";
}
int main() {
vector<vector<pair<string,string>>> Hash_table;
Hash_table.resize(1000003);
ifstream input("multimap.in");
ofstream output("multimap.out");
string command;
string key;
int k = 0;
string value;
while(true) {
input >> command;
if(input.eof())
break;
if(command == "put") {
input >> key;
long long hash = h1(key);
input >> value;
Put(Hash_table,hash,key,value);
}
if(command == "delete") {
input >> key;
input >> value;
long long hash = h1(key);
Delete(Hash_table,hash,key,value);
}
if(command == "get") {
input >> key;
long long hash = h1(key);
output << Get(Hash_table,hash,key);
}
if(command == "deleteall"){
input >> key;
long long hash = h1(key);
Delete_All(Hash_table,hash,key);
}
}
}
How can I do my code work faster?
At very first, a matter of design: Normally, one would pass the key only to the function and calculate the hash within. Your variant allows a user to place elements anywhere within the hash table (using bad hash values), so user could easily break it.
So e. g. put:
using HashTable = std::vector<std::vector<std::pair<std::string, std::string>>>;
void put(HashTable& table, std::string& key, std::string const& value)
{
auto hash = h1(key);
// ...
}
If at all, the hash function could be parametrised, but then you'd write a separate class for (wrapping the vector of vectors) and provide the hash function in constructor so that a user cannot exchange it arbitrarily (and again break the hash table). A class would come with additional benefits, most important: better encapsulation (hiding the vector away, so user could not change it with vector's own interface):
class HashTable
{
public:
// IF you want to provide hash function:
template <typename Hash>
HashTable(Hash hash) : hash(hash) { }
void put(std::string const& key, std::string const& value);
void remove(std::string const& key, std::string const& value); //(delete is keyword!)
// ...
private:
std::vector<std::vector<std::pair<std::string, std::string>>> data;
// if hash function parametrized:
std::function<size_t(std::string)> hash; // #include <functional> for
};
I'm not 100% sure how efficient std::function really is, so for high performance code, you preferrably use your hash function h1 directly (not implenting constructor as illustrated above).
Coming to optimisations:
For the hash key I would prefer unsigned value: Negative indices are meaningless anyway, so why allow them at all? long long (signed or unsigned) might be a bad choice if testing system is a 32 bit system (might be unlikely, but still...). size_t covers both issues at once: it is unsigned and it is selected in size appropriately for given system (if interested in details: actually adjusted to address bus size, but on modern systems, this is equal to register size as well, which is what we need). Select type of pow to be the same.
deleteAll is implemented inefficiently: With each element you erase you move all the subsequent elements one position towards front. If you delete multiple elements, you do this repeatedly, so one single element can get moved multiple times. Better:
auto pos = vector.begin();
for(auto& pair : vector)
{
if(pair.first != keyToDelete)
*pos++ = std::move(s); // move semantics: faster than copying!
}
vector.erase(pos, vector.end());
This will move each element at most once, erasing all surplus elements in one single go. Appart from the final erasing (which you have to do explicitly then), this is more or less what std::remove and std::remove_if from algorithm library do as well. Are you allowed to use it? Then your code might look like this:
auto condition = [&keyToDelete](std::pair<std::string, std::string> const& p)
{ return p.first == keyToDelete; };
vector.erase(std::remove_if(vector.begin(), vector.end(), condition), vector.end());
and you profit from already highly optimised algorithm.
Just a minor performance gain, but still: You can spare variable initialisation, assignment and conditional branch (the latter one can be relatively expensive operation on some systems) within put if you simply return if an element is found:
//int checker = 0;
for(auto& pair : hashTable[hash]) // just a little more comfortable to write...
{
if(pair.first == key && pair.second == value)
return;
}
auto key_value = std::make_pair(key, value);
hashTable[hash].push_back(key_value);
Again, with algorithm library:
auto key_value = std::make_pair(key, value);
// same condition as above!
if(std::find_if(vector.begin(), vector.end(), condition) == vector.end();
{
vector.push_back(key_value);
}
Then less than 100000 operations does not indicate that each operation will require a separate key/value pair. We might expect that keys are added, removed, re-added, ..., so you most likely don't have to cope with 100000 different values. I'd assume your map is much too large (be aware that it requires initialisation of 100000 vectors as well). I'd assume a much smaller one should suffice already (possibly 1009 or 10007? You might possibly have to experiment a little...).
Keeping the inner vectors sorted might give you some performance boost as well:
put: You could use a binary search to find the two elements in between a new one is to be inserted (if one of these two is equal to given one, no insertion, of course)
delete: Use binary search to find the element to delete.
deleteAll: Find upper and lower bounds for elements to be deleted and erase whole range at once.
get: find lower and upper bound as for deleteAll, distance in between (number of elements) is a simple subtraction and you could print out the texts directly (instead of first building a long string). Which of outputting directly or creating a string really is more efficient is to be found out, though, as outputting directly involves multiple system calls, which in the end might cost previously gained performance again...
Considering your input loop:
Checking for eof() (only) is critical! If there is an error in the file, you'll end up in an endless loop, as the fail bit gets set, operator>> actually won't read anything at all any more and you won't ever reach the end of the file. This even might be the reason for your 20th test failing.
Additionally: You have line based input (each command on a separate line), so reading a whole line at once and only afterwards parse it will spare you some system calls. If some argument is missing, you will detect it correctly instead of (illegally) reading next command (e. g. put) as argument, similarly you won't interpret a surplus argument as next command. If a line is invalid for whatever reason (bad number of arguments as above or unknown command), you can then decide indiviually what you want to do (just ignore the line or abort processing entirely). So:
std::string line;
while(std::getline(std::cin, line))
{
// parse the string; if line is invalid, appropriate error handling
// (ignoring the line, exiting from loop, ...)
}
if(!std::cin.eof())
{
// some error occured, print error message!
}

C++ - Get the "difference" of 2 strings like git

I'm currently working on a project which includes a Win32 console program on my Windows 10 PC and an app for my Windows 10 Mobile Phone. It's about controlling the master and audio session volumes on my PC over the app on my Windows Phone.
The "little" problem I have right now is to get the "difference" between 2 strings.
Let's take these 2 strings for example:
std::string oldVolumes = "MASTER:50:SYSTEM:50:STEAM:100:UPLAY:100";
std::string newVolumes = "MASTER:30:SYSTEM:50:STEAM:100:ROCKETLEAGUE:80:CHROME:100";
Now I want to compare these 2 strings. Lets say I explode each string to a vector with the ":" as delimiter (I have a function named explode to cut the given string by the delimiter and write the string before into a vector).
Good enough. But as you can see, in the old string there's UPLAY with the value 100, but it's missing in the new string. Also, there are 2 new values (RocketLeague and Chrome), which are missing in the old one. But not only the "audio sessions/names" are different, the values are different too.
What I want now is for each session, which is in both strings (like master and system), to compare the values and if the the new value is different to the old one, I want to append this change into another string, like:
std::string volumeChanges = "MASTER:30"; // Cause Master is changed, System not
If there's a session in the old string, but not in the new one, I want to append:
std::string volumeChanges = "MASTER:30:REMOVE:UPLAY";
If there's a session in the new one, which is missing in the old string, I want to append it like that:
std::string volumeChanges = "MASTER:30:REMOVE:UPLAY:ADD:ROCKETLEAGUE:ROCKETLEAGUE:80:ADD:CHROME:CHROME:100";
The volumeChanges string is just to show you, what I need. I'll try to make a better one afterwards.
Do you have any ideas of how to implement such a comparison? I don't need a specific code example or something, just some ideas of how I could do that in theory. It's like GIT at least. If you make changes in a text file, you see in red the deleted text and in green the added one. Something similar to this, just with strings or vectors of strings.
Lets say I explode each string to a vector with the ":" as delimiter (I have a function named explode to cut the given string by the delimiter and write the string before into a vector).
I'm going to advise you further extend that logic to separate them into property objects that discretely maintain a name + value:
struct property {
std::string name;
in32_t value;
bool same_name(property const& o) const {
return name == o.name;
}
bool same_value(property const& o) const {
return value == o.value;
}
bool operator==(property const& o) const {
return same_name(o) && same_value(o);
}
bool operator<(property const& o) const {
if(!same_name(o)) return name < o.name;
else return value < o.value;
}
};
This will dramatically simplify the logic needed to work out which properties were changed/added/removed.
The logic for "tokenizing" this kind of string isn't too difficult:
std::set<property> tokenify(std::string input) {
bool finding_name = true;
property prop;
std::set<property> properties;
while (input.size() > 0) {
auto colon_index = input.find(':');
if (finding_name) {
prop.name = input.substr(0, colon_index);
finding_name = false;
}
else {
prop.value = std::stoi(input.substr(0, colon_index));
finding_name = true;
properties.insert(prop);
}
if(colon_index == std::string::npos)
break;
else
input = input.substr(colon_index + 1);
}
return properties;
}
Then, the function to get the difference:
std::string get_diff_string(std::string const& old_props, std::string const& new_props) {
std::set<property> old_properties = tokenify(old_props);
std::set<property> new_properties = tokenify(new_props);
std::string output;
//We first scan for properties that were either removed or changed
for (property const& old_property : old_properties) {
auto predicate = [&](property const& p) {
return old_property.same_name(p);
};
auto it = std::find_if(new_properties.begin(), new_properties.end(), predicate);
if (it == new_properties.end()) {
//We didn't find the property, so we need to indicate it was removed
output.append("REMOVE:" + old_property.name + ':');
}
else if (!it->same_value(old_property)) {
//Found the property, but the value changed.
output.append(it->name + ':' + std::to_string(it->value) + ':');
}
}
//Finally, we need to see which were added.
for (property const& new_property : new_properties) {
auto predicate = [&](property const& p) {
return new_property.same_name(p);
};
auto it = std::find_if(old_properties.begin(), old_properties.end(), predicate);
if (it == old_properties.end()) {
//We didn't find the property, so we need to indicate it was added
output.append("ADD:" + new_property.name + ':' + new_property.name + ':' + std::to_string(new_property.value) + ':');
}
//The previous loop detects changes, so we don't need to bother here.
}
if (output.size() > 0)
output = output.substr(0, output.size() - 1); //Trim off the last colon
return output;
}
And we can demonstrate that it's working with a simple main function:
int main() {
std::string diff_string = get_diff_string("MASTER:50:SYSTEM:50:STEAM:100:UPLAY:100", "MASTER:30:SYSTEM:50:STEAM:100:ROCKETLEAGUE:80:CHROME:100");
std::cout << "Diff String was \"" << diff_string << '\"' << std::endl;
}
Which yields an output (according to IDEONE.com):
Diff String was "MASTER:30:REMOVE:UPLAY:ADD:CHROME:CHROME:100:ADD:ROCKETLEAGUE:ROCKETLEAGUE:80"
Which, although the contents are in a slightly different order than your example, still contains all the correct information. The contents are in different order because std::set implicitly sorted the attributes by name when tokenizing the properties; if you want to disable that sorting, you'd need to use a different data structure which preserves entry order. I chose it because it eliminates duplicates, which could cause odd behavior otherwise.
In this particular instance, you could do it as follows:
Split the old and new strings by the delimiter, and store the results in a vector.
Loop over the vector with the old data. Look for each word in the vector with new data: e.g. find("MASTER").
If not found add "REMOVE:MASTER" to your results.
If found, compare the numbers and add it to the results if it has been changed.
The added string can be found by looping over the new string and searching for the words in the old string.
I suggest that you enumerate some features (in your case for example: UPLAY present, REMOVE is present, ...)
for every one of those assign a weight if the two strings differs for the given feature.
At the end sum up weights for the features presents in one string and absent in the other and get a number.
This number should represent what you are looking for.
You can adjust weights until you are satisfied with the result.
Maybe my answer will give you some new thoughts. In fact, by tweaking the current code, you can find all the missing words.
std::vector<std::string> splitString(const std::string& str, const char delim)
{
std::vector<std::string> out;
std::stringstream ss(str);
std::string s;
while (std::getline(ss, s, delim)) {
out.push_back(s);
}
return out;
}
std::vector<std::string> missingWords(const std::string& first, const std::string& second)
{
std::vector<std::string> missing;
const auto firstWords = splitString(first, ' ');
const auto secWords = splitString(second, ' ');
size_t i = 0, j = 0;
for(; i < firstWords.size();){
auto findSameWord = std::find(secWords.begin() + j, secWords.end(), firstWords[i]);
if(findSameWord == secWords.end()) {
missing.push_back(firstWords[i]);
j++;
} else {
j = distance(secWords.begin(), findSameWord);
}
i++;
}
return missing;
}

How to iterate over std::vector<char> and find null-terminated c-strings

I have three questions based on the following code fragments
I have a list of strings. It just happens to be a vector but could potentially be any source
vector<string> v1_names = boost::assign::list_of("Antigua and Barbuda")( "Brasil")( "Papua New Guinea")( "Togo");
The following is to store lengths of each name
vector<int> name_len;
the following is where I want to store the strings
std::vector<char> v2_names;
estimate memory required to copy names from v1_names
v2_names.reserve( v1_names.size()*20 + 4 );
Question: is this the best way to estimate storage? I fix the max len at 20 that is ok, then add space for null treminator
Now copy the names
for( std::vector<std::string>::size_type i = 0; i < v1_names.size(); ++i)
{
std::string val( v1_names[i] );
name_len.push_back(val.length());
for(std::string::iterator it = val.begin(); it != val.end(); ++it)
{
v2_names.push_back( *it );
}
v2_names.push_back('\0');
}
Question: is this the most efficient way to copy the elements from v1_name to v2_names?
Main Question: How do I iterate over v2_names and print the country names contained in v2_names
Use simple join, profit!
#include <boost/algorithm/string/join.hpp>
#include <vector>
#include <iostream>
int main(int, char **)
{
vector<string> v1_names = boost::assign::list_of("Antigua and Barbuda")( "Brasil")( "Papua New Guinea")( "Togo");
std::string joined = boost::algorithm::join(v1_names, "\0");
}
To estimate storage, you should probably measure the strings, rather than rely on a hard-coded constant 20. For example:
size_t total = 0;
for (std::vector<std::string>::iterator it = v1_names.begin(); it != v1_names.end(); ++it) {
total += it->size() + 1;
}
The main inefficiency in your loop is probably that you take an extra copy of each string in turn: std::string val( v1_names[i] ); could instead be const std::string &val = v1_names[i];.
To append each string, you can use the insert function:
v2_names.insert(v2_names.end(), val.begin(), val.end());
v2_names.push_back(0);
This isn't necessarily the most efficient, since there's a certain amount of redundant checking of available space in the vector, but it shouldn't be too bad and it's simple. An alternative would be to size v2_names at the start rather than reserving space, and then copy data (with std::copy) rather than appending it. But either one of them might be faster, and it shouldn't make a lot of difference.
For the main question, if all you have is v2_names and you want to print the strings, you could do something like this:
const char *p = &v2_names.front();
while (p <= &v2_names.back()) {
std::cout << p << "\n";
p += strlen(p) + 1;
}
If you also have name_len:
size_t offset = 0;
for (std::vector<int>::iterator it = name_len.begin(); it != name_len.end(); ++it) {
std::cout << &v2_names[offset] << "\n";
offset += *it + 1;
}
Beware that the type of name_len is technically wrong - it's not guaranteed that you can store a string length in an int. That said, even if int is smaller than size_t in a particular implementation, strings that big will still be pretty rare.
The best way to compute the required storage is to sum up the length of each string in v1_names.
For your second question instead of using the for loop for you could just use the iterator, iterator append method of vector with begin and end on the string.
For your third question: Just don't do that. Iterate over v1_names's strings instead. The only reason to ever create such a thing as v2_names is to pass it into a legacy C API and then you don't have to worry about iterating over it.
If you want to concatenate all the strings, you could just use a single pass and rely on amortized O(1) insertions:
name_len.reserve(v1_names.size());
// v2_names.reserve( ??? ); // only if you have a good heuristic or
// if you can determine this efficiently
for (auto it = v1_names.cbegin(); it != v1_names.cend(); ++it)
{
name_len.push_back(it->size());
v2_names.insert(v2_names.end(), it->c_str(), it->c_str() + it->size() + 1);
}
You could precompute the total length by another loop before this and call reserve if you think this will help. It depends on how well you know the strings. But perhaps there's no point worrying, since in the long run the insertions are O(1).

Sorting a file with 55K rows and varying Columns

I want to find a programmatic solution using C++.
I have a 900 files each of 27MB size. (just to inform about the enormity ).
Each file has 55K rows and Varying columns. But the header indicates the columns
I want to sort the rows in an order w.r.t to a Column Value.
I wrote the sorting algorithm for this (definitely my newbie attempts, you may say).
This algorithm is working for few numbers, but fails for larger numbers.
Here is the code for the same:
basic functions I defined to use inside the main code:
int getNumberOfColumns(const string& aline)
{
int ncols=0;
istringstream ss(aline);
string s1;
while(ss>>s1) ncols++;
return ncols;
}
vector<string> getWordsFromSentence(const string& aline)
{
vector<string>words;
istringstream ss(aline);
string tstr;
while(ss>>tstr) words.push_back(tstr);
return words;
}
bool findColumnName(vector<string> vs, const string& colName)
{
vector<string>::iterator it = find(vs.begin(), vs.end(), colName);
if ( it != vs.end())
return true;
else return false;
}
int getIndexForColumnName(vector<string> vs, const string& colName)
{
if ( !findColumnName(vs,colName) ) return -1;
else {
vector<string>::iterator it = find(vs.begin(), vs.end(), colName);
return it - vs.begin();
}
}
////////// I like the Recurssive functions - I tried to create a recursive function
///here. This worked for small values , say 20 rows. But for 55K - core dumps
void sort2D(vector<string>vn, vector<string> &srt, int columnIndex)
{
vector<double> pVals;
for ( int i = 0; i < vn.size(); i++) {
vector<string>meancols = getWordsFromSentence(vn[i]);
pVals.push_back(stringToDouble(meancols[columnIndex]));
}
srt.push_back(vn[max_element(pVals.begin(), pVals.end())-pVals.begin()]);
if (vn.size() > 1 ) {
vn.erase(vn.begin()+(max_element(pVals.begin(), pVals.end())-pVals.begin()) );
vector<string> vn2 = vn;
//cout<<srt[srt.size() -1 ]<<endl;
sort2D(vn2 , srt, columnIndex);
}
}
Now the main code:
for ( int i = 0; i < TissueNames.size() -1; i++)
{
for ( int j = i+1; j < TissueNames.size(); j++)
{
//string fname = path+"/gse7307_Female_rma"+TissueNames[i]+"_"+TissueNames[j]+".txt";
//string fname2 = sortpath2+"/gse7307_Female_rma"+TissueNames[i]+"_"+TissueNames[j]+"Sorted.txt";
string fname = path+"/gse7307_Male_rma"+TissueNames[i]+"_"+TissueNames[j]+".txt";
string fname2 = sortpath2+"/gse7307_Male_rma"+TissueNames[i]+"_"+TissueNames[j]+"4Columns.txt";
vector<string>AllLinesInFile;
BioInputStream fin(fname);
string aline;
getline(fin,aline);
replace (aline.begin(), aline.end(), '"',' ');
string headerline = aline;
vector<string> header = getWordsFromSentence(aline);
int pindex = getIndexForColumnName(header,"p-raw");
int xcindex = getIndexForColumnName(header,"xC");
int xeindex = getIndexForColumnName(header,"xE");
int prbindex = getIndexForColumnName(header,"X");
string newheaderline = "X\txC\txE\tp-raw";
BioOutputStream fsrt(fname2);
fsrt<<newheaderline<<endl;
int newpindex=3;
while ( getline(fin, aline) ){
replace (aline.begin(), aline.end(), '"',' ');
istringstream ss2(aline);
string tstr;
ss2>>tstr;
tstr = ss2.str().substr(tstr.length()+1);
vector<string> words = getWordsFromSentence(tstr);
string values = words[prbindex]+"\t"+words[xcindex]+"\t"+words[xeindex]+"\t"+words[pindex];
AllLinesInFile.push_back(values);
}
vector<string>SortedLines;
sort2D(AllLinesInFile, SortedLines,newpindex);
for ( int si = 0; si < SortedLines.size(); si++)
fsrt<<SortedLines[si]<<endl;
cout<<"["<<i<<","<<j<<"] = "<<SortedLines.size()<<endl;
}
}
can some one suggest me a better way of doing this?
why it is failing for larger values. ?
The primary function of interest for this query is Sort2D function.
thanks for the time and patience.
prasad.
I'm not sure why your code is crashing, but recursion in that case is only going to make the code less readable. I doubt it's a stack overflow, however, because you're not using much stack space in each call.
C++ already has std::sort, why not use that instead? You could do it like this:
// functor to compare 2 strings
class CompareStringByValue : public std::binary_function<string, string, bool>
{
public:
CompareStringByValue(int columnIndex) : idx_(columnIndex) {}
bool operator()(const string& s1, const string& s2) const
{
double val1 = stringToDouble(getWordsFromSentence(s1)[idx_]);
double val2 = stringToDouble(getWordsFromSentence(s2)[idx_]);
return val1 < val2;
}
private:
int idx_;
};
To then sort your lines you would call
std::sort(vn.begin(), vn.end(), CompareByStringValue(columnIndex));
Now, there is one problem. This will be slow because stringToDouble and getWordsFromSentence are called multiple times on the same string. You would probably want to generate a separate vector which has precalculated the values of each string, and then have CompareByStringValue just use that vector as a lookup table.
Another way you can do this is insert the strings into a std::multimap<double, std::string>. Just insert the entries as (value, str) and then read them out line-by-line. This is simpler but slower (though has the same big-O complexity).
EDIT: Cleaned up some incorrect code and derived from binary_function.
You could try a method that doesn't involve recursion. if your program crashes using the Sort2D function with large values, then your probably overflowing the stack (danger of using recursion with a large number of function calls). Try another sorting method, maybe using a loop.
sort2D crashes because you keep allocating an array of strings to sort and then you pass it by value, in effect using O(2*N^2) memory. If you really want to keep your recursive function, simply pass vn by reference and don't bother with vn2. And if you don't want to modify the original vn, move the body of sort2D into another function (say, sort2Drecursive) and call that from sort2D.
You might want to take another look at sort2D in general, since you are doing O(N^2) work for something that should take O(N+N*log(N)).
The problem is less your code than the tool you chose for the job. This is purely a text processing problem, so choose a tool good at that. In this case on Unix the best tool for the job is Bash and the GNU coreutils. On Windows you can use PowerShell, Python or Ruby. Python and Ruby will work on any Unix-flavoured machine too, but roughly all Unix machines have Bash and the coreutils installed.
Let $FILES hold the list of files to process, delimited by whitespace. Here's the code for Bash:
for FILE in $FILES; do
echo "Processing file $FILE ..."
tail --lines=+1 $FILE |sort >$FILE.tmp
mv $FILE.tmp $FILE
done