how to improve natural sort program for decimals? - c++

I have std::strings containing numbers in the leading section that I need to sort. The numbers can be integers or floats.
The vector<std::string> sort was not optimal, I found the following natural sort program which was much better. I still have a small issue with numbers smaller than zero that do not sort just right. Does anyone have a suggestion to improve? We're using Visual Studio 2003.
The complete program follows.
TIA,
Bert
#include <list>
#include <string>
#include <iostream>
using namespace std;
class MyData
{
public:
string m_str;
MyData(string str) {
m_str = str;
}
long field1() const
{
int second = m_str.find_last_of("-");
int first = m_str.find_last_of("-", second-1);
return atol(m_str.substr(first+1, second-first-1).c_str());
}
long field2() const
{
return atol(m_str.substr(m_str.find_last_of("-")+1).c_str());
}
bool operator < (const MyData& rhs)
{
if (field1() < rhs.field1()) {
return true;
} else if (field1() > rhs.field1()) {
return false;
} else {
return field2() < rhs.field2();
}
}
};
int main()
{
// Create list
list<MyData> mylist;
mylist.push_front(MyData("93.33"));
mylist.push_front(MyData("0.18"));
mylist.push_front(MyData("485"));
mylist.push_front(MyData("7601"));
mylist.push_front(MyData("1001"));
mylist.push_front(MyData("0.26"));
mylist.push_front(MyData("0.26"));
// Sort the list
mylist.sort();
// Dump the list to check the result
for (list<MyData>::const_iterator elem = mylist.begin(); elem != mylist.end(); ++elem)
{
cout << (*elem).m_str << endl;
}
return 1;
}
GOT:
0.26
0.26
0.18
93.33
485
1001
7601
EXPECTED:
0.18
0.26
0.26
93.33
485
1001
7601

Use atof() instead of atol() to have the comparison take the fractional part of the number into account. You will also need to change the return types to doubles.

If it's just float strings, I'd rather suggest to create a table with two columns (first row contains the original string, second row is filled with the string converted to float), sort this by the float column and then output/use the sorted string column.

If the data are all numbers I would create a new class to contain the data.
It can have a string to include the data but then allows you to have better methods to model behaviour - in this case espacially to implement operator <
The implementation could also include use of a library that calculates to exact precion e.g. GNU multiple precision this would do the comparison and canversion from string (or if the numbers do not have that many significant figures you could use doubles)

I would compute the values once and store them.
Because they are not actually part of the objects state (they are just calcualted values) mark them as mutable. Then they can also be set during const methods.
Also note that MyClass is a friend of itself and thus can access the private members of another object of the same class. So there is no need for the extranious accessor methods. Remember Accessor methods are to protect other classes from changes in the implementation not the class you are implementing.
The problem with ordering is that atoi() is only reading the integer (ie it stops at the '.' character. Thus all your numbers smaller than 0 have a zero value for comparison and thus they will appear in a random order. To compare against the full value you need to extract them as a floating point value (double).
class MyData
{
private:
mutable bool gotPos;
mutable double f1;
mutable double f2;
public:
/*
* Why is this public?
*/
std::string m_str;
MyData(std::string str)
:gotPos(false)
,m_str(str) // Use initializer list
{
// If you are always going to build f1,f2 then call BuildPos()
// here and then you don't need the test in the operator <
}
bool operator < (const MyData& rhs)
{
if (!gotPos)
{ buildPos();
}
if (!rhs.gotPos)
{ rhs.buildPos();
}
if (f1 < rhs.f1) return true;
if (f1 > rhs.f1) return false;
return f2 < rhs.f2;
}
private:
void buildPos() const
{
int second = m_str.find_last_of("-");
int first = m_str.find_last_of("-", second-1);
// Use boost lexical cast as it handles doubles
// As well as integers.
f1 = boost::lexical_cast<double>(m_str.substr(first + 1, second-first - 1));
f2 = boost::lexical_cast<double>(m_str.substr(second + 1));
gotPos = true;
}
};

Related

Is mutex() needed to safely access different elements of an array with 2 threads at once?

I am working weather data (lightning energy detected from a weather satellite). I have written a function that takes satellite data (int) and inserts it into a multidimensional array after deciding which element it needs to be placed in.
The array is :
int conus_grid[1180][520];
This has worked flawlessly, but it has taken too long to process and so I have written 2 functions that split the array so I can run 2 threads using std::thread. This is where the trouble happens... and I am doing my best to keep my examples to a minimum.
Here is my original function that accesses the array, and works fine. You can see my two loops to access the array: one being 0-1180 (x) and the other 0-520 (y) :
void writeCell(long double latitude, long double longitude, int energy)
{
double lat = latitude;
double lon = longitude;
for(int x=0;x<1180;x++)
{
for(int y=0;y<520;y++)
{
// Check every cell for one that matches current lat and lon selection, then write into that cell.
if(lon < conus_grid_west[x][y] && lon > conus_grid_east[x][y] && lat < conus_grid_north[x][y] && lat > conus_grid_south[x][y])
{
grid_used[x][y] = 1;
conus_grid[x][y] = conus_grid[x][y] + energy; // this is where it accesses the array
}
}
}
}
When I converted the code to take advantage of multithreading, I created the following functions (based on the one above, replacing it). The only difference is that they each access only one specific portion of the array. (Exactly one half each)
This first handles X... 0 to 590, and Y... 0 to 260 :
void writeCellT1(long double latitude, long double longitude, int energy)
{
double lat = latitude;
double lon = longitude;
for(int x=0;x<590;x++)
{
for(int y=0;y<260;y++)
{
// Check every cell for one that matches current lat and lon selection, then write into that cell.
if(lon < conus_grid_west[x][y] && lon > conus_grid_east[x][y] && lat < conus_grid_north[x][y] && lat > conus_grid_south[x][y])
{
grid_used[x][y] = 1;
conus_grid[x][y] = conus_grid[x][y] + energy; // this is where it accesses the array
}
}
}
}
The second handles the other half- X is 590-1180 and Y is 260-520 :
void writeCellT2(long double latitude, long double longitude, int energy)
{
double lat = latitude;
double lon = longitude;
for(int x=590;x<1180;x++)
{
for(int y=260;y<520;y++)
{
// Check every cell for one that matches current lat and lon selection, then write into that cell.
if(lon < conus_grid_west[x][y] && lon > conus_grid_east[x][y] && lat < conus_grid_north[x][y] && lat > conus_grid_south[x][y])
{
grid_used[x][y] = 1;
conus_grid[x][y] = conus_grid[x][y] + energy; // this is where it accesses the array
}
}
}
}
The program does not crash but there is data that is missing in the array once it completes - only part of the data is there. It's hard for me to track which elements it does not write, but it is clear that when I have one function to do this task, it works but when I have more than one thread accessing the array with 2 functions, it is not putting data in the array completely.
I figured it was worth a try to use mutex() like this :
m.lock();
grid_used[x][y] = 1;
conus_grid[x][y] = conus_grid[x][y] + energy;
m.unlock();
However, this does not work either as it gives the same result with failing to write data to the array. Any idea as to why this would be happening? This is only my 3rd day working with so I hope it's something simple that I overlooked in tutorials.
Is mutex() needed to safely access different elements of an array with 2 threads at once?
If you don't write to elements that may be written to or read by another thread at the same time, you don't need a mutex.
The program does not crash but there is data that is missing in the array once it completes
As #G.M. implied, you should only split on one range (and it's X in this case), Otherwise you'll only handle half of the cells. One thread handles 1/4 and the other 1/4. You should split on X because you want each thread to handle data as closely placed as possible.
Note that data in 2D arrays is stored in row-major order in memory (which is why people usually use the notation [Y][X]) but it's fine to do as you do too. Splitting on X gives one thread half the memory rows and the other thread the other half.
An alternative could be to not do the thread management yourself. C++17 added execution policies which lets you write loops where the body of the loop can be executed in different threads, usually picked from an internal thread pool. How many threads that will be used is then up to the C++ implementation and the hardware your program is executed on.
I've made an example where I've swapped X and Y and made some assumptions about the actual types you are using, for which I've created aliases.
#include <algorithm> // std::for_each
#include <array>
#include <execution> // std::execution::par
#include <iostream>
#include <memory>
#include <type_traits>
// a class to keep everything together
struct conus {
static constexpr size_t y_size = 520, x_size = 1180;
// row aliases
using conus_int_row_t = std::array<int, x_size>;
using conus_bool_row_t = std::array<bool, x_size>;
using conus_real_row_t = std::array<double, x_size>;
// 2D array aliases
using conus_grid_int_t = std::array<conus_int_row_t, y_size>;
using conus_grid_bool_t = std::array<conus_bool_row_t, y_size>;
using conus_grid_real_t = std::array<conus_real_row_t, y_size>;
// a class to store the arrays
struct conus_data_t {
conus_grid_int_t conus_grid{};
conus_grid_bool_t grid_used{};
conus_grid_real_t conus_grid_west{}, conus_grid_east{},
conus_grid_north{}, conus_grid_south{};
// an iterator to be able to loop over the row number in the arrays
class iterator {
public:
using iterator_category = std::forward_iterator_tag;
using value_type = unsigned;
using difference_type = std::make_signed_t<value_type>;
using pointer = value_type*;
using reference = value_type&;
iterator(unsigned y = 0) : current(y) {}
iterator& operator++() {
++current;
return *this;
}
bool operator!=(const iterator& rhs) const {
return current != rhs.current;
}
unsigned operator*() { return current; }
private:
unsigned current;
};
// create iterators to use in loops
iterator begin() { return {0}; }
iterator end() { return {static_cast<unsigned>(conus_grid.size())}; }
};
// create arrays on the heap to save the stack
std::unique_ptr<conus_data_t> data = std::make_unique<conus_data_t>();
void writeCell(double lat, double lon, int energy) {
// Below is the std::execution::parallel_policy in use.
// A lambda, capturing its surrounding by reference, is called for each "y".
std::for_each(std::execution::par, data->begin(), data->end(), [&](unsigned y) {
// here we're most probably in a thread from the thread pool
// references to the rows
conus_int_row_t& row_grid = data->conus_grid[y];
conus_bool_row_t& row_used = data->grid_used[y];
conus_real_row_t& row_west = data->conus_grid_west[y];
conus_real_row_t& row_east = data->conus_grid_east[y];
conus_real_row_t& row_north = data->conus_grid_north[y];
conus_real_row_t& row_south = data->conus_grid_south[y];
for(unsigned x = 0; x < x_size; ++x) {
// Check every cell for one that matches current lat
// and lon selection, then write into that cell.
if(lon < row_west[x] && lon > row_east[x] &&
lat < row_north[x] && lat > row_south[x])
{
row_used[x] = true;
// this is where it accesses the array
row_grid[x] += energy;
}
}
});
}
};
If you use g++ or clang++ on Linux, you must link with tbb (add -ltbb when linking). Other compilers may have other library demands to be able to use execution policies. Visual Studio 2019 compiles and links it out-of-the-box if you select C++17 as your language.
I've often found that using std::execution::par is a quick and semi-easy way to speed things up, but you'll have to try it out yourself to see if it becomes faster on your target machine.

I need to create MultiMap using hash-table but I get time-limit exceeded error (C++)

I'm trying to solve algorithm task: I need to create MultiMap(key,(values)) using hash-table. I can't use Set and Map libraries. I send code to testing system, but I get time-limit exceeded error on test 20. I don't know what exactly this test contains. The code must do following tasks:
put x y - add pair (x,y).If pair exists, do nothing.
delete x y - delete pair(x,y). If pair doesn't exist, do nothing.
deleteall x - delete all pairs with first element x.
get x - print number of pairs with first element x and second elements.
The amount of operations <= 100000
Time limit - 2s
Example:
multimap.in:
put a a
put a b
put a c
get a
delete a b
get a
deleteall a
get a
multimap.out:
3 b c a
2 c a
0
#include <iostream>
#include <fstream>
#include <vector>
using namespace std;
inline long long h1(const string& key) {
long long number = 0;
const int p = 31;
int pow = 1;
for(auto& x : key){
number += (x - 'a' + 1 ) * pow;
pow *= p;
}
return abs(number) % 1000003;
}
inline void Put(vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key, const string& value) {
int checker = 0;
for(int i = 0; i < Hash_table[hash].size();i++) {
if(Hash_table[hash][i].first == key && Hash_table[hash][i].second == value) {
checker = 1;
break;
}
}
if(checker == 0){
pair <string,string> key_value = make_pair(key,value);
Hash_table[hash].push_back(key_value);
}
}
inline void Delete(vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key, const string& value) {
for(int i = 0; i < Hash_table[hash].size();i++) {
if(Hash_table[hash][i].first == key && Hash_table[hash][i].second == value) {
Hash_table[hash].erase(Hash_table[hash].begin() + i);
break;
}
}
}
inline void Delete_All(vector<vector<pair<string,string>>>& Hash_table,const long long& hash,const string& key) {
for(int i = Hash_table[hash].size() - 1;i >= 0;i--){
if(Hash_table[hash][i].first == key){
Hash_table[hash].erase(Hash_table[hash].begin() + i);
}
}
}
inline string Get(const vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key) {
string result="";
int counter = 0;
for(int i = 0; i < Hash_table[hash].size();i++){
if(Hash_table[hash][i].first == key){
counter++;
result += Hash_table[hash][i].second + " ";
}
}
if(counter != 0)
return to_string(counter) + " " + result + "\n";
else
return "0\n";
}
int main() {
vector<vector<pair<string,string>>> Hash_table;
Hash_table.resize(1000003);
ifstream input("multimap.in");
ofstream output("multimap.out");
string command;
string key;
int k = 0;
string value;
while(true) {
input >> command;
if(input.eof())
break;
if(command == "put") {
input >> key;
long long hash = h1(key);
input >> value;
Put(Hash_table,hash,key,value);
}
if(command == "delete") {
input >> key;
input >> value;
long long hash = h1(key);
Delete(Hash_table,hash,key,value);
}
if(command == "get") {
input >> key;
long long hash = h1(key);
output << Get(Hash_table,hash,key);
}
if(command == "deleteall"){
input >> key;
long long hash = h1(key);
Delete_All(Hash_table,hash,key);
}
}
}
How can I do my code work faster?
At very first, a matter of design: Normally, one would pass the key only to the function and calculate the hash within. Your variant allows a user to place elements anywhere within the hash table (using bad hash values), so user could easily break it.
So e. g. put:
using HashTable = std::vector<std::vector<std::pair<std::string, std::string>>>;
void put(HashTable& table, std::string& key, std::string const& value)
{
auto hash = h1(key);
// ...
}
If at all, the hash function could be parametrised, but then you'd write a separate class for (wrapping the vector of vectors) and provide the hash function in constructor so that a user cannot exchange it arbitrarily (and again break the hash table). A class would come with additional benefits, most important: better encapsulation (hiding the vector away, so user could not change it with vector's own interface):
class HashTable
{
public:
// IF you want to provide hash function:
template <typename Hash>
HashTable(Hash hash) : hash(hash) { }
void put(std::string const& key, std::string const& value);
void remove(std::string const& key, std::string const& value); //(delete is keyword!)
// ...
private:
std::vector<std::vector<std::pair<std::string, std::string>>> data;
// if hash function parametrized:
std::function<size_t(std::string)> hash; // #include <functional> for
};
I'm not 100% sure how efficient std::function really is, so for high performance code, you preferrably use your hash function h1 directly (not implenting constructor as illustrated above).
Coming to optimisations:
For the hash key I would prefer unsigned value: Negative indices are meaningless anyway, so why allow them at all? long long (signed or unsigned) might be a bad choice if testing system is a 32 bit system (might be unlikely, but still...). size_t covers both issues at once: it is unsigned and it is selected in size appropriately for given system (if interested in details: actually adjusted to address bus size, but on modern systems, this is equal to register size as well, which is what we need). Select type of pow to be the same.
deleteAll is implemented inefficiently: With each element you erase you move all the subsequent elements one position towards front. If you delete multiple elements, you do this repeatedly, so one single element can get moved multiple times. Better:
auto pos = vector.begin();
for(auto& pair : vector)
{
if(pair.first != keyToDelete)
*pos++ = std::move(s); // move semantics: faster than copying!
}
vector.erase(pos, vector.end());
This will move each element at most once, erasing all surplus elements in one single go. Appart from the final erasing (which you have to do explicitly then), this is more or less what std::remove and std::remove_if from algorithm library do as well. Are you allowed to use it? Then your code might look like this:
auto condition = [&keyToDelete](std::pair<std::string, std::string> const& p)
{ return p.first == keyToDelete; };
vector.erase(std::remove_if(vector.begin(), vector.end(), condition), vector.end());
and you profit from already highly optimised algorithm.
Just a minor performance gain, but still: You can spare variable initialisation, assignment and conditional branch (the latter one can be relatively expensive operation on some systems) within put if you simply return if an element is found:
//int checker = 0;
for(auto& pair : hashTable[hash]) // just a little more comfortable to write...
{
if(pair.first == key && pair.second == value)
return;
}
auto key_value = std::make_pair(key, value);
hashTable[hash].push_back(key_value);
Again, with algorithm library:
auto key_value = std::make_pair(key, value);
// same condition as above!
if(std::find_if(vector.begin(), vector.end(), condition) == vector.end();
{
vector.push_back(key_value);
}
Then less than 100000 operations does not indicate that each operation will require a separate key/value pair. We might expect that keys are added, removed, re-added, ..., so you most likely don't have to cope with 100000 different values. I'd assume your map is much too large (be aware that it requires initialisation of 100000 vectors as well). I'd assume a much smaller one should suffice already (possibly 1009 or 10007? You might possibly have to experiment a little...).
Keeping the inner vectors sorted might give you some performance boost as well:
put: You could use a binary search to find the two elements in between a new one is to be inserted (if one of these two is equal to given one, no insertion, of course)
delete: Use binary search to find the element to delete.
deleteAll: Find upper and lower bounds for elements to be deleted and erase whole range at once.
get: find lower and upper bound as for deleteAll, distance in between (number of elements) is a simple subtraction and you could print out the texts directly (instead of first building a long string). Which of outputting directly or creating a string really is more efficient is to be found out, though, as outputting directly involves multiple system calls, which in the end might cost previously gained performance again...
Considering your input loop:
Checking for eof() (only) is critical! If there is an error in the file, you'll end up in an endless loop, as the fail bit gets set, operator>> actually won't read anything at all any more and you won't ever reach the end of the file. This even might be the reason for your 20th test failing.
Additionally: You have line based input (each command on a separate line), so reading a whole line at once and only afterwards parse it will spare you some system calls. If some argument is missing, you will detect it correctly instead of (illegally) reading next command (e. g. put) as argument, similarly you won't interpret a surplus argument as next command. If a line is invalid for whatever reason (bad number of arguments as above or unknown command), you can then decide indiviually what you want to do (just ignore the line or abort processing entirely). So:
std::string line;
while(std::getline(std::cin, line))
{
// parse the string; if line is invalid, appropriate error handling
// (ignoring the line, exiting from loop, ...)
}
if(!std::cin.eof())
{
// some error occured, print error message!
}

fast way to compare two vector containing strings

I have a vector of strings I that pass to my function and I need to compare it with some pre-defined values. What is the fastest way to do this?
The following code snippet shows what I need to do (This is how I am doing it, but what is the fastest way of doing this):
bool compare(vector<string> input1,vector<string> input2)
{
if(input1.size() != input2.size()
{
return false;
}
for(int i=0;i<input1.siz();i++)
{
if(input1[i] != input2[i])
{
return false;
}
}
return true;
}
int compare(vector<string> inputData)
{
if (compare(inputData,{"Apple","Orange","three"}))
{
return 129;
}
if (compare(inputData,{"A","B","CCC"}))
{
return 189;
}
if (compare(inputData,{"s","O","quick"}))
{
return 126;
}
if (compare(inputData,{"Apple","O123","three","four","five","six"}))
{
return 876;
}
if (compare(inputData,{"Apple","iuyt","asde","qwe","asdr"}))
{
return 234;
}
return 0;
}
Edit1
Can I compare two vector like this:
if(inputData=={"Apple","Orange","three"})
{
return 129;
}
You are asking what is the fastest way to do this, and you are indicating that you are comparing against a set of fixed and known strings. I would argue that you would probably have to implement it as a kind of state machine. Not that this is very beautiful...
if (inputData.size() != 3) return 0;
if (inputData[0].size() == 0) return 0;
const char inputData_0_0 = inputData[0][0];
if (inputData_0_0 == 'A') {
// possibly "Apple" or "A"
...
} else if (inputData_0_0 == 's') {
// possibly "s"
...
} else {
return 0;
}
The weakness of your approach is its linearity. You want a binary search for teh speedz.
By utilising the sortedness of a map, the binaryness of finding in one, and the fact that equivalence between vectors is already defined for you (no need for that first compare function!), you can do this quite easily:
std::map<std::vector<std::string>, int> lookup{
{{"Apple","Orange","three"}, 129},
{{"A","B","CCC"}, 189},
// ...
};
int compare(const std::vector<std::string>& inputData)
{
auto it = lookup.find(inputData);
if (it != lookup.end())
return it->second;
else
return 0;
}
Note also the reference passing for extra teh speedz.
(I haven't tested this for exact syntax-correctness, but you get the idea.)
However! As always, we need to be context-aware in our designs. This sort of approach is more useful at larger scale. At the moment you only have a few options, so the addition of some dynamic allocation and sorting and all that jazz may actually slow things down. Ultimately, you will want to take my solution, and your solution, and measure the results for typical inputs and whatnot.
Once you've done that, if you still need more speed for some reason, consider looking at ways to reduce the dynamic allocations inherent in both the vectors and the strings themselves.
To answer your follow-up question: almost; you do need to specify the type:
// new code is here
// ||||||||||||||||||||||||
if (inputData == std::vector<std::string>{"Apple","Orange","three"})
{
return 129;
}
As explored above, though, let std::map::find do this for you instead. It's better at it.
One key to efficiency is eliminating needless allocation.
Thus, it becomes:
bool compare(
std::vector<std::string> const& a,
std::initializer_list<const char*> b
) noexcept {
return std::equal(begin(a), end(a), begin(b), end(b));
}
Alternatively, make them static const, and accept the slight overhead.
As an aside, using C++17 std::string_view (look at boost), C++20 std::span (look for the Guideline support library (GSL)) also allows a nicer alternative:
bool compare(std::span<std::string> a, std::span<std::string_view> b) noexcept {
return a == b;
}
The other is minimizing the number of comparisons. You can either use hashing, binary search, or manual ordering of comparisons.
Unfortunately, transparent comparators are a C++14 thing, so you cannot use std::map.
If you want a fast way to do it where the vectors to compare to are not known in advance, but are reused so can have a little initial run-time overhead, you can build a tree structure similar to the compile time version Dirk Herrmann has. This will run in O(n) by just iterating over the input and following a tree.
In the simplest case, you might build a tree for each letter/element. A partial implementation could be:
typedef std::vector<std::string> Vector;
typedef Vector::const_iterator Iterator;
typedef std::string::const_iterator StrIterator;
struct Node
{
std::unique_ptr<Node> children[256];
std::unique_ptr<Node> new_str_child;
int result;
bool is_result;
};
Node root;
int compare(Iterator vec_it, Iterator vec_end, StrIterator str_it, StrIterator str_end, const Node *node);
int compare(const Vector &input)
{
return compare(input.begin(), input.end(), input.front().begin(), input.front().end(), &root);
}
int compare(Iterator vec_it, Iterator vec_end, StrIterator str_it, StrIterator str_end, const Node *node)
{
if (str_it != str_end)
{
// Check next character
auto next_child = node->children[(unsigned char)*str_it].get();
if (next_child)
return compare(vec_it, vec_end, str_it + 1, str_end, next_child);
else return -1; // No string matched
}
// At end of input string
++vec_it;
if (vec_it != vec_end)
{
auto next_child = node->new_str_child.get();
if (next_child)
return compare(vec_it, vec_end, vec_it->begin(), vec_it->end(), next_child);
else return -1; // Have another string, but not in tree
}
// At end of input vector
if (node->is_result)
return node->result; // Got a match
else return -1; // Run out of input, but all possible matches were longer
}
Which can also be done without recursion. For use cases like yours you will find most nodes only have a single success value, so you can collapse those into prefix substrings, to use the OP example:
"A"
|-"pple" - new vector - "O" - "range" - new vector - "three" - ret 129
| |- "i" - "uyt" - new vector - "asde" ... - ret 234
| |- "0" - "123" - new vector - "three" ... - ret 876
|- new vector "B" - new vector - "CCC" - ret 189
"s" - new vector "O" - new vector "quick" - ret 126
you could make use of std::equal function like below :
bool compare(vector<string> input1,vector<string> input2)
{
if(input1.size() != input2.size()
{
return false;
}
return std::equal(input1.begin(), input2.end(), input2.begin())
}
Can I compare two vector like this
The answer is No, you need compare a vector with another vector, like this:
vector<string>data = {"ab", "cd", "ef"};
if(data == vector<string>{"ab", "cd", "efg"})
cout << "Equal" << endl;
else
cout << "Not Equal" << endl;
What is the fastest way to do this?
I'm not an expert of asymptotic analysis but:
Using the relational operator equality (==) you have a shortcut to compare two vectors, first validating the size and, second, each element on them. This way provide a linear execution (T(n), where n is the size of vector) which compare each item of the vector, but each string must be compared and, generally, it is another linear comparison (T(m), where m is the size of the string).
Suppose that each string has de same size (m) and you have a vector of size n, each comparison could have a behavior of T(nm).
So:
if you want a shortcut to compare two vector you can use the
relational operator equality.
If you want an program which perform a fast comparison you should look for some algorithm for compare strings.

C++ double sorting data with multiple elements

I have multiple data entries that contain the following information:
id_number
name1
date
name2
It is possible to put this into a struct like this:
struct entry {
int id_number;
string name1;
int date;
string name2;
}
In my data, I have many such entries and I would like to sort. First, I want to sort alphabetically based on name1, then sort by date. However, the sort by date is a subset of the alphabetical sort, e.g. if I have two entries with the same name1, I then want to order those entries by date. Furthermore, when I sort, I want the elements of the entry to remain together, so all four values go together.
My questions are the following:
1) What type of data structure should I use to hold this data so I can keep the set of four elements together when I sort any by any one of them?
2) What is the quickest way to do this sorting (in terms of amount of time to write the code). Ideally, I want to use something like the sort in algorithms.h since it is already built in.
3) Does STL have some built in data structure that can handle the double sorting I described efficiently?
The struct you have is fine, except that you may want to add an overload of operator< to do comparison. Here I'm doing the "compare by name, then date" comparison:
// Add this as a member function to `entry`.
bool operator<(entry const &other) const {
if (name1 < other.name1)
return true;
if (name1 > other.name1)
return false;
// otherwise name1 == other.name1
// so we now fall through to use the next comparator.
if (date < other.date)
return true;
return false;
}
[Edit: What's required is called a "strict weak ordering". If you want to get into detail about what the means, and what alternatives are possible, Dave Abrahams wrote quite a detailed post on C++ Next about it.
In the case above, we start by comparing the name1 fields of the two. If a<b, then we immediately return true. Otherwise, we check for a>b, and if so we return false. At that point, we've eliminated a<b and a>b, so we've determined that a==b, in which case we test the dates -- if a<b, we return true. Otherwise, we return false -- either the dates are equal, or b>a, either of which means the test for a<b is false. If the sort needs to sort out (no pun intended) which of those is the case, it can call the function again with the arguments swapped. The names will still be equal, so it'll still come down to the dates -- if we get false, the dates are equal. If we get true on the swapped dates, then what started as the second date is actually greater. ]
The operator< you define in the structure defines the order that will be used by default. When/if you want you can specify another order for the sorting to use:
struct byid {
bool operator<(entry const &a, entry const &b) {
return a.id_number < b.id_number;
}
};
std::vector<entry> entries;
// sort by name, then date
std::sort(entries.begin(), entries.end());
// sort by ID
std::sort(entries.begin(), entries.end(), byid());
That data structure right there should work just fine. What you should do is override the less than operator, then you could just insert them all in a map and they would be sorted. Here is more info on the comparison operators for a map
Update: upon farther reflection, I would use a set, and not a map, because there is no need for a value. But here is proof it still works
Proof this works:
#include<string>
#include<map>
#include<stdio.h>
#include <sstream>
using namespace std;
struct entry {
int m_id_number;
string m_name1;
int m_date;
string m_name2;
entry( int id_number, string name1, int date, string name2) :
m_id_number(id_number),
m_name1(name1),
m_date(date),
m_name2(name2)
{
}
// Add this as a member function to `entry`.
bool operator<(entry const &other) const {
if (m_name1 < other.m_name1)
return true;
if (m_name2 < other.m_name2)
return true;
if (m_date < other.m_date)
return true;
return false;
}
string toString() const
{
string returnValue;
stringstream out;
string dateAsString;
out << m_date;
dateAsString = out.str();
returnValue = m_name1 + " " + m_name2 + " " + dateAsString;
return returnValue;
}
};
int main(int argc, char *argv[])
{
string names1[] = {"Dave", "John", "Mark", "Chris", "Todd"};
string names2[] = {"A", "B", "C", "D", "E", "F", "G"};
std::map<entry, int> mymap;
for(int x = 0; x < 100; ++x)
{
mymap.insert(pair<entry, int>(entry(0, names1[x%5], x, names2[x%7]), 0));
}
std::map<entry, int>::iterator it = mymap.begin();
for(; it != mymap.end() ;++it)
{
printf("%s\n ", it->first.toString().c_str());
}
return 0;
}
Actually you can use function object to implement your sorting criteria
suppose that you would like to store the entries in the set
//EntrySortCriteria.h
class EntrySortCriteria
{
bool operator(const entry &e1, const entry &e2) const
{
return e1.name1 < e2.name1 ||
(!(e1.name1 < e2.name1) && e1.date < e2.date))
}
}
//main.cc
#include <iostream>
#include "EntrySortCriteria.h"
using namespace std;
int main(int argc, char **argv)
{
set<entry, EntrySortCriteria> entrySet;
//then you can put entries into this set,
//they will be sorted automatically according to your criteria
//syntax of set:
//entrySet.insert(newEntry);
//where newEntry is a object of your entry type
}

What is the best way to create a sparse array in C++?

I am working on a project that requires the manipulation of enormous matrices, specifically pyramidal summation for a copula calculation.
In short, I need to keep track of a relatively small number of values (usually a value of 1, and in rare cases more than 1) in a sea of zeros in the matrix (multidimensional array).
A sparse array allows the user to store a small number of values, and assume all undefined records to be a preset value. Since it is not physically possibly to store all values in memory, I need to store only the few non-zero elements. This could be several million entries.
Speed is a huge priority, and I would also like to dynamically choose the number of variables in the class at runtime.
I currently work on a system that uses a binary search tree (b-tree) to store entries. Does anyone know of a better system?
For C++, a map works well. Several million objects won't be a problem. 10 million items took about 4.4 seconds and about 57 meg on my computer.
My test application is as follows:
#include <stdio.h>
#include <stdlib.h>
#include <map>
class triple {
public:
int x;
int y;
int z;
bool operator<(const triple &other) const {
if (x < other.x) return true;
if (other.x < x) return false;
if (y < other.y) return true;
if (other.y < y) return false;
return z < other.z;
}
};
int main(int, char**)
{
std::map<triple,int> data;
triple point;
int i;
for (i = 0; i < 10000000; ++i) {
point.x = rand();
point.y = rand();
point.z = rand();
//printf("%d %d %d %d\n", i, point.x, point.y, point.z);
data[point] = i;
}
return 0;
}
Now to dynamically choose the number of variables, the easiest solution is to represent index as a string, and then use string as a key for the map. For instance, an item located at [23][55] can be represented via "23,55" string. We can also extend this solution for higher dimensions; such as for three dimensions an arbitrary index will look like "34,45,56". A simple implementation of this technique is as follows:
std::map data<string,int> data;
char ix[100];
sprintf(ix, "%d,%d", x, y); // 2 vars
data[ix] = i;
sprintf(ix, "%d,%d,%d", x, y, z); // 3 vars
data[ix] = i;
The accepted answer recommends using strings to represent multi-dimensional indices.
However, constructing strings is needlessly wasteful for this. If the size isn’t known at compile time (and thus std::tuple doesn’t work), std::vector works well as an index, both with hash maps and ordered trees. For std::map, this is almost trivial:
#include <vector>
#include <map>
using index_type = std::vector<int>;
template <typename T>
using sparse_array = std::map<index_type, T>;
For std::unordered_map (or similar hash table-based dictionaries) it’s slightly more work, since std::vector does not specialise std::hash:
#include <vector>
#include <unordered_map>
#include <numeric>
using index_type = std::vector<int>;
struct index_hash {
std::size_t operator()(index_type const& i) const noexcept {
// Like boost::hash_combine; there might be some caveats, see
// <https://stackoverflow.com/a/50978188/1968>
auto const hash_combine = [](auto seed, auto x) {
return std::hash<int>()(x) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
};
return std::accumulate(i.begin() + 1, i.end(), i[0], hash_combine);
}
};
template <typename T>
using sparse_array = std::unordered_map<index_type, T, index_hash>;
Either way, the usage is the same:
int main() {
using i = index_type;
auto x = sparse_array<int>();
x[i{1, 2, 3}] = 42;
x[i{4, 3, 2}] = 23;
std::cout << x[i{1, 2, 3}] + x[i{4, 3, 2}] << '\n'; // 65
}
Boost has a templated implementation of BLAS called uBLAS that contains a sparse matrix.
https://www.boost.org/doc/libs/release/libs/numeric/ublas/doc/index.htm
Eigen is a C++ linear algebra library that has an implementation of a sparse matrix. It even supports matrix operations and solvers (LU factorization etc) that are optimized for sparse matrices.
Complete list of solutions can be found in the wikipedia. For convenience, I have quoted relevant sections as follows.
https://en.wikipedia.org/wiki/Sparse_matrix#Dictionary_of_keys_.28DOK.29
Dictionary of keys (DOK)
DOK consists of a dictionary that maps (row, column)-pairs to the
value of the elements. Elements that are missing from the dictionary
are taken to be zero. The format is good for incrementally
constructing a sparse matrix in random order, but poor for iterating
over non-zero values in lexicographical order. One typically
constructs a matrix in this format and then converts to another more
efficient format for processing.[1]
List of lists (LIL)
LIL stores one list per row, with each entry containing the column
index and the value. Typically, these entries are kept sorted by
column index for faster lookup. This is another format good for
incremental matrix construction.[2]
Coordinate list (COO)
COO stores a list of (row, column, value) tuples. Ideally, the entries
are sorted (by row index, then column index) to improve random access
times. This is another format which is good for incremental matrix
construction.[3]
Compressed sparse row (CSR, CRS or Yale format)
The compressed sparse row (CSR) or compressed row storage (CRS) format
represents a matrix M by three (one-dimensional) arrays, that
respectively contain nonzero values, the extents of rows, and column
indices. It is similar to COO, but compresses the row indices, hence
the name. This format allows fast row access and matrix-vector
multiplications (Mx).
Small detail in the index comparison. You need to do a lexicographical compare, otherwise:
a= (1, 2, 1); b= (2, 1, 2);
(a<b) == (b<a) is true, but b!=a
Edit: So the comparison should probably be:
return lhs.x<rhs.x
? true
: lhs.x==rhs.x
? lhs.y<rhs.y
? true
: lhs.y==rhs.y
? lhs.z<rhs.z
: false
: false
Hash tables have a fast insertion and look up. You could write a simple hash function since you know you'd be dealing with only integer pairs as the keys.
The best way to implement sparse matrices is to not to implement them - atleast not on your own. I would suggest to BLAS (which I think is a part of LAPACK) which can handle really huge matrices.
Since only values with [a][b][c]...[w][x][y][z] are of consequence, we only store the indice themselves, not the value 1 which is just about everywhere - always the same + no way to hash it. Noting that the curse of dimensionality is present, suggest go with some established tool NIST or Boost, at least read the sources for that to circumvent needless blunder.
If the work needs to capture the temporal dependence distributions and parametric tendencies of unknown data sets, then a Map or B-Tree with uni-valued root is probably not practical. We can store only the indice themselves, hashed if ordering ( sensibility for presentation ) can subordinate to reduction of time domain at run-time, for all 1 values. Since non-zero values other than one are few, an obvious candidate for those is whatever data-structure you can find readily and understand. If the data set is truly vast-universe sized I suggest some sort of sliding window that manages file / disk / persistent-io yourself, moving portions of the data into scope as need be. ( writing code that you can understand ) If you are under commitment to provide actual solution to a working group, failure to do so leaves you at the mercy of consumer grade operating systems that have the sole goal of taking your lunch away from you.
Here is a relatively simple implementation that should provide a reasonable fast lookup (using a hash table) as well as fast iteration over non-zero elements in a row/column.
// Copyright 2014 Leo Osvald
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_
#define UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_
#include <algorithm>
#include <limits>
#include <map>
#include <type_traits>
#include <unordered_map>
#include <utility>
#include <vector>
// A simple time-efficient implementation of an immutable sparse matrix
// Provides efficient iteration of non-zero elements by rows/cols,
// e.g. to iterate over a range [row_from, row_to) x [col_from, col_to):
// for (int row = row_from; row < row_to; ++row) {
// for (auto col_range = sm.nonzero_col_range(row, col_from, col_to);
// col_range.first != col_range.second; ++col_range.first) {
// int col = *col_range.first;
// // use sm(row, col)
// ...
// }
template<typename T = double, class Coord = int>
class SparseMatrix {
struct PointHasher;
typedef std::map< Coord, std::vector<Coord> > NonZeroList;
typedef std::pair<Coord, Coord> Point;
public:
typedef T ValueType;
typedef Coord CoordType;
typedef typename NonZeroList::mapped_type::const_iterator CoordIter;
typedef std::pair<CoordIter, CoordIter> CoordIterRange;
SparseMatrix() = default;
// Reads a matrix stored in MatrixMarket-like format, i.e.:
// <num_rows> <num_cols> <num_entries>
// <row_1> <col_1> <val_1>
// ...
// Note: the header (lines starting with '%' are ignored).
template<class InputStream, size_t max_line_length = 1024>
void Init(InputStream& is) {
rows_.clear(), cols_.clear();
values_.clear();
// skip the header (lines beginning with '%', if any)
decltype(is.tellg()) offset = 0;
for (char buf[max_line_length + 1];
is.getline(buf, sizeof(buf)) && buf[0] == '%'; )
offset = is.tellg();
is.seekg(offset);
size_t n;
is >> row_count_ >> col_count_ >> n;
values_.reserve(n);
while (n--) {
Coord row, col;
typename std::remove_cv<T>::type val;
is >> row >> col >> val;
values_[Point(--row, --col)] = val;
rows_[col].push_back(row);
cols_[row].push_back(col);
}
SortAndShrink(rows_);
SortAndShrink(cols_);
}
const T& operator()(const Coord& row, const Coord& col) const {
static const T kZero = T();
auto it = values_.find(Point(row, col));
if (it != values_.end())
return it->second;
return kZero;
}
CoordIterRange
nonzero_col_range(Coord row, Coord col_from, Coord col_to) const {
CoordIterRange r;
GetRange(cols_, row, col_from, col_to, &r);
return r;
}
CoordIterRange
nonzero_row_range(Coord col, Coord row_from, Coord row_to) const {
CoordIterRange r;
GetRange(rows_, col, row_from, row_to, &r);
return r;
}
Coord row_count() const { return row_count_; }
Coord col_count() const { return col_count_; }
size_t nonzero_count() const { return values_.size(); }
size_t element_count() const { return size_t(row_count_) * col_count_; }
private:
typedef std::unordered_map<Point,
typename std::remove_cv<T>::type,
PointHasher> ValueMap;
struct PointHasher {
size_t operator()(const Point& p) const {
return p.first << (std::numeric_limits<Coord>::digits >> 1) ^ p.second;
}
};
static void SortAndShrink(NonZeroList& list) {
for (auto& it : list) {
auto& indices = it.second;
indices.shrink_to_fit();
std::sort(indices.begin(), indices.end());
}
// insert a sentinel vector to handle the case of all zeroes
if (list.empty())
list.emplace(Coord(), std::vector<Coord>(Coord()));
}
static void GetRange(const NonZeroList& list, Coord i, Coord from, Coord to,
CoordIterRange* r) {
auto lr = list.equal_range(i);
if (lr.first == lr.second) {
r->first = r->second = list.begin()->second.end();
return;
}
auto begin = lr.first->second.begin(), end = lr.first->second.end();
r->first = lower_bound(begin, end, from);
r->second = lower_bound(r->first, end, to);
}
ValueMap values_;
NonZeroList rows_, cols_;
Coord row_count_, col_count_;
};
#endif /* UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_ */
For simplicity, it's immutable, but you can can make it mutable; be sure to change std::vector to std::set if you want a reasonable efficient "insertions" (changing a zero to a non-zero).
I would suggest doing something like:
typedef std::tuple<int, int, int> coord_t;
typedef boost::hash<coord_t> coord_hash_t;
typedef std::unordered_map<coord_hash_t, int, c_hash_t> sparse_array_t;
sparse_array_t the_data;
the_data[ { x, y, z } ] = 1; /* list-initialization is cool */
for( const auto& element : the_data ) {
int xx, yy, zz, val;
std::tie( std::tie( xx, yy, zz ), val ) = element;
/* ... */
}
To help keep your data sparse, you might want to write a subclass of unorderd_map, whose iterators automatically skip over (and erase) any items with a value of 0.