Fastest way to sort a data structure in C++ - c++

I have a data structure that consists of three int values that represent a coordinate, and a double that represents a value at that coordinate. I would like to store them together, and sort them on value. Values are not unique. Right now, I have them in a struct and I sort them using a lambda, as is shown in the code below. As this is a piece of performance-critical code, I am looking for an implementation that gives the fastest sorting. The list will contain 10^6 to 10^7 elements.
What is the most elegant way to solve this? I am not trying to use std::sort, but I am mostly asking whether to store the data in a struct is the best solution, or are there better alternatives?
#include <vector>
#include <algorithm>
#include <iostream>
struct Data
{
int i;
int j;
int k;
double d;
};
int main()
{
std::vector<Data> v;
v.push_back({1,2,3,0.6});
v.push_back({1,2,3,0.2});
v.push_back({1,2,3,0.5});
v.push_back({1,2,3,0.1});
v.push_back({1,2,3,0.4});
std::sort(v.begin(), v.end(), [](const Data& a, const Data& b)
{ return a.d < b.d; });
for (auto d : v)
std::cout << d.i << ", " << d.j << ", "
<< d.k << ", " << d.d << std::endl;
return 0;
}

The fastest way to sort them is to not have to sort them.
At the expense of some slightly slower insertion, you could store your entire container sorted, and insert only in the correct place. A std::set could help you here, or you could roll your own.
edit: A std::multiset would provide the same advantages if you need to allow values that compare equal.

Duplicate Question, Fastest way to search and sort vectors is a far better answer than I could give.
Summary,
You need a better sample set, 5 entries isn't going to tell you anything. You're not going to be able to beat std::sort. In particular to you, the floating point compare will be the painful bit.

Related

How to get the top 100 names in a collection

I am creating a program where I am summarizing from a data file. The data file has information about first names, etc. The information are the fields in the csv file. The fields in the data file are included as instance variables in the class. I created setter and getter methods to return the data for one person. I created vectors to hold the collection of variables.
I am having trouble understanding how create a list of the 100 most common first names of all people in the collection. The list must be in descending order of occurrence.
I was able to print all the common names and its frequencies. But, I am unable to print the 100 most common names. I sorted the vector and got the following errors:
class std::pair<const std::string, int> has no member begin and end
Please help me resolve these issue. All processing of data in the vector must be done with iterators.I am not sure how to fix these issues since I am a beginner.
std::vector<std::string> commonNamesFirst; //vector
for (auto x : census) {
commonNamesFirst.push_back(x.getFirstName()); //populate vector
}
std::map<std::string, int> frequencies;
for (auto& x : census) { ++frequencies[x.getFirstName()]; }
for (auto& freq : frequencies) {
sort(freq.begin(), freq.end(), greater <>()); //error, need to sort in descending order
cout << freq.first << ": " << freq.second << endl; //print the 100 common names in descending order
}
std::map<std::string, int> frequencies;
This is generally the right direction. You're using this to count how many times each word occurs.
for (auto& freq : frequencies) {
This iterates over each individual word and a count of how many times it occured. This no longer makes any logical sense. You are looking to find the 100 most common ones, the one with the highest count values. Iterating, and looking at each one individually, in the manner that's done here, does not make any sense.
sort(freq.begin(), freq.end(), greater <>());
freq, here, is a single word and how many times it occured. You are using freq to iterate over all of the frequencies. Therefore, this is just one of the words, and its frequency value. This is a single std::pair value. And it does not have anything called begin, or end. And that's what your compiler is telling you, directly.
Furthermore, you cannot sort a std::map in the first place. This is not a sortable container. The simplest option is to extract the contents if the now-complete map into something that's sortable. Like, for example, a vector:
std::vector<std::pair<std::string, int>> vfrequencies{
frequencies.begin(),
frequencies.end()
};
So, you've now copied the contents of a map into a vector. Not the most efficient approach, but a workable one.
And now, you can sort this vector. Rather easily.
However, as one last detail, you can't just drop std::greater<> and expect the right thing to happen.
You are looking to sort on the frequency count value only, which is the .second of these std::pairs. A plain std::greater is not going to do this for you. The std::greater overload for a std::pair is not going to do what you think it will do, here.
You will need to provide your own custom lambda for the third parameter of std::sort, that compares the second value of the std::pairs in that vector.
And then, the first 100 most common words will be the first 100 values in the vector. Mission accomplished.
You cannot (re-)sort std::map, you can copy frequencies in vector or std::multimap as intermediate:
std::map<std::string, int> frequencies;
for (auto& x : census) { ++frequencies[x.getFirstName()]; }
std::vector<std::pair<std::string, int>> freqs{frequencies.begin(), frequencies.end()};
std::partial_sort(freqs.begin(), freqs.begin() + 100, freqs.end(),
[](const auto& lhs, const auto& rhs){ return lhs.second > rhs.second; });
for (std::size_t i = 0; i != 100; ++i)
{
std::cout << freqs[i].second << ":" << freqs[i] << std::endl;
}
Building on to #MichaƂ Kaczorowski's answer, you are trying to sort the values in each pair instead of the pairs in the map. However, as Sam mentoined, you cannot sort an std::map (the internal implementation stores things sorted by the key value, or the name in this case). You'd have to get the values out of the map and sort them then, or use a priority queue and heapsort (faster constant factor), or a monotonic queue (linear time but harder to implement). Here is an example heapsort implementation:
vector<string> commonNamesFirst; //vector
for (auto x : census) {
commonNamesFirst.push_back(x.getFirstName()); //populate vector
}
std::map<std::string, int> frequencies;
for (auto& x : census) { ++frequencies[x.getFirstName()]; }
std::priority_queue<pair<int, std::string> > top_names; // put the frequency before the name to take advantage of default pair compare
for (auto& freq : frequencies) top_names.push(std::make_pair(freq.second, freq.first));
for (int i=0; i<100; ++i)
{
outputFile << top_names.top().second << ": " << top_names.top().first << endl; //print the 100 common names in descending order
top_names.pop();
}
The error you get, says it all. You are trying to sort individual std::pair. I think the best way would be to transform your map into a std::vector of pairs and then sort that vector. Then just go through first 100 elements in a loop and print results.

Can we get an iterator that filters a vector from a predicate in C++?

Is it possible to get an iterator over a vector that filters some element with a predicate, i.e. showing a view of the vector?
I think remove_if does something similar but I have not found whether I can use it as I want to or not.
Something like:
auto it = filter(vec.begin(), vec.end(), predicate);
// I can reuse the iterator like:
for (auto i = it; i != vec.end(); i++)
// ...
Edit: (A bit more context to get the best answer) I am doing a lot of queries in an sqlite database of log data in order to print a report.
The performances are not good at the moment because of the number of request needed. I believe querying once the database and storing the result in a vector of smart pointers (unique_ptr if possible), then querying the vector with pure C++ may be faster.
Using copy_if is a good way to do the queries, but I don't need to copy everything and it might cost too much at the end (not sure about that), I should have mentioned than the data are immutable in my case.
As #Jarod42 mentioned in the comments one solution would be using ranges:
#include <algorithm>
#include <iostream>
#include <vector>
#include <range/v3/view/filter.hpp>
#include <range/v3/view/transform.hpp>
int main()
{
std::vector<int> numbers = { 1, 2, 3 ,4, 5 };
auto predicate = [](int& n){ return n % 2 == 0; };
auto evenNumbers = numbers | ranges::view::filter(predicate);
auto result = numbers | ranges::view::filter(predicate)
| ranges::view::transform([](int n) { return n * 2; });
for (int n : evenNumbers)
{
std::cout << n << ' ';
}
std::cout << '\n';
for (int n : result)
{
std::cout << n << ' ';
}
}
evenNumbers is a range view adapter which sticks to the numbers range and changes the way it iterates.
result is a ranges of numbers that have been filtered on the predicate and then have been applied a funciton.
see the compile at compiler-explorer
credit: fluentcpp
Your question
Can we get an iterator that filters a vector from a predicate in C++?
in the sense you are asked it, can only be answered with: No. At the moment not (C++17). As per your requirement the iterator then would have to store the predicate and checking that for each modification of the position or for all dereferencing stuff. I.e before any dereferencing, the predicate would need to be checked. Because other code could modifiy your std::vector. The the iterator would need to check the predicate all the time. Also standard functionality like begin, end, distance would be rather complicated.
So you could create your own iterator by deriving from an existing iterator. Store the predicate and overload most of the functions to take care of the predicate. Very, very complicated, much work and maybe not, what you want to have. This would be the only way to get exact your requested functionality.
For work arounds, there are are many other possible solutions. Peolple will show you here.
But if I read your statement
"showing a view of the vector"
then life becomes easier. You can easily create a view of a vector by copying it conditionally with std::copy_if, as oblivion has written. That is in my opinion the best answer. It is none destructive. But it is a snapshot and not the original data. So, it is read only. And, it does not take into account changes to the original std::vector after the snapshot has been taken.
The second option, a combination of std::remove_if and std::erase, will destroy the original data. Or better said, it will invalidate the filtered out data. You could also std::copy_if the unwanted data to a backup area, std::remove_if them, and at the end add them again to the vector.
All these methods are critical, if the original data will be modified.
Maybe for you the standard std::copy_if is best to create a view. You would then return an iterator of copy and work with that.
#include <iostream>
#include <vector>
#include <algorithm>
int main()
{
std::vector<int> testVector{ 1,2,3,4,5,6,7 }; // Test data
std::vector<int> testVectorView{}; // The view
// Create predicate
auto predForEvenNumbers = [](const int& i) -> bool { return (i % 2 == 0); };
// And filter. Take a snapshot
std::copy_if(testVector.begin(), testVector.end(), std::back_inserter(testVectorView), predForEvenNumbers);
// Show example result
std::vector<int>::iterator iter = testVectorView.begin();
std::cout << *iter << '\n';
return 0;
}
Please note. For big std::vectors, it will become a very expensive solution . . .

Tracking node traversals in calls to std::map::find?

I'm performing a large number of lookups, inserts and deletes on a std::map. I'm considering adding some code to optimize for speed, but I'd like to collect some statistics about the current workload. Specifically, I'd like to keep track of how many nodes 'find' has to traverse on each call so I can keep a running tally.
I'm thinking that if most changes in my map occur at the front, I might be better off searching the first N entries before using the tree that 'find' uses.
Find will have to compare elements using the map's compare function so you can provide a custom compare function that counts the number of times it is called in order to see how much work it is doing on each call (essentially how many nodes are traversed).
I don't see how searching the first N entries before calling find() could help in this case though. Iterating through the entries in a map just traverses the tree in sorted order so it can't be more efficient than just calling find() unless somehow your comparison function is much more expensive than a check for equality.
Example code:
#include <algorithm>
#include <iostream>
#include <map>
#include <numeric>
#include <vector>
using namespace std;
int main() {
vector<int> v(100);
iota(begin(v), end(v), 0);
vector<pair<int, int>> vp(v.size());
transform(begin(v), end(v), begin(vp), [](int i) { return make_pair(i, i); });
int compareCount = 0;
auto countingCompare = [&](int x, int y) { ++compareCount; return x < y; };
map<int, int, decltype(countingCompare)> m(begin(vp), end(vp), countingCompare);
cout << "Compares during construction: " << compareCount << "\n";
compareCount = 0;
auto pos = m.find(50);
cout << "Compares during find(): " << compareCount << "\n";
}
If it's feasible for your key/value structures it is worth considering unordered_map (in C++11 or TR1) as an alternative. std::map, being a balanced tree, is not likely to perform well under this usage profile, and hybrid approaches where you search the first N seem like a lot of work to me with no guaranteed payoff.

How can I get vector's index by its data in C++?

Let's assume I have a vector<node> containing 10000 objects:
vect[0] to vect[9999]
struct node
{
int data;
};
And let's say I want to find the vector id that contain this data ("444"), which happens to be in node 99.
Do I really have to do a for-loop to loop through all the elements then use
if (data == c[i].data)
Or is there a quicker way? Consider that my data is distinct and won't repeat in other nodes.
For this answer I am assuming that you've made an informed decision to use a std::vector over the other containers available.
Do I really have to do a for-loop to loop through all the elements?
No, you do not have to roll a for-loop to find an element. The idiomatic way of finding an element in a container is to use an algorithm from the standard library. Whether you should roll your own really depends on the situation.
To help you decide...
Alternative 1:
std::find() requires a that there is a suitable equality comparator for your node data type, which may be as simple as this:
bool operator ==(node const& l, node const& r)
{
return l.data == r.data;
}
Then, given a required node, you can search for the element. This returns an iterator (or a pointer if you're using a plain old array). If you need the index, this requires a little calculation:
auto i = std::find(v.begin(), v.end(), required);
if (i != v.end())
{
std::cout << i->data << " found at index " << i - v.begin() << std::endl;
}
else
{
std::cout << "Item not found" << std::endl;
}
Alternative 2:
If creating a node is too expensive or you don't have an equality operator, a better approach would be to use std::find_if(), which takes a predicate (here I use a lambda because it's succinct, but you could use a functor like in this answer):
// Alternative linear search, using a predicate...
auto i = std::find_if(v.begin(), v.end(), [](node const& n){return n.data == 444;});
if (i != v.end())
{
std::cout << i->data << " found at index " << i - v.begin() << std::endl;
}
else
{
std::cout << "Item not found" << std::endl;
}
Or is there a quicker way?
Again, it depends. std::find() and std::find_if() run in linear time (O(n)), the same as your for-loop.
That said, using std::find() or std::find_if() won't involve random access or indexing into the container (they use iterators) but they may require a little bit of extra code compared with your for-loop.
Alternative 3:
If running time is critical and your array is sorted (say with std::sort()), you could perform a binary-search, which runs in logarithmic time (O(log n)). std::lower_bound() implements a binary search for the first element that is not less than the given value. It does not take a predicate unfortunately but requires a suitable less-than comparator for your node data type, such as:
bool operator <(node const& l, node const& r)
{
return l.data < r.data;
}
The invocation is similar to std::find() and returns an iterator, but requires an extra check:
auto i = std::lower_bound(v.begin(), v.end(), required);
if (i != v.end() && i->data == required.data)
{
std::cout << i->data << " found at index " << i - v.begin() << std::endl;
}
else
{
std::cout << "Item not found" << std::endl;
}
These functions from the Algorithms Library work with any container supplying an iterator, so switching to another container from std::vector would be quick and easy to test and to maintain.
The decision is yours!
[See a demonstration here.]
You should use std::find. You can't get faster than linear complexity (O(n)) if you know nothing about the vector beforehand (like it being sorted).
If you want to find elements in the container then vector is not the right data-structure. You should use an ordered container such as std::set or std::map. Since elements in these containers are kept ordered (sorted), we can find elements in O(log (n)) time as opposed to linear time for unordered containers.
Use std::find :
vector<int>::Iterator it = find (vect.begin(), vect.end(), 444);
Note that If you have sorted vector, you can make it faster.
A neat solution would be to add an extra int index member to the node struct to provide data-to-index mapping when you have an instance of the struct. In such a case, you should probably wrap std::vector in a NodeVector class which will handle the updating of indices when, say, you remove an item (it's enough to subtract 1 from elements' indices which preceed the element being removed in such a case) etc. If the vector doesn't change the number of elements, that's not even an issue. Other than that, if you can't have an instance of the struct grow in size, use std::map. Iterating over the containter to find one item is not very smart, unless you need to do it very rarely and making anything complicated isn't worth the trouble.

C++ arrays [from:to]

How can i do that in C++?
in python is
example = [u'one', u'two', u'three', u'four']
print example[1:3]
How can i do that in C++ (i missing this function)
I need rewrite this to C++
while i<len(a)-1:
if (a[i]=='\x00' or a[i]=='\x04') and (eval("0x"+(a[i-1].encode("hex"))) in range(32-(4*eval((a[i].encode("hex")))),128-(12*eval((a[i].encode("hex")))))):
st+=a[i-1:i+1]
i+=2;continue
elif st=='':
i+=1;continue
elif len(st)>=4 and (a[i-1:i+1]=='\x00\x00' or a[i-1:i+1]=='\x0a\x00' or a[i-1:i+1]=='\x09\x00' or a[i-1:i+1]=='\x0d\x00'):
s.STRINGS.append([st.decode("utf-16le"),0xffffff])
s.INDEX.append(iCodeOffset+i-1-len(st))
st=''
i=i-1;continue
else:
st=''
i=i-1;continue
I need list of strings from binary files without using string.exe
THX for advance
Benecore
Here is a function that returns a new spliced vector given then old one. It does only the most basic splicing (from:to), and only in one direction (not sure if from is greater than to but I believe python reverses the output).
template<typename T>
std::vector<T> splice(const std::vector<T> in, int from, int to)
{
if (to < from) std::swap(to, from);
std::vector<T> ret(to - from + 1);
for (to -= from; to + 1; to--)
{
ret[to] = in[from + to];
}
return ret;
}
First of all, there is no immediate replacement for this in C++, as C++ is not python and has its own idioms that work differently.
To begin with, for strings you can use the specific std::string::substr.
For more generic containers you should know C++ usually works iterator based when operating on elements of said container. For example suppose you want to compare elements in a vector, you'd do something like the following:
#include <iostream>
#include <algorithm>
#include <vector>
int main()
{
std::vector<int> a = {1,2,3,4};
std::vector<int> b = {1,2,10,4};
std::cout << "Whole vectors equal? " << (std::equal(a.begin(), a.end(), b.begin())?"yes":"no") << std::endl;
}
Now, suppose we only want to compare the first two values (like [:2]), Then we would rewrite the last statement to something like this:
std::cout << "First 2 values equal? " << (std::equal(a.begin(), a.begin()+2, b.begin())?"yes":"no") << std::endl;
Suppose we want to compare the last two values we would do this:
std::cout << "Last 2 values equal? " << (std::equal(a.end()-2, a.end(), b.begin())?"yes":"no") << std::endl;
See the pattern emerging? x.begin()+i,x.begin()+j is roughly equal to [i:j], and x.end()-i,x.end()-j) is roughly equal to [-i,-j]. Note that you can mix these of course.
So in general when working on containers you will work on a range of iterators and this iterator range can be specified very much alike to python's list splicing. It is more verbose and it is another idiom (spliced lists are lists again but iterators are no containers), but you get the same result.
Some final notes:
I wrote x.begin() to make the code a bit clearer, you can also write std::begin(x), which is more generic and also works on arrays. The same goes for std::end
Take a look to the algorithms library before writing your own for loops over iterators.
Yes you can write your own for loops (something like for(auto it = a.begin(); it != a.end(); it++), but often it's easier and more consistent to pass a function or lambda to std::foreach
Really remember C++ is not python or vice versa.