How do I sort two tuples of containers? - c++

I have two tuples, each containing containers of different types.
std::tuple<containerA<typesA>...> tupleA;
std::tuple<containerB<typesB>...> tupleB;
So, as an example tupleA might be defined like this:
std::tuple<list<int>, list<float>> tupleA;
The two containers, containerA and containerB are different types. typesA and typesB do not intersect. I want to sort the containers inside the tuples by their sizes, and be able to access them by their type. So, as an example
std::tuple<list<int>, list<float>> tupleA {{2}, {3.3f, 4.2f}};
std::tuple<deque<double>, deque<uint8_t>> tupleB {{2.0, 1.2, 4.4}, {}};
auto sortedArray = sort(tupleA, tupleB);
sortedArray == {deque<uint8_t>, list<int>, list<float>, deque<double>};
sortedArray.get<list<float>>() == {3.3f, 4.2f};
std::get<list<int>>(tupleA).push_back(4);
std::get<list<int>>(tupleA).push_back(5);
std::get<list<int>>(tupleA).push_back(6);
sortedArray = sort(tupleA, tupleB);
sortedArray == {deque<uint8_t>, list<float>, deque<double>, list<int>};
The most important part is that I have to store sortedArray and that the sizes of the elements in the tuple may change. I have not been able to create such a sort function. The actual syntax of accessing the sortedArray is not important.
I tried to use naive indexes to the container data inside the sortedArray, such as using std::pair<A, 1> to refer to the second element in tuple A, however I can't access the information of the tuple through it because it is not a constexpr

This is possible but not easy.
First, you need to generate a compile-time list of every permutation of N elements (there are N factorial of them).
Write both a runtime and compile time permutation object (separate).
Make a runtime map from permutation to index. (map step)
Convert your tuple to a vector of (index, size). Sort this by size. Extract the permutation. Map this to the index of the set of permutations. (Sort step, uses map step)
Write a "magic switch" that takes a function object f and a permutation index, and invokes the f with the compile time permutation. (magic step)
Write code that takes a compile time permutation and reorders a tuple based on it. (reorder step)
Write code that takes a function object f and a tuple. Do (sort step). Do (magic step), feeding it a second function object g that takes the passed in permutation and the tuple and (reorder step), then calls f with it.
Call the function that does this bob.
std::tuple<list<int>, list<float>> tupleA {{2}, {3.3f, 4.2f}};
std::tuple<deque<double>, deque<uint8_t>> tupleB {{2.0, 1.2, 4.4}, {}};
bob(concat_tuple_tie(tupleA, tupleB), [&](auto&& sorted_array){
assert( std::is_same<
std::tuple<deque<uint8_t>&, list<int>&, list<float>&, deque<double>&>,
std::decay_t<decltype(sorted_array)>
>::value, "wrong order" );
sortedArray.get<list<float>>() == {3.3f, 4.2f};
});
std::get<list<int>>(tupleA).push_back(4);
std::get<list<int>>(tupleA).push_back(5);
std::get<list<int>>(tupleA).push_back(6);
bob(concat_tuple_tie(tupleA, tupleB), [&](auto&& sorted_array){
assert( std::is_same<
std::tuple<deque<uint8_t>&, list<float>&, deque<double>&, list<int>&>,
std::decay_t<decltype(sorted_array)>
>::value, "wrong order" );
});
Personally, I doubt you need to do this.
I could do this, but it might take me hours, so I'm not going to do it for a SO answer. You can look at my magic switch code for the most magic of the above. The other hard part is that permutation step.
Note that the code uses continuation passing style. Also note that every permutation of the lambda is instantiated, including the wrong ones, so your code must be valid for every permutation. This can result in an insane amount of code bloat.
An alternative solution could involve creating type-erasure wrappers around your containers and simply sorting those, but that is not what you asked for.

The problem you describe sounds very much like you want dynamic polymorphism, not static polymorphism — and your difficulty in solving it is because you're stuck on trying to express with tools for doing static polymorphism.
i.e. you need a type (hierarchy) for collections (or pointers to collections) that let you select the type of collection and the type of data at run-time, but still provides a unified interface to functions you need like size (e.g. by inheritance and virtual functions).
e.g. a beginning of a sketch might look like
struct SequencePointer
{
virtual size_t size() const = 0;
virtual boost::any operator[](size_t n) const = 0;
};
template <typename Sequence>
struct SLSequencePointer
{
const Sequence *ptr;
size_t size() const { return ptr->size(); }
boost::any operator[](size_t n) const { return (*ptr)[n]; }
};

Related

Good hash function over C++ unordered_set

I'm looking to implement a hash function over a C++ std::unordered_set<char>. I initially tried using boost::hash_range:
namespace std
{
template<> struct hash<unordered_set<char> >
size_t operator(const unordered_set<char> &s)(
{
return boost::hash_range(begin(s), end(s))
};
}
But then I realised that because the set is unordered, the iteration order isn't stable, and the hash function is thus wrong. What are some better options for me? I guess I could std::set instead of std::unordered_set, but using an ordered set just because it's easier to hash seems ... wrong.
A very similar question, albeit in C#, was asked here:
Hash function on list independant of order of items in it
Over there, Per gave a nice language-independent answer that should put you on the right track. In short, for the input
x1, …, xn
you should map it to
f(x1) op … op f(xn)
where
f is a good hash function for single elements (integer in your case)
op is a commutative operator, such as xor or plus
Hashing an integer may seam pointless at first, but your goal is to make two neighboring integers be dissimilar from each other, so that when combined with op do not create the same result. e.g. if you use + as the operator, you want f(1)+f(2) to give a different result than f(0)+f(3).
If standard hashing functions are not good candidates for f and you cannot find one, check the linked answer for more details...
You could try simply adding which is independent of order and returning the hash of that:
template<> struct hash<unordered_set<char> >
size_t operator(const unordered_set<char> &s) {
long long sum{0};
for ( auto e : s )
sum += s;
return std::hash(sum);
};

Map, pair-vector or two vectors...?

I read through some posts and "wikis" but still cannot decide what approach is suitable for my problem.
I create a class called Sample which contains a certain number of compounds (lets say this is another class Nuclide) at a certain relative quantity (double).
Thus, something like (pseudo):
class Sample {
map<Nuclide, double>;
}
If I had the nuclides Ba-133, Co-60 and Cs-137 in the sample, I would have to use exactly those names in code to access those nuclides in the map. However, the only thing I need to do, is to iterate through the map to perform calculations (which nuclides they are is of no interest), thus, I will use a for- loop. I want to iterate without paying any attention to the key-names, thus, I would need to use an iterator for the map, am I right?
An alternative would be a vector<pair<Nuclide, double> >
class Sample {
vector<pair<Nuclide, double> >;
}
or simply two independent vectors
Class Sample {
vector<Nuclide>;
vector<double>;
}
while in the last option the link between a nuclide and its quantity would be "meta-information", given by the position in the respective vector only.
Due to my lack of profound experience, I'd ask kindly for suggestions of what approach to choose. I want to have the iteration through all available compounds to be fast and easy and at the same time keep the logical structure of the corresponding keys and values.
PS.: It's possible that the number of compunds in a sample is very low (1 to 5)!
PPS.: Could the last option be modified by some const statements to prevent changes and thus keep the correct order?
If iteration needs to be fast, you don't want std::map<...>: its iteration is a tree-walk which quickly gets bad. std::map<...> is really only reasonable if you have many mutations to the sequence and you need the sequence ordered by the key. If you have mutations but you don't care about the order std::unordered_map<...> is generally a better alternative. Both kinds of maps assume you are looking things up by key, though. From your description I don't really see that to be the case.
std::vector<...> is fast to iterated. It isn't ideal for look-ups, though. If you keep it ordered you can use std::lower_bound() to do a std::map<...>-like look-up (i.e., the complexity is also O(log n)) but the effort of keeping it sorted may make that option too expensive. However, it is an ideal container for keeping a bunch objects together which are iterated.
Whether you want one std::vector<std::pair<...>> or rather two std::vector<...>s depends on your what how the elements are accessed: if both parts of an element are bound to be accessed together, you want a std::vector<std::pair<...>> as that keeps data which is accessed together. On the other hand, if you normally only access one of the two components, using two separate std::vector<...>s will make the iteration faster as more iteration elements fit into a cache-line, especially if they are reasonably small like doubles.
In any case, I'd recommend to not expose the external structure to the outside world and rather provide an interface which lets you change the underlying representation later. That is, to achieve maximum flexibility you don't want to bake the representation into all your code. For example, if you use accessor function objects (property maps in terms of BGL or projections in terms of Eric Niebler's Range Proposal) to access the elements based on an iterator, rather than accessing the elements you can change the internal layout without having to touch any of the algorithms (you'll need to recompile the code, though):
// version using std::vector<std::pair<Nuclide, double> >
// - it would just use std::vector<std::pair<Nuclide, double>::iterator as iterator
auto nuclide_projection = [](Sample::key& key) -> Nuclide& {
return key.first;
}
auto value_projecton = [](Sample::key& key) -> double {
return key.second;
}
// version using two std::vectors:
// - it would use an iterator interface to an integer, yielding a std::size_t for *it
struct nuclide_projector {
std::vector<Nuclide>& nuclides;
auto operator()(std::size_t index) -> Nuclide& { return nuclides[index]; }
};
constexpr nuclide_projector nuclide_projection;
struct value_projector {
std::vector<double>& values;
auto operator()(std::size_t index) -> double& { return values[index]; }
};
constexpr value_projector value_projection;
With one pair these in-place, for example an algorithm simply running over them and printing them could look like this:
template <typename Iterator>
void print(std::ostream& out, Iterator begin, Iterator end) {
for (; begin != end; ++begin) {
out << "nuclide=" << nuclide_projection(*begin) << ' '
<< "value=" << value_projection(*begin) << '\n';
}
}
Both representations are entirely different but the algorithm accessing them is entirely independent. This way it is also easy to try different representations: only the representation and the glue to the algorithms accessing it need to be changed.

Most efficient way to return array of struct member

I have a struct like:
struct ohlc{
double open,high,low,close;
};
Part of my application makes use of a collection of these. Sometimes time-stamped. Another part of my app uses a third party (closed) library that requires arrays of doubles e.g. close[] or open[] etc.
What would be the most suitable container and method to return a double array of open[], close[] etc. Currently, I use vector and iterate over the whole collection to create the arrays. Is there a more efficient way.
I may even be completely wrong with my current use of the struct? What I have is a price feed of bid/ask prices. I try to maintain a collection of M1, M5, M15 and H1 candle sticks i.e. OHLC data. Typically I only require 100 hrs worth of data. As a new minute of prices comes in, I can delete the oldest minute thus maintaining 100 hrs worth of data at any time. As the H1, M15, M5, M1 can all be created from the base data of time-stamped ask/bid prices, do I still need to hold independent H1, M15 etc for performance reasons. I ask this because it is a duplication of data?
EDIT: My current method is fine for my usual purpose but now I am 'back-testing', I'm throwing millions of bid/ask prices at my code and need it to be as efficient as possible. The back tests can currently take hours to perform
I return from my collection of structs as follows:
std::vector<double> Series::EODSeries::open( const_iterator iter, unsigned long num ) const
{
vector<double> v;
if( iter == end() )
return v;
// reverse iterator init skips the first element in collection. We must manually insert the current element.
v.insert(v.begin(), iter->second.open);
unsigned i = 1;
for( const_reverse_iterator rev_iter(iter); i < num && rev_iter != rend(); ++rev_iter, ++i )
v.insert(v.begin(), rev_iter->second.open);
return v;
}
You say the third party application requires arrays of doubles but this is slightly misleading since arrays cannot be arguments in C++ – they always decay to pointers.
So what you can do is simply pass a pointer to the first argument of your vector. This is guaranteed to work.
// Assuming the following signature:
void the_method(double arg[]);
// is actually the same as:
// void the_method(double* arg);
std::vector<double> open; // your vector
the_method(&open[0]);
However, if I misunderstood you and you actually have a std::vector<ohlc> then you are essentially out of luck – you indeed need to copy the open and close members out of this vector’s elements into its own container. But even here I’d advise you to use a vector in your code, not a C array.

C++ - Check if One Array of Strings Contains All Elements of Another

I've recently been porting a Python application to C++, but am now at a loss as to how I can port a specific function. Here's the corresponding Python code:
def foo(a, b): # Where `a' is a list of strings, as is `b'
for x in a:
if not x in b:
return False
return True
I wish to have a similar function:
bool
foo (char* a[], char* b[])
{
// ...
}
What's the easiest way to do this? I've tried working with the STL algorithms, but can't seem to get them to work. For example, I currently have this (using the glib types):
gboolean
foo (gchar* a[], gchar* b[])
{
gboolean result;
std::sort (a, (a + (sizeof (a) / sizeof (*a))); // The second argument corresponds to the size of the array.
std::sort (b, (b + (sizeof (b) / sizeof (*b)));
result = std::includes (b, (b + (sizeof (b) / sizeof (*b))),
a, (a + (sizeof (a) / sizeof (*a))));
return result;
}
I'm more than willing to use features of C++11.
I'm just going to add a few comments to what others have stressed and give a better algorithm for what you want.
Do not use pointers here. Using pointers doesn't make it c++, it makes it bad coding. If you have a book that taught you c++ this way, throw it out. Just because a language has a feature, does not mean it is proper to use it anywhere you can. If you want to become a professional programmer, you need to learn to use the appropriate parts of your languages for any given action. When you need a data structure, use the one appropriate to your activity. Pointers aren't data structures, they are reference types used when you need an object with state lifetime - i.e. when an object is created on one asynchronous event and destroyed on another. If an object lives it's lifetime without any asynchronous wait, it can be modeled as a stack object and should be. Pointers should never be exposed to application code without being wrapped in an object, because standard operations (like new) throw exceptions, and pointers do not clean themselves up. In other words, pointers should always be used only inside classes and only when necessary to respond with dynamic created objects to external events to the class (which may be asynchronous).
Do not use arrays here. Arrays are simple homogeneous collection data types of stack lifetime of size known at compiletime. They are not meant for iteration. If you need an object that allows iteration, there are types that have built in facilities for this. To do it with an array, though, means you are keeping track of a size variable external to the array. It also means you are enforcing external to the array that the iteration will not extend past the last element using a newly formed condition each iteration (note this is different than just managing size - it is managing an invariant, the reason you make classes in the first place). You do not get to reuse standard algorithms, are fighting decay-to-pointer, and generally are making brittle code. Arrays are (again) useful only if they are encapsulated and used where the only requirement is random access into a simple type, without iteration.
Do not sort a vector here. This one is just odd, because it is not a good translation from your original problem, and I'm not sure where it came from. Don't optimise early, but don't pessimise early by choosing a bad algorithm either. The requirement here is to look for each string inside another collection of strings. A sorted vector is an invariant (so, again, think something that needs to be encapsulated) - you can use existing classes from libraries like boost or roll your own. However, a little bit better on average is to use a hash table. With amortised O(N) lookup (with N the size of a lookup string - remember it's amortised O(1) number of hash-compares, and for strings this O(N)), a natural first way to translate "look up a string" is to make an unordered_set<string> be your b in the algorithm. This changes the complexity of the algorithm from O(NM log P) (with N now the average size of strings in a, M the size of collection a and P the size of collection b), to O(NM). If the collection b grows large, this can be quite a savings.
In other words
gboolean foo(vector<string> const& a, unordered_set<string> const& b)
Note, you can now pass constant to the function. If you build your collections with their use in mind, then you often have potential extra savings down the line.
The point with this response is that you really should never get in the habit of writing code like that posted. It is a shame that there are a few really (really) bad books out there that teach coding with strings like this, and it is a real shame because there is no need to ever have code look that horrible. It fosters the idea that c++ is a tough language, when it has some really nice abstractions that do this easier and with better performance than many standard idioms in other languages. An example of a good book that teaches you how to use the power of the language up front, so you don't build bad habits, is "Accelerated C++" by Koenig and Moo.
But also, you should always think about the points made here, independent of the language you are using. You should never try to enforce invariants outside of encapsulation - that was the biggest source of savings of reuse found in Object Oriented Design. And you should always choose your data structures appropriate for their actual use. And whenever possible, use the power of the language you are using to your advantage, to keep you from having to reinvent the wheel. C++ already has string management and compare built in, it already has efficient lookup data structures. It has the power to make many tasks that you can describe simply coded simply, if you give the problem a little thought.
Your first problem is related to the way arrays are (not) handled in C++. Arrays live a kind of very fragile shadow existence where, if you as much as look at them in a funny way, they are converted into pointers. Your function doesn't take two pointers-to-arrays as you expect. It takes two pointers to pointers.
In other words, you lose all information about the size of the arrays. sizeof(a) doesn't give you the size of the array. It gives you the size of a pointer to a pointer.
So you have two options: the quick and dirty ad-hoc solution is to pass the array sizes explicitly:
gboolean foo (gchar** a, int a_size, gchar** b, int b_size)
Alternatively, and much nicer, you can use vectors instead of arrays:
gboolean foo (const std::vector<gchar*>& a, const std::vector<gchar*>& b)
Vectors are dynamically sized arrays, and as such, they know their size. a.size() will give you the number of elements in a vector. But they also have two convenient member functions, begin() and end(), designed to work with the standard library algorithms.
So, to sort a vector:
std::sort(a.begin(), a.end());
And likewise for std::includes.
Your second problem is that you don't operate on strings, but on char pointers. In other words, std::sort will sort by pointer address, rather than by string contents.
Again, you have two options:
If you insist on using char pointers instead of strings, you can specify a custom comparer for std::sort (using a lambda because you mentioned you were ok with them in a comment)
std::sort(a.begin(), a.end(), [](gchar* lhs, gchar* rhs) { return strcmp(lhs, rhs) < 0; });
Likewise, std::includes takes an optional fifth parameter used to compare elements. The same lambda could be used there.
Alternatively, you simply use std::string instead of your char pointers. Then the default comparer works:
gboolean
foo (const std::vector<std::string>& a, const std::vector<std::string>& b)
{
gboolean result;
std::sort (a.begin(), a.end());
std::sort (b.begin(), b.end());
result = std::includes (b.begin(), b.end(),
a.begin(), a.end());
return result;
}
Simpler, cleaner and safer.
The sort in the C++ version isn't working because it's sorting the pointer values (comparing them with std::less as it does with everything else). You can get around this by supplying a proper comparison functor. But why aren't you actually using std::string in the C++ code? The Python strings are real strings, so it makes sense to port them as real strings.
In your sample snippet your use of std::includes is pointless since it will use operator< to compare your elements. Unless you are storing the same pointers in both your arrays the operation will not yield the result you are looking for.
Comparing adresses is not the same thing as comparing the true content of your c-style-strings.
You'll also have to supply std::sort with the neccessary comparator, preferrably std::strcmp (wrapped in a functor).
It's currently suffering from the same problem as your use of std::includes, it's comparing addresses instead of the contents of your c-style-strings.
This whole "problem" could have been avoided by using std::strings and std::vectors.
Example snippet
#include <iostream>
#include <algorithm>
#include <cstring>
typedef char gchar;
gchar const * a1[5] = {
"hello", "world", "stack", "overflow", "internet"
};
gchar const * a2[] = {
"world", "internet", "hello"
};
...
int
main (int argc, char *argv[])
{
auto Sorter = [](gchar const* lhs, gchar const* rhs) {
return std::strcmp (lhs, rhs) < 0 ? true : false;
};
std::sort (a1, a1 + 5, Sorter);
std::sort (a2, a2 + 3, Sorter);
if (std::includes (a1, a1 + 5, a2, a2 + 3, Sorter)) {
std::cerr << "all elements in a2 was found in a1!\n";
} else {
std::cerr << "all elements in a2 wasn't found in a1!\n";
}
}
output
all elements in a2 was found in a1!
A naive transcription of the python version would be:
bool foo(std::vector<std::string> const &a,std::vector<std::string> const &b) {
for(auto &s : a)
if(end(b) == std::find(begin(b),end(b),s))
return false;
return true;
}
It turns out that sorting the input is very slow. (And wrong in the face of duplicate elements.) Even the naive function is generally much faster. Just goes to show again that premature optimization is the root of all evil.
Here's an unordered_set version that is usually somewhat faster than the naive version (or was for the values/usage patterns I tested):
bool foo(std::vector<std::string> const& a,std::unordered_set<std::string> const& b) {
for(auto &s:a)
if(b.count(s) < 1)
return false;
return true;
}
On the other hand, if the vectors are already sorted and b is relatively small ( less than around 200k for me ) then std::includes is very fast. So if you care about speed you just have to optimize for the data and usage pattern you're actually dealing with.

Difference between two vector<MyType*> A and B

I've got two vector<MyType*> objects called A and B. The MyType class has a field ID and I want to get the MyType* which are in A but not in B. I'm working on a image analysis application and I was hoping to find a fast/optimized solution.
The unordered approach will typically have quadratic complexity unless the data is sorted beforehand (by your ID field), in which case it would be linear and would not require repeated searches through B.
struct CompareId
{
bool operator()(const MyType* a, const MyType* b) const
{
return a>ID < b->ID;
}
};
...
sort(A.begin(), A.end(), CompareId() );
sort(B.begin(), B.end(), CompareId() );
vector<MyType*> C;
set_difference(A.begin(), A.end(), B.begin(), B.end(), back_inserter(C) );
Another solution is to use an ordered container like std::set with CompareId used for the StrictWeakOrdering template argument. I think this would be better if you need to apply a lot of set operations. That has its own overhead (being a tree) but if you really find that to be an efficiency problem, you could implement a fast memory allocator to insert and remove elements super fast (note: only do this if you profile and determine this to be a bottleneck).
Warning: getting into somewhat complicated territory.
There is another solution you can consider which could be very fast if applicable and you never have to worry about sorting data. Basically, make any group of MyType objects which share the same ID store a shared counter (ex: pointer to unsigned int).
This will require creating a map of IDs to counters and require fetching the counter from the map each time a MyType object is created based on its ID. Since you have MyType objects with duplicate IDs, you shouldn't have to insert to the map as often as you create MyType objects (most can probably just fetch an existing counter).
In addition to this, have a global 'traversal' counter which gets incremented whenever it's fetched.
static unsigned int counter = 0;
unsigned int traversal_counter()
{
// make this atomic for multithreaded applications and
// needs to be modified to set all existing ID-associated
// counters to 0 on overflow (see below)
return ++counter;
}
Now let's go back to where you have A and B vectors storing MyType*. To fetch the elements in A that are not in B, we first call traversal_counter(). Assuming it's the first time we call it, that will give us a traversal value of 1.
Now iterate through every MyType* object in B and set the shared counter for each object from 0 to the traversal value, 1.
Now iterate through every MyType* object in A. The ones that have a counter value which doesn't match the current traversal value(1) are the elements in A that are not contained in B.
What happens when you overflow the traversal counter? In this case, we iterate through all the counters stored in the ID map and set them back to zero along with the traversal counter itself. This will only need to occur once in about 4 billion traversals if it's a 32-bit unsigned int.
This is about the fastest solution you can apply to your given problem. It can do any set operation in linear complexity on unsorted data (and always, not just in best-case scenarios like a hash table), but it does introduce some complexity so only consider it if you really need it.
Sort both vectors (std::sort) according to ID and then use std::set_difference. You will need to define a custom comparator to pass to both of these algorithms, for example
struct comp
{
bool operator()(MyType * lhs, MyType * rhs) const
{
return lhs->id < rhs->id;
}
};
First look at the problem. You want "everything in A not in B". That means you're going to have to visit "everything in A". You'll also have to visit everything in B to have knowledge of what is and is not in B. So that suggests there should be an O(n) + O(m) solution, or taking liberty to elide the difference between n and m, O(2n).
Let's consider the std::set_difference approach. Each sort is O(n log n), and set_difference is O(n). So the sort-sort-set_difference approach is O(n + 2n log n). Let's call that O(4n).
Another approach would be to first place the elements of B in a set (or map). Iteration across B to create the set is O(n) plus insertion O(log n) of each element, followed by iteration across A O(n), with a lookup for each element of A (log n), gives a total: O(2n log n). Let's call that O(3n), which is slightly better.
Finally, using an unordered_set (or unordered_map), and assuming we get average case of O(1) insertion and O(1) lookup, we have an approach that is O(2n). A-ha!
The real win here is that unordered_set (or map) is probably the most natural choice to represent your data in the first place, i.e., the proper design yields the optimized implementation. That doesn't always happen, but it's nice when it does!
If B preexists to A, then while populating A, you can bookkeep in a C vector.