I'm doing some coding at work in C++, and a lot of the things that I work on involve analyzing sets of data. Very often I need to select some elements from a STL container, and very frequently I wrote code like this:
using std::vector;
vector< int > numbers;
for ( int i = -10; i <= 10; ++i ) {
numbers.push_back( i );
}
vector< int > positive_numbers;
for ( vector< int >::const_iterator it = numbers.begin(), end = numbers.end();
it != end; ++it
) {
if ( number > 0 ) {
positive_numbers.push_back( *it );
}
}
Over time this for loop and the logic contained within it gets a lot more complicated and unreadable. Code like this is less satisfying than the analogous SELECT statement in SQL, assuming that I have a table called numbers with a column named "num" rather than a std::vector< int > :
SELECT * INTO positive_numbers FROM numbers WHERE num > 0
That's a lot more readable to me, and also scales better, over time a lot of the if-statement logic that's in our codebase has become complicated, order-dependent and unmaintainable. If we could do SQL-like statements in C++ without having to go to a database I think that the state of the code might be better.
Is there a simpler way that I can implement something like a SELECT statement in C++ where I can create a new container of objects by only describing the characteristics of the objects that I want? I'm still relatively new to C++, so I'm hoping that there's something magic with either template metaprogramming or clever iterators that would solve this. Thanks!
Edit based on first two answers. Thanks, I had no idea that's what LINQ actually was. I program on Linux and OSX systems primarily, and am interested in something cross-platform across OSX, Linux and Windows. So a more educated version of this question would be - is there a cross-platform implementation of something like LINQ for C++?
You've almost exactly described LINQ. It's a .NET 3.5 feature so you should be able to use it from C++.
The functionality you're describing is commonly found in functional languages that support concepts such as closures, predicates, functors, etc.
The problem with the code above is that it combines:
Logic for iterating over collection (the for loop)
Condition that must be satisfied for an element to be copied to another collection
Logic for copying an element from one collection to another
In reality (1) and (3) are boilerplate, insofar as every time you need to iterate over a collection copying some elements to another collection, it's probably only the conditional code that will change each time. Languages with support for functional programming eliminate this boilerplate. For example, in Groovy you can replace your for loop above with just
def positive_numbers = numbers.findAll{it > 0}
Even though C++ is not a functional language there may be libraries which provide support for functional-style programming with STL collections. For example, the Apache commons collection (and also possibly Google's collection library) provides support for functional style programming with Java collections, even though Java itself is not a functional language.
I think you have described LINQ (a C# and .NET 3.5 feature). Have you looked into that?
LINQ is the obvious answer for .NET (or Mono on non-Windows platforms, but in C++, it shouldn't be that difficult to write something like it yourself in STL.
Use the Boost.Iterator library to write a "select" iterator, for example, one which skips all elements that do not satisfy a given predicate.
Boost already has a few relevant examples in their documentation I believe.
Or http://www.boost.org/doc/libs/1_39_0/libs/iterator/doc/filter_iterator.html might actually do what you need out of the box.
In any case, in C++, you could achieve the same effect basically by layering iterators.
If you have a regular iterator, which visits every element in the sequence, you can wrap that in a filter iterator, which increments the underlying iterator until it finds a value satisfying the condition. Then you could even wrap that in a "select" iterator transforming the value to the desired format.
It seems like a fairly obvious idea, but I'm not aware of any complete implementations of it.
You're using STL containers. I would recommend using STL algorithms, which are largely straight out of set theory. A SQL select is translated to repeated applications of std::find_if, or a combination of std::lower_bound and std::upper_bound (on sorted containers). The performance will be about the same as looping, but the syntax is a little more declarative.
LINQ will give you similar syntax and operations, but unless used over IQueryables (i.e., data in a database) you're not going to get any performance gains either.
Your best bet after that is putting things into files for this sort of thing. Whether that's BerkelyDB, NetCDF, HDF5, STXXL, etc. File access is slow, but doing this allows you to work on more data than fits in memory.
For what you're describing, std::vector isn't a terribly good choice. This is an SQL equivalent to a table with no indexes. On top of that, filling one container with the contents of another container is possibly a reasonable performance optimization, but not very readable, and not quite idiomatic, either. There are a number of ways of solving this portably (IE, without relying on managed code .net).
First choice is to choose a better container. If you don't need to have stable iteration, then you should use std::set or std::multi_set. these containers use a balanced search tree to store the values in order. This is equivalent to a simple SQL index of all columns.
std::set< int > numbers;
for ( int i = -10; i <= 10; ++i ) {
numbers.insert( i );
}
std::set::iterator first = numbers.find(1);
std::set::iterator end = numbers.end();
Now you can iterate from first until end without wasting any extra effort, over the O(n log(n)) fill and O(log(n) ) seek. Iterating is O(1) for std::set::iterator
If, for some reason you must use a vector, you can get more idiomatic C++ using std::find_if (see Max Lybbert's answer)
bool isPositive(int n) { return n > 0; }
std::vector< int > numbers;
for ( int i = -10; i <= 10; ++i ) {
numbers.push_back( i );
}
for ( std::vector< int >::const_iterator end = numbers.end(),
iter = std::find_if(numbers.begin(), end, isPositive); // <- first positive value
iter != end;
iter = std::find_if(iter, end, isPositive) // <- advance iter to the next positive
) {
// iter is guaranteed to be positive here, do something with it!
}
If you want something even more evocative of SQL without actually connecting to a database, you should look at Boost, particularly the boost::multi_index container and boost iterators.
Check out Mono if you want try out LINQ on Linux / OS X. It's a port of the .NET Framework and LINQ is included now i believe.
Related
Recently I was dealing with what I am sure is a very common problem, which essentially boils down into the following:
Given a long text, calculate the frequency of each word occurring in the text.
I was able to solve this problem using std::unordered_map. This, however, turned quite ugly, as for every word in the text, if that's already been encountered I had to do a find, erase, and then a re-insert into the map with the value incremented.
I realise there are other ways of doing this, such as using a hashing function on top of a vanilla array/vector and increment value there, but I was wondering if there was a more elegant way of solving this problem, like an STL component, or function. that would have a similar interface to Pythons Counter collections.
I know C++ being C++ I can't really expect such high level concepts to always be implemented for me, but was just wondering if you guys new about anything (or at least your Googling skills are superior to mine) which could make my code a little nicer.
I'm not quite sure why an std::unordered_map (or just std::map) would involve much complexity. I'd write the code something like this:
std::unordered_map<std::string, int> words;
std::string word;
while (word = getword(input))
++words[word];
There's no need for any kind of find/erase/reinsert.
In case it's not clear how/why this works: operator[] will create an entry for a value if none exists yet in the map. The associated value will be a value-initialized object of the specified type, which will be zero in the case of an int (or similar). We then increment that every time we encounter the word.
An alternative solution:
std::multiset<std::string> m;
for (auto w: words) m.insert(w);
m.count("some word");
The advantage is that you don't have to rely on the 'trick' with operator[], making the code more readable.
EDIT: As Kerrek pointed out in the comments, this solution is slower. multiset stores all the elements you insert, even if they are deemed equal (they might still differ in some aspect that operator== does not check). This causes a significant overhead compared to unordered_map<std::string, int>, which only has to store each word once.
(As a side note, processing the complete works of William Shakespeare using the map solution takes about 0.33s on my machine, as opposed to 0.78s for the multiset solution.)
Consider a std::vector v of N elements, and consider that the n first elements have already been sorted withn < N and where (N-n)/N is very small:
Is there a clever way using the STL algorithms to sort this vector more rapidly than with a complete std::sort(std::begin(v), std::end(v)) ?
EDIT: a clarification: the (N-n) unsorted elements should be inserted at the right position within the n first elements already sorted.
EDIT2: bonus question: and how to find n ? (which corresponds to the first unsorted element)
Sort the other range only, and then use std::merge.
void foo( std::vector<int> & tab, int n ) {
std::sort( begin(tab)+n, end(tab));
std::inplace_merge(begin(tab), begin(tab)+n, end(tab));
}
for edit 2
auto it = std::adjacent_find(begin(tab), end(tab), std::greater<int>() );
if (it!=end(tab)) {
it++;
std::sort( it, end(tab));
std::inplace_merge(begin(tab), it, end(tab));
}
The optimal solution would be to sort the tail portion independently and then perform in-place merge, as described here
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.5750
The algorithm is quite convoluted and is usually regarded as "not worth the effort".
Of course, with C++ you can use readily available std::inplace_merge. However, the name of that algorithm is highly misleading. Firstly, there's no guarantee that std::inplace_merge actually works in-place. And when it is actually in-place, there's no guarantee that it is not implemented as a full-blown sort. In practice it boils down to trying it and seeing whether it is good enough for your purposes.
But if you really want to make it in-place and formally more efficient than a full sort, then you will have to implement it manually. STL might help with a few utility algorithms, but it does not offer any solid solutions of "just a few calls to standard functions" kind.
Using insertion sort on N - n last elements:
template <typename IT>
void mysort(IT begin, IT end) {
for (IT it = std::is_sorted_until(begin, end); it != end; ++it) {
IT insertPos = std::lower_bound(begin, it, *it);
IT endRotate = it;
std::rotate(insertPos, it, ++endRotate);
}
}
The Timsort sorting algorithm is a hybrid algorithm developed by Pythonista Tim Peters. It makes optimal use of already-sorted subsegments anywhere inside the array, including in the beginning. Although you may find a faster algorithm if you know for sure that in particular the first n elements are already sorted, this algorithm should be useful for the overall class of problems involved. Wikipedia describes it as:
The algorithm finds subsets of the data that are already ordered, and uses that knowledge to sort the remainder more efficiently.
In Tim Peters' own words,
It has supernatural performance on many
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
as few as N-1), yet as fast as Python's previous highly tuned samplesort
hybrid on random arrays.
Full details are described in this undated text document by Tim Peters. The examples are in Python, but Python should be quite readable even to people not familiar with its syntax.
Use std::partition_point (or is_sorted_until) to find n. Then if n-m is small do an insertion sort (linear search+std::rotate).
I assume that your question has two aims:
improve runtime (using a clever way)
with few effort (restricting to STL)
Considering these aims, I'd strongly recommend against this specific optimization, unless you are sure that the effort is worth the benefit.
As far as I remember, std::sort() implements the quick sort algorithm, which is almost as fast on presorted input as to determine, if / how-much-of the input is sorted.
Instead of meddling with std::sort you can try changing the data structure to a sorted/prioritized queue.
Python's itertools implement a chain iterator which essentially concatenates a number of different iterators to provide everything from single iterator.
Is there something similar in C++ ? A quick look at the boost libraries didn't reveal something similar, which is quite surprising to me. Is it difficult to implement this functionality?
Came across this question while investigating for a similar problem.
Even if the question is old, now in the time of C++ 11 and boost 1.54 it is pretty easy to do using the Boost.Range library. It features a join-function, which can join two ranges into a single one. Here you might incur performance penalties, as the lowest common range concept (i.e. Single Pass Range or Forward Range etc.) is used as new range's category and during the iteration the iterator might be checked if it needs to jump over to the new range, but your code can be easily written like:
#include <boost/range/join.hpp>
#include <iostream>
#include <vector>
#include <deque>
int main()
{
std::deque<int> deq = {0,1,2,3,4};
std::vector<int> vec = {5,6,7,8,9};
for(auto i : boost::join(deq,vec))
std::cout << "i is: " << i << std::endl;
return 0;
}
In C++, an iterator usually doesn't makes sense outside of a context of the begin and end of a range. The iterator itself doesn't know where the start and the end are. So in order to do something like this, you instead need to chain together ranges of iterators - range is a (start, end) pair of iterators.
Takes a look at the boost::range documentation. It may provide tools for constructing a chain of ranges. The one difference is that they will have to be the same type and return the same type of iterator. It may further be possible to make this further generic to chain together different types of ranges with something like any_iterator, but maybe not.
I've written one before (actually, just to chain two pairs of iterators together). It's not that hard, especially if you use boost's iterator_facade.
Making an input iterator (which is effectively what Python's chain does) is an easy first step. Finding the correct category for an iterator chaining a combination of different iterator categories is left as an exercise for the reader ;-).
Check Views Template Library (VTL). It may not provided 'chained iterator' directly. But I think it has all the necessary tools/templates available for implementing your own 'chained iterator'.
From the VTL Page:
A view is a container adaptor, that provides a container interface to
parts of the data or
a rearrangement of the data or
transformed data or
a suitable combination of the data sets
of the underlying container(s). Since views themselves provide the container interface, they can be easily combined and stacked. Because of template trickery, views can adapt their interface to the underlying container(s). More sophisticated template trickery makes this powerful feature easy to use.
Compared with smart iterators, views are just smart iterator factories.
What you are essentially looking for is a facade iterator that abstracts away the traversing through several sequences.
Since you are coming from a python background I'll assume that you care more about flexibility rather than speed. By flexibility I mean the ability to chain-iterate through different sequence types together (vector, array, linked list, set etc....) and by speed I mean only allocating memory from the stack.
If this is the case then you may want to look at the any_iterator from adobe labs:
http://stlab.adobe.com/classadobe_1_1any__iterator.html
This iterator will give you the ability to iterate through any sequence type at runtime. To chain you would have a vector (or array) of 3-tuple any_iterators, that is, three any_iterators for each range you chain together (you need three to iterate forward or backward, if you just want to iterate forward two will suffice).
Let's say that you wanted to chain-iterate through a sequence of integers:
(Untested psuedo-c++ code)
typedef adobe::any_iterator AnyIntIter;
struct AnyRange {
AnyIntIter begin;
AnyIntIter curr;
AnyIntIter end;
};
You could define a range such as:
int int_array[] = {1, 2, 3, 4};
AnyRange sequence_0 = {int_array, int_array, int_array + ARRAYSIZE(int_array)};
Your RangeIterator class would then have an std::vector.
<code>
class RangeIterator {
public:
RangeIterator() : curr_range_index(0) {}
template <typename Container>
void AddAnyRange(Container& c) {
AnyRange any_range = { c.begin(), c.begin(), c.end() };
ranges.push_back(any_range);
}
// Here's what the operator++() looks like, everything else omitted.
int operator++() {
while (true) {
if (curr_range_index > ranges.size()) {
assert(false, "iterated too far");
return 0;
}
AnyRange* any_range = ranges[curr_range_index];
if (curr_range->curr != curr_range->end()) {
++(curr_range->curr);
return *(curr_range->curr);
}
++curr_range_index;
}
}
private:
std::vector<AnyRange> ranges;
int curr_range_index;
};
</code>
I do want to note however that this solution is very slow. The better, more C++ like approach is just to store all the pointers to the objects that you want operate on and iterate through that. Alternatively, you can apply a functor or a visitor to your ranges.
Not in the standard library. Boost might have something.
But really, such a thing should be trivial to implement. Just make yourself an iterator with a vector of iterators as a member. Some very simple code for operator++, and you're there.
No functionality exists in boost that implements this, to the best of my knowledge - I did a pretty extensive search.
I thought I'd implement this easily last week, but I ran into a snag: the STL that comes with Visual Studio 2008, when range checking is on, doesn't allow comparing iterators from different containers (i.e., you can't compare somevec1.end() with somevec2.end() ). All of a sudden it became much harder to implement this and I haven't quite decided yet on how to do it.
I wrote other iterators in the past using iterator_facade and iterator_adapter from boost, which are better than writing 'raw' iterators but I still find writing custom iterators in C++ rather messy.
If someone can post some pseudocode on how this could be done /without/ comparing iterators from different containers, I'd be much obliged.
I will be parsing 60GB of text and doing a lot of insert and lookups in maps.
I just started using boost::unordered_set and boost::unordered_map
As my program starts filling in these containers they start growing bigger and bigger and i was wondering if this would be a good idea to pre allocate memory for these containers.
something like
mymap::get_allocator().allocate(N); ?
or should i just leave them to allocate and figure out grow factors by themselves?
the codes look like this
boost::unordered_map <string,long> words_vs_frequency, wordpair_vs_frequency;
boost::unordered_map <string,float> word_vs_probability, wordpair_vs_probability,
wordpair_vs_MI;
//... ... ...
N = words_vs_frequency.size();
long y =0; float MIWij =0.0f, maxMI=-999999.0f;
for (boost::unordered_map <string,long>::iterator i=wordpair_vs_frequency.begin();
i!=wordpair_vs_frequency.end(); ++i){
if (i->second >= BIGRAM_OCCURANCE_THRESHOLD)
{
y++;
Wij = i->first;
WordPairToWords(Wij, Wi,Wj);
MIWij = log ( wordpair_vs_probability[Wij] /
(word_vs_probability[Wi] * word_vs_probability[Wj])
);
// keeping only the pairs which MI value greater than
if (MIWij > MUTUAL_INFORMATION_THRESHOLD)
wordpair_vs_MI[ Wij ] = MIWij;
if(MIWij > maxMI )
maxMI = MIWij;
}
}
Thanks in advance
According to the documentation, both unordered_set and unordered_map have a method
void rehash(size_type n);
that regenerates the hashtable so that it contains at least n buckets. (It sounds like it does what reserve() does for STL containers).
I would try it both ways, which will let you generate hard data showing whether one method works better than the other. We can speculate all day about which method will be optimal, but as with most performance questions, the best thing to do is try it out and see what happens (and then fix the parts that actually need fixing).
That being said, the Boost authors seem to be very smart, so it quite possibly will work fine as-is. You'll just have to test and see.
Honestly, I think you would be best off writing your own allocator. You could, for instance, make an allocator with a method called preallocate(int N) which would reserve N bytes, then use unordered_map::get_allocator() for all your fun. In addition, with your own allocator, you could tell it to grab huge chunks at a time.
I have a list of about a hundreds unique strings in C++, I need to check if a value exists in this list, but preferrably lightning fast.
I am currenly using a hash_set with std::strings (since I could not get it to work with const char*) like so:
stdext::hash_set<const std::string> _items;
_items.insert("LONG_NAME_A_WITH_SOMETHING");
_items.insert("LONG_NAME_A_WITH_SOMETHING_ELSE");
_items.insert("SHORTER_NAME");
_items.insert("SHORTER_NAME_SPECIAL");
stdext::hash_set<const std::string>::const_iterator it = _items.find( "SHORTER_NAME" ) );
if( it != _items.end() ) {
std::cout << "item exists" << std::endl;
}
Does anybody else have a good idea for a faster search method without building a complete hashtable myself?
The list is a fixed list of strings which will not change. It contains a list of names of elements which are affected by a certain bug and should be repaired on-the-fly when opened with a newer version.
I've built hashtables before using Aho-Corasick but I'm not really willing to add too much complexity.
I was amazed by the number of answers. I ended up testing a few methods for their performance and ended up using a combination of kirkus and Rob K.'s answers. I had tried a binary search before but I guess I had a small bug implementing it (how hard can it be...).
The results where shocking... I thought I had a fast implementation using a hash_set... well, ends out I did not. Here's some statistics (and the eventual code):
Random lookup of 5 existing keys and 1 non-existant key, 50.000 times
My original algorithm took on average 18,62 seconds
A lineair search took on average 2,49 seconds
A binary search took on average 0,92 seconds.
A search using a perfect hashtable generated by gperf took on average 0,51 seconds.
Here's the code I use now:
bool searchWithBinaryLookup(const std::string& strKey) {
static const char arrItems[][NUM_ITEMS] = { /* list of items */ };
/* Binary lookup */
int low, mid, high;
low = 0;
high = NUM_ITEMS;
while( low < high ) {
mid = (low + high) / 2;
if(arrAffectedSymbols[mid] > strKey) {
high = mid;
}
else if(arrAffectedSymbols[mid] < strKey) {
low = mid + 1;
}
else {
return true;
}
}
return false;
}
NOTE: This is Microsoft VC++ so I'm not using the std::hash_set from SGI.
I did some tests this morning using gperf as VardhanDotNet suggested and this is quite a bit faster indeed.
If your list of strings are fixed at compile time, use gperf
http://www.gnu.org/software/gperf/
QUOTE:
gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
The output of gperf is not governed by gpl or lgpl, afaik.
You could try a PATRICIA Trie if none of the standard containers meet your needs.
Worst-case lookup is bounded by the length of the string you're looking up. Also, strings share common prefixes so it is really easy on memory.So if you have lots of relatively short strings this could be beneficial.
Check it out here.
Note: PATRICIA = Practical Algorithm to Retrieve Information Coded in Alphanumeric
If it's a fixed list, sort the list and do a binary search? I can't imagine, with only a hundred or so strings on a modern CPU, you're really going to see any appreciable difference between algorithms, unless your application is doing nothing but searching said list 100% of the time.
What's wrong with std::vector? Load it, sort(v.begin(), v.end()) once and then use lower_bound() to see if the string is in the vector. lower_bound is guaranteed to be O(log2 N) on a sorted random access iterator. I can't understand the need for a hash if the values are fixed. A vector takes less room in memory than a hash and makes fewer allocations.
I doubt you'd come up with a better hashtable; if the list varies from time to time you've probably got the best way.
The fastest way would be to construct a finite state machine to scan the input. I'm not sure what the best modern tools are (it's been over ten years since I did anything like this in practice), but Lex/Flex was the standard Unix constructor.
A FSM has a table of states, and a list of accepting states. It starts in the beginning state, and does a character-by-character scan of the input. Each state has an entry for each possible input character. The entry could either be to go into another state, or to abort because the string isn't in the list. If the FSM gets to the end of the input string without aborting, it checks the final state it's in, which is either an accepting state (in which case you've matched the string) or it isn't (in which case you haven't).
Any book on compilers should have more detail, or you can doubtless find more information on the web.
If the set of strings to check for numbers in the hundreds as you say, and this is when doing I/O (loading a file, which I assume comes from a disk, commonly), then I'd say: profile what you've got, before looking for more exotic/complex solutions.
Of course, it could be that your "documents" contain hundreds of millions to these strings, in which case I guess it really starts to take time ... Without more detail, it's hard to say for sure.
What I'm saying boils down to "consider the use-case and typical scenarios, before (over)optimizing", which I guess is just a specialization of that old thing about roots of evil ... :)
100 unique strings? If this isn't called frequently, and the list doesn't change dynamically, I'd probably use a straight forward const char array with a linear search. Unless you search it a lot, something that small just isn't worth the extra code. Something like this:
const char _items[][MAX_ITEM_LEN] = { ... };
int i = 0;
for (; strcmp( a, _items[i] ) < 0 && i < NUM_ITEMS; ++i );
bool found = i < NUM_ITEMS && strcmp( a, _items[i] ) == 0;
For a list that small, I think your implementation and maintenance costs with anything more complex would probably outweigh the run time costs, and you're not really going to get cheaper space costs than this. To gain a little more speed, you could do a hash table of first char -> list index to set the initial value of i;
For a list this small, you probably won't get much faster.
You're using binary search, which is O(log(n)). You should look at interpolation search, which is not as good "worst case," but it's average case is better: O(log(log(n)).
I don't know which kind of hashing function MS uses for stings, but maybe you could come up with something simpler (=faster) that works in your special case. The container should allow you to use a custom hashing class.
If it's an implementation issue of the container you can also try if boosts std::tr1::unordered_set gives better results.
a hash table is a good solution, and by using a pre-existing implementation you are likely to get good performance. an alternative though i believe is called "indexing".
keep some pointers around to convenient locations. e.g. if its using letters for the sorting, keep a pointer to everything starting aa, ab, ac... ba, bc, bd... this is a few hundred pointers, but means that you can skip to part of the list which is quite near to the result before continuing. e.g. if an entry is is "afunctionname" then you can binary search between the pointers for af and ag, much faster than searching the whole lot... if you have a million records in total you will likely only have to binary search a list of a few thousand.
i re-invented this particular wheel, but there may be plenty of implementations out there already, which will save you the headache of implementing and are likely faster than any code I could paste in here. :)