Using boost::iostreams::mapped_file_source with std::multimap - c++

I have a rather large amount of data to analyse - each file is about 5gigs. Each file is of the following format:
xxxxx yyyyy
Both key and value can repeat, but the keys are sorted in increasing order. I'm trying to use a memory mapped file for this purpose and then find the required keys and work with them. This is what I've written:
if (data_file != "")
{
clock_start = clock();
data_file_mapped.open(data_file);
data_multimap = (std::multimap<double, unsigned int> *)data_file_mapped.data();
if (data_multimap != NULL)
{
std::multimap<double, unsigned int>::iterator it = data_multimap->find(keys_to_find[4]);
if (it != data_multimap->end())
{
std::cout << "Element found.";
for (std::multimap<double, unsigned int>::iterator it = data_multimap->lower_bound(keys_to_find[4]); it != data_multimap->upper_bound(keys_to_find[5]); ++it)
{
std::cout << it->second;
}
std::cout << "\n";
clock_end = clock();
std::cout << "Time taken to read in the file: " << (clock_end - clock_start)/CLOCKS_PER_SEC << "\n";
}
else
std::cerr << "Element not found at all" << "\n";
}
else
std::cerr << "Nope - no data received."<< "\n";
}
Basically, I need to locate ranges of keys and pull those chunks out to work on. I get a segfault the first time I try to use a method on the multimap. For example, when the find method is called. I tried the upper_bound, lower_bound and other methods too, and still get a segfault.
This is what gdb gives me:
Program received signal SIGSEGV, Segmentation fault.
_M_lower_bound (this=<optimized out>, __k=<optimized out>, __y=<optimized out>, __x=0xa31202030303833) at /usr/include/c++/4.9.2/bits/stl_tree.h:1261
1261 if (!_M_impl._M_key_compare(_S_key(__x), __k))
Could someone please point out what I'm doing wrong? I've only been able to find simplistic examples on memory mapped files - nothing like this yet. Thanks.
EDIT: More information:
The file I described above is basically a two column plain text file which a neural simulator gives me as the output of my simulations. It's simple like this:
$ du -hsc 201501271755.e.ras
4.9G 201501271755.e.ras
4.9G total
$ head 201501271755.e.ras
0.013800 0
0.013800 1
0.013800 10
0.013800 11
0.013800 12
0.013800 13
0.013800 14
0.013800 15
0.013800 16
0.013800 17
The first column is time, the second column is the neurons that fired at this time - (it's a spike time raster file). Actually, the output is a file like this from each MPI rank that is being used to run the simulation. The various files have been combined to this master file using sort -g -m. More information on the file format is here: http://www.fzenke.net/auryn/doku.php?id=manual:ras
To calculate the firing rate and other metrics of the neuron set at certain times of the simulation, I need to - locate the time in the file, pull out a chunk between [time -1,time] and run some metrics and so on on this chunk. This file is quite small and I expect the size to increase quite a bit as my simulations get more and more complicated and run for longer time periods. It's why I began looking into memory mapped files. I hope that clarifies the problem statement somewhat. I only need to read the output file to process the information it contains. I do not need to modify this file at all.
To process the data, I'll use multiple threads again, but since all my operations on the map are read-only, I don't expect to run into trouble there.

Multi maps aren't laid out sequentially in memory. (They're node-based containers, but I digress). In fact, even if they were, chances would be slim that the layout would match that of the text input.
There's basically two ways you can make this work:
Keep using the multimap but use a custom allocator (so that all allocations are done in the mapped memory region). This is the "nicest" from a high-level C++ viewpoint, /but/ you will need to change to a binary format of your file.
If you can, this is what I'd suggest. Boost Container + Boost Interprocess have everything you need to make this relatively painless.
You write a custom container "abstraction" that works directly on the mapped data. You could either
recognize a "xxxx yyyy" pair from anywhere (line ends?) or
build an index of all line starts in the file.
Using these you can devise an interator (Boost Iterator iterator_facade) that you can use to implement higher level operations (lower_bound, upper_bound and equal_range).
Once you have these, you're basically all set to query this memory map as a readonly key-value database.
Sadly, this kind of memory representation would be extremely bad for performance if you also want to support mutating operations (insert, remove).
If you have an actual sample of the file, I could do a demonstration of either of the approaches described.
Update
Quick Samples:
With boost::interprocess you can (very) simply define the multimap you desire:
namespace shared {
namespace bc = boost::container;
template <typename T> using allocator = bip::allocator<T, bip::managed_mapped_file::segment_manager>;
template <typename K, typename V>
using multi_map = bc::flat_multimap<
K, V, std::less<K>,
allocator<typename bc::flat_multimap<K, V>::value_type> >;
}
Notes:
I chose flatmap (flat_multimap, actually) because it is likely more
storage efficient, and is much more comparable to the second approach
(given below);
Note that this choice affects iterator/reference stability and will
favours read-only operations pretty heavily. If you need iterator
stability and/or many mutating operations, use a regular map (or for
very high volumes a hash_map) instead of the flat variations.
I chose a managed_mapped_file segment for this demonstration (so you get persistence). The demo shows how 10G is sparsely pre-allocated, but only the space actually allocated is used on disk. You could equally well use a managed_shared_memory.
If you have binary persistence, you might discard the text datafile altogether.
I parse the text data into a shared::multi_map<double, unsigned> from a mapped_file_source using Boost Spirit. The implementation is fully generic.
There is no need to write iterator classes, start_of_line(), end_of_line(), lower_bound(), upper_bound(), equal_range() or any of those, since they're already standard in the multi_map interface, so all we need to is write main:
Live On Coliru
#define NDEBUG
#undef DEBUG
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/container/flat_map.hpp>
#include <boost/interprocess/managed_mapped_file.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
namespace bip = boost::interprocess;
namespace qi = boost::spirit::qi;
namespace shared {
namespace bc = boost::container;
template <typename T> using allocator = bip::allocator<T, bip::managed_mapped_file::segment_manager>;
template <typename K, typename V>
using multi_map = bc::flat_multimap<
K, V, std::less<K>,
allocator<typename bc::flat_multimap<K, V>::value_type> >;
}
#include <iostream>
bip::managed_mapped_file msm(bip::open_or_create, "lookup.bin", 10ul<<30);
template <typename K, typename V>
shared::multi_map<K,V>& get_or_load(const char* fname) {
using Map = shared::multi_map<K, V>;
Map* lookup = msm.find_or_construct<Map>("lookup")(msm.get_segment_manager());
if (lookup->empty()) {
// only read input file if not already loaded
boost::iostreams::mapped_file_source input(fname);
auto f(input.data()), l(f + input.size());
bool ok = qi::phrase_parse(f, l,
(qi::auto_ >> qi::auto_) % qi::eol >> *qi::eol,
qi::blank, *lookup);
if (!ok || (f!=l))
throw std::runtime_error("Error during parsing at position #" + std::to_string(f - input.data()));
}
return *lookup;
}
int main() {
// parse text file into shared memory binary representation
auto const& lookup = get_or_load<double, unsigned int>("input.txt");
auto const e = lookup.end();
for(auto&& line : lookup)
{
std::cout << line.first << "\t" << line.second << "\n";
auto er = lookup.equal_range(line.first);
if (er.first != e) std::cout << " lower: " << er.first->first << "\t" << er.first->second << "\n";
if (er.second != e) std::cout << " upper: " << er.second->first << "\t" << er.second->second << "\n";
}
}
I implemented it exactly as I described:
simple container over the raw const char* region mapped;
using boost::iterator_facade to make an iterator that parses the text on dereference;
for printing the input lines I use boost::string_ref - which avoids dynamic allocations for copying strings.
parsing is done with Spirit Qi:
if (!qi::phrase_parse(
b, _data.end,
qi::auto_ >> qi::auto_ >> qi::eoi,
qi::space,
_data.key, _data.value))
Qi was chosen for speed and genericity: you can choose the Key and Value types at instantiation time:
text_multi_lookup<double, unsigned int> tml(map.data(), map.data() + map.size());
I've implemented lower_bound, upper_bound and equal_range member functions that take advantage of underlying contiguous storage. Even though the "line" iterator is not random-access but bidirectional, we can still jump to the mid_point of such an iterator range because we can get the start_of_line from any const char* into the underlying mapped region. This make binary searching efficient.
Note that this solution parses lines on dereference of the iterator. This might not be efficient if the same lines are dereferenced a lot of times.
But, for infrequent lookups, or lookups that are not typical in the same region of the input data, this is about as efficient as it can possibly get (doing only minimum required parsing and O(log n) binary searching), all the while completely bypassing the initial load time by mapping the file instead (no access means nothing needs to be loaded).
Live On Coliru (including test data)
#define NDEBUG
#undef DEBUG
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/optional.hpp>
#include <boost/spirit/include/qi.hpp>
#include <thread>
#include <iomanip>
namespace io = boost::iostreams;
namespace qi = boost::spirit::qi;
template <typename Key, typename Value>
struct text_multi_lookup {
text_multi_lookup(char const* begin, char const* end)
: _map_begin(begin),
_map_end(end)
{
}
private:
friend struct iterator;
enum : char { nl = '\n' };
using rawit = char const*;
rawit _map_begin, _map_end;
rawit start_of_line(rawit it) const {
while (it > _map_begin) if (*--it == nl) return it+1;
assert(it == _map_begin);
return it;
}
rawit end_of_line(rawit it) const {
while (it < _map_end) if (*it++ == nl) return it;
assert(it == _map_end);
return it;
}
public:
struct value_type final {
rawit beg, end;
Key key;
Value value;
boost::string_ref str() const { return { beg, size_t(end-beg) }; }
};
struct iterator : boost::iterator_facade<iterator, boost::string_ref, boost::bidirectional_traversal_tag, value_type> {
iterator(text_multi_lookup const& d, rawit it) : _region(&d), _data { it, nullptr, Key{}, Value{} } {
assert(_data.beg == _region->start_of_line(_data.beg));
}
private:
friend text_multi_lookup;
text_multi_lookup const* _region;
value_type mutable _data;
void ensure_parsed() const {
if (!_data.end)
{
assert(_data.beg == _region->start_of_line(_data.beg));
auto b = _data.beg;
_data.end = _region->end_of_line(_data.beg);
if (!qi::phrase_parse(
b, _data.end,
qi::auto_ >> qi::auto_ >> qi::eoi,
qi::space,
_data.key, _data.value))
{
std::cerr << "Problem in: " << std::string(_data.beg, _data.end)
<< "at: " << std::setw(_data.end-_data.beg) << std::right << std::string(_data.beg,_data.end);
assert(false);
}
}
}
static iterator mid_point(iterator const& a, iterator const& b) {
assert(a._region == b._region);
return { *a._region, a._region->start_of_line(a._data.beg + (b._data.beg -a._data.beg)/2) };
}
public:
value_type const& dereference() const {
ensure_parsed();
return _data;
}
bool equal(iterator const& o) const {
return (_region == o._region) && (_data.beg == o._data.beg);
}
void increment() {
_data = { _region->end_of_line(_data.beg), nullptr, Key{}, Value{} };
assert(_data.beg == _region->start_of_line(_data.beg));
}
};
using const_iterator = iterator;
const_iterator begin() const { return { *this, _map_begin }; }
const_iterator end() const { return { *this, _map_end }; }
const_iterator cbegin() const { return { *this, _map_begin }; }
const_iterator cend() const { return { *this, _map_end }; }
template <typename CompatibleKey>
const_iterator lower_bound(CompatibleKey const& key) const {
auto f(begin()), l(end());
while (f!=l) {
auto m = iterator::mid_point(f,l);
if (m->key < key) {
f = m;
++f;
}
else {
l = m;
}
}
return f;
}
template <typename CompatibleKey>
const_iterator upper_bound(CompatibleKey const& key) const {
return upper_bound(key, begin());
}
private:
template <typename CompatibleKey>
const_iterator upper_bound(CompatibleKey const& key, const_iterator f) const {
auto l(end());
while (f!=l) {
auto m = iterator::mid_point(f,l);
if (key < m->key) {
l = m;
}
else {
f = m;
++f;
}
}
return f;
}
public:
template <typename CompatibleKey>
std::pair<const_iterator, const_iterator> equal_range(CompatibleKey const& key) const {
auto lb = lower_bound(key);
return { lb, upper_bound(key, lb) };
}
};
#include <iostream>
int main() {
io::mapped_file_source map("input.txt");
text_multi_lookup<double, unsigned int> tml(map.data(), map.data() + map.size());
auto const e = tml.end();
for(auto&& line : tml)
{
std::cout << line.str();
auto er = tml.equal_range(line.key);
if (er.first != e) std::cout << " lower: " << er.first->str();
if (er.second != e) std::cout << " upper: " << er.second->str();
}
}
For the curious: here's the disassembly. Note how all the algorithmic stuff is inlined right into main: http://paste.ubuntu.com/9946135/

data_multimap = (std::multimap<double, unsigned int> *)data_file_mapped.data();, as far I can read from the boost documentation, you have missunderstood that function, that casting will not work, you need to fill the the multimap with the char* provided by data()
I edit to add a bit more detailed content, for example after the mapping, you can do
std::getline(data_file_mapped, oneString);
And after that, deliver the content on the line (you can use a stringstream for that task) and fill your multimap.
Repeat the process until the end of the file.

Related

Best way to calculate a running hash for an unordered_map?

I've got a simple wrapper-class for std::unordered_map that updates a running hash-code for the unordered_map's contents, as key-value pairs are added or removed; that way I never have to iterate over the entire contents to get the current hash code for the set. It does this by adding to the _hash member-variable whenever a new key-value pair is added, and subtracting from the _hash member-variable whenever an existing key-value pair is removed. This all works fine (but see the toy implementation below if you want a code-example of what I mean).
My only concern is that I suspect that simply adding and subtracting values from _hash might not be the optimal thing to do from the perspective of minimizing the likelihood of hash-value collisions. Is there a mathematically better way to compute the running-hash-code for the table, that would still preserve my ability to efficiently add/remove items from the table (i.e. without forcing me to iterate over the table to rebuild a hash code from scratch every time?)
#include <functional>
#include <unordered_map>
#include <string>
#include <iostream>
template<typename KeyType, typename ValueType> class UnorderedMapWithHashCode
{
public:
UnorderedMapWithHashCode() : _hash(0) {/* empty */}
void Clear() {_map.clear(); _hash = 0;}
void Put(const KeyType & k, const ValueType & v)
{
Remove(k); // to deduct any existing value from _hash
_hash += GetHashValueForPair(k, v);
_map[k] = v;
}
void Remove(const KeyType & k)
{
if (_map.count(k) > 0)
{
_hash -= GetHashValueForPair(k, _map[k]);
_map.erase(k);
}
}
const std::unordered_map<KeyType, ValueType> & GetContents() const {return _map;}
std::size_t GetHashCode() const {return _hash;}
private:
std::size_t GetHashValueForPair(const KeyType & k, const ValueType & v) const
{
return std::hash<KeyType>()(k) + std::hash<ValueType>()(v);
}
std::unordered_map<KeyType, ValueType> _map;
std::size_t _hash;
};
int main(int, char **)
{
UnorderedMapWithHashCode<std::string, int> map;
std::cout << "A: Hash is " << map.GetHashCode() << std::endl;
map.Put("peanut butter", 5);
std::cout << "B: Hash is " << map.GetHashCode() << std::endl;
map.Put("jelly", 25);
std::cout << "C: Hash is " << map.GetHashCode() << std::endl;
map.Remove("peanut butter");
std::cout << "D: Hash is " << map.GetHashCode() << std::endl;
map.Remove("jelly");
std::cout << "E: Hash is " << map.GetHashCode() << std::endl;
return 0;
}
Your concept's perfectly fine, just the implementation could be improved:
you could take the hash functions to use as template arguments that default to the relevant std::hash instantiations; note that for numbers it's common (GCC, Clang, Visual C++) for std::hash<> to be an identity hash, which is moderately collision prone; GCC and Clang mitigate that somewhat by having prime number of buckets (vs Visual C++'s power-of-2 choice), but you need to avoid having distinct key,value entries collide in the size_t hash-value space, rather than post-mod-bucket-count, so would be better off using a meaningful hash function. Similarly Visual C++'s std::string hash only incorporates 10 characters spaced along the string (so it's constant time), but if your key and value were both similar same-length long strings only differing in a few characters that would be horrible collision prone too. GCC uses a proper hash function for strings - MURMUR32.
return std::hash<KeyType>()(k) + std::hash<ValueType>()(v); is mediocre idea in general and an awful idea when using an identity hash function (e.g. h({k,v}) == k + v, so h({4,2}) == h({2,4}) == h({1,5}) etc.)
consider using something based on boost::hash_combine instead (assuming you do adopt the above advice to have template parameters provide the hash functions:
auto key_hash = KeyHashPolicy(key);
return (key_hash ^ ValueHashPolicy(value)) +
0x9e3779b9 + (key_hash << 6) + (key_hash >> 2);
you could dramatically improve the efficiency of your operations by avoiding unnecessarily hash table lookups (your Put does 2-4 table lookups, and Remove does 1-3):
void Put(const KeyType& k, const ValueType& v)
{
auto it = _map.find(k);
if (it == _map.end()) {
_map[k] = v;
} else {
if (it->second == v) return;
_hash -= GetHashValueForPair(k, it->second);
it->second = v;
}
_hash += GetHashValueForPair(k, v);
}
void Remove(const KeyType& k)
{
auto it = _map.find(k);
if (it == _map.end()) return;
_hash -= GetHashValueForPair(k, it->second);
_map.erase(it);
}
if you want to optimise further, you can create a version of GetHashValueForPair that returned the HashKeyPolicy(key) value and let you pass it in to avoid hashing the key twice in Put.

Compare guarantees for insert with hint for ordered associative containers

I want to insert new (unique) element into known place (generally somewhere in the middle) of an ordered associative container std::set/std::multiset/std::map/std::multimap using insert (w/ hint) or emplace_hint.
During insertion operation I absolutely sure, that the place to insert is right before the "hint" iterator. Generally I can compare any two non-neighbouring elements in the container, but this operation is strongly heavyweight. To avoid overhead imposed, I provide custom comparator for the container, which contains a references to pointers to both neigbouring elements (they always became known right before the insertion/emplacement operation).
#include <map>
#include <set>
static std::size_t counter = 0;
template< typename T >
struct less
{
T const * const & pl;
T const * const & pr;
bool operator () (T const & l, T const & r) const
{
if (&l == &r) {
return false;
}
if (pl) {
if (&l == pl) {
return true;
}
if (&r == pl) {
return false;
}
}
if (pr) {
if (&r == pr) {
return true;
}
if (&l == pr) {
return false;
}
}
++counter;
return l < r; // very expensive, it is desirable this line to be unrecheable
}
};
#include <iostream>
#include <algorithm>
#include <iterator>
#include <cassert>
int main()
{
using T = int;
T const * pl = nullptr;
T const * pr = nullptr;
less< T > less_{pl, pr};
std::set< T, less< T > > s{less_};
s.insert({1, 2,/* 3, */4, 5});
std::copy(std::cbegin(s), std::cend(s), std::ostream_iterator< T >(std::cout, " "));
std::cout << '\n';
auto const hint = s.find(4);
// now I want to insert 3 right before 4 (and, of course, after 2)
pl = &*std::prev(hint); // prepare comparator to make a cheap insertion
pr = &*hint;
// if hint == std::end(s), then pr = nullptr
// else if hint == std::begin(s), then pl = nullptr
// if I tried to insert w/o hint, then pl = pr = nullptr;
{
std::size_t const c = counter;
s.insert(hint, 3);
assert(counter == c);
}
std::copy(std::cbegin(s), std::cend(s), std::ostream_iterator< T >(std::cout, " "));
std::cout << '\n';
}
Current libc++/libstdc++ implementations allows me to use described comparator, but is there undefined behaviour if I rely on their current behaviour? Can I rely, that insert (w/ hint parameter) or emplace_hint (and modern insert_or_assign/try_emplace w/ hint parameter for map/multimap) don't touch any other elements other then pointed by pl and pr? Is it implementation-defined thing?
Why I want this strange thing? IRL I tried to implement Fortune's algorithm to find Voronoi diagram on the plane using native STL's self-balanced binary search tries. std::set is used to store current state of a part of a so-called beach line: a chain of sorted endpoints. When I add a new endpoint I always know the place where to insert it right before the insertion. It would be best if I can add assert(false); before or throw std::logic_error{};/__builtin_unreachable(); instead of last return in comparator functor. I only can do it if there is corresponding logical guarantee. Can I do this?

What container to store unique values?

I've got the following problem. I have a game which runs on average 60 frames per second. Each frame I need to store values in a container and there must be no duplicates.
It probably has to store less than 100 items per frame, but the number of insert-calls will be alot more (and many rejected due to it has to be unique). Only at the end of the frame do I need to traverse the container. So about 60 iterations of the container per frame, but alot more insertions.
Keep in mind the items to store are simple integer.
There are a bunch of containers I can use for this but I cannot make up my mind what to pick. Performance is the key issue for this.
Some pros/cons that I've gathered:
vector
(PRO): Contigous memory, a huge factor.
(PRO): Memory can be reserved first, very few allocations/deallocations afterwards
(CON): No alternative than to traverse the container (std::find) each insert() to find unique keys? The comparison is simple though (integers) and the whole container can probably fit the cache
set
(PRO): Simple, clearly meant for this
(CON): Not constant insert-time
(CON): Alot of allocations/deallocations per frame
(CON): Not contigous memory. Traversing a set of hundreds of objects means jumping around alot in memory.
unordered_set
(PRO): Simple, clearly meant for this
(PRO): Average case constant time insert
(CON): Seeing as I store integers, hash operation is probably alot more expensive than anything else
(CON): Alot of allocations/deallocations per frame
(CON): Not contigous memory. Traversing a set of hundreds of objects means jumping around alot in memory.
I'm leaning on going the vector-route because of memory access patterns, even though set is clearly meant for this issue. The big issue that is unclear to me is whether traversing the vector for each insert is more costly than the allocations/deallocations (especially considering how often this must be done) and the memory lookups of set.
I know ultimately it all comes down to profiling each case, but if nothing else than as a headstart or just theoretically, what would probably be best in this scenario? Are there any pros/cons I might've missed aswell?
EDIT: As I didnt mention, the container is cleared() at the end of each frame
I did timing with a few different methods that I thought were likely candidates. Using std::unordered_set was the winner.
Here are my results:
Using UnorderedSet: 0.078s
Using UnsortedVector: 0.193s
Using OrderedSet: 0.278s
Using SortedVector: 0.282s
Timing is based on the median of five runs for each case.
compiler: gcc version 4.9.1
flags: -std=c++11 -O2
OS: ubuntu 4.9.1
CPU: Intel(R) Core(TM) i5-4690K CPU # 3.50GHz
Code:
#include <algorithm>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <random>
#include <set>
#include <unordered_set>
#include <vector>
using std::cerr;
static const size_t n_distinct = 100;
template <typename Engine>
static std::vector<int> randomInts(Engine &engine,size_t n)
{
auto distribution = std::uniform_int_distribution<int>(0,n_distinct);
auto generator = [&]{return distribution(engine);};
auto vec = std::vector<int>();
std::generate_n(std::back_inserter(vec),n,generator);
return vec;
}
struct UnsortedVectorSmallSet {
std::vector<int> values;
static const char *name() { return "UnsortedVector"; }
UnsortedVectorSmallSet() { values.reserve(n_distinct); }
void insert(int new_value)
{
auto iter = std::find(values.begin(),values.end(),new_value);
if (iter!=values.end()) return;
values.push_back(new_value);
}
};
struct SortedVectorSmallSet {
std::vector<int> values;
static const char *name() { return "SortedVector"; }
SortedVectorSmallSet() { values.reserve(n_distinct); }
void insert(int new_value)
{
auto iter = std::lower_bound(values.begin(),values.end(),new_value);
if (iter==values.end()) {
values.push_back(new_value);
return;
}
if (*iter==new_value) return;
values.insert(iter,new_value);
}
};
struct OrderedSetSmallSet {
std::set<int> values;
static const char *name() { return "OrderedSet"; }
void insert(int new_value) { values.insert(new_value); }
};
struct UnorderedSetSmallSet {
std::unordered_set<int> values;
static const char *name() { return "UnorderedSet"; }
void insert(int new_value) { values.insert(new_value); }
};
int main()
{
//using SmallSet = UnsortedVectorSmallSet;
//using SmallSet = SortedVectorSmallSet;
//using SmallSet = OrderedSetSmallSet;
using SmallSet = UnorderedSetSmallSet;
auto engine = std::default_random_engine();
std::vector<int> values_to_insert = randomInts(engine,10000000);
SmallSet small_set;
namespace chrono = std::chrono;
using chrono::system_clock;
auto start_time = system_clock::now();
for (auto value : values_to_insert) {
small_set.insert(value);
}
auto end_time = system_clock::now();
auto& result = small_set.values;
auto sum = std::accumulate(result.begin(),result.end(),0u);
auto elapsed_seconds = chrono::duration<float>(end_time-start_time).count();
cerr << "Using " << SmallSet::name() << ":\n";
cerr << " sum=" << sum << "\n";
cerr << " elapsed: " << elapsed_seconds << "s\n";
}
I'm going to put my neck on the block here and suggest that the vector route is probably most efficient when the size is 100 and the objects being stored are integral values. The simple reason for this is that set and unordered_set allocate memory for each insert whereas the vector needn't more than once.
You can increase search performance dramatically by keeping the vector ordered, since then all searches can be binary searches and therefore complete in log2N time.
The downside is that the inserts will take a tiny fraction longer due to the memory moves, but it sounds as if there will be many more searches than inserts, and moving (average) 50 contiguous memory words is an almost instantaneous operation.
Final word:
Write the correct logic now. Worry about performance when the users are complaining.
EDIT:
Because I couldn't help myself, here's a reasonably complete implementation:
template<typename T>
struct vector_set
{
using vec_type = std::vector<T>;
using const_iterator = typename vec_type::const_iterator;
using iterator = typename vec_type::iterator;
vector_set(size_t max_size)
: _max_size { max_size }
{
_v.reserve(_max_size);
}
/// #returns: pair of iterator, bool
/// If the value has been inserted, the bool will be true
/// the iterator will point to the value, or end if it wasn't
/// inserted due to space exhaustion
auto insert(const T& elem)
-> std::pair<iterator, bool>
{
if (_v.size() < _max_size) {
auto it = std::lower_bound(_v.begin(), _v.end(), elem);
if (_v.end() == it || *it != elem) {
return make_pair(_v.insert(it, elem), true);
}
return make_pair(it, false);
}
else {
return make_pair(_v.end(), false);
}
}
auto find(const T& elem) const
-> const_iterator
{
auto vend = _v.end();
auto it = std::lower_bound(_v.begin(), vend, elem);
if (it != vend && *it != elem)
it = vend;
return it;
}
bool contains(const T& elem) const {
return find(elem) != _v.end();
}
const_iterator begin() const {
return _v.begin();
}
const_iterator end() const {
return _v.end();
}
private:
vec_type _v;
size_t _max_size;
};
using namespace std;
BOOST_AUTO_TEST_CASE(play_unique_vector)
{
vector_set<int> v(100);
for (size_t i = 0 ; i < 1000000 ; ++i) {
v.insert(int(random() % 200));
}
cout << "unique integers:" << endl;
copy(begin(v), end(v), ostream_iterator<int>(cout, ","));
cout << endl;
cout << "contains 100: " << v.contains(100) << endl;
cout << "contains 101: " << v.contains(101) << endl;
cout << "contains 102: " << v.contains(102) << endl;
cout << "contains 103: " << v.contains(103) << endl;
}
As you said you have many insertions and only one traversal, I’d suggest to use a vector and push the elements in regardless of whether they are unique in the vector. This is done in O(1).
Just when you need to go through the vector, then sort it and remove the duplicate elements. I believe this can be done in O(n) as they are bounded integers.
EDIT: Sorting in linear time through counting sort presented in this video. If not feasible, then you are back to O(n lg(n)).
You will have very little cache miss because of the contiguity of the vector in memory, and very few allocations (especially if you reserve enough memory in the vector).

How to find the index of current object in range-based for loop?

Assume I have the following code:
vector<int> list;
for(auto& elem:list) {
int i = elem;
}
Can I find the position of elem in the vector without maintaining a separate iterator?
Yes you can, it just take some massaging ;)
The trick is to use composition: instead of iterating over the container directly, you "zip" it with an index along the way.
Specialized zipper code:
template <typename T>
struct iterator_extractor { typedef typename T::iterator type; };
template <typename T>
struct iterator_extractor<T const> { typedef typename T::const_iterator type; };
template <typename T>
class Indexer {
public:
class iterator {
typedef typename iterator_extractor<T>::type inner_iterator;
typedef typename std::iterator_traits<inner_iterator>::reference inner_reference;
public:
typedef std::pair<size_t, inner_reference> reference;
iterator(inner_iterator it): _pos(0), _it(it) {}
reference operator*() const { return reference(_pos, *_it); }
iterator& operator++() { ++_pos; ++_it; return *this; }
iterator operator++(int) { iterator tmp(*this); ++*this; return tmp; }
bool operator==(iterator const& it) const { return _it == it._it; }
bool operator!=(iterator const& it) const { return !(*this == it); }
private:
size_t _pos;
inner_iterator _it;
};
Indexer(T& t): _container(t) {}
iterator begin() const { return iterator(_container.begin()); }
iterator end() const { return iterator(_container.end()); }
private:
T& _container;
}; // class Indexer
template <typename T>
Indexer<T> index(T& t) { return Indexer<T>(t); }
And using it:
#include <iostream>
#include <iterator>
#include <limits>
#include <vector>
// Zipper code here
int main() {
std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8, 9};
for (auto p: index(v)) {
std::cout << p.first << ": " << p.second << "\n";
}
}
You can see it at ideone, though it lacks the for-range loop support so it's less pretty.
EDIT:
Just remembered that I should check Boost.Range more often. Unfortunately no zip range, but I did found a pearl: boost::adaptors::indexed. However it requires access to the iterator to pull of the index. Shame :x
Otherwise with the counting_range and a generic zip I am sure it could be possible to do something interesting...
In the ideal world I would imagine:
int main() {
std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8, 9};
for (auto tuple: zip(iota(0), v)) {
std::cout << tuple.at<0>() << ": " << tuple.at<1>() << "\n";
}
}
With zip automatically creating a view as a range of tuples of references and iota(0) simply creating a "false" range that starts from 0 and just counts toward infinity (or well, the maximum of its type...).
jrok is right : range-based for loops are not designed for that purpose.
However, in your case it is possible to compute it using pointer arithmetic since vector stores its elements contiguously (*)
vector<int> list;
for(auto& elem:list) {
int i = elem;
int pos = &elem-&list[0]; // pos contains the position in the vector
// also a &-operator overload proof alternative (thanks to ildjarn) :
// int pos = addressof(elem)-addressof(list[0]);
}
But this is clearly a bad practice since it obfuscates the code & makes it more fragile (it easily breaks if someone changes the container type, overload the & operator or replace 'auto&' by 'auto'. good luck to debug that!)
NOTE: Contiguity is guaranteed for vector in C++03, and array and string in C++11 standard.
No, you can't (at least, not without effort). If you need the position of an element, you shouldn't use range-based for. Remember that it's just a convenience tool for the most common case: execute some code for each element. In the less-common circumstances where you need the position of the element, you have to use the less-convenient regular for loop.
Based on the answer from #Matthieu there is a very elegant solution using the mentioned boost::adaptors::indexed:
std::vector<std::string> strings{10, "Hello"};
int main(){
strings[5] = "World";
for(auto const& el: strings| boost::adaptors::indexed(0))
std::cout << el.index() << ": " << el.value() << std::endl;
}
You can try it
This works pretty much like the "ideal world solution" mentioned, has pretty syntax and is concise. Note that the type of el in this case is something like boost::foobar<const std::string&, int>, so it handles the reference there and no copying is performed. It is even incredibly efficient: https://godbolt.org/g/e4LMnJ (The code is equivalent to keeping an own counter variable which is as good as it gets)
For completeness the alternatives:
size_t i = 0;
for(auto const& el: strings) {
std::cout << i << ": " << el << std::endl;
++i;
}
Or using the contiguous property of a vector:
for(auto const& el: strings) {
size_t i = &el - &strings.front();
std::cout << i << ": " << el << std::endl;
}
The first generates the same code as the boost adapter version (optimal) and the last is 1 instruction longer: https://godbolt.org/g/nEG8f9
Note: If you only want to know, if you have the last element you can use:
for(auto const& el: strings) {
bool isLast = &el == &strings.back();
std::cout << isLast << ": " << el << std::endl;
}
This works for every standard container but auto&/auto const& must be used (same as above) but that is recommended anyway. Depending on the input this might also be pretty fast (especially when the compiler knows the size of your vector)
Replace the &foo by std::addressof(foo) to be on the safe side for generic code.
If you have a compiler with C++14 support you can do it in a functional style:
#include <iostream>
#include <string>
#include <vector>
#include <functional>
template<typename T>
void for_enum(T& container, std::function<void(int, typename T::value_type&)> op)
{
int idx = 0;
for(auto& value : container)
op(idx++, value);
}
int main()
{
std::vector<std::string> sv {"hi", "there"};
for_enum(sv, [](auto i, auto v) {
std::cout << i << " " << v << std::endl;
});
}
Works with clang 3.4 and gcc 4.9 (not with 4.8); for both need to set -std=c++1y. The reason you need c++14 is because of the auto parameters in the lambda function.
If you insist on using range based for, and to know index, it is pretty trivial to maintain index as shown below.
I do not think there is a cleaner / simpler solution for range based for loops. But really why not use a standard for(;;)? That probably would make your intent and code the clearest.
vector<int> list;
int idx = 0;
for(auto& elem:list) {
int i = elem;
//TODO whatever made you want the idx
++idx;
}
There is a surprisingly simple way to do this
vector<int> list;
for(auto& elem:list) {
int i = (&elem-&*(list.begin()));
}
where i will be your required index.
This takes advantage of the fact that C++ vectors are always contiguous.
Here's a quite beautiful solution using c++20:
#include <array>
#include <iostream>
#include <ranges>
template<typename T>
struct EnumeratedElement {
std::size_t index;
T& element;
};
auto enumerate(std::ranges::range auto& range)
-> std::ranges::view auto
{
return range | std::views::transform(
[i = std::size_t{}](auto& element) mutable {
return EnumeratedElement{i++, element};
}
);
}
auto main() -> int {
auto const elements = std::array{3, 1, 4, 1, 5, 9, 2};
for (auto const [index, element] : enumerate(elements)) {
std::cout << "Element " << index << ": " << element << '\n';
}
}
The major features used here are c++20 ranges, c++20 concepts, c++11 mutable lambdas, c++14 lambda capture initializers, and c++17 structured bindings. Refer to cppreference.com for information on any of these topics.
Note that element in the structured binding is in fact a reference and not a copy of the element (not that it matters here). This is because any qualifiers around the auto only affect a temporary object that the fields are extracted from, and not the fields themselves.
The generated code is identical to the code generated by this (at least by gcc 10.2):
#include <array>
#include <iostream>
#include <ranges>
auto main() -> int {
auto const elements = std::array{3, 1, 4, 1, 5, 9, 2};
for (auto index = std::size_t{}; auto& element : elements) {
std::cout << "Element " << index << ": " << element << '\n';
index++;
}
}
Proof: https://godbolt.org/z/a5bfxz
I read from your comments that one reason you want to know the index is to know if the element is the first/last in the sequence. If so, you can do
for(auto& elem:list) {
// loop code ...
if(&elem == &*std::begin(list)){ ... special code for first element ... }
if(&elem == &*std::prev(std::end(list))){ ... special code for last element ... }
// if(&elem == &*std::rbegin(list)){... (C++14 only) special code for last element ...}
// loop code ...
}
EDIT: For example, this prints a container skipping a separator in the last element. Works for most containers I can imagine (including arrays), (online demo http://coliru.stacked-crooked.com/a/9bdce059abd87f91):
#include <iostream>
#include <vector>
#include <list>
#include <set>
using namespace std;
template<class Container>
void print(Container const& c){
for(auto& x:c){
std::cout << x;
if(&x != &*std::prev(std::end(c))) std::cout << ", "; // special code for last element
}
std::cout << std::endl;
}
int main() {
std::vector<double> v{1.,2.,3.};
print(v); // prints 1,2,3
std::list<double> l{1.,2.,3.};
print(l); // prints 1,2,3
std::initializer_list<double> i{1.,2.,3.};
print(i); // prints 1,2,3
std::set<double> s{1.,2.,3.};
print(s); // print 1,2,3
double a[3] = {1.,2.,3.}; // works for C-arrays as well
print(a); // print 1,2,3
}
Tobias Widlund wrote a nice MIT licensed Python style header only enumerate (C++17 though):
GitHub
Blog Post
Really nice to use:
std::vector<int> my_vector {1,3,3,7};
for(auto [i, my_element] : en::enumerate(my_vector))
{
// do stuff
}
If you want to avoid having to write an auxiliary function while having
the index variable local to the loop, you can use a lambda with a mutable variable.:
int main() {
std::vector<char> values = {'a', 'b', 'c'};
std::for_each(begin(values), end(values), [i = size_t{}] (auto x) mutable {
std::cout << i << ' ' << x << '\n';
++i;
});
}
Here's a macro-based solution that probably beats most others on simplicity, compile time, and code generation quality:
#include <iostream>
#define fori(i, ...) if(size_t i = -1) for(__VA_ARGS__) if(i++, true)
int main() {
fori(i, auto const & x : {"hello", "world", "!"}) {
std::cout << i << " " << x << std::endl;
}
}
Result:
$ g++ -o enumerate enumerate.cpp -std=c++11 && ./enumerate
0 hello
1 world
2 !

Something about reading portions of data from file using a C++ istream_iterator

Target: There is text file (on HDD) containing integers divided with some kind of delimiter.
Example:
5245
234224
6534
1234
I need to read them into STL container.
int main(int argc, char * argv[]) {
using namespace std;
// 1. prepare the file stream
string fileName;
if (argc > 1)
fileName = argv[1];
else {
cout << "Provide the filename to read from: ";
cin >> fileName;
}
unique_ptr<ifstream, ifstream_deleter<ifstream>> ptrToStream(new ifstream(fileName, ios::out));
if (!ptrToStream->good()) {
cerr << "Error opening file " << fileName << endl;
return -1;
}
// 2. value by value reading will be too slow on large data so buffer data
typedef unsigned int values_type;
const int BUFFER_SIZE(4); // 4 is for testing purposes. 16MB or larger in real life
vector<values_type> numbersBuffer(BUFFER_SIZE);
numbersBuffer.insert(numbersBuffer.begin(), istream_iterator<values_type>(*ptrToStream), istream_iterator<values_type>());
// ...
The main drawback of this code is how can I handle the issue when file size is extremely large, so I cannot store all of it's contents in memory ?
I also do not want to use push_back as it is non efficient in comparison to interval insert.
So, the question is: how can I read not more than BUFFER_SIZE elements from the file effectively using STL?
The approach to limit reading from input iterators is to create a wrapper which counts the number of elements processed so far and whose end iterator compares to this number. Doing this generically isn't quite trivial, doing it specifically for std::istream_iterator<T> shouldn't be too hard. That said, I think the easiest way to do it is this:
std::vector<T> buffer;
buffer.reserve(size);
std::istreambuf_iterator<T> it(in), end;
for (std::vector<T>::size_type count(0), capacity(size);
it != end && count != capacity; ++it, ++count) {
buffer.push_back(*it);
}
I realize that you don't want to push_back() because it is allegedly slow. However, compared to the I/O operation I doubt that you'll be able to measure the small overhead, especially with typical implementation of the I/O library.
Just to round things off with an example of a wrapped iterator: below is an example how a counting wrapper for std::istream_iterator<T> could look like. There are many different ways this could be done, this is just one of them.
#include <iostream>
#include <iterator>
#include <vector>
#include <sstream>
template <typename T>
class counted_istream_iterator:
public std::iterator<std::input_iterator_tag, T, std::ptrdiff_t>
{
public:
explicit counted_istream_iterator(std::istream& in): count_(), it_(in) {}
explicit counted_istream_iterator(size_t count): count_(count), it_() {}
T const& operator*() { return *this->it_; }
T const* operator->() { return it_->it_.operator->(); }
counted_istream_iterator& operator++() {
++this->count_; ++this->it_; return *this;
}
counted_istream_iterator operator++(int) {
counted_istream_iterator rc(*this); ++*this; return rc;
}
bool operator== (counted_istream_iterator const& other) const {
return this->count_ == other.count_ || this->it_ == other.it_;
}
bool operator!= (counted_istream_iterator const& other) const {
return !(*this == other);
}
private:
std::ptrdiff_t count_;
std::istream_iterator<T> it_;
};
void read(int count)
{
std::istringstream in("0 1 2 3 4 5 6 7 8 9");
std::vector<int> vec;
vec.insert(vec.end(), counted_istream_iterator<int>(in),
counted_istream_iterator<int>(count));
std::cout << "size=" << vec.size() << "\n";
}
int main()
{
read(4);
read(100);
}
There is possible way to solve my problem:
// 2. value by value reading will be too slow on large data so buffer data
typedef unsigned int values_type;
const int BUFFER_SIZE(4);
vector<values_type> numbersBuffer;
numbersBuffer.reserve(BUFFER_SIZE);
istream_iterator<values_type> begin(*ptrToStream), end;
while (begin != end) {
copy_n(begin, BUFFER_SIZE, numbersBuffer.begin());
for_each(numbersBuffer.begin(), numbersBuffer.end(), [](values_type const &val){ std::cout << val << std::endl; });
++begin;
}
But it has one drawback. If input file contains the following:
8785
245245454545
7767
then 8785 will be read, but 245245454545 and 7767 will not, because 245245454545 cannot be converted to unsigned int. Error will be silent. :(