Need for Fast map between string and integers - c++

I have a map of string and unsigned, in which I store a word to its frequency of the following form:
map<string,unsigned> mapWordFrequency; //contains 1 billion such mappings
Then I read a huge file (100GB), and retain only those words in the file which have a frequency greater than 1000. I check for the frequency of the words in the file using: mapWordFrequency[word]>1000. However, it turns out as my mapWordFrequency has 1 billion mappings and my file is huge, therefore trying to check mapWordFrequency[word]>1000 for each and every word in the file is very slow and takes more than 2 days. Can someone please suggest as to how can I improve the efficiency of the above code.
map does not fit in my RAM and swapping is consuming a lot of time.
Would erasing all words having frequency < 1000 help using erase function of map?

I suggest you use an unordered_map as opposed to a map. As already discussed in the comments, the former will give you an insertion/retrieval time of O(1) as opposed to O(logn) in a map.
As you have already said, memory swapping is consuming a lot of time. So how about tackling the problem incrementally. Load the maximum data and unordered_map you can into memory, hash it, and continue. After one pass, you should have a lot of unordered_maps, and you can start to combine them in subsequent passes.
You could improve the speed by doing this in a distributed manner. Processing the pieces of data on different computers, and then combining data (which will be in form of unordered maps. However, I have no prior experience in distributed computing, and so can't help beyond this.
Also, if implementing something like this is too cumbersome, I suggest you use external mergesort. It is a method of sorting a file too large to fit into memory by sorting smaller chunks and combining them. The reason I'm suggesting this is that external mergesort is a pretty common technique, and you might find already implemented solutions for your need. Even though the time complexity of sorting is higher than your idea of using a map, it will reduce the overhead in swapping as compared to a map. As pointed out in the comments, sort in linux implements external mergesort.

You can use hash map where your hashed string will be key and occurrence will be value. It will be faster. You can choose a good string hashing based on your requirement. Here is link of some good hashing function:
http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
you can use some third party libraries for this also.
EDIT:
pseudo code
int mapWordFrequency[MAX_SIZE] = {0} ;// if MAX_SIZE is large go with dynamic memory location
int someHashMethod(string input);
loop: currString in ListOfString
int key = someHashMethod(currString);
++mapWordFrequency[key];
if(mapWordFrequency[key] > 1000)
doSomeThing();
Update:
As #Jens pointed out there can be cases when someHashMethod() will return same int (hash) for two different string. In that case we have to resolve collision and then lookup time will be more than constant. Also as input size is very large creating a single array of that size may not be possible. In that case we may use distributed computing concepts but actual lookup time will again go up as compare to single machine.

Depending on the statistical distribution of your words, it may be worth compressing each word before adding it to the map. As long as this is lossless compression you can recover the original words after filtering. The idea being you may be able to reduce the average word size (hence saving memory, and time comparing keys). Here's a simple compression/decompression procedure you could use:
#include <string>
#include <sstream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <boost/iostreams/copy.hpp>
inline std::string compress(const std::string& data)
{
std::stringstream decompressed {data};
boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
stream.push(boost::iostreams::zlib_compressor());
stream.push(decompressed);
std::stringstream compressed {};
boost::iostreams::copy(stream, compressed);
return compressed.str();
}
inline std::string decompress(const std::string& data)
{
std::stringstream compressed {data};
boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
stream.push(boost::iostreams::zlib_decompressor());
stream.push(compressed);
std::stringstream decompressed;
boost::iostreams::copy(stream, decompressed);
return decompressed.str();
}
In addition to using a std::unordered_map as other have suggested, you could also move any words that have already been seen more than 1000 times out of the map, and into a std::unordered_set. This would require also checking the set before the map, but you may see better hash performance by doing this. It may also be worth rehashing occasionally if you employ this strategy.

You need another approach to your problem, your data is too big to be processed all at once.
For example you could split your file into multiple files, let's say the easiest would be to logically splitting them by letters.
100GB/24 letters = 4.17 GB
Now you'll have 24 files of 4.17GB each.
You know that the words in any of the files can not be part of any other file, this will help you, since you won't have to merge the results.
With a 4GB file, it now gets easier to work in ram.
std::map has a problem when you start using a lot of memory, since it fragments a lot. Try std::unordered_map, and if that's still not performing well, you may be able to load in memory the file and sort it. Counting occurrences will be then quite easy.
Assuming you have several duplicates, your map or unordered_map will have a significantly lower memory footprint.
Run your code in a loop, for each file, and append the results in another file.
You should be done quite quickly.

The main problem seems to be the memory footprint, so we are looking for a solution that uses up little memory. A way to save memory is to use sorted vectors instead of a map. Now, vector has a lookup time with ~log(n) comparisons and an average insert time of n/2, which is bad. The upside is you have basically no memory overhead, the memory to be moved is small due to separation of data and you get sequential memory (cache-friendliness) which can easily outperform a map. The required memory would be 2 (wordcount) + 4 (index) + 1 (\0-char) + x (length of word) bytes per word. To achieve that we need to get rid of the std::string, because it is just too big in this case.
You can split your map into a vector<char> that saves the strings one after another separated by \0-characters, a vector<unsigned int> for the index and a vector<short int> for the word count. The code would look something like this (tested):
#include <vector>
#include <algorithm>
#include <cstring>
#include <string>
#include <fstream>
#include <iostream>
std::vector<char> strings;
std::vector<unsigned int> indexes;
std::vector<short int> wordcount;
const int countlimit = 1000;
void insertWord(const std::string &str) {
//find the word
auto stringfinder = [](unsigned int lhs, const std::string &rhs) {
return &strings[lhs] < rhs;
};
auto index = lower_bound(begin(indexes), end(indexes), str, stringfinder);
//increment counter
if (index == end(indexes) || strcmp(&strings[*index], str.c_str())) { //unknown word
wordcount.insert(begin(wordcount) + (index - begin(indexes)), 1);
indexes.insert(index, strings.size());
strings.insert(end(strings), str.c_str(), str.c_str() + str.size() + 1);
}
else { //known word
auto &count = wordcount[index - begin(indexes)];
if (count < countlimit) //prevent overflow
count++;
}
}
int main() {
std::ifstream f("input.txt");
std::string s;
while (f >> s) { //not a good way to read in words
insertWord(s);
}
for (size_t i = 0; i < indexes.size(); ++i) {
if (wordcount[i] > countlimit) {
std::cout << &strings[indexes[i]] << ": " << wordcount[i] << '\n';
}
}
}
This approach still saves all words in memory. According to Wolfram Alpha the average word length in the English language is 5.1 characters. This gives you a total memory requirement of (5.1 + 7) * 1bn bytes = 12.1bn bytes = 12.1GB. Assuming you have a halfway modern computer with 16+GB of RAM you can fit it all into RAM.
If this fails (because you don't have English words and they don't fit in memory), the next approach would be memory mapped files. That way you can make indexes point to the memory mapped file instead of strings, so you can get rid of strings, but the access time would suffer.
If this fails due to low performance you should look into map-reduce which is very easy to apply to this case. It gives you as much performance as you have computers.

#TonyD Can you please give a little example with trie? – Rose Sharma
Here's an example of a trie approach to this problem:
#include <iostream>
#include <string>
#include <limits>
#include <array>
class trie
{
public:
void insert(const std::string& s)
{
node_.insert(s.c_str());
}
friend std::ostream& operator<<(std::ostream& os, const trie& t)
{
return os << t.node_;
}
private:
struct Node
{
Node() : freq_(0) { }
uint16_t freq_;
std::array<Node*, 26> next_letter_{};
void insert(const char* p)
{
if (*p)
{
Node*& p_node = next_letter_[*p - 'a'];
if (!p_node)
p_node = new Node;
p_node->insert(++p);
}
else
if (freq_ < std::numeric_limits<decltype(freq_)>::max()) ++freq_;
}
} node_;
friend std::ostream& operator<<(std::ostream& os, const Node& n)
{
os << '(';
if (n.freq_) os << n.freq_ << ' ';
for (size_t i = 0; i < 26; ++i)
if (n.next_letter_[i])
os << char('a' + i) << *(n.next_letter_[i]);
return os << ')';
}
};
int main()
{
trie my_trie;
my_trie.insert("abc");
my_trie.insert("abcd");
my_trie.insert("abc");
my_trie.insert("bc");
std::cout << my_trie << '\n';
}
Output:
(a(b(c(2 d(1 ))))b(c(1 )))
The output is a compressed/tree-like representation of your word-frequency histogram: abc appears 2 times, abcd 1, bc 1. The parentheses can be thought of as pushing and popping characters from a "stack" to form the current prefix or - when there's a number - word.
Whether it improves much on a map depends on the variations in the input words, but it's worth a try. A more memory efficient implementation might use a vector or set - or even a string of say space-separated- suffixes when there are few elements beneath the current prefix, then switch to the array-of-26-pointers when that's likely to need less memory.

Related

Appending strings in C++ [duplicate]

This question already has answers here:
C++ equivalent of StringBuffer/StringBuilder?
(10 answers)
Closed 9 years ago.
Consider this piece of code:
public String joinWords(String[] words) {
String sentence = "";
for(String w : words) {
sentence = sentence + w;
}
return sentence;
}
On each concatenation a new copy of the string is created, so that the overall complexity is O(n^2). Fortunately in Java we could solve this with a StringBuffer, which has O(1) complexity for each append, then the overall complexity would be O(n).
While in C++, std::string::append() has complexity of O(n), and I'm not clear about the complexity of stringstream.
In C++, are there methods like those in StringBuffer with the same complexity?
C++ strings are mutable, and pretty much as dynamically sizable as a StringBuffer. Unlike its equivalent in Java, this code wouldn't create a new string each time; it just appends to the current one.
std::string joinWords(std::vector<std::string> const &words) {
std::string result;
for (auto &word : words) {
result += word;
}
return result;
}
This runs in linear time if you reserve the size you'll need beforehand. The question is whether looping over the vector to get sizes would be slower than letting the string auto-resize. That, i couldn't tell you. Time it. :)
If you don't want to use std::string itself for some reason (and you should consider it; it's a perfectly respectable class), C++ also has string streams.
#include <sstream>
...
std::string joinWords(std::vector<std::string> const &words) {
std::ostringstream oss;
for (auto &word : words) {
oss << word;
}
return oss.str();
}
It's probably not any more efficient than using std::string, but it's a bit more flexible in other cases -- you can stringify just about any primitive type with it, as well as any type that has specified an operator <<(ostream&, its_type&) override.
This is somewhat tangential to your Question, but relevant nonetheless. (And too big for a comment!!)
On each concatenation a new copy of the string is created, so that the overall complexity is O(n^2).
In Java, the complexity of s1.concat(s2) or s1 + s2 is O(M1 + M2) where M1 and M2 are the respective String lengths. Turning that into the complexity of a sequence of concatenations is difficult in general. However, if you assume N concatenations of Strings of length M, then the complexity is indeed O(M * N * N) which matches what you said in the Question.
Fortunately in Java we could solve this with a StringBuffer, which has O(1) complexity for each append, then the overall complexity would be O(n).
In the StringBuilder case, the amortized complexity of N calls to sb.append(s) for strings of size M is O(M*N). The key word here is amortized. When you append characters to a StringBuilder, the implementation may need to expand its internal array. But the expansion strategy is to double the array's size. And if you do the math, you will see that each character in the buffer is going to be copied on average one extra time during the entire sequence of append calls. So the complexity of the entire sequence of appends still works out as O(M*N) ... and, as it happens M*N is the final string length.
So your end result is correct, but your statement about the complexity of a single call to append is not correct. (I understand what you mean, but the way you say it is facially incorrect.)
Finally, I'd note that in Java you should use StringBuilder rather than StringBuffer unless you need the buffer to be thread-safe.
As an example of a really simple structure that has O(n) complexity in C++11:
template<typename TChar>
struct StringAppender {
std::vector<std::basic_string<TChar>> buff;
StringAppender& operator+=( std::basic_string<TChar> v ) {
buff.push_back(std::move(v));
return *this;
}
explicit operator std::basic_string<TChar>() {
std::basic_string<TChar> retval;
std::size_t total = 0;
for( auto&& s:buff )
total+=s.size();
retval.reserve(total+1);
for( auto&& s:buff )
retval += std::move(s);
return retval;
}
};
use:
StringAppender<char> append;
append += s1;
append += s2;
std::string s3 = append;
This takes O(n), where n is the number of characters.
Finally, if you know how long all of the strings are, just doing a reserve with enough room makes append or += take a total of O(n) time. But I agree that is awkward.
Use of std::move with the above StringAppender (ie, sa += std::move(s1)) will significantly increase performance for non-short strings (or using it with xvalues etc)
I do not know the complexity of std::ostringstream, but ostringstream is for pretty printing formatted output, or cases where high performance is not important. I mean, they aren't bad, and they may even out perform scripted/interpreted/bytecode languages, but if you are in a rush, you need something else.
As usual, you need to profile, because constant factors are important.
A rvalue-reference-to-this operator+ might also be a good one, but few compilers implement rvalue references to this.

Best way to concatenate and condense a std::vector<std::string>

Disclaimer: This problem is more of a theoretical, rather than a practical interest. I want to find out various different ways of doing this, with speed as icing on the new year cake.
The Problem
I want to be able to store a list of strings, and be able to quickly combine them into 1 if needed.
In short, I want to condense a structure (currently a std::vector<std::string>) that looks like
["Hello, ", "good ", "day ", " to", " you!"]
to
["Hello, good day to you!"]
Is there any idiomatic way to achieve this, ala python's [ ''.join(list_of_strings) ]?
What is the best way to achieve this in C++, in terms of time?
Possible Approaches
The first idea I had is to
loop over the vector,
append each element to the first,
simultaneously delete the element.
We will be concatenating with += and reserve(). I assume that max_size() will not be reached.
Approach 1 (The Greedy Approach)
So called because it ignores conventions and operates in-place.
#if APPROACH == 'G'
// Greedy Approach
void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
{
// Reserve the size for all characters, less than max_size()
my_strings[0].reserve(total_characters_in_list);
// There are strings left, ...
for(auto itr = my_strings.begin()+1; itr != my_strings.end();)
{
// append, and...
my_strings[0] += *itr;
// delete, until...
itr = my_strings.erase(itr);
}
}
#endif
Now I know, you would say that this is risky and bad. So:
loop over the vector,
append each element to another std::string,
clear the vector and make the string first element of the vector.
Approach 2 (The "Safe" Haven)
So called because it does not modify the container while iterating over it.
#if APPROACH == 'H'
// Safe Haven Approach
void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
{
// Store the whole vector here
std::string condensed_string;
condensed_string.reserve(total_characters_in_list);
// There are strings left...
for(auto itr = my_strings.begin(); itr != my_strings.end(); ++itr)
{
// append, until...
condensed_string += *itr;
}
// remove all elements except the first
my_strings.resize(1);
// and set it to condensed_string
my_strings[0] = condensed_string;
}
#endif
Now for the standard algorithms...
Using std::accumulate from <algorithm>
Approach 3 (The Idiom?)
So called simply because it is a one-liner.
#if APPROACH == 'A'
// Accumulate Approach
void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
{
// Reserve the size for all characters, less than max_size()
my_strings[0].reserve(total_characters_in_list);
// Accumulate all the strings
my_strings[0] = std::accumulate(my_strings.begin(), my_strings.end(), std::string(""));
// And resize
my_strings.resize(1);
}
#endif
Why not try to store it all in a stream?
Using std::stringstream from <sstream>.
Approach 4 (Stream of Strings)
So called due to the analogy of C++'s streams with flow of water.
#if APPROACH == 'S'
// Stringstream Approach
void condense(std::vector< std::string >& my_strings, int) // you can remove the int
{
// Create out stream
std::stringstream buffer(my_strings[0]);
// There are strings left, ...
for(auto itr = my_strings.begin(); itr != my_strings.end(); ++itr)
{
// add until...
buffer << *itr;
}
// resize and assign
my_strings.resize(1);
my_strings[0] = buffer.str();
}
#endif
However, maybe we can use another container rather than std::vector?
In that case, what else?
(Possible) Approach 5 (The Great Indian "Rope" Trick)
I have heard about the rope data structure, but have no idea if (and how) it can be used here.
Benchmark and Verdict:
Ordered by their time efficiency (currently and surprisingly) is1:
Approaches Vector Size: 40 Vector Size: 1600 Vector Size: 64000
SAFE_HAVEN: 0.1307962699997006 0.12057728999934625 0.14202970000042114
STREAM_OF_STRINGS: 0.12656566000077873 0.12249500000034459 0.14765803999907803
ACCUMULATE_WEALTH: 0.11375975999981165 0.12984520999889354 3.748660090001067
GREEDY_APPROACH: 0.12164988000004087 0.13558526000124402 22.6994204800023
timed with2:
NUM_OF_ITERATIONS = 100
test_cases = [ 'greedy_approach', 'safe_haven' ]
for approach in test_cases:
time_taken = timeit.timeit(
f'system("{approach + ".exe"}")',
'from os import system',
number = NUM_OF_ITERATIONS
)
print(approach + ": ", time_taken / NUM_OF_ITERATIONS)
Can we do better?
Update: I tested it with 4 approaches (so far), as I could manage in my little time. More incoming soon. It would have been better to fold the code, so that more approaches could be added to this post, but it was declined.
1 Note that these readings are only for a rough estimate. There are a lot of things that influence the execution time, and note that there are some inconsistencies here as well.
2 This is the old code, used to test only the first two approaches. The current code is a good deal longer, and more integrated, so I am not sure I should add it here.
Conclusions:
Deleting elements is very costly.
You should just copy the strings somewhere, and resize the vector.
Infact, better reserve enough space too, if copying to another string.
You could also try std::accumulate:
auto s = std::accumulate(my_strings.begin(), my_strings.end(), std::string());
Won't be any faster, but at least it's more compact.
With range-v3 (and soon with C++20 ranges), you might do:
std::vector<std::string> v{"Hello, ", "good ", "day ", " to", " you!"};
std::string s = v | ranges::view::join;
Demo
By default, I would use std::stringstream. Simply construct the steam, stream in all the strings from the vector, and then return the output string. It isn't very efficient but it is clear what it does.
In most cases, one doesn't need fast method when dealing with strings and printing - so the "easy to understand and safe" methods are better. Plus, compilers nowadays are good at optimizing inefficiencies in simple cases.
The most efficient way... it is a hard question. Some applications require efficiency on multiple fronts. In these cases you might need to utilize multithreading.
Personally, I'd construct a second vector to hold a single "condensed" string, construct the condensed string, and then swap vectors when done.
void Condense(std::vector<std::string> &strings)
{
std::vector<std::string> condensed(1); // one default constructed std::string
std::string &constr = &condensed.begin(); // reference to first element of condensed
for (const auto &str : strings)
constr.append(str);
std::swap(strings, condensed); // swap newly constructed vector into original
}
If an exception is thrown for some reason, then the original vector is left unchanged, and cleanup occurs - i.e. this function gives a strong exception guarantee.
Optionally, to reduce resizing of the "condensed" string, after initialising constr in the above, one could do
// optional: compute the length of the condensed string and reserve
std::size_t total_characters_in_list = 0;
for (const auto &str : strings)
total_characters_in_list += str.size();
constr.reserve(total_characters_in_list);
// end optional reservation
As to how efficient this is compared with alternatives, that depends. I'm also not sure it's relevant - if strings keep on being appended to the vector, and needing to be appended, there is a fair chance that the code that obtains the strings from somewhere (and appends them to the vector) will have a greater impact on program performance than the act of condensing them.

Finding occurences of specific string of length L inside a txt file

Suppose I have a hexadecimal string y of length N of the form y{N}y{N-1}...y{1}.
Then given another hexadecimal string x of length L (L less than N), I want to check how many times (if at all) this string appears inside y... say like y{N}...x{L}x{L-1}...x{1}...y{j}..x{L}x{L-1}...x{1}....y{1}.
Which is the most efficient way to do this in C++ ?...I need a really efficient implementation as I would like to run this for a large database
"Hexadecimal" doesn't mean a thing here. C++ is a computer language, and works on bits. "Hexadecimal" is just a convenient way to group 4 bits together for human consumption.
Similarly, C++ doesn't index strings like y{N}y{N-1}...y{1}. It indexes them as y[0],y[1],y[N-1]. (There's no y[N].)
Under normal circumstances, std::string::find is going to be faster than your disk, which means it's fast enough.
Your request is a simple string search algorithm.
There are many algorithms to do that.
Most of them will give you a good answer in O(L+N) with preprocessing.
You can also use a suffix tree which will provide a faster answer in O(L + Z), where Z is the number of occurrences of x in y .
A suffix tree take a lot of memory space (O(N²)) though , might not be the ideal choice here.
Which is the most efficient way to do this in C++ ?
Try std::search across an std::istream_iterator of your input file, like this:
#include <string>
#include <iterator>
#include <iostream>
#include <algorithm>
int main () {
// std::ifstream input("input.txt");
std::istream& input(std::cin);
std::string search_for("1234");
std::istream_iterator<char> last;
std::istream_iterator<char> it(input);
int count(0);
while((it = std::search(it, last, search_for.begin(), search_for.end())) != last) {
count++;
}
std::cout << count << "\n";
}
If that isn't fast enough, you might try std::istreambuf_iterator.
If that isn't fast enough, you might try memory-mapping the file and using the initial and final pointers as your iterators.

C++ std::map creation taking too long?

UPDATED:
I am working on a program whose performance is very critical. I have a vector of structs that are NOT sorted. I need to perform many search operations in this vector. So I decided to cache the vector data into a map like this:
std::map<long, int> myMap;
for (int i = 0; i < myVector.size(); ++i)
{
const Type& theType = myVector[i];
myMap[theType.key] = i;
}
When I search the map, the results of the rest of the program are much faster. However, the remaining bottleneck is the creation of the map itself (it is taking about 0.8 milliseconds on average to insert about 1,500 elements in it). I need to figure out a way to trim this time down. I am simply inserting a long as the key and an int as the value. I don't understand why it is taking this long.
Another idea I had was to create a copy of the vector (can't touch the original one) and somehow perform a faster sort than the std::sort (it takes way too long to sort it).
Edit:
Sorry everyone. I meant to say that I am creating a std::map where the key is a long and the value is an int. The long value is the struct's key value and the int is the index of the corresponding element in the vector.
Also, I did some more debugging and realized that the vector is not sorted at all. It's completely random. So doing something like a stable_sort isn't going to work out.
ANOTHER UPDATE:
Thanks everyone for the responses. I ended up creating a vector of pairs (std::vector of std::pair(long, int)). Then I sorted the vector by the long value. I created a custom comparator that only looked at the first part of the pair. Then I used lower_bound to search for the pair. Here's how I did it all:
typedef std::pair<long,int> Key2VectorIndexPairT;
typedef std::vector<Key2VectorIndexPairT> Key2VectorIndexPairVectorT;
bool Key2VectorIndexPairComparator(const Key2VectorIndexPairT& pair1, const Key2VectorIndexPairT& pair2)
{
return pair1.first < pair2.first;
}
...
Key2VectorIndexPairVectorT sortedVector;
sortedVector.reserve(originalVector.capacity());
// Assume "original" vector contains unsorted elements.
for (int i = 0; i < originalVector.size(); ++i)
{
const TheStruct& theStruct = originalVector[i];
sortedVector.insert(Key2VectorIndexPairT(theStruct.key, i));
}
std::sort(sortedVector.begin(), sortedVector.end(), Key2VectorIndexPairComparator);
...
const long keyToSearchFor = 20;
const Key2VectorIndexPairVectorT::const_iterator cItorKey2VectorIndexPairVector = std::lower_bound(sortedVector.begin(), sortedVector.end(), Key2VectorIndexPairT(keyToSearchFor, 0 /* Provide dummy index value for search */), Key2VectorIndexPairComparator);
if (cItorKey2VectorIndexPairVector->first == keyToSearchFor)
{
const int vectorIndex = cItorKey2VectorIndexPairVector->second;
const TheStruct& theStruct = originalVector[vectorIndex];
// Now do whatever you want...
}
else
{
// Could not find element...
}
This yielded a modest performance gain for me. Before the total time for my calculations were 3.75 milliseconds and now it is down to 2.5 milliseconds.
Both std::map and std::set are built on a binary tree and so adding items does dynamic memory allocation. If your map is largely static (i.e. initialized once at the start and then rarely or never has new items added or removed) you'd probably be better to use a sorted vector and a std::lower_bound to look up items using a binary search.
Maps take a lot of time for two reasons
You need to do a lot of memory allocation for your data storage
You need to perform O(n lg n) comparisons for the sort.
If you are just creating this as one batch, then throwing the whole map out, using a custom pool allocator may be a good idea here - eg, boost's pool_alloc. Custom allocators can also apply optimizations such as not actually deallocating any memory until the map's completely destroyed, etc.
Since your keys are integers, you may want to consider writing your own container based on a radix tree (on the bits of the key) as well. This may give you significantly improved performance, but since there is no STL implementation, you may need to write your own.
If you don't need to sort the data, use a hash table, such as std::unordered_map; these avoid the significant overhead needed for sorting data, and also can reduce the amount of memory allocation needed.
Finally, depending on the overall design of the program, it may be helpful to simply reuse the same map instead of recreating it over and over. Just delete and add keys as needed, rather than building a new vector, then building a new map. Again, this may not be possible in the context of your program, but if it is, it would definitely help you.
I suspect it's the memory management and tree rebalancing that's costing you here.
Obviously profiling may be able to help you pinpoint the issue.
I would suggest as a general idea to just copy the long/int data you need into another vector and since you said it's almost sorted, use stable_sort on it to finish the ordering. Then use lower_bound to locate the items in the sorted vector.
std::find is a linear scan(it has to be since it works on unsorted data). If you can sort(std::sort guaranties n log(n) behavior) the data then you can use std::binary_search to get log(n) searches. But as pointed out by others it may be copy time is the problem.
If keys are solid and short, perhaps try std::hash_map instead. From MSDN's page on hash_map Class:
The main advantage of hashing over sorting is greater efficiency; a
successful hashing performs insertions, deletions, and finds in
constant average time as compared with a time proportional to the
logarithm of the number of elements in the container for sorting
techniques.
Map creation can be a performance bottleneck (in the sense that it takes a measurable amount of time) if you're creating a large map and you're copying large chunks of data into it. You're also using the obvious (but suboptimal) way of inserting elements into a std::map - if you use something like:
myMap.insert(std::make_pair(theType.key, theType));
this should improve the insertion speed, but it will result in a slight change in behaviour if you encounter duplicate keys - using insert will result in values for duplicate keys being dropped, whereas using your method, the last element with the duplicate key will be inserted into the map.
I would also look into avoiding a making a copy of the data (for example by storing a pointer to it instead) if your profiling results determine that it's the copying of the element that is expensive. But for that you'll have to profile the code, IME guesstimates tend to be wrong...
Also, as a side note, you might want to look into storing the data in a std::set using custom comparator as your contains the key already. That however will not really result in a big speed up as constructing a set in this case is likely to be as expensive as inserting it into a map.
I'm not a C++ expert, but it seems that your problem stems from copying the Type instances, instead of a reference/pointer to the Type instances.
std::map<Type> myMap; // <-- this is wrong, since std::map requires two template parameters, not one
If you add elements to the map and they're not pointers, then I believe the copy constructor is invoked and that will certainly cause delays with a large data structure. Use the pointer instead:
std::map<KeyType, ObjectType*> myMap;
Furthermore, your example is a little confusing since you "insert" a value of type int in the map when you're expecting a value of type Type. I think you should assign the reference to the item, not the index.
myMap[theType.key] = &myVector[i];
Update:
The more I look at your example, the more confused I get. If you're using the std::map, then it should take two template types:
map<T1,T2> aMap;
So what are you REALLY mapping? map<Type, int> or something else?
It seems that you're using the Type.key member field as a key to the map (it's a valid idea), but unless key is of the same type as Type, then you can't use it as the key to the map. So is key an instance of Type??
Furthermore, you're mapping the current vector index to the key in the map, which indicates that you're just want the index to the vector so you can later access that index location fast. Is that what you want to do?
Update 2.0:
After reading your answer it seems that you're using std::map<long,int> and in that case there is no copying of the structure involved. Furthermore, you don't need to make a local reference to the object in the vector. If you just need to access the key, then access it by calling myVector[i].key.
Your building a copy of the table from the broken example you give, and not just a reference.
Why Can't I store references in an STL map in C++?
Whatever you store in the map it relies on you not changing the vector.
Try a lookup map only.
typedef vector<Type> Stuff;
Stuff myVector;
typedef std::map<long, *Type> LookupMap;
LookupMap myMap;
LookupMap::iterator hint = myMap.begin();
for (Stuff::iterator it = myVector.begin(); myVector.end() != it; ++it)
{
hint = myMap.insert(hint, std::make_pair(it->key, &*it));
}
Or perhaps drop the vector and just store it in the map??
Since your vector is already partially ordered, you may want to instead create an auxiliary array referencing (indices of) the elements in your original vector. Then you can sort the auxiliary array using Timsort which has good performance for partially sorted data (such as yours).
I think you've got some other problem. Creating a vector of 1500 <long, int> pairs, and sorting it based on the longs should take considerably less than 0.8 milliseconds (at least assuming we're talking about a reasonably modern, desktop/server type processor).
To try to get an idea of what we should see here, I did a quick bit of test code:
#include <vector>
#include <algorithm>
#include <time.h>
#include <iostream>
int main() {
const int size = 1500;
const int reps = 100;
std::vector<std::pair<long, int> > init;
std::vector<std::pair<long, int> > data;
long total = 0;
// Generate "original" array
for (int i=0; i<size; i++)
init.push_back(std::make_pair(rand(), i));
clock_t start = clock();
for (int i=0; i<reps; i++) {
// copy the original array
std::vector<std::pair<long, int> > data(init.begin(), init.end());
// sort the copy
std::sort(data.begin(), data.end());
// use data that depends on sort to prevent it being optimized away
total += data[10].first;
total += data[size-10].first;
}
clock_t stop = clock();
std::cout << "Ignore: " << total << "\n";
clock_t ticks = stop - start;
double seconds = ticks / (double)CLOCKS_PER_SEC;
double ms = seconds * 1000.0;
double ms_p_iter = ms / reps;
std::cout << ms_p_iter << " ms/iteration.";
return 0;
}
Running this on my somewhat "trailing edge" (~5 year-old) machine, I'm getting times around 0.1 ms/iteration. I'd expect searching in this (using std::lower_bound or std::upper_bound) to be somewhat faster than searching in an std::map as well (since the data in the vector is allocated contiguously, we can expect better locality of reference, leading to better cache usage).
Thanks everyone for the responses. I ended up creating a vector of pairs (std::vector of std::pair(long, int)). Then I sorted the vector by the long value. I created a custom comparator that only looked at the first part of the pair. Then I used lower_bound to search for the pair. Here's how I did it all:
typedef std::pair<long,int> Key2VectorIndexPairT;
typedef std::vector<Key2VectorIndexPairT> Key2VectorIndexPairVectorT;
bool Key2VectorIndexPairComparator(const Key2VectorIndexPairT& pair1, const Key2VectorIndexPairT& pair2)
{
return pair1.first < pair2.first;
}
...
Key2VectorIndexPairVectorT sortedVector;
sortedVector.reserve(originalVector.capacity());
// Assume "original" vector contains unsorted elements.
for (int i = 0; i < originalVector.size(); ++i)
{
const TheStruct& theStruct = originalVector[i];
sortedVector.insert(Key2VectorIndexPairT(theStruct.key, i));
}
std::sort(sortedVector.begin(), sortedVector.end(), Key2VectorIndexPairComparator);
...
const long keyToSearchFor = 20;
const Key2VectorIndexPairVectorT::const_iterator cItorKey2VectorIndexPairVector = std::lower_bound(sortedVector.begin(), sortedVector.end(), Key2VectorIndexPairT(keyToSearchFor, 0 /* Provide dummy index value for search */), Key2VectorIndexPairComparator);
if (cItorKey2VectorIndexPairVector->first == keyToSearchFor)
{
const int vectorIndex = cItorKey2VectorIndexPairVector->second;
const TheStruct& theStruct = originalVector[vectorIndex];
// Now do whatever you want...
}
else
{
// Could not find element...
}
This yielded a modest performance gain for me. Before the total time for my calculations were 3.75 milliseconds and now it is down to 2.5 milliseconds.

Sorting 1000-2000 elements with many cache misses

I have an array of 1000-2000 elements which are pointers to objects. I want to keep my array sorted and obviously I want to do this as quick as possible. They are sorted by a member and not allocated contiguously so assume a cache miss whenever I access the sort-by member.
Currently I'm sorting on-demand rather than on-add, but because of the cache misses and [presumably] non-inlining of the member access the inner loop of my quick sort is slow.
I'm doing tests and trying things now, (and see what the actual bottleneck is) but can anyone recommend a good alternative to speeding this up?
Should I do an insert-sort instead of quicksorting on-demand, or should I try and change my model to make the elements contigious and reduce cache misses?
OR, is there a sort algorithm I've not come accross which is good for data that is going to cache miss?
Edit: Maybe I worded this wrong :), I don't actually need my array sorted all the time (I'm not iterating through them sequentially for anything) I just need it sorted when I'm doing a binary chop to find a matching object, and doing that quicksort at that time (when I want to search) is currently my bottleneck, because of the cache misses and jumps (I'm using a < operator on my object, but I'm hoping that inlines in release)
Simple approach: insertion sort on every insert. Since your elements are not aligned in memory I'm guessing linked list. If so, then you could transform it into a linked list with jumps to the 10th element, the 100th and so on. This is kind of similar to the next suggestion.
Or you reorganize your container structure into a binary tree (or what every tree you like, B, B*, red-black, ...) and insert elements like you would insert them into a search tree.
Running a quicksort on each insertion is enormously inefficient. Doing a binary search and insert operation would likely be orders of magnitude faster. Using a binary search tree instead of a linear array would reduce the insert cost.
Edit: I missed that you were doing sort on extraction, not insert. Regardless, keeping things sorted amortizes sorting time over each insert, which almost has to be a win, unless you have a lot of inserts for each extraction.
If you want to keep the sort on-extract methodology, then maybe switch to merge sort, or another sort that has good performance for mostly-sorted data.
I think the best approach in your case would be changing your data structure to something logarithmic and rethinking your architecture. Because the bottleneck of your application is not that sorting thing, but the question why do you have to sort everything on each insert and try to compensate that by adding on-demand sort?.
Another thing you could try (that is based on your current implementation) is implementing an external pointer - something mapping table / function and sort those second keys, but I actually doubt it would benefit in this case.
Instead of the array of the pointers you may consider an array of structs which consist of both a pointer to your object and the sort criteria. That is:
Instead of
struct MyType {
// ...
int m_SomeField; // this is the sort criteria
};
std::vector<MyType*> arr;
You may do this:
strcut ArrayElement {
MyType* m_pObj; // the actual object
int m_SortCriteria; // should be always equal to the m_pObj->m_SomeField
};
std::vector<ArrayElement> arr;
You may also remove the m_SomeField field from your struct, if you only access your object via this array.
By such in order to sort your array you won't need to dereference m_pObj every iteration. Hence you'll utilize the cache.
Of course you must keep the m_SortCriteria always synchronized with m_SomeField of the object (in case you're editing it).
As you mention, you're going to have to do some profiling to determine if this is a bottleneck and if other approaches provide any relief.
Alternatives to using an array are std::set or std::multiset which are normally implemented as R-B binary trees, and so have good performance for most applications. You're going to have to weigh using them against the frequency of the sort-when-searched pattern you implemented.
In either case, I wouldn't recommend rolling-your-own sort or search unless you're interested in learning more about how it's done.
I would think that sorting on insertion would be better. We are talking O(log N) comparisons here, so say ceil( O(log N) ) + 1 retrieval of the data to sort with.
For 2000, it amounts to: 8
What's great about this is that you can buffer the data of the element to be inserted, that's how you only have 8 function calls to actually insert.
You may wish to look at some inlining, but do profile before you're sure THIS is the tight spot.
Nowadays you could use a set, either a std::set, if you have unique values in your structure member, or, std::multiset if you have duplicate values in you structure member.
One side note: The concept using pointers, is in general not advisable.
STL containers (if used correctly) give you nearly always an optimized performance.
Anyway. Please see some example code:
#include <iostream>
#include <array>
#include <algorithm>
#include <set>
#include <iterator>
// Demo data structure, whatever
struct Data {
int i{};
};
// -----------------------------------------------------------------------------------------
// All in the below section is executed during compile time. Not during runtime
// It will create an array to some thousands pointer
constexpr std::size_t DemoSize = 4000u;
using DemoPtrData = std::array<const Data*, DemoSize>;
using DemoData = std::array<Data, DemoSize>;
consteval DemoData createDemoData() {
DemoData dd{};
int k{};
for (Data& d : dd)
d.i = k++*2;
return dd;
}
constexpr DemoData demoData = createDemoData();
consteval DemoPtrData createDemoPtrData(const DemoData& dd) {
DemoPtrData dpd{};
for (std::size_t k{}; k < dpd.size(); ++k)
dpd[k] = &dd[k];
return dpd;
}
constexpr DemoPtrData dpd = createDemoPtrData(demoData);
// -----------------------------------------------------------------------------------------
struct Comp {bool operator () (const Data* d1, const Data* d2) const { return d1->i < d2->i; }};
using MySet = std::multiset<const Data*, Comp>;
int main() {
// Add some thousand pointers. Will be sorted according to struct member
MySet mySet{ dpd.begin(), dpd.end() };
// Extract a range of data. integer values between 42 and 52
const Data* p42 = dpd[21];
const Data* p52 = dpd[26];
// Show result
for (auto iptr = mySet.lower_bound(p42); iptr != mySet.upper_bound(p52); ++iptr)
std::cout << (*iptr)->i << '\n';
// Insert a new element
Data d1{ 47 };
mySet.insert(&d1);
// Show again
std::cout << "\n\n";
for (auto iptr = mySet.lower_bound(p42); iptr != mySet.upper_bound(p52); ++iptr)
std::cout << (*iptr)->i << '\n';
}