How to get all combinations by matching fields - c++

I have 4 classes.
Jacket, Shirt, Tie and Outfit.
class Outfit {
//...
shared_ptr<Jacket> jacket;
shared_ptr<Shirt> shirt;
shared_ptr<Tie> tie;
//...
}
class Jacket {
public:
Jacket(const string &brand, const string &color, const string &size);
// ... getters, setters etc. ...
private:
string brand, color, size
}
// The same for shirt and tie, brand, followed by color or size
I need to get all the possible matches between jacket and shirts, jacket and ties respectively. Like this:
vector<shared_ptr<Jacket>> jackets {
make_shared<Jacket>("some_brand", "blue", "15"),
make_shared<Jacket>("some_other_brand", "red", "14")
};
vector<shared_ptr<Shirt>> shirts {
make_shared<Shirt>("brand1", "doesnotmatterformatchingcriteria", "15"),
make_shared<Shirt>("brand6", "blue", "15"),
make_shared<Shirt>("brand3", "blue", "14"),
make_shared<Shirt>("brand5", "red", "15"),
make_shared<Shirt>("brand6", "red", "14")
};
vector<shared_ptr<Tie>> ties {
make_shared<Tie>("other_brand1", "blue"),
make_shared<Tie>("other_brand2", "blue"),
make_shared<Tie>("other_brand6", "blue"),
make_shared<Tie>("other_brand7", "blue"),
};
void getAllPosibilities(vector<Outfit> &outfits) {
for (const auto &jacket : jackets) {
for (const auto &shirt : shirts) {
if (jacket->getSizeAsString() == shirt->getSizeAsString()) {
for (const auto &tie : ties) {
if (jacket->getColor() == tie->getColor()) {
outfits.push_back(Outfit(jacket, shirt, ties));
}
}
}
}
}
}
So basically I want all the combinations, regardless of the brand name, only by the fields i specify to match, but I think this is painfully slow, considering I keep nesting for loops. In my actual problem, I have even more fields to match, and more classes and i think it is not ideal at all.
Is there any better/simpler solution than doing this?

What you're doing here is commonly known in databases as a join. As an SQL query, your code would look like this:
select * from jacket, shirt, tie where jacket.size == shirt.size and jacket.color == tie.color;
Algorithmic Ideas
Nested Loop Join
Now, what you've implemented is known as a nested loop join, which usually would have complexity O(jackets * shirts * ties), however, you have an early return inside your shirt-loop, so you save some complexity there and reduce it to O(jackets * shirts + jackets * matching_shirts * ties).
For small data sets, as the one you provided, this nested loop join can be very effective and is usually chosen. However, if the data gets bigger, the algorithm can quickly become slow. Depending on how much additional memory you can spare and whether sorting the input sequences is okay with you, you might want to use approaches that utilize sorting, as #Deduplicator initially pointed out.
Sort Merge Join
The sort-merge-join usually is used for two input sequences, so that after both sequences have been sorted, you only need to traverse each sequence once, giving you complexity of O(N log N) for the sorting and O(N) for the actual join phase. Check out the Wikipedia article for a more in-depth explanation. However, when you have more than two input sequences, the logic can become hard to maintain, and you have to traverse one of the input sequences more than once. Effectively, we will have O(N log N) for the sorting of jackets, shirts and ties and O(jackets + shirts + colors * ties) complexity for the actual joining part.
Sort + Binary Search Join
In the answer #Deduplicator gave, they utilized sorting, but in a different way: Instead of sequentially going through the input sequences, they used binary search to quickly find the block of matching elements. This gives O(N log N) for the sorting and O(jackets * log shirts + log ties * matching jacket-shirt combinations + output_elements) for the actual join phase.
Hash Join
However, all of these approaches can be trumped if the elements you have can easily be hashed, as hashmaps allow us to store and find potential join partners incredibly fast. Here, the approach is to iterate over all but one input sequence once and store the elements in a hashmap. Then, we iterate over the last input sequence once, and, using the hash map, find all matching join partners. Again, Wikipedia has a more in-depth explanation. We can utilize std::unordered_map here. We should be able to find a good hashing function here, so this gives us a total runtime of O(jackets + shirts + ties + total_output_tuples), which is a lower bound on how fast you can be: You need to process all input sequences once, and you need to produce the output sequence.
Also, the code for it is rather beautiful (imho) and easy to understand:
void hashJoin(vector<Outfit> &outfits) {
using shirtsize_t = decltype(Shirt::size);
std::unordered_map<shirtsize_t, vector<shared_ptr<Shirt>>> shirts_by_sizes;
using tiecolor_t = decltype(Tie::color);
std::unordered_map<tiecolor_t, vector<shared_ptr<Tie>>> ties_by_colors;
for(const auto& shirt : shirts) {
shirts_by_sizes[shirt->size].push_back(shirt);
}
for(const auto& tie : ties) {
ties_by_colors[tie->color].push_back(tie);
}
for (const auto& jacket : jackets) {
for (const auto& matching_shirt : shirts_by_sizes[jacket->size]) {
for (const auto& matching_tie : ties_by_colors[jacket->color]) {
outfits.push_back(Outfit{jacket, matching_shirt, matching_tie});
}
}
}
}
However, in the worst case, if our hashing does not give us uniform hashes, we might experience worse complexity, as the O(1) access can become worse. You would want to inspect the hash maps to spot this, and replace the hash function in this case.
Implementations
I've posted a working implementation of all four algorithms I discussed here on godbolt. However, since they are rather long, I only included the (superior) hash join algorithm in this answer
Lower bounds and output elements
As #Dominique pointed out, there is no way to have a better run time complexity than O(output_element_count), since you have to produce each output element at least once. So, if you expect that your result is (asymptotically) close to jackets * shirts * ties, the nested loop join is the variant with the lowest overhead, and thus should be fastest. However, I think this will not be the case here.

Sorting is often a good idea:
static constexpr auto size_comp = [](auto&& a, auto&& b){
return a->getSizeAsString() < b->getSizeAsString(); };
static constexpr auto color_comp = [](auto&& a, auto&& b){
return a->getColor() < b->getColor(); };
std::sort(jackets.begin(), jackets.end(), size_comp);
std::sort(shirt.begin(), shirt.end(), size_comp);
std::sort(tie.begin(), tie.end(), color_comp);
auto ps = shirts.begin();
for (auto pj = jackets.begin(); pj != jackets.end() && ps != shirts.end(); ++pj)
ps = std::lower_bound(ps, shirts.end(), *pj, size_comp);
auto pt = std::lower_bound(ties.begin(), ties.end(), *pj, color_comp);
for (auto psx = ps; psx != shirts.end() && !size_comp(*pj, *psx); ++psx) {
for (auto ptx = pt; ptx != ties.end() && !color_comp(*pj, *ptx); ++ptx)
outfits.emplace_back(*pj, *psx, *ptx);
}
}

I don't think it's possible to optimise, in case you are interested in all cases, for this simple reason:
Imagine you have 2 brands of jackets, 5 of shirts and 4 of ties, then you are looking for 2*5*4, which means 40 possibilities. That's just the amount you're looking for, so no optimising there. Then the following loop is ok:
for all jackets_brands:
for all shirts_brands:
for all ties_brands:
...
However, imagine you have some criteria, like some of the 5 brands don't go together with some of the 4 brands of the ties. In that case, it might be better to alter the sequence of the for-loops, as you can see here:
for all shirts_brands:
for all ties_brands:
if go_together(this_particular_shirts_brand, this_particular_ties_brand)
then for all jackets_brands:
...
In this way, you might avoid some unnecessary loops.

Related

What is profits of lazy directory iteration?

I am reading Phobos docs and found method dirEntries that complete "lazily iterates a given directory". But I can't understand real profits of it.
As I understand lazy function mean function that calculate only at time when it's needed.
Let's look at next code:
auto files = dirEntries(...);
auto cnt = files.count;
foreach( file; files ) { }
How much times dirEntries would be called? One or two? Please explain me the logic.
Or for example splitter
For me it's make code much more harder for understanding.
Lazy evaluation can be much more efficient if used correctly.
Say you have a somewhat expensive function that does something and you apply it to a whole range:
auto arr = iota(0, 100000); // a range of numbers from 0 to 100000
arr.map!(number => expensiveFunc(number))
.take(5)
.writeln;
If map wasn't lazy, it would execute expensiveFunc for all 100000 items in the range, and then pop off the first 5 of them.
But because map is lazy, expensiveFunc will only be called for the 5 items actually popped from the range.
Similarly with splitter, say you've a csv string with some data in it and you want to continue summing values until you meet a negative value.
string csvStr = "100,50,-1,1000,10,24,51"
int sum;
foreach(val; csvStr.splitter(",")){
immutable asNumber = val.to!int;
if(asNumber < 0) break;
sum += asNumber;
}
writeln(sum);
The above will only do the expensive 'splitting' work 3 times, as splitter is lazy and we only had to read 3 items. Saving us from having to keep splitting csvStr until the end, even though we don't need them.
So, in summary, the profits of lazy evaluation are that only the work that NEEDS to be done is actually done.

Choosing an efficient data structure to find rhymes

I've been working on a program that reads in a whole dictionary, and utilizes the WordNet from CMU that splits every word to its pronunciation.
The goal is to utilize the dictionary to find the best rhymes and alliterations of a given word, given the number of syllables in the word we need to find and its part of speech.
I've decided to use std::map<std::string, vector<Sound> > and std::multimap<int, std::string> where the map maps each word in the dictionary to its pronunciation in a vector, and the multimap is returned from a function that finds all the words that rhyme with a given word.
The int is the number of syllables of the corresponding word, and the string holds the word.
I've been working on the efficiency, but can't seem to get it to be more efficient than O(n). The way I'm finding all the words that rhyme with a given word is
vector<string> *rhymingWords = new vector<string>;
for (iterator it : map<std::string, vector<Sound> >) {
if(rhymingSyllables(word, it.first) >= 1 && it.first != word) {
rhymingWords->push_back(it.first);
}
}
return rhymingWords;
And when I find the best rhyme for a word (a word that rhymes the most syllables with the given word), I do
vector<string> rhymes = *getAllRhymes(rhymesWith);
int x = 0;
for (string s : rhymes) {
if (countSyllables(s) == numberOfSyllables) {
int a = rhymingSyllables(s, rhymesWith);
if (a > x) {
maxRhymes = thisRhyme;
bestRhyme = s;
}
}
}
return bestRhyme;
The drawback is the O(n) access time in terms of the number of words in the dictionary. I'm thinking of ideas to drop this down to O(log n) , but seem to hit a dead end every time. I've considered using a tree structure, but can't work out the specifics.
Any suggestions? Thanks!
The rhymingSyllables function is implemented as such:
int syllableCount = 0;
if((soundMap.count(word1) == 0) || (soundMap.count(word2) == 0)) {
return 0;
}
vector<Sound> &firstSounds = soundMap.at(word1), &secondSounds = soundMap.at(word2);
for(int i = firstSounds.size() - 1, j = secondSounds.size() - 1; i >= 0 && j >= 0; --i, --j){
if(firstSounds[i] != secondSounds[j]) return syllableCount;
else if(firstSounds[i].isVowel()) ++syllableCount;
}
return syllableCount;
P.S.
The vector<Sound> is the pronunciation of the word, where Sound is a class that contains every different pronunciation of a morpheme in English: i.e,
AA vowel AE vowel AH vowel AO vowel AW vowel AY vowel B stop CH affricate D stop DH fricative EH vowel ER vowel EY vowel F fricative G stop HH aspirate IH vowel IY vowel JH affricate K stop L liquid M nasal N nasal NG nasal OW vowel OY vowel P stop R liquid S fricative SH fricative T stop TH fricative UH vowel UW vowel V fricative W semivowel Y semivowel Z fricative ZH fricative
Perhaps you could group the morphemes that will be matched during rhyming and compare not the vectors of morphemes, but vectors of associated groups. Then you can sort the dictionary once and get a logarithmic search time.
After looking at rhymingSyllables implementation, it seems that you convert words to sounds, and then match any vowels to each other, and match other sounds only if they are the same. So applying advice above, you could introduce an extra auxiliary sound 'anyVowel', and then during dictionary building convert each word to its sound, replace all vowels with 'anyVowel' and push that representation to dictionary. Once you're done sort the dictionary. When you want to search a rhyme for a word - convert it to the same representation and do a binary search on the dictionary, first by last sound as a key, then by previous and so on. This will give you m*log(n) worst case complexity, where n is dictionary size and m is word length, but typically it will terminate faster.
You could also exploit the fact that for best rhyme you consider words only with certain syllable numbers, and maintain a separate dictionary per each syllable count. Then you count number of syllables in word you look rhymes for, and search in appropriate dictionary. Asymptotically it doesn't give you any gain, but a speedup it gives may be useful in your application.
I've been thinking about this and I could probably suggest an approach to an algorithm.
I would maybe first take the dictionary and divide it into multiple buckets or batches. Where each batch represents the number of syllables each word has. The traversing of the vector to store into different buckets should be linear as you are traverse a large vector of strings. From here since the first bucket will have all words of 1 syllable there is nothing to do at the moment so you can skip to bucket two and each bucket after will need to take each word and separate the syllables of each word. So if you have say 25 buckets, where you know the first few and the last few are not going to hold many words their time shouldn't be significant and should be done first, however the buckets in the middle that have say 3-5 or 3-6 syllables in length will be the largest to do so you could run each of these buckets on a separate thread if their size is over a certain amount and have them run in parallel. Now once you are done; each bucket should return a std::vector<std::shared_ptr<Word>> where your structure might look like this:
enum SpeechSound {
SS_AA,
SS_AE,
SS_...
SS_ZH
};
enum SpeechSoundType {
ASPIRATE,
...
VOWEL
};
struct SyllableMorpheme {
SpeechSound sound;
SpeechSoundType type;
};
class Word {
public:
private:
std::string m_strWord;
// These Two Containers Should Match In Size! One String For Each
// Syllable & One Matching Struct From Above Containing Two Enums.
std::vector<std::string> m_vSyllables
std::vector<SyllableMorpheme> m_vMorphemes;
public:
explicit Word( const std::string& word );
std::string getWord() const;
std::string getSyllable( unsigned index ) const;
unsigned getSyllableCount() const;
SyllableMorpheme getMorhpeme( unsigned index ) const;
bool operator==( const ClassObj& other ) const;
bool operator!=( const ClassObj& other ) const;
private:
Word( const Word& c ); // Not Implemented
Word& operator=( const Word& other ) const; // Not Implemented
};
This time you will now have new buckets or vectors of shared pointers of these class objects. Then you can easily write a function to traverse through each bucket or even multiple buckets since the buckets will have the same signature only a different amount of syllables. Remember; each bucket should already be sorted alphabetically since we only added them in by the syllable count and never changed the order that was read in from the dictionary.
Then with this you can easily compare if two words are equal or not while checking For Matching Syllables and Morphemes. And these are contained in std::vector<std::shared_ptr<Word>>. So you don't have to worry about memory clean up as much either.
The idea is to use linear search, separation and comparison as much as possible; yet if your container gets too large, then create buckets and run in parallel multiple threads, or maybe use a hash table if it will suite your needs.
Another possibility with this class structure is that you could even add more to it later on if you wanted or needed to such as another std::vector for its definitions, and another std::vector<string> for its part of speech {noun, verb, etc.} You could even add in other vector<string> for things such as homonyms, homophomes and even a vector<string> for a list of all words that rhyme with it.
Now for your specific task of finding the best matching rhyme you may find that some words may end up having a list of Words that would all be considered a Best Match or Fit! Due to this you wouldn't want to store or return a single string, but rather a vector of strings!
Case Example:
To Too Two Blue Blew Hue Hew Knew New,
Bare Bear Care Air Ayre Heir Fair Fare There Their They're
Plain, Plane, Rain, Reign, Main, Mane, Maine
Yes these are all single syllable rhyming words, but as you can see there are many cases where there are multiple valid answers, not just a single best case match. This is something that does need to be taken into consideration.

How to use couchdb reduce for a map that has multiple elements for values

I can't seem to find an answer for this anywhere, so maybe this is not allowed but I can't find any couchdb info that confirms this. Here is a scenario:
Suppose for a map function, within Futon, I'm emitting a value for a key, ex. K(1). This value is comprised of two separate floating point numbers A(1) and B(1) for key K(1). I would like to have a reduction perform the sample average of the ratio A(N)/B(N) over all K(N) from 1 to N. The issue I'm always running into in the reduce function is for the "values" parameter. Each key is associated with a value pair of (A,B), but I can't break out the A, B floating numbers from "values".
I can't seem to find any examples on how to do this. I've already tried accessing multi-level javascript arrays for "values" but it doesn't work, below is my map function.
function(doc) {
if(doc['Reqt.ID']) {
doc['Reqt.ID'].forEach(function(reqt) {
row_index=doc['Reqt.ID'].indexOf(reqt);
if(doc.Resource[row_index]=="Joe Smith")
emit({rid:reqt},{acthrs:doc['Spent.Hours'] [row_index],esthrs:doc['Estimate.Total.Hours'][row_index]});
});
}
}
I can get this to work (i.e. avg ratio) if I just produce a map that emits a single element value of A/B within the map function, but I'm curious about this case of multiple value elements.
How is this generally done within the Futon reduce function?
I've already tried various JSON Javascript notations such as values[key index].esthrs[0] within a for loop of the keys, but none of my combinations work.
Thank you so much.
There are two ways you could approach this; first, my reccomendation, is to change your map function to make it more of a "keys are keys and values are values", which in your particular case probably means, since you have two "values" you'd like to work with, Spent.Hours and Estimate.Total.Hours, you'll need two views; although you can cheat a little, but issuing multiple emit()'s per row, in the same view, for example:
emit(["Spent.Hours", reqt], doc['Spent.Hours'][row_index]);
emit(["Estimate.Total.Hours", reqt], doc['Estimate.Total.Hours'][row_index]);
with that approach, you can just use the predefined _stats reduce function.
alternatively, you can define a "smart" stats function, which can do the statistics for more elaborate documents.
The standard _stats function provides count, sum, average and standard deviation. the algorithm it uses is to take the sum of the value, the sum of the value squared, and the count of values; from just these, average and standard deviation can be calculated (and is embedded, for convenience in the reduced view)
roughly, that might look like:
function(key, values, rereduce) {
function getstats(seq, getter) {
var c, s, s2 = 0, 0, 0;
values.forEach(function (row) {
var value = getter(row);
if (rereduce) {
c += value.count;
s += value.sum;
s2 += value.sumsq;
} else {
c += 1;
s += value;
s2 += value * value;
}
return {
count: c,
sum: s,
sumsq: s2,
average: s / c,
stddev: Math.sqrt(c * s2 - s1) / c
};
}
return {esthrs: getstats(function(x){return x.esthrs}),
acthrs: getstats(function(x){return x.acthrs})};
}

How to figure out "progress" while sorting?

I'm using stable_sort to sort a large vector.
The sorting takes on the order of a few seconds (say, 5-10 seconds), and I would like to display a progress bar to the user showing how much of the sorting is done so far.
But (even if I was to write my own sorting routine) how can I tell how much progress I have made, and how much more there is left to go?
I don't need it to be exact, but I need it to be "reasonable" (i.e. reasonably linear, not faked, and certainly not backtracking).
Standard library sort uses a user-supplied comparison function, so you can insert a comparison counter into it. The total number of comparisons for either quicksort/introsort or mergesort will be very close to log2N * N (where N is the number of elements in the vector). So that's what I'd export to a progress bar: number of comparisons / N*log2N
Since you're using mergesort, the comparison count will be a very precise measure of progress. It might be slightly non-linear if the implementation spends time permuting the vector between comparison runs, but I doubt your users will see the non-linearity (and anyway, we're all used to inaccurate non-linear progress bars :) ).
Quicksort/introsort would show more variance, depending on the nature of the data, but even in that case it's better than nothing, and you could always add a fudge factor on the basis of experience.
A simple counter in your compare class will cost you practically nothing. Personally I wouldn't even bother locking it (the locks would hurt performance); it's unlikely to get into an inconsistent state, and anyway the progress bar won't go start radiating lizards just because it gets an inconsistent progress number.
Split the vector into several equal sections, the quantity depending upon the granularity of progress reporting you desire. Sort each section seperately. Then start merging with std::merge. You can report your progress after sorting each section, and after each merge. You'll need to experiment to determine how much percentage the sorting of the sections should be counted compared to the mergings.
Edit:
I did some experiments of my own and found the merging to be insignificant compared to the sorting, and this is the function I came up with:
template<typename It, typename Comp, typename Reporter>
void ReportSort(It ibegin, It iend, Comp cmp, Reporter report, double range_low=0.0, double range_high=1.0)
{
double range_span = range_high - range_low;
double range_mid = range_low + range_span/2.0;
using namespace std;
auto size = iend - ibegin;
if (size < 32768) {
stable_sort(ibegin,iend,cmp);
} else {
ReportSort(ibegin,ibegin+size/2,cmp,report,range_low,range_mid);
report(range_mid);
ReportSort(ibegin+size/2,iend,cmp,report,range_mid,range_high);
inplace_merge(ibegin, ibegin + size/2, iend);
}
}
int main()
{
std::vector<int> v(100000000);
std::iota(v.begin(), v.end(), 0);
std::random_shuffle(v.begin(), v.end());
std::cout << "starting...\n";
double percent_done = 0.0;
auto report = [&](double d) {
if (d - percent_done >= 0.05) {
percent_done += 0.05;
std::cout << static_cast<int>(percent_done * 100) << "%\n";
}
};
ReportSort(v.begin(), v.end(), std::less<int>(), report);
}
Stable sort is based on merge sort. If you wrote your own version of merge sort then (ignoring some speed-up tricks) you would see that it consists of log N passes. Each pass starts with 2^k sorted lists and produces 2^(k-1) lists, with the sort finished when it merges two lists into one. So you could use the value of k as an indication of progress.
If you were going to run experiments, you might instrument the comparison object to count the number of comparisons made and try and see if the number of comparisons made is some reasonably predictable multiple of n log n. Then you could keep track of progress by counting the number of comparisons done.
(Note that with the C++ stable sort, you have to hope that it finds enough store to hold a copy of the data. Otherwise the cost goes from N log N to perhaps N (log N)^2 and your predictions will be far too optimistic).
Select a small subset of indices and count inversions. You know its maximal value, and you know when you are done the value is zero. So, you can use this value as a "progressor". You can think of it as a measure of entropy.
Easiest way to do it: sort a small vector and extrapolate the time assuming O(n log n) complexity.
t(n) = C * n * log(n) ⇒ t(n1) / t(n2) = n1/n2 * log(n1)/log(n2)
If sorting 10 elements takes 1 μs, then 100 elements will take 1 μs * 100/10 * log(100)/log(10) = 20 μs.
Quicksort is basically
partition input using a pivot element
sort smallest part recursively
sort largest part using tail recursion
All the work is done in the partition step. You could do the outer partition directly and then report progress as the smallest part is done.
So there would be an additional step between 2 and 3 above.
Update progressor
Here is some code.
template <typename RandomAccessIterator>
void sort_wReporting(RandomAccessIterator first, RandomAccessIterator last)
{
double done = 0;
double whole = static_cast<double>(std::distance(first, last));
typedef typename std::iterator_traits<RandomAccessIterator>::value_type value_type;
while (first != last && first + 1 != last)
{
auto d = std::distance(first, last);
value_type pivot = *(first + std::rand() % d);
auto iter = std::partition(first, last,
[pivot](const value_type& x){ return x < pivot; });
auto lower = distance(first, iter);
auto upper = distance(iter, last);
if (lower < upper)
{
std::sort(first, iter);
done += lower;
first = iter;
}
else
{
std::sort(iter, last);
done += upper;
last = iter;
}
std::cout << done / whole << std::endl;
}
}
I spent almost one day to figure out how to display the progress for shell sort, so I will leave here my simple formula. Given an array of colors, it will display the progress. It is blending the colors from red to yellow and then to green. When it is Sorted, it is the last position of the array that is blue. For shell sort, the iterations each time it passes through the array are quite proportional, so the progress becomes pretty accurate.
(Code in Dart/Flutter)
List<Color> colors = [
Color(0xFFFF0000),
Color(0xFFFF5500),
Color(0xFFFFAA00),
Color(0xFFFFFF00),
Color(0xFFAAFF00),
Color(0xFF55FF00),
Color(0xFF00FF00),
Colors.blue,
];
[...]
style: TextStyle(
color: colors[(((pass - 1) * (colors.length - 1)) / (log(a.length) / log(2)).floor()).floor()]),
It is basically a cross-multiplication.
a means array. (log(a.length) / log(2)).floor() means rounding down the log2(N), where N means the number of items. I tested this with several combinations of array sizes, array numbers, and sizes for the array of colors, so I think it is good to go.

how to get median value from sorted map

I am using a std::map. Sometimes I will do an operation like: finding the median value of all items. e.g
if I add
1 "s"
2 "sdf"
3 "sdfb"
4 "njw"
5 "loo"
then the median is 3.
Is there some solution without iterating over half the items in the map?
I think you can solve the problem by using two std::map. One for smaller half of items (mapL) and second for the other half (mapU). When you have insert operation. It will be either case:
add item to mapU and move smallest element to mapL
add item to mapL and move greatest element to mapU
In case the maps have different size and you insert element to the one with smaller number of
elements you skip the move section.
The basic idea is that you keep your maps balanced so the maximum size difference is 1 element.
As far as I know STL all operations should work in O(ln(n)) time. Accessing smallest and greatest element in map can be done by using iterator.
When you have n_th position query just check map sizes and return greatest element in mapL or smallest element in mapR.
The above usage scenario is for inserting only but you can extend it to deleting items as well but you have to keep track of which map holds item or try to delete from both.
Here is my code with sample usage:
#include <iostream>
#include <string>
#include <map>
using namespace std;
typedef pair<int,string> pis;
typedef map<int,string>::iterator itis;
map<int,string>Left;
map<int,string>Right;
itis get_last(map<int,string> &m){
return (--m.end());
}
int add_element(int key, string val){
if (Left.empty()){
Left.insert(make_pair(key,val));
return 1;
}
pis maxl = *get_last(Left);
if (key <= maxl.first){
Left.insert(make_pair(key,val));
if (Left.size() > Right.size() + 1){
itis to_rem = get_last(Left);
pis cpy = *to_rem;
Left.erase(to_rem);
Right.insert(cpy);
}
return 1;
} else {
Right.insert(make_pair(key,val));
if (Right.size() > Left.size()){
itis to_rem = Right.begin();
pis cpy = *to_rem;
Right.erase(to_rem);
Left.insert(*to_rem);
}
return 2;
}
}
pis get_mid(){
int size = Left.size() + Right.size();
if (Left.size() >= size / 2){
return *(get_last(Left));
}
return *(Right.begin());
}
int main(){
Left.clear();
Right.clear();
int key;
string val;
while (!cin.eof()){
cin >> key >> val;
add_element(key,val);
pis mid = get_mid();
cout << "mid " << mid.first << " " << mid.second << endl;
}
}
I think the answer is no. You cannot just jump to the N / 2 item past the beginning because a std::map uses bidirectional iterators. You must iterate through half of the items in the map. If you had access to the underlying Red/Black tree implementation that is typically used for the std::map, you might be able to get close like in Dani's answer. However, you don't have access to that as it is encapsulated as an implementation detail.
Try:
typedef std::map<int,std::string> Data;
Data data;
Data::iterator median = std::advance(data.begin(), data.size() / 2);
Works if the size() is odd. I'll let you work out how to do it when size() is even.
In self balancing binary tree(std::map is one I think) a good approximation would be the root.
For exact value just cache it with a balance indicator, and each time an item added below the median decrease the indicator and increase when item is added above. When indicator is equal to 2/-2 move the median upwards/downwards one step and reset the indicator.
If you can switch data structures, store the items in a std::vector and sort it. That will enable accessing the middle item positionally without iterating. (It can be surprising but a sorted vector often out-performs a map, due to locality. For lookups by the sort key you can use binary search and it will have much the same performance as a map anyway. See Scott Meyer's Effective STL.)
If you know the map will be sorted, then get the element at floor(length / 2). If you're in a bit twiddly mood, try (length >> 1).
I know no way to get the median from a pure STL map quickly for big maps. If your map is small or you need the median rarely you should use the linear advance to n/2 anyway I think - for the sake of simplicity and being standard.
You can use the map to build a new container that offers median: Jethro suggested using two maps, based on this perhaps better would be a single map and a continuously updated median iterator. These methods suffer from the drawback that you have to reimplement every modifiying operation and in jethro's case even the reading operations.
A custom written container will also do what you what, probably most efficiently but for the price of custom code. You could try, as was suggested to modify an existing stl map implementation. You can also look for existing implementations.
There is a super efficient C implementation that offers most map functionality and also random access called Judy Arrays. These work for integer, string and byte array keys.
Since it sounds like insert and find are your two common operations while median is rare, the simplest approach is to use the map and std::advance( m.begin(), m.size()/2 ); as originally suggested by David Rodríguez. This is linear time, but easy to understand so I'd only consider another approach if profiling shows that the median calls are too expensive relative to the work your app is doing.
The nth_element() method is there for you for this :) It implements the partition part of the quick sort and you don't need your vector (or array) to be sorted.
And also the time complexity is O(n) (while for sorting you need to pay O(nlogn)).
For a sortet list, here it is in java code, but i assume, its very easy to port to c++:
if (input.length % 2 != 0) {
return input[((input.length + 1) / 2 - 1)];
} else {
return 0.5d * (input[(input.length / 2 - 1)] + input[(input.length / 2 + 1) - 1]);
}