Related
I Have to design an order-book data structure that allows me to query for the highest price of an order which has been inserted and not yet deleted.
Insert and delete operations are given upfront in a file in which each operation looks like one of the following two:
TIMESTAMP insert ID PRICE
TIMESTAMP delete ID
where ID is an integer identifier of a order, timestamp are always in increasing order and each ID appears exactly twice: once in a insert and once in a delete operation, in this order.
From this list of operations, I need to output the time-weighted average of the highest price.
As an example, let's say we have the following input:
10 I 1 10
20 I 2 13
22 I 3 13
24 E 2
25 E 3
40 E 1
We can say that after the ith operation, the max is
10, 13, 13, 13, 10
and the time-weigthed average is
10*(20-10) + 13*(22-20) + 13*(24-22)+13*(25-24)+10*(40-25) = 10.5
because 10 is the max price between timestamps [10-20] and [25,40] while 13 in the rest.
I was thinking to use an unordered_map<ID,price> and a multiset<price> for supporting:
insert in O(log(n))
delete in O(log(n))
getMax in O(1)
Here is an example of what I come up with:
struct order {
int timestamp, id;
char type;
double price;
};
unordered_map<uint, order> M;
multiset<double> maxPrices;
double totaltime = 0;
double avg = 0;
double lastTS = 0;
double getHighest() {
return !maxPrices.empty() ? *maxPrices.rbegin()
: std::numeric_limits<double>::quiet_NaN();
}
void update(const uint timestamp) {
const double timeLeg = timestamp - lastTS;
totaltime += timeLeg;
avg += timeLeg * getHighest();
lastTS = timestamp;
}
void insertOrder(const order& ord) {
if (!maxPrices.empty()) {
if (ord.price >= getHighest()) {
// we have a new maxPrice
update(ord.timestamp);
}
} else // if there are not orders this is the mex for sure
lastTS = ord.timestamp;
M[ord.id] = ord;
maxPrices.insert(ord.price);
}
void deleteOrder(
const uint timestamp,
const uint id_ord) { // id_ord is assumed to exists in both M and maxPrices
order ord = M[id_ord];
if (ord.price >= getHighest()) {
update(timestamp);
}
auto it = maxPrices.find(ord.price);
maxPrices.erase(it);
M.erase(id_ord);
}
This approach has a complexity of nlogn where n is the number of active orders.
Is there any faster asymptotic and/or more elegant approach to solving this problem?
I recommend you take the database approach.
Place all your records into a std::vector.
Create an index table, std::map</* key type */, size_t>, which will contain a key value and the index of the record in the vector. If you want the key sorted in descending order, also supply a comparison functor.
This strategy allows you to create many index tables without having to re-sort all of your data. The map will give good search times for your keys. You can also iterate through the map to list all the keys in order.
Note: with modern computers, you may need a huge amount of data to provide a significant timing improvement between a binary search (map) and an linear search (vector).
I have a list of key-value-pairs and I want to filter a list where every key parameter only occurs once.
So that a list of e.g. {Pair(1,2), Pair(1,4), Pair(2,2)} becomes {Pair(1,2), Pair(2,2)}.
It doesn't matter which Pair gets filtered out as I only need the size
(maybe there's a different way to get the amount of pairs with pairwise different key values?).
This all is again happening in another stream of an array of lists (of key-value-pairs) and they're all added up.
I basically want the amount of collisions in a hashmap.
I hope you understand what I mean; if not please ask.
public int collisions() {
return Stream.of(t)
.filter(l -> l.size() > 1)
.filter(/*Convert l to list of Pairs with pairwise different Keys*/)
.mapToInt(l -> l.size() - 1)
.sum();
}
EDIT:
public int collisions() {
return Stream.of(t)
.forEach(currentList = stream().distinct().collect(Collectors.toList())) //Compiler Error, how do I do this?
.filter(l -> l.size() > 1)
.mapToInt(l -> l.size() - 1)
.sum();
}
I overwrote the equals of Pair to return true if the Keys are identical so now i can use distinct to remove "duplicates" (Pairs with equal Keys).
Is it possible to, in forEach, replace the currentElement with the same List "distincted"? If so, how?
Regards,
Claas M
I'm not sure whether you want the sum of amount of collisions per list or the amount of collisions in all list were merged into a single one before. I assumed the former, but if it's the latter the idea does not change by much.
This how you could do it with Streams:
int collisions = Stream.of(lists)
.flatMap(List::stream)
.mapToInt(l -> l.size() - (int) l.stream().map(p -> p.k).distinct().count())
.sum();
Stream.of(lists) will give you a Stream<List<List<Pair<Integer, Integer>> with a single element.
Then you flatMap it, so that you have a Stream<List<Pair<Integer, Integer>>.
From there, you mapToInt each list by substracting its original size with the number of elements of unique Pairs by key it contained (l.stream().map(p -> p.k).distinct().count()).
Finally, you call sum to have the overall amount of collisions.
Note that you could use mapToLong to get rid of the cast but then collisions has to be a long (which is maybe more correct if each list has a lot of "collisions").
For example given the input:
List<Pair<Integer, Integer>> l1 = Arrays.asList(new Pair<>(1,2), new Pair<>(1,4), new Pair<>(2,2));
List<Pair<Integer, Integer>> l2 = Arrays.asList(new Pair<>(2,2), new Pair<>(1,4), new Pair<>(2,2));
List<Pair<Integer, Integer>> l3 = Arrays.asList(new Pair<>(3,2), new Pair<>(3,4), new Pair<>(3,2));
List<List<Pair<Integer, Integer>>> lists = Arrays.asList(l1, l2, l3);
It will output 4 as you have 1 collision in the first list, 1 in the second and 2 in the third.
Don't use a stream. Dump the list into a SortedSet with a custom comparator and diff the sizes:
List<Pair<K, V>> list; // given this
Set<Pair<K, V>> set = new TreeSet<>(list, (a, b) -> a.getKey().compareTo(b.getKey())).size();
set.addAll(list);
int collisions = list.size() - set.size();
If the key type isn't comparable, alter the comparator lambda accordingly.
I can't seem to find an answer for this anywhere, so maybe this is not allowed but I can't find any couchdb info that confirms this. Here is a scenario:
Suppose for a map function, within Futon, I'm emitting a value for a key, ex. K(1). This value is comprised of two separate floating point numbers A(1) and B(1) for key K(1). I would like to have a reduction perform the sample average of the ratio A(N)/B(N) over all K(N) from 1 to N. The issue I'm always running into in the reduce function is for the "values" parameter. Each key is associated with a value pair of (A,B), but I can't break out the A, B floating numbers from "values".
I can't seem to find any examples on how to do this. I've already tried accessing multi-level javascript arrays for "values" but it doesn't work, below is my map function.
function(doc) {
if(doc['Reqt.ID']) {
doc['Reqt.ID'].forEach(function(reqt) {
row_index=doc['Reqt.ID'].indexOf(reqt);
if(doc.Resource[row_index]=="Joe Smith")
emit({rid:reqt},{acthrs:doc['Spent.Hours'] [row_index],esthrs:doc['Estimate.Total.Hours'][row_index]});
});
}
}
I can get this to work (i.e. avg ratio) if I just produce a map that emits a single element value of A/B within the map function, but I'm curious about this case of multiple value elements.
How is this generally done within the Futon reduce function?
I've already tried various JSON Javascript notations such as values[key index].esthrs[0] within a for loop of the keys, but none of my combinations work.
Thank you so much.
There are two ways you could approach this; first, my reccomendation, is to change your map function to make it more of a "keys are keys and values are values", which in your particular case probably means, since you have two "values" you'd like to work with, Spent.Hours and Estimate.Total.Hours, you'll need two views; although you can cheat a little, but issuing multiple emit()'s per row, in the same view, for example:
emit(["Spent.Hours", reqt], doc['Spent.Hours'][row_index]);
emit(["Estimate.Total.Hours", reqt], doc['Estimate.Total.Hours'][row_index]);
with that approach, you can just use the predefined _stats reduce function.
alternatively, you can define a "smart" stats function, which can do the statistics for more elaborate documents.
The standard _stats function provides count, sum, average and standard deviation. the algorithm it uses is to take the sum of the value, the sum of the value squared, and the count of values; from just these, average and standard deviation can be calculated (and is embedded, for convenience in the reduced view)
roughly, that might look like:
function(key, values, rereduce) {
function getstats(seq, getter) {
var c, s, s2 = 0, 0, 0;
values.forEach(function (row) {
var value = getter(row);
if (rereduce) {
c += value.count;
s += value.sum;
s2 += value.sumsq;
} else {
c += 1;
s += value;
s2 += value * value;
}
return {
count: c,
sum: s,
sumsq: s2,
average: s / c,
stddev: Math.sqrt(c * s2 - s1) / c
};
}
return {esthrs: getstats(function(x){return x.esthrs}),
acthrs: getstats(function(x){return x.acthrs})};
}
for example if it is given to make all the choices between 1 to 5 and the answer goes like this..
1,2,3,4,5,
1-2,1-3,1-4,1-5,2-3,2-4,2-5,3-4,3-5,4-5,
1-2-3,1-2-4,1-2-5,1-3-4,
.....,
1-2-3-4-5.
can anyone suggest a fast algorithm?
Just generate all the integers from one (or zero if you want to include the empty set) to 2^N - 1. Your sets are indicated by the set bits in the number. For example if you had 5 elements {A,B,C,D,E} the number 6 = 00110 would represent the subset {C,D}.
You want to find the powerset
In mathematics, given a set S, the power set (or powerset) of S, written ,
P(S), , is the set of all subsets of S
There is an algorithm to find the power set at this link.
You basically take first element say 1 and find a all subsets {{},{1}}. Call this
power set
Take next element 2 and add to powerset and get {{2},{1,2}} and take union
with powerset.
{{},{1}} U {{2},{1,2}} = {{},{1},{2},{1,2}}
But a easy way to do it is described in the answers above. Here is a link which explains it in detail.
The fastest is by using template metaprogramming, which will trade compile time and code size for execution time. But this will only be practical for lowish numbers of combinations, and you have to know them ahead of time. But, you said "fast" :)
#include <iostream>
using namespace std;
typedef unsigned int my_uint;
template <my_uint M>
struct ComboPart {
ComboPart<M-1> rest;
void print() {
rest.print();
for(my_uint i = 0; i < sizeof(my_uint) * 8; i++)
if(M & (1<<i)) cout << (i + 1) << " ";
cout << endl;
}
};
template <>
struct ComboPart<0> {
void print() {};
};
template <my_uint N>
struct TwoPow {
enum {value = 2 * TwoPow<N-1>::value};
};
template <>
struct TwoPow<0> {
enum {value = 1};
};
template <my_uint N>
struct Combos {
ComboPart<TwoPow<N>::value - 1> part;
void print() {
part.print();
}
};
int main(int argc, char *argv[]) {
Combos<5> c5 = Combos<5>();
c5.print();
return 0;
}
This one constructs all the combinations at compile time.
can anyone suggest a fast algorithm?
Algorithms can be expressed in many languages, here is the power set in Haskell:
power [] = [[]]
power (x:xs) = rest ++ map (x:) rest
where rest = power xs
You want combinations, not permutations (i.e. {1,2} is the same as {2,1})
C(n,k) = n!/(k!(n-k)!)
Answer = sum(k = 1 to n) C(n,k)
( i.e. C(n,1)+C(n,2)...+C(n,n) )
What you want is called choose in combinatorics. This and this should get you started.
I have a list of elements around 1000. Each element (objects that i read from the file, hence i can arrange them efficiently at the beginning) containing contains 4 variables. So now I am doing the following, which is very inefficient at grand scheme of things:
void func(double value1, double value2, double value3)
{
fooArr[1000];
for(int i=0;i<1000; ++i)
{
//they are all numeric! ranges are < 1000
if(fooArr[i].a== value1
&& fooArr[i].b >= value2;
&& fooArr[i].c <= value2; //yes again value2
&& fooArr[i].d <= value3;
)
{
/* yay found now do something!*/
}
}
}
Space is not too important!
MODIFIED per REQUEST
If space isn't too important the easiest thing to do is to create a hash based on "a" Depending on how many conflicts you get on "a" it may make sense to make each node in the hash table point to a binary tree based off of "b" If b has a lot of conflicts, do the same for c.
That first index into the hash, depending on how many conflicts, will save you a lot of time for very little coding or data structures work.
First, sort the list on increasing a and decreasing b. Then build an index on a (values are integers from 0 to 999. So, we've got
int a_index[1001]; // contains starting subscript for each value
a_index[1000] = 1000;
for (i = a_index[value1]; i < a_index[value1 + 1] && fooArr[i].b >= value2; ++i)
{
if (fooArr[i].c <= value2 && fooArr[i].d <= value3) /* do stuff */
}
Assuming I haven't made a mistake here, this limits the search to the subscripts where a and b are valid, which is likely to cut your search times drastically.
Since you are have only three properties to match you could use a hash table. When performing a search, you use the hash table (which indexes the a-property) to find all entries where a matches SomeConstant. After that you check if b and c also match your constants. This way you can reduce the number of comparisons. I think this would speed the search up quite a bit.
Other than that you could build three binary search trees. One sorted by each property. After searching all three of them you perform your action for those which match your values in each tree.
Based on what you've said (in both the question and the comments) there are only a very few values for a (something like 10).
That being the case, I'd build an index on the values of a where each one points directly to all the elements in the fooArr with that value of a:
std::vector<std::vector<foo *> > index(num_a_values);
for (int i=0; i<1000; i++)
index[fooArr[i].a].push_back(&fooArr[i]);
Then when you get a value to look up an item, you go directly to those for which fooArr[i].a==value1:
std::vector<foo *> const &values = index[value1];
for (int i=0; i<values.size(); i++) {
if (value2 <= values[i]->b
&& value2 >= values[i]->c
&& value3 >= values[i]->d) {
// yay, found something
}
}
This way, instead of looking at 1000 items in fooArray each time, you look at an average of 100 each time. If you want still more speed, the next step would be to sort the items in each vector in the index based on the value of b. This will let you find the lower bound for value2 using a binary search instead of a linear search, reducing ~50 comparisons to ~10. Since you've sorted it by b, from that point onward you don't have to compare value2 to b -- you know exactly where the rest of the numbers that satisfy the inequality are, so you only have to compare to c and d.
You might also consider another approach based on the limited range of the numbers: 0 to 1000 can be represented in 10 bits. Using some bit-twiddling, you could combine three fields into a single 32-bit number, which would let the compiler compare all three at once, instead of in three separate operations. Getting this right is a little tricky, but once you to, it could roughly triple the speed again.
I think using kd-tree would be appropriate.
If there aren't many conflicts with a then hashing/indexing a might resolve your problem.
Anyway if that doesn't work I suggest using kd-tree.
First do a table of multiple kd-trees. Index them with a.
Then implement a kd-tree for each a value with 3-dimensions in directions b, c, d.
Then when searching - first index to appropriate kd-tree with a, and then search from kd-tree with your limits. Basically you'll do a range search.
Kd-tree
You'll get your answer in O(L^(2/3)+m), where L is the number of elements in appropriate kd-tree and m is the number of matching points.
Something better that I found is Range Tree. This might be what you are looking for.
It's fast. It'll answer your query in O(log^3(L)+m). (Unfortunately don't know about Range Tree much.)
Well, let's have a go.
First of all, the == operator calls for a pigeon-hole approach. Since we are talking about int values in the [0,1000] range, a simple table is good.
std::vector<Bucket1> myTable(1001, /*MAGIC_1*/); // suspense
The idea of course is that you will find YourObject instance in the bucket defined for its a attribute value... nothing magic so far.
Now on the new stuff.
&& fooArr[i].b >= value2
&& fooArr[i].c <= value2 //yes again value2
&& fooArr[i].d <= value3
The use of value2 is tricky, but you said you did not care for space right ;) ?
typedef std::vector<Bucket2> Bucket1;
/*MAGIC_1*/ <-- Bucket1(1001, /*MAGIC_2*/) // suspense ?
A BucketA instance will have in its ith position all instances of YourObject for which yourObject.c <= i <= yourObject.b
And now, same approach with the d.
typedef std::vector< std::vector<YourObject*> > Bucket2;
/*MAGIC_2*/ <-- Bucket2(1001)
The idea is that the std::vector<YourObject*> at index ith contains a pointer to all instances of YourObject for which yourObject.d <= i
Putting it altogether!
class Collection:
{
public:
Collection(size_t aMaxValue, size_t bMaxValue, size_t dMaxValue);
// prefer to use unsigned type for unsigned values
void Add(const YourObject& i);
// Pred is a unary operator taking a YourObject& and returning void
template <class Pred>
void Apply(int value1, int value2, int value3, Pred pred);
// Pred is a unary operator taking a const YourObject& and returning void
template <class Pred>
void Apply(int value1, int value2, int value3, Pred pred) const;
private:
// List behaves nicely with removal,
// if you don't plan to remove, use a vector
// and store the position within the vector
// (NOT an iterator because of reallocations)
typedef std::list<YourObject> value_list;
typedef std::vector<value_list::iterator> iterator_vector;
typedef std::vector<iterator_vector> bc_buckets;
typedef std::vector<bc_buckets> a_buckets;
typedef std::vector<a_buckets> buckets_t;
value_list m_values;
buckets_t m_buckets;
}; // class Collection
Collection::Collection(size_t aMaxValue, size_t bMaxValue, size_t dMaxValue) :
m_values(),
m_buckets(aMaxValue+1,
a_buckets(bMaxValue+1, bc_buckets(dMaxValue+1))
)
)
{
}
void Collection::Add(const YourObject& object)
{
value_list::iterator iter = m_values.insert(m_values.end(), object);
a_buckets& a_bucket = m_buckets[object.a];
for (int i = object.c; i <= object.b; ++i)
{
bc_buckets& bc_bucket = a_bucket[i];
for (int j = 0; j <= object.d; ++j)
{
bc_bucket[j].push_back(index);
}
}
} // Collection::Add
template <class Pred>
void Collection::Apply(int value1, int value2, int value3, Pred pred)
{
index_vector const& indexes = m_buckets[value1][value2][value3];
BOOST_FOREACH(value_list::iterator it, indexes)
{
pred(*it);
}
} // Collection::Apply<Pred>
template <class Pred>
void Collection::Apply(int value1, int value2, int value3, Pred pred) const
{
index_vector const& indexes = m_buckets[value1][value2][value3];
// Promotion from value_list::iterator to value_list::const_iterator is ok
// The reverse is not, which is why we keep iterators
BOOST_FOREACH(value_list::const_iterator it, indexes)
{
pred(*it);
}
} // Collection::Apply<Pred>
So, admitedly adding and removing items to that collections will cost.
Furthermore, you have (aMaxValue + 1) * (bMaxValue + 1) * (dMaxValue + 1) std::vector<value_list::iterator> stored, which is a lot.
However, Collection::Apply complexity is roughly k applications of Pred where k is the number of items which match the parameters.
I am looking for a review there, not sure I got all the indexes right oO
If your app is already using a database then just put them in a table and use a query to find it. I use mysql in a few of my apps and would recommend it.
First for each a do different table...
do a tabel num for numbers that have the same a.
do 2 index tabels each with 1000 rows.
index table contains integer representation of a split which numbers
will be involved.
For example let's say you have values in the array
(ignoring a because we have a table for each a value)
b = 96 46 47 27 40 82 9 67 1 15
c = 76 23 91 18 24 20 15 43 17 10
d = 44 30 61 33 21 52 36 70 98 16
then the index table values for the row 50, 20 are:
idx[a].bc[50] = 0000010100
idx[a].d[50] = 1101101001
idx[a].bc[20] = 0001010000
idx[a].d[20] = 0000000001
so let's say you do func(a, 20, 50).
Then to get which numbers are involved you do:
g = idx[a].bc[20] & idx[a].d[50];
Then g has 1-s for each number you have to deal with. If you don't
need the array values then you can just do a populationCount on g. And
do the inner thing popCount(g) times.
You can do
tg = g
n = 0
while (tg > 0){
if(tg & 1){
// do your stuff
}
tg = tg >>> 1;
n++;
}
maybe it can be improved in tg = tg >>> 1; n++; part by skipping over many zeros, but I have no idea if that's possible. It should considerably faster than your current approach because all variables for the loop are in registers.
As pmg said, the idea is to eliminate as many comparisons as possible. Obviously you won't have 4000 comparisons. That would require that all 1000 elements pass the first test, which would then be redundant. Apparently there are only 10 values of a, hence 10% passes that check. So, you'd do 1000 + 100 + ? + ? checks. Let's assume +50+25, for a total of 1175.
You'd need to know how a,b,c,d and value1, 2 and 3 are distributed to decide exactly what's fastest. We only know that a can have 10 values, and we presume that value1 has the same domain. In that case, binning by a can reduce it to an O(1) operation to get the right bin, plus the same 175 checks further on. But if b,c and value2 effectively form 50 buckets, you could find the right bucket again in O(1). Yet each bucket would now have an average of 20 elements, so you'd only need 35 tests (80% reduction). So, data distribution matters here. Once you understand your data, the algorithm will be clear.
Look, this is just a linear search. It would be nice if you could do a search that scales up better, but your complex matching requirements make it unclear to me whether it's even possible to, say, keep it sorted and use a binary search.
Having said this, perhaps one possibility is to generate some indexes. The main index might be a dictionary keyed on the a property, associating it with a list of elements with the same value for that property. Assuming the values for this property are well-distributed, it would immediately eliminate the overwhelming majority of comparisons.
If the property has a limited number of values, then you could consider adding an additional index which sorts items by b and maybe even another that sorts by c (but in the opposite order).
You can use hash_set from Standard Template Library(STL), this will give you very efficient implementation. complexity of your search would be O(1)
here is link: http://www.sgi.com/tech/stl/hash_set.html
--EDIT--
declare new Struct which will hold your variables, overload comparison operators and make the hash_set of this new struct. every time you want to search, create new object with your variables and pass it to hash_set method "find".
It seems that hash_set is not mandatory for STL, therefore you can use set and it will give you O(LogN) complexity for searching.
here is example:
#include <cstdlib>
#include <iostream>
#include <set>
using namespace std;
struct Obj{
public:
Obj(double a, double b, double c, double d){
this->a = a;
this->b = b;
this->c = c;
this->d = d;
}
double a;
double b;
double c;
double d;
friend bool operator < ( const Obj &l, const Obj &r ) {
if(l.a != r.a) return l.a < r.a;
if(l.b != r.b) return l.a < r.b;
if(l.c != r.c) return l.c < r.c;
if(l.d != r.d) return l.d < r.d;
return false;
}
};
int main(int argc, char *argv[])
{
set<Obj> A;
A.insert( Obj(1,2,3,4));
A.insert( Obj(16,23,36,47));
A.insert(Obj(15,25,35,43));
Obj c(1,2,3,4);
A.find(c);
cout << A.count(c);
system("PAUSE");
return EXIT_SUCCESS;
}