I Have to design an order-book data structure that allows me to query for the highest price of an order which has been inserted and not yet deleted.
Insert and delete operations are given upfront in a file in which each operation looks like one of the following two:
TIMESTAMP insert ID PRICE
TIMESTAMP delete ID
where ID is an integer identifier of a order, timestamp are always in increasing order and each ID appears exactly twice: once in a insert and once in a delete operation, in this order.
From this list of operations, I need to output the time-weighted average of the highest price.
As an example, let's say we have the following input:
10 I 1 10
20 I 2 13
22 I 3 13
24 E 2
25 E 3
40 E 1
We can say that after the ith operation, the max is
10, 13, 13, 13, 10
and the time-weigthed average is
10*(20-10) + 13*(22-20) + 13*(24-22)+13*(25-24)+10*(40-25) = 10.5
because 10 is the max price between timestamps [10-20] and [25,40] while 13 in the rest.
I was thinking to use an unordered_map<ID,price> and a multiset<price> for supporting:
insert in O(log(n))
delete in O(log(n))
getMax in O(1)
Here is an example of what I come up with:
struct order {
int timestamp, id;
char type;
double price;
};
unordered_map<uint, order> M;
multiset<double> maxPrices;
double totaltime = 0;
double avg = 0;
double lastTS = 0;
double getHighest() {
return !maxPrices.empty() ? *maxPrices.rbegin()
: std::numeric_limits<double>::quiet_NaN();
}
void update(const uint timestamp) {
const double timeLeg = timestamp - lastTS;
totaltime += timeLeg;
avg += timeLeg * getHighest();
lastTS = timestamp;
}
void insertOrder(const order& ord) {
if (!maxPrices.empty()) {
if (ord.price >= getHighest()) {
// we have a new maxPrice
update(ord.timestamp);
}
} else // if there are not orders this is the mex for sure
lastTS = ord.timestamp;
M[ord.id] = ord;
maxPrices.insert(ord.price);
}
void deleteOrder(
const uint timestamp,
const uint id_ord) { // id_ord is assumed to exists in both M and maxPrices
order ord = M[id_ord];
if (ord.price >= getHighest()) {
update(timestamp);
}
auto it = maxPrices.find(ord.price);
maxPrices.erase(it);
M.erase(id_ord);
}
This approach has a complexity of nlogn where n is the number of active orders.
Is there any faster asymptotic and/or more elegant approach to solving this problem?
I recommend you take the database approach.
Place all your records into a std::vector.
Create an index table, std::map</* key type */, size_t>, which will contain a key value and the index of the record in the vector. If you want the key sorted in descending order, also supply a comparison functor.
This strategy allows you to create many index tables without having to re-sort all of your data. The map will give good search times for your keys. You can also iterate through the map to list all the keys in order.
Note: with modern computers, you may need a huge amount of data to provide a significant timing improvement between a binary search (map) and an linear search (vector).
So I am pretty new to c++ and I am not sure if there is a data structure already created to facilitate what I am trying to do (so I do not reinvent the wheel):
What I am trying to do
I am reading a file where I need to parse the file, do some calculations on every floating value on every row of the file, and return the top 10 results from the file in ascending order.
What am I trying to optimize
I am dealing with a 1k file and a 1.9 million row file so for each row, I will get a result that is of size 72 so in 1k row, I will need to allocate a vector of 72000 elements and for the 1.9 million rows ... well you get the idea.
What I have so far
I am currently working with a vector for the results which then I sort and resize it to 10.
const unsigned int vector_space = circularVector.size()*72;
//vector for the results
std::vector<ResultType> results;
results.reserve(vector_space);
but this is extremely inefficient.
*What I want to accomplish *
I want to only keep a vector of size 10, and whenever I perform a calculation, I will simply insert the value into the vector and remove the largest floating point that was in the vector, thus maintaining the top 10 results in ascending order.
Is there a structure already in c++ that will have such behavior?
Thanks!
EDIT: Changed to use the 10 lowest elements rather than the highest elements as the question now makes clear which is required
You can use a std::vector of 10 elements as a max heap, in which the elements are partially sorted such that the first element always contains the maximum value. Note that the following is all untested, but hopefully it should get you started.
// Create an empty vector to hold the highest values
std::vector<ResultType> results;
// Iterate over the first 10 entries in the file and put the results in the vector
for (... ; i < 10; i++) {
// Calculate the value of this row
ResultType r = ....
// Add it to the vector
results.push_back(r);
}
// Now that the vector is "full", turn it into a heap
std::make_heap(results.begin(), results.end());
// Iterate over all the remaining rows, adding values which are lower than the
// current maximum
for (i = 10; .....) {
// Calculate the value for this row
ResultType r = ....
// Compare it to the max element in the heap
if (r < results.front()) {
// Add the new element to the vector
results.push_back(r);
// Move the existing minimum to the back and "re-heapify" the rest
std::pop_heap(results.begin(), results.end());
// Remove the last element from the vector
results.pop_back();
}
}
// Finally, sort the results to put them all in order
// (using sort_heap just because we can)
std::sort_heap(results.begin(), results.end());
Yes. What you want is a priority queue or heap, defined so as to remove the lowest value. You just need to do such a remove if the size after the insertion is greater than 10. You should be able to do this with STL classes.
Just use std::set to do that, since in std::set all values are sorted from min to max.
void insert_value(std::set<ResultType>& myset, const ResultType& value){
myset.insert(value);
int limit = 10;
if(myset.size() > limit){
myset.erase(myset.begin());
}
}
I think MaxHeap will work for this problem.
1- Create a max heap of size 10.
2- Fill the heap with 10 elements for the first time.
3- For 11th element check it with the largest element i.e root/element at 0th index.
4- If 11th element is smaller; replace the root node with 11th element and heapify again.
Repeat the same steps until the whole file is parsed.
Ihave a vector that contains monthyear
Jan2013
Jan2013
Jan2013
Jan2014
Jan2014
Jan2014
Jan2014
Feb2014
Feb2014
Basically what I want to do is to search through the vector, for every same record, group them
together like
e.g
total count for Jan2013 = 3;
total count for Jan2014 = 4;
total count for Feb2014 = 2;
Of course as we know, we can just simply write multiple if to solve it
if(monthyear = "Jan2013") {
//add count
}
if(monthyear = "Jan2014") {
//add count
}
if(monthyear = "Feb2014") {
//add count
}
but no way a programmer is going code it in this way.
what if there's additional monthyear march2014,april2014,may2014 all the way to dec2014
and jan2015-dec2015 and so on.
I don't think I should be adopting this kind of hard-coding method in the
long run and looking for a more dynamic approach.
I not asking for codes, but just some steps and perhaps give me some hints on what c++ methods should I be researching on.
Thanks in advance
You can use std::map. For example
std::map<std::string, size_t> m;
for ( const std::string &s : v ) ++m[s];
I'd probably do a std::map<monthyear, int>. For each member of your vector, increment that member of the map.
Just for completeness: the solution by #VladfromMoscow is optimal for the general case in which you have little knowledge about your input. It is of O(N log N) complexity for an input of length N.
Equivalently, you can first sort your input in O(N log N), and then iterate in O(N) over the sorted input and store the counts in a std::vector<std::pair<std::string, int>>.
However, if you have a priori information on the range of your input (say you know for sure it runs from Jan 2013 until Jan 2014), you can also directly run over your input and update a pre-allocated std::vector<std::pair<std::string, int>> in O(N) complexity.
I am writing a little program with a GUI in C++ and Qt.
It is supposed to be similar to a vocabulary trainer. I will use it for my own studying.
I have a QList of objects (name and description as string for example).
Then I have a second QList with ints in it. For every object in my other list, an int is in this list. Start value is 50 for every object; if user clicks correct, it gets decremented, vice versa.
So a object with value 70 should be shown more often to the user than an object with value 30. So in the correct answer method I increase/decrease it, sort the QList and use my random algorithm:
if(packList.count()==0) // the QList with objects
return;
int Min = 0;
int Max = packList.count()-1; // -1 because i need the index
qsrand(QTime::currentTime().msec());
if (Min > Max)
{
int Temp = Min;
Min = Max;
Max = Temp;
}
int randNum = ((rand()%(Max-Min+1))+Min);
setPage(randNum); // randNum will be used as index in this method
Now what I need is a way to implement my priority in this random algorithm. I don't want the ones with a higher value to appear 90% of the time, but just more often, just like a vocabulary trainer.
First a remark: You should use qsrand only once at the beginning of the program.
Now to your algorithm: First get the sum of all your values, let us call it sumValues, then compute your random number between 0 and sumValues-1. Go through your list and sum the values into the variable currentSum until it is greater or equal to your random number, and use the index of this entry. This will be more efficient if you sort your list by decreasing values.
I'm trying to obtain the top say, 100 scores from a list of scores being generated by my program. Unfortuatly the list is huge (on the order of millions to billions) so sorting is a time intensive portion of the program.
Whats the best way of doing the sorting to get the top 100 scores?
The only two methods i can think of so far is either first generating all the scores into a massive array and then sorting it and taking the top 100. Or second, generating X number of scores, sorting it and truncating the top 100 scores then continue generating more scores, adding them to the truncated list and then sorting it again.
Either way I do it, it still takes more time than i would like, any ideas on how to do it in an even more efficient way? (I've never taken programming courses before, maybe those of you with comp sci degrees know about efficient algorithms to do this, at least that's what I'm hoping).
Lastly, whats the sorting algorithm used by the standard sort() function in c++?
Thanks,
-Faken
Edit: Just for anyone who is curious...
I did a few time trials on the before and after and here are the results:
Old program (preforms sorting after each outer loop iteration):
top 100 scores: 147 seconds
top 10 scores: 147 seconds
top 1 scores: 146 seconds
Sorting disabled: 55 seconds
new program (implementing tracking of only top scores and using default sorting function):
top 100 scores: 350 seconds <-- hmm...worse than before
top 10 scores: 103 seconds
top 1 scores: 69 seconds
Sorting disabled: 51 seconds
new rewrite (optimizations in data stored, hand written sorting algorithm):
top 100 scores: 71 seconds <-- Very nice!
top 10 scores: 52 seconds
top 1 scores: 51 seconds
Sorting disabled: 50 seconds
Done on a core 2, 1.6 GHz...I can't wait till my core i7 860 arrives...
There's a lot of other even more aggressive optimizations for me to work out (mainly in the area of reducing the number of iterations i run), but as it stands right now, the speed is more than good enough, i might not even bother to work out those algorithm optimizations.
Thanks to eveyrone for their input!
take the first 100 scores, and sort them in an array.
take the next score, and insertion-sort it into the array (starting at the "small" end)
drop the 101st value
continue with the next value, at 2, until done
Over time, the list will resemble the 100 largest value more and more, so more often, you find that the insertion sort immediately aborts, finding that the new value is smaller than the smallest value of the candidates for the top 100.
You can do this in O(n) time, without any sorting, using a heap:
#!/usr/bin/python
import heapq
def top_n(l, n):
top_n = []
smallest = None
for elem in l:
if len(top_n) < n:
top_n.append(elem)
if len(top_n) == n:
heapq.heapify(top_n)
smallest = heapq.nsmallest(1, top_n)[0]
else:
if elem > smallest:
heapq.heapreplace(top_n, elem)
smallest = heapq.nsmallest(1, top_n)[0]
return sorted(top_n)
def random_ints(n):
import random
for i in range(0, n):
yield random.randint(0, 10000)
print top_n(random_ints(1000000), 100)
Times on my machine (Core2 Q6600, Linux, Python 2.6, measured with bash time builtin):
100000 elements: .29 seconds
1000000 elements: 2.8 seconds
10000000 elements: 25.2 seconds
Edit/addition: In C++, you can use std::priority_queue in much the same way as Python's heapq module is used here. You'll want to use the std::greater ordering instead of the default std::less, so that the top() member function returns the smallest element instead of the largest one. C++'s priority queue doesn't have the equivalent of heapreplace, which replaces the top element with a new one, so instead you'll want to pop the top (smallest) element and then push the newly seen value. Other than that the algorithm translates quite cleanly from Python to C++.
Here's the 'natural' C++ way to do this:
std::vector<Score> v;
// fill in v
std::partial_sort(v.begin(), v.begin() + 100, v.end(), std::greater<Score>());
std::sort(v.begin(), v.begin() + 100);
This is linear in the number of scores.
The algorithm used by std::sort isn't specified by the standard, but libstdc++ (used by g++) uses an "adaptive introsort", which is essentially a median-of-3 quicksort down to a certain level, followed by an insertion sort.
Declare an array where you can put the 100 best scores. Loop through the huge list and check for each item if it qualifies to be inserted in the top 100. Use a simple insert sort to add an item to the top list.
Something like this (C# code, but you get the idea):
Score[] toplist = new Score[100];
int size = 0;
foreach (Score score in hugeList) {
int pos = size;
while (pos > 0 && toplist[pos - 1] < score) {
pos--;
if (pos < 99) toplist[pos + 1] = toplist[pos];
}
if (size < 100) size++;
if (pos < size) toplist[pos] = score;
}
I tested it on my computer (Code 2 Duo 2.54 MHz Win 7 x64) and I can process 100.000.000 items in 369 ms.
Since speed is of the essence here, and 40.000 possible highscore values is totally maintainable by any of today's computers, I'd resort to bucket sort for simplicity. My guess is that it would outperform any of the algorithms proposed thus far. The downside is that you'd have to determine some upper limit for the highscore values.
So, let's assume your max highscore value is 40.000:
Make an array of 40.000 entries. Loop through your highscore values. Each time you encounter highscore x, increase your array[x] by one. After this, all you have to do is count the top entries in your array until you have reached 100 counted highscores.
You can do it in Haskell like this:
largest100 xs = take 100 $ sortBy (flip compare) xs
This looks like it sorts all the numbers into descending order (the "flip compare" bit reverses the arguments to the standard comparison function) and then returns the first 100 entries from the list. But Haskell is lazily evaluated, so the sortBy function does just enough sorting to find the first 100 numbers in the list, and then stops.
Purists will note that you could also write the function as
largest100 = take 100 . sortBy (flip compare)
This means just the same thing, but illustrates the Haskell style of composing a new function out of the building blocks of other functions rather than handing variables around the place.
You want the absolute largest X numbers, so I'm guessing you don't want some sort of heuristic. How unsorted is the list? If it's pretty random, your best bet really is just to do a quick sort on the whole list and grab the top X results.
If you can filter scores during the list generation, that's way way better. Only ever store X values, and every time you get a new value, compare it to those X values. If it's less than all of them, throw it out. If it's bigger than one of them, throw out the new smallest value.
If X is small enough you can even keep your list of X values sorted so that you are comparing your new number to a sorted list of values, you can make an O(1) check to see if the new value is smaller than all of the rest and thus throw it out. Otherwise, a quick binary search can find where the new value goes in the list and then you can throw away the first value of the array (assuming the first element is the smallest element).
Place the data into a balanced Tree structure (probably Red-Black tree) that does the sorting in place. Insertions should be O(lg n). Grabbing the highest x scores should be O(lg n) as well.
You can prune the tree every once in awhile if you find you need optimizations at some point.
If you only need to report the value of top 100 scores (and not any associated data), and if you know that the scores will all be in a finite range such as [0,100], then an easy way to do it is with "counting sort"...
Basically, create an array representing all possible values (e.g. an array of size 101 if scores can range from 0 to 100 inclusive), and initialize all the elements of the array with a value of 0. Then, iterate through the list of scores, incrementing the corresponding entry in the list of achieved scores. That is, compile the number of times each score in the range has been achieved. Then, working from the end of the array to the beginning of the array, you can pick out the top X score. Here is some pseudo-code:
let type Score be an integer ranging from 0 to 100, inclusive.
let scores be an array of Score objects
let scorerange be an array of integers of size 101.
for i in [0,100]
set scorerange[i] = 0
for each score in scores
set scorerange[score] = scorerange[score] + 1
let top be the number of top scores to report
let idx be an integer initialized to the end of scorerange (i.e. 100)
while (top > 0) and (idx>=0):
if scorerange[idx] > 0:
report "There are " scorerange[idx] " scores with value " idx
top = top - scorerange[idx]
idx = idx - 1;
I answered this question in response to an interview question in 2008. I implemented a templatized priority queue in C#.
using System;
using System.Collections.Generic;
using System.Text;
namespace CompanyTest
{
// Based on pre-generics C# implementation at
// http://www.boyet.com/Articles/WritingapriorityqueueinC.html
// and wikipedia article
// http://en.wikipedia.org/wiki/Binary_heap
class PriorityQueue<T>
{
struct Pair
{
T val;
int priority;
public Pair(T v, int p)
{
this.val = v;
this.priority = p;
}
public T Val { get { return this.val; } }
public int Priority { get { return this.priority; } }
}
#region Private members
private System.Collections.Generic.List<Pair> array = new System.Collections.Generic.List<Pair>();
#endregion
#region Constructor
public PriorityQueue()
{
}
#endregion
#region Public methods
public void Enqueue(T val, int priority)
{
Pair p = new Pair(val, priority);
array.Add(p);
bubbleUp(array.Count - 1);
}
public T Dequeue()
{
if (array.Count <= 0)
throw new System.InvalidOperationException("Queue is empty");
else
{
Pair result = array[0];
array[0] = array[array.Count - 1];
array.RemoveAt(array.Count - 1);
if (array.Count > 0)
trickleDown(0);
return result.Val;
}
}
#endregion
#region Private methods
private static int ParentOf(int index)
{
return (index - 1) / 2;
}
private static int LeftChildOf(int index)
{
return (index * 2) + 1;
}
private static bool ParentIsLowerPriority(Pair parent, Pair item)
{
return (parent.Priority < item.Priority);
}
// Move high priority items from bottom up the heap
private void bubbleUp(int index)
{
Pair item = array[index];
int parent = ParentOf(index);
while ((index > 0) && ParentIsLowerPriority(array[parent], item))
{
// Parent is lower priority -- move it down
array[index] = array[parent];
index = parent;
parent = ParentOf(index);
}
// Write the item once in its correct place
array[index] = item;
}
// Push low priority items from the top of the down
private void trickleDown(int index)
{
Pair item = array[index];
int child = LeftChildOf(index);
while (child < array.Count)
{
bool rightChildExists = ((child + 1) < array.Count);
if (rightChildExists)
{
bool rightChildIsHigherPriority = (array[child].Priority < array[child + 1].Priority);
if (rightChildIsHigherPriority)
child++;
}
// array[child] points at higher priority sibling -- move it up
array[index] = array[child];
index = child;
child = LeftChildOf(index);
}
// Put the former root in its correct place
array[index] = item;
bubbleUp(index);
}
#endregion
}
}
Median of medians algorithm.