SplayTreeSet index of element in logarithmic time - list

I am storing a collection of objects in a SplayTreeSet
class Obj {
final DateTime date;
Obj(this.date);
}
var list = [Obj(DateTime(2020)), Obj(DateTime(2021)]; //can contain many elements
listInOrder = SplayTreeSet.from(list.values, (a, b) => b.date.compareTo(a.date));
I would like to perform slices to listInOrder to get elements between certain dates. The range of dates is small compared to the length of objects.
Of course I could do that by looping and skipping / taking elements:
List<ArchivedBattle> getObjsBetween(DateTime from, DateTime to) {
return listInOrder.skipWhile((e) => e.date.isAfter(to)).takeWhile((e) => e.date.isAfter(from)).toList();
}
I wonder if I could do better and use the fact that the list is in order to find the start of the region that I want to get.
SplayTreeSet itself does not look like it is providing the functionality that I am looking for.
There is binarySearch function takes a List, so I would need make a copy of the data which is not a good option.
How could I get a slice of an ordered set in log(n) + k time? (k being the number of objects in the range)

Related

How to get all combinations by matching fields

I have 4 classes.
Jacket, Shirt, Tie and Outfit.
class Outfit {
//...
shared_ptr<Jacket> jacket;
shared_ptr<Shirt> shirt;
shared_ptr<Tie> tie;
//...
}
class Jacket {
public:
Jacket(const string &brand, const string &color, const string &size);
// ... getters, setters etc. ...
private:
string brand, color, size
}
// The same for shirt and tie, brand, followed by color or size
I need to get all the possible matches between jacket and shirts, jacket and ties respectively. Like this:
vector<shared_ptr<Jacket>> jackets {
make_shared<Jacket>("some_brand", "blue", "15"),
make_shared<Jacket>("some_other_brand", "red", "14")
};
vector<shared_ptr<Shirt>> shirts {
make_shared<Shirt>("brand1", "doesnotmatterformatchingcriteria", "15"),
make_shared<Shirt>("brand6", "blue", "15"),
make_shared<Shirt>("brand3", "blue", "14"),
make_shared<Shirt>("brand5", "red", "15"),
make_shared<Shirt>("brand6", "red", "14")
};
vector<shared_ptr<Tie>> ties {
make_shared<Tie>("other_brand1", "blue"),
make_shared<Tie>("other_brand2", "blue"),
make_shared<Tie>("other_brand6", "blue"),
make_shared<Tie>("other_brand7", "blue"),
};
void getAllPosibilities(vector<Outfit> &outfits) {
for (const auto &jacket : jackets) {
for (const auto &shirt : shirts) {
if (jacket->getSizeAsString() == shirt->getSizeAsString()) {
for (const auto &tie : ties) {
if (jacket->getColor() == tie->getColor()) {
outfits.push_back(Outfit(jacket, shirt, ties));
}
}
}
}
}
}
So basically I want all the combinations, regardless of the brand name, only by the fields i specify to match, but I think this is painfully slow, considering I keep nesting for loops. In my actual problem, I have even more fields to match, and more classes and i think it is not ideal at all.
Is there any better/simpler solution than doing this?
What you're doing here is commonly known in databases as a join. As an SQL query, your code would look like this:
select * from jacket, shirt, tie where jacket.size == shirt.size and jacket.color == tie.color;
Algorithmic Ideas
Nested Loop Join
Now, what you've implemented is known as a nested loop join, which usually would have complexity O(jackets * shirts * ties), however, you have an early return inside your shirt-loop, so you save some complexity there and reduce it to O(jackets * shirts + jackets * matching_shirts * ties).
For small data sets, as the one you provided, this nested loop join can be very effective and is usually chosen. However, if the data gets bigger, the algorithm can quickly become slow. Depending on how much additional memory you can spare and whether sorting the input sequences is okay with you, you might want to use approaches that utilize sorting, as #Deduplicator initially pointed out.
Sort Merge Join
The sort-merge-join usually is used for two input sequences, so that after both sequences have been sorted, you only need to traverse each sequence once, giving you complexity of O(N log N) for the sorting and O(N) for the actual join phase. Check out the Wikipedia article for a more in-depth explanation. However, when you have more than two input sequences, the logic can become hard to maintain, and you have to traverse one of the input sequences more than once. Effectively, we will have O(N log N) for the sorting of jackets, shirts and ties and O(jackets + shirts + colors * ties) complexity for the actual joining part.
Sort + Binary Search Join
In the answer #Deduplicator gave, they utilized sorting, but in a different way: Instead of sequentially going through the input sequences, they used binary search to quickly find the block of matching elements. This gives O(N log N) for the sorting and O(jackets * log shirts + log ties * matching jacket-shirt combinations + output_elements) for the actual join phase.
Hash Join
However, all of these approaches can be trumped if the elements you have can easily be hashed, as hashmaps allow us to store and find potential join partners incredibly fast. Here, the approach is to iterate over all but one input sequence once and store the elements in a hashmap. Then, we iterate over the last input sequence once, and, using the hash map, find all matching join partners. Again, Wikipedia has a more in-depth explanation. We can utilize std::unordered_map here. We should be able to find a good hashing function here, so this gives us a total runtime of O(jackets + shirts + ties + total_output_tuples), which is a lower bound on how fast you can be: You need to process all input sequences once, and you need to produce the output sequence.
Also, the code for it is rather beautiful (imho) and easy to understand:
void hashJoin(vector<Outfit> &outfits) {
using shirtsize_t = decltype(Shirt::size);
std::unordered_map<shirtsize_t, vector<shared_ptr<Shirt>>> shirts_by_sizes;
using tiecolor_t = decltype(Tie::color);
std::unordered_map<tiecolor_t, vector<shared_ptr<Tie>>> ties_by_colors;
for(const auto& shirt : shirts) {
shirts_by_sizes[shirt->size].push_back(shirt);
}
for(const auto& tie : ties) {
ties_by_colors[tie->color].push_back(tie);
}
for (const auto& jacket : jackets) {
for (const auto& matching_shirt : shirts_by_sizes[jacket->size]) {
for (const auto& matching_tie : ties_by_colors[jacket->color]) {
outfits.push_back(Outfit{jacket, matching_shirt, matching_tie});
}
}
}
}
However, in the worst case, if our hashing does not give us uniform hashes, we might experience worse complexity, as the O(1) access can become worse. You would want to inspect the hash maps to spot this, and replace the hash function in this case.
Implementations
I've posted a working implementation of all four algorithms I discussed here on godbolt. However, since they are rather long, I only included the (superior) hash join algorithm in this answer
Lower bounds and output elements
As #Dominique pointed out, there is no way to have a better run time complexity than O(output_element_count), since you have to produce each output element at least once. So, if you expect that your result is (asymptotically) close to jackets * shirts * ties, the nested loop join is the variant with the lowest overhead, and thus should be fastest. However, I think this will not be the case here.
Sorting is often a good idea:
static constexpr auto size_comp = [](auto&& a, auto&& b){
return a->getSizeAsString() < b->getSizeAsString(); };
static constexpr auto color_comp = [](auto&& a, auto&& b){
return a->getColor() < b->getColor(); };
std::sort(jackets.begin(), jackets.end(), size_comp);
std::sort(shirt.begin(), shirt.end(), size_comp);
std::sort(tie.begin(), tie.end(), color_comp);
auto ps = shirts.begin();
for (auto pj = jackets.begin(); pj != jackets.end() && ps != shirts.end(); ++pj)
ps = std::lower_bound(ps, shirts.end(), *pj, size_comp);
auto pt = std::lower_bound(ties.begin(), ties.end(), *pj, color_comp);
for (auto psx = ps; psx != shirts.end() && !size_comp(*pj, *psx); ++psx) {
for (auto ptx = pt; ptx != ties.end() && !color_comp(*pj, *ptx); ++ptx)
outfits.emplace_back(*pj, *psx, *ptx);
}
}
I don't think it's possible to optimise, in case you are interested in all cases, for this simple reason:
Imagine you have 2 brands of jackets, 5 of shirts and 4 of ties, then you are looking for 2*5*4, which means 40 possibilities. That's just the amount you're looking for, so no optimising there. Then the following loop is ok:
for all jackets_brands:
for all shirts_brands:
for all ties_brands:
...
However, imagine you have some criteria, like some of the 5 brands don't go together with some of the 4 brands of the ties. In that case, it might be better to alter the sequence of the for-loops, as you can see here:
for all shirts_brands:
for all ties_brands:
if go_together(this_particular_shirts_brand, this_particular_ties_brand)
then for all jackets_brands:
...
In this way, you might avoid some unnecessary loops.

Find minimum value at each index after queries which tell you minimum value over a range

Assume that initially in array a each element has infinity as value.
Now M queries are input of the type l r x.
Here l to r is range where value need to be updated if a[i]>x where l<=i<=r and l<=r<=n.
After M queries you need to output the minimum value at each index.
One way to this is to use Brute Force
memset(a,inf,sizeof(a));
while(j<m)
{
scanf("%d %d %d",&l,&r,&c);
for(i=l-1;i<r;i++)
{
if(a[i]>c)
a[i]=c;
}
j++;
}
for(i=0;i<n;i++)
printf("%d",a[i]);
Now this takes O(mn) time where n=size of each query which can be n in worst case.
What are more efficient ways to solve this in lesser time complexity?
There is an approach that has a different asymptotic complexity. It involves keeping a sorted list of begin and end of queries. In order to avoid actual sorting, I'm using a sparse array the size of a.
Now, the general idea is that you store the queries and while iterating you keep a heap containing the queries is who's range you are:
# size of array (n)
count = ...
# for each array element you have a list of ranges that
# start or end at this array element
list<queries> l[count]
list<queries> r[count]
heap<queries> h
for i in range(count):
if l[i]:
h.push(l[i])
if h is empty:
output(inf)
else:
output(h.lowest().value)
if r[i]:
h.pop(r[i])
The actual performance of this (and other algorithms) greatly depends on the size of the array and density of the queries, none of which is covered in the asymptotic complexity of this algorithm though. Finding an optimal algorithm can't be done while ignoring the actual input data. It could also be worthwhile to change algorithms depending on the data.
Note: my answer assumes that the problem is online, so you must execute updates and queries as they arrive. An advantage of this is that my solution is more robust, allowing you to add more types of updates and queries in the same complexity. The disadvantage is that it might not be the absolute best choice for your problem if you're dealing with an offline problem.
You can use a segment tree. Have each node in the segment tree store the minimum value set for its associated interval (initially infinity, something very large) and use a lazy update and query scheme.
Update(left, right, c)
Update(node, left, right, c):
if node.interval does not intersect [left, right]:
return
if node.interval included in [left, right]:
node.minimum = min(c, node.minimum)
return
Update(node.left, left, right, c)
Update(node.right, left, right, c)
Query(index)
Query(node, minimum = infinity, index):
if node.interval == [index, index]:
return minimum
if index included in node.left.interval:
return Query(node.left, min(minimum, node.minimum), index)
return Query(node.right, min(minimum, node.minimum), index)
Total complexity: O(log n) for each update and query operation. You need to call Query for every element in the end.

Java list: get amount of Pairs with pairwise different Keys using lambda expressions

I have a list of key-value-pairs and I want to filter a list where every key parameter only occurs once.
So that a list of e.g. {Pair(1,2), Pair(1,4), Pair(2,2)} becomes {Pair(1,2), Pair(2,2)}.
It doesn't matter which Pair gets filtered out as I only need the size
(maybe there's a different way to get the amount of pairs with pairwise different key values?).
This all is again happening in another stream of an array of lists (of key-value-pairs) and they're all added up.
I basically want the amount of collisions in a hashmap.
I hope you understand what I mean; if not please ask.
public int collisions() {
return Stream.of(t)
.filter(l -> l.size() > 1)
.filter(/*Convert l to list of Pairs with pairwise different Keys*/)
.mapToInt(l -> l.size() - 1)
.sum();
}
EDIT:
public int collisions() {
return Stream.of(t)
.forEach(currentList = stream().distinct().collect(Collectors.toList())) //Compiler Error, how do I do this?
.filter(l -> l.size() > 1)
.mapToInt(l -> l.size() - 1)
.sum();
}
I overwrote the equals of Pair to return true if the Keys are identical so now i can use distinct to remove "duplicates" (Pairs with equal Keys).
Is it possible to, in forEach, replace the currentElement with the same List "distincted"? If so, how?
Regards,
Claas M
I'm not sure whether you want the sum of amount of collisions per list or the amount of collisions in all list were merged into a single one before. I assumed the former, but if it's the latter the idea does not change by much.
This how you could do it with Streams:
int collisions = Stream.of(lists)
.flatMap(List::stream)
.mapToInt(l -> l.size() - (int) l.stream().map(p -> p.k).distinct().count())
.sum();
Stream.of(lists) will give you a Stream<List<List<Pair<Integer, Integer>> with a single element.
Then you flatMap it, so that you have a Stream<List<Pair<Integer, Integer>>.
From there, you mapToInt each list by substracting its original size with the number of elements of unique Pairs by key it contained (l.stream().map(p -> p.k).distinct().count()).
Finally, you call sum to have the overall amount of collisions.
Note that you could use mapToLong to get rid of the cast but then collisions has to be a long (which is maybe more correct if each list has a lot of "collisions").
For example given the input:
List<Pair<Integer, Integer>> l1 = Arrays.asList(new Pair<>(1,2), new Pair<>(1,4), new Pair<>(2,2));
List<Pair<Integer, Integer>> l2 = Arrays.asList(new Pair<>(2,2), new Pair<>(1,4), new Pair<>(2,2));
List<Pair<Integer, Integer>> l3 = Arrays.asList(new Pair<>(3,2), new Pair<>(3,4), new Pair<>(3,2));
List<List<Pair<Integer, Integer>>> lists = Arrays.asList(l1, l2, l3);
It will output 4 as you have 1 collision in the first list, 1 in the second and 2 in the third.
Don't use a stream. Dump the list into a SortedSet with a custom comparator and diff the sizes:
List<Pair<K, V>> list; // given this
Set<Pair<K, V>> set = new TreeSet<>(list, (a, b) -> a.getKey().compareTo(b.getKey())).size();
set.addAll(list);
int collisions = list.size() - set.size();
If the key type isn't comparable, alter the comparator lambda accordingly.

Fastest way to check list of integers against a list of Ranges in scala?

I have a list of integers and I need to find out the range it falls in. I have a list of ranges which might be of size 2 to 15 at the maximum. Currently for every integer,I check through the list of ranges and find its location. But this takes a lot of time as the list of integers I needed to check includes few thousands.
//list of integers
val numList : List[(Int,Int)] = List((1,4),(6,20),(8,15),(9,15),(23,27),(21,25))
//list of ranges
val rangesList:List[(Int,Int)] = List((1,5),(5,10),(15,30))
def checkRegions(numPos:(Int,Int),posList:List[(Int,Int)]){
val loop = new Breaks()
loop.breakable {
for (va <- 0 until posList.length) {
if (numPos._1 >= posList(va)._1 && numPos._2 <= (posList(va)._2)) {
//i save "va"
loop.break()
}
}
}
}
Currently for every integer in numList I go through the rangesList to find its range and save its range location. Is there any faster/better way approach to this issue?
Update: It's actually a list of tuples that is compared against a list of ranges.
First of all, using apply on a List is problematic, since it takes linear run time.
List(1,2,3)(2) has to traverse the whole list to finally get the last element at index 2.
If you want your code to be efficient, you should either find a way around it or choose another data structure. Data structures like IndexedSeq have constant time indexing.
You should also avoid breaks, as to my knowledge, it works via exceptions and that is not a good practice. There are always ways around it.
You can do something like this:
val numList : List[(Int,Int)] = List((1,4),(6,20),(8,15),(9,15),(23,27),(21,25))
val rangeList:List[(Int,Int)] = List((1,5),(5,10),(15,30))
def getRegions(numList: List[(Int,Int)], rangeList:List[(Int,Int)]) = {
val indexedRangeList = rangeList.zipWithIndex
numList.map{case (a,b) => indexedRangeList
.find{case ((min, max), index) => a >= min && b <= max}.fold(-1)(_._2)}
}
And use it like this:
getRegions(numList, rangeList)
//yields List(0, -1, -1, -1, 2, 2)
I chose to yield -1 when no range matches. The key thing is that you zip the ranges with the index beforehand. Therefore we know at each range, what index this range has and are never using apply.
If you use this method to get the indices to again access the ranges in rangeList via apply, you should consider changing to IndexedSeq.
The apply will of course only be costly when the number ranges gets big. If, as you mentioned, it is only 2-15, then it is no problem. I just want to give you the general idea.
One approach includes the use of parallel collections with par, and also indexWhere which delivers the index of the first item in a collection that holds a condition.
For readability consider this predicate for checking interval inclusion,
def isIn( n: (Int,Int), r: (Int,Int) ) = (r._1 <= n._1 && n._2 <= r._2)
Thus,
val indexes = numList.par.map {n => rangesList.indexWhere(r => isIn(n,r))}
indexes: ParVector(0, -1, -1, -1, 2, 2)
delivers, for each number, the index in the ranges collection where it is included. Value -1 indicates the condition did not hold.
For associating numbers with range indexes, consider this,
numList zip indexes
res: List(((1,4), 0), ((6,20),-1), ((8,15),-1),
((9,15),-1), ((23,27),2), ((21,25),2))
Parallel collections may prove more efficient that the non parallel counterpart for performing computations on a very large number of items.

How to use couchdb reduce for a map that has multiple elements for values

I can't seem to find an answer for this anywhere, so maybe this is not allowed but I can't find any couchdb info that confirms this. Here is a scenario:
Suppose for a map function, within Futon, I'm emitting a value for a key, ex. K(1). This value is comprised of two separate floating point numbers A(1) and B(1) for key K(1). I would like to have a reduction perform the sample average of the ratio A(N)/B(N) over all K(N) from 1 to N. The issue I'm always running into in the reduce function is for the "values" parameter. Each key is associated with a value pair of (A,B), but I can't break out the A, B floating numbers from "values".
I can't seem to find any examples on how to do this. I've already tried accessing multi-level javascript arrays for "values" but it doesn't work, below is my map function.
function(doc) {
if(doc['Reqt.ID']) {
doc['Reqt.ID'].forEach(function(reqt) {
row_index=doc['Reqt.ID'].indexOf(reqt);
if(doc.Resource[row_index]=="Joe Smith")
emit({rid:reqt},{acthrs:doc['Spent.Hours'] [row_index],esthrs:doc['Estimate.Total.Hours'][row_index]});
});
}
}
I can get this to work (i.e. avg ratio) if I just produce a map that emits a single element value of A/B within the map function, but I'm curious about this case of multiple value elements.
How is this generally done within the Futon reduce function?
I've already tried various JSON Javascript notations such as values[key index].esthrs[0] within a for loop of the keys, but none of my combinations work.
Thank you so much.
There are two ways you could approach this; first, my reccomendation, is to change your map function to make it more of a "keys are keys and values are values", which in your particular case probably means, since you have two "values" you'd like to work with, Spent.Hours and Estimate.Total.Hours, you'll need two views; although you can cheat a little, but issuing multiple emit()'s per row, in the same view, for example:
emit(["Spent.Hours", reqt], doc['Spent.Hours'][row_index]);
emit(["Estimate.Total.Hours", reqt], doc['Estimate.Total.Hours'][row_index]);
with that approach, you can just use the predefined _stats reduce function.
alternatively, you can define a "smart" stats function, which can do the statistics for more elaborate documents.
The standard _stats function provides count, sum, average and standard deviation. the algorithm it uses is to take the sum of the value, the sum of the value squared, and the count of values; from just these, average and standard deviation can be calculated (and is embedded, for convenience in the reduced view)
roughly, that might look like:
function(key, values, rereduce) {
function getstats(seq, getter) {
var c, s, s2 = 0, 0, 0;
values.forEach(function (row) {
var value = getter(row);
if (rereduce) {
c += value.count;
s += value.sum;
s2 += value.sumsq;
} else {
c += 1;
s += value;
s2 += value * value;
}
return {
count: c,
sum: s,
sumsq: s2,
average: s / c,
stddev: Math.sqrt(c * s2 - s1) / c
};
}
return {esthrs: getstats(function(x){return x.esthrs}),
acthrs: getstats(function(x){return x.acthrs})};
}