Subsets of similar objects by absolute decimal difference

Subsets of similar objects by absolute decimal difference - list

Greetings!
I have the question below, which I solved with a naive way of nested loops, and I'm looking for a better one. Tried looking at counting sort but not sure how will it work with decimals. Thinking some dynamic programing is needed here.
Anyone who can provide an answer with a better running time than O(N^2), please do :)
The Question:
Given list of objects with an X(double) and Y(double) values, and an ID(string) value.
Objects are considered “similar” if the X and Y values of both objects are within 1.5 units, respectively.
Print all objects ID's with no similar objects.
Print the ID of the object with the most similar objects to it, and each similar object under it.

Related

Uniquely identify an arbitrary object in c++

I'm trying to create a general memoizator for multiple and arbitrary functions.
For each function std::function<ReturnType(Args...)> that we want to memoize, we unordered_map<Args ..., ReturnType> (I'm keeping things simple on purpose).
The big problem comes when our memoized function has some really big argument Args ...: for example let suppose that our function sort a vector of 10 millions numbers and then returns the sorted vector, so something like std::function<vector<double>(vector<double>)>.
As you can imagine, after having inserted less than 100 vectors, we have already filled 8 GBS of memory. Notice that maybe this is given from the combination of huge vectors and the memory required by the sorting algorithm (I didn't investigate on the causes).
So what about if instead of the structure described above, we define unordered_map<UUID(Args ...), ReturnType> (where UUID= Universally Unique Identifier)? We should relax the deterministic feature (so maybe we return a wrong error), but with a very low probability.
The problem is that since I never used UUIDs, I don't know if there are suitable implementations for this application.
So my question is:
There exists a better solution than UUIDs for this problem?
Which UUID implementation is better suitable for this problem?
boost uuid is a possible candidate?
Unfortunately, the problem could be solved for Args ... but not for ReturnType, so there is a solution for memoized result?
Notice that the UUIDs generated for the object x should be the same even in different runs and machines.
Notice that if we have the same UUID for two different objects (and so we return the wrong value) with a really low probability, then it could be acceptable...let's say that this could be a "probabilistic memoizator".
I know that this application doesn't make sense in a memoization context (what are the odds that an user asks two times to sort the same 10 millions elements vector?), but it's time and memory expensive (so good for benchmarking and to introduce the memory problem that I stated above), so please don't whip and crucify me because this is an absurd memoization application.

Identifying any object is easy. The address is "object identity" in C++. This is also the reason that even empty classes cannot have zero size.
Now, what you want is value equivalence. That's strictly not in the language domain. It's solidly in the application/library logic domain.
You should consider using something like boost::flyweights. It has precisely this facility, and makes it "easy" to customize the equivalence semantics for your types.

3D-Grid of bins: nested std::vector vs std::unordered_map

pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.

Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.

How to handle very large matrices (e.g. 1000000 by 1000000)

My question is very general..and its not duplicate too..
when we declare something like this int mat[1000000][1000000];
it is sure it will give an error saying matrix size too large.
i have seen many problems on many competitive programming websites where we need to declare a 2d matrix with 10^6 rows, columns ,I know there is always some trick associated with it to reduce the matrix size.
so i just want to ask what are the possible ways or tricks we can use in such cases to minimize the size ..i mean which types of algorithms are generally required to solve it like DP or anyone else??

In DP, if current row is dependent only on previous row, you can use
int mat[2][1000000];. After calculating current row, you can immediately discard previous row and switch current and previous.
Sometimes, it is possible to use std::map instead of 2D array.
I have encountered many question in programming contests and the
solutions defers from case to case basis, so if you mention a
specific case, I can possibly give you a better targeted solution.

That depends very much on the specific task. There is no universal "trick" that will always work. You'll have to look for something in the particular problem that allows you to solve it in a different way.
That said, if I could really see no other way, I'd start thinking about how many elements of that matrix will really be non-zero (perhaps I can use a sparse array or a map (dictionary) instead). Or maybe I don't need to store all the elements it memory, but can instead re-calculate them every time I need them.
At any rate, a matrix that large (or any kind of fake representation of it) will NOT be useful. Not just because you don't have enough memory, but also because filling up such an array with data will take anywhere from several hours to many months. That should be your first concern - figuring out how to solve the task with less data and computations. When you figure out that, you'll also see what data structure is appropriate.

Could the S of S.O.L.I.D be extended for every single element of the code?

The S of the famous Object Oriented Programming design stands for:
Single responsibility principle, the notion that an object should have
only a single responsibility.
I was wondering, can this principle, be extended even to arrays, variables, and all the elements of a program?
For example, let's say we have:
int A[100];
And we use it to store the result of a function, but somehow we use the same A[100] to check, for example, what indexes of A have we already checked and elaborated.
Could this be considered wrong? Shouldn't we create another element to store, for example, the indexes that we have already checked? Isn't this an hint of future messy code?
PS: I'm sorry if my question is not comprehensible but English is not my primary language. If you have any problem understanding the point of it please let me know in a comment below.

If same A instance is used in different program code portions you must follow this principle. If A is a auxiliary variable, local one for example, I think you don't need to be care about it.

If you are tracking the use of bits of the array that have been updated, then you probably shouldn't be using an array, but a map instead.
In any case, if you need that sort of extra control over the array, then basically, you should be considering a class that contains both the contents of the array and the various information about what has and hasn't been done. So your array becomes local to the class object, as do your controls, and voila. You have single responsibility again.

Perfect hash function for a set of integers with no updates

In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.

I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.

What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.

integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.

A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.

For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.

It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.

Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.

A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js