Declarative Data Mining: Frequent Itemset Tiling

Declarative Data Mining: Frequent Itemset Tiling - data-mining

For a course in my Computer Science studies, I have to come up with a set of constraints and a score-definition to find a tiling for frequent itemset mining. The matrix with the data consists of ones and zeroes.
My task is to come up with a set of constraints for the tiling (having a fixed amount of tiles), and a score-function that needs to be maximized. Since I started working out a solution that allows overlapping tiles, I tried to find a score-function to calculate the total "area" of all tiles. Bear in mind that the score function has to be evaluated for every possible solution, so I can't simply go over the total matrix (which contains about 100k elements) and see if it is part of a tile.
However, I only took into account overlap between only 2 tiles, and came up with the following:
TotalArea = Sum_a_in_Tiles(Area(a)) - Sum_a/b_in_tiles(Overlap(a,b))
Silly me, I didn't consider a possible overlap between 3 tiles. My question is the following:
Is it possible to come up with a generic score-function for n tiles, considering only area per tile and area per overlap between 2 (or more) tiles, and if so, how would I program it?
I could provide some code, but then again it has to be programmed in some obscure language called Comet :(

Related

Understanding the loss function in Yolo v1 research paper

I'm not able to understand the following piece of text from YOLO v1 research paper:
"We use sum-squared error because it is easy to optimize,
however it does not perfectly align with our goal of
maximizing average precision. It weights localization error
equally with classification error which may not be ideal.
Also, in every image many grid cells do not contain any
object. This pushes the “confidence” scores of those cells
towards zero, often overpowering the gradient from cells
that do contain objects. This can lead to model instability,
causing training to diverge early on.
To remedy this, we increase the loss from bounding box
coordinate predictions and decrease the loss from confidence
predictions for boxes that don’t contain objects. We
use two parameters, lambda(coord) and lambda(noobj) to accomplish this. We
set lambda(coord) = 5 and lambda(noobj) = .5"
What is the meaning of "overpowering" in the first paragraph and why would we decrease the loss from confidence prediction(must it not be already low especially for boxes that don't contain any object) and increase that from bounding box predictions ?

There are cells that contain objects and that do not. Model often very confident about the absence (confidence around zero) of the object in the grid cell, it make gradient from those cells be much greater than the gradient from cells that do contain objects but not with huge confidence, it overpowers them (i.e around 0.7-0.8).
So that we want to consider classification score less important because they are not very "fair", to implement this we make weight for coords prediction greater than for classification.

C++ Molecular Dynamics Simulation Code

I am working on a project to simulate a hard sphere model of a gas. (Similar to the ideal gas model.)
I have written my entire project, and it is working. To give you an idea of what I have done, there is a loop which does the following: (Pseudo code)
Get_Next_Collision(); // Figure out when the next collision will occur
Step_Time_Forwards(); // Step to time of collision
Process_Collision(); // Process collision between 2 particles
(Repeat)
For a large number of particles (say N particles), O(N*N) checks must be made to figure out when the next collision occurs. It is clearly inefficient to follow the above procedure, because in the vast majority of cases, collisions between pairs of particles are unaffected by the processing of a collision elsewhere. Therefore it is desirable to have some form of priority queue which stores the next event for each particle. (Actually, since a collision involves 2 particles, only half that number of events will be stored, because if A collides with B then B also collides with A, and at exactly the same time.)
I am finding it difficult to write such an event/collision priority queue.
I would like to know if there are any Molecular Dynamics simulators which have been written and which I can go and look at the source code in order to understand how I might implement such a priority queue.
Having done a google search, it is clear to me that there are many MD programs which have been written, however many of them are either vastly too complex or not suitable.
This may be because they have huge functionality, including the ability to produce visualizations or ability to compute the simulation for particles which have interacting forces acting between them, etc.
Some simulators are not suitable because they do calculations for a different model, ie: something other than the energy conserving, hard sphere model with elastic collisions. For example, particles interacting with potentials or non-spherical particles.
I have tried looking at the source code for LAMMPS, but it's vast and I struggle to make any sense of it.
I hope that is enough information about what I am trying to do. If not I can probably add some more info.

A basic version of a locality-aware system could look like this:
Divide the universe into a cubic grid (where each cube has side A, and volume A^3), where each cube is sufficiently large, but sufficiently smaller than the total volume of the system. Each grid cube is further divided into 4 sub-cubes whose particles it can theoretically give to its neighboring cubes (and lend for calculations).
Each grid cube registers particles that are contained within it and is aware of its neighboring grid cubes' contained particles.
Define a particle's observable universe to have a radius of (grid dimension/2). Define timestep=(griddim/2) / max_speed. This postulates that particles from a maximum of four, adjacent grid cubes can theoretically interact in that time period.
For every particle in every grid cube, run your traditional collision detection algorithm (with mini_timestep < timestep, where each particle is checked for possible collisions with other particles in its observable universe. Store the collisions into any structure sorted by time, even just an array, sorted by the time of collision.
The first collision that happens within a mini_timestep resets your universe(and universe clock) to (last_time + time_to_collide), where time_to_collide < mini_timestep. I suppose that does not differ from your current algorithm. Important note: particles' absolute coordinates are updated, but which grid cube and sub-cube they belong to are not updated.
Repeat step 5 until the large timestep has passed. Update the ownership of particles by each grid square.
The advantage of this system is that for each time window, we have (assuming uniform distribution of particles) O(universe_particles * grid_size) instead of O(universe_particles * universe_size) checks for collision. In good conditions (depending on universe size, speed and density of particles), you could improve the computation efficiency by orders of magnitude.

I didn't understand how the 'priority queue' approach would work, but I have an alternative approach that may help you. It is what I think #Boyko Perfanov meant with 'make use of locality'.
You can sort the particles into 'buckets', such that you don't have to check each particle against each other ( O(n²) ). This uses the fact that particles can only collide if they are already quite close to each other. Create buckets that represent a small area/volume, and fill in all particles that are currently in the area/volume of the bucket ( O(n) worst case ). Then check all particles inside a bucket against the other particles in the bucket ( O(m*(n/m)²) average case, m = number of buckets ). The buckets need to be overlapping for this to work, or else you could also check the particles from neighboring buckets.
Update: If the particles can travel for a longer distance than the bucket size, an obvious 'solution' is to decrease the time-step. However this will increase the running time of the algorithm again, and it works only if there is a maximum speed.
Another solution applicable even when there is no maximum speed, would be to create an additional 'high velocity' bucket. Since the velocity distribution is usually a gaussian curve, not many particles would have to be placed into that bucket, so the 'bucket approach' would still be more efficient than O(n²).

Ray-mesh intersection or AABB tree implementation in C++ with little overhead?

Can you recommend me...
either a proven lightweight C / C++ implementation of an AABB tree?
or, alternatively, another efficient data-structure, plus a lightweight C / C++ implementation, to solve the problem of intersecting a large number of rays with a large number of triangles?
"Large number" means several 100k for both rays and triangles.
I am aware that AABB trees are part of the CGAL library and probably of game physics libraries like Bullet. However, I don't want the overhead of an enormous additional library in my project. Ideally, I'd like to use a small float-type templated header-only implementation. I would also go for something with a bunch of CPP files, as long as it integrated easily in my project. Dependency on boost is ok.
Yes, I have googled, but without success.
I should mention that my application context is mesh processing, and not rendering. In a nutshell, I'm transferring the topology of a reference mesh to the geometry of a mesh from a 3D scan. I'm shooting rays from vertices and along the normals of the reference mesh towards the 3D scan, and I need to recover the intersection of these rays with the scan.
Edit
Several answers / comments pointed to nearest-neighbor data structures. I have created a small illustration regarding the problems that arise when ray-mesh intersections are approached with nearest neighbor methods. Nearest neighbors methods can be used as heuristics that work in many cases, but I'm not convinced that they actually solve the problem systematically, like AABB trees do.

While this code is a bit old and using the 3DS Max SDK, it gives a fairly good tree system for object-object collision deformations in C++. Can't tell at a glance if it is Quad-tree, AABB-tree, or even OBB-tree (comments are a bit skimpy too).
http://www.max3dstuff.com/max4/objectDeform/help.html
It will require translation from Max to your own system but it may be worth the effort.

Try the ANN library:
http://www.cs.umd.edu/~mount/ANN/
It's "Approximate Nearest Neighbors". I know, you're looking for something slightly different, but here's how you can use this to speed up your data processing:
Feed points into ANN.
Query a user-selectable (think of this as a "per-mesh knob") radius around each vertex that you want to ray-cast from and find out the mesh vertices that are within range.
Select only the triangles that are within that range, and ray trace along the normal to find the one you want.
By judiciously choosing the search radius, you will definitely get a sizable speed-up without compromising on accuracy.

If there's no real time requirements, I'd first try brute force.
1M * 1M ray->triangle tests shouldn't take much more than a few minutes to run (in CPU).
If that's a problem, the second best thing to do would be to restrict the search area by calculating a adjacency graph/relation between the triangles/polygons in the target mesh. After an initial guess fails, one can try the adjacent triangles. This of course relies on lack of self occlusion / multiple hit points. (which I think is one interpretation of "visibility doesn't apply to this problem").
Also depending on how pathological the topologies are, one could try environment mapping the target mesh on a unit cube (each pixel would consists of a list of triangles projected on it) and test the initial candidate by a single ray->aabb test + lookup.
Given the feedback, there's one more simple option to consider -- space partitioning to simple 3D grid, where each dimension can be subdivided by the histogram of the x/y/z locations or even regularly.
100x100x100 grid is of very manageable size of 1e6 entries
the maximum number of cubes to visit is proportional to the diameter (max 300)
There are ~60000 extreme cells, which suggests an order of 10 triangles per cell
caveats: triangles must be placed on every cell they occupy
-- a conservative algorithm places them to cells they don't belong to; large triangles will probably require clipping and reassembly.

Group of soldiers moving on grid map together

I am making RTS game and whole terrain is like grid ( cells with x and y coordinates). I have couple soldiers in group (military unit) and I want to send them from point A to point B ( between points A and B is obstacles ). I can solve for one soldier using A* algorithm and that is not problem. How to achieve that my group of soldiers always going together ? (I notice couple corner cases when they split and go with different ways to the same destination point, I can choose leader of group but I don't need that soldiers going on same cells but by leader, for example couple at right side, couple at left side if it is possible). Was anyone solving similar problem in past ? Any idea for algorithm modification ?

You want a flocking algorithm, where you have the leader of the pack follow the A* directions and the other follow the leader in formation.
In case you have very large formations you are going to get into issues like "how to fit all those soldiers through this small hole" and that's where you will need to get smart.
An example could be to enforce a single line formation for tight spots, others would involve breaking down the groups into smaller squads.

If you don't have too many soldiers, a straightforward modification would be to consider the problem as multidimensional problem with each soldier representing 2 dimensions. You can add constaints to this multidimensional space to ensure that your soldiers keep close to each other. However, this might become computationally expensive.
Artifical Potential Fields are usually less expensive and easy to implement. And they can be extended to cooperative strategies. If combined with graph search techiques, you cannot get stuck in local minima. Google gives plenty of starting points: http://www.google.com/search?ie=UTF-8&oe=utf-8&q=motion+planning+potential+fields

Minimizing Sum of Distances: Optimization Problem

The actual question goes like this:
McDonald's is planning to open a number of joints (say n) along a straight highway. These joints require warehouses to store their food. A warehouse can store food for any number of joints, but has to be located at one of the joints only. McD has a limited number of warehouses (say k) available, and wants to place them in such a way that the average distance of joints from their nearest warehouse is minimized.
Given an array (n elements) of coordinates of the joints and an integer 'k', return an array of 'k' elements giving the coordinates of the optimal positioning of warehouses.
Sorry, I don't have any examples available since I'm writing this down from memory. Anyway, one sample could be:
array={1,3,4,5,7,7,8,10,11} (n=9)
k=1
Ans: {7}
This is what I've been thinking: For k=1, we can simply find out the median of the set, which would give the optimal location of the warehouse. However, for k>1, the given set should be divided into 'k' subsets (disjoint, and of contiguous elements of the superset), and median for each subset would give the warehouse locations. However, I don't understand on what basis the 'k' subsets should be formed. Thanks in advance.
EDIT: There's a variation to this problem also: Instead of sum/avg, minimize the maximum distance between a joint and its closest warehouse. I don't get this either..

The straight highway makes this an exercise in dynamic programming, working from left to right along the highway. A partial solution can be described by the location of the rightmost warehouse and the number of warehouses placed. The cost of the partial solution will be the total distance to the nearest warehouse (for fixed k minimising this is the same as minimising the averge) or the maximum distance so far to the closest warehouse.
At each stage you have worked out the answers for the leftmost N joints and have them indexed by number of warehouses used and position of the rightmost warehouse - you need to save only the best cost. Now consider the next joint and work out the best solution for N+1 joints and all possible values of k and rightmost warehouse, using the answers you have stored for N joints to speed this up. Once you have worked out the best cost solution covering all the joints you know where its rightmost warehouse is, which gives you the location of one warehouse. Go back to the solution that has that warehouse as the rightmost joint and find out what solution that was based on. That gives you one more rightmost warehouse - and so you can work your way back to the location of all the warehouses for the best solution.
I tend to get the cost of working this out wrong, but with N joints and k warehouses to place you have N steps to take, each of the based on considering no more than Nk previous solutions, so I reckon cost is kN^2.

This is NOT a clustering problem, it's a special case of a facility location problem. You can solve it using a general integer / linear programming package, but because the problem is on a line, there may be more efficient (and less expensive software-wise) algorithms that would work. You might consider dynamic programming since there are probably combination of facilities that could be eliminated rather quickly. Look into the P-Median problem for more info.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js