I am currently working on a MD simulation. It stores the molecule positions in a vector. For each time step, that vector is stored for display in a second vector, resulting in
std::vector<std::vector<molecule> > data;
The size of data is time steps*<number of molecules>*sizeof(molecule), where sizeof(molecule) is (already reduced) 3*sizeof(double), as the position vector. Still I get memory problems for larger amounts of time steps and number of molecules.
Thus, is there an additional possibility to decrease the amount of data? My current workflow is that I calculate all molecules first, store them, and then render them by using the data of each molecule for each step, the rendering is done with Irrlicht. (Maybe later with blender).
If the trajectories are smooth, you can consider to compress data by storing only for every Nth step and restoring the intermediate positions by interpolation.
If the time step is small, linear interpolation can do. Top quality can be provided by cubic splines. Anyway, the computation of the spline coefficients is a global operation that you can only perform in the end and that required extra storage (!), and you might prefer Cardinal splines, which can be built locally, from four consecutive positions.
You could gain a factor of 2 improvement by storing the positions in single precision rather than double - it will be sufficient for rendering, if not for the simulation.
But ultimately you will need to store the results in a file and render offline.
Related
I am trying to create a distance matrix to run the DBSCAN algorithm for clustering purposes. The final distance matrix has 174,000 X 174,000 entries that are all floating numbers between 0 and 1. I have the individual lists (all 174,000 of them) saved with numbers saved as int in them, but when trying to consolidate into an array, I keep running out of memory.
Is there a way to compress the data (I have tried hdf5, but that also seems to struggle) that can deal with such a large data set?
I have a grayscale texture (8000*8000) , the value of each pixel is an ID (actually, this ID is the ID of triangle to which the fragment belongs, I want to using this method to calculate how many triangles and which triangles are visible in my scene).
now I need to count how many unique IDs there are and what are them. I want to implement this with GLSL and minimize the data transfer between GPU RAM and RAM.
The initial idea I come up with is to use a shader storage buffer, bind it to an array in GLSL, its size is totalTriangleNum, then iterate through the ID texture in shader, increase the array element by 1 that have index equal to ID in texture.
After that, read the buffer to OpenGL application and get what I want. Is this a efficient way to do so? Or are there some better solutions like compute-shader (well I'm not familiar with it) or something else.
I want to using this method to calculate how many triangles and which triangles are visible in my scene)
Given your description of your data let me rephrase that a bit:
You want to determine how many distinct values there are in your dataset, and how often each value appears.
This is commonly known as a Histogram. Unfortunately (for you) generating histograms are among the problems not that trivially solved on GPUs. Essentially you have to divide down your image into smaller and smaller subimages (BSP, quadtree, etc.) until divided down to single pixels on which you perform the evaluation. Then you backtrack propagating up the sub-histograms, essentially performing an insertion or merge sort on the histogram.
Generating histograms with GPUs is still actively researched, so I suggest you read up on the published academic works (usually accompanied with source code). Keywords: Histogram, GPU
This one is a nice paper done by the AMD GPU researchers: https://developer.amd.com/wordpress/media/2012/10/GPUHistogramGeneration_preprint.pdf
I am working on a project to simulate a hard sphere model of a gas. (Similar to the ideal gas model.)
I have written my entire project, and it is working. To give you an idea of what I have done, there is a loop which does the following: (Pseudo code)
Get_Next_Collision(); // Figure out when the next collision will occur
Step_Time_Forwards(); // Step to time of collision
Process_Collision(); // Process collision between 2 particles
(Repeat)
For a large number of particles (say N particles), O(N*N) checks must be made to figure out when the next collision occurs. It is clearly inefficient to follow the above procedure, because in the vast majority of cases, collisions between pairs of particles are unaffected by the processing of a collision elsewhere. Therefore it is desirable to have some form of priority queue which stores the next event for each particle. (Actually, since a collision involves 2 particles, only half that number of events will be stored, because if A collides with B then B also collides with A, and at exactly the same time.)
I am finding it difficult to write such an event/collision priority queue.
I would like to know if there are any Molecular Dynamics simulators which have been written and which I can go and look at the source code in order to understand how I might implement such a priority queue.
Having done a google search, it is clear to me that there are many MD programs which have been written, however many of them are either vastly too complex or not suitable.
This may be because they have huge functionality, including the ability to produce visualizations or ability to compute the simulation for particles which have interacting forces acting between them, etc.
Some simulators are not suitable because they do calculations for a different model, ie: something other than the energy conserving, hard sphere model with elastic collisions. For example, particles interacting with potentials or non-spherical particles.
I have tried looking at the source code for LAMMPS, but it's vast and I struggle to make any sense of it.
I hope that is enough information about what I am trying to do. If not I can probably add some more info.
A basic version of a locality-aware system could look like this:
Divide the universe into a cubic grid (where each cube has side A, and volume A^3), where each cube is sufficiently large, but sufficiently smaller than the total volume of the system. Each grid cube is further divided into 4 sub-cubes whose particles it can theoretically give to its neighboring cubes (and lend for calculations).
Each grid cube registers particles that are contained within it and is aware of its neighboring grid cubes' contained particles.
Define a particle's observable universe to have a radius of (grid dimension/2). Define timestep=(griddim/2) / max_speed. This postulates that particles from a maximum of four, adjacent grid cubes can theoretically interact in that time period.
For every particle in every grid cube, run your traditional collision detection algorithm (with mini_timestep < timestep, where each particle is checked for possible collisions with other particles in its observable universe. Store the collisions into any structure sorted by time, even just an array, sorted by the time of collision.
The first collision that happens within a mini_timestep resets your universe(and universe clock) to (last_time + time_to_collide), where time_to_collide < mini_timestep. I suppose that does not differ from your current algorithm. Important note: particles' absolute coordinates are updated, but which grid cube and sub-cube they belong to are not updated.
Repeat step 5 until the large timestep has passed. Update the ownership of particles by each grid square.
The advantage of this system is that for each time window, we have (assuming uniform distribution of particles) O(universe_particles * grid_size) instead of O(universe_particles * universe_size) checks for collision. In good conditions (depending on universe size, speed and density of particles), you could improve the computation efficiency by orders of magnitude.
I didn't understand how the 'priority queue' approach would work, but I have an alternative approach that may help you. It is what I think #Boyko Perfanov meant with 'make use of locality'.
You can sort the particles into 'buckets', such that you don't have to check each particle against each other ( O(n²) ). This uses the fact that particles can only collide if they are already quite close to each other. Create buckets that represent a small area/volume, and fill in all particles that are currently in the area/volume of the bucket ( O(n) worst case ). Then check all particles inside a bucket against the other particles in the bucket ( O(m*(n/m)²) average case, m = number of buckets ). The buckets need to be overlapping for this to work, or else you could also check the particles from neighboring buckets.
Update: If the particles can travel for a longer distance than the bucket size, an obvious 'solution' is to decrease the time-step. However this will increase the running time of the algorithm again, and it works only if there is a maximum speed.
Another solution applicable even when there is no maximum speed, would be to create an additional 'high velocity' bucket. Since the velocity distribution is usually a gaussian curve, not many particles would have to be placed into that bucket, so the 'bucket approach' would still be more efficient than O(n²).
I have a stream of (x,y) data that I want to determine velocity and acceleration from. The data is pretty typical and can be thought to represent say a car driving around.
A new data point comes every 2ms and I would prefer to not accumulate/store unnecessary values so I thought to use a boost::accumulator.
Is there a simpler way to handle this type of task? or perhaps other libraries that already exist which already does this? or am I on the right track with my thinking. Not yet sure what tags I'm going to use but I like the idea that the container keeps an updated value for a given property and doesn't store the old positional data.
Another idea is to use a circular buffer (e.g size 200) and calculate the acceleration based off the last 50 values and velocity based off all the values in the buffer. However if the buffer stores raw positional data this would require looping over all elements every time to calculate the acceleration and velocity. This could be improved by instead keeping some sort of rolling acceleration and velocity value which recalculates by removing the value from the end element and adding the value from the new element to insert (with weight 1/elements in buffer). However this to me seems like some sort of boost rolling weighted accumulator.
You probably want to apply some sort of a Kalman filter to the data. Old data needs to be there to help reduce the impact of noise, new data needs to be there, and weighted higher so that the answer is sensitive to the latest information.
A fairly simple approach for position, lets call it X, where each new sample is x is:
X = (1-w) * X + w * x
as each new value comes in. Where the weight w adjusts how sensitive you are to new information vs old information. w = 1 means you don't care about history, w = 0 means you don't care at all about new information (and obviously means that you'll never store anything).
The instantaneous velocity can be calculated by computing the difference between successive points and dividing this difference by the time interval. This can in turn be filtered with a Kalman filter.
Acceleration is the difference in sequential velocities, again divided by the time interval. You can filter these as well.
The divided differences will be more sensitive to noise than the position. For example, if object whose position you're monitoring stops, you will continue to get position measurements. The velocity vectors for successive measurements will point in random directions.
Boost accumulator doesn't appear to do what you want.
My question here is what data structure should I use to distribute the work to each threads and get the calculated value from them. First thing in my mind is fill vector[0] .. vector[63999] (for 800x800 pixel) with struct that holds x,y and iterate_value. Pass those vector to each node -> then further divide the given vector to each core(Os-thread) -> then further divide the given vector to each thread. Is there any other possible way to send and received the values? and also if I do it in vector way should I pass the vector by pass by value or pass by reference, which one would be better in this case ?
Different points of the mandelbrot set take varying amounts of time to compute (points near the edge are more expensive), so giving each worker an even number of pixels will have some of them finishing faster than others.
Break the image into small rectangles (tiles). Create a work list using a multithreaded queue, and fill it with the tiles. Each worker thread loops, picking a tile off the work list and submitting the results, until the work list is empty.
Pixels are evenly spaced, so why send the coordinates for each one? Just tell each node the x and y coordinates of its lower left pixel, the spacing between pixels, and the number of pixels. This way, your work unit specification is a small constant size.
As far as the larger design goes, there is no point in having more worker threads than physical cores to run on. The context switches of multiple threads per core only reduces performance.