I am performing some particle simulations in C++ and I need to keep a list of contacts info between particles. A contact is actually a data struct containing some data related to the contact. Each particle is identified with a unique ID. Once a contact is lost, it is deleted from the list. The bottleneck of the simulation is computing the force (a routine inside the contacts), and I have found an important impact on the overall performance according to the actual way the contact list is organised.
Currently, I am using a c++ unordered_map (hash map), whose key is a single integer obtained from a pair function applied over the two unique IDS of the particles, and the value is the contact itself.
I would like to know if there is a better approach to this problem (organising efficiently the list of contacts while keeping the info of the particles they are related with) since my approach is done just because I read and found than a hash map is fast for both insertion and deletion.
Thanks in advance.
Related
I've been racking my brain over the past several days trying to find a solution to a storage/access problem.
I currently have 300 unique items with 10 attributes each (all attributes are currently set as strings but some of the attributes are numerical). I am trying to programmatically store and be able to efficiently access attributes based on item ID. I have attempted storing them in string arrays, vectors, maps, and multimaps with no success.
Goal: to be able to quickly access an item and one of its attributes quickly and efficiently by unique identifier.
The closest I have been able to get to being successful is:
string item1[] = {"attrib1","attrib2","attrib3","attrib4","attrib5","attrib6","attrib7","attrib8","attrib9","attrib10"};
I was then able to access an element on-demand by callingitem1[0]; but this is VERY inefficient (particularly when trying to loop through 300 items) and was very hard to work with.
Is there a better way to approach this?
If I understand your question correctly it sounds like you should have some sort of class to hold the attributes, which you would put into a map that has the item ID as the key.
I have somewhat of an interesting problem, and I'm looking for data store solutions for efficient querying.
I have a large (1M+) number of business objects, and each object has a large number of attributes (on the order of 100). The attributes are relatively unstructured -- the system has thousands of possible attributes, their number grows over time, and each object has an arbitrary (e.g. sparse) subset of them.
I frequently have to perform the following operation: find all objects with some concrete set of attributes S and perform an aggregation on them. I never know S ahead of time, and so on every request I have to perform an expensive sweep of the database which doesn't scale.
What are some data store solutions for this kind of problem? One possible solution would be to have a data store that parallelizes the aggregations -- maybe Cassandra with Hive/Pig on top?
Thoughts?
At this point, Cassandra + Spark is a likely candidate.
In a pure Cassandra world, you could (in theory) create a manual mapping of all possible S attributes to data objects, and then load those in via app and process (where the name of the S attribute is the partition key, the value of the S attribute is the clustering key, and the data object ID itself is another clustering key, that way you can quickly iterate over all objects with S attribute set).
It's not incredibly sexy, but could be made to work.
I'm trying to implement DRUM (Disk Repository with Update Management) in Java as per the IRLBot paper (relevant pages start at 4) but as quick summary it's essentially just an efficient way of batch updating (key, value) pairs against a persistent repository. In the linked paper it's used as the backbone behind the crawler's URLSeen test, RobotsTxt check and DNS cache.
There has helpfully been an implementation in c++ done here, which lays out the architecture in a much more digestable way. For ease of reference, this is the architecture diagram from the c++ implementation:
The part which I'm struggling to understand is the reasoning behind keeping the (key, value) buckets and auxiliary buckets separate. The article with the c++ implementation states the following:
During merge a key/value bucket is read into a separate buffer and
sorted. Its content is synchronized with that of the persistent
repository. Checks and updates happen at this moment. Afterwards, the
buffer is re-sorted to its original order so that key/value pairs
match again the corresponding auxiliary bucket. A dispatching
mechanism then forwards the key, value and auxiliary for further
processing along with the operation result. This process repeats for
all buckets sequentially.
So if the order of the (key, value) buckets need to be restored to that of the auxiliary buckets in order to re-link the (key, value) pairs with the auxiliary information, why not just keep the (key, value, aux) values together in singular buckets? What is the reasoning behind keeping them separate and would it be more efficient to just keep them together (since you no longer need to restore the original unsorted order of the bucket)?
On merge-time DRUM loads the content of the key/value disk file of the respective bucket and depending on the operation used checks, updates or check+updates every single entry of that file with the backing datastore.
The auxiliary disk file is therefore irrelevant and not loading the auxiliary data into memory simply saves some memory footprint while sorting which DRUM tries to minimize in order to process the uniqueness of more than 6 billion entries. In case of f.e. the RobotsCache the auxiliary data can be even some 100kb per entry. This is however only a thesis on my own, if you really want to know why they separated these two buffers and disk-files you should probably ask Dmitri Loguinov.
I've also create a Java based DRUM implementation (also a Java-based IRLbot implementation), but both might need a bit more love though. There is also a further Java-based Github project called DRUMS which extends DRUM with a select feature which was used to store genome codes.
I want to run a MapReduce Job where I want to scan multiple columns from a given file and assign a unique ID(Index No.) to each distinct value for each column. The main challenge is to share the same ID for same value that is encountered on different node or different instances of Reducer.
Currently, I am using zookeeper for sharing the Unique IDs, but that is having its performance impact. I have even kept the information in local cache's at reducer level to avoid multiple trips to zookeeper for same value. I wanted to explore if there is any other better mechanism to do the same.
I can suggest two possible solutions for your problem
Create unique ID based on your value. This might be a hash function with low collision rate.
Use faster storage than ZooKeeper. You can try simple key value storage like Redis to store value to id mapping.
I'm rewriting an application which handles a lot of data (about 100 GB) which is designed as a relational model.
The application is very complex; it is some kind of conversion tool for open street map data of huge sizes (the whole world) and converts it into a map file for our own route planning software. The converter application for example holds the nodes in the open street map with their coordinate and all its tags (a lot of more than that, but this should serve as an example in this question).
Current situation:
Because this data is very huge, I split it into several files: Each file is a map from an ID to an atomic value (let's assume that the list of tags for a node is an atomic value; it is not but the data storage can treat it as such). So for nodes, I have a file holding the node's coords, one holding the node's name and one holding the node's tags, where the nodes are identified by (non-continuous) IDs.
The application once was split into several applications. Each application processes one step of the conversion. Therefore, such an application only needs to handle some of the data stored in the files. For example, not all applications need the node's tags, but a lot of them need the node's coords. This is why I split the relations into files, one file for each "column".
Each processing step can read a whole file at once into a data structure within RAM. This ensures that lookups can be very efficient (if the data structure is a hash map).
I'm currently rewriting the converter. It should now be one single application. And it should now not use separated files for each "column". It should rather use some well-known architecture to hold external data in a relational manner, like a database, but much faster.
=> Which library can provide the following features?
Requirements:
It needs to be very fast in iterating over the existing data (while not modifying the set of rows, but some values in the current row).
It needs to provide constant or near-constant lookup, similar to hash maps (while not modifying the whole relation at all).
Most of the types of the columns are constantly sized, but in general they are not.
It needs to be able to append new rows to a relation in constant or logarithmic time per row. Live-updating some kind of search index will not be required. Updating (rebuilding) the index can happen after a whole processing step is complete.
Some relations are key-value-based, while others are an (continuously indexed) array. Both of them should provide fast lookups.
It should NOT be a separate process, like a DBMS like MySQL would be. The number of queries will be enormous (around 10 billions) and will be totally the bottle neck of the performance. However, caching queries would be a possible workaround: Iterating over a whole table can be done in a single query while writing to a table (from which no data will be read in the same processing step) can happen in a batch query. But still: I guess that serializing, inter-process-transmitting and de-serializing SQL queries will be the bottle neck.
Nice-to-have: easy to use. It would be very nice if the relations can be used in a similar way than the C++ standard and Qt container classes.
Non-requirements (Why I don't need a DBMS):
Synchronizing writing and reading from/to the same relation. The application is split into multiple processing steps; every step has a set of "input relations" it reads from and "output relations" it writes into. However, some steps require to read some columns of a relation while writing in other columns of the same relation.
Joining relations. There are a few cross-references between different relations, however, they can be resolved within my application if lookup is fast enough.
Persistent storage. Once the conversion is done, all the data will not be required anymore.
The key-value-based relations will never be re-keyed; the array-based relations will never be re-indexed.
I can think of several possible solutions depending on lots of factors that you have not quantified in your question.
If you want a simple store to look things up and you have sufficient disk, SQLite is pretty efficient as a database. Note that there is no SQLite server, the 'server' is linked into your application.
Personally this job smacks of being embarrassingly parallel. I would think that a small Hadoop cluster would make quick work of the entire job. You could spin it up in AWS, process your data, and shut it down pretty inexpensively.