what type of data structure would be efficient for searching a process table

what type of data structure would be efficient for searching a process table - c++

i have to search a process table which is populated by the names of processes running on a given set of ip adresses.
currently i am using multimaps in C++ with process name as key and ip address as the value.
is there any other efficient data structure which can do the same task.
also can i gain any sort of parallelism by using pthreads ? if so can anyone point me into a right direction

You do not need parallelism to access a data structure in RAM of several thousand entries. You can just lock over it (making sure only one process/thread accesses it at the time), and ensure the access is sufficient enough. Multimap is okay. A hashmap would be better though.

What is typical query to your table?
Try to use hashmap, it can be faster for big tables.
How do you store names and IP? UTF, string, char*? Ip as uint32 or string?
For readonly structure with a lot of read queries you can benefit from several threads.
upd: use std::unordered_multimap from #include <tr1/unordered_map>

Depending on the size of the table, you may find a hash table more efficient than the multimap container (which is implemented with a balanced binary tree).
The hash_multimap data structure implements a hash table STL container, and could be of use to you.

Related

Is it efficient to read the contents of a file into an unordered_map if it has over 1000 entries

I'm making a hash table that's supposed to give pretty fast lookup time for some values I type before hand. I didn't know how to go about it but my friend said I should make a text file and just have an unordered map that reads from the text file and puts the values in the code before I run it. Is this efficient? Is there a better way to do this?
Also side note, the values are supposed to be structures. Is it going to be possible to read them into the code with an unordered map?

As said in comments, your idea is good enough unless these structures are really large, megabytes.
If you have reasons to worry about the performance of that, e.g. if you want to support millions of records or very large values, more complicated approaches can be more efficient.
When I only need 64-bit support, I sometimes make a single binary file, optimized for memory mapping the complete one. Specifically, a fixed-size header, then sorted arrays of (key,offset) tuples serving as a primary index (can use binary search there, the OS only fetches required pages from mapped files and it caches them in RAM in quite aggressive manner), then values at the offsets specified in the index.

Use std::map when
You need ordered data.
You would have to print/access the data (in
sorted order). You need predecessor/successor of elements.
Use std::unordered_map when
You need to keep count of some data (Example – strings) and no ordering is
required.
You need single element access i.e. no traversal.
Also side note, the values are supposed to be structures. Is it going to be possible to read them into the code with an unordered map?
Definately you can but i hope you knew that you cannot read file with map fstream is there for that purpose.

Can I reinterpret a memory mapped file of key-value pairs as a map in order to sort them?

I have a memory mapped file that contains key-value pairs. Both the key and value are uint32_t, and all the keys and values are stored in the file in binary, where a key immediately proceeds the value. The file contains only these pairs, no delimiters.
I want to be able to sort all of these key-value pairs by increasing key.
The following just compiled in my code:
struct FileAsMap { map<uint32_t, uint32_t> keyValueMap; };
const FileAsMap* fileAsMap = reinterpret_cast<FileAsMap*>(mmappedData);
but I don't really know what to do from here, since by definition the map container keeps a strict weak ordering of the pairs by key. If I just reinterpret the mapped file as a map, how can I get the pairs to order?

it's not an answer but explanations don't fit into comment limitations.
The keys in a map are usually unique (at least in std::map they are). But maps in general differ one from another in method they sort stored keys. For example std::map is based on a balanced binary tree with average complexity of retrieving a given key equal to O(ln(n)) where n is a number of elements in the map. Or e.g. std::unordered_map is a hashmap internally with the average access time = O(1). That is it looks for a key in constant time regardless of number of elements inside.
In any case all these data containers demand dedicated internal in-memory structure which practically never looks like a simple stream of key-value pairs. That's why I've told above in the first comment that it's almost impossible to reuse one of standard maps as a convenient data accessor for mmap-ed data w/o prior read and unpack the data stream.
But you can create your own map-like class which would iterate over data in mmap-ed area and would check in its operator[](size_t i) if a stored key matches the requested one. Iguess that a simplest implementation would take a single screen of code.
But beware: sequental scan is a relatively expensive operation, so if you got enough elements in the file, it could become unacceptable slow. In this case you'll need some optimized indexing. For example all keys are read in the beginning of processing and an indexing array is built. But all these questions heavily depend on task details, ao it's better to stop explanations now.
If you have any further questions feel free to ask. Of course a good question assumes that you have already studied the subject and now have encountered a particular problem that you can't solve yoursef

There are a lot of reasons why the answer is no. The two simplest are:
Maps are a structure that stores data in a form in which it's already sorted. Your data isn't already sorted, so it's simply not a map.
The map class has its own internal data structure that it uses to store maps. Unless your file replicates this internal structure perfectly (which it almost certainly can't since it likely includes pointers into memory) the map class will misunderstand the data in the file.

How did u serialize the data to the file?
Assuming that you serialized a struct consisting of maps, you'd de-serialize as below:
FileAsMap* fileAsMap = reinterpret_cast<FileAsMap*>(mmappedData);
Gives access to entire structure (blob).
(*fileAsMap).keyValueMap gives access to map.

Vector or Map or Hash map for C++?

I have a large number of records, say around 4,000,000, that I want to address them repeatedly and put information in a class that is linked to that record. I am not sure what kind of data structure should I use? Should I use the vectors, maps, or hash maps. I don't need to insert a record, but I need to read a table which contain sets of these records numbers (or names), and then grab some of the data which are linked to that record and do some processes on them. Is the finding on map fast enough to not go for hashmaps for this example? The records have a class as its structure and I have not done anything before with using the map or hashmap that has a class as its value (if it is possible).
Thanks in advance guys.
Edited:
I do not need to have all the records on the memory at the same time for now> I need to first give it a structure first and then grab the data from some of the records. The total number of records is around 20 million, and I want to read each of these raw records and then if its basic info doesn't exist in my new map or vector that I want to create and put the rest of data in there as a vector. Because I have 20 million records, I think it would be very excruciating that for every record go through 4 million record to find if the basic info of that record exist or not. I have around 4 million type of packages and each of these packages could have more than one type of service (roughly around 5 (20/4) per package). I want to read each of these records and then if the package ID does not exist into the vector or whatever I want to use and push the basic info into the vector and then for the services that are related to that package be saved in a vector inside the package class.

These three data structures have each a different purpose.
A vector is basically a dynamic array, which is good for indexed values.
A map is a sorted data-structure with O(log(n)) retrieval and insertion time (implemented using a balanced binary tree, usually Red-Black). This is best if you can't find an efficient hash method.
A hash_map uses hashes to retrieve object. If you have a well defined hash function with a low collision rate, you will get constant retrieval and insertion time on average. hash_maps are usually faster than map but not always. It is highly dependent on the hash function.
For your example, I think that it is best to use a hash_map where the key would be the record number (assuming record numbers are unique).
If these record numbers are dense (meaning there are little or no gaps between the indexes
, say: 1,2,4,5,8,9,10...), you can use a vector. If your records are coming from a database with an autoincrement primary key and not many deletions, this should be usually the case.

Efficient data structure for Zobrist keys

Zobrist keys are 64bit hashed values used in board games to univocally represent different positions found during a tree search. They are usually stored in arrays having a size of 1000K entries or more (each entry is about 10 bytes long). The table is usually accessed by hashKey % size as index. What kind of STL container would you use to represent this sort of table? Consider that since the size of the table is limited collisions might happen. With a "plain" array I would have to handle this case, so I thought of an unordered_map, but since the implementation is not specified, I am not sure how efficient it will be while the map is being populated.

Seems to me a standard hashmap would suit you well - very fast look up and it will handle the collisions for you reliably and invisibly.

If you wish to explore other territories apart STL, take a look at Judy arrays: these should fit your problem.
If you are on Linux you can experiment with them very easily, just install from your repository...
This application note could help to solve your task.
EDIT
There is this STL interface: I'm going to experiment with it, then I'll report my results.

Hash table with two keys

I have a large amount of data the I want to be able to access in two different ways. I would like constant time look up based on either key, constant time insertion with one key, and constant time deletion with the other. Is there such a data structure and can I construct one using the data structures in tr1 and maybe boost?

Use two parallel hash-tables. Make sure that the keys are stored inside the element value, because you'll need all the keys during deletion.

Have you looked at Bloom Filters? They aren't O(1), but I think they perform better than hash tables in terms of both time and space required to do lookups.

Hard to find why you need to do this but as someone said try using 2 different hashtables.
Just pseudocode in here:
Hashtable inHash;
Hashtable outHash;
//Hello myObj example!!
myObj.inKey="one";
myObj.outKey=1;
myObj.data="blahblah...";
//adding stuff
inHash.store(myObj.inKey,myObj.outKey);
outHash.store(myObj.outKey,myObj);
//deleting stuff
inHash.del(myObj.inKey,myObj.outKey);
outHash.del(myObj.outKey,myObj);
//findin stuff
//straight
myObj=outHash.get(1);
//the other way; still constant time
key=inHash.get("one");
myObj=outHash.get(key);
Not sure, thats what you're looking for.

This is one of the limits of the design of standard containers: a container in a sense "own" the contained data and expects to be the only owner... containers are not merely "indexes".
For your case a simple, but not 100% effective, solution is to have two std::maps with "Node *" as value and storing both keys in the Node structure (so you have each key stored twice). With this approach you can update your data structure with reasonable overhead (you will do some extra map search but that should be fast enough).
A possibly "correct" solution however would IMO be something like
struct Node
{
Key key1;
Key key2;
Payload data;
Node *Collision1Prev, *Collision1Next;
Node *Collision2Prev, *Collision2Next;
};
basically having each node in two different hash tables at the same time.
Standard containers cannot be combined this way. Other examples I coded by hand in the past are for example an hash table where all nodes are also in a doubly-linked list, or a tree where all nodes are also in an array.
For very complex data structures (e.g. network of structures where each one is both the "owner" of several chains and part of several other chains simultaneously) I even resorted sometimes to code generation (i.e. scripts that generate correct pointer-handling code given a description of the data structure).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js