Caching data from MySQL DB - technique and appropriate STL container? - c++

I am designing a data caching system that could have a very large amount of records held at a time, and I need to know what stl container to use and how to use it. The application is that I have an extremely large DB of records for users - when they log in to my system I want to pull their record and cache some data such as username and several important properties. As they interact with the system, I update and access their properties. Several properties are very volatile and I'm doing this to avoid "banging" on the DB with many transactions. Also, I rarely need to be using the database for sorting or anything - I'm using this just like a glorified binary save file (which is why I am happy to cache records to memory..); a more important goal for me is to be able to scale to huge numbers of users.
When the user logs out, server shuts down, or periodically in round-robin fashion (just in case..), I want to write their data back to the DB.
The server keeps its own:
vector <UserData *> loggedInUsers;
With UserData keeping things like username (string) and other properties from the DB, as well as other temporary data like network handles.
My first Q is, if I need to find a specific user in this vector, what's the fastest way to do that and is there a different stl container I can use to do this faster? What I do now is create an iterator, start it at loggedInUsers.begin() and iterate to .end(), checking *iter->username == "foo" and returning when it's found. If the username is at the end of the vector, or if the vector has 5000 users, this is a significant delay.
My second Q is, how can I round-robin schedule this data to be written back to the DB? I can call a function every time I'm ready to write a few records to the DB. But I can't hold an iterator to the vector, because it will become invalid. What I'd like to do is have a rotating queue where I can access the head of the queue, persist it to the DB, then rotate it to be the end of the queue. That seems like a lot of overhead.. what type could I use to do this better?
My third Q is, I'm using MySQL server and libmysqlclient connector/C.. is there any kind of built in caching that could solve this problem "for free", or is there a different technique altogether? I'm open to suggestions

A1. you're better off with a map, this is a tree that does the lookup for you. Test with a map and (assuming you have the right compiler) or a hash_map (which does the same thing, but the lookup mechanism is different). They have different performance characteristics for different types of data storage workloads.
A2. A list would probably be better for you - push to the front, pull off the end. (a deque could also be used, but you cannot keep an iterator if you erase from it, you can with a list). push_back and pop_front (or vice-versa) will allow you to keep a rolling queue of cached data.
A3. You could try SQLite, which is a mini-database designed for simple application-level db storage needs. It can work entirely in-memory too.

You don't say what your system does or how it's accessed, but this kind of technique probably won't scale well (because eventually you'll run out of memory and whatever you use to find information won't be as efficient as a database) and won't necessarily handle concurrent users properly, unless you make sure that data can be shared properly between them.
That said.. you might be better off using a map (http://www.cplusplus.com/reference/stl/map/) with the username as the key.
In terms of writing it back to the database, why not store a separate structure (a queue) that you can clear every time you write it to the database? As long as you're storing pointers it won't use much more memory. Which brings me to.. rather than using pointers you should take a look at smart pointers (for example boost's shared_ptr) which let you pass them around without worrying about ownership.

Related

How to search the value from a std::map when I use cuda?

I have something stored in std::map, which maps string to vector. Its keys and values looks like
key value
"a"-----[1,2,3]
"b"-----[8,100]
"cde"----[7,10]
For each thread, it needs to process one query. The query looks like
["a", "b"]
or
["cde", "a"]
So I need to get the value from the map and then do some other jobs like combine them. So as for the first query, the result will be
[1,2,3,8,100]
The problem is, how can threads access the map and find the value by a key?
At first, I tried to store it in global memory. However, It looks like it can only pass arrays from host to device.
Then I tried to use thrust, but I can only use it to store vector.
Is there any other way I can use? Or maybe I ignored some methods in thrust? Thanks!
**Ps: I do not need to modify the map, I just need to read data from it.
I believe it's unlikely you will benefit from doing any of this on the GPU, unless you have a huge number of queries which are all available to you at once, or at least in batches.
If you do not have that many queries, then just transferring the data (regardless of its exact format/structure) will likely be a waste.
If you do have that many queries, the benefit is still entirely unclear, and depends on a lot of parameters. The fact that you've been trying to use std::map for anything suggests (see below for the reason) that you haven't been seriously concerned with performance so far. If that's indeed the case, just don't make your life difficult by using a GPU.
So what's wrong what std::map? Nothing, but it's extremely slow even on the CPU, and this is even worse on the GPU.

Storing named data, where the 'name' is larger than the 'data'?

I'm writing the logic portion of a game, and want to create, retrieve, and store values (integers) to keep track of progress. For instance, a door would create the pair ("location.room.doorlock", 0) in an std::map, and unlocking this door would set that value to 1. Anytime the player wants to go through this door, it would retrieve the value by that keyname to see if it's passable. (Just an example, but it's important that this information exist outside of the "door" object itself, as characters or other events might retrieve this data and act on it.)
The problem though is that the name (or map key) itself is far larger than the data it's referring to, which seems wasteful, and feels 'wrong' as a result.
Is there a commonly used or best approach for storing this type of data, one where the key isn't so much larger than the data itself?
It is possible to know how much space to allocate at compile time for the progress data itself, if it's important. It need not use std::map either, so long as I don't have to use raw array indices to get or store data.
It seems like you have two options, if you really want to diminish the size of the string (although the string length does not seem to be that bad at all).
You can either just change your naming conventions or implement hashing. Hashing can be implemented in the form of a hashmap (also known as an unordered map) or by hand (you can create a small program that hashes your names to an int, then use that as a pair). Hashmaps/unordered maps are probably your best bet, as there is a lot of support code out there for it and you don't run the risk of having to deal with bugs in your own programs.
http://www.cplusplus.com/reference/unordered_map/unordered_map/

Iterating through a list made up of a custom Class. How do I do it? C++

I am working on an assignment for my Operating Systems class. I am to simulate how a schedular works with Processes. I have a Process class which holds all the information about the processes. I also have a class called scheduler which holds two Process Lists, interactive and Real-Time.
Using a test text file, I am able to read through the file and place Processes into two lists. One for Interactive Processes and one for Real-time processes.
My issue is this. My professor did not let us know if he will put the processes in order of FCFS, as he said that they must be executed in that order. So, what I must now do is iterate through the lists and sort the Processes based on their arrival times. How do I iterate through the list?
I've tried using
list<Process>::iterator it;
for (it=super.interactive.begin() ; it !=super.interactive.end(); it++)
Where super is the name of the Scheduler object i'm using and interactive is the Interactive Process List.
But the issue with this is that since it's a list made out of Processes, I can't access the int starttime that tells me when the processes start because I don't know how to access individual Processes in these lists.
Any help would be much appreciated or suggestions on any other container I could use for this task would be greatly appreciated.
I first had it set up to use Queues but when it came time to iterate through it, I was told I couldn't. Which is why I've switched to links but i'm not too familiar with those.
My only other idea is to use just dynamic arrays but it would be nice to be able to use Lists because of the push_back() functions. I wouldn't have to worry about increasing the arrays capacity since with a list and a queue you can just add to the back.
One quality of iterators is that they act like pointers to the data you are iterating over, so if you want starttime (and it is public), you can do it->starttime inside of your loop.
But first, you probably don't want std::list. Use std::vector instead, which behaves like a "dynamic array" but handles all of the memory allocation internally. Random access is going to be helpful for keeping a sorted list.
Next, you need a way to sort. Luckily, the standard library has std::sort. You will need to either overload operator< or provide a BinaryPredicate (as described in the link).
But the issue with this is that since it's a list made out of Processes, I can't access the int starttime that tells me when the processes start because I don't know how to access individual Processes in these lists.
For that, you can do : (*it).something(), or the better looking : it->something().
The only reason I asked is because I needed to sort a bunch of my Process classes. Turns out, the rest of my class is assuming the professor will format the input text as first come first serve so I needn't worry about it. Thanks for the help you two. It did help figuring out other bits of the assignment :D

Selection appropriate STL container for logging Data

I require logging and filtering mechanism in my client server application.where client may request log data based on certain parameter.
log will have MACID,date and time,command type and direction as field.
server can filter log data based on these parameter as well.
size of the the log is 10 mb afterwards the log will be override the message from beginning.
My approach is I will log data in to file as well in the STL container as "in memory" so that when the client request data server will filter the log data based on any criteria
So the process is server will first do the sorting on particular criteria on vector<> and then filter it using binary search.
I am planning to use vector as STL container for in memory logging data.
I am bit confused whether vector will appropriate on this situation or not.
since size of the data can max upto 10 mb in vector.
my question whether vector is fare enough for this case or not ?
I'd go with a deque, double ended queue. It's like a vector but you can add/remove elements from both ends.
I would first state that I would use a logging library since there are many and I assure you they will do a better job (log4cxx for ex). If you insist on doing this your yourself A vector is an appropriate mechanism but you will have to manually sort the data biased upon user requests. One other idea is to use sqllite and let it manage storing sorting and filtering your data.
The actual response will depend a lot on the usage pattern and interface. If you are using a graphical UI, then chances are that there is already a widget that implements that feature to some extent (ability to sort by different columns, and even filter). If you really want to implement this outside of the UI, then it will depend on the usage pattern, will the user want a particular view more than others? does she need only filtering, or also sorting?
If there is one view of the data that will be used in most cases, and you only need to show a different order a few times, I would keep an std::vector or std::deque of the elements, and filter out with remove_copy_if when needed. If a different sort is required, I would copy and sort the copy, to avoid having to re-sort back to time based to continue adding elements to the log. Beware if you the application keeps pushing data that you will need to update the copy with the new elements in place (or provide a fixed view and rerun the operation periodically).
If there is no particular view that occurs much more often than the rest, of if you don't want to go through the pain of implementing the above, take a look a boost multi index containers. They keep synchronized views of the same data with different criteria. That will probably be the most efficient in this last case, and even if it might be less efficient in the general case of a dominating view, it might make things simpler, so it could still be worth it.

what type of data structure would be efficient for searching a process table

i have to search a process table which is populated by the names of processes running on a given set of ip adresses.
currently i am using multimaps in C++ with process name as key and ip address as the value.
is there any other efficient data structure which can do the same task.
also can i gain any sort of parallelism by using pthreads ? if so can anyone point me into a right direction
You do not need parallelism to access a data structure in RAM of several thousand entries. You can just lock over it (making sure only one process/thread accesses it at the time), and ensure the access is sufficient enough. Multimap is okay. A hashmap would be better though.
What is typical query to your table?
Try to use hashmap, it can be faster for big tables.
How do you store names and IP? UTF, string, char*? Ip as uint32 or string?
For readonly structure with a lot of read queries you can benefit from several threads.
upd: use std::unordered_multimap from #include <tr1/unordered_map>
Depending on the size of the table, you may find a hash table more efficient than the multimap container (which is implemented with a balanced binary tree).
The hash_multimap data structure implements a hash table STL container, and could be of use to you.