Analysing CDR using opencl parallel progarmming

Analysing CDR using opencl parallel progarmming - mapreduce

I am analysing a large data set of Call Detail Records. I get the details from mysql database and extract user and call duration details and aggregate the sum of call duration of every user.
I allocated 10000000 size output buffer to store output result and invoked kernel with whole data got from mysql database. Within the kernel I used atomic addition to sum up the duration of particular user.
In kernal code
atomic_add(&outputbuffer[userid],duration)
It works perfectly.
But I am concerned about the large output buffer allocation. Even to get the result of 100000 data set, we have to go through whole output memory.
Can't we do "Hash map" kind of thing in kernel? How we can use "Map reduce" to this kind of problem.
Whenever I tried this methods, I couldn't avoid collision with parallel processes.
I went through many tutorials and questions in this site related to my problem. But unfortunately I couldn't get any helpful guides.
If any one can suggest idea to solve this problem, It will be helpful.
Thanks in advance.

Whether or not the amount of global memory you allocate will be a problem is a function of the number of different users (account numbers, phone numbers ... depending on how you want to summarize the call durations) that your list of CDRs contains and the amount of global memory that your GPU supports.
For example, on a GPU card with 1GB of memory you might be able to allocate roughly 250 million 32-bit counters - probably a bit less. This may or may not be enough to generate duration summaries for your complete batch of call records. If this is not enough, you will have to split your batch in smaller batches. The available amount of global memory can be queried by requesting the CL_DEVICE_GLOBAL_MEM_SIZE property using the clGetDeviceInfo OpenCL API call.
The problem with using a more complex data structure such as a hash map from the kernel, is that pointers in your implementation on the host side will be meaningless on the device side (OpenCL 2.0 will offer a solution here). So whatever hash map implementation you envision should only use integer offsets from its base rather than pointers. Then there is still the synchronization issue of simultaneous updates to the same value, but there you could use the atomic_add as you do in your original implementation.
Depending on the characteristics of your batch of call data records, you may want to retrieve them sorted by account, source phone number or other key and if there are enough CDRs per key (as would be the case for a commercial customer like a large bank) you could apply a reduction technique such as discussed here on the list of CDRs for a given key.
Hopefully this will give you some ideas.

Related

RocksDB compaction: how to reduce data size and use more than 1 CPU core?

I'm trying to use RocksDB to store billions of records, so the resulting databases are fairly large - hundreds of gigabytes, several terabytes in some cases. The data is initially imported from a different service snapshot and updated from Kafka afterwards, but that's beside the point.
There are two parts of the problem:
Part 1) Initial data import takes hours with autocompactions disabled (it takes days if I enable them), after that I reopen the database with autocompactions enabled, but they aren't triggered automatically when the DB is opened, so I have to do it with CompactRange(Range{nil, nil}) in Go manually.
Manual compaction takes almost similar time with only one CPU core being busy and during compaction the overall size of the DB increases 2x-3x, but then ends up around 0.5x
Question 1: Is there a way to avoid 2x-3x data size growth during compaction? It becomes a problem when the data size reaches terabytes. I use the default Level Compaction, which according to the docs "optimizes disk footprint vs. logical database size (space amplification) by minimizing the files involved in each compaction step".
Question 2: Is it possible to engage more CPU cores for manual compaction? Looks like only one is used atm (even though MaxBackgroundCompactions = 32). It would speed up the process A LOT as there are no writes during initial manual compaction, I just need to prepare the DB without waiting days.
Would it work with several routines working on different sets of keys instead of just one routine working on all keys? If yes, what's the best way to divide the keys into these sets?
Part 2) Even after this manual compaction, RocksDB seems to perform autocompaction later, after I start adding/updating the data, and after it's done the DB size gets even smaller - around 0.4x comparing to the size before the manual compaction.
Question 3: What's the difference between manual and autocompation and why autocompaction seems to be more effective in terms of resulting data size?
My project is in Go, but I'm more or less familiar with RocksDB C++ code and I couldn't find any answers to these questions in the docs or in the source code.

Fast and frequent file access while executing C++ code

I am looking for suggestions on how best to implement my code for the following requirements. During execution of my c++ code, I frequently need to access data stored in a dictionary, which itself is stored in a text file. The dictionary contains 100 million entries, and at any point in time, my code would query data corresponding to some particular entry among those 100 million entries. There is no particular pattern in which those queries are made, and further during the lifetime of the program execution, not all entries in the dictionary are queried. Also, the dictionary will remain unchanged during the program's lifetime. The data corresponding to each entry is not all of the same length. The file size of my dictionary is ~24 GB, and I have only 16 GB of RAM memory. I need my application to be very fast, so I would like to know how best to implement such a system so that read access times can be minimized.
I am also the one who is creating the dictionary, so I do have the flexibility in breaking down my dictionary into several smaller volumes. While thinking about what I can do, I came up with the following, but not sure if either are good.
If I store the line offset for each entry in my dictionary from the beginning of the file, then to read the data for the corresponding entry, I can directly jump to the corresponding offset. Is there a way to do this using say ifstream without looping through all lines until the offset line? A quick search on the web seems to suggest this is not possible atleast with ifstream, are there are other ways this can be done?
The other extreme thought was to create a single file for each entry in the dictionary, so I would have 100 million files. This approach has the obvious drawback of overhead in opening and closing the file stream.
In general I am not convinced either of the approaches I have in mind are good, and so I would like some suggestions.

Well, if you only need key value accesses, and if the data is larger than what can fit in memory, the answer is a NoSQL database. That mean a hash type index for the key and arbitrary values. If you have no other constraint like concurrent accesses from many clients or extended scalability, you can roll your own. The most important question for a custom NoSQL database is the expected number of keys that will give the size of the index file. You can find rather good hashing algorithms around, and will have to make a decision between a larger index file and a higher risk of collisions. Anyway, unless you want to use a tera bytes index files, your code must be prepared to possible collisions.
A detailed explaination with examples is far beyond what I can write in a SO answer, but it should give you a starting point.
The next optimization will be what should be cached in memory. It depends on the way you expect the queries. If it is unlikely to query more than one time the same key, you can probably just rely on the OS and filesystem cache, and a slight improvement would be memory mapped files, else caching (of index and/or values) makes sense. Here again you can choose and implement a caching algorithm.
Or if you think that it is too complex for little gain, you can search if one of the free NoSQL databases could meet your requirement...

Once you decide using on-disk data structure it becomes less a C++ question and more a system design question. You want to implement a disk-based dictionary.
You should consider the following factors from now on are - what's your disk parameters? is it SSD? HDD? what's your average lookup rate per second? Are you fine having 20usec - 10ms latencies for your Lookup() method?
On-disk dictionaries require random disk seeks. Such seeks have a latency of dozens of microseconds for SSD and 3-10ms for HDD. Also, there is a limit on how many such seeks you can make a second. You can read this article for example. CPU stops being a bottleneck and IO becomes important.
If you want to pursue this direction - there are state of art C++ libraries that give you on-disk key-value store (no need for out-of- process database) or you can do something simple yourself.
If your application is a batch process and not a server/UI program, i.e. you have another finite stream of items that you want to join with your dictionary then I recommend reading about external algorithms like Hash Join or a MapReduce. In these cases, it's possible to organize your data in such way that instead of having 1 huge dictionary of 24GB you can have 10 dictionaries of size 2.4GB and sequentially load each one of them and join. But for that, I need to understand what kind of problem you are trying to solve.
To summarize, you need to design your system first before coding the solution. Using mmap or tries or other tricks mentioned in the comments are local optimizations (if at all), they are unlikely game-changers. I would not rush exploring them before doing back-on-the-envelope computations to understand the main direction.

Quickly loading large data structures from a file

I have a large dictionary of english words (around 70k of them) that I load into memory at the beginning of the program. They are loaded into a radix trie data structure, and each trie node often has many links from one node to many others (for example the word antonyms, "dead" -> "alive", "well"). Each node also have a std::vector<MetaData> in it which contains various miscellaneous metadata for my program.
Now, the problem is with the loading time of this file. Reading the file from disk, deserialization and allocating the data structure for the thing in general takes a lot of time (4-5 seconds).
Currently, I'm working on making the load asynchronously (or bit by bit, a fraction of them per frame), but due to the nature of the application (it's a mobile keyboard), there's just plenty of times where it simply has to be loaded quickly.
What can be done to speed up loading? Memory pool everything? I am benchmarking different parts to see what can be optimized, but it looks like, so far, it's just little things that add up.

If the trie is static (i.e. doesn't change when the program's running), then build an optimized version in an array using array indexes in place of pointers. You can then save that as your data file. Startup then amounts to just loading that block of data into memory.
Doing it that way makes some things less convenient (you'll have to use arrays rather than std::vector, for example), and you might have to do a bit of casting, but with a little thought you end up with a very compact and very fast data structure that doesn't suffer from the allocation overhead associated with creating an object for each node. Instead, it's essentially an array of varying length structures.
I did this for an application that used a directed acyclic word graph (DAWG). Rather than rebuild the DAWG every time the program was loaded (a time consuming process), I had a utility program that created the DAWG and shipped that as the data file in place of the word list.

Not knowing the details, only a vague idea:
Loading the bulk data (entries) will give you the basic dictionary.
For all the cross references like syn- and antonyms and whatever, load and process the data in background, after you've shown "ready". Chances are, until A. User has typed in the first query, you are ship-shape.
Later
If the file is rather big, reading a compressed version may gain.
Also, a BufferedReader with a suitably increased buffer size may help.

You should review the structure of the data to make the data faster to load.
Also, splitting into multiple tables may speed things up.
For example, have one table for the words, another table for synonyms and additional tables for other relationships.
The first table should have organization. This allows the synonym table to be represented as ; which should load fast.
You can then build any internal containers from the data loaded in. A reason for having different data structures for store data vs. internal data is for optimization. The structures used for data storage (and loading) are optimized for loading. The structure for internal data is optimized for searching.

Another idea based on the fact that it is a mobile keyboard application.
Some words are used more often than others, so maybe you could organize it so the frequently used words are loaded first and leave the infrequently used ones to be loaded as it is needed (or as you have time).

Which is costlier - DB call or new call?

I am debugging a process core dump and I would like to do a design change.
The C++ process uses eSQL/C to connect to the informix database.
Presently, the application uses a query which fetches more than 2lacs rows from the database. For each row, it creates dynamic memory using new and processes the result. It results in Out of memory errors at times, maybe because of inherent memory leaks.
I am thinking of an option by which I will query only 500 rows from the database at a time, allocate dynamic memory and process it. Once it is de-allocated, then load next 500 and so on. But this would increase the number of DB queries, even though the dynamic memory required at a time is reduced.
So my question is whether this option is a scalable solution.
Whether more DB calls will make the application less scalable?

Depends on the query.
Your single call at the moment takes a certain amount of time to return all 200k rows. Let's say that time is proportional to the number of rows in the DB, call it n.
If it turns out that your new, smaller call still takes time proportional to the number of rows in the DB, then your overall operation will take time proportional to n^2 (because you have to make n / 500 calls at cost n each). This might not be scalable.
So, you need to make sure you have the right indexes in place in the database (or more likely, make sure that you divide up the rows into groups of 500 according to the order of some field that is indexed already) so that the smaller calls take time roughly proportional to the number of rows returned, rather than the number of rows in the DB. Then it might be scalable.
Anyway, if you do have memory leaks then they are bugs, they're not "inherent" and they should be removed!

DB calls surely costs more that dynamic memory allocation(though both expensive). If you can't fix memory leaks, you should try this solution and tune number of rows to fetch for max efficency.
Anyway, memory leaks are huge problem and your solution will be just temporary. You should give a try to smart pointers.

Holding all the records in memory while processing is not a very scalable unless you are processing a small number of records. Given that the current solution already fails, paging will definitely result in better scalability. While multiple round-trips will result in greater delays due to network latency, paging will allow you to work with a much larger number of records.
That said, you should definitely solve the memory leak errors, because you will still end up with out-of-memory exceptions, it will simply take longer for the leaks to accumulate to the point where the exception occurs.
Additionally, you should ensure that you do not keep any cursors open while paging, otherwise you may cause blocking problems for others. You should create a SQL statement that returns only one page of data at a time.

Firstly identify if you have memory leaks or not and fix them if you do.
Memory leaks do not scale well.
Secondly allocating dynamic memory is usually much faster than DB access - except when you are allocating a lot of memory and requiring increase in the heap.
If you are requesting a lot (100k upwards of rows) to perform processing - firstly ask yourself why it is necessary to fetch all of these - can the SQL be modified to perform the processing based on criteria - if you clarify the processing we can provide better advice about how to do this.
Fetching and processing large amounts of data need proper thought to ensure that it scales well.

Performance of table access

We have an application which is completely written in C. For table access inside the code like fetching some values from a table we use Pro*C. And to increase the performance of the application we also preload some tables for fetching the data. We take some input fields and fetch the output fields from the table in general.
We usually have around 30000 entries in the table and max it reaches 0.1 million some times.
But if the table entries increase to around 10 million entries, I think it dangerously affects the performance of the application.
Am I wrong somewhere? If it really affects the performance, is there any way to keep the performance of the application stable?
What is the possible workaround if the number of rows in the table increases to 10 million considering the way the application works with tables?

If you are not sorting the table you'll get a proportional increase of search time... if you don't code anything wrong, in your example (30K vs 1M) you'll get 33X greater search times. I'm assumning you're incrementally iterating (i++ style) the table.
However, if it's somehow possible to sort the table, then you can greatly reduce search times. That is possible because an indexer algorithm that searchs sorted information will not parse every element till it gets to the sought one: it uses auxiliary tables (trees, hashes, etc), usually much faster to search, and then it pinpoints the correct sought element, or at least gets a much closer estimate of where it is in the master table.
Of course, that will come at the expense of having to sort the table, either when you insert or remove elements from it, or when you perform a search.

maybe you can go to 'google hash' and take a look at their implementation? although it is in C++

It might be that you have too many cache misses once you increase over 1MB or whatever your cache size is.
If you iterate table multiple times or you access elements randomly you can also hit lot of cache misses.
http://en.wikipedia.org/wiki/CPU_cache#Cache_Misses

Well, it really depends on what you are doing with the data. If you have to load the whole kit-and-kabootle into memory, then a reasonable approach would be to use a large bulk size, so that the number of oracle round trips that need to occur is small.
If you don't really have the memory resources to allow the whole result set to be loaded into memory, then a large bulk size will still help with the Oracle overhead. Get a reasonable size chunk of records into memory, process them, then get the next chunk.
Without more information about your actual run time environment, and business goals, that is about as specific as anyone can get.
Can you tell us more about the issue?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js