Performance of table access - c++

We have an application which is completely written in C. For table access inside the code like fetching some values from a table we use Pro*C. And to increase the performance of the application we also preload some tables for fetching the data. We take some input fields and fetch the output fields from the table in general.
We usually have around 30000 entries in the table and max it reaches 0.1 million some times.
But if the table entries increase to around 10 million entries, I think it dangerously affects the performance of the application.
Am I wrong somewhere? If it really affects the performance, is there any way to keep the performance of the application stable?
What is the possible workaround if the number of rows in the table increases to 10 million considering the way the application works with tables?

If you are not sorting the table you'll get a proportional increase of search time... if you don't code anything wrong, in your example (30K vs 1M) you'll get 33X greater search times. I'm assumning you're incrementally iterating (i++ style) the table.
However, if it's somehow possible to sort the table, then you can greatly reduce search times. That is possible because an indexer algorithm that searchs sorted information will not parse every element till it gets to the sought one: it uses auxiliary tables (trees, hashes, etc), usually much faster to search, and then it pinpoints the correct sought element, or at least gets a much closer estimate of where it is in the master table.
Of course, that will come at the expense of having to sort the table, either when you insert or remove elements from it, or when you perform a search.

maybe you can go to 'google hash' and take a look at their implementation? although it is in C++

It might be that you have too many cache misses once you increase over 1MB or whatever your cache size is.
If you iterate table multiple times or you access elements randomly you can also hit lot of cache misses.
http://en.wikipedia.org/wiki/CPU_cache#Cache_Misses

Well, it really depends on what you are doing with the data. If you have to load the whole kit-and-kabootle into memory, then a reasonable approach would be to use a large bulk size, so that the number of oracle round trips that need to occur is small.
If you don't really have the memory resources to allow the whole result set to be loaded into memory, then a large bulk size will still help with the Oracle overhead. Get a reasonable size chunk of records into memory, process them, then get the next chunk.
Without more information about your actual run time environment, and business goals, that is about as specific as anyone can get.
Can you tell us more about the issue?

Related

Fast and frequent file access while executing C++ code

I am looking for suggestions on how best to implement my code for the following requirements. During execution of my c++ code, I frequently need to access data stored in a dictionary, which itself is stored in a text file. The dictionary contains 100 million entries, and at any point in time, my code would query data corresponding to some particular entry among those 100 million entries. There is no particular pattern in which those queries are made, and further during the lifetime of the program execution, not all entries in the dictionary are queried. Also, the dictionary will remain unchanged during the program's lifetime. The data corresponding to each entry is not all of the same length. The file size of my dictionary is ~24 GB, and I have only 16 GB of RAM memory. I need my application to be very fast, so I would like to know how best to implement such a system so that read access times can be minimized.
I am also the one who is creating the dictionary, so I do have the flexibility in breaking down my dictionary into several smaller volumes. While thinking about what I can do, I came up with the following, but not sure if either are good.
If I store the line offset for each entry in my dictionary from the beginning of the file, then to read the data for the corresponding entry, I can directly jump to the corresponding offset. Is there a way to do this using say ifstream without looping through all lines until the offset line? A quick search on the web seems to suggest this is not possible atleast with ifstream, are there are other ways this can be done?
The other extreme thought was to create a single file for each entry in the dictionary, so I would have 100 million files. This approach has the obvious drawback of overhead in opening and closing the file stream.
In general I am not convinced either of the approaches I have in mind are good, and so I would like some suggestions.
Well, if you only need key value accesses, and if the data is larger than what can fit in memory, the answer is a NoSQL database. That mean a hash type index for the key and arbitrary values. If you have no other constraint like concurrent accesses from many clients or extended scalability, you can roll your own. The most important question for a custom NoSQL database is the expected number of keys that will give the size of the index file. You can find rather good hashing algorithms around, and will have to make a decision between a larger index file and a higher risk of collisions. Anyway, unless you want to use a tera bytes index files, your code must be prepared to possible collisions.
A detailed explaination with examples is far beyond what I can write in a SO answer, but it should give you a starting point.
The next optimization will be what should be cached in memory. It depends on the way you expect the queries. If it is unlikely to query more than one time the same key, you can probably just rely on the OS and filesystem cache, and a slight improvement would be memory mapped files, else caching (of index and/or values) makes sense. Here again you can choose and implement a caching algorithm.
Or if you think that it is too complex for little gain, you can search if one of the free NoSQL databases could meet your requirement...
Once you decide using on-disk data structure it becomes less a C++ question and more a system design question. You want to implement a disk-based dictionary.
You should consider the following factors from now on are - what's your disk parameters? is it SSD? HDD? what's your average lookup rate per second? Are you fine having 20usec - 10ms latencies for your Lookup() method?
On-disk dictionaries require random disk seeks. Such seeks have a latency of dozens of microseconds for SSD and 3-10ms for HDD. Also, there is a limit on how many such seeks you can make a second. You can read this article for example. CPU stops being a bottleneck and IO becomes important.
If you want to pursue this direction - there are state of art C++ libraries that give you on-disk key-value store (no need for out-of- process database) or you can do something simple yourself.
If your application is a batch process and not a server/UI program, i.e. you have another finite stream of items that you want to join with your dictionary then I recommend reading about external algorithms like Hash Join or a MapReduce. In these cases, it's possible to organize your data in such way that instead of having 1 huge dictionary of 24GB you can have 10 dictionaries of size 2.4GB and sequentially load each one of them and join. But for that, I need to understand what kind of problem you are trying to solve.
To summarize, you need to design your system first before coding the solution. Using mmap or tries or other tricks mentioned in the comments are local optimizations (if at all), they are unlikely game-changers. I would not rush exploring them before doing back-on-the-envelope computations to understand the main direction.

Why are Get and MultiGet significantly slower for large key sets compared to using an Iterator?

I'm currently playing around with RocksDB (C++) and was curious about some performance metrics I've experienced.
For testing purposes, my database keys are file paths and the values are filenames. My database has around 2M entries in it. I'm running RocksDB locally on a MacBook Pro 2016 (SSD).
My use case is dominated by reads. Full key scans are quite common as are key scans that include a "significant" number of keys. (50%+)
I'm curious about the following observations:
1. An Iterator is dramatically faster than calling Get when performing full key scans.
When I want to look at all of the keys in the database, I'm seeing a 4-8x performance improvement when using an Iterator instead of calling Get for each key. The use of MultiGet makes no difference.
In the case of calling Get roughly 2M times, the keys have been previously fetched into a vector and sorted lexicographically. Why is calling Get repeatedly so much slower than using an Iterator? Is there a way to narrow the performance gap between the two APIs?
2. When fetching around half the keys, the performance between using an Iterator and Get starts to become negligible.
As the number of keys to fetch is reduced, then making multiple calls to Get starts to take about as long as using an Iterator as the iterator is paying the price of scanning over keys that aren't in the desired keyset.
Is there some "magic" ratio where this becomes true for most databases? For example, if I need to scan over 25% of the keys, then calling Get is faster, but if it's 75% of the keys, then an Iterator is faster. But those numbers are just "made up" by rough testing.
3. Fetching keys in sorted order does not appear to improve performance.
If I pre-sort the keys I want to fetch into the same order that an Iterator would return them in, that does not appear to make calling Get multiple times any faster. Why is that? It's mentioned in the documentation that it's recommended to sort keys before doing a batch insert. Does Get not benefit from the same look-ahead caching that an Iterator benefits from?
4. What settings are recommended for a read-heavy use case?
Finally, are there any specific settings recommended for a read-heavy use case that might involve scanning a significant number of keys at once?
macOS 10.14.3, MacBook Pro 2016 SSD, RocksDB 5.18.3, Xcode 10.1
RocksDB internally represents its data as a log-structured merge tree which has several sorted layers by default (this can be changed with plugins/config). The intuition from Paul's first answer holds, except there is no classical index; the data is actually sorted on disk with pointers to the next files. The lookup operation has on average logarithmic complexity, but advancing an iterator in a sorted range is constant time. So for dense sequential reads, iterating is much faster.
The point where the costs balance out is determined not only by the number of keys you read, but also by the size of the database. As the database grows, the lookup becomes slower, while Next() remains constant. Very recent inserts are likely to be read very fast, since they may still be in memory (memtables).
Sorting the keys actually just improves your cache hit-rate. Depending on your disk, the difference may be very small, e.g., if you have an NVMe SSD, the difference in access time is just not as drastic anymore as it was when it was RAM vs. HDD. If you have to do several operations over the same or even different key-sets doing them by key-order (f(a-c) g(a-c) f(d-g)...) instead of sequentially should improve your performance, since you will have more cache-hits and also benefit from the RocksDB block cache.
The tuning guide is a good starting point, especially the video on database solutions, but if RocksDB is too slow for you also consider using a DB based on a different storage algorithm. LSM is typically better for write-heady workloads, and while RocksDB lets you control read vs. write vs. space amplification very well, a b-tree or ISAM based solution may just be much faster for range-reads/repeated reads.
I don't know anything about RocksDB per-se, but I can answer a lot of this from first principles.
An Iterator is dramatically faster than calling Get when performing full key scans.
This is likely to be because Get has to do a full lookup in the underlying index (starting from the top) whereas advancing an iterator can be achieved by just moving from the current node to the next. Assuming the index is implemented as a red-black tree or similar, there's a lot less work in the second method than the first.
When fetching around half the keys, the performance between using an Iterator and Get starts to become negligible.
So you are skipping entries by calling iterator->Next () multiple times? If so, then there will come a point where it's cheaper to call Get for each key instead, yes. Exactly when that happens will depend on the number of entries in the index (since that determines the number of levels in the tree).
Fetching keys in sorted order does not appear to improve performance.
No, I would not expect it to. Get is (presumably) stateless.
What settings are recommended for a read-heavy use case?
That I don't know, sorry, but you might read:
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

Which is costlier - DB call or new call?

I am debugging a process core dump and I would like to do a design change.
The C++ process uses eSQL/C to connect to the informix database.
Presently, the application uses a query which fetches more than 2lacs rows from the database. For each row, it creates dynamic memory using new and processes the result. It results in Out of memory errors at times, maybe because of inherent memory leaks.
I am thinking of an option by which I will query only 500 rows from the database at a time, allocate dynamic memory and process it. Once it is de-allocated, then load next 500 and so on. But this would increase the number of DB queries, even though the dynamic memory required at a time is reduced.
So my question is whether this option is a scalable solution.
Whether more DB calls will make the application less scalable?
Depends on the query.
Your single call at the moment takes a certain amount of time to return all 200k rows. Let's say that time is proportional to the number of rows in the DB, call it n.
If it turns out that your new, smaller call still takes time proportional to the number of rows in the DB, then your overall operation will take time proportional to n^2 (because you have to make n / 500 calls at cost n each). This might not be scalable.
So, you need to make sure you have the right indexes in place in the database (or more likely, make sure that you divide up the rows into groups of 500 according to the order of some field that is indexed already) so that the smaller calls take time roughly proportional to the number of rows returned, rather than the number of rows in the DB. Then it might be scalable.
Anyway, if you do have memory leaks then they are bugs, they're not "inherent" and they should be removed!
DB calls surely costs more that dynamic memory allocation(though both expensive). If you can't fix memory leaks, you should try this solution and tune number of rows to fetch for max efficency.
Anyway, memory leaks are huge problem and your solution will be just temporary. You should give a try to smart pointers.
Holding all the records in memory while processing is not a very scalable unless you are processing a small number of records. Given that the current solution already fails, paging will definitely result in better scalability. While multiple round-trips will result in greater delays due to network latency, paging will allow you to work with a much larger number of records.
That said, you should definitely solve the memory leak errors, because you will still end up with out-of-memory exceptions, it will simply take longer for the leaks to accumulate to the point where the exception occurs.
Additionally, you should ensure that you do not keep any cursors open while paging, otherwise you may cause blocking problems for others. You should create a SQL statement that returns only one page of data at a time.
Firstly identify if you have memory leaks or not and fix them if you do.
Memory leaks do not scale well.
Secondly allocating dynamic memory is usually much faster than DB access - except when you are allocating a lot of memory and requiring increase in the heap.
If you are requesting a lot (100k upwards of rows) to perform processing - firstly ask yourself why it is necessary to fetch all of these - can the SQL be modified to perform the processing based on criteria - if you clarify the processing we can provide better advice about how to do this.
Fetching and processing large amounts of data need proper thought to ensure that it scales well.

Best of breed indexing data structures for Extremely Large time-series

I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).
Two basic types of time-series exist based on the sampling/discretisation characteristic:
Regular discretisation (Every sample is taken with a common frequency)
Irregular discretisation(Samples are taken at arbitary time-points)
Queries that will be required:
All values in the time range [t0,t1]
All values in the time range [t0,t1] that are greater/less than v0
All values in the time range [t0,t1] that are in the value range[v0,v1]
The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.
Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.
Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.
I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.
Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)
You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:
http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf
Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)
Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].
General ideas:
Problem 1 is fairly common: Create an index that fits into your RAM and has links to the data on the secondary storage (datastructure: B-Tree family).
Problem 2 / 3 are quite complicated since your data is so large. You could partition your data into time ranges and calculate the min / max for that time range. Using that information, you can filter out time ranges (e.g. max value for a range is 50 and you search for v0>60 then the interval is out). The rest needs to be searched by going through the data. The effectiveness greatly depends on how fast the data is changing.
You can also do multiple indices by combining the time ranges of lower levels to do the filtering faster.
It is going to be really time consuming and complicated to implement this by your self. I recommend you use Cassandra.
Cassandra can give you horizontal scalability, redundancy and allow you to run complicated map reduce functions in future.
To learn how to store time series in cassandra please take a look at:
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
and http://www.youtube.com/watch?v=OzBJrQZjge0.

Correct data structure to use for (this specific) expiring cache?

I need to read from a dataset which is very large, highly interlinked, the data is fairly localized, and reads are fairly expensive. Specifically:
The data sets are 2gigs - 30gigs in size, so I have to map sections of the file into memory to read. This is very expensive compared to the rest of the work I do in the algorithm. From profiling I've found roughly 60% of the time is spent reading the memory, so this is the right place to start optimizing.
When operating on a piece of this dataset, I have to follow links inside of it (imagine it like being similar to a linked list), and while those reads aren't guaranteed to anywhere near sequential, they are fairly localized. This means:
Let's say, for example, we operate on 2 megs of memory at a time. If you read 2 megs of data into memory, roughly 40% of the reads I will have to subsequently do will be in that same 2 megs of memory. Roughly 20% of the reads will be purely random access in the rest of the data, and the other 40% very likely links back into the 2meg segment which pointed to this one.
From knowledge of the problem and from profiling, I believe that introducing a cache to the program will help greatly. What I want to do is create a cache which holds N chunks of X megs of memory (N and X configurable so I can tune it) which I can check first, before having to map another section of memory. Additionally, the longer something has been in the cache, the less likely it is that we will request that memory in the short term, and so the oldest data will need to be expired.
After all that, my question is very simple: What data structure would be best to implement a cache of this nature?
I need to have very fast lookups to see if a given address is in the cache. With every "miss" of the cache, I'll want to expire the oldest member of it, and add a new member. However, I plan to try to tune it (by changing the amount that's cached) such that 70% or more of reads are hits.
My current thinking is to use either an AVL tree (LOG2 n for search/insert/delete) would be the safest (no degenerate cases). My other option is a sparse hashtable such that lookups would be O(1) in the best case. In theory this could degenerate into O(n), but in practice I could keep collisions low. The concern here would be how long it takes to find and remove the oldest entry in the hashtable.
Does anyone have any thoughts or suggestions on what data structure would be best here, and why?
Put the cache into two sorted trees (AVL or any other reasonably balanced tree implementation is fine--you're better off using one from a library than creating your own).
One tree should sort by position in the file. This lets you do log(n) lookups to see if your cache is there.
The other tree should sort by time used (which can be represented by a number that increments by one on each use). When you use a cached block, you remove it, update the time, and insert it again. This will take log(n) also. When you miss, remove the smallest element of the tree, and add the new block as the largest. (Don't forget to also remove/add that block to the by-position-in-file tree.)
If your cache doesn't have very many items in it, you'll be better off still by just storing everything in a sorted array (using insertion sort to add new elements). Moving 16 items down one spot in an array is incredibly fast.
Seems like you are looking for an LRU (Least Recently Used) cache: LRU cache design
If 60% of your algorithm is I/O, I suggest that the actual cache design doesn't really matter that much- any sort of cache could be a substantial speed-up.
However, the design depends a lot on what data you're using to access your chunks. String, int, etc. If you had an int, you could do a hashmap into a linked list, erase the back on cache miss, erase and then push on top if cache hit.
hashmaps are provided under varying names (most commonly, unordered map) in many implementations. Boost has one, there's one in TR1, etc. A big advantage of a hash_map is less performance loss with growing numbers, and more flexibility about key values.