Which is costlier - DB call or new call? - c++

I am debugging a process core dump and I would like to do a design change.
The C++ process uses eSQL/C to connect to the informix database.
Presently, the application uses a query which fetches more than 2lacs rows from the database. For each row, it creates dynamic memory using new and processes the result. It results in Out of memory errors at times, maybe because of inherent memory leaks.
I am thinking of an option by which I will query only 500 rows from the database at a time, allocate dynamic memory and process it. Once it is de-allocated, then load next 500 and so on. But this would increase the number of DB queries, even though the dynamic memory required at a time is reduced.
So my question is whether this option is a scalable solution.
Whether more DB calls will make the application less scalable?

Depends on the query.
Your single call at the moment takes a certain amount of time to return all 200k rows. Let's say that time is proportional to the number of rows in the DB, call it n.
If it turns out that your new, smaller call still takes time proportional to the number of rows in the DB, then your overall operation will take time proportional to n^2 (because you have to make n / 500 calls at cost n each). This might not be scalable.
So, you need to make sure you have the right indexes in place in the database (or more likely, make sure that you divide up the rows into groups of 500 according to the order of some field that is indexed already) so that the smaller calls take time roughly proportional to the number of rows returned, rather than the number of rows in the DB. Then it might be scalable.
Anyway, if you do have memory leaks then they are bugs, they're not "inherent" and they should be removed!

DB calls surely costs more that dynamic memory allocation(though both expensive). If you can't fix memory leaks, you should try this solution and tune number of rows to fetch for max efficency.
Anyway, memory leaks are huge problem and your solution will be just temporary. You should give a try to smart pointers.

Holding all the records in memory while processing is not a very scalable unless you are processing a small number of records. Given that the current solution already fails, paging will definitely result in better scalability. While multiple round-trips will result in greater delays due to network latency, paging will allow you to work with a much larger number of records.
That said, you should definitely solve the memory leak errors, because you will still end up with out-of-memory exceptions, it will simply take longer for the leaks to accumulate to the point where the exception occurs.
Additionally, you should ensure that you do not keep any cursors open while paging, otherwise you may cause blocking problems for others. You should create a SQL statement that returns only one page of data at a time.

Firstly identify if you have memory leaks or not and fix them if you do.
Memory leaks do not scale well.
Secondly allocating dynamic memory is usually much faster than DB access - except when you are allocating a lot of memory and requiring increase in the heap.
If you are requesting a lot (100k upwards of rows) to perform processing - firstly ask yourself why it is necessary to fetch all of these - can the SQL be modified to perform the processing based on criteria - if you clarify the processing we can provide better advice about how to do this.
Fetching and processing large amounts of data need proper thought to ensure that it scales well.

Related

RocksDB compaction: how to reduce data size and use more than 1 CPU core?

I'm trying to use RocksDB to store billions of records, so the resulting databases are fairly large - hundreds of gigabytes, several terabytes in some cases. The data is initially imported from a different service snapshot and updated from Kafka afterwards, but that's beside the point.
There are two parts of the problem:
Part 1) Initial data import takes hours with autocompactions disabled (it takes days if I enable them), after that I reopen the database with autocompactions enabled, but they aren't triggered automatically when the DB is opened, so I have to do it with CompactRange(Range{nil, nil}) in Go manually.
Manual compaction takes almost similar time with only one CPU core being busy and during compaction the overall size of the DB increases 2x-3x, but then ends up around 0.5x
Question 1: Is there a way to avoid 2x-3x data size growth during compaction? It becomes a problem when the data size reaches terabytes. I use the default Level Compaction, which according to the docs "optimizes disk footprint vs. logical database size (space amplification) by minimizing the files involved in each compaction step".
Question 2: Is it possible to engage more CPU cores for manual compaction? Looks like only one is used atm (even though MaxBackgroundCompactions = 32). It would speed up the process A LOT as there are no writes during initial manual compaction, I just need to prepare the DB without waiting days.
Would it work with several routines working on different sets of keys instead of just one routine working on all keys? If yes, what's the best way to divide the keys into these sets?
Part 2) Even after this manual compaction, RocksDB seems to perform autocompaction later, after I start adding/updating the data, and after it's done the DB size gets even smaller - around 0.4x comparing to the size before the manual compaction.
Question 3: What's the difference between manual and autocompation and why autocompaction seems to be more effective in terms of resulting data size?
My project is in Go, but I'm more or less familiar with RocksDB C++ code and I couldn't find any answers to these questions in the docs or in the source code.

How can one analyse and/or eliminate performance variations due to memory allocation?

I have a real-time application that generally deals with each chunk of incoming data in a matter of a 2-5 milliseconds, but sometimes it spikes to several tens of milliseconds. I can generate and repeat the sequence of incoming data as often as I like, and prove that the spikes are not related to particular chunks of data.
My guess is that because the C++/Win32/MFC code also uses variable-length std:vectors and std::lists, it regularly needs to get memory from the OS, and periodically has to wait for the OS to do some garbage collections or something. How could I test this conjecture? Is there any way to tune the memory allocation to make OS processes have less of an impact?
Context: think of the application as a network protocol analyser which gathers data in real-time and makes it available for inspection. The data "capture" always runs in the highest priority thread.
The easy way to test is to not put your data into any structure. ie eliminate whatever you suspect may be the problem. You might also consider that the delays may be the OS switching your process out of context in order to give time to other processes.
If you are pushing lots of data onto a vector, such that it is constantly growing, then you will experience periodic delays as the vector is resized. In this case, the delays are likely to get longer and less frequent. One way to mitigate this is to use a deque which allocates data in chunks but relaxes the requirement that all data be in contiguous memory.
Another way around it is to create a background thread that handles the allocation, provided you know that it can allocate memory faster than the process consuming it. You can't directly use standard containers for this. However, you can implement something similar to a deque, by allocating constant size vector chunks or simply using traditional dynamic arrays. The idea here is that as soon as you begin using a new chunk, you signal your background process to allocate a new chunk.
All the above is based on the assumption that you need to store all your incoming data. If you don't need to do that, don't. In that case, it would suggest your symptoms are related to the OS switching you out. You could investigate altering the priority of your thread.

How does a memory leak improve performance

I'm building a large RTree (spatial index) full of nodes. It needs to be able to handle many queries AND updates. Objects are continuously being created and destroyed. The basic test I'm running is to see the performance of the tree as the number of objects in the tree increases. I insert from 100-20000 uniformly size, randomly located objects in increments of 100. Searching and updating are irrelevant to the issue I am currently faced with.
Now, when there is NO memory leak the "insert into tree" performance is everywhere. It goes anywhere from 10.5 seconds with ~15000 objects to 1.5 with ~18000. There is no pattern whatsoever.
When I deliberately add in a leak, as simple as putting in "new int;" I don't assign it to anything, that right there is a line to itself, the performance instantly falls onto a nice gentle curve sloping from 0 (roughly) seconds for 100 objects to 1.5 for the full 20k.
Very, very lost at this point. If you want source code I can include it but it's huuugggeeee and literally the only line that makes a difference is "new int;"
Thanks in advance!
-nick
I'm not sure how you came up with this new int test, but it's not very good way to fix things :) Run your code using a profiler and find out where the real delays are. Then concentrate on fixing the hot spots.
g++ has it built in - just compile with -pg
Without more information it's impossible to be sure.
However I wonder if this is to do with heap fragmentation. By creating a freeing many blocks of memory you'll likely be creating a whole load of small fragments of memory linked together.The memory manager needs to keep track of them all so it can allocate them again if needed.
Some memory managers when you free a block try to "merge" it with surrounding blocks of memory and on a highly fragmented heap this can be very slow as it tries to find the surrounding blocks. Not only this, but if you have limited physical memory it can "touch " many physical pages of memory as it follows the chain of memory blocks which can cause a whole load of extremely slow page faults which will be very variable in speed depending on exactly how much physical memory the OS decides to give that process.
By leaving some un-freed memory you will be changing this pattern of access which might make a large difference to the speed. You might for example be forcing the run time library to allocate new block of memory each time rather than having to track down a suitably sized existing block to reuse.
I have no evidence this is the case in your program, but I do know that memory fragmentation is often the causes of slow programs when a lot of new and free is performed.
The possible thing that is happening which explains this (a theory)
The compiler did not remove the empty new int
The new int is in one of the inner loops or somewhere in your recursive traversal wherein it gets executed the most amount of time
The overall RSS of the process increases and eventually the total memory being used by the process
There are page faults happening because of this
Because of the page-faults, the process becomes I/O bound instead of being CPU bound
End result, you see a drop in the throughput. It will help if you can mention the compiler being used and the options for the compiler that you are using to build the code.
I am taking a stab in the dark here but the problem could be the way the heap gets fragmented. You said that you are creating a destroying large numbers of objects. I will assume that the objects are all of different size.
When one allocates memory on the heap, a cell the size needed is broken off from the heap. When the memory is freed, the cell is added to a freelist. When one does a new alloc, the allocator walks the heap until a cell that is big enough is found. When doing large numbers of allocations, the free list can get rather long and walking the list can take a non-trivial amount of time.
Now an int is rather small. So when you do your new int, it may well eat up all the small heap cells on the free list and thus dramatically speed up larger allocations.
The chances are, however that you are allocating and freeing similar sized objects. If you use your own freelists, you will safe yourself many heap walks and may dramatically improve performance. This is exactly what the STL allocators do to improve performance.
Solution: Do not run from Visual Studio. Actually run the .exe file. Figured this out because that's what the profilers were doing and the numbers were magically dropping. Checked memory usage and version running (and giving me EXCEPTIONAL times) was not blowing up to excessively huge sizes.
Solution to why the hell Visual Studio does ridiculous crap like this: No clue.

Performance of table access

We have an application which is completely written in C. For table access inside the code like fetching some values from a table we use Pro*C. And to increase the performance of the application we also preload some tables for fetching the data. We take some input fields and fetch the output fields from the table in general.
We usually have around 30000 entries in the table and max it reaches 0.1 million some times.
But if the table entries increase to around 10 million entries, I think it dangerously affects the performance of the application.
Am I wrong somewhere? If it really affects the performance, is there any way to keep the performance of the application stable?
What is the possible workaround if the number of rows in the table increases to 10 million considering the way the application works with tables?
If you are not sorting the table you'll get a proportional increase of search time... if you don't code anything wrong, in your example (30K vs 1M) you'll get 33X greater search times. I'm assumning you're incrementally iterating (i++ style) the table.
However, if it's somehow possible to sort the table, then you can greatly reduce search times. That is possible because an indexer algorithm that searchs sorted information will not parse every element till it gets to the sought one: it uses auxiliary tables (trees, hashes, etc), usually much faster to search, and then it pinpoints the correct sought element, or at least gets a much closer estimate of where it is in the master table.
Of course, that will come at the expense of having to sort the table, either when you insert or remove elements from it, or when you perform a search.
maybe you can go to 'google hash' and take a look at their implementation? although it is in C++
It might be that you have too many cache misses once you increase over 1MB or whatever your cache size is.
If you iterate table multiple times or you access elements randomly you can also hit lot of cache misses.
http://en.wikipedia.org/wiki/CPU_cache#Cache_Misses
Well, it really depends on what you are doing with the data. If you have to load the whole kit-and-kabootle into memory, then a reasonable approach would be to use a large bulk size, so that the number of oracle round trips that need to occur is small.
If you don't really have the memory resources to allow the whole result set to be loaded into memory, then a large bulk size will still help with the Oracle overhead. Get a reasonable size chunk of records into memory, process them, then get the next chunk.
Without more information about your actual run time environment, and business goals, that is about as specific as anyone can get.
Can you tell us more about the issue?

Memory management while loading huge XML files

We have an application which imports objects from an XML. The XML is around 15 GB. The application invariably starts running out of memory. We tried to free memory in between operations but this has lead to degrading performance. i.e it takes more time to complete the import operation. The CPU utilization reaches 100%
The application is written in C++.
Does the frequent call to free() will lead to performance issues?
Promoted from a comment by the OP: the parser being used in expat, which is a SAX parser with a very small footprint, and customisable memory management.
Use SAX parser instead of DOM parser.
Have you tried resuing the memory and your classes as opposed to freeing and reallocating it? Constant allocation/deallocation cycles, especially if they are coupled with small (less than 4096 bytes) data fragments can lead to serious performance problems and memory address space fragmentation.
Profile the application during one of these bothersome loads, to see where it is spending most of its time.
I believe that free() can sometimes be costly, but that is of course very much dependent on the platform's implementation.
Also, you don't say a lot about the lifelength of the objects loaded; if the XML is 15 GB, how much of that is kept around for each "object", once the markup is parsed and thrown away?
It sounds sensible to process an input document of this size in a streaming fashion, i.e. not trying a DOM-approach which loads and builds the entire XML parse tree at once.
If you want to minimise your memory usage, took a look at How to read the XML data from a file by using Visual C++.
One thing that often helps is to use a lightweight low-overhead memory pool. If you combine this with "frame" allocation methods (ignoring any delete/free until you're all done with the data), you can get something that's ridiculously fast.
We did this for an embedded system recently, mostly for performance reasons, but it saved a lot of memory as well.
The trick was basically to allocate a big block -- slightly bigger than we'd need (you could allocate a chain of blocks if you like) -- and just keep returning a "current" pointer (bumping it up by allocSize, rounded up to maximum align requirement of 4 in our case, each time). This cut our overhead per alloc from on the order of 52-60 bytes down to <= 3 bytes. We also ignored "free" calls until we were all done parsing and then freed the whole block.
If you're clever enough with your frame allocation you can save a lot of space and time. It might not get you all the way to your 15GiB, but it would be worth looking at how much space overhead you really have... My experience with DOM-based systems is that they use tons of small allocs, each with a relatively high overhead.
(If you have virtual memory, a large "block" might not even hurt that much, if your access at any given time is local to a page or three anyway...)
Obviously you have to keep the memory you actually need in the long run, but the parser's "scratch memory" becomes a lot more efficient this way.
We tried to free memory in between operations but this has lead to degrading performance .
Does the frequent call to free() will lead to performance issues ?
Based on the evidence supplied, yes.
Since your already using expat, a SAX parser, what exactly are you freeing? If you can free it, why are you mallocing it in a loop in the first place?
Maybe, it should say profiler.
Also don't forget that work with heap is single-thread. I mean that if booth of your threads will allocate/free memory in ont time, one of them will waiting when first will done.
If you allocating and free memory for same objects, you could create pool of this object and do allocate/free once.
Try and find a way to profile your code.
If you have no profiler, try and organize your work so that you only have a few free() commands (instead of the many you suggest).
It is very common to have to find the right balance between memory consumption and time efficiency.
I did not try it myself, but have you heard of XMLLite, there's an MSDN artical introducing it. It's used by MS Office internally.