real cleanup in Doctrine2 iteration - doctrine-orm

I've read the manual, but it doesn't seem to work that way. I'm iterating over a large dataset, processing statistics. My memory usage explodes and nothing I do stops it.
I've tried detaching, clearing, unsetting, calling gc_collect_cycles and dancing naked under the full moon - nothing even makes a difference.
I have a complex dataset with many relations, lots of whom get pulled into memory during processing, so I'm pretty sure that's the culprit, there's probably a cyclic reference or 10 somewhere.
What I need is something that tells Doctrine to throw away everything it has in memory right now, stop trying to be smart, and start from scratch in the next iteration with nothing except the iteration result itself. NOTHING.
em->clear() doesn't do that. In fact, some test runs reveal that clearing, detatching - really makes no difference in my case.
Is there a way to tell Doctrine to stop being greedy and keeping twenty million objects nobobody cares about anymore in its hands?

Related

Performance bottleneck at _L_unlock_16

I am trying to use google perf tools CPU profiler for debugging performance issues on a multi-threaded program. With single thread it take 250 ms while 4 threads take around 900ms.
My program has a mmap'ed file which is shared across threads and all operations are read only. Also my program creates large number of objects which are not shared across threads. (Specifically my program uses CRF++ library to do some querying). I am trying to figure out how to make my program perform better with multi threads. Call graph produced by CPU profiler of gperf tools shows that my program spends a lot of time (around 50%) of time in _L_unlock_16.
Searching web for _L_unlock_16 pointed to some bug reports with canonical suggesting that its associated with libpthread. But other than that I was not able to find any useful information for debugging.
A brief description of what my program does. I have few words in a file (4). In my program I have a processWord() which processes a single word using CRF++. This processWord() is what each thread executes. My main() reads words from the file and each threads runs processWord() in parallel. If I process a single word(hence only 1 thread) it takes 250ms and so if I process all 4 words(and hence 4 threads) I expected it to finish by same time 250 ms, however as I mentioned above it's taking around 900ms.
This is the callgraph of the execution - https://www.dropbox.com/s/o1mkh477i7e9s4m/cgout_n2.png
I want to understand why my program is spending lot of time at _L_unlock_16 and what I can do to mitigate it.
Yet again the _L_unlock_16 is not a function of your code. Have you looked at the stracktraces above that function? What are the callers of it when the program waits? You've said that the program wastes 50% waiting inside. But, which part of the program ordered that operation? Is it again from memory alloc/dealloc ops?
The function seems to come from libpthread. Does CRF+ handle threads/libpthread in any way? If yes, then maybe the library is illconfigured? Or maybe it implements some 'basic threadsafety' by adding locks everywhere and simply is not built well for multithreading? What does the docs say about that?
Personally, I'd guess that it ignores threads and that you have added all the threading. I may be wrong, but if that's true, then the CRF++ probably will not call that 'unlock' function at all, and the 'unlock' is somwhow called from your code that orchestrates the threads/locks/queues/messages etc? Halt the program a few times and look at who called the unlock. If it really spends 50% sitting in the unlock, you will very quickly know who causes the lock to be used and you will be able to either eliminate it or at least perform a more refined research..
EDIT #1:
Eh.. when I said "stacktrace" I meant stacktrace, not callgraph. Callgraph may look nice in trivial cases, but in more complex ones, it will be mangled and unreadable and will hide the precious details into "compacted" form.. But, fortunatelly, here the case looks simple enough.
Please observe the beginning: "Process word, 99x". I assume that the "99x" is the call count. Then, look at "tagger-parse": 97x. From that:
61x into rebuildFeatures from which 41x goes right into unlock and 20(13) indirectly into it
23x goes to buildLattice fro which 21x goes into unlock
I'd guess that it was the CRF++ uses locking quite heavily. For me, it seems that you simply observe the effects of CRF's internal locking. It certainly is not lockless internally.
It seems to lock at least once per "processWord". It's hard to say without looking at code (is it opensource? I've not checked..), from stacktraces it would be more obvious, but IF it really locks once per "processWord" that it could even be a sort of a "global lock" that protects "everything" from "all threads" and causes all jobs to serialize. Whatever. Anyways, clearly, it's the CRF++'s internals that lock and wait.
If your CRF objects are really (really) not shared across threads, then remove threading configuration flags from CRF, pray that they were sane enough to not use any static variables nor global objects, add some own locking (if needed) at the topmost job/result level and retry. You should have it now much faster.
If the CRF objects are shared, unshare them and see above.
But, if they are shared behind the scenes, then there's little doable. Change your library to a one that has a better threading support, OR fix the library, OR ignore and use it with current performance.
The last advice may sound strange (it works slowly, right? so why to ignore it?), but in fact is the most important one, and you should try it first. If the parallel tasks have similar "data profile", then there is very probable that they will try to hit the same locks in the same approximate moment of time. Imagine a medium-sized cache that holds words sorted by its first letter. At the toplevel there's array of, say, 26 entries. Each entry has a lock and a list of words inside. If you run 100 threads that will each first check "mom" then "dad" then "son" - then all of that 100 threads will first hit and wait for each other at "M", then at "D" then at "S". Well, approximately/probably of course. But you get the idea. If the data profile were more random then they'd block each other far less. Mind that processing ONE word is a .. small task and that you try to process the same word(s). Even if the internal CRF's locking is smart, it is just bound to hit the same areas. Try again with a more dispersed data.
Add to that the fact that threading costs. If something was guarded against races with use of locks, then every lock/unlock costs because at least they have to "halt and check if the lock is open" (sorry very imprecise wording). If the data-to-process is small in relation to the-amount-of-lockchecks, then adding more threads will not help and will just waste the time. For checking one word, it may even happen that the sole handling of a single lock takes longer than processing the word! But, if the amount of data to be processed were larger, then the cost of flipping a lock compared to processing the data might start being neglible.
Prepare a set of 100 or more words. Run and measure it on one thread. Then partition the words at random and run it on 2 and 4 threads. And measure. If not better, try at 1000 and 10000 words. The more the better, of course, keeping in mind that the test should not last till your next birthday ;)
If you notice that 10k words split over 4 threads (2500w per th) works about 40%-30%-or even 25% faster than on one thread - here you go! You simply gave it a too small job. It was tailored and optimized for bigger ones!
But, on the other hand, it may happen that 10k words split over 4 threads does not work faster, or worse, works slower - then it might indicate that the library handles multithreading very wrong. Now try the other things like stripping threading from it or repairing it.

c++ dynamicly change addresses in another process?

Is it possible to change addresses in an application such that the app still works fine, but hacks (based on memory read/write) to this app don't? Maybe move the stack or something?
#update
I don't looking for randomization base address. I'm looking for method for changing addresses in running app such that app could still work but "hacks and bots" could not read this part of memory. ASLR isn't that what am I looking for (it's to easy to bypass)
Moving the stack will be very hard unless you can return all the way to main before you do it. Any variable passed as a reference or pointer to a varĂ­able on the stack will not be allowed to move. And before you say "well, then I'll just allocate everything dynamically", now you have exactly the same problem - your HEAP is located in one place that is predictable (at least somewhat predictable), and thus can be modified. And of course, even if the heap isn't predictably placed, you can't just move it around at random during execution, since your code would rely on pointers and references to other data in the heap - and if you move that, you'd end up having to rearrange all those references. Finally, you will still have some register or memory location that is known or possible to calculate from some other value (e.g. the stack, some global data value, or something) that can be used to figure out where your data is.
My best suggestion would be to generate code in the heap, and use that to parallel calculate results that are used in your game.
Also, one way to avoid persistent locations is to run code in threads that are dynamically created and destroyed - that way, the stack is only in one place for a short period of time. But of course, it doesn't really stop someone who is skilled and determined to find a way around you protection. And with millions and millions of people in the world that have access to computers and are able to "break into things", you can't really rely on "security through obscurity" [making things complicated is not security].
The proper secure way is to perform all essential calculations on a server which holds the code, and the code isn't available to the public! However, for a FPS game, that's probably not realistic. For a poker game, it's very much realistic, especially if you are going to win money if you play well!

profiler for c++ code, very sleepy

I'm a newbie with profiling. I'd like to optimize my code to satisfy timing constraints. I use Visual C++ 08 Express and thus had to download a profiler, for me it's Very Sleepy. I did some search but found no decent tutorial on Sleepy, and here my question:
How to use it properly? I grasped the general idea of profiling, so I sorted according to %exclusive to find my bottlenecks. Firstly, on the top of this list I have ZwWaitForSingleObject, RtlEnterCriticalSection, operator new, RtlLeaveCriticalSection, printf, some iterators ... and after they take some like 60% there comes my first function, first position with Child Calls. Can someone explain me why above mentioned come out, what do they mean and how can I optimize my code if I have no access to this critical 60%? (for "source file": unknown...).
Also, for my function I'd think I get time for each line, but it's not the case, e.g. arithmetics or some functions have no timing (not nested in unused "if" clauses).
AND last thing: how to find out that some line can execute superfast, but is called thousands times, being the actual bottleneck?
Finally, is Sleepy good? Or some free alternative for my platform?
Help very appreciated!
cheers!
UPDATE - - - - -
I have found another version of profiler, called plain Sleepy. It shows how many times some snippet was called plus the number of line (I guess it points to the critical one). So in my case.. KiFastSystemCallRet takes 50%! It means that it waits for some data right? How to improve that matter, is there maybe a decent approach to trace what causes these multiple calls and eventually remove/change it?
I'd like to optimize my code to satisfy timing constraints
You're running smack into a persistent issue in this business.
You want to find ways to make your code take less time, and you (and many people) assume (and have been taught) the only way to do that is by taking various sorts of measurements.
There's a minority view, and the only thing it has to recommend it is actual significant results (plus an ironclad theory behind it).
If you've got a "bottleneck" (and you do, probably several), it's taking some fraction of time, like 30%.
Just treat it as a bug to be found.
Randomly halt the program with the pause button, and look carefully to see what the program is doing and why it's doing it.
Ask if it's something that could be gotten rid of.
Do this 10 times. On average you will see the problem on 3 of the pauses.
Any activity you see more than once, if it's not truly necessary, is a speed bug.
This does not tell you precisely how much the problem costs, but it does tell you precisely what the problem is, and that it's worth fixing.
You'll see things this way that no profiler can find, because profilers
are only programs, and cannot be broad-minded about what constitutes an opportunity.
Some folks are risk-averse, thinking it might not give enough speedup to be worth it.
Granted, there is a small chance of a low payoff, but it's like investing.
The theory says on average it will be worthwhile, and there's also a small chance of a high payoff.
In any case, if you're worried about the risks, a few more samples will settle your fears.
After you fix the problem, the remaining bottlenecks each take a larger percent, because they didn't get smaller but the overall program did.
So they will be easier to find when you repeat the whole process.
There's lots of literature about profiling, but very little that actually says how much speedup it achieves in practice.
Here's a concrete example with almost 3 orders of magnitude speedup.
I've used GlowCode (commercial product, similar to Sleepy) for profiling native C++ code. You run the instrumenting process, then execute your program, then look at the data produced by the tool. The instrumenting step injects a little trace function at every methods' entrypoints and exitpoints, and simply measures how much time it takes for each function to run through to completion.
Using the call graph profiling tool, I listed the methods sorted from "most time used" to "least time used", and the tool also displays a call count. Simply drilling into the highest percentage routine showed me which methods were using the most time. I could see that some methods were very slow, but drilling into them I discovered they were waiting for user input, or for a service to respond. And some took a long time because they were calling some internal routines thousands of times each invocation. We found someone made a coding error and was walking a large linked list repeatedly for each item in the list, when they really only needed to walk it once.
If you sort by "most frequently called" to "least called", you can see some of the tiny functions that get called from everywhere (iterator methods like next(), etc.) Something to check for is to make sure the functions that are called the most often are really clean. Saving a millisecond in a routine called 500 times to paint a screen will speed that screen up by half a second. This helps you decide which are the most important places to spend your efforts.
I've seen two common approaches to using profiling. One is to do some "general" profiling, running through a suite of "normal" operations, and discovering which methods are slowing the app down the most. The other is to do specific profiling, focusing on specific user complaints about performance, and running through those functions to reveal their issues.
One thing I would caution you about is to limit your changes to those that will measurably impact the users' experience or system throughput. Shaving one millisecond off a mouse click won't make a difference to the average user, because human reaction time simply isn't that fast. Race car drivers have reaction times in the 8 millisecond range, some elite twitch gamers are even faster, but normal users like bank tellers will have reaction times in the 20-30 millisecond range. The benefits would be negligible.
Making twenty 1-millisecond improvements or one 20-millisecond change will make the system a lot more responsive. It's cheaper and better if you can do the single big improvement over the many small improvements.
Similarly, shaving one millisecond off a service that handles 100 users per second will make a 10% improvement, meaning you could improve the service to handle 110 users per second.
The reason for concern is that coding changes strictly to improve performance often negatively impact your code's structure by adding complexity. Let's say you decided to improve a call to a database by caching results. How do you know when the cache goes invalid? Do you add a cache cleaning mechanism? Consider a financial transaction where looping through all the line items to produce a running total is slow, so you decide to keep a runningTotal accumulator to answer faster. You now have to modify the runningTotal for all kinds of situations like line voids, reversals, deletions, modifications, quantity changes, etc. It makes the code more complex and more error-prone.

Error handling design problem on collection of items

I have a collection of some items and some operation on them. This operation is a part of remote calls between client and server and it should run on all items at once. On server side it runs repeatedly on each item and may fail or succeed. I need to know which items succeeded and which failed. I guess this is rather common case and there are good solutions to it. How should I design it?
it should run on all items at once
You will hate your life if you don't read into this as a design requirement. All or nothing is the right way to handle it. It will simplify everything you do.
If that isn't an option, just do the dumbest thing possible. Wrap each call in a try/catch and give some report. Chances are no one will be able to consume the report, which is another reason all or nothing is the right thing to do.
edit:
To elaborate: When batching, writing simple logic to report errors is fine, but writing logic to recover from errors is very complicated. I've never seen a system really handle recovery well on batching. I'm sure there are some corner cases where each item is completely independent. At which point makes no matter that one or another failed, but that is usually not the case.
Generally, I expect any errors that happen during a batching operation to not be critical. By that I mean the system should be able to ignore errors and continue operating as if the message that caused the error never existed.
If it's really vital that these messages get processed, then I would definately try for all or nothing.

Creating my object takes too long. Is it bad practice to create a ton of instances at startup to speed things up later?

I have a wizard class that gets used a lot in my program. Unfortunately, the wizard takes a while to load mostly because the GUI framework is very slow. I tried to redesign the wizard class multiple times (like making the object reusable so it only gets created once) but I always hit a brick wall somewhere. So, at this point is it a huge ugly hack to just load 50 instances of this beast into a vector and just pop them off as I use them? That way the delay will only be noticed on startup and run fine thereafter. Too much of a hack? Is such a construct common?
In games, we often first allocate and construct everything needed in a game session. Then we recycle the objects if they have short life-time, trying to get 0 allocations/deallocations while the game session is running.
So no it's not really a hack, it's just good sense to make the computer do less work to get faster. One strategy is "caching", that is, in general, first compute your non-variant data, then run with the dynamic ones. Memory allocation, object constructions, etc have to be prepared before use, where possible and necessary.
Unfortunately, the wizard takes a while to load mostly because the GUI framework is very slow.
Isn't a wizard just a form-based template? Shouldn't that carry essentially no overhead? Find what's slowing the framework down (uncompressed background image?) and fix the root cause.
As a stopgap, you could create the windows in the background and not display them until the user asks. But that's obviously just moving the problem somewhere else. Even if you create them in a background thread at startup, the user's first command might ask for the last wizard and then they have to wait 50x as long… which they'll probably interpret as a crash. At the very least, anticipate and test such corner cases. Also test on a low-RAM setup.
Yes it is bad practice, it breaks RFC2549 standard.
OK ok, I was just kidding. Do whatever is best for your application.
It isn't a matter of "hacks" or "standards".
Just make sure you have proper documentation about what isn't as straightforward as it should be (such as hacks).
Trust me, if a 5k investment produced a product with lots of hacks (such as windows), then they [hacks] must really help at some point.