Say I'm debugging code where one or more of the functions involved is defined with the help of memoize. I'll edit some code, reload the file in the REPL, and try out the new code. But if the bug is still there I always question whether it's because I haven't fixed the bug or because memoize has cached buggy results.
So, is there some way short of restarting the REPL that I can use to make absolutely sure that memoize has lost its memory?
(Note that eliminating calls to memoize during REPL sessions is both tedious and sometimes even unpractical, because the performance of the function might rely heavily on memoization.)
memoize never, under any circumstances, empties its cache. Its storage is permanent. If you have a new function you wish to use, you must replace your memoized function by re-memoizing the underlying function, and use only the new version of the function, not the old one. This way, your function calls will be passed through to the new underlying function, and the memory used to cache results of the old function will become eligible for garbage collection because nothing points at it.
You may say, Well gosh this is a pain, why is memoize so inflexible! The answer is, memoize is a very blunt instrument, not well suited to almost any production usages. Anytime you memoize a function whose set of possible inputs is not limited, you introduce a memory leak. If your function depends on a cache for its performance, you should think more about a more flexible caching policy than "cache everything forever", and use a library designed for such use cases.
Related
I am new to clojure and I have just learned and experimented with the memorize function.
It seems to me the existence of this function is strange.
Firstly functions with side effects end with !
Secondly using memorize is very simple
Why doesn't clojure just do this for me? There is a balance between memory use and performance but you could easily get the clojure runtime to have a chunk of ram allocated to function results. If a function is called with the same arguments multiple times use cached results, if memory runs out clear out the cache and keep track of cache hits so frequently recalled functions are less likely to be removed from the cache.
If I designed this I would even set a minimum performance level for functions so that if a function call is quicker than cache retrieval it is not cached. (Or make this a property of how all function calls work.)
Can anyone explain why clojure doesn't do this
Thanks
There is an old joke in computer science that there are only two hard problems:
cache invalidation
naming things
The built in memoize function is a great start at what it does, and it's useful and sufficient in a few cases. It does however fit that joke above nicely. It's a somewhat awkward name (opinion here) and it fails very badly at cache invalidation. It assumes that if a function has ever been called that the result will always be relevant, and that the function is pure, and does not encounter errors, and that all calls are equally relevant for all time. The real world is full of nuances that turn out to be really important for caching:
many return values are not useful forever
many function are unbounded in scope (math for instance)
many functions are faster than the cache (math again)
functions have equivalent arguments. These should be in the same class.
memoize doesn't actually guarantee to run a function only once.
many functions are never going to be pure (like accepting a network connection)
some arguments don't affect the return value.
All these things come into play when designing caching for web apps for instance. #5 is an interesting case, consider what happens if you memoize a very slow function, and there are two calls to it one second apart. Which return value becomes the memorized result? In what cases is this important. These details can really matter, especially if one of them encounters an unusual circumstance.
In the last eight years doing clojure professionally I have seem memoize used in production many times, and it's always been replaced in short order by a call to one of the functions in clojure.core.cache once the inevitable problems arise.
If you find yourself wanting memoize there's a good chance you will be happier with core.cache. It offers many more nuanced options to fit more of the real world cases.
Suppose you have a very large graph with lots of processing upon its nodes (like tens of milions of operations per node). The core routine is the same for each node, but there are some additional operations based on internal conditions. There can be 2 such conditions which produces 4 cases (0,0), (1,0), (0,1), (1,1). E.g. (1,1) means that both conditions hold. Conditions are established once (one set for each node independently) in a program and, from then on, never change. Unfortunately, they are determined in runtime and in a fully unpredictable way (based on data received via HTTP from and external server).
What is the fastest in such scenario? (taken into account modern compiler optimizations which I have no idea of)
simply using "IFs": if (condition X) perform additional operation X.
using inheritance to derive four classes from base class (exposing method OPERATION) to have a proper operation and save tens milions of "ifs". [but I am not sure if this is a real saving, inheritance must have its overhead too)
use pointer to function to assign the function based on conditions once.
I would take me long to come to a point where I can test it by myself (I don't have such a big data yet and this will be integrated into bigger project so would not be easy to test all versions).
Reading answers: I know that I probably have to experiment with it. But apart from everything, this is sort of a question what is faster:
tens of milions of IF statements and normal statically known function calls VS function pointer calls VS inheritance which I think is not the best idea in this case and I am thinking of eliminating it from further inspection Thanks for any constructive answers (not saying that I shouldn't care about such minor things ;)
There is no real answer except to measure the actual code on the
real data. At times, in the past, I've had to deal with such
problems, and in the cases I've actually measured, virtual
functions were faster than if's. But that doesn't mean much,
since the cases I measured were in a different program (and thus
a different context) than yours. For example, a virtual
function call will generally prevent inlining, whereas an if is
inline by nature, and inlining may open up additional
optimization possibilities.
Also the machines I measured on handled virtual functions pretty
well; I've heard that some other machines (HP's PA, for example)
are very ineffective in their implementation of indirect jumps
(including not only virtual function calls, but also the return
from the function---again, the lost opportunity to inline
costs).
If you absolutely have to have the fastest way, and the process order of the nodes is not relevant, make four different types, one for each case, and define a process function for each. Then in a container class, have four vectors, one for each node type. Upon creation of a new node get all the data you need to create the node, including the conditions, and create a node of the correct type and push it into the correct vector. Then, when you need to process all your nodes, process them type by type, first processing all nodes in the first vector, then the second, etc.
Why would you want to do this:
No ifs for the state switching
No vtables
No function indirection
But much more importantly:
No instruction cache thrashing (you're not jumping to a different part of your code for every next node)
No branch prediction misses for state switching ifs (since there are none)
Even if you'd have inheritance with virtual functions and thus function indirection through vtables, simply sorting the nodes by their type in your vector may already make a world of difference in performance as any possible instruction cache thrashing would essentially be gone and depending on the methods of branch prediction the branch prediction misses could also be reduced.
Also, don't make a vector of pointers, but make a vector of objects. If they are pointers you have an extra adressing indirection, which in itself is not that worrisome, but the problem is that it may lead to data cache thrashing if the objects are pretty much randomly spread throughout your memory. If on the other hand your objects are directly put into the vector the processing will basically go through memory linearly and the cache will basically hit every time and cache prefetching might actually be able to do a good job.
Note though that you would pay heavily in data structure creation if you don't do it correctly, if at all possible, when making the vector reserve enough capacity in it immediately for all your nodes, reallocating and moving every time your vector runs out of space can become expensive.
Oh, and yes, as James mentioned, always, always measure! What you think may be the fastest way may not be, sometimes things are very counter intuitive, depending on all kinds of factors like optimizations, pipelining, branch prediction, cache hits/misses, data structure layout, etc. What I wrote above is a pretty general approach, but it is not guaranteed to be the fastest and there are definitely ways to do it wrong. Measure, Measure, Measure.
P.S. inheritance with virtual functions is roughly equivalent to using function pointers. Virtual functions are usually implemented by a vtable at the head of the class, which is basically just a table of function pointers to the implementation of the given virtual for the actual type of the object. Whether ifs is faster than virtuals or the other way around is a very very difficult question to answer and depends completely on the implementation, compiler and platform used.
I'm actually quite impressed with how effective branch prediction can be, and only the if solution allows inlining which also can be dramatic. Virtual functions and pointer to function also involve loading from memory and could possibly cause cache misses
But, you have four conditions so branch misses can be expensive.
Without the ability to test and verify the answer really can't be answered. Especially since its not even clear that this would be a performance bottleneck sufficient enough to warrant optimization efforts.
In cases like this. I would err on the side of readability and ease of debugging and go with if
Many programmers have taken classes and read books that go on about certain favorite subjects: pipelining, cacheing, branch prediction, virtual functions, compiler optimizations, big-O algorithms, etc. etc. and the performance of those.
If I could make an analogy to boating, these are things like trimming weight, tuning power, adjusting balance and streamlining, assuming you are starting from some speedboat that's already close to optimal.
Never mind you may actually be starting from the Queen Mary, and you're assuming it's a speedboat.
It may very well be that there are ways to speed up the code by large factors, just by cutting away fat (masquerading as good design), if only you knew where it was.
Well, you don't know where it is, and guessing where is a waste of time, unless you value being wrong.
When people say "measure and use a profiler" they are pointing in the right direction, but not far enough.
Here's an example of how I do it, and I made a crude video of it, FWIW.
Unless there's a clear pattern to these attributes, no branch predictor exists that can effectively predict this data-dependent condition for you. Under such circumstances, you may be better off avoiding a control speculation (and paying the penalty of a branch misprediction), and just wait for the actual data to arrive and resolve the control flow (more likely to occur using virtual functions). You'll have to benchmark of course to verify that, as it depends on the actual pattern (if for e.g. you have even small groups of similarly "tagged" elements).
The sorting suggested above is nice and all, but note that it converts a problem that's just plain O(n) into an O(logn) one, so for large sizes you'll lose unless you can sort once - traverse many times, or otherwise cheaply maintain the sort state.
Note that some predictors may also attempt to predict the address of the function call, so you might be facing the same problem there.
However, I must agree about the comments regarding early-optimizations - do you know for sure that the control flow is your bottleneck? What if fetching the actual data from memory takes longer? In general, it would seem that your elements can be process in parallel, so even if you run this on a single thread (and much more if you use multiple cores) - you should be bandwidth-bound and not latency bound.
In my question profiling: deque is 23% of my runtime i have a problem with 'new' being a large % of my runtime. The problems are
I have to use the new keyword a lot and on many different classes/structs (i have >200 of them and its by design). I use lots of stl objects, iterators and strings. I use strdup and other allocation (or free) functions.
I have one function that is called >2million times. All it did was create stl iterators and it took up >20% of the time (however from what i remember stl is optimized pretty nicely and debug makes it magnitudes slower).
But keeping in mind i need to allocate and free these iterators >2m times along with other functions that are called often. How do i optimize the new and malloc keyword/function? Especially for all these classes/structs and classes/struct i didnt write (stl and others)
Although profiling says i (and stl?) use the new keyword more then anything else.
Look for opportunities to avoid the allocation/freeing, either by adding your own management layer to recycle memory and objects that have already been allocated, or modifying their allocators. There are plenty of articles on STL Allocators:
http://www.codeguru.com/cpp/cpp/cpp_mfc/stl/article.php/c4079
http://bmagic.sourceforge.net/memalloc.html
http://www.codeproject.com/KB/stl/blockallocator.aspx
I have seen large multimap code go from unusably slow to very fast simply by replacing the default allocator.
You can't make malloc faster. You might be able to make new faster, but I bet you can find ways not to call them so much.
One way to find excess calls to anything is to peruse the code looking for them, but that's slow and error-prone, and they're not always visible.
A simple and foolproof way to find them is to pause the program a few times and look at the stack.
Notice you don't really need to measure anything. If something is happening that takes a large fraction of time, that is the probability you will see it on each pause, and the goal is to find it.
You do get a rough measurement, but that's only a by-product of finding the problem.
Here's an example where this was done in a series of stages, resulting in large speedup factors.
I have an application that, in order to reload plugins, requires them with the :reload option whenever they are to be reloaded. I've noticed that this is building up memory about 2-3 megs at a time when I do it. I'm curious as to what could cause this sort of thing to happen. Is data from previous reloads being kept in memory? Is there a way to totally reload a namespace?
EDIT: It's also relevant to mention that each of these plugins that gets reloaded makes new defmethods for a multimethod in another namespace (that never gets reloaded). Maybe the methods are being kept in memory when it's reloaded?
Clojure defers memory management to the JVM. While I don't know clojure's codebase deeply, it probably just reassigns the vars with the reloaded code - which will leave the old objects around until the JVM runs the garbage collector.
You can hint to the JVM that you want the GC to run using (System/gc), but it's generally not recommended to use.
Alternatively, if you know the contraints of your system, you can tinker with the JVM memory flags to encourage the GC to run more frequently (ie - use a lower heap size).
But if you have a system that's not really memory constrained, saving a few mbs doesn't matter much.
As it turns out, I didn't test it long enough. The memory will only grow to a certain level, and then it'll stop and eventually go back down quite a bit.
Boys and girls: test your code before whining about bugs.
I'm working on a project that is supposed to be used from the command line with the following syntax:
program-name input-file
The program is supposed to process the input, compute some stuff and spit out results on stdout.
My language of choice is C++ for several reasons I'm not willing to debate. The computation phase will be highly symbolic (think compiler) and will use pretty complex dynamically allocated data structures. In particular, it's not amenable to RAII style programming.
I'm wondering if it is acceptable to forget about freeing memory, given that I expect the entire computation to consume less than the available memory and that the OS is free to reclaim all the memory in one step after the program finishes (assume program terminates in seconds). What are your feeling about this?
As a backup plan, if ever my project will require to run as a server or interactively, I figured that I can always refit a garbage collector into the source code. Does anyone have experience using garbage collectors for C++? Do they work well?
It shouldn't cause any problems in the specific situation described the question.
However, it's not exactly normal. Static analysis tools will complain about it. Most importantly, it builds bad habits.
Sometimes not deallocating memory is the right thing to do.
I used to write compilers. After building the parse tree and traversing it to write the intermediate code, we would simply just exit. Deallocating the tree would have
added a bit of slowness to the compiler, which we wanted of course to be as fast as possible.
taken up code space
taken time to code and test the deallocators
violated the "no code executes better than 'no code'" dictum.
HTH! FWIW, this was "back in the day" when memory was non-virtual and minimal, the boxes were much slower, and the first two were non-trivial considerations.
My feeling would be something like "WTF!!!"
Look at it this way:
You choose a programming language that does not include a garbage collector, we are not allowed to ask why.
You are basically stating that you are too lazy to care about freeing the memory.
Well, WTF again. Laziness isn't a good reason for anything, the least of what is playing around with memory without freeing it.
Just free the memory, it's a bad practice, the scenario may change and then can be a million reasons you can need that memory freed and the only reason for not doing it is laziness, don't get bad habits, and get used to do things right, that way you'll tend to do them right in the future!!
Not deallocating memory should not be problem but it is a bad practice.
Joel Coehoorn is right:
It shouldn't cause any problems.
However, it's not exactly normal.
Static analysis tools will complain
about it. Most importantly, it builds
bad habits.
I'd also like to add that thinking about deallocation as you write the code is probably a lot easier than trying to retrofit it afterwards. So I would probably make it deallocate memory; you don't know how your program might be used in future.
If you want a really simple way to free memory, look at the "pools" concept that Apache uses.
Well, I think that it's not acceptable. You've already alluded to potential future problems yourself. Don't think they're necessarily easy to solve.
Things like “… given that I expect the entire computation to consume less …” are famous last phrases. Similarly, refitting code with some feature is one of these things they all talk of and never do.
Not deallocating memory might sound good in the short run but can potentially create a huge load of problems in the long run. Personally, I just don't think that's worth it.
There are two strategies. Either you build in the GC design from the very beginning. It's more work but it will pay off. For a lot of small objects it might pay to use a pool allocator and just keep track of the memory pool. That way, you can keep track of the memory consumption and simply avoid a lot of problems that similar code, but without allocation pool, would create.
Or you use smart pointers throughout the program from the beginning. I actually prefer this method even though it clutters the code. One solution is to rely heavily on templates, which takes out a lot of redundancy when referring to types.
Take a look at projects such as WebKit. Their computation phase resembles yours since they build parse trees for HTML. They use smart pointers throughout their program.
Finally: “It’s a question of style … Sloppy work tends to be habit-forming.”
– Silk in Castle of Wizardry by David Eddings.
will use pretty complex dynamically
allocated data structures. In
particular, it's not amenable to RAII
style programming.
I'm almost sure that's an excuse for lazy programming. Why can't you use RAII? Is it because you don't want to keep track of your allocations, there's no pointer to them that you keep? If so, how do you expect to use the allocated memory - there's always a pointer to it that contains some data.
Is it because you don't know when it should be released? Leave the memory in RAII objects, each one referenced by something, and they'll all trickle-down free each other when the containing object gets freed - this is particularly important if you want to run it as a server one day, each iteration of the server effective runs a 'master' object that holds all others so you can just delete it and all the memory disappears. It also helps prevent you retro-fitting a GC.
Is it because all your memory is allocated and kept in-use all the time, and only freed at the end? If so see above.
If you really, really cannot think of a design where you cannot leak memory, at least have the decency to use a private heap. Destroy that heap before you quit and you'll have a better design already, if a little 'hacky'.
There are instances where memory leaks are ok - static variables, globally initialised data, things like that. These aren't generally large though.
Reference counting smart pointers like shared_ptr in boost and TR1 could also help you manage your memory in a simple manner.
The drawback is that you have to wrap every pointers that use these objects.
I've done this before, only to find that, much later, I needed the program to be able to process several inputs without separate commands, or that the guts of the program were so useful that they needed to be turned into a library routine that could be called many times from within another program that was not expected to terminate. It was much harder to go back later and re-engineer the program than it would have been to make it leak-less from the start.
So, while it's technically safe as you've described the requirements, I advise against the practice since it's likely that your requirements may someday change.
If the run time of your program is very short, it should not be a problem. However, being too lazy to free what you allocate and losing track of what you allocate are two entirely different things. If you have simply lost track, its time to ask yourself if you actually know what your code is doing to a computer.
If you are just in a hurry or lazy and the life of your program is small in relation to what it actually allocates (i.e. allocating 10 MB per second is not small if running for 30 seconds) .. then you should be OK.
The only 'noble' argument regarding freeing allocated memory sets in when a program exits .. should one free everything to keep valgrind from complaining about leaks, or just let the OS do it? That entirely depends on the OS and if your code might become a library and not a short running executable.
Leaks during run time are generally bad, unless you know your program will run in a short amount of time and not cause other programs far more important than your's as far as the OS is concerned to skid to dirty paging.
What are your feeling about this?
Some O/Ses might not reclaim the memory, but I guess you're not intenting to run on those O/Ses.
As a backup plan, if ever my project will require to run as a server or interactively, I figured that I can always refit a garbage collector into the source code.
Instead, I figure you can spawn a child process to do the dirty work, grab the output from the child process, let the child process die as soon as possible after that and then expect the O/S to do the garbage collection.
I have not personally used this, but since you are starting from scratch you may wish to consider the Boehm-Demers-Weiser conservative garbage collector
The answer really depends on how large your program will be and what performance characteristics it needs to exhibit. If you never deallocate memory, your process's memory footprint will be much larger than it would otherwise be. Depeding on the system, this could cause a lot of paging and slow down the performance for you or other applications on the system.
Beyond that, what everyone above says is correct. It probably won't cause harm in the short term, but it's a bad practice that you should avoid. You'll never be able to use the code again. Trying to retrofit a GC on afterwards will be a nightmare. Just think about going to each place you allocate memory and trying to retrofit it but not break anything.
One more reason to avoid doing this: reputation. If you fail to deallocate, everyone who maintains the code will curse your name and your rep in the company will take a hit. "Can you believe how dumb he was? Look at this code."
If it is non-trivial for you to determine where to deallocate the memory, I would be concerned that other aspects of the data structure manipulation may not be fully understood either.
Apart from the fact that the OS (kernel and/or C/C++ library) can choose not to free the memory when the execution ends, your application should always provide proper freeing of allocated memory as a good practice. Why? Suppose you decide to extend that application or reuse the code; you'll quickly get in trouble if the code you had previously written hogs up the memory unnecessarily, after finishing its job. It's a recipe for memory leaks.
In general, I agree it's a bad practice.
For a one shot program, it can be OK, but it kinda shows like you don't what you are doing.
There is one solution to your problem though - use a custom allocator, which preallocates larger blocks from malloc, and then, after the computation phase, instead of freeing all the little blocks from you custom allocator, just release the larger preallocated blocks of memory. Then you don't need to keep track of all objects you need to deallocate and when. One guy who wrote a compiler too explained this approach many years ago to me, so if it worked for him, it will probably work for you as well.
Try to use automatic variables in methods so that they will be freed automatically from the stack.
The only useful reason to not free heap memory is to save a tiny amount of computational power used in the free() method. You might loose any advantage if page faults become an issue due to large virtual memory needs with small physical memory resources. Some factors to consider are:
If you are allocating a few huge chunks of memory or many small chunks.
Is the memory going to need to be locked into physical memory.
Are you absolutely positive the code and memory needed will fit into 2GB, for a Win32 system, including memory holes and padding.
That's generally a bad idea. You might encounter some cases where the program will try to consume more memory than it's available. Plus you risk being unable to start several copies of the program.
You can still do this if you don't care of the mentioned issues.
When you exit from a program, the memory allocated is automatically returned to the system. So you may not deallocate the memory you had allocated.
But deallocations becomes necessary when you go for bigger programs such as an OS or Embedded systems where the program is meant to run forever & hence a small memory leak can be malicious.
Hence it is always recommended to deallocate the memory you have allocated.