Why do we have clojure memoize function? - clojure

I am new to clojure and I have just learned and experimented with the memorize function.
It seems to me the existence of this function is strange.
Firstly functions with side effects end with !
Secondly using memorize is very simple
Why doesn't clojure just do this for me? There is a balance between memory use and performance but you could easily get the clojure runtime to have a chunk of ram allocated to function results. If a function is called with the same arguments multiple times use cached results, if memory runs out clear out the cache and keep track of cache hits so frequently recalled functions are less likely to be removed from the cache.
If I designed this I would even set a minimum performance level for functions so that if a function call is quicker than cache retrieval it is not cached. (Or make this a property of how all function calls work.)
Can anyone explain why clojure doesn't do this
Thanks

There is an old joke in computer science that there are only two hard problems:
cache invalidation
naming things
The built in memoize function is a great start at what it does, and it's useful and sufficient in a few cases. It does however fit that joke above nicely. It's a somewhat awkward name (opinion here) and it fails very badly at cache invalidation. It assumes that if a function has ever been called that the result will always be relevant, and that the function is pure, and does not encounter errors, and that all calls are equally relevant for all time. The real world is full of nuances that turn out to be really important for caching:
many return values are not useful forever
many function are unbounded in scope (math for instance)
many functions are faster than the cache (math again)
functions have equivalent arguments. These should be in the same class.
memoize doesn't actually guarantee to run a function only once.
many functions are never going to be pure (like accepting a network connection)
some arguments don't affect the return value.
All these things come into play when designing caching for web apps for instance. #5 is an interesting case, consider what happens if you memoize a very slow function, and there are two calls to it one second apart. Which return value becomes the memorized result? In what cases is this important. These details can really matter, especially if one of them encounters an unusual circumstance.
In the last eight years doing clojure professionally I have seem memoize used in production many times, and it's always been replaced in short order by a call to one of the functions in clojure.core.cache once the inevitable problems arise.
If you find yourself wanting memoize there's a good chance you will be happier with core.cache. It offers many more nuanced options to fit more of the real world cases.

Related

When does Clojure's memoize clear its cache?

Say I'm debugging code where one or more of the functions involved is defined with the help of memoize. I'll edit some code, reload the file in the REPL, and try out the new code. But if the bug is still there I always question whether it's because I haven't fixed the bug or because memoize has cached buggy results.
So, is there some way short of restarting the REPL that I can use to make absolutely sure that memoize has lost its memory?
(Note that eliminating calls to memoize during REPL sessions is both tedious and sometimes even unpractical, because the performance of the function might rely heavily on memoization.)
memoize never, under any circumstances, empties its cache. Its storage is permanent. If you have a new function you wish to use, you must replace your memoized function by re-memoizing the underlying function, and use only the new version of the function, not the old one. This way, your function calls will be passed through to the new underlying function, and the memory used to cache results of the old function will become eligible for garbage collection because nothing points at it.
You may say, Well gosh this is a pain, why is memoize so inflexible! The answer is, memoize is a very blunt instrument, not well suited to almost any production usages. Anytime you memoize a function whose set of possible inputs is not limited, you introduce a memory leak. If your function depends on a cache for its performance, you should think more about a more flexible caching policy than "cache everything forever", and use a library designed for such use cases.

High-performance code in c++ (inheritance, pointers to functions, if)

Suppose you have a very large graph with lots of processing upon its nodes (like tens of milions of operations per node). The core routine is the same for each node, but there are some additional operations based on internal conditions. There can be 2 such conditions which produces 4 cases (0,0), (1,0), (0,1), (1,1). E.g. (1,1) means that both conditions hold. Conditions are established once (one set for each node independently) in a program and, from then on, never change. Unfortunately, they are determined in runtime and in a fully unpredictable way (based on data received via HTTP from and external server).
What is the fastest in such scenario? (taken into account modern compiler optimizations which I have no idea of)
simply using "IFs": if (condition X) perform additional operation X.
using inheritance to derive four classes from base class (exposing method OPERATION) to have a proper operation and save tens milions of "ifs". [but I am not sure if this is a real saving, inheritance must have its overhead too)
use pointer to function to assign the function based on conditions once.
I would take me long to come to a point where I can test it by myself (I don't have such a big data yet and this will be integrated into bigger project so would not be easy to test all versions).
Reading answers: I know that I probably have to experiment with it. But apart from everything, this is sort of a question what is faster:
tens of milions of IF statements and normal statically known function calls VS function pointer calls VS inheritance which I think is not the best idea in this case and I am thinking of eliminating it from further inspection Thanks for any constructive answers (not saying that I shouldn't care about such minor things ;)
There is no real answer except to measure the actual code on the
real data. At times, in the past, I've had to deal with such
problems, and in the cases I've actually measured, virtual
functions were faster than if's. But that doesn't mean much,
since the cases I measured were in a different program (and thus
a different context) than yours. For example, a virtual
function call will generally prevent inlining, whereas an if is
inline by nature, and inlining may open up additional
optimization possibilities.
Also the machines I measured on handled virtual functions pretty
well; I've heard that some other machines (HP's PA, for example)
are very ineffective in their implementation of indirect jumps
(including not only virtual function calls, but also the return
from the function---again, the lost opportunity to inline
costs).
If you absolutely have to have the fastest way, and the process order of the nodes is not relevant, make four different types, one for each case, and define a process function for each. Then in a container class, have four vectors, one for each node type. Upon creation of a new node get all the data you need to create the node, including the conditions, and create a node of the correct type and push it into the correct vector. Then, when you need to process all your nodes, process them type by type, first processing all nodes in the first vector, then the second, etc.
Why would you want to do this:
No ifs for the state switching
No vtables
No function indirection
But much more importantly:
No instruction cache thrashing (you're not jumping to a different part of your code for every next node)
No branch prediction misses for state switching ifs (since there are none)
Even if you'd have inheritance with virtual functions and thus function indirection through vtables, simply sorting the nodes by their type in your vector may already make a world of difference in performance as any possible instruction cache thrashing would essentially be gone and depending on the methods of branch prediction the branch prediction misses could also be reduced.
Also, don't make a vector of pointers, but make a vector of objects. If they are pointers you have an extra adressing indirection, which in itself is not that worrisome, but the problem is that it may lead to data cache thrashing if the objects are pretty much randomly spread throughout your memory. If on the other hand your objects are directly put into the vector the processing will basically go through memory linearly and the cache will basically hit every time and cache prefetching might actually be able to do a good job.
Note though that you would pay heavily in data structure creation if you don't do it correctly, if at all possible, when making the vector reserve enough capacity in it immediately for all your nodes, reallocating and moving every time your vector runs out of space can become expensive.
Oh, and yes, as James mentioned, always, always measure! What you think may be the fastest way may not be, sometimes things are very counter intuitive, depending on all kinds of factors like optimizations, pipelining, branch prediction, cache hits/misses, data structure layout, etc. What I wrote above is a pretty general approach, but it is not guaranteed to be the fastest and there are definitely ways to do it wrong. Measure, Measure, Measure.
P.S. inheritance with virtual functions is roughly equivalent to using function pointers. Virtual functions are usually implemented by a vtable at the head of the class, which is basically just a table of function pointers to the implementation of the given virtual for the actual type of the object. Whether ifs is faster than virtuals or the other way around is a very very difficult question to answer and depends completely on the implementation, compiler and platform used.
I'm actually quite impressed with how effective branch prediction can be, and only the if solution allows inlining which also can be dramatic. Virtual functions and pointer to function also involve loading from memory and could possibly cause cache misses
But, you have four conditions so branch misses can be expensive.
Without the ability to test and verify the answer really can't be answered. Especially since its not even clear that this would be a performance bottleneck sufficient enough to warrant optimization efforts.
In cases like this. I would err on the side of readability and ease of debugging and go with if
Many programmers have taken classes and read books that go on about certain favorite subjects: pipelining, cacheing, branch prediction, virtual functions, compiler optimizations, big-O algorithms, etc. etc. and the performance of those.
If I could make an analogy to boating, these are things like trimming weight, tuning power, adjusting balance and streamlining, assuming you are starting from some speedboat that's already close to optimal.
Never mind you may actually be starting from the Queen Mary, and you're assuming it's a speedboat.
It may very well be that there are ways to speed up the code by large factors, just by cutting away fat (masquerading as good design), if only you knew where it was.
Well, you don't know where it is, and guessing where is a waste of time, unless you value being wrong.
When people say "measure and use a profiler" they are pointing in the right direction, but not far enough.
Here's an example of how I do it, and I made a crude video of it, FWIW.
Unless there's a clear pattern to these attributes, no branch predictor exists that can effectively predict this data-dependent condition for you. Under such circumstances, you may be better off avoiding a control speculation (and paying the penalty of a branch misprediction), and just wait for the actual data to arrive and resolve the control flow (more likely to occur using virtual functions). You'll have to benchmark of course to verify that, as it depends on the actual pattern (if for e.g. you have even small groups of similarly "tagged" elements).
The sorting suggested above is nice and all, but note that it converts a problem that's just plain O(n) into an O(logn) one, so for large sizes you'll lose unless you can sort once - traverse many times, or otherwise cheaply maintain the sort state.
Note that some predictors may also attempt to predict the address of the function call, so you might be facing the same problem there.
However, I must agree about the comments regarding early-optimizations - do you know for sure that the control flow is your bottleneck? What if fetching the actual data from memory takes longer? In general, it would seem that your elements can be process in parallel, so even if you run this on a single thread (and much more if you use multiple cores) - you should be bandwidth-bound and not latency bound.

Function size vs execution speed

I remember hearing somewhere that "large functions might have higher execution times" because of code size, and CPU cache or something like that.
How can I tell if function size is imposing a performance hit for my application? How can I optimize against this? I have a CPU intensive computation that I have split into (as many threads as there are CPU cores). The main thread waits until all of the worker threads are finished before continuing.
I happen to be using C++ on Visual Studio 2010, but I'm not sure that's really important.
Edit:
I'm running a ray tracer that shoots about 5,000 rays per pixel. I create (cores-1) threads (1 per extra core), split the screen into rows, and give each row to a CPU thread. I run the trace function on each thread about 5,000 times per pixel.
I'm actually looking for ways to speed this up. It is possible for me to reduce the size of the main tracing function by refactoring, and I want to know if I should expect to see a performance gain.
A lot of people seem to be answering the wrong question here, I'm looking for an answer to this specific question, even if you think I can probably do better by optimizing the contents of the function, I want to know if there is a function size/performance relationship.
It's not really the size of the function, it's the total size of the code that gets cached when it runs. You aren't going to speed things up by splitting code into a greater number of smaller functions, unless some of those functions aren't called at all in your critical code path, and hence don't need to occupy any cache. Besides, any attempt you make to split code into multiple functions might get reversed by the compiler, if it decides to inline them.
So it's not really possible to say whether your current code is "imposing a performance hit". A hit compared with which of the many, many ways that you could have structured your code differently? And you can't reasonably expect changes of that kind to make any particular difference to performance.
I suppose that what you're looking for is instructions that are rarely executed (your profiler will tell you which they are), but are located in the close vicinity of instructions that are executed a lot (and hence will need to be in cache a lot, and will pull in the cache line around them). If you can cluster the commonly-executed code together, you'll get more out of your instruction cache.
Practically speaking though, this is not a very fruitful line of optimization. It's unlikely you'll make much difference. If nothing else, your commonly-executed code is probably quite small and adjacent already, it'll be some small number of tight loops somewhere (your profiler will tell you where). And cache lines at the lowest levels are typically small (of the order of 32 or 64 bytes), so you'd need some very fine re-arrangement of code. C++ puts a lot between you and the object code, that obstructs careful placement of instructions in memory.
Tools like perf can give you information on cache misses - most of those won't be for executable code, but on most systems it really doesn't matter which cache misses you're avoiding: if you can avoid some then you'll speed your code up. Perhaps not by a lot, unless it's a lot of misses, but some.
Anyway, what context did you hear this? The most common one I've heard it come up in, is the idea that function inlining is sometimes counter-productive, because sometimes the overhead of the code bloat is greater than the function call overhead avoided. I'm not sure, but profile-guided optimization might help with that, if your compiler supports it. A fairly plausible profile-guided optimization is to preferentially inline at call sites that are executed a larger number of times, leaving colder code smaller, with less overhead to load and fix up in the first place, and (hopefully) less disruptive to the instruction cache when it is pulled in. Somebody with far more knowledge of compilers than me, will have thought hard about whether that's a good profile-guided optimization, and therefore decided whether or not to implement it.
Unless you're going to hand-tune to the assembly level, to include locking specific lines of code in cache, you're not going to see a significant execution difference between one large function and multiple small functions. In both cases, you still have the same amount of work to perform and that's going to be your bottleneck.
Breaking things up into multiple smaller functions will, however, be easier to maintain and easier to read -- especially 6 months later when you've forgotten what you did in the first place.
Function size is unlikely to be a bottleneck in your application. What you do in the function is much more important that it's physical size. There are some things your compiler can do with small function that it cannot do with large functions (namely inlining), but usually this isn't a huge difference anyway.
You can profile the code to see where the real bottleneck is. I suspect calling a large function is not the problem.
You should, however, break up the function into smaller function for code readability reasons.
It's not really about function size, but about what you do in it. Depending on what you do, there is possibly some way to optimize it.

Using virtual functions instead of IF statements is faster?

I remember reading online somewhere that in EXTREMELY low latency situations its better to use virtual functions as a substitute for IF statements.
Is this true? Are they basically saying dynamic polymorphism is better for speed situations?
Do any users have any other C++ low latency "quirks" they could share?
I very much doubt that a single if/else statement would be slower than using a virtual function: the virtual function typically enforces a pipeline stall and limits the optimization opportunities. An if statement may stall the pipeline but if it is often executed the prediction may go the right way. However, if your alternative is between a cascade of a few if/else statements vs. just one virtual function call than that latter may be faster. Also, if the total code being executed via using virtual functions vs. branches is different functions ends up substantially smaller it may cause few cache misses on the instruction cache. That is, it depends on the situation. The best way is to measure. Note that measuring artificial code which is just attempting to investigate the difference between two approaches but doesn't really do any processing yields misleading results. However, when you need to produce very low latency code you typically can spend more time to come up with it, i.e. experimenting with multiple different approaches may be viable.
Although my colleagues tend to frown upon my template approaches for avoiding run-time branching, the code I end up with often is very slow to compile but very fast to run. Of course, this depends on the functions or branches being used to be known at compile time. In the areas I have used this e.g. for message processing it is often sufficient to have one dynamic decision e.g. one for each message (i.e. one virtual function call), followed by processing which doesn't involve any dynamic types (this are still conditionals, e.g. for the amount of values in a table).

How many registers in custom VM?

I'm designing a custom VM and am curious about how many registers I should use. Initially, I had 255, but I'm a little concerned about backing 255 pointers (a whole KB) on to the stack or heap every time I call a function, when most of them won't even be used. How many registers should I use?
You might want to look into register windows, which are a way of reducing the number of "active" registers available at any one time, while still keeping a large number of registers in core.
Having said that, you may find that using a stack-based architecture is more convenient. Some major virtual machines intended to be implemented in software (JVM, CLR, Python, etc) use a stack architecture. It's certainly easier to write a compiler for a stack rather than an artificially restricted set of registers.
This generally depends on how many you think you'll need. I question 255 registers' usefulness in practical applications.
The last register machine I built was aimed at supporting a small programming language, and when mapping things out, I looked at the types of applications, the design methodologies I wanted to guide people to use, balancing all that with performance concerns when designing the register file.
It's not something that can easily be answered without more details, but if you stop and think about what it is you're trying to do, and balance it all out with whatever aspects you find important, you'll come to a conclusion you can live with, and that probably makes sense.
Whatever number of registers you choose, you are probably going to have way too many for most subroutines and way too few for a few subroutines. (This is just a guess. However, considering how many things in programming follow a Power Law Distribution – incoming references to objects, modules, classes, outgoing references from objects, modules, classes, cyclomatic complexity of subroutines, NPath complexity of subroutines, SLOC length of subroutines, lifetime of objects, size of objects – it is only reasonable to assume that the same is true for the number of registers for a subroutine, especially if you consider that there is probably a correlation between complexity/length and number of registers.)
The Parrot VM has found quite a simple way out of this conundrum: they have an infinite number of registers. Obviously, those registers aren't stored in an infinite array, rather they lazily materialize just enough registers for any single subroutine. That way, they never run out of registers, and they never waste any space.
Sorry guys. I made a stupid on this one. Turns out that I already had a vector of registers to optimize access to the stack, which I totally forgot about. Instead of duping them, I just set the registers in the state to be a reference to the stack's registers. Now all I need to do is specialize pushing to push straight to a register, and problem solved in a nice efficient fashion. These registers will also never need backing, since there's nothing function-dependent about them, and they'll grow in perfect accordance with my stack. It had just never occurred to me that I could push values into them without pushing an equivalent value into the stack.
The absolutely hideous template mess this is turning into for simple design concepts though is making me extremely unhappy. Want to buy: static if and variadic templates.