What is a 'thunk'? - c++

I've seen it used in programming (specifically in the C++ domain) and have no idea what it is. Presumably it is a design pattern, but I could be wrong. Can anyone give a good example of a thunk?

A thunk usually refers to a small piece of code that is called as a function, does some small thing, and then JUMPs to another location (usually a function) instead of returning to its caller. Assuming the JUMP target is a normal function, when it returns, it will return to the thunk's caller.
Thunks can be used to implement lots of useful things efficiently
protocol translation -- when calling from code that uses one calling convention to code that uses a different calling convention, a thunk can be used to translate the arguments appropriately. This only works if the return conventions are compatible, but that is often the case
virtual function handling -- when calling a virtual function of a multiply-inherited base class in C++, there needs to be a fix-up of the this pointer to get it to point to the right place. A thunk can do this.
dynamic closures -- when you build a dynamic closure, the closure function needs to be able to get at the context where it was created. A small thunk can be built (usually on the stack) which sets up the context info in some register(s) and then jumps to a static piece of code that implements the closure's function. The thunk here is effectively supplying one or more hidden extra arguments to the function that are not provided by the call site.

The word thunk has at least three related meanings in computer science. A "thunk" may be:
a piece of code to perform a delayed
computation (similar to a closure)
a feature of some virtual function
table implementations (similar to a
wrapper function)
a mapping of machine data from one
system-specific form to another,
usually for compatibility reasons
I have usually seen it used in the third context.
http://en.wikipedia.org/wiki/Thunk

The term thunk originally referred to the mechanism used by the Royal Radar Establishment implementation of pass-by-name in their Algol60 compiler. In general it refers to any way to induce dynamic behavior when referencing an apparently static object. The term was invented by Brian Wichmann, who when asked to explain pass-by-name said "Well you go out to load the value from memory and then suddenly - thunk - there you are evaluating an expression."
Thunks have been put in hardware (cf. KDF9, Burroughs mainframes). There are several ways to implement them in software, all very machine, language and compiler specific.
The term has come to be generalized beyond pass-by-name, to include any situation in which an apparently or nominally static data reference induces dynamic behavior. Related terms include "trampoline" and "future".

Some compilers for object-oriented languages such as C++ generate functions called "thunks" as an optimization of virtual function calls in the presence of multiple or virtual inheritance.
Taken from: http://en.wikipedia.org/wiki/Thunk#Thunks_in_object-oriented_programming

This question has already been asked on SO, see:
What is a 'thunk', as used in Scheme or in general?
From what I can tell, it's akin to a lambda statement, where you may not want to return the value until you need to evaluate it; or it can also be compared to a property getter which by design executes some code in order to return a value while yet having the interface form that comes across more like a variable, but also has polymorphic behavior that can be swapped out whether by inheritance or by swapping out the function pointer that would evaluate and return a value at runtime based on compile-time or environmental characteristics.

There's considerable variation in use. Almost universally, a thunk is a function that's (at least conceptually) unusually small and simple. It's usually some sort of adapter that gives you the correct interface to something or other (some data, another function, etc.) but is at least seen as doing little else.
It's almost like a form of syntactic sugar, except that (at least as usually used) syntactic sugar is supposed to make things look the way the human reader wants to see them, and a thunk is to make something look the way the compiler wants to see it.

I was distressed to find no general 'computer science' definition of this term matching its de-facto usage as known historically to me. The first real-life encounter I can recall where it was actually called that was in the OS/2 days and the 16-32 bit transition. It appears "thunking" is like irony in its application today.
My rough general understanding is that the thunk is a stub routine that just does nothing or routes across some fundamental boundary in kind between systems as in the mentioned historical cases.
So the sense is like a synesthesia of being dropped from the one environment to the other making (metaphorically/as a simile) a "thunk" sound.

I'm going to look this up, but I thought thunking was the process employed by a 32-bit processor to run legacy 16-bit code.
I used to use it as an analogy for how you have to restrict how fast you talk and what words you use when talking to dumb people.
Yeah, it's in the Wikipedia link (the part about 32-bit, not my nerdalogy).
https://en.wikipedia.org/wiki/Thunk
Much of the literature on interoperability thunks relates to various Wintel platforms, including MS-DOS, OS/2,[8]Windows[9][10] and .NET, and to the transition from 16-bit to 32-bit memory addressing. As customers have migrated from one platform to another, thunks have been essential to support legacy software written for the older platforms.
(emphasis added by me)

The earliest use of "thunk" I know of is from late '50s in reference to Algol60 pass-by-name argument evaluation in function calls. Algol was originally a specification language, not a programming language, and there was some question about how pass-by-name could be implemented on a computer.
The solution was to pass the entry point of what was essentially a lambda. When the callee evaluated the parameter, control fell through - thunk! - into the caller's context where the lambda was evaluated and it's result became the value of the parameter in the callee.
In tagged hardware, such as the Burroughs machines, the evaluation was implicit: an argument could be passed as a data value as in ordinary pass-by-value, or by thunk for pass-by-name, with different tags in the argument metadata. A load operation hardware checked the tag and either returned the simple value or automatically invoked the lambda thunk.

Per Kyle Simpson's definition, a thunk is a way to abstract the component of time out of asynchronous code.

An earlier version of The New Hacker's Dictionary claimed a thunk is a function that takes no arguments, and that it was a simple late-night solution to a particularly tricky problem, with "thunk" being the supposed past tense of "think", because they ought to have thunk of it long time ago.

In OCaml, it’s a function which takes unit “()” as a param (takes no params, often used for side effects)

Related

Why is function with useless isolated `static` considered impure?

In Wikipedia article on Pure function, there is an example of impure function like this:
void f() {
static int x = 0;
++x;
}
With the remark of "because of mutation of a local static variable".
I wonder why is it impure? It's from unit type to unit type, so it always returns the same result for same input. And it has no side effects, because even despite it has static int variable, it's unobservable by any other function than this f(), so there is no observable mutation of global state that other functions might use.
If one argues that any global mutations are disallowed, regardless of whether they are observable or not, then no real life function can be considered pure ever, because any function would allocate its memory on stack, and allocation is impure, as it involves talking to MMU via OS, and allocated page might be residing in a different physical page, and so on, and so on.
So, why does this useless isolated static int makes function impure?
The result of a pure function is fully defined by its input arguments. Here, the result means not only the returned value, but also the effect in terms of the virtual machine defined by the C/C++ standard. In other words, if the function occasionally exhibits undefined behavior with the same input arguments, it cannot be considered pure (because the behavior is different from one call to another with the same input).
In the particular case with the static local variable, that variable may become the source of a data race if f is called concurrently in multiple threads. Data race means undefined behavior. Another possible source of UB is signed integer overflow, which may eventually happen.
The concept of pure functions seems to only matter in... functional languages? Correct me if I'm wrong. The wikipedia link you provide provides two references near the top, one of which is Professor Frisby's Mostly Adequate Guide to Functional Programming. Where there are several different qualifications for a pure function, including:
does not have any observable side effect
This matters because one of the things we can do to a pure function (as opposed to an impure function) is memoization (from the link above), or input/output caching. Pure functions are also testable, reasonable, and self documenting.
I guess memoization matters for the compiler, so asking if a function is "pure" can be considered equivalent to asking if the compiler can memoize the function. It seems like the concept of a static local variable that no other piece of code touches is just bad code, and the compiler should issue a warning about it. But should the compiler optimize it away? And should the compiler try to figure out if any given static local variable actually has no side effects?
It seems like it's just easier to design the compiler to always flag a function is impure if it has a static local, instead of writing logic to hem and haw over whether the function is memoizable or not. See a local static? Boom: no longer pure.
So from a compiler's point of view, it's impure.
What about the other properties of a pure function?
testable, reasonable, and self documenting
Tests are written by a person, usually, so I'd argue this function is testable. Although some automated test-writing software might again see that it's not memoizable, and just choose to ignore writing tests for it entirely. This hypothetical software might just skip anything with local statics. Again, hypothetically.
Is the code reasonable? Certainly not. Although I'm not sure how much this matters. It doesn't do anything. It makes it hard to understand. ("Why did Bob write the function this way? Is this a Magic/More Magic situation?").
Is the code self-documenting? Again, I'd say not. But again, this is degenerate example code.
I think the biggest argument against this being considered a pure function is that a functional language compiler would be perfectly reasonable if it just assumed it wasn't pure.
I think the biggest argument for this being considered a pure function is that we can look at it with our own eyeballs and see that there's obviously no outside behavior. Ignore the fact that signed overflow is undefined. Replace this with a datatype that has defined overflow, and is atomic. Well now there's no undefined behavior, but it still looks weird.
"In conclusion, I don't care whether it's pure or not."
Let me rephrase from my previous (above) conclusion.
I'm inclined to just scan a function for any mutation of static variables and call it a day. Boom, no longer pure.
Can the function be considered pure if we really think about it? Sure. But what's the point? If the definition of a pure function needs to be changed, argue for it to be changed. Seems like you think this is a pure function. That's fine, I see the merits in that. I also see the merits in considering it an impure function.
As much as this is a non-answer, it really depends on what you're using the definition of pure for. If it's writing a compiler? Probably want to use the more conservative definition of pure that allows false positives and excludes this function. If it's to impress a bunch of sophomore CS students while listening to Zep? Go for the definition that recognizes this has no side effects and call it a day.

forcing a function to be pure

In C++ it is possible to declare that a function is const, which means, as far as I understand, that the compiler ensures the function does not modify the object. Is there something analogous in C++ where I can require that a function is pure? If not in C++, is there a language where one can make this requirement?
If this is not possible, why is it possible to require functions to be const but not require them to be pure? What makes these requirements different?
For clarity, by pure I want there to be no side effects and no use of variables other than those passed into the function. As a result there should be no file reading or system calls etc.
Here is a clearer definition of side effects:
No modification to files on the computer that the program is run on and no modification to variables with scope outside the function. No information is used to compute the function other than variables passed into it. Running the function should return the same thing every time it is run.
NOTE: I did some more research and encountered pure script
(Thanks for jarod42's comment)
Based on a quick read of the wikipedia article I am under the impression you can require functions be pure in pure script, however I am not completely sure.
Short answer: No. There is no equivalent keyword called pure that constrains a function like const does.
However, if you have a specific global variable you'd like to remain untouched, you do have the option of static type myVar. This will require that only functions in that file will be able to use it, and nothing outside of that file. That means any function outside that file will be constrained to leave it alone.
As to "side effects", I will break each of them down so you know what options you have:
No modification to files on the computer that the program is run on.
You can't constrain a function to do this that I'm aware. C++ just doesn't offer a way to constrain a function like this. You can, however, design a function to not modify any files, if you like.
No modification to variables with scope outside the function.
Globals are the only variables you can modify outside a function's scope that I'm aware of, besides anything passed by pointer or reference as a parameter. Globals have the option of being constant or static, which will keep you from modifying them, but, beyond that, there's really nothing you can do that I'm aware.
No information is used to compute the function other than variables passed into it.
Again, you can't constrain it to do so that I'm aware. However, you can design the function to work like this if you want.
Running the function should return the same thing every time it is run.
I'm not sure I understand why you want to constrain a function like this, but no. Not that I'm aware. Again, you can design it like this if you like, though.
As to why C++ doesn't offer an option like this? I'm guessing reusability. It appears that you have a specific list of things you don't want your function to do. However, the likelihood that a lot of other C++ users as a whole will need this particular set of constraints often is very small. Maybe they need one or two at a time, but not all at once. It doesn't seem like it would be worth the trouble to add it.
The same, however, cannot be said about const. const is used all the time, especially in parameter lists. This is to keep data from getting modified if it's passed by reference, or something. Thus, the compiler needs to know what functions modify the object. It uses const in the function declaration to keep track of this. Otherwise, it would have no way of knowing. However, with using const, it's quite simple. It can just constrain the object to only use functions that guarantee that it remains constant, or uses the const keyword in the declaration if the function.
Thus, const get's a lot of reuse.
Currently, C++ does not have a mechanism to ensure that a function has "no side effects and no use of variables other than those passed into the function." You can only force yourself to write pure functions, as mentioned by Jack Bashford. The compiler can't check this for you.
There is a proposal (N3744 Proposing [[pure]]). Here you can see that GCC and Clang already support __attribute__((pure)). Maybe it will be standardized in some form in the future revisions of C++.
In C++ it is possible to declare that a function is const, which means, as far as I understand, that the compiler ensures the function does not modify the object.
Not quite. The compiler will allow the object to be modified by (potentially ill-advised) use of const_cast. So the compiler only ensures that the function does not accidentally modify the object.
What makes these requirements [constant and pure] different?
They are different because one affects correct functionality while the other does not.
Suppose C is a container and you are iterating over its contents. At some point within the loop, perhaps you need to call a function that takes C as a parameter. If that function were to clear() the container, your loop will likely crash. Sure, you could build a loop that can handle that, but the point is that there are times when a caller needs assurance that the rug will not be pulled out from under it. Hence the ability to mark things const. If you pass C as a constant reference to a function, that function is promising to not modify C. This promise provides the needed assurance (even though, as I mentioned above, the promise can be broken).
I am not aware of a case where use of a non-pure function could similarly cause a program to crash. If there is no use for something, why complicate the language with it? If you can come up with a good use-case, maybe it is something to consider for a future revision of the language.
(Knowing that a function is pure could help a compiler optimize code. As far as I know, it's been left up to each compiler to define how to flag that, as it does not affect functionality.)

C function to exercise as many registers as possible

I'm writing a test suite for an assembler stub on multiple CPU architectures. The stub is meant to pass arguments to a C callback.
I have written tests to cover multiple scenarios (eg passing struct by value, mixing arguments of different native size, mixing float args with ints etc) and I would now like the test callback to do something that will use up lots of registers/stack slots etc.
The idea is to try and flush out implementations that are only working by fluke (eg value not correctly put on stack, but it happens to still be in a certain register so you get away with it etc).
Can anyone recommend a good piece of C/C++ I can use as the test here? I realise register use varies wildly across architectures and there's no guarantee at all that it will get complete coverage, but it would be nice to have something that gave a reasonable amount of confidence. Thanks
There is nothing in the C/C++ standards to help you here. Ultimately the only reliable way is exactly the way that compiler writers do it. They study the code their compiler generates and think up ways to break it.
Having said that, I can think of some strategies that might flush out some common problems.
Make calls with lots of different combinations of argument types and numbers. For example, a function with a single argument or return value that is a char/short/int/long/float/double/pointer etc will exercise a certain range of compiler code generation (and likely be passed in a register if possible). The same function with lots of arguments will use a different strategy, and (in most cases) not have enough registers.
Insert preamble code into the called function, so that passed in arguments are not used immediately but the registers get filled with other values.
Use variadic arguments. The calling conventions for variadic arguments (especially with separate compile and link) virtually guarantee arguments on the stack and not in registers.
Exercise lots of different call types: not just simple scalars but struct by value, pointer to function, pointer to member, etc.
Cheat. Call with one prototype but cast the function pointer so that the callee has a different prototype. For example, call with double on the stack but the callee function has two longs. Requires some inside knowledge of compiler workings.
The only pre-written code you will find to do this stuff is the compiler compliance suite for your chosen compiler.
Breaking compilers is fun. Hopefully something here will help you do it.

MSVC Compiler Error C2688: Microsoft C++ ABI corner case issue?

A very specific corner case that MSVC disallows via Compiler Error 2688 is admitted by Microsoft to be non-standard behavior. Does anyone know why MSVC++ has this specific limitation?
The fact that it involves simultaneous usage of three language features ("virtual base classes", "covariant return types", and "variable number of arguments", according to the description in the second linked page) that are semantically orthogonal and fully supported separately seems to imply that this is not a parsing or semantic issue, but a corner case in the Microsoft C++ ABI. In particular, the fact that a "variable number of arguments" is involved seems to (?) suggest that the C++ ABI is using an implicit trailing parameter to implement the combination of the other two features, but can't because there's no fixed place to put that parameter when the function is var arg.
Does anyone have enough knowledge of the Microsoft C++ ABI to confirm whether this is the case, and explain what this implicit trailing argument is used for (or what else is going on, if my guess is incorrect)? The C++ ABI is not documented by Microsoft but I know that some people outside of Microsoft have done work to match the ABI for various reasons so I'm hoping someone can explain what is going on.
Also, Microsoft's documentation is a bit inconsistent; the second page linked says:
Virtual base classes are not supported as covariant return types when the virtual function has a variable number of arguments.
but the first page more broadly states:
covariant returns with multiple or virtual inheritance not supported for varargs functions
Does anyone know what the real story is? I can do some experimentation to find out, but I'm guessing that the actual corner case is neither of these, exactly, but has to do with the specifics of the class hierachy in a way that the documenters decided to gloss over. My guess it that it has to do with the need for a pointer adjustment in the virtual thunk, but I'm hoping someone with deeper knowledge of the situation than me can explain what's going on behind the hood.
I can tell you with authority that MSVC's C++ ABI uses implicit extra parameters to do things that in other ABIs (namely Itanium) implement multiple separate functions to handle, so it's not hard to imagine that one is being used here (or would be, if the case were supported).
I don't know for sure what's happening in this case, but it seems plausible that an implicit extra parameter is being passed to tell the thunk implementing the virtual function whether a downcast to the covariant return type class is required (or, more likely, whether an upcast back to the base class is required, since the actual implementing function probably returns the derived class), and that this extra parameter goes last so that it can be ignored by the base classes (which wouldn't know anything about the covariant return).
This implies that the unsupported corner case occurs always when a virtual base class is the original return type (since a thunk will always be required to the derived class) which is what is described in the first quote; it would also happen in some, but not all, cases involving multiple inheritance (which may be why it's included in the second quote, but not the first).

static binding of default parameter

In Effective C++, the book just mentioned one sentence why default parameters are static bound:
If default parameter values were dynamically bound, compilers would have to come up with a way to determine the appropriate default values for parameters of virtual functions at runtime, which would be slower and more complicated than the current mechanism of determining them during compilation.
Can anybody elaborate this a bit more? Why it is complicated and inefficient?
Thanks so much!
Whenever a class has virtual functions, the compiler generates a so-called v-table to calculate the proper addresses that are needed at runtime to support dynamic binding and polymorphic behavior. Lots of class optimizers work toward removing virtual functions for this reason exactly. Less overhead, and smaller code. If default parameters were also calculated into the equation, it would make the whole virtual function mechanism all the more cumbersome and bloated.
Because for a function call the actual call would need to be looked up using the vtables associated with the object instance and from that the deault would need to be inferred in some manner. Meaning that the ctable would need extension or there would need to be extra administration to link the default to a vtable entry.