Direct execution of the data - c++

What is the best approach to make my program to execute the data. Say, I wrote the (so-called) compiler for x86_64 machine:
#include <iostream>
#include <vector>
#include <cstdlib>
#include <cstdint>
struct compiler
{
void op() const { return; }
template< typename ...ARGS >
void op(std::uint8_t const _opcode, ARGS && ..._tail)
{
code_.push_back(_opcode);
return op(std::forward< ARGS >(_tail)...);
}
void clear() { code_.clear(); }
long double operator () () const
{
// ?
}
private :
std::vector< std::uint8_t > code_;
};
int main()
{
compiler compiler_; // long double (*)();
compiler_.op(0xD9, 0xEE); // FLDZ
compiler_.op(0xC3); // ret
std::cout << compiler_() << std::endl;
return EXIT_SUCCESS;
}
But I don't know how to implement operator () correctly. I suspect, that I must put all the contents of code_ into contiguous memory chunk and then cast to long double (*)(); and call this. But there is some difficulties:
Should I use VirtualProtect(Ex) (+ FlushInstructionCache) on Windows? And something similar on Linux?
What is the container, that reliably places the bytes in the memory in proper manner (i.e. one by one)? And also allows to get the pointer to memory chunk.

First, you will need to allocate the code as executable [using VirtualAlloc with "executable" flag in Windows, and mmap using "MAP_EXECUTABLE" as one of the flags]. It's probably a lot easier to allocate a large region of this kind of memory, and then have a "allocation function" for your content. You could possibly use virtualprotect and whatever the corresponding function is in Linux, but I'd say that allocating as executable in the first place is a better choice. I don't believe you need to flush instruction cache it the memory is already allocated as executable - certainly not on x86 at least - and since your instructions are x86 instructions, I guess that's a fair limitation.
Second, you'll need to make something like a function pointer to your code. SOmething like this should do it:
typedef void (*funcptr)(void);
funcptr f = reinterpret_cast<funcptr>(&code_[0]);
should do the trick.

Related

Can I be sure that the binary code of the functions will be copied sequentially?

Sorry if this question already exist, because I hope this approach is used but i just don't know how this called. So, my purpose to execute sequence of functions from memory, for this I copied size of first and last func.
This is my first try:
source.cpp
void func1(int var1, int var2)
{
func2();
func3();
//etc.
}
void func2(...){...}
void func3(...){...}
void funcn(){return 123;}//last func as border, I do not use it
//////////////////////////////////////////////////
main.cpp
#include"source.cpp"
long long size= (long long)funcn-(long long)func1;// i got size of binary code of this funcs;
// and then i can memcpy it to file or smth else and execute by adress of first
Firstly it's worked correct, but after updating my functions it's crashed. Size has become negative.
Then i tried to attach it to memory hardlier:
source.cpp
extern void(*pfunc1)(int, int);
extern void(*pfuncn)();
void(*pfunc1)(int , int) = &func1;
void(*funcn)() = &funcn;
static void __declspec(noinline) func1(int var1, int var2)
{
//the same impl
}
static void __declspec(noinline) func2(...){...}
static void __declspec(noinline) func3(...){...}
static void __declspec(noinline) funcn(...){retunr 123;}
//////////////////////////////////
main.cpp
#include"source.cpp"
long long size= (long long) pfuncn - (long long) pfunc1;
//same impl
This worked after my 1st update, but then, I had to update it again, and now this gives me wrong size. Size was near 900+ bytes. I changed some funcs, and size become 350+ bytes i haven't changed that many.
I disabled optimizations and inline optimizations.
So my question is how to be sure that my func1 will be less adress then last funcn and what could change their locations in memory. Thank you for attention.
// and then i can memcpy it to file or smth else and execute by adress of first
copy it in memory and then call it in allocated memory and then call by adress of allocation.
This needs to be stated:
You cannot copy code from one location to another and hope for it to work.
There's no guarantees that all the code required to call a function
be located in a contiguous block.
There's no guarantee the function pointer actually point to the
beginning of the needed code.
There's no guarantees you can effectively write to executable memory. To the OS, you'd look a lot like a virus.
there's no guarantees the code is relocatable (able to work after being moved to a different location). for this it requires to use only relative addresses
In short: unless you have supporting tools that go beyond the scope of standard C++, don't even think about it.
GCC family only!
You can force the compiler to put the whole function to separate section. Then you can know the memory area where the funcion resides.
int __attribute__((section(".foosection"))) foo()
{
/* some code here */
}
in linker script in the .text you need to add
.text :
{
/* ... */
__foosection_start = .;
*(*foosection)
*(.foosection*)
__foosection_end = .;
/* .... */
and in the place where you want to know or use it
extern unsigned char __foosection_start[];
extern unsigned char __foosection_end[];
void printfoo()
{
printf("foosection start: %p, foosection end: %p\n ", (void *)__foosection_start, (void *)__foosection_end);
}
This is probably not possible because of a requirement you did not mention, but why not use an array of function pointers?
std::function<void()> funcs[] = {
func2,
func3,
[](){ /* and an inline lambda, because why not */ },
};
// Call them in sequence like so:
for (auto& func: funcs) {
func();
}

Can I allocate memory on CUDA device for objects containing arrays of float numbers?

I am working on parallel solving of identical ordinary differential equations with different initial conditions. I have solved this problem with OpenMP and now I want to implement similar code on GPU. Specifically, I want to allocate memory on device for floats in class constructor and then deallocate it in destructor. It doesn't work for me since I get my executable "terminated by signal SIGSEGV (Address boundary error)". Is it possible to use classes, constructors and destructors in CUDA?
By the way, I am newbie in CUDA and not very experienced in C++ either.
I attach the code in case I have described my problem poorly.
#include <cmath>
#include <iostream>
#include <fstream>
#include <iomanip>
#include <random>
#include <string>
#include <chrono>
#include <ctime>
using namespace std;
template<class ode_sys>
class solver: public ode_sys
{
public:
int *nn;
float *t,*tt,*dt,*x,*xx,*m0,*m1,*m2,*m3;
using ode_sys::rhs_sys;
__host__ solver(int n): ode_sys(n)
{ //here I try to allocate memory. It works malloc() and doesn't with cudaMalloc()
size_t size=sizeof(float)*n;
cudaMalloc((void**)&nn,sizeof(int));
*nn=n;
cudaMalloc((void**)&t,sizeof(float));
cudaMalloc((void**)&tt,sizeof(float));
cudaMalloc((void**)&dt,sizeof(float));
cudaMalloc((void**)&x,size);
cudaMalloc((void**)&xx,size);
cudaMalloc((void**)&m0,size);
cudaMalloc((void**)&m1,size);
cudaMalloc((void**)&m2,size);
cudaMalloc((void**)&m3,size);
}
__host__ ~solver()
{
cudaFree(nn);
cudaFree(t);
cudaFree(tt);
cudaFree(dt);
cudaFree(x);
cudaFree(xx);
cudaFree(m0);
cudaFree(m1);
cudaFree(m2);
cudaFree(m3);
}
__host__ __device__ void rk4()
{//this part is not important now.
}
};
class ode
{
private:
int *nn;
public:
float *eps,*d;
__host__ ode(int n)
{
cudaMalloc((void**)&nn,sizeof(int));
*nn=n;
cudaMalloc((void**)&eps,sizeof(float));
size_t size=sizeof(float)*n;
cudaMalloc((void**)&d,size);
}
__host__ ~ode()
{
cudaFree(nn);
cudaFree(eps);
cudaFree(d);
}
__host__ __device__ float f(float x_,float y_,float z_,float d_)
{
return d_+*eps*(sinf(x_)+sinf(z_)-2*sinf(y_));
}
__host__ __device__ void rhs_sys(float *t,float *dt,float *x,float *dx)
{
}
};
//const float pi=3.14159265358979f;
__global__ void solver_kernel(int m,int n,solver<ode> *sys_d)
{
int index = threadIdx.x;
int stride = blockDim.x;
//actually ode numerical evaluation should be here
for (int l=index;l<m;l+=stride)
{//this is just to check that i can run kernel
printf("%d Hello \n", l);
}
}
int main ()
{
auto start = std::chrono::system_clock::now();
std::time_t start_time = std::chrono::system_clock::to_time_t(start);
cout << "started computation at " << std::ctime(&start_time);
int m=128,n=4,l;// i want to run 128 threads, n is dimension of ode
size_t size=sizeof(solver<ode>(n));
solver<ode> *sys_d; //an array of objects
cudaMalloc(&sys_d,size*m); //nvprof shows that this array is allocated
for (l=0;l<m;l++)
{
new (sys_d+l) solver<ode>(n); //it doesn't work as it meant to
}
solver_kernel<<<1,m>>>(m,n,sys_d);
for (l=0;l<m;l++)
{
(sys_d+l)->~solver<ode>(); //it doesn't work as it meant to
}
cudaFree(sys_d); //it works
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::time_t end_time = std::chrono::system_clock::to_time_t(end);
std::cout << "finished computation at " << std::ctime(&end_time) << "elapsed time: " << elapsed_seconds.count() << "s\n";
return 0;
}
//end of file
Distinguish host-side and device-side memory
As other answer also state:
GPU (global) memory you allocate with cudaMalloc() is not accessible by code running on the CPU; and
System memory (aka host memorY) you allocate in plain C++ (with std::vector, with std::make_unique, with new etc.) is not accessible by code running on the GPU.
So, you need to allocate both host-side and device-side memory. For a simple example of working with both device-side and host-side memory see the CUDA vectorAdd sample program.
(Actually, you can also make a special kind of allocation which is accessible from both the device and the host; this is Unified Memory. But let's ignore that for now since we're dealing with the basics.)
Don't live in the kingdom of nouns
Specifically, I want to allocate memory on device for floats in class constructor and then deallocate it in destructor.
I'm not sure you really want to do that. You seem to be taking a more Java-esque approach, in which everything you do is noun-centric, i.e. classes are used for everything: You don't solve equations, you have an "equation solver". You don't "do X", you have an "XDoer" class etc. Why not just have a (templated) function which solves an ODE system, returning the solution? Are you using your "solver" in any other way?
(this point is inspired by Steve Yegge's blog post, Execution in the Kingdom of Nouns.)
Try to avoid allocating and de-allocating yourself
In well-written modern C++, we try to avoid direct, manual allocation of memory (that's a link to the C++ Core Programming Guidelines by the way). Now, it's true that you free your memory with the destructor, so it's not all that bad, but I'd really consider using std::unique_ptr on the host and something equivalent on the device (like cuda::memory::unique_ptr from my Modern-C++ CUDA API wrapper cuda-api-wrappers library); or a GPU-oriented container class like thrust's device vector.
Check for errors
You really must check for errors after you call CUDA API functions. And this is doubly necessary after you launch a kernel. When you call a C++ standard library code, it throws an exception on error; CUDA's runtime API is C-like, and doesn't know about exceptions. It will just fail and set some error variable you need to check.
So, either you write error checks, like in the vectorAdd() sample I linked to above, or you get some library to exhibit more standard-library-like behavior. cuda-api-wrappers and thrust will both do that - on different levels of abstraction; and so will other libraries/frameworks.
You need an array on the host side and one on the device side.
Initialize the host array, then copy it to the device array with cudaMemcpy. The destruction has to be done on the host side again.
An alternative would be to initialize the array from the device, you would need to put __device__ in front of your constructor, then just use malloc.
You can not dereference pointer to device memory in host code:
__host__ ode(int n)
{
cudaMalloc((void**)&nn,sizeof(int));
*nn=n; // !!! ERROR
cudaMalloc((void**)&eps,sizeof(float));
size_t size=sizeof(float)*n;
cudaMalloc((void**)&d,size);
}
You will have to copy the values with cudaMemcpy.
(Or use the parameters of a __global__ function.)

Avoid memory allocation with std::function and member function

This code is just for illustrating the question.
#include <functional>
struct MyCallBack {
void Fire() {
}
};
int main()
{
MyCallBack cb;
std::function<void(void)> func = std::bind(&MyCallBack::Fire, &cb);
}
Experiments with valgrind shows that the line assigning to func dynamically allocates about 24 bytes with gcc 7.1.1 on linux.
In the real code, I have a few handfuls of different structs all with a void(void) member function that gets stored in ~10 million std::function<void(void)>.
Is there any way I can avoid memory being dynamically allocated when doing std::function<void(void)> func = std::bind(&MyCallBack::Fire, &cb); ? (Or otherwise assigning these member function to a std::function)
Unfortunately, allocators for std::function has been dropped in C++17.
Now the accepted solution to avoid dynamic allocations inside std::function is to use lambdas instead of std::bind. That does work, at least in GCC - it has enough static space to store the lambda in your case, but not enough space to store the binder object.
std::function<void()> func = [&cb]{ cb.Fire(); };
// sizeof lambda is sizeof(MyCallBack*), which is small enough
As a general rule, with most implementations, and with a lambda which captures only a single pointer (or a reference), you will avoid dynamic allocations inside std::function with this technique (it is also generally better approach as other answer suggests).
Keep in mind, for that to work you need guarantee that this lambda will outlive the std::function. Obviously, it is not always possible, and sometime you have to capture state by (large) copy. If that happens, there is no way currently to eliminate dynamic allocations in functions, other than tinker with STL yourself (obviously, not recommended in general case, but could be done in some specific cases).
As an addendum to the already existent and correct answer, consider the following:
MyCallBack cb;
std::cerr << sizeof(std::bind(&MyCallBack::Fire, &cb)) << "\n";
auto a = [&] { cb.Fire(); };
std::cerr << sizeof(a);
This program prints 24 and 8 for me, with both gcc and clang. I don't exactly know what bind is doing here (my understanding is that it's a fantastically complicated beast), but as you can see, it's almost absurdly inefficient here compared to a lambda.
As it happens, std::function is guaranteed to not allocate if constructed from a function pointer, which is also one word in size. So constructing a std::function from this kind of lambda, which only needs to capture a pointer to an object and should also be one word, should in practice never allocate.
Run this little hack and it probably will print the amount of bytes you can capture without allocating memory:
#include <iostream>
#include <functional>
#include <cstring>
void h(std::function<void(void*)>&& f, void* g)
{
f(g);
}
template<size_t number_of_size_t>
void do_test()
{
size_t a[number_of_size_t];
std::memset(a, 0, sizeof(a));
a[0] = sizeof(a);
std::function<void(void*)> g = [a](void* ptr) {
if (&a != ptr)
std::cout << "malloc was called when capturing " << a[0] << " bytes." << std::endl;
else
std::cout << "No allocation took place when capturing " << a[0] << " bytes." << std::endl;
};
h(std::move(g), &g);
}
int main()
{
do_test<1>();
do_test<2>();
do_test<3>();
do_test<4>();
}
With gcc version 8.3.0 this prints
No allocation took place when capturing 8 bytes.
No allocation took place when capturing 16 bytes.
malloc was called when capturing 24 bytes.
malloc was called when capturing 32 bytes.
Many std::function implementations will avoid allocations and use space inside the function class itself rather than allocating if the callback it wraps is "small enough" and has trivial copying. However, the standard does not require this, only suggests it.
On g++, a non-trivial copy constructor on a function object, or data exceeding 16 bytes, is enough to cause it to allocate. But if your function object has no data and uses the builtin copy constructor, then std::function won't allocate.
Also, if you use a function pointer or a member function pointer, it won't allocate.
While not directly part of your question, it is part of your example.
Do not use std::bind. In virtually every case, a lambda is better: smaller, better inlining, can avoid allocations, better error messages, faster compiles, the list goes on. If you want to avoid allocations, you must also avoid bind.
I propose a custom class for your specific usage.
While it's true that you shouldn't try to re-implement existing library functionality because the library ones will be much more tested and optimized, it's also true that it applies for the general case. If you have a particular situation like in your example and the standard implementation doesn't suite your needs you can explore implementing a version tailored to your specific use case, which you can measure and tweak as necessary.
So I have created a class akin to std::function<void (void)> that works only for methods and has all the storage in place (no dynamic allocations).
I have lovingly called it Trigger (inspired by your Fire method name). Please do give it a more suited name if you want to.
// helper alias for method
// can be used in user code
template <class T>
using Trigger_method = auto (T::*)() -> void;
namespace detail
{
// Polymorphic classes needed for type erasure
struct Trigger_base
{
virtual ~Trigger_base() noexcept = default;
virtual auto placement_clone(void* buffer) const noexcept -> Trigger_base* = 0;
virtual auto call() -> void = 0;
};
template <class T>
struct Trigger_actual : Trigger_base
{
T& obj;
Trigger_method<T> method;
Trigger_actual(T& obj, Trigger_method<T> method) noexcept : obj{obj}, method{method}
{
}
auto placement_clone(void* buffer) const noexcept -> Trigger_base* override
{
return new (buffer) Trigger_actual{obj, method};
}
auto call() -> void override
{
return (obj.*method)();
}
};
// in Trigger (bellow) we need to allocate enough storage
// for any Trigger_actual template instantiation
// since all templates basically contain 2 pointers
// we assume (and test it with static_asserts)
// that all will have the same size
// we will use Trigger_actual<Trigger_test_size>
// to determine the size of all Trigger_actual templates
struct Trigger_test_size {};
}
struct Trigger
{
std::aligned_storage_t<sizeof(detail::Trigger_actual<detail::Trigger_test_size>)>
trigger_actual_storage_;
// vital. We cannot just cast `&trigger_actual_storage_` to `Trigger_base*`
// because there is no guarantee by the standard that
// the base pointer will point to the start of the derived object
// so we need to store separately the base pointer
detail::Trigger_base* base_ptr = nullptr;
template <class X>
Trigger(X& x, Trigger_method<X> method) noexcept
{
static_assert(sizeof(trigger_actual_storage_) >=
sizeof(detail::Trigger_actual<X>));
static_assert(alignof(decltype(trigger_actual_storage_)) %
alignof(detail::Trigger_actual<X>) == 0);
base_ptr = new (&trigger_actual_storage_) detail::Trigger_actual<X>{x, method};
}
Trigger(const Trigger& other) noexcept
{
if (other.base_ptr)
{
base_ptr = other.base_ptr->placement_clone(&trigger_actual_storage_);
}
}
auto operator=(const Trigger& other) noexcept -> Trigger&
{
destroy_actual();
if (other.base_ptr)
{
base_ptr = other.base_ptr->placement_clone(&trigger_actual_storage_);
}
return *this;
}
~Trigger() noexcept
{
destroy_actual();
}
auto destroy_actual() noexcept -> void
{
if (base_ptr)
{
base_ptr->~Trigger_base();
base_ptr = nullptr;
}
}
auto operator()() const
{
if (!base_ptr)
{
// deal with this situation (error or just ignore and return)
}
base_ptr->call();
}
};
Usage:
struct X
{
auto foo() -> void;
};
auto test()
{
X x;
Trigger f{x, &X::foo};
f();
}
Warning: only tested for compilation errors.
You need to thoroughly test it for correctness.
You need to profile it and see if it has a better performance than other solutions. The advantage of this is because it's in house cooked you can make tweaks to the implementation to increase performance on your specific scenarios.
As #Quuxplusone mentioned in their answer-as-a-comment, you can use inplace_function here. Include the header in your project, and then use like this:
#include "inplace_function.h"
struct big { char foo[20]; };
static stdext::inplace_function<void(), 8> inplacefunc;
static std::function<void()> stdfunc;
int main() {
static_assert(sizeof(inplacefunc) == 16);
static_assert(sizeof(stdfunc) == 32);
inplacefunc = []() {};
// fine
struct big a;
inplacefunc = [a]() {};
// test.cpp:15:24: required from here
// inplace_function.h:237:33: error: static assertion failed: inplace_function cannot be constructed from object with this (large) size
// 237 | static_assert(sizeof(C) <= Capacity,
// | ~~~~~~~~~~^~~~~~~~~~~
// inplace_function.h:237:33: note: the comparison reduces to ‘(20 <= 8)’
}

build function at runtime c++ from number of functions that built at compilation

I am creating scripting language that first parse the code
and then copy functions (To execute the code) to one buffer\memory as the parsed code.
There is a way to copy function's binary code to buffer and then execute the whole buffer?
I need to execute all the functions at once to get better performance.
To understand my question to best I want to do something like this:
#include <vector>
using namespace std;
class RuntimeFunction; //The buffer to my runtime function
enum ByteCodeType {
Return,
None
};
class ByteCode {
ByteCodeType type;
}
void ReturnRuntime() {
return;
}
RuntimeFunction GetExecutableData(vector<ByteCode> function) {
RuntimeFunction runtimeFunction=RuntimeFunction(sizeof(int)); //Returns int
for (int i = 0 ; i < function.size() ; i++ ) {
#define CurrentByteCode function[i]
if (CurrentByteCode.Type==Return) {
runtimeFunction.Append(&ReturnRuntime);
} //etc.
#undef
}
return runtimeFunction;
}
void* CallFunc(RuntimeFunction runtimeFunction,vector<void*> custom_parameters) {
for (int i=custom_parameters-1;i>=0;--i) { //Invert parameters loop
__asm {
push custom_parameters[i]
}
}
__asm {
call runtimeFunction.pHandle
}
}
There are a number of ways of doing this, depending on how deep you want to get into generating code at runtime, but one relatively simple way of doing it is with threaded code and a threaded code interpreter.
Basically, threaded code consists of an array of function pointers, and the interpreter goes through the array calling each pointed at function. The tricky part is that you generally have each function return the address of array element containing a pointer to the next function to call, which allows you to implement things like branches and calls without any effort in the interpreter
Usually you use something like:
typedef void *(*tc_func_t)(void *, runtime_state_t *);
void *interp(tc_func_t **entry, runtime_state_t *state) {
tc_func_t *pc = *entry;
while (pc) pc = (*pc)(pc+1, state);
return entry+1;
}
That's the entire interpreter. runtime_state_t is some kind of data structure containing some runtime state (usually one or more stacks). You call it by creating an array of tc_func_t function pointers and filling them in with function pointers (and possibly data), ending with a null pointer, and then call interp with the address of a variable containing the start of the array. So you might have something like:
void *add(tc_func_t *pc, runtime_state_t *state) {
int v1 = state->data.pop();
int v2 = state->data.pop();
state->data.push(v1 + v2);
return pc; }
void *push_int(tc_func_t *pc, runtime_state_t *state) {
state->data.push((int)*pc);
return pc+1; }
void *print(tc_func_t *pc, runtime_state_t *state) {
cout << state->data.pop();
return pc; }
tc_func_t program[] = {
(tc_func_t)push_int,
(tc_func_t)2,
(tc_func_t)push_int,
(tc_func_t)2,
(tc_func_t)add,
(tc_func_t)print,
0
};
void run_prgram() {
runtime_state_t state;
tc_func_t *entry = program;
interp(&entry, &state);
}
Calling run_program runs the little program that adds 2+2 and prints the result.
Now you may be confused by the slightly odd calling setup for interp, with an extra level of indirection on the entry argument. That's so that you can use interp itself as a function in a threaded code array, followed by a pointer to another array, and it will do a threaded code call.
edit
The biggest problem with threaded code like this is related to performance -- the threaded coded interpreter is extremely unfriendly to branch predictors, so performance is pretty much locked at one threaded instruction call per branch misprediction recovery time.
If you want more performance, you pretty much have to go to full-on runtime code generation. LLVM provides a good, machine independent interface to doing that, along with pretty good optimizers for common platforms that will produce pretty good code at runtime.

Does getenv cache the result?

I have several calls to getenv in my code(called a lot of times), so I see the potential for an optimization. My question is, does getenv somehow cache the result internally, or does it query the environment variables on each call?
I have profiled the code, getenv is not a bottleneck, but I'd still like to change it if it's more efficient.
As a side question, can an environment variable be changed for a program while it is running? I'm not doing that, so in my case caching the result would be safe, it's just curiosity.
Environment variables usually live in the memory of given process so there is nothing to cache there, they are readily available.
As for updates, any component of a running process can call putenv to updated the environment, you should not cache it for prolonged periods if you expect that to happen.
I doubt it caches the results, environment variables could change from call to call. You can implement that cache yourself:
#include <map>
#include <iostream>
#include <string>
#include <stdexcept>
#include <cstdlib>
class EnvCache {
public:
const std::string &get_env(const std::string &key) {
auto it = cache_entries.find(key);
if(it == cache_entries.end()) {
const char *ptr = getenv(key.c_str());
if(!ptr)
throw std::runtime_error("Env var not found");
it = cache_entries.insert({key, ptr}).first;
}
return it->second;
}
void clear() {
cache_entries.clear();
}
private:
std::map<std::string, std::string> cache_entries;
};
int main() {
EnvCache cache;
std::cout << cache.get_env("PATH") << std::endl;
}
You could invalidate cache entries in case you modify environment variables. You could also map directly to const char*, but that's up to you.
A process inherits the environment from the process creating the new process. This is held in memory.
Indeed, In C and C++ you can define main to have an extra parameter that contains the environment - see http://www.gnu.org/software/libc/manual/html_node/Program-Arguments.html#Program-Arguments
Additionally you can use extern char **environ; to access the array containing the environment. (this is null terminated)
Therefore you do not need a cache. The environment variables are held in memory as an array.