Accidental Complexity in OpenSSL HMAC functions - c++

SSL Documentation Analaysis
This question is pertaining the usage of the HMAC routines in OpenSSL.
Since Openssl documentation is a tad on the weak side in certain areas, profiling has revealed that using the:
unsigned char *HMAC(const EVP_MD *evp_md, const void *key,
int key_len, const unsigned char *d, int n,
unsigned char *md, unsigned int *md_len);
From here, shows 40% of my library runtime is devoted to creating and taking down HMAC_CTX's behind the scenes.
There are also two additional function to create and destroy a HMAC_CTX explicetly:
HMAC_CTX_init() initialises a HMAC_CTX
before first use. It must be called.
HMAC_CTX_cleanup() erases the key and
other data from the HMAC_CTX and
releases any associated resources. It
must be called when an HMAC_CTX is no
longer required.
These two function calls are prefixed with:
The following functions may be used if
the message is not completely stored
in memory
My data fits entirely in memory, so I choose the HMAC function -- the one whose signature is shown above.
The context, as described by the man page, is made use of by using the following two functions:
HMAC_Update() can be called repeatedly
with chunks of the message to be
authenticated (len bytes at data).
HMAC_Final() places the message
authentication code in md, which must
have space for the hash function
output.
The Scope of the Application
My application generates a authentic (HMAC, which is also used a nonce), CBC-BF encrypted protocol buffer string. The code will be interfaced with various web-servers and frameworks Windows / Linux as OS, nginx, Apache and IIS as webservers and Python / .NET and C++ web-server filters.
The description above should clarify that the library needs to be thread safe, and potentially have resumeable processing state -- i.e., lightweight threads sharing a OS thread (which might leave thread local memory out of the picture).
The Question
How do I get rid of the 40% overhead on each invocation in a (1) thread-safe / (2) resume-able state way ? (2) is optional since I have all of the source-data present in one go, and can make sure a digest is created in place without relinquishing control of the thread mid-digest-creation. So,
(1) can probably be done using thread local memory -- but how do I resuse the CTX's ? does the HMAC_final() call make the CTX reusable ?.
(2) optional: in this case I would have to create a pool of CTX's.
(3) how does the HMAC function do this ? does it create a CTX in the scope of the function call and destroy it ?
Psuedocode and commentary will be useful.

The documentation for the HMAC_Init_ex() function in OpenSSL 0.9.8g says:
HMAC_Init_ex() initializes or reuses a
HMAC_CTX structure to use the function
evp_md and key key. Either can be
NULL, in which case the existing one
will be reused.
(Emphasis mine). So this means that you can initialise a HMAC_CTX with HMAC_CTX_Init() once, then keep it around to create multiple HMACs with, as long as you don't call HMAC_CTX_cleanup() on it and you start off each HMAC with HMAC_Init_ex().
So yes, you should be able to do what you want with a HMAC_CTX in thread-local memory.

If you aren't trying to restrict your dependencies, you could choose a HMAC implementation that is self contained and requires that the user explicitly control all the aspects that OpenSSL is, in it's documentation, vague about. Many such simple C/C++ alternatives exist, but it is up to you to choose and evaluate such an alternative.

Related

Returning thread-local data from a shared library C-api

Question 1: Is it safe and portable to return a pointer to a thread_local data from a shared library providing a traditional C-API?
The lib itself is naturally implemented with C++11. Safetyness in respect to memory leaks and race conditions, portablitity covering the main desktop OSs: Windows, Linux and OSX. The calling application might be for example native, Java, C#, etc.
The use case is to implement caller-friendly and thread-safe routine which returns data from shared library. In this case consequent calls specifically will overwrite the thread-local buffer, and this drawback is preferred over requiring the caller to explicitely free the returned data using a library provided "free_data()" function.
// For example as a return value:
const char* text = MYLIB_get_foo_info();
The shared lib quarantees that the returned data is valid until the next call of the specific API function by the same thread that originally received the data from the API, and termination of the thread will invalidate (deallocate) it. The usage of the data is therefore in practice limited to a single-thread use, and should the API user desire to use the data with other threads or store it for later use it must take a value copy of it, in the API function caller thread.
By definition in this particular case one can safely assume that nothing will invalidate the data during the caller reads it. This is exceptionally strong assumption indeed, which is based on the particular nature of the API and its very limited intented use. It is an option to later add another version of the API that does not require this assumption, if some need for it arises.
Question 2: Is it quaranteed that when the library user (application) thread terminate the TLS memory is deallocated at that moment?
For example if the returned string is static thread_local std::string in the library.
I have failed to find clear and direct answer this specific case (of using TLS from a shared library).
Found two good articles about library API design, but these do not give any hints regarding TLS:
http://lucumr.pocoo.org/2013/8/18/beautiful-native-libraries/
https://anteru.net/blog/2016/05/01/3249/index.html
Before C++11 for example with Windows one could use TlsAlloc() but one had to specifically check for library-caller threads created before loading the lib.
Question 3: Am I correct that with C++11 thread_local one does not have this kind of issues anymore?
https://msdn.microsoft.com/en-us/library/windows/desktop/ms686997(v=vs.85).aspx

Implementing Thread Local Storage in Software

We are porting an embedded application from Windows CE to a different system. The current processor is an STM32F4. Our current codebase heavily uses TLS. The new prototype is running KEIL CMSIS RTOS which has very reduced functionality.
On http://www.keil.com/support/man/docs/armcc/armcc_chr1359124216560.htm it says that thread local storage is supported since 5.04. Right now we are using 5.04. The problem is that when linking our program with a variable definition of __thread int a; the linker cannot find __aeabi_read_tp which makes sense to me.
My question is: Is it possible to implement __aeabi_read_tp and it will work or is there more to it?
If it simply is not possible for us: Is there a way to implement TLS only in software? Let's not talk about performance there for now.
EDIT
I tried implementing __aeabi_read_tp by looking at old source of freeBSD and other sources. While the function is mostly implemented in assembly I found a version in C which boils down to this:
extern "C"
{
extern osThreadId svcThreadGetId(void);
void *__aeabi_read_tp()
{
return (void*)svcThreadGetId();
}
}
What this basically does is give me the ID (void*) of my currently executing thread. If I understand correctly that is what we want. Can this possibly work?
Not considering the performance and not going into CMIS RTOS specifics (which are unknown to me), you can allocate space needed for your variables - either on heap or as static or global variable - I would suggest to have an array of structures. Then, when you create thread, pass the pointer to the next not used structure to your thread function.
In case of static or global variable, it would be good if you know how many threads are working in parallel for limiting the size of preallocated memory.
EDIT: Added sample of TLS implementation based on pthreads:
#include <pthread.h>
#define MAX_PARALLEL_THREADS 10
static pthread_t threads[MAX_PARALLEL_THREADS];
static struct tls_data tls_data[MAX_PARALLEL_THREADS];
static int tls_data_free_index = 0;
static void *worker_thread(void *arg) {
static struct tls_data *data = (struct tls_data *) arg;
/* Code omitted. */
}
static int spawn_thread() {
if (tls_data_free_index >= MAX_PARALLEL_THREADS) {
// Consider increasing MAX_PARALLEL_THREADS
return -1;
}
/* Prepare thread data - code omitted. */
pthread_create(& threads[tls_data_free_index], NULL, worker_thread, & tls_data[tls_data_free_index]);
}
The not-so-impressive solution is a std::map<threadID, T>. Needs to be wrapped with a mutex to allow new threads.
For something more convoluted, see this idea
I believe this is possible, but probably tricky.
Here's a paper describing how __thread or thread_local behaves in ELF images (though it doesn't talk about ARM architecture for AEABI):
https://www.akkadia.org/drepper/tls.pdf
The executive summary is:
The linker creates .tbss and/or .tdata sections in the resulting executable to provide a prototype image of the thread local data needed for each thread.
At runtime, each thread control block (TCB) has a pointer to a dynamic thread-local vector table (dtv in the paper) that contains the thread-local storage for that thread. It is lazily allocated and initialized the first time a thread attempts to access a thread-local variable. (presumably by __aeabi_read_tp())
Initialization copies the prototype .tdata image and memsets the .tbss image into the allocated storage.
When source code access thread-local variables, the compiler generates code to read the thread pointer from __aeabi_read_tp(), and do all the appropriate indirection to get at the storage for that thread-local variable.
The compiler and linker is doing all the work you'd expect it to, but you need to initialize and return a "thread pointer" that is properly structured and filled out the way the compiler expects it to be, because it's generating instructions directly to follow the hops.
There are a few ways that TLS variables are accessed, as mentioned in this paper, which, again, may or may not totally apply to your compiler and architecture:
http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt
But, the problems are roughly the same. When you have runtime-loaded libraries that may bring their own .tbss and .tdata sections, it gets more complicated. You have to expand the thread-local storage for any thread that suddenly tries to access a variable introduced by a library loaded after the storage for that thread was initialized. The compiler has to generate different access code depending on where the TLS variable is declared. You'd need to handle and test all the cases you would want to support.
It's years later, so you probably already solved or didn't solve your problem. In this case, it is (was) probably easiest to use your OS's TLS API directly.

Check pointer handle is valid

I want to implement a Microsoft CryptographicServiceProvider library and currently I thinking about the best way how to deal with context handle which I create.
My question is specific to this case but the design approach can be used in other situations.
I come from a managed code background and I am not 100% shure about multithread pointer handling in C/C++.
In general there are two functions which are responsible for handle creation and destruction (CryptAcquireContext, CryptReleaseContext), and all subsequent CSP functions uses the handle which is return by the creator function.
I didn't found any concrete information or specification from Microsoft which gives a design approach or rules how to do it. But I did research with other CSP providers created by Microsoft to find out the design rules, which are:
The functions must be thread safe
The context handle will not be shared between threads
If a context handle is not valid return with an error
Other MS CSP Provider will return a valid pointer as handle, or NULL if not.
I don't think that the calling application will pass complete garbage but it could happen that it passes a handle which has been already released and my library should return with an error.
This brought me to three ideas how to implement that:
Just allocate memory of my context struct with malloc or new and return the raw pointer as handle.
I can expect that the applications which call my library will pass a valid handle. But if not my library will run into an undefined behaviour. So I need a better solution.
Add the pointer which I create to a list (std::list, std::map). So I can iterate the list to check if the pointer exists. The access to the list is guarded with a mutex.
This should be safe and a regular API usage shouldn't be a performance issue. But in a Terminal Server scenario it could be. In this case the Windows process lsass.exe creates for every user who wants to login a CSP context in a separate thread and makes around 10 API calls per context.
The design goal is that my library should be able to handle 300 clients parallel. I don't know how many threads a created by Windows in this case.
So if possible I would prefer a lockless implementation.
I allocate a basic struct which holds a check value and the pointer of the actual data. Use the pointer of this struct as context handle.
typedef struct CSPHandle
{
int Type; // (eg. magic number CSPContext=0xA1B2C3D4)
CSPContextPtr pCSPContext;
};
So I could read the first byte of the passed pointer and check if the data equals my defined type. And I have the full control about actual data pointer, which is set to NULL if the context is released. Is this a good or bad idea?
What are your thoughts about this case? Should I go with one of these approaches or is there a other solution?
Thanks
I found a solution and will answer my question.
I overlooked a little but important detail
In CSP there are not direct API calls to the dll (load library, get function pointer, call function) because the function calls are forwarded by the Microsoft CSP which loads the CSP library by name.
So the Microsoft CSP needs to know and to check the passed context to get a correct mapping to the specific library.
Example:
1. client->cryptacquirecontext(in cspname, out ctx)
2. MS CSP->loads libray from the cspname
3. MS CSP->calls the function pointer of the loaded library
4. CSP LIB->cryptacquirecontext creates new context
5. MS CSP->receives the returned csp handle and saves it to the dll mapping
6. MS CSP->returns the result to the calling application
7. client->cryptsetprovparam(ctx) // which was created before
8. MS CSP->checks if the context exists and which library is responsible
9. MS CSP->if the given context can not be mapped to a csp dll an error will be returned, because the MS CSP doesn’t know which function pointer should be called.
So in this case it should be sufficient just to allocate memory. If the client application passes invalid context handle it will never hit the csp library.
I think that the MS CSP uses a list with mutex guard to store the context mappings. Because the context can be anything from a random number to a valid pointer.

Why do thread creation methods take an argument?

All thread create methods like pthread_create() or CreateThread() in Windows expect the caller to provide a pointer to the arg for the thread. Isn't this inherently unsafe?
This can work 'safely' only if the arg is in the heap, and then again creating a heap variable
adds to the overhead of cleaning the allocated memory up. If a stack variable is provided as the arg then the result is at best unpredictable.
This looks like a half-cooked solution to me, or am I missing some subtle aspect of the APIs?
Context.
Many C APIs provide an extra void * argument so that you can pass context through third party APIs. Typically you might pack some information into a struct and point this variable at the struct, so that when the thread initializes and begins executing it has more information than the particular function that its started with. There's no necessity to keep this information at the location given. For instance you might have several fields that tell the newly created thread what it will be working on, and where it can find the data it will need. Furthermore there's no requirement that the void * actually be used as a pointer, its a typeless argument with the most appropriate width on a given architecture (pointer width), that anything can be made available to the new thread. For instance you might pass an int directly if sizeof(int) <= sizeof(void *): (void *)3.
As a related example of this style: A FUSE filesystem I'm currently working on starts by opening a filesystem instance, say struct MyFS. When running FUSE in multithreaded mode, threads arrive onto a series of FUSE-defined calls for handling open, read, stat, etc. Naturally these can have no advance knowledge of the actual specifics of my filesystem, so this is passed in the fuse_main function void * argument intended for this purpose. struct MyFS *blah = myfs_init(); fuse_main(..., blah);. Now when the threads arrive at the FUSE calls mentioned above, the void * received is converted back into struct MyFS * so that the call can be handled within the context of the intended MyFS instance.
Isn't this inherently unsafe?
No. It is a pointer. Since you (as the developer) have created both the function that will be executed by the thread and the argument that will be passed to the thread you are in full control. Remember this is a C API (not a C++ one) so it is as safe as you can get.
This can work 'safely' only if the arg is in the heap,
No. It is safe as long as its lifespan in the parent thread is as long as the lifetime that it can be used in the child thread. There are many ways to make sure that it lives long enough.
and then again creating a heap variable adds to the overhead of cleaning the allocated memory up.
Seriously. That's an argument? Since this is basically how it is done for all threads unless you are passing something much more simple like an integer (see below).
If a stack variable is provided as the arg then the result is at best unpredictable.
Its as predictable as you (the developer) make it. You created both the thread and the argument. It is your responsibility to make sure that the lifetime of the argument is appropriate. Nobody said it would be easy.
This looks like a half-cooked solution to me, or am i missing some subtle aspects of the APIs?
You are missing that this is the most basic of threading API. It is designed to be as flexible as possible so that safer systems can be developed with as few strings as possible. So we now hove boost::threads which if I guess is build on-top of these basic threading facilities but provide a much safer and easier to use infrastructure (but at some extra cost).
If you want RAW unfettered speed and flexibility use the C API (with some danger).
If you want a slightly safer use a higher level API like boost:thread (but slightly more costly)
Thread specific storage with no dynamic allocation (Example)
#include <pthread.h>
#include <iostream>
struct ThreadData
{
// Stuff for my thread.
};
ThreadData threadData[5];
extern "C" void* threadStart(void* data);
void* threadStart(void* data)
{
intptr_t id = reinterpret_cast<intptr_t>(data);
ThreadData& tData = threadData[id];
// Do Stuff
return NULL;
}
int main()
{
for(intptr_t loop = 0;loop < 5; ++loop)
{
pthread_t threadInfo; // Not good just makes the example quick to write.
pthread_create(&threadInfo, NULL, threadStart, reinterpret_cast<void*>(loop));
}
// You should wait here for threads to finish before exiting.
}
Allocation on the heap does not add a lot of overhead.
Besides the heap and the stack, global variable space is another option. Also, it's possible to use a stack frame that will last as long as the child thread. Consider, for example, local variables of main.
I favor putting the arguments to the thread in the same structure as the pthread_t object itself. So wherever you put the pthread record, put its arguments as well. Problem solved :v) .
This is a common idiom in all C programs that use function pointers, not just for creating threads.
Think about it. Suppose your function void f(void (*fn)()) simply calls into another function. There's very little you can actually do with that. Typically a function pointer has to operate on some data. Passing in that data as a parameter is a clean way to accomplish this, without, say, the use of global variables. Since the function f() doesn't know what the purpose of that data might be, it uses the ever-generic void * parameter, and relies on you the programmer to make sense of it.
If you're more comfortable with thinking in terms of object-oriented programming, you can also think of it like calling a method on a class. In this analogy, the function pointer is the method and the extra void * parameter is the equivalent of what C++ would call the this pointer: it provides you some instance variables to operate on.
The pointer is a pointer to the data that you intend to use in the function. Windows style APIs require that you give them a static or global function.
Often this is a pointer to the class you are intending to use a pointer to this or pThis if you will and the intention is that you will delete the pThis after the ending of the thread.
Its a very procedural approach, however it has a very big advantage which is often overlooked, the CreateThread C style API is binary compatible so that when you wrap this API with a C++ class (or almost any other language) you can do this actually do this. If the parameter was typed, you wouldn't be able to access this from another language as easily.
So yes, this is unsafe but there's a good reason for it.

pthread_key_t and pthread_once_t?

Starting with pthreads, I cannot understand what is the business with pthread_key_t and pthread_once_t?
Would someone explain in simple terms with examples, if possible?
thanks
pthread_key_t is for creating thread thread-local storage: each thread gets its own copy of a data variable, instead of all threads sharing a global (or function-static, class-static) variable. The TLS is indexed by a key. See pthread_getspecific et al for more details.
pthread_once_t is a control for executing a function only once with pthread_once. Suppose you have to call an initialization routine, but you must only call that routine once. Furthermore, the point at which you must call it is after you've already started up multiple threads. One way to do this would be to use pthread_once(), which guarantees that your routine will only be called once, no matter how many threads try to call it at once, so long as you use the same control variable in each call. It's often easier to use pthread_once() than it is to use other alternatives.
No, it can't be explained in layman terms. Laymen cannot successfully program with pthreads in C++. It takes a specialist known as a "computer programmer" :-)
pthread_once_t is a little bit of storage which pthread_once must access in order to ensure that it does what it says on the tin. Each once control will allow an init routine to be called once, and once only, no matter how many times it is called from how many threads, possibly concurrently. Normally you use a different once control for each object you're planning to initialise on demand in a thread-safe way. You can think of it in effect as an integer which is accessed atomically as a flag whether a thread has been selected to do the init. But since pthread_once is blocking, I guess there's allowed to be a bit more to it than that if the implementation can cram in a synchronisation primitive too (the only time I ever implemented pthread_once, I couldn't, so the once control took any of 3 states (start, initialising, finished). But then I couldn't change the kernel. Unusual situation).
pthread_key_t is like an index for accessing thread-local storage. You can think of each thread as having a map from keys to values. When you add a new entry to TLS, pthread_key_create chooses a key for it and writes that key into the location you specify. You then use that key from any thread, whenever you want to set or retrieve the value of that TLS item for the current thread. The reason TLS gives you a key instead of letting you choose one, is so that unrelated libraries can use TLS, without having to co-operate to avoid both using the same value and trashing each others' TLS data. The pthread library might for example keep a global counter, and assign key 0 for the first time pthread_key_create is called, 1 for the second, and so on.
Wow, the other answers here are way too verbose.
pthread_once_t stores state for pthread_once(). Calling pthread_once(&s, fn) calls fn and sets the value pointed to by s to record the fact it has been executed. All subsequent calls to pthread_once() are noops. The name should become obvious now.
pthread_once_t should be initialized to PTHREAD_ONCE_INIT.