Is it necessary to clean up stack contents?

Is it necessary to clean up stack contents? - c++

We are under a PCI PA-DSS certification and one of its requirements is to avoid writing clean PAN (card number) to disk. The application is not writing such information to disk, but if the operating system (Windows, in this case) needs to swap, the memory contents is written to page file. Therefore the application must clean up the memory to prevent from RAM capturer services to read sensitive data.
There are three situations to handle:
heap allocation (malloc): before freeing the memory, the area can be cleaned up with memset
static or global data: after being used, the area can be cleaned up using memset
local data (function member): the data is put on stack and is not accessible after the function is finished
For example:
void test()
{
char card_number[17];
strcpy(card_number, "4000000000000000");
}
After test executes, the memory still contains the card_number information.
One instruction could zero the variable card_number at the end of test, but this should be for all functions in the program.
memset(card_number, 0, sizeof(card_number));
Is there a way to clean up the stack at some point, like right before the program finishes?

Cleaning the stack right when the program finishes might be too late, it could have already been swapped out during any point at its runtime. You should keep your sentitive data only in memory locked with VirtualLock so it does not get swapped out. This has to happen before said sensitive data is read.
There is a small limit on how much memory you can lock like this so you can propably not lock the whole stack and should avoid storing sensitive data on the stack at all.

I assume you want to get rid of this situation below:
#include <iostream>
using namespace std;
void test()
{
char card_number[17];
strcpy(card_number, "1234567890123456");
cout << "test() -> " << card_number << endl;
}
void test_trash()
{
// don't initialize, so get the trash from previous call to test()
char card_number[17];
cout << "trash from previous function -> " << card_number << endl;
}
int main(int argc, const char * argv[])
{
test();
test_trash();
return 0;
}
Output:
test() -> 1234567890123456
trash from previous function -> 1234567890123456
You CAN do something like this:
#include <iostream>
using namespace std;
class CardNumber
{
char card_number[17];
public:
CardNumber(const char * value)
{
strncpy(card_number, value, sizeof(card_number));
}
virtual ~CardNumber()
{
// as suggested by #piedar, memset_s(), so the compiler
// doesn't optimize it away.
memset_s(card_number, sizeof(card_number), 0, sizeof(card_number));
}
const char * operator()()
{
return card_number;
}
};
void test()
{
CardNumber cardNumber("1234567890123456");
cout << "test() -> " << cardNumber() << endl;
}
void test_trash()
{
// don't initialize, so get the trash from previous call to test()
char card_number[17];
cout << "trash from previous function -> " << card_number << endl;
}
int main(int argc, const char * argv[])
{
test();
test_trash();
return 0;
}
Output:
test() -> 1234567890123456
trash from previous function ->
You can do something similar to clean up memory on the heap or static variables.
Obviously, we assume the card number will come from a dynamic source instead of the hard-coded thing...
AND YES: to explicit answer the title of your question: The stack will not be cleaned automatically... you have to clean it by yourself.

I believe it is necessary, but this is only half of the problem.
There are two issues here:
In principle, nothing prevents the OS from swapping your data while you are still using it. As pointed out in the other answer, you want VirtualLock on windows and mlock on linux.
You need to prevent the optimizer from optimizing out the memset. This also applies to global and dynamically allocated memory. I strongly suggest to take a look at cryptopp SecureWipeBuffer.
In general, you should avoid to do it manually, as it is an error-prone procedure. Instead, consider using a custom allocator or a custom class template for secure data that can be freed in the destructor.

The stack is cleaned up by moving the stack pointer, not by actually popping values from it. The only mechanics are to pop the return into the appropriate registers. You must do it all manually. Also -- volatile can help you avoid optimizations on a per variable basis. You can manually pop the stack clean, but -- you need assembler to do that -- and it is not so simple to start manipulating the stack -- it is not actually your resource -- the compiler owns it as far as you are concerned.

Related

C++ scope guard with zero overhead

In C++ we can ensure foo is called when we exit a scope by putting foo() in the destructor of a local object. That's what I think of when I head "scope guard." There are plenty of generic implementations.
I'm wondering—just for fun—if it's possible to achieve the behavior of a scope guard with zero overhead compared to just writing foo() at every exit point.
Zero overhead, I think:
{
try {
do_something();
} catch (...) {
foo();
throw;
}
foo();
}
Overhead of at least 1 byte to give the scope guard an address:
{
scope_guard<foo> sg;
do_something();
}
Do compilers optimize away giving sg an address?
A slightly more complicated case:
{
Bar bar;
try {
do_something();
} catch (...) {
foo(bar);
throw;
}
foo(bar);
}
versus
{
Bar bar;
scope_guard<[&]{foo(bar);}> sg;
do_something();
}
The lifetime of bar entirely contains the lifetime of sg and its held lambda (destructors are called in reverse order) but the lambda held by sg still has to hold a reference to bar. I mean for example int x; auto l = [&]{return x;}; gives sizeof(l) == 8 on my 64-bit system.
Is there maybe some template metaprogramming magic that achieve the scope_guard sugar without any overhead?

If by overhead you mean how much space is occupied by scope-guard variable then zero overhead is possible if functional object is compile-time value. I've coded small snippet to illustrate this:
Try it online!
#include <iostream>
template <auto F>
class ScopeGuard {
public:
~ScopeGuard() { F(); }
};
void Cleanup() {
std::cout << "Cleanup func..." << std::endl;
}
int main() {
{
char a = 0;
ScopeGuard<&Cleanup> sg;
char b = 0;
std::cout << "Stack difference "
<< int(&a - &b - sizeof(char)) << std::endl;
}
{
auto constexpr f = []{
std::cout << "Cleanup lambda..." << std::endl; };
char a = 0;
ScopeGuard<f> sg;
char b = 0;
std::cout << "Stack difference "
<< int(&a - &b - sizeof(char)) << std::endl;
}
}
Output:
Stack difference 0
Cleanup func...
Stack difference 0
Cleanup lambda...
Code above doesn't create even a single byte on a stack, because any class variable that has no fields occupies on stack 0 bytes, this is one of obvious optimizations that is done by any compiler. Of course unless you take a pointer to such object then compiler is obliged to create 1-byte memory object. But in your case you don't take address to scoped guard.
You can see that there is not a single byte occupied by looking at Try it online! link above the code, it shows assembler output of CLang.
To have no fields at all scoped guard class should only use compile-time function object, like global function pointer of lambda without capture. This two kinds of objects are used in my code above.
In code above you can even see that I outputted stack difference of char variable before and after scoped guard variable to show that scoped guard actually occupies 0 bytes.
Lets go a bit further and make possibility to have non-compile-time values of functional objects.
For this again we create class with no fields, but now store all functional objects inside one shared vector with thread local storage.
Again as we have no fields in class and don't take any pointer to scoped guard object then compiler doesn't create not a single byte for scoped guard object on stack.
But instead single shared vector is allocated in heap. This way you can trade stack storage for heap storage if you're out of stack memory.
Also having shared vector will allow us to use as few memory as possible, because vector uses only as much memory as many there are nested blocks that use scoped guard. If all scoped guards are located sequentially in different blocks then vector will have just 1 element inside so using just few bytes of memory for all scoped guards that were used.
Why heap memory of shared vector is more economical memory-wise than stack-stored memory of scoped guard. Because in case of stack memory if you have several sequential blocks of guards:
void test() {
{
ScopeGuard sg(f0);
}
{
ScopeGuard sg(f1);
}
{
ScopeGuard sg(f2);
}
}
then all 3 guards occupy tripple amount of memory on stack, because for each function like test() above compiler allocates stack memory for all used in function's variables, so for 3 guards it allocates tripple amount.
In case of shared vector test() function above will use just 1 vector's element, so vector will have size of 1 at most hence will use just single amount of memory to store functional object.
Hence if you have many non-nested scoped guards inside one function then shared vector will be much more economical.
Now below I present code snippet for shared-vector approach with zero fields and zero stack memory overhead. To remind, this approach allows to use non-compile-time functional objects unlike solution in part one of my answer.
Try it online!
#include <iostream>
#include <vector>
#include <functional>
class ScopeGuard2 {
public:
static auto & Funcs() {
thread_local std::vector<std::function<void()>> funcs_;
return funcs_;
}
ScopeGuard2(std::function<void()> f) {
Funcs().emplace_back(std::move(f));
}
~ScopeGuard2() {
Funcs().at(Funcs().size() - 1)();
Funcs().pop_back();
}
};
void Cleanup() {
std::cout << "Cleanup func..." << std::endl;
}
int main() {
{
ScopeGuard2 sg(&Cleanup);
}
{
auto volatile x = 123;
auto const f = [&]{
std::cout << "Cleanup lambda... x = "
<< x << std::endl;
};
ScopeGuard2 sg(f);
}
}
Output:
Cleanup func...
Cleanup lambda... x = 123

It's not exactly clear what you mean by 'zero overhead' here.
Do compilers optimize away giving sg an address?
Most likely modern mainstream compilers will do it when run in optimizing modes. Unfortunately, that's as much definite as it can get. It depends on the environment and has to be tested to be relied upon.
If the question is if there is a guaranteed way to avoid <anything> in the resulting assembly, the answer is negative. As #Peter said in the comment, compiler is allowed to do anything to produce the equivalent result. It may not ever call foo() at all, even if you write it there verbatim - when it can prove that nothing in the observed program behavior will change.

Memory used internally by Botan not deallocated

I have currently a memory issue using the Botan library (version 2.15) for cryptography functions within a C++ project. My development environment is Solus Linux 4.1 (kernel-current), but I could observe this issue on Debian Buster too.
I observed that some memory allocated internally by Botan for calculations is not deallocated when going out of scope. When I called Botan::HashFunction, Botan::StreamCipher and Botan::scrypt multiple times, always going out of scope in between, the memory footprint increases steadily.
For example, consider this code:
#include <iostream>
#include <vector>
#include "botan/scrypt.h"
void pause() {
char ch;
std::cout << "Insert any key to proceed... ";
std::cin >> ch;
}
std::vector<uint8_t> get_scrypt_passhash(std::string const& password, std::string const& name) {
std::vector<uint8_t> key (32);
Botan::scrypt(key.data(), key.size(), password.c_str(), password.length(), salt.c_str(), salt.length(), 65536, 32, 1);
std::cout << "From function: before closing.\n";
pause();
return key;
}
int main(int argc, char *argv[]) {
std::cout << "Beginning test.\n";
pause();
auto pwhashed = get_scrypt_passhash(argv[1], argv[2]);
std::cout << "Test ended.\n";
pause();
}
I used the pause() function to observe the memory consumption (I called top/pmap and observed KSysGuard during the pause), when it is called from within get_scrypt_passhash before terminating, the used memory (both by top/pmap and KSysGuard) is about 2 MB more than at beginning, and after terminating the same.
I tried to dive into the Botan source code, but I cannot find memory leaks or the like. Valgrind also outputted that all allocated bytes have been freed, so no memory leaks were possible.
Just for information, I tried the same functionality with Crypto++ without observing this behavior.
Has anyone experienced the same issue? Is there a way to fix it?

OpenACC and object oriented C++

I am trying to write a object oriented C++ code that is parallelized with OpenACC.
I was able to find some stackoverflow questions and GTC talks on OpenACC, but I could not find some real world examples of object oriented code.
In this question an example for a OpenACCArray was shown that does some memory management in the background (code available at http://www.pgroup.com/lit/samples/gtc15_S5233.tar).
However, I am wondering if it is possible create a class that manages the arrays on a higher level. E.g.
struct Data
{
// OpenACCArray<float> a;
OpenACCArray<Vector3<float>> a3;
Data(size_t len) {
#pragma acc enter data copyin(this)
// a.resize(len);
a3.resize(len);
}
~Data() {
#pragma acc exit data delete(this)
}
void update_device() {
// a.update_device();
a3.update_device();
}
void update_host() {
// a.update_host();
a3.update_host();
}
};
int main(int argc, char *argv[])
{
const size_t len = 32*128;
Data d(len);
d.update_device();
#pragma acc kernels loop independent present(d)
for (int i=0; i < len; ++i) {
float val = (float)i/(float)len;
d.a3[i].x = val;
d.a3[i].y = i;
d.a3[i].z = d.a3[i].x / d.a3[i].y;
}
d.update_host();
for (int i=0; i < len/128; ++i) {
cout << i << ": " << d.a3[i].x << "," << d.a3[i].y << "," << d.a3[i].z << endl;
}
cout << endl;
return 0;
}
Interestingly this program works, but as soon as I uncomment OpenACCArray<float> a;, i.e. add another member to that Data struct, I get memory errors.
FATAL ERROR: variable in data clause is partially present on the device.
Since the OpenACCArray struct is a flat structure that handles the pointer indirections on its own it should work to copy it as member?
Or does need to be a pointer to the struct and the pointers have to be hardwired with directives?
Then I fear the problem that I have to use alias pointers as suggested by jeff larkin at the above mentioned question.
I don't mind doing the work to get this running, but I cannot find any reference how to do that.
Using compiler directives keepgpu,keepptx helps a bit to understand what the compiler is doing, but I would prefer an alternative to reverse engineering generated ptx code.
Any pointers to helpful reference project or documents are highly appreciated.

In the OpenACCArray1.h header, remove the two "#pragma acc enter data create(this)" pragmas. What's happening is that the "Data" constructor is creating the "a" and "a3" objects on the device. Hence, when the second enter data region is encountered in the OpenACCArray constructor, the device this pointer is already there.
It works when there is only one data member since "a3" and "Data" share the same address for the this pointer. Hence when the second enter data pragma is encountered, the present check sees that it's already on the device so doesn't created it again. When "a" is added, the size of "Data" is twice that of "a", hence the present check sees that the this pointer is already there but has a different size than before. That's what the "partially present" error means. The data is there but has a different than expected size.
Only the parent class/struct should create the this pointer on the device.
Hope this helps,
Mat

Modifying the call stack

Is it possible to modify the call stack in c++? (I realize this is a horrible idea and am really just wondering----I don't plan on actually doing this)
For example:
void foo(){
other();
cout << "You never see this" << endl; //The other() function modifies the stack to
//point to what ever called this function...so this is not displayed
}
void other(){
//modify the stack pointer here somehow to go down 2 levels
}
//Elsewhere
foo();

When a function calls another one in typical C implementations, the processor stack is used and the call opcode is used. That has as effect to push the next to execute processor instuction pointer on the processor stack. Usually besides the return address also the value of the stack frame pointer is used.
So the stack contains:
...free_space... [local_variables] [framePtr] [returnAddr] PREVIOUS_STACK.
So in order to change the return address ( that you should know what size it has -- if you compile e.g. via -m64 it will have as size of 64 bits ) you may get the address of a variable and add some to it in order to arrive to the address of the return pointer and change it.
The code bellow has been compiled with g++ in mode m64.
If it works by change also for you you may see the effect.
#include <stdio.h>
void changeRetAddr(long* p){
p-=2;
*p+=0x11;
}
void myTest(){
long a=0x1122334455667788;
changeRetAddr(&a);
printf("hi my friend\n");
printf("I didn't showed the salutation\n");
}
int main(int argc, char **argv)
{
myTest();
return 0;
}

In C++, how can I get the current thread's call stack?

I'm writing this error handler for some code I'm working in, in C++. I would like to be able to make some sort of reference to whatever I have on the stack, without it being explicitly passed to me. Specifically, let's say I want to print the names of the functions on the call stack, in order. This is trivial in managed runtime environments like the JVM, probably not so trivial with 'simple' compiled code. Can I do this?
Notes:
Assume for simplicity that I compile my code with debugging information and no optimization.
I want to write something that is either platform-independent or multi-platform. Much prefer the former.
If you think I'm trying to reinvent the wheel, just link to the source of the relevant wheel and I'll look there.
Update:
I can't believe how much you need to bend over backwards to do this... almost makes me pine for another language which shall not be mentioned.

There is a way to get a back-trace in C++, though it is not portable. I cannot speak for Windows, but on Unix-like systems there is a backtrace API that consists primarily of the following functions:
int backtrace(void** array, int size);
char** backtrace_symbols(void* const* array, int size);
void backtrace_symbols_fd(void* const* array, int size, int fd);
You can find up to date documentation and examples on GNU website here. There are other sources, like this manual page for OS X, etc.
Keep in mind that there are a few problems with getting backtrace using this API. Firstly, there no file names and no line numbers. Secondly, you cannot even get backtrace in certain situations like if the frame pointer is omitted entirely (default behavior of recent GCC compilers for x86_64 platforms). Or maybe the binary doesn't have any debug symbols whatsoever. On some systems, you also have to specify -rdynamic flag when compiling your binary (which has other, possible undesirable, effects).

Unfortunately, there is no built-in way of doing this with the standard C++. You can construct a system of classes to help you build a stack tracer utility, but you would need to put a special macro in each of the methods that you would like to trace.
I've seen it done (and even implemented parts of it) using the strategy outlined below:
Define your own class that stores the information about a stack frame. At the minimum, each node should contain the name of the function being called, file name / line number info being close second.
Stack frame nodes are stored in a linked list, which is reused if it exists, or created if it does not exist
A stack frame is created and added to the list by instantiating a special object. Object's constructor adds the frame node to the list; object's destructor deletes the node from the list.
The same constructor/destructor pair are responsible for creating the list of frames in thread local storage, and deleting the list that it creates
The construction of the special object is handled by a macro. The macro uses special preprocessor tokens to pass function identification and location information to the frame creator object.
Here is a rather skeletal proof-of-concept implementation of this approach:
#include <iostream>
#include <list>
using namespace std;
struct stack_frame {
const char *funName;
const char *fileName;
int line;
stack_frame(const char* func, const char* file, int ln)
: funName(func), fileName(file), line(ln) {}
};
thread_local list<stack_frame> *frames = 0;
struct entry_exit {
bool delFrames;
entry_exit(const char* func, const char* file, int ln) {
if (!frames) {
frames = new list<stack_frame>();
delFrames = true;
} else {
delFrames = false;
}
frames->push_back(stack_frame(func, file, ln));
}
~entry_exit() {
frames ->pop_back();
if (delFrames) {
delete frames;
frames = 0;
}
}
};
void show_stack() {
for (list<stack_frame>::const_iterator i = frames->begin() ; i != frames->end() ; ++i) {
cerr << i->funName << " - " << i->fileName << " (" << i->line << ")" << endl;
}
}
#define FUNCTION_ENTRY entry_exit _entry_exit_(__func__, __FILE__, __LINE__);
void foo() {
FUNCTION_ENTRY;
show_stack();
}
void bar() {
FUNCTION_ENTRY;
foo();
}
void baz() {
FUNCTION_ENTRY;
bar();
}
int main() {
baz();
return 0;
}
The above code compiles with C++11 and prints this:
baz - prog.cpp (52)
bar - prog.cpp (48)
foo - prog.cpp (44)
Functions that do not have that macro would be invisible on the stack. Performance-critical functions should not have such macros.
Here is a demo on ideone.

It is not easy. The exact solution depends very much on the OS and Execution environment.
Printing the stack is usually not that difficult, but finding symbols can be quite tricky, since it usually means reading debug symbols.
An alternative is to use an intrusive approach and add some "where am I" type code to each function (presumably for "debug builds only"):
#ifdef DEBUG
struct StackEntry
{
const char *file;
const char *func;
int line;
StackEntry(const char *f, const char *fn, int ln) : file(f), func(fn), line(ln) {}
};
std::stack<StackEntry> call_stack;
class FuncEntry
{
public:
FuncEntry(const char *file, const char *func, int line)
{
StackEntry se(file, func, line);
call_stack.push_back(se);
}
~FuncEntry()
{
call_stack.pop_back();
}
void DumpStack()
{
for(sp : call_stack)
{
cout << sp->file << ":" << sp->line << ": " << sp->func << "\n";
}
}
};
#define FUNC() FuncEntry(__FILE__, __func__, __LINE__);
#else
#define FUNC()
#endif
void somefunction()
{
FUNC();
... more code here.
}
I have used this technique in the past, but I just typed this code in, it may not compile, but I think it's clear enough . One major benefit is that you don't HAVE to put it in every function - just "important ones". [You could even have different types of FUNC macros that are enabled or disabled based on different levels of debugging].

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Is it necessary to clean up stack contents? - c++

Related

C++ scope guard with zero overhead

Memory used internally by Botan not deallocated

OpenACC and object oriented C++

Modifying the call stack

In C++, how can I get the current thread's call stack?

Categories

Resources