Important: Scroll down to the "final update" before you invest too much time here. Turns out the main lesson is to beware of the side effects of other tests in your unittest suite, and to always reproduce things in isolation before jumping to conclusions!
On the face of it, the following 64-bit code allocates (and accesses) one-mega 4k pages using VirtualAlloc (a total of 4GByte):
const size_t N=4; // Tests with this many Gigabytes
const size_t pagesize4k=4096;
const size_t npages=(N<<30)/pagesize4k;
BOOST_AUTO_TEST_CASE(test_VirtualAlloc) {
std::vector<void*> pages(npages,0);
for (size_t i=0;i<pages.size();++i) {
pages[i]=VirtualAlloc(0,pagesize4k,MEM_RESERVE|MEM_COMMIT,PAGE_READWRITE);
*reinterpret_cast<char*>(pages[i])=1;
}
// Check all allocs succeeded
BOOST_CHECK(std::find(pages.begin(),pages.end(),nullptr)==pages.end());
// Free what we allocated
bool trouble=false;
for (size_t i=0;i<pages.size();++i) {
const BOOL err=VirtualFree(pages[i],0,MEM_RELEASE);
if (err==0) trouble=true;
}
BOOST_CHECK(!trouble);
}
However, while executing it grows the "Working Set" reported in Windows Task Manager (and confirmed by the value "sticking" in the "Peak Working Set" column) from a baseline ~200,000K (~200MByte) to over 6,000,000 or 7,000,000K (tested on 64bit Windows7, and also on ESX-virtualized 64bit Server 2003 and Server 2008; unfortunately I didn't take note of which systems the various numbers observed occurred on).
Another very similar test case in the same unittest executable tests one-mega 4k mallocs (followed by frees) and that only expands by around the expected 4GByte when running.
I don't get it: does VirtualAlloc have some quite high per-alloc overhead? It's clearly a significant fraction of the page size if so; why is so much extra needed and what's it for? Or am I misunderstanding what the "Working Set" reported actually means? What's going on here?
Update: With reference to Hans' answer, I note this fails with an access violation in the second page access, so whatever is going on isn't as simple as the allocation being rounded up to the 64K "granularity".
char*const ptr = reinterpret_cast<char*>(
VirtualAlloc(0, 4096, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE)
);
ptr[0] = 1;
ptr[4096] = 1;
Update: Now on an AWS/EC2 Windows2008 R2 instance, with VisualStudioExpress2013 installed, I can't reproduce the problem with this minimal code (compiled 64bit), which tops out with an apparently overhead-free peak working set of 4,335,816K, which is the sort of number I'd expected to see originally. So either there is something different about the other machines I'm running on, or the boost-test based exe used in the previous testing. Bizzaro, to be continued...
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <vector>
int main(int, char**) {
const size_t N = 4;
const size_t pagesize4k = 4096;
const size_t npages = (N << 30) / pagesize4k;
std::vector<void*> pages(npages, 0);
for (size_t i = 0; i < pages.size(); ++i) {
pages[i] = VirtualAlloc(0, pagesize4k, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
*reinterpret_cast<char*>(pages[i]) = 1;
}
Sleep(5000);
for (size_t i = 0; i < pages.size(); ++i) {
VirtualFree(pages[i], 0, MEM_RELEASE);
}
return 0;
}
Final update: Apologies! I'd delete this question if I could because it turns out the observed problems were entirely due to an immediately preceeding unittest in the test suite which used TBB's "scalable allocator" to allocate/deallocate a couple of GByte of stuff. It seems scalable allocator actually retains such allocations in it's own pool rather than returning them to the system (see e.g here or here). Became obvious once I ran tests individually with enough of a Sleep after them to observe their on-completion working set in task manager (whether anything can be done about the TBB behaviour might be an interesting question, but as-is the question here is a red-herring).
pages[i]=VirtualAlloc(0,pagesize4k,MEM_RESERVE|MEM_COMMIT,PAGE_READWRITE);
You won't get 4096 bytes, it will be rounded up to the smallest permitted allocation. Which is SYSTEM_INFO.dwAllocationGranularity, it has been 64KB for a long time. It is a very basic address space fragmentation counter-measure.
So you are allocating way more than you think.
It turns out the observed problems were entirely due to an immediately preceding unittest in the test suite which used TBB's "scalable allocator" to allocate/deallocate a couple of GByte of stuff. It seems scalable allocator actually retains such allocations in it's own pool rather than returning them to the system (see e.g here or here). Became obvious once I ran tests individually with enough of a Sleep after them to observe their on-completion working set in task manager.
Related
I have some extremely simple C++ code that I was certain would run 3x faster with multithreading but somehow only runs 3% faster (or less) on both GCC and MSVC on Windows 10.
There are no mutex locks and no shared resources. And I can't see how false sharing or cache thrashing could be at play since each thread only modifies a distinct segment of the array, which has over a billion int values. I realize there are many questions on SO like this but I haven't found any that seem to solve this particular mystery.
One hint might be that moving the array initialization into the loop of the add() function does make the function 3x faster when multithreaded vs single-threaded (~885ms vs ~2650ms).
Note that only the add() function is being timed and takes ~600ms on my machine. My machine has 4 hyperthreaded cores, so I'm running the code with threadCount set to 8 and then to 1.
Any idea what might be going on? Is there any way to turn off (when appropriate) the features in processors that cause things like false sharing (and possibly like what we're seeing here) to happen?
#include <chrono>
#include <iostream>
#include <thread>
void startTimer();
void stopTimer();
void add(int* x, int* y, int threadIdx);
namespace ch = std::chrono;
auto start = ch::steady_clock::now();
const int threadCount = 8;
int itemCount = 1u << 30u; // ~1B items
int itemsPerThread = itemCount / threadCount;
int main() {
int* x = new int[itemCount];
int* y = new int[itemCount];
// Initialize arrays
for (int i = 0; i < itemCount; i++) {
x[i] = 1;
y[i] = 2;
}
// Call add() on multiple threads
std::thread threads[threadCount];
startTimer();
for (int i = 0; i < threadCount; ++i) {
threads[i] = std::thread(add, x, y, i);
}
for (auto& thread : threads) {
thread.join();
}
stopTimer();
// Verify results
for (int i = 0; i < itemCount; ++i) {
if (y[i] != 3) {
std::cout << "Error!";
}
}
delete[] x;
delete[] y;
}
void add(int* x, int* y, int threadIdx) {
int firstIdx = threadIdx * itemsPerThread;
int lastIdx = firstIdx + itemsPerThread - 1;
for (int i = firstIdx; i <= lastIdx; ++i) {
y[i] = x[i] + y[i];
}
}
void startTimer() {
start = ch::steady_clock::now();
}
void stopTimer() {
auto end = ch::steady_clock::now();
auto duration = ch::duration_cast<ch::milliseconds>(end - start).count();
std::cout << duration << " ms\n";
}
You may be simply hitting the memory transfer rate of your machine, you are doing 8GB of reads and 4GB of writes.
On my machine your test completes in about 500ms which is 24GB/s (which is similar to the results given by a memory bandwidth tester).
As you hit each memory address with a single read and a single write the caches aren't much use as you aren't reusing memory.
Your problem is not the processor. You ran against the RAM read and write latency. As your cache is able to hold some megabytes of data and you exceed this storage by far. Multi-threading is so long useful, as long as you can shovel data into your processor. The cache in your processor is incredibly fast, compared to your RAM. As you exceed your cache storage, this results in a RAM latency test.
If you want to see the advantages of multi-threading, you have to choose data sizes in range of your cache size.
EDIT
Another thing to do, would be to create a higher workload for the cores, so the storage latency goes unrecognized.
sidenote: keep in mind, your core has several execution units. one or more for each type of operation - integer, float, shift and so on. That means, one core can execute more then one command per step. In particular one operation per execution unit. You can keep the data size of the test data and do more stuff with it - be creative =) Filling the queue with integer operations only, will give you an advantage in multi-threading. If you can variate in your code, when and where you do different operations, do it, this also will show impact on the speedup. Or avoid it, if you want to see a nice speedup on multi-threading.
to avoid any kind of optimization, you should use randomized test data. so neither the compiler nor the processor itself can predict what the outcome of your operation is.
Also avoid doing branches like if and while. Each decision the processor has to predict and execute, will slow you down and alter the result. With branch-prediction, you will never get a deterministic result. Later in a "real" program, be my guest and do what you want. But when you want to explore the multi-threading world, this could lead you to wrong conclusions.
BTW
Please use a delete for every new you use, to avoid memory leaks. AND even better, avoid plain pointers, new and delete. You should use RAII. I advice to use std::array or std::vector, simple a STL-container. This will save you tons of debugging time and headaches.
Speedup from parallelization is limited by the portion of the task that remains serial. This is called Amdahl's law. In your case, a decent amount of that serial time is spent initializing the array.
Are you compiling the code with -O3? If so, the compiler might be able to unroll and/or vectorize some of the loops. The loop strides are predictable, so hardware prefetching might help as well.
You might want to also explore if using all 8 hyperthreads are useful or if it's better to run 1 thread per core (I am going to guess that since the problem is memory-bound, you'll likely benefit from all 8 hyperthreads).
Nevertheless, you'll still be limited by memory bandwidth. Take a look at the roofline model. It'll help you reason about the performance and what speedup you can theoretically expect. In your case, you're hitting the memory bandwidth wall that effectively limits the ops/sec achievable by your hardware.
I have decided to compare the times of passing by value and by reference in C++ (g++ 5.4.0) with the following code:
#include <iostream>
#include <sys/time.h>
using namespace std;
int fooVal(int a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int fooRef(int & a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int main() {
int a = 0;
struct timeval stop, start;
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooVal(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooRef(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
return 0;
}
It was expected that the fooRef execution would take much more time in comparison with fooVal case because of "looking up" referenced value in memory while performing operations inside fooRef. But the result proved to be unexpected for me:
The loop has taken 18446744073708648210 microseconds
The loop has taken 99967 microseconds
And the next time I run the code it can produce something like
The loop has taken 97275 microseconds
The loop has taken 99873 microseconds
Most of the time produced values are close to each other (with fooRef being just a little bit slower), but sometimes outbursts like in the output from the first run can happen (both for fooRef and fooVal loops).
Could you please explain this strange result?
UPD: Optimizations were turned off, O0 level.
If gettimeofday() function relies on operating system clock, this clock is not really designed for dealing with microseconds in an accurate manner. The clock is typically updated periodically and only frequently enough to give the appearance of showing seconds accurately for the purpose of working with date/time values. Sampling at the microsecond level may be unreliable for a benchmark such as the one you are performing.
You should be able to work around this limitation by making your test time much longer; for example, several seconds.
Again, as mentioned in other answers and comments, the effects of which type of memory is accessed (register, cache, main, etc.) and whether or not various optimizations are applied, could substantially impact results.
As with working around the time sampling limitation, you might be able to somewhat work around the memory type and optimization issues by making your test data set much larger such that memory optimizations aimed at smaller blocks of memory are effectively bypassed.
Firstly, you should look at the assembly language to see if there are any differences between passing by reference and passing by value.
Secondly, make the functions equivalent by passing by constant reference. Passing by value says that the original variable won't be changed. Passing by constant reference keeps the same principle.
My belief is that the two techniques should be equivalent in both assembly language and performance.
I'm no expert in this area, but I would tend to think that the reason why the two times are somewhat equivalent is due to cache memory.
When you need to access a memory location (Say, address 0xaabbc125 on an IA-32 architecure), the CPU copies the memory block (addresses 0xaabbc000 to 0xaabbcfff) to your cache memory. Reading from and writing to the memory is very slow, but once it's been copied into you cache, you can access values very quickly. This is useful because programs usually require the same range of addresses over and over.
Since you execute the same code over and over and that your code doesn't require a lot of memory, the first time the function is executed, the memory block(s) is (are) copied to your cache once, which probably takes most of the 97000 time units. Any subsequent calls to your fooVal and fooRef functions will require addresses that are already in your cache, so they will require only a few nanoseconds (I'd figure roughly between 10ns and 1µs). Thus, dereferencing the pointer (since a reference is implemented as a pointer) is about double the time compared to just accessing a value, but it's double of not much anyway.
Someone who is more of an expert may have a better or more complete explanation than mine, but I think this could help you understand what's going on here.
A little idea : try to run the fooVal and fooRef functions a few times (say, 10 times) before setting start and beginning the loop. That way, (if my explanation was correct!) the memory block will (should) be already into cache when you begin looping them, which means you won't be taking caching in your times.
About the super-high value you got, I can't explain that. But the value is obviously wrong.
It's not a bug, it's a feature! =)
This is the code I am dealing with:
#include <iostream>
#include <map>
using namespace std;
static unsigned long collatzLength(unsigned long n) {
static std::map<unsigned long,int> collatzMap;
int mapResult = collatzMap[n];
if (mapResult != 0) return mapResult;
if (n == 1) {
return 1;
} else {
collatzMap[n] = 1 + collatzLength(n%2==0?n/2:3*n+1);
return collatzMap[n];
}
}
int main() {
int maxIndex = 1;
unsigned int max = 1;
for (int i=1; i<1000000; i++) {
//cout<<i<<endl;
unsigned long count = collatzLength(i);
if (count > max) {
maxIndex = i;
max = count;
}
}
cout<<maxIndex<<endl;
getchar();
cout<<"Returning..."<<endl;
return maxIndex;
}
When I compile and run this program (using Visual Studio 2012 and Release build settings) it takes like 10 seconds (on my computer) to close after the program prints "Returning..."
Why is that?
Note: I am aware that this program is bad written and that i probably shouldn't be using 'static' on the collatzLength nor using a cache for that function, but I am not interesting on how to make this code better, I am just interesting about why does it take so much to close.
Go to project settings on your start up project, Debugging section. Enter _NO_DEBUG_HEAP=1 into the Environment section. You need to do this even in Release mode.
It takes so long to close because collatzMap is static, thus it will only get freed when the program exits, and it contains a lot of data, so it will take long to get freed (the size is just over 2 million, and, because of how map works, for each of these there's at least one pointer that needs to be freed).
That said, on Dev-C++ it takes less than a second to exit. I guess Visual Studio isn't doing a good job.
Destroying a std::map is very slow on Visual Studio, especially for Debug builds. The slowdown should already disappear by using an unordered_map instead.
VS's implementation of map builds a red-black tree for storing the data, which means you will have to allocate a lot of tree nodes to store all your data. The limiting factor during destruction is the time required for traversing the tree and de-allocating all the nodes. With an unordered_map the traversal is usually a lot easier as the allocated buckets are usually larger and not as scattered as the red-black tree nodes (your mileage may vary though).
I just tried on my machine with VS 2012 release build. It doesn't take more than 2 seconds to close the program after "Returning"
When running VS in Debug mode, there are a lot of error checking flags set (e.g. bounds checking, memory watching, etc.) Since you are creating a lot of data recursively in a static map, it is scanning the memory before it is released to make sure nothing was corrupted. When you switch to a Release build, it should be almost instant.
Upon further inspection, you are basically allocating almost 1 million pairs of (unsigned long, int) in the static portion of memory. This effectively means that ~8MB of data must be freed before the application can finish closing (and since map isn't required to be contiguous, it must iterate through all 1 million pairs to delete each one). Other implementations may optimize memory usage a bit by reserving chunks of space (if it reserved enough space for say, 100 pairs, it would decrease the deallocation process by 2 orders of magnitude).
All that said, asking why bad code behaves badly is like asking why you get wet when you jump in a pool.
Is it possible to get the remaining available memory on a system (x86, x64, PowerPC / Windows, Linux or MacOS) in standard C++11 without crashing ?
A naive way would be to try allocating very large arrays starting by too large size, catch exceptions everytime it fails and decrease the size until no exception is thrown. But maybe there is a more efficient/clever method...
EDIT 1: In fact I do not need the exact amount of memory. I would like to know approximately (error bar of 100MB) how much my code could use when I start it.
EDIT 2 :
What do you think of this code ? Is it secure to run it at the start of my program or it could corrupt the memory ?
#include <iostream>
#include <array>
#include <list>
#include <initializer_list>
#include <stdexcept>
int main(int argc, char* argv[])
{
static const long long int megabyte = 1024*1024;
std::array<char, megabyte> content({{'a'}});
std::list<decltype(content)> list1;
std::list<decltype(content)> list2;
const long long int n1 = list1.max_size();
const long long int n2 = list2.max_size();
long long int i1 = 0;
long long int i2 = 0;
long long int result = 0;
for (i1 = 0; i1 < n1; ++i1) {
try {
list1.push_back(content);
}
catch (const std::exception&) {
break;
}
}
for (i2 = 0; i2 < n2; ++i2) {
try {
list2.push_back(content);
}
catch (const std::exception&) {
break;
}
}
list1.clear();
list2.clear();
result = (i1+i2)*sizeof(content);
std::cout<<"Memory available for program execution = "<<result/megabyte<<" MB"<<std::endl;
return 0;
}
This is highly dependent on the OS/platform. The approach that you suggest need not even work in real life. In some platforms the OS will grant you all your memory requests, but not really give you the memory until you use it, at which point you get a SEGFAULT...
The standard does not have anything related to this.
It seems to me that the answer is no, you cannot do it in standard C++.
What you could do instead is discussed under How to get available memory C++/g++? and the contents linked there. Those are all platform specific stuff. It's not standard but it least it helps you to solve the problem you are dealing with.
As others have mentioned, the problem is hard to precisely define, much less solve. Does virtual memory on the hard disk count as "available"? What about if the system implements a prompt to delete files to obtain more hard disk space, meanwhile suspending your program? (This is exactly what happens on OS X.)
The system probably implements a memory hierarchy which gets slower as you use more. You might try detecting the performance cliff between RAM and disk by allocating and initializing chunks of memory while using the C alarm interrupt facility or clock or localtime/mktime, or the C++11 clock facilities. Wall-clock time should appear to pass quicker as the machine slows down under the stress of obtaining memory from less efficient resources. (But this makes the assumption that it's not stressed by anything else such as another process.) You would want to tell the user what the program is attempting, and save the results to an editable configuration file.
I would advise using a configurable maximum amount of memory instead. Since some platforms overcommit memory, it's not easy to tell how much memory you will actually have access to. It's also not polite to assume that you have exclusive access to 100% of the memory available, many systems will have other programs running.
Inspired by this question.
Apparently in the following code:
#include <Windows.h>
int _tmain(int argc, _TCHAR* argv[])
{
if( GetTickCount() > 1 ) {
char buffer[500 * 1024];
SecureZeroMemory( buffer, sizeof( buffer ) );
} else {
char buffer[700 * 1024];
SecureZeroMemory( buffer, sizeof( buffer ) );
}
return 0;
}
compiled with default stack size (1 megabyte) with Visual C++ 10 with optimizations on (/O2) a stack overflow occurs because the program tries to allocate 1200 kilobytes on stack.
The code above is of course slightly exaggerated to show the problem - uses lots of stack in a rather dumb way. Yet in real scenarios stack size can be smaller (like 256 kilobytes) and there could be more branches with smaller objects that would induce a total allocation size enough to overflow the stack.
That makes no sense. The worst case would be 700 kilobytes - it would be the codepath that constructs the set of local variables with the largest total size along the way. Detecting that path during compilation should not be a problem.
So the compiler produces a program that tries to allocate even more memory than the worst case. According to this answer LLVM does the same.
That could be a deficiency in the compiler or there could be some real reason for doing it this way. I mean maybe I just don't understand something in compilers design that would explain why doing allocation this way is necessary.
Why would the compiler want a program allocate more memory than the code needs in the worst case?
I can only speculate that this optimization was deemed too unimportant by the compiler designers. Or perhaps, there is some subtle security reason.
BTW, on Windows, stack is reserved in its entirety when the thread starts execution, but is committed on as-needed basis, so you are not really spending much "real" memory even if you reserved a large stack.
Reserving a large stack can be a problem on 32-bit system, where having large number of threads can eat the available address space without really committing much memory. On 64-bit, you are golden.
It could be down to your use of SecureZeroMemory. Try replacing it with regular ZeroMemory and see what happens- the MSDN page essentially indicates that SZM has some additional semantics beyond what it's signature implies, and they could be the cause of the bug.
The following code when compiled using GCC 4.5.1 on ideone places the two arrays at the same address:
#include <iostream>
int main()
{
int x;
std::cin >> x;
if (x % 2 == 0)
{
char buffer[500 * 1024];
std::cout << static_cast<void*>(buffer) << std::endl;
}
if (x % 3 == 0)
{
char buffer[700 * 1024];
std::cout << static_cast<void*>(buffer) << std::endl;
}
}
input: 6
output:
0xbf8e9b1c
0xbf8e9b1c
The answer is probably "use another compiler" if you want this optimization.
OS Pageing and byte alignment could be a factor. Also housekeeping may use extra stack along with space required for calling other functions within that function.