I couldn't find any clock drift RNG code for Windows anywhere so I attempted to implement it myself. I haven't run the numbers through ent or DIEHARD yet, and I'm just wondering if this is even remotely correct...
void QueryRDTSC(__int64* tick) {
__asm {
xor eax, eax
cpuid
rdtsc
mov edi, dword ptr tick
mov dword ptr [edi], eax
mov dword ptr [edi+4], edx
}
}
__int64 clockDriftRNG() {
__int64 CPU_start, CPU_end, OS_start, OS_end;
// get CPU ticks -- uses RDTSC on the Processor
QueryRDTSC(&CPU_start);
Sleep(1);
QueryRDTSC(&CPU_end);
// get OS ticks -- uses the Motherboard clock
QueryPerformanceCounter((LARGE_INTEGER*)&OS_start);
Sleep(1);
QueryPerformanceCounter((LARGE_INTEGER*)&OS_end);
// CPU clock is ~1000x faster than mobo clock
// return raw
return ((CPU_end - CPU_start)/(OS_end - OS_start));
// or
// return a random number from 0 to 9
// return ((CPU_end - CPU_start)/(OS_end - OS_start)%10);
}
If you're wondering why I Sleep(1), it's because if I don't, OS_end - OS_start returns 0 consistently (because of the bad timer resolution, I presume).
Basically, (CPU_end - CPU_start)/(OS_end - OS_start) always returns around 1000 with a slight variation based on the entropy of CPU load, maybe temperature, quartz crystal vibration imperfections, etc.
Anyway, the numbers have a pretty decent distribution, but this could be totally wrong. I have no idea.
Edit: According to Stephen Nutt, Sleep(1) may not be doing what I'm expecting, so instead of Sleep(1), I'm trying to use:
void loop() {
__asm {
mov ecx, 1000
cylcles:
nop
loop cylcles
}
}
The Sleep function is limited by the resolution of the system clock, so Sleep (1) may not be doing what you want.
Consider some multiplication to increase the range. Then use the result to seed a PRNG.
Related
I have a code like that:
const uint64_t tsc = __rdtsc();
const __m128 res = computeSomethingExpensive();
const uint64_t elapsed = __rdtsc() - tsc;
printf( "%" PRIu64 " cycles", elapsed );
In release builds, this prints garbage like “38 cycles” because VC++ compiler reordered my code:
const uint64_t tsc = __rdtsc();
00007FFF3D398D00 rdtsc
00007FFF3D398D02 shl rdx,20h
00007FFF3D398D06 or rax,rdx
00007FFF3D398D09 mov r9,rax
const uint64_t elapsed = __rdtsc() - tsc;
00007FFF3D398D0C rdtsc
00007FFF3D398D0E shl rdx,20h
00007FFF3D398D12 or rax,rdx
00007FFF3D398D15 mov rbx,rax
00007FFF3D398D18 sub rbx,r9
const __m128 res = …
00007FFF3D398D1B lea rdx,[rcx+98h]
00007FFF3D398D22 mov rcx,r10
00007FFF3D398D25 call computeSomethingExpensive (07FFF3D393E50h)
What’s the best way to fix?
P.S. I’m aware rdtsc doesn’t count cycles, it measures time based on CPU’s base frequency. I’m OK with that, I still want to measure that number.
Update: godbolt link
Adding a fake store
static bool save = false;
if (save)
{
static float res1[4];
_mm_store_ps(res1, res);
}
before the second __rdtsc seem to be enough to fool the compiler.
(Not adding a real store to avoid contention if this function is called in multiple threads, though could use TLS to avoid that)
I have a program in which a simple function is called a large number of times. I have added some simple logging code and find that this significantly affects performance, even when the logging code is not actually called. A complete (but simplified) test case is shown below:
#include <chrono>
#include <iostream>
#include <random>
#include <sstream>
using namespace std::chrono;
std::mt19937 rng;
uint32_t getValue()
{
// Just some pointless work, helps stop this function from getting inlined.
for (int x = 0; x < 100; x++)
{
rng();
}
// Get a value, which happens never to be zero
uint32_t value = rng();
// This (by chance) is never true
if (value == 0)
{
value++; // This if statment won't get optimized away when printing below is commented out.
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
return value;
}
int main(int argc, char* argv[])
{
// Just fror timing
high_resolution_clock::time_point start = high_resolution_clock::now();
uint32_t sum = 0;
for (uint32_t i = 0; i < 10000000; i++)
{
sum += getValue();
}
milliseconds elapsed = duration_cast<milliseconds>(high_resolution_clock::now() - start);
// Use (print) the sum to make sure it doesn't get optimized away.
std::cout << "Sum = " << sum << ", Elapsed = " << elapsed.count() << "ms" << std::endl;
return 0;
}
Note that the code contains stringstream and cout but these are never actually called. However, the presence of these three lines of code increases the run time from 2.9 to 3.3 seconds. This is in release mode on VS2013. Curiously, if I build in GCC using '-O3' flag the extra three lines of code actually decrease the runtime by half a second or so.
I understand that the extra code could impact the resulting executable in a number of ways, such as by preventing inlining or causing more cache misses. The real question is whether there is anything I can do to improve on this situation? Switching to sprintf()/printf() doesn't seem to make a difference. Do I need to simply accept that adding such logging code to small functions will affect performance even if not called?
Note: For completeness, my real/full scenario is that I use a wrapper macro to throw exceptions and I like to log when such an exception is thrown. So when I call THROW_EXCEPT(...) it inserts code similar to that shown above and then throws. This in then hurting when I throw exceptions from inside a small function. Any better alternatives here?
Edit: Here is a VS2013 solution for quick testing, and so compiler settings can be checked: https://drive.google.com/file/d/0B7b4UnjhhIiEamFyS0hjSnVzbGM/view?usp=sharing
So I initially thought that this was due to branch prediction and optimising out branches so I took a look at the annotated assembly for when the code is commented out:
if (value == 0)
00E21371 mov ecx,1
00E21376 cmove eax,ecx
{
value++;
Here we see that the compiler has helpfully optimised out our branch, so what if we put in a more complex statement to prevent it from doing so:
if (value == 0)
00AE1371 jne getValue+99h (0AE1379h)
{
value /= value;
00AE1373 xor edx,edx
00AE1375 xor ecx,ecx
00AE1377 div eax,ecx
Here the branch is left in but when running this it runs about as fast as the previous example with the following lines commented out. So lets have a look at the assembly for having those lines left in:
if (value == 0)
008F13A0 jne getValue+20Bh (08F14EBh)
{
value++;
std::stringstream ss;
008F13A6 lea ecx,[ebp-58h]
008F13A9 mov dword ptr [ss],8F32B4h
008F13B3 mov dword ptr [ebp-0B0h],8F32F4h
008F13BD call dword ptr ds:[8F30A4h]
008F13C3 push 0
008F13C5 lea eax,[ebp-0A8h]
008F13CB mov dword ptr [ebp-4],0
008F13D2 push eax
008F13D3 lea ecx,[ss]
008F13D9 mov dword ptr [ebp-10h],1
008F13E0 call dword ptr ds:[8F30A0h]
008F13E6 mov dword ptr [ebp-4],1
008F13ED mov eax,dword ptr [ss]
008F13F3 mov eax,dword ptr [eax+4]
008F13F6 mov dword ptr ss[eax],8F32B0h
008F1401 mov eax,dword ptr [ss]
008F1407 mov ecx,dword ptr [eax+4]
008F140A lea eax,[ecx-68h]
008F140D mov dword ptr [ebp+ecx-0C4h],eax
008F1414 lea ecx,[ebp-0A8h]
008F141A call dword ptr ds:[8F30B0h]
008F1420 mov dword ptr [ebp-4],0FFFFFFFFh
That's a lot of instructions if that branch is ever hit. So what if we try something else?
if (value == 0)
011F1371 jne getValue+0A6h (011F1386h)
{
value++;
printf("This never gets printed, but commenting out these three lines improves performance.");
011F1373 push 11F31D0h
011F1378 call dword ptr ds:[11F30ECh]
011F137E add esp,4
Here we have far fewer instructions and once again it runs as quickly as with all lines commented out.
So I'm not sure I can say for certain exactly what is happening here but I feel at the moment it is a combination of branch prediction and CPU instruction cache misses.
In order to solve this problem you could move the logging into a function like so:
void log()
{
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
and
if (value == 0)
{
value++;
log();
Then it runs as fast as before with all those instructions replaced with a single call log (011C12E0h).
Consider the following code segment:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define ARRAYSIZE(arr) (sizeof(arr)/sizeof(arr[0]))
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("cpuid; rdtsc" : "=a" (a), "=d" (d) : : "ebx", "ecx");
return a | ((uint64_t)d << 32);
}
inline int func() { return 5;}
inline void test()
{
uint64_t start, end;
char c;
start = rdtsc();
func();
end = rdtsc();
printf("%ld ticks\n", end - start);
}
void flushFuncCache()
{
// Assuming function to be not greater than 320 bytes.
char* fPtr = (char*)func;
clflush(fPtr);
clflush(fPtr+64);
clflush(fPtr+128);
clflush(fPtr+192);
clflush(fPtr+256);
}
int main(int ac, char **av)
{
test();
printf("Function must be cached by now!\n");
test();
flushFuncCache();
printf("Function flushed from cache.\n");
test();
printf("Function must be cached again by now!\n");
test();
return 0;
}
Here, i am trying to flush the instruction cache to remove the code for 'func', and then expecting a performance overhead on the next call to func but my results don't agree to my expectations:
858 ticks
Function must be cached by now!
788 ticks
Function flushed from cache.
728 ticks
Function must be cached again by now!
710 ticks
I was expecting CLFLUSH to also flush the instruction cache, but apparently, it is not doing so. Can someone explain this behavior or suggest how to achieve the desired behavior.
Your code does almost nothing in func, and the little you do gets inlined into test, and probably optimized out since you never use the return value.
gcc -O3 gives me -
0000000000400620 <test>:
400620: 53 push %rbx
400621: 0f a2 cpuid
400623: 0f 31 rdtsc
400625: 48 89 d7 mov %rdx,%rdi
400628: 48 89 c6 mov %rax,%rsi
40062b: 0f a2 cpuid
40062d: 0f 31 rdtsc
40062f: 5b pop %rbx
...
So you're measuring time for the two moves that are very cheap HW-wise - your measurement is probably showing the latency of cpuid which is relatively expensive..
Worse, your clflush would actually flush test as well, this means you pay the re-fetch penalty when you next access it, which is out of the rdtsc pair so it's not measured. The measured code on the other hand, sequentially follows, so fetching test would probably also fetch the flushed code you measure, so it could actually be cached by the time you measure it.
it works well on my computer.
264 ticks
Function must be cached by now!
258 ticks
Function flushed from cache.
519 ticks
Function must be cached again by now!
240 ticks
This question already has answers here:
Possible compiler bug in Visual C++ 2012 (x86)?
(2 answers)
Closed 10 years ago.
Using VS2012, I noticed that a switch that's been working for several years now seems to be broken in Release builds but works correctly (or at least as it used to) in Debug builds. I can't see anything at all wrong with the code so would appreciate some feedback on the correctness of using return statements from within a switch block.
The following code compiles ok but gives the wrong output in a Release build on Win7 32-bit...
#include <stdio.h>
#include <tchar.h>
class CSomeClass
{
public:
float GetFloat(int nInt)
{
printf("GetFloat() - entered\n");
switch (nInt)
{
case 1 :
printf("GetFloat() - case 1 entered\n");
return 0.5F;
case 0 :
printf("GetFloat() - case 0 entered\n");
return 1.0F;
case 2 :
printf("GetFloat() - case 2 entered\n");
return 2.0F;
case 3 :
printf("GetFloat() - case 3 entered\n");
return 3.0F;
case 4 :
printf("GetFloat() - case 4 entered\n");
return 4.0F;
}
printf("GetFloat() - exit\n");
return 1.0F;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
CSomeClass pClass;
float fValue = pClass.GetFloat(3);
printf("fValue = %f\n", fValue);
return 0;
}
If you can repeat the problem, and have a MS Connect login, maybe you can vote it up here too?
Actual results
Release build gives the following incorrect result:
GetFloat() - entered
GetFloat() - case 3 entered
fValue = 0.000000
Expected results
Debug build gives the following correct result:
GetFloat() - entered
GetFloat() - case 3 entered
fValue = 3.000000
MS Connect bug report
It sounds like it might be this bug? Where there is a problem in returning floating point values generated in a similar way?
Definitely a compiler error. Here is the stripped down asm code being executed (jumps etc. removed). The compiler removes some code it assumes to be unnecessary - even though it is not.
Release build:
// inside GetFloat
00E0104D fld dword ptr ds:[0E021D8h] // load constant float onto FPU stack
00E01068 add esp,4
00E0106B ret
// back in main
00E01098 cvtss2sd xmm0,xmm0 // convert float to double. assumes the returned value to be in xmm0
00E0109C sub esp,8
00E0109F movsd mmword ptr [esp],xmm0 // push double being printed (xmm0 is with high likelyhood = 0)
00E010A4 push 0E021B8h // push output string
00E010A9 call dword ptr ds:[0E02090h] // call printf
Debug build:
003314B0 fld dword ptr ds:[335964h] // load const float onto FPU stack
[...]
00331500 mov esp,ebp
00331502 pop ebp
00331503 ret 4
// back in main
00331598 fstp dword ptr [fValue] // copies topmost element of the FPU stack to [fValue]
0033159B cvtss2sd xmm0,dword ptr [fValue] // correctly takes returned value (now inside [fValue] for conversion to double
003315A0 mov esi,esp
003315A2 sub esp,8
003315A5 movsd mmword ptr [esp],xmm0 // push double being printed
003315AA push 335940h // push output string
003315AF call dword ptr ds:[3392C8h] // call printf
Gathering data from all the results, it's most probably compiler bug.
x64 works correctly
/O1 works correctly
well, Debug works correctly
it gives the same kind of error on both cout and printf
And, probably the most important - it's 100% standards compliant. So it should work.
I have a C++ app that uses large arrays of data, and have noticed while testing that it is running out of memory, while there is still plenty of memory available. I have reduced the code to a sample test case as follows;
void MemTest()
{
size_t Size = 500*1024*1024; // 512mb
if (Size > _HEAP_MAXREQ)
TRACE("Invalid Size");
void * mem = malloc(Size);
if (mem == NULL)
TRACE("allocation failed");
}
If I create a new MFC project, include this function, and run it from InitInstance, it works fine in debug mode (memory allocated as expected), yet fails in release mode (malloc returns NULL). Single stepping through release into the C run times, my function gets inlined I get the following
// malloc.c
void * __cdecl _malloc_base (size_t size)
{
void *res = _nh_malloc_base(size, _newmode);
RTCCALLBACK(_RTC_Allocate_hook, (res, size, 0));
return res;
}
Calling _nh_malloc_base
void * __cdecl _nh_malloc_base (size_t size, int nhFlag)
{
void * pvReturn;
// validate size
if (size > _HEAP_MAXREQ)
return NULL;
'
'
And (size > _HEAP_MAXREQ) returns true and hence my memory doesn't get allocated. Putting a watch on size comes back with the exptected 512MB, which suggests the program is linking into a different run-time library with a much smaller _HEAP_MAXREQ. Grepping the VC++ folders for _HEAP_MAXREQ shows the expected 0xFFFFFFE0, so I can't figure out what is happening here. Anyone know of any CRT changes or versions that would cause this problem, or am I missing something way more obvious?
Edit: As suggested by Andreas, looking at this under this assembly view shows the following;
--- f:\vs70builds\3077\vc\crtbld\crt\src\malloc.c ------------------------------
_heap_alloc:
0040B0E5 push 0Ch
0040B0E7 push 4280B0h
0040B0EC call __SEH_prolog (40CFF8h)
0040B0F1 mov esi,dword ptr [size]
0040B0F4 cmp dword ptr [___active_heap (434660h)],3
0040B0FB jne $L19917+7 (40B12Bh)
0040B0FD cmp esi,dword ptr [___sbh_threshold (43464Ch)]
0040B103 ja $L19917+7 (40B12Bh)
0040B105 push 4
0040B107 call _lock (40DE73h)
0040B10C pop ecx
0040B10D and dword ptr [ebp-4],0
0040B111 push esi
0040B112 call __sbh_alloc_block (40E736h)
0040B117 pop ecx
0040B118 mov dword ptr [pvReturn],eax
0040B11B or dword ptr [ebp-4],0FFFFFFFFh
0040B11F call $L19916 (40B157h)
$L19917:
0040B124 mov eax,dword ptr [pvReturn]
0040B127 test eax,eax
0040B129 jne $L19917+2Ah (40B14Eh)
0040B12B test esi,esi
0040B12D jne $L19917+0Ch (40B130h)
0040B12F inc esi
0040B130 cmp dword ptr [___active_heap (434660h)],1
0040B137 je $L19917+1Bh (40B13Fh)
0040B139 add esi,0Fh
0040B13C and esi,0FFFFFFF0h
0040B13F push esi
0040B140 push 0
0040B142 push dword ptr [__crtheap (43465Ch)]
0040B148 call dword ptr [__imp__HeapAlloc#12 (425144h)]
0040B14E call __SEH_epilog (40D033h)
0040B153 ret
$L19914:
0040B154 mov esi,dword ptr [ebp+8]
$L19916:
0040B157 push 4
0040B159 call _unlock (40DDBEh)
0040B15E pop ecx
$L19929:
0040B15F ret
_nh_malloc:
0040B160 cmp dword ptr [esp+4],0FFFFFFE0h
0040B165 ja _nh_malloc+29h (40B189h)
With the registers as follows;
EAX = 009C8AF0 EBX = FFFFFFFF ECX = 009C8A88 EDX = 00747365 ESI = 00430F80
EDI = 00430F80 EIP = 0040B160 ESP = 0013FDF4 EBP = 0013FFC0 EFL = 00000206
So the compare does appear to be against the correct constant, i.e. #040B160 cmp dword ptr [esp+4],0FFFFFFE0h, also esp+4 = 0013FDF8 = 1F400000 (my 512mb)
Second edit: Problem was actually in HeapAlloc, as per Andreas' post. Changing to a new seperate heap for large objects, using HeapCreate & HeapAlloc, did not help alleviate the problem, nor did an attempt to use VirtualAlloc with various parameters. Some further experimentation has shown that where allocation one large section of contiguous memory fails, two smaller blocks yielding the same total memory is ok. e.g. where a 300MB malloc fails, 2 x 150MB mallocs work ok. So it looks like I'll need a new array class that can live in a number of biggish memory fragments rather than a single contiguous block. Not a major problem, but I would have expected a bit more out of Win32 in this day and age.
Last edit: The following yielded 1.875GB of space, albeit non-contiguous
#define TenMB 1024*1024*10
void SmallerAllocs()
{
size_t Total = 0;
LPVOID p[200];
for (int i = 0; i < 200; i++)
{
p[i] = malloc(TenMB);
if (p[i])
Total += TenMB; else
break;
}
CString Msg;
Msg.Format("Allocated %0.3lfGB",Total/(1024.0*1024.0*1024.0));
AfxMessageBox(Msg,MB_OK);
}
May it be the cast that the debugger is playing a trick on you in release-mode? Neither single stepping nor the values of variables are reliable in release-mode.
I tried your example in VS2003 in release mode, and when single stepping it does at first look like the code is landing on the return NULL line, but when I continue stepping it eventually continues into HeapAlloc, I would guess that it's this function that's failing, looking at the disassembly if (size > _HEAP_MAXREQ) reveals the following:
00401078 cmp dword ptr [esp+4],0FFFFFFE0h
so I don't think it's a problem with _HEAP_MAXREQ.