I read the answer of this question
How to calculate number of times each instruction could be called at run-time?
I found at InstrProfiling.cpp file function called castToIncrementInst, but this function only calculate total number of instructions. I need the number of calls for each instrucion or basic block
Related
I'm using the random_number subroutine from Fortran, but in different runs of program the number which is being produced doesn't change. What should I include in my code so every time I compile and run the program the numbers change?
The random number generator produces pseudo-random numbers. To get different numbers each run, you need to initialise the random seed at the start of your program. This picks a different starting position in the pseudo-random stream.
The sequence of pseudorandom numbers coming from call(s) to random_number depends on the algorithm used by the processor and the value of the seed.
The initial value of the seed is processor dependent. For some processors this seed value will be the same each time the program runs, and for some it will be different. The first case gives a repeatable pseudorandom sequence and the second a non-repeatable sequence.
gfortran (before version 7) falls into this first category. You will need to explicitly change the random seed if you wish to get non-repeatable sequences.
As stated in another answer the intrinsic random_seed can be used to set the value of the seed and restart the pseudorandom generator. Again, it is processor dependent what happens when the call is call random_seed() (that is, without a put= argument). Some processors will restart the generator with a repeatable sequence, some won't. gfortran (again, before version 7) is in the first category.
For processors where call random_seed() gives rise to a repeatable sequence an explicit run-time varying seed will be required to generate distinct sequences. An example for those older gfortran versions can be found in the documentation.
It should be noted that choosing a seed can be a complicated thing. Not only will there be portability issues, but care may be required in ensuring that the generator is not restarted in a low entropy region. For multi-image programs the user will have to work to have varying sequences across these images.
On a final note, Fortran 2018 introduced the standard intrinsic procedure random_init. This handles both cases of selecting repeatability across invocations and distinctness over (coarray) images.
I have a member function of a class that is supposed to generate a random number in a range. To do so, I am using the rand() function. The function generates a random number like this:
unsigned seed;
seed = time(0);
srand(seed);
std::cout << "Random Number: "<< rand() << std::endl;
The function is called on two different objects. The result is:
Random Number: 1321638448
Random Number: 1321638448
This is consistent every-time I call it. What am i doing wrong?
(Converting my comment to an answer).
For most applications, you'll only really want to seed rand once in the course of running a program. Seeding it multiple times requires you to get different random seeds, and it's easy to mess that up.
In your case, the time function usually returns something with resolution on the level of seconds (though this isn't actually required by the standard). As a result, if you call time twice within the same second, you might get back the same value. That would explain why you're getting duplicate values: you're seeding the randomizer with the same value twice and then immediately querying it for a random number.
The best solution to this is to just seed the randomizer once. Typically, you'd do that in main.
If you really do want to seed the randomizer multiple times, make sure that you're doing so using a seed that is going to be pretty much random. Otherwise, you risk something like this happening.
Pseudorandom number generators basically have to pass a set of statistical tests to make sure they're "random enough" as a set of numbers. But of course, it's not actually random. Calling srand(seed) with some seed basically generates a set of numbers which, if passed through those tests, will seem "random enough".
By calling srand(seed) with the same seed multiple times, you're effectively generating the same set over and over again and getting the first value in it.
You call srand(seed) ONCE, and then you call rand() to get the next values in the random number set. Or you need to call srand(seed) with a different (random) seed each time.
If you're on linux, you can also use /dev/urandom to get a random number- the kernel has been taking signal/noise from the environment to generate "entropy" for it, supposedly making it even better than an algorithm psuedorandom number generator.
srand function should be called only once in program(most cases, not all cases). If you want reseed, you should use different seed number. Because rand() function is pseudo-random number generator. In other words, rand() gives you a calculated number.
You can use much for powerful random number generating library after C++11. See: http://en.cppreference.com/w/cpp/numeric/random
I'm currently working with llvm and I'm concentrating on loops: as I wanted to unroll loops I found out that llvm already provides a function that unrolls loops. That functions expects the following arguments:
a count that determines the unroll count (how many times the loop body will exist after unrolling)
a tripcount which determines how many times the loop will execute (before being unrolled). Or to be exact (taken from the documentation of the getSmallConstantTripCount function in the ScalarEvolution class):
[...] it is the number of times that control may reach ExitingBlock before taking the branch. For loops with multiple exits, it may not be the number times that the loop header executes if the loop exits prematurely via another branch
a tripmultiple which - according to the documentation of the getSmallConstantTripMultiple function in the ScalarEvolution class - is
[...] the largest constant divisor of the trip count of this loop [...]
some other arguments that do not matter for this question.
The tripcount and tripmultiple value can be obtained from the ScalarEvolution class using the already mentioned functions. My pass currently uses those values but when I started testing the pass on several loops it seems like both values are always equal (when I started using breaks to create early exits in the CFG llvm could not determine any of those values and always returned "default" values).
My questions are now: what exactly is the difference between those two values? And under which conditions are these values different (some code example would be very usefull)? And could it happen that llvm ScalarEvolution pass cannot compute the tripcount but can determine the tripmultiple? And if so some code would be very helpfull as I currently cannot image such a situation.
I'm a beginner at c++ and I am trying to find the frequency of numbers(1-6) from a random number generator of 100 numbers. The only commands that I can use are rand, srand, cin, cout, loops, and if else. Is it possible to create a program that shows the frequency using only these commands? Thank you.
You can use a std::map<int, int> where first would be the random number and second would be the count.
Therefore the frequency would be the count / total.
Consider the function of each of the commands given to you:
srand(int): seeds your random number generator; you can give this a simple number (say 0 or 1, or 7) so that every time you run your code, your random number generator will have the same output (good for checking your progress. It's difficult to tell if you're going in the right direction if you're getting different input every time)
rand(): generates a random number for you between 0 and RAND_MAX. For generating numbers within a certain range, look at this thread (How does modulus and rand() work?).
cin: reads input from the user.
cout: outputs to the console (ex. cout << "string"; will output a the word "string" to the console)
loops: your two main loops while and for. While loops goes only until a certain condition is met; a for loop allows you to create a temporary variable, check a condition, and increment a variable. Hint: for loops are better when you know exactly how many iterations you want to run through.
if/else: Allows you to check a given condition, and either execute one block of code if that condition results to true or go to another block of code (from the else) if false. Note that you can also establish else if's (ie, if(condition) { // code } else if(condition) { // code } ).
Now looking back at the question, what do we need to solve this?
1. A way to generate random numbers (rand)
2. A way to control how many numbers we are generating (loop)
3. A way to check if the number we generated is a certain number (if/else)
4. A way to store our amount of frequency for each number (variables)
5. A way to output our findings to the screen (cout)
I hope I didn't give too much away - from here it is up to you to form the exact specification of your logic throughout the program (ie, what variables to define, how to execute your loop, what if/else to use, and how and what to output).
I used _rdtsc() to time atoi() and atof() and I noticed they were taking pretty long. I therefore wrote my own versions of these functions which were much quicker from the first call.
I am using Windows 7, VS2012 IDE but with the Intel C/C++ compiler v13. I have -/O3 enabled and also -/Ot ("favour fast code"). My CPU is an Ivy Bridge (mobile).
Upon further investigation, it seemed that the more times atoi() and atof() were called, the quicker they executed?? I am talking magnitudes faster:
When I call atoi() from outside my loop, just the once, it takes 5,892 CPU cycles but after thousands of iterations this reduced to 300 - 600 CPU cycles (quite a large execution time range).
atof() initially takes 20,000 to 30,000 CPU cycles and then later on after a few thousand iterations it was taking 18 - 28 CPU cycles (which is the speed at which my custom function takes the first time it is called).
Could someone please explain this effect?
EDIT: forgot to say- the basic setup of my program was a loop parsing bytes from a file. Inside the loop I obviously use my atof and atoi to notice the above. However, what I also noticed is that when I did my investigation before the loop, just calling atoi and atof twice, along with my user-written equivalent functions twice, it seemed to make the loop execute faster. The loop processed 150,000 lines of data, each line requiring 3x atof() or atoi()s. Once again, I cannot understand why calling these functions before my main loop affected the speed of a program calling these functions 500,000 times?!
#include <ia32intrin.h>
int main(){
//call myatoi() and time it
//call atoi() and time it
//call myatoi() and time it
//call atoi() and time it
char* bytes2 = "45632";
_int64 start2 = _rdtsc();
unsigned int a2 = atoi(bytes2);
_int64 finish2 = _rdtsc();
cout << (finish2 - start2) << " CPU cycles for atoi()" << endl;
//call myatof() and time it
//call atof() and time it
//call myatof() and time it
//call atof() and time it
//Iterate through 150,000 lines, each line about 25 characters.
//The below executes slower if the above debugging is NOT done.
while(i < file_size){
//Loop through my data, call atoi() or atof() 1 or 2 times per line
switch(bytes[i]){
case ' ':
//I have an array of shorts which records the distance from the beginning
//of the line to each of the tokens in the line. In the below switch
//statement offset_to_price and offset_to_qty refer to this array.
case '\n':
switch(message_type){
case 'A':
char* temp = bytes + offset_to_price;
_int64 start = _rdtsc();
price = atof(temp);
_int64 finish = _rdtsc();
cout << (finish - start) << " CPU cycles" << endl;
//Other processing with the tokens
break;
case 'R':
//Get the 4th line token using atoi() as above
char* temp = bytes + offset_to_qty;
_int64 start = _rdtsc();
price = atoi(temp);
_int64 finish = _rdtsc();
cout << (finish - start) << " CPU cycles" << endl;
//Other processing with the tokens
break;
}
break;
}
}
}
The lines in the file are like this (with no blank lines in between):
34605792 R dacb 100
34605794 A racb S 44.17 100
34605797 R kacb 100
34605799 A sacb S 44.18 100
34605800 R nacb 100
34605800 A tacb B 44.16 100
34605801 R gacb 100
I am using atoi() on the 4th element in the 'R' messages and 5th element in 'A' messages and using atof() on the 4th element in the 'A' messages.
I'm guessing the reason why you see such a drastic improvement for atoi and atof, but not for your own, simpler function, is that the former have a large number of branches in order to handle all the edge cases. The first few times, this leads to a large number of incorrect branch predictions, which are costly. But after a few times, the predictions get more accurate. A correctly predicted branch is almost free, which would then make them competitive with your simpler version which doesn't include the branches to begin with.
Caching surely also important, but I don't think that explains why your own function was fast from the beginning, and did not see any relevant improvements after repeated execution (if I understand you correctly).
Using RDTSC for profiling is dangerous. From the Intel processor manual:
The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions
have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the
read operation is performed. If software requires RDTSC to be executed only after all previous instructions have
completed locally, it can either use RDTSCP (if the processor supports that instruction) or execute the sequence
LFENCE;RDTSC.
With the inevitable Heisenberg effect that causes, you'll now measure the cost of RDTSCP or LFENCE. Consider measuring a loop instead.
Measuring performance for a single call like this isn't advisable. You'd get too many variance due to power throttles, interrupts and other OS/system interferences, measurement overhead, and as said above - cold/warm variance. On top of that, rdtsc is no longer considered a reliable measurement since your CPU may throttle its own frequency, but for the sake of this simple check we can say it's good enough.
You should run your code at least several thousands of times, discard some portion at the beginning, and then divide to get the average - that would give you the "warm" performance, which would include (as mentioned in the comments above) close caches hit latency for both code and data (and also TLBs), good branch prediction, and might also negate some of the external impacts (such as having only recently woken up your CPU from a powerdown state).
Of course, you may argue that this performance is too optimistic because in real scenarios you won't always hit the L1 cache etc.. - it may still be fine for comparing two different methods (such as competing with the library ato* functions), just don't count on the results for real life. You can also make the test slightly harder, and call the function with a more elaborate pattern of inputs that would stress the caches a bit better.
As for your question regarding the 20k-30k cycles - that's exactly the reason why you should discard the first few iterations. This isn't just cache-miss latency, you're actually waiting for the first instructions to do a code fetch, which may also wait for the code page translation to do a page walk (a long process that may involve multiple memory accesses), and if you're really unlucky - also swapping in a page from disk, which requires OS assistance and lots of IO latency. And this is still before you started executing the first instruction.
The most likely explanation is that because you are calling atoi/atof so often, it is being identified as a hot spot and thus being kept in the Level 1 or Level 2 processor code cache. The CPU's replacement policy -- that microcode that determines what cache lines can be purged when a cache miss occurs) would tag such a hot spot to be kept in cache. There's a decent write up of cpu caching technlogies on wikipedia, if you are interested.
Your initial timings were low because your code wasn't in the CPU's most performant cache yet, but once invoked some number of times, were.