I am new to c++ programming and computer architecture.
I am trying to learn branch prediction using ChampSim simulator.(https://github.com/ChampSim/ChampSim)
However, I have no idea how to change the parameters in the program to do some simple simulations.
For example, for the bimodal predictor in ChampSim, how can I change the size of prediction tables?How can I change the branch history to 1-bit and do the simulation?
I also don't know how to change parameters like fetch-width(fetches how many instructions per cycle), decode-width, execute-width, commit-width, and ROB size.
If there is anybody who is familiar with ChamnpSim simulator and C++, please help me.
For example, it is the code for bimodal predictor:
#include "ooo_cpu.h"
#define BIMODAL_TABLE_SIZE 16384
#define BIMODAL_PRIME 16381
#define MAX_COUNTER 3
int bimodal_table[NUM_CPUS][BIMODAL_TABLE_SIZE];
void O3_CPU::initialize_branch_predictor()
cout << "CPU " << cpu << " Bimodal branch predictor" << endl;
for(int i = 0; i < BIMODAL_TABLE_SIZE; i++)
bimodal_table[cpu][i] = 0;
uint8_t O3_CPU::predict_branch(uint64_t ip)
uint32_t hash = ip % BIMODAL_PRIME;
uint8_t prediction = (bimodal_table[cpu][hash] >= ((MAX_COUNTER + 1)/2)) ? 1 : 0;
return prediction;
void O3_CPU::last_branch_result(uint64_t ip, uint8_t taken)
uint32_t hash = ip % BIMODAL_PRIME;
if (taken && (bimodal_table[cpu][hash] < MAX_COUNTER))
else if ((taken == 0) && (bimodal_table[cpu][hash] > 0))
What should I do if I want to change the branch history to 1 bit and let the bimodal predictor to study prediction tables of 128 entries?
I was not satisfied with the performance of the below thrust::reduce_by_key, so I rewrote it in a variety of ways with little gained benefit (including removing the permutation iterator). However, it wasn't until after replacing it with a thrust::for_each() (see below) that capitalizes on atomicAdd(), that I gained almost a 75x speedup! The two versions produce the exact same results. What could be the biggest cause for the dramatic performance differences?
Complete code for comparison between the two approaches:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <ctime>
#include <iostream>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/sort.h>
constexpr int NumberOfOscillators = 100;
int SeedRange = 500;
struct GetProduct
template<typename Tuple>
__host__ __device__
int operator()(const Tuple & t)
return thrust::get<0>(t) * thrust::get<1>(t);
int main()
using namespace std;
using namespace thrust::placeholders;
thrust::device_vector<int> dv_OscillatorsVelocity(NumberOfOscillators);
thrust::device_vector<int> dv_outputCompare(NumberOfOscillators);
thrust::device_vector<int> dv_Connections_Strength((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connections_Active((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connections_TerminalOscillatorID_Map(0);
thrust::device_vector<int> dv_Permutation_Connections_To_TerminalOscillators((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connection_Keys((NumberOfOscillators - 1) * NumberOfOscillators);
srand((unsigned int)time(NULL));
thrust::fill(dv_OscillatorsVelocity.begin(), dv_OscillatorsVelocity.end(), 0);
for (int c = 0; c < NumberOfOscillators * (NumberOfOscillators - 1); c++)
dv_Connections_Strength[c] = (rand() % SeedRange) - (SeedRange / 2);
dv_Connections_Active[c] = 0;
int curOscillatorIndx = -1;
for (int c = 0; c < NumberOfOscillators * NumberOfOscillators; c++)
if (c % NumberOfOscillators == 0)
if (c % NumberOfOscillators != curOscillatorIndx)
dv_Connections_TerminalOscillatorID_Map.push_back(c % NumberOfOscillators);
for (int n = 0; n < NumberOfOscillators; n++)
for (int p = 0; p < NumberOfOscillators - 1; p++)
thrust::make_counting_iterator<int>(dv_Connections_TerminalOscillatorID_Map.size()), // indices from 0 to N
dv_Connections_TerminalOscillatorID_Map.begin(), // array data
dv_Permutation_Connections_To_TerminalOscillators.begin() + (n * (NumberOfOscillators - 1)), // result will be written here
_1 == n);
for (int c = 0; c < NumberOfOscillators * (NumberOfOscillators - 1); c++)
dv_Connection_Keys[c] = c / (NumberOfOscillators - 1);
auto t = clock();
for (int x = 0; x < 5000; ++x) //Set x maximum to a reasonable number while testing performance.
//dv_Connection_Keys = 0,0,0,...1,1,1,...2,2,2,...3,3,3...
dv_Connection_Keys.begin(), //keys_first The beginning of the input key range.
dv_Connection_Keys.end(), //keys_last The end of the input key range.
), //values_first The beginning of the input value range.
thrust::make_discard_iterator(), //keys_output The beginning of the output key range.
dv_OscillatorsVelocity.begin() //values_output The beginning of the output value range.
std::cout << "iterations time for original: " << (clock() - t) * (1000.0 / CLOCKS_PER_SEC) << "ms\n" << endl << endl;
thrust::copy(dv_OscillatorsVelocity.begin(), dv_OscillatorsVelocity.end(), dv_outputCompare.begin());
t = clock();
for (int x = 0; x < 5000; ++x) //Set x maximum to a reasonable number while testing performance.
thrust::make_counting_iterator(0) + dv_Connections_Active.size(),
s = dv_OscillatorsVelocity.size() - 1,
dv_b = thrust::raw_pointer_cast(dv_OscillatorsVelocity.data()),
dv_c = thrust::raw_pointer_cast(dv_Permutation_Connections_To_TerminalOscillators.data()), //3,6,9,0,7,10,1,4,11,2,5,8
dv_ppa = thrust::raw_pointer_cast(dv_Connections_Active.data()),
dv_pps = thrust::raw_pointer_cast(dv_Connections_Strength.data())
] __device__(int i) {
const int readIndex = i / s;
dv_b + readIndex,
(dv_ppa[dv_c[i]] * dv_pps[dv_c[i]])
std::cout << "iterations time for new: " << (clock() - t) * (1000.0 / CLOCKS_PER_SEC) << "ms\n" << endl << endl;
std::cout << "***" << (dv_OscillatorsVelocity == dv_outputCompare ? "success" : "fail") << "***\n";
return 0;
Extra info.:
My results are using a single GTX 980 TI.
There are 100 * (100 - 1) = 9,900 elements in all of the "Connection" vectors.
Each of the 100 unique keys found in dv_Connection_Keys has 99 elements each.
Use this compiler option: --expt-extended-lambda
What could be the biggest cause for the dramatic performance differences?
You are evidently building a debug project, that is your compilation settings include the -G switch. Although you were asked for your compilation settings in the comments, you didn't mention this.
It's important.
CUDA device code can have dramatically different performance characteristics when compiled with -G.
Don't evaluate performance of a debug project, or code compiled with -G.
When I compile and run your code without -G, I get:
iterations time for original: 210ms
iterations time for new: 70ms
When I compile your code with the debug switch -G, and run, I get:
iterations time for original: 12330ms
iterations time for new: 320ms
returning to your question, that accounts for the biggest factor of the difference.
The following answer tries to explain or at least motivate the remaining difference in performance after going from a debug build to a release build as explained in Robert Crovella's answer.
As the accesses in both kernels are not coalesced due to the permutation_iterator/indirection through dv_c, going by the the plain number of accesses will overestimate the performance in this case. thrust::reduce_by_key (or pretty much all Thrust algorithms) is not and can not be optimized for general permutations of the input as the performance of these bandwidth-bound kernels depends strongly on coalesced memory access. Naturally the algorithms are written such that accesses are coalesced for normal continuous input. So if you need to access the permuted state order of the data more than once (which might happen in a single reduction algorithm), it could be faster to actually permute the data in memory using thrust::gather or thrust::scatter once so at least all following accesses are efficient. I would not expect the for_each solution to beat reduce_by_key without that permutation.
Newer versions of nvcc will try to use automatically use warp-aggregated-atomics to reduce the number of actual atomic instructions on the same address. As neighboring threads (same warp) tend to atomically write to the same address, this optimization is crucial for the performance of your custom reduction. Another important detail is that s = NumberOfOscillators is relatively small (100) in your code compared to typical thread-block sizes (256, 512, 1024; locality of atomic writes) and the amount of parallelism in the for_each (~NumberOfOscillators^2). So for smaller NumberOfOscillators I expect your custom reduction to get worse than reduce_by_key due to the vanishing amount of parallelism, while for bigger NumberOfOscillators you get both much more parallelism and more thread blocks/warps writing to the same location, so it is not quite clear which one will win without benchmarking it for given hardware and compiler.
I'm addressing an issue with WebSocket that I'm not able to understand.
Please, use the code below as reference:
int write_buffer_size = 8000 +
char *write_buffer = new unsigned char[write_buffer_size];
/* ... other code
write_buffer is filled in some way that is not important for the question
n = libwebsocket_write(wsi, &write_buffer[LWS_SEND_BUFFER_PRE_PADDING], write_len,
if (n < 0) {
cerr << "ERROR " << n << " writing to socket, hanging up" << endl;
if (utils) {
log = "wsmanager::error: hanging up writing to websocket";
return -1;
if (n < write_len) {
cerr << "Partial write: " << n << " < " << write_len << endl;
if (utils) {
log = "wsmanager-error: websocket partial write";
return -1;
When I try to send data bigger than 7160 bytes I receive always the same error, e.g. Partial write: 7160 < 8000.
Do you have any kind of explanation for that behavior?
I have allocated a buffer with 8000 bytes reserved for the payload so I was expecting to be able to send a maximum amount of data of 8K, but 7160 (bytes) seems to be the maximum amount of data I can send.
Any help is appreciated, thanks!
I have encountered similar problem with an older version of libwebsockets. Although I didn't monitor the limit, it was pretty much the same thing: n < write_len. I think my limit was way lower, below 2048B, and I knew that the same code worked fine with newer version of libwebsockets (on different machine).
Since Debian Jessie doesn't have lws v1.6 in repositories, I've built it from github sources. Consider upgrading, it may help solve your problem. Beware, they have changed api. It was mostly renaming of methods' names from libwebsocket_* to lws_*, but also some arguments changed. Check this pull request which migrates boilerplate libwebsockets server to version 1.6. Most of these changes will affect your code.
We solved the issue updating libwebsockets to 1.7.3 version.
We also optimized the code using a custom callback called when the channel is writable
WSManager::onWritable() {
int ret, n;
struct fragment *frg;
if (!send_queue.empty() && !lws_partial_buffered(wsi)) {
frg = send_queue.front();
n = lws_write(wsi, frg->content + LWS_PRE, frg->len, (lws_write_protocol)frg->mode);
ret = checkWsWrite(n, frg->len);
if (ret >= 0 && !lws_partial_buffered(wsi)) {
if (frg->mode == WS_SINGLE_FRAGMENT || frg->mode == WS_LAST_FRAGMENT)
// pop fragment and free memory only if lws_write was successful
I am currently working a P300 (basically there is detectable increase in a brain wave when a user sees something they are interested) detection system in C++ using the Emotiv EPOC. The system works but to improve accuracy I'm attempting to use Wekinator for machine learning, using an support vector machine (SVM).
So for my P300 system I have three stimuli (left, right and forward arrows). My program keeps track of the stimulus index and performs some filtering on the incoming "brain wave" and then calculates which index has the highest average area under the curve to determine which stimuli the user is looking at.
For my integration with Wekinator: I have setup Wekinator to receive a custom OSC message with 64 features (the length of the brain wave related to the P300) and set up three parameters with discrete values of 1 or 0. For training I have I have been sending the "brain wave" for each stimulus index in a trial and setting the relevant parameters to 0 or 1, then training it and running it. The issue is that when the OSC message is received by the the program from Wekinator it is returning 4 messages, rather than just the one most likely.
Here is the code for the training (and input to Wekinator during run time):
for(int s=0; s < stimCount; s++){
for(int i=0; i < stimIndexes[s].size(); i++) {
int eegIdx = stimIndexes[s][i];
ofxOscMessage wek;
if (eegIdx + winStart + winLen < sig.size()) {
int winIdx = 0;
for(int e=eegIdx + winStart; e < eegIdx + winStart + winLen; e++) {
//stimAvgWins[s][winIdx++] += sig[e];
std::cout << "Num args: " << wek.getNumArgs() << std::endl;
Here is the receipt of messages from Wekinator:
ofxOscMessage msg;
while(receiver.getNextMessage(&msg)) {
std::cout << "Wek Args: " << msg.getNumArgs() << std::endl;
if (msg.getAddress() == "/OSCSynth/params"){
resultReceived = true;
if(msg.getArgAsFloat(0) == 1){
result = 0;
} else if(msg.getArgAsFloat(1) == 1){
result = 1;
} else if(msg.getArgAsFloat(2) == 1){
result = 2;
std::cout << "Wek Result: " << result << std::endl;
Full code for both is at the following Gist:
Main query is basically whether something is wrong with the code: Should I send the full "brain wave" for a trial to Wekinator? Or should I train Wekinator on different features? Does the code look right or should it be amended? Is there a way to only receive one OSC message back from Wekinator based on smaller feature sizes i.e. 64 rather than 4 x 64 per stimulus or 9 x 64 per stimulus index.
I want to write a program in c++ that are able to stress a windows 7 system. In my intention I want that this program brings the cpu usage to 100%, using all ram installed.
I have tried with a big FOR cycle that run a simple multiplication at every step: the cpu usage increase but the ram used remain low.
What is the best approach to reach my target?!
In an OS agnostic way, you could allocate and fill heap memory obtained with malloc(3) (so do some computation with that zone), or in C++ with operator new. Be sure to test against failure of malloc. And increase the size of your zone progressively.
If your goal is ability to stress CPU and RAM utilization (and not necessarily writing a program), try HeavyLoad which is free and does just that http://www.jam-software.com/heavyload/
I wrote a test program for this using multithreading and Mersenne Primes as the algorithm for stress testing
The code first determines the number of cores on the machine (WINDOWS only guys, sorry!!)
NtQuerySystemInformation(SystemBasicInformation, &BasicInformation, sizeof(SYSTEM_BASIC_INFORMATION), NULL);
It then runs a test method to spawn a thread for each core
// make sure we've got at least as many threads as cores
for (int i = 0; i < this->BasicInformation.NumberOfProcessors; i++)
threads.push_back(thread(&CpuBenchmark::MersennePrimes, CpuBenchmark()));
running our load
BOOL CpuBenchmark::IsPrime(ULONG64 n) // function determines if the number n is prime.
for (ULONG64 i = 2; i * i < n; i++)
if (n % i == 0)
return false;
return true;
ULONG64 CpuBenchmark::Power2(ULONG64 n) //function returns 2 raised to power of n
ULONG64 square = 1;
for (ULONG64 i = 0; i < n; i++)
square *= 2;
return square;
VOID CpuBenchmark::MersennePrimes()
ULONG64 i;
ULONG64 n;
for (n = 2; n <= 61; n++)
if (IsPrime(n) == 1)
i = Power2(n) - 1;
if (IsPrime(i) == 1) {
// dont care just want to stress the CPU
and waits for all threads to complete
// wait for them all to finish
for (auto& th : threads)
Full article and source code on my blog site
There are 2 possible techniques shown below which do the same task.
I would like to know if there will be any performance difference between the two.
I think the first technique will suffer due to branch prediction as contents of A are random.
Technique 1:
#include <iostream>
#include <cstdlib>
#include <ctime>
#define SIZE 1000000
using namespace std;
class MyClass
bool flag;
void setFlag(bool f) {flag = f;}
int main()
MyClass obj;
int *A = new int[SIZE];
for(int i = 0; i < SIZE; i++)
A[i] = (unsigned int)rand();
time_t mytime1;
time_t mytime2;
for(int test = 0; test < 5000; test++)
for(int i = 0; i < SIZE; i++)
if(A[i] > 100)
cout << asctime(localtime(&mytime1)) << endl;
cout << asctime(localtime(&mytime2)) << endl;
Sat May 03 20:08:07 2014
Sat May 03 20:08:32 2014
i.e. Time taken = 25sec
Technique 2:
#include <iostream>
#include <cstdlib>
#include <ctime>
#define SIZE 1000000
using namespace std;
class MyClass
bool flag;
void setFlag(bool f) {flag = f;}
int main()
MyClass obj;
int *A = new int[SIZE];
for(int i = 0; i < SIZE; i++)
A[i] = (unsigned int)rand();
time_t mytime1;
time_t mytime2;
for(int test = 0; test < 5000; test++)
for(int i = 0; i < SIZE; i++)
obj.setFlag(A[i] > 100);
cout << asctime(localtime(&mytime1)) << endl;
cout << asctime(localtime(&mytime2)) << endl;
Sat May 03 20:08:42 2014
Sat May 03 20:09:10 2014
i.e. Time taken = 28sec
The compilation is done using MinGW 64 bt compiler with no flags.
From the results it looks like the opposite is happening.
After making the check for RAND_MAX / 2 instead of 100, I am getting the following results:
Technique 1: 70sec
Technique 2: 28sec
So it becomes clear now that Technique 2 is better than technique 1 and can be explained on the basis of branch prediction failure phenomenon.
With optimisations enabled the binaries are exactly the same, in GCC 4.8 at least: demo.
They're different with optimisations disabled, though: demo.
This very poor attempt at a measurement suggests that the second is actually slower, though both programs run in the same duration in real terms: demo
real 0m0.052s
user 0m0.036s
sys 0m0.012s
real 0m0.052s
user 0m0.044s
sys 0m0.004s
To find out how they really differ in performance with optimisations disabled, you can benchmark properly with more runs.
Frankly, though, since it's irrelevant for your production code I wouldn't even bother.
I agree with the fact that this isn't very interesting for practical code (especially when it dissapears with -O3), but for the sake of academic interest: In some conditions it may be better to rely on the branch predictor.
On one hand, in this particular case the branch is almost always going to be not-taken (as RAND_MAX >> 100), which is easy to predict both interms of branch resolution as well as the next IP address. Try converting the prediciton to a 50% chance and then benchmark this.
On the other hand, the second operation turns the stores done to the obj flag into being data-dependent with the loads from A[i]. These loads are going to be slow as your dataset is 1000000*sizeof(A) bytes at least (almost 4MB), meaning that it could be either in the L3 cache or the memory - either way that's quiet a few cycles per each new line (once every few accesses) - when the writes to the flag were independent, they could queue in parallel, now you have to stall them until you get the data. In Theory, the CPU should be able to "pipeline" this, since stores are performed much later than loads along the pipeline on most CPUs, but in practice you're limited by the size of the execution window, in most machines that would be ~100 I believe), so if the store of the current iteration is stalled, you won't be able to launch too far ahead the loads required for the future iterations.
In other words - you may be losing due to the fact that CPUs have a fairly decent branch prediction, but no (or hardly no) data prediction.