Branch Prediction - Global Share Implementation Explanation [closed]

Branch Prediction - Global Share Implementation Explanation [closed] - c++

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm working on an assignment in my Computer Architecture class where we have to implement a branch prediction algorithm in C++ (for the Alpha 21264 microprocessor architecture).
There is a solution provided as an example. This solution is an implementation of a Global Share Predictor.
I am simply trying to understand the given solution, specifically what is going on in:
*predict (branch_info &b) {...}
specifically,
if (b.br_flags & BR_CONDITIONAL) {...}
Can anyone provide me with an explanation? Thank you.

I think the following paper by Scott McFarling provides the detailed answer:
http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-36.pdf
Let me use your code to explain.
unsigned char tab[1<<TABLE_BITS];
is the Pattern History Table. Each entry in the tab keeps a 2-bit saturating counter. The direction of the conditional branch is finally determined by the MSB of the counter:
u.direction_prediction (tab[u.index] >> 1);
The reason why we use a counter of two or more bits instead of just one bit is to make the pattern less sensitive to reduce misprediction. For example,
for( int i = 0; i < m; i++ )
{
for( int j = 0; j < n; j++ )
{
...
}
}
when the inner loop is executed for the next time, one bit counter will mispredict the branch while 2-bit counter can prevent it.
The next is how to find the correct pattern in the Pattern History Table.
The naive way is to use branch address as index. But it ignores the correlation between different branches. That is why Global Branch History is introduced (For more details, please refer to http://www.eecg.utoronto.ca/~moshovos/ACA06/readings/two-level-bpred.pdf).
In your code,
unsigned int history;
is the Branch History Register which stores the Global Branch History.
Then some guys have found that combining Global Branch History and Branch Address as index can lead to more accurate prediction than just using one of them. The reason is that both Global Branch History and Branch Address affect the branch pattern.
If ignoring one of them, different branch pattern may be hashed to the same position of the Pattern History Table, thus causing collision problem.
Before Gshare is proposed, there is a solution called Gselect, which uses concatenation of Global Branch History and Branch Address as index of Pattern History Table.
The solution provided by Gshare is the hash function of
index = branch_addr XOR branch_history
This is what exactly what the following code means:
u.index = (history << (TABLE_BITS - HISTORY_LENGTH)) ^ (b.address & ((1<<TABLE_BITS)-1));
Scott McFarling's paper provides a good example to show how Gshare works better than Gselect:
Branch Address=1111_1111 Global Branch History=0000_0000
Branch Address=1111_1111 Global Branch History=1000_0000
Assume that we use the following Gselect strategy to prevent bias:
index = { {branch_addr[7:4]}, {branch_history[3:0]} }
Then Gselect will produce 1111_0000 for both cases while Gshare can distinguish the different patterns.
As far as I know, Gshare turns out to be the best solution by far to remove the collision.

Related

Shouldn't this be using a backtracking algorithm?

I am solving some questions on LeetCode. One of the questions is:
Given a m x n grid filled with non-negative numbers, find a path from top left to bottom right which minimizes the sum of all numbers along its path.You can only move either down or right at any point in time.
The editorial as well as the solutions posted all use dynamic programming. One of the most upvoted solution is as follows:
class Solution {
public:
int minPathSum(vector<vector<int>>& grid) {
int m = grid.size();
int n = grid[0].size();
vector<vector<int> > sum(m, vector<int>(n, grid[0][0]));
for (int i = 1; i < m; i++)
sum[i][0] = sum[i - 1][0] + grid[i][0];
for (int j = 1; j < n; j++)
sum[0][j] = sum[0][j - 1] + grid[0][j];
for (int i = 1; i < m; i++)
for (int j = 1; j < n; j++)
sum[i][j] = min(sum[i - 1][j], sum[i][j - 1]) + grid[i][j];
return sum[m - 1][n - 1];
}
};
My question is simple: shouldn't this be solved using backtracking? Suppose the input matrix is something like:
[
[1,2,500]
[100,500,500]
[1,3,4]
]
My doubt is because in DP, the solutions to subproblems are a part of the global solution (optimal substructure). However, as can be seen above, when we make a local choice of choosing 2 out of (2,100), we might be wrong, since the future paths might be too expensive (all numbers surrounding 2 are 500s). So, how is using dynamic programming justified in this case?
To summarize:
Shouldn't we use backtracking since we might have to retract our path if we have made an incorrect choice previously (looking at local maxima)?
How is this a dynamic programming question?
P.S.: The above solution definitely runs.

The example you illustrated above shows that a greedy solution to the problem will not necessarily produce an optimal solution, and you're absolutely right about that.
However, the DP solution to this problem doesn't quite use this strategy. The idea behind the DP solution is to compute, for each location, the cost of the shortest path ending at that location. In the course of solving the overall problem, the DP algorithm will end up computing the length of some shortest paths that pass through the 2 in your grid, but it won't necessarily use those intermediate shortest paths when determining the overall shortest path to return. Try tracing through the above code on your example - do you see how it computes and then doesn't end up using those other path options?

Shouldn't we use backtracking since we might have to retract our path if we have made an incorrect choice previously (looking at local maxima)?
In a real-world scenario, there will be quite a few factors that will determine which algorithm will be better suited to solve this problem.
This DP solution is alright in the sense that it will give you the best performance/memory usage when handling worst-case scenarios.
Any backtracking/dijkstra/A* algorithm will need to maintain a full matrix as well as a list of open nodes. This DP solution just assumes every node will end up being visited, so it can ditch the open node list and just maintain the costs buffer.
By assuming every node will be visited, it also gets rid of the "which node do I open next" part of the algorithm.
So if optimal worst-case scenario performance is what we are looking for, then this algorithm is actually going to be very hard to beat. But wether that's what we want or not is a different matter.
How is this a dynamic programming question?
This is only a dynamic programming question in the sense that there exists a dynamic programming solution for it. But by no means is DP the only way to tackle it.
Edit: Before I get dunked on, yes there are more memory-efficient solutions, but at very high CPU costs in the worst-case scenarios.

For your input
[
[ 1, 2, 500]
[100, 500, 500]
[ 1, 3, 4]
]
sum array results to
[
[ 1, 3, 503]
[101, 503, 1003]
[102, 105, 109]
]
And we can even retrace shortest path:
109, 105, 102, 101, 1
Algorithm doesn't check each path, but use the property that it can take previous optimum path to compute current cost:
sum[i][j] = min(sum[i - 1][j], // take better path between previous horizontal
sum[i][j - 1]) // or previous vertical
+ grid[i][j]; // current cost

Backtracking, in itself, doesn't fit this problem particularly well.
Backtracking works well for problems like eight queens, where a proposed solution either works, or it doesn't. We try a possible route to a solution, and if it fails, we backtrack and try another possible route, until we find one that works.
In this case, however, every possible route gets us from the beginning to the end. We can't just try different possibilities until we find one that works. Instead, we have to basically try every route from beginning to end, until one find the one that works the best (the lowest weight, in this case).
Now, it's certainly true that with backtracking and pruning, we could (perhaps) improve our approach to this solution, to at least some degree. In particular, let's assume you did a search that started by looking downward (if possible) and then to the side. In this case, with the input you gave its first attempt would end up being the optimal route.
The question is whether it can recognize that, and prune some branches of the tree without traversing them entirely. The answer is that yes, it can. To do that, it keeps track of the best route it's found so far, and based upon that, it can reject entire sub-trees. In this case its first route gives a total weight of 109. Then it tries to the right of the first node, which is a 2, for a total weight of 3 so far. That's smaller than 109, so it proceeds. From there, it looks downward and gets to the 500. That gives a weight of 503, so without doing any further looking, it knows no route from there can be suitable, so it stops and prunes off all the branches that start from that 500. Then it tries rightward from the 2 and finds another 500. This lets it prune that entire branch as well. So, in these cases, it never looks at the third 500, or the 3 and 4 at all--just by looking at the 500 nodes, we can determine that those can't possibly yield an optimal solution.
Whether that's really an improvement on the DP strategy largely comes down to a question of what operations cost how much. For the task at hand, it probably doesn't make much difference either way. If, however, your input matrix was a lot larger, it might. For example, we might have a large input stored in tiles. With a DP solution, we evaluate all the possibilities, so we always load all the tiles. With a tree-trimming approach, we might be able to completely avoid loading some tiles at all, because the routes including those tiles have already been eliminated.

Is adding 1 to a number repeatedly slower than adding everything at the same time in C++? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
If I have a number a, would it be slower to add 1 to it b times rather than simply adding a + b?
a += b;
or
for (int i = 0; i < b; i++) {
a += 1;
}
I realize that the second example seems kind of silly, but I have a situation where coding would actually be easier that way, and I am wondering if that would impact performance.
EDIT: Thank you for all your answers. It looks like some posters would like to know what situation I have. I am trying to write a function to shift an inputted character a certain number of characters over (ie. a cipher) if it is a letter. So, I want to say that one char += the number of shifts, but I also need to account for the jumps between the lowercase characters and uppercase characters on the ascii table, and also wrapping from z back to A. So, while it is doable in another way, I thought it would be easiest to keep adding one until I get to the end of a block of letter characters, then jump to the next one and keep going.

If your loop is really that simple, I don't see any reason why a compiler couldn't optimize it. I have no idea if any actually would, though. If your compiler doesn't the single addition will be much faster than the loop.

The language C++ does not describe how long either of those operations take. Compilers are free to turn your first statement into the second, and that is a legal way to compile it.
In practice, many compilers would treat those two subexpressions as the same expression, assuming everything is of type int. The second, however, would be fragile in that seemingly innocuous changes would cause massive performance degradation. Small changes in type that 'should not matter', extra statements nearby, etc.
It would be extremely rare for the first to be slower than the second, but if the type of a was such that += b was a much slower operation than calling += 1 a bunch of times, it could be. For example;
struct A {
std::vector<int> v;
void operator+=( int x ) {
// optimize for common case:
if (x==1 && v.size()==v.capacity()) v.reserve( v.size()*2 );
// grow the buffer:
for (int i = 0; i < x; ++i)
v.reserve( v.size()+1 );
v.resize( v.size()+1 );
}
}
};
then A a; int b = 100000; a+=b; would take much longer than the loop construct.
But I had to work at it.

The overhead (CPU instructions) on having a variable being incremented in a loop is likely to be insignificant compared to the total number of instructions in that loop (unless the only thing you are doing in the loop is incrementing). Loop variables are likely to remain in the low levels of the CPU cache (if not in CPU registries) and is very fast to increment as in doesn't need to read from the RAM via the FSB. Anyway, if in doubt just make a quick profile and you'll know if it makes sense to sacrifice code readability for speed.

Yes, absolutely slower. The second example is beyond silly. I highly doubt you have a situation where it would make sense to do it that way.
Lets say 'b' is 500,000... most computers can add that in a single operation, why do 500,000 operations (not including the loop overhead).

If the processor has an increment instruction, the compiler will usually translate the "add one" operation into an increment instruction.
Some processors may have an optimized increment instructions to help speed up things like loops. Other processors can combine an increment operation with a load or store instruction.
There is a possibility that a small loop containing only an increment instruction could be replaced by a multiply and add. The compiler is allowed to do so, if and only if the functionality is the same.
This kind of operation, generally produces negligible results. However, for large data sets and performance critical applications, this kind of operation may be necessary and the time gained would be significant.
Edit 1:
For adding values other than 1, the compiler would emit processor instructions to use the best addition operations.
The add operation is optimized in hardware as a different animal than incrementing. Arithmetic Logic Units (ALU) have been around for a long time. The basic addition operation is very optimized and a lot faster than incrementing in a loop.

all solutions to change making with dynamic programming

I was reviewing my handouts for our algorithm class and I started to think about this question:
Given different types of coins with different values, find all coin configurations to add up to a certain sum without duplication.
During class, we solved the problem to find the number of all possible ways for a sum and the least number of coins for a sum. However, we never tried to actually find the solutions.
I was thinking about solving this problem with dynamic programming.
I came with the recursion version(for simplicity I only print the solutions):
void solve(vector<string>& result, string& currSoln, int index, int target, vector<int>& coins)
{
if(target < 0)
{
return;
}
if(target == 0)
{
result.push_back(currSoln);
}
for(int i = index; i < coins.size(); ++i)
{
stringstream ss;
ss << coins[i];
string newCurrSoln = currSoln + ss.str() + " ";
solve(result, newCurrSoln, i, target - coins[i], coins);
}
}
However, I got stuck when trying to use DP to solve the problem.
I have 2 major obstacles:
I don't know what data structure I should use to store previous answers
I don't know what my bottom-up procedure(using loops to replace recursions) should look like.
Any help is welcomed and some codes would be appreciated!
Thank you for your time.

In a dp solution you generate a set of intermediate states, and how many ways there are to get there. Then your answer is the number that wound up in a success state.
So, for change counting, the states are that you got to a specific amount of change. The counts are the number of ways of making change. And the success state is that you made the correct amount of change.
To go from counting solutions to enumerating them you need to keep those intermediate states, and also keep a record in each state of all of the states that transitioned to that one - and information about how. (In the case of change counting, the how would be which coin you added.)
Now with that information you can start from the success state and recursively go backwards through the dp data structures to actually find the solutions rather than the count. The good news is that all of your recursive work is efficient - you're always only looking at paths that succeed so waste no time on things that won't work. But if there are a billion solutions, then there is no royal shortcut that makes printing out a billion solutions fast.
If you wish to be a little clever, though, you can turn this into a usable enumeration. You can, for instance, say "I know there are 4323431 solutions, what is the 432134'th one?" And finding that solution will be quick.

It is immediately obvious that you can take a dynamic programming approach. What isn't obvious that in most cases (depending on the denominations of the coins) you can use the greedy algorithm, which is likely to be more efficient. See Cormen, Leiserson, Rivest, Stein: Introduction to Algorithms 2nd ed, problems 16.1.

Compiler optimization for loops [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have frequently noticed the following pattern:
for (int i = 0; i < strlen(str); ++i) {
// do some operations on string
}
The complexity of above loop would be O(N²) because the complexity of strlen is N and that comparison is made during every iteration.
However, if we calculate strlen before the loop and use that constant, the complexity of the loop is reduced to O(N).
I am sure there are many other such optimizations.
Does the compiler carry out such optimizations or do programmers have to take precautions to prevent it?

While I don't have any solid evidence whatsoever, my guess would be this:
The compiler makes a data flow analysis of the variable str. If it's potentially modified inside the loop or marked as volatile, there is no guarantee that strlen(str) will remain constant between iterations and therefore cannot be cached. Otherwise, it should be safe to cache and would be optimized.

Yes, good optimizers are able to do this kind of transform if they can establish that the string remains unmodified in the loop body. They can pull out of loops expressions that remain constant.
Anyway, in a case where you would, say, turn all characters to uppercase, it would be hard for a compiler to infer that the string length won't change.
I personally favor a "defensive" approach, not relying on advanced compiler skills, and do the obvious optimizations myself. In case the code would be ported to a different environment, with a poorer compiler, or just in case of doubt.
Also think of the cases where optimization is off.

Try
for (int i = 0; str[i]; ++i) {
// do some operations on string
}
As strlen is essentially doing this

The first step towards understanding what compilers do, can do, and can not do is to write your intentions into the code and see what happens:
const int len = strlen(str);
for (int i=0; i<len; ++I)
{
// do some operations which DO NOT CHANGE the length of str
}
Of course, it is your responsibility not to change the length of str inside the loop...you can lowercase, uppercase, swap or replace characters...something you may 'assert()' if you really care (in a debug version). In this case, you communicated your intentions to the compiler and if you are lucky and you are using a good compiler you are likely to get what you are after.
I really doubt that this optimisation would make any difference in your code: if you were doing heavy string manipulation you would (1) know whether you are working with long strings or short ones, (2) be using a library which (a) keeps explicit track of the length of strings, (b) makes (especially repeated) string concatenation cheaper as it is in C.

Fastest "for" loop [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Let's suppose we have this struct
struct structure
{
type element;
int size;
}
and we're in the main and we want to iterate something.
Is it faster
for ( int i = 0; i < structure.size; ++i )
or
int size = structure.size;
for ( int i = 0; i < size; ++i )
?
Does weight more the continous binding to sctructure in the first method or the additional space of memory, and the time spent creating the first variable in the first line of method n.2?
I can't see any other difference between the two of them, so if you do, please share!
EDIT: I edited the question so that is now concise, simple and easy answerable.
Please reconsider the vote you would give to it. Thank you.

There might be a good reason to choose one over the other. If the contents of your loop in the first example change the value of structure.size, i will be continuously checked against the current value. However, in your second choice, size will not change as structure.size does. Which one you want depends on the problem. I would perhaps change size to be called initialSize instead, however.
If that is not the case, you should stop thinking about such minor "optimizations" and instead think about what is most readable. I'd prefer the first choice because it doesn't introduce an unnecessary variable name. When you have two bits of code that do the same thing, trust the compiler to work out the optimal way of doing it. It's your job to tell the compiler what you want your program to do. It's the compilers job to do it in the best way it can.
If and only if you determine through measurement that this is a necessary optimization (I can't imagine that it ever will be) should you then choose the one that measures fastest.

Very unlikely that there will be any actual difference in the compiled code from this, unless it's REALLY an ancient compiler with really rubbish optimisation. Anything like gcc, clang, MSVC or Intel's C++ compilers would produce exactly the same code for these scenarios.
Of course, if you start calling a function inside the condition of the loop, and the data processed by the function is modified by the loop, e.g.
std::string str;
cin >> str;
for(int i = 0; i < str.size(); i++)
{
if (str[i] > 'a')
str+= 'B';
}
then we have a different story...

You should allow compiler to do microoptimizations like this. Write readable code, make it work, then if it runs slow profile it and optimize where it is really necessary.
Though in case inside the loop you call a function, that can modify this structure and compiler does not have access to it's implementation second variant may help, as you give compiler a hint that it does not need to reload structure.size from memory. I would recommend to use const:
const int size = structure.size;
for ( int i = 0; i < size; ++i ) {
somefunc( &structure );
}

I do not know how much you know about compilation, but among the various phases of a compiler, there is a phase called code-optimization, which attempts to improve the intermediate code (by performing various optimization techniques like dead-code elimination, loop transformations, etc.), so that faster-running machine code can be produced.
So, actually your compiler takes care of your headache and I doubt that you would notice any performance issues.

In your first method, if structure is a reference or a member variable, it will not be properly stored into the CPU cache, as there is no way to tell if it is changed outise this block.
In your second method, as size is a local variable to the current code block, it will be properly stored in cache.
Thus, the second method should be faster, despite creating a new variable.
See Load-Hit-Store for a more complete explanation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Branch Prediction - Global Share Implementation Explanation [closed] - c++

Related

Shouldn't this be using a backtracking algorithm?

Is adding 1 to a number repeatedly slower than adding everything at the same time in C++? [closed]

all solutions to change making with dynamic programming

Compiler optimization for loops [closed]

Fastest "for" loop [closed]

Categories

Resources