Associative Set Cache Underestimating Hit Rate

Associative Set Cache Underestimating Hit Rate - c++

I'm attempting to implement a set associative cache that uses least recently used replacement techniques. So far, my code is underestimating the amount of cache hits, and I'm not sure why. Posted below is my function, setAssoc, which takes in an int value that denotes the associativity of the cache, and also a vector of pairs that are a series of data accesses.
The function uses two 2D arrays, one to store the cache blocks, and one to store the "age" of each block in the cache.
For this particular implementation, it's okay to not worry about tag bits or anything of that nature; simply using the address divided by the block size is enough to determine the block number, then using the block number modulo the number of sets to determine the set number is sufficient.
Any insight as to why I may not be accurately predicting the right number of cache hits is appreciated!
int setAssoc(int associativity, vector<pair<unsigned long long, int>>& memAccess){
int blockNum, setNum;
int hitRate = 0;
int numOfSets = 16384 / (associativity * 32);
int cache [numOfSets][associativity];//used to store blocks
int age [numOfSets][associativity];//used to store ages
int maxAge = 0;
int hit;//use this to signal a hit in the cache
//set up cache here
for(int i = 0; i < numOfSets; i++){
for(int j = 0; j < associativity; j++){
cache[i][j] = -1;//initialize all blocks to -1
age[i][j] = 0;//initialize all ages to 0
}//end for int j
}//end for int i
for(int i = 0; i < memAccess.size(); i++){
blockNum = int ((memAccess[i].first) / 32);
setNum = blockNum % numOfSets;
hit = 0;
for(int j = 0; j < associativity; j++){
age[setNum][j]++;//age each entry in the cache
if(cache[setNum][j] == blockNum){
hitRate++;//increment hitRate if block is in cache
age[setNum][j] = 0;//reset age of block since it was just accessed
hit = 1;
}//end if
}//end for int j
if(!hit){
for(int j = 0; j < associativity; j++){
//loop to find the least recently used block
if(age[setNum][j] > maxAge){
maxAge = j;
}//end if
}//end for int j
cache[setNum][maxAge] = blockNum;
age[setNum][maxAge] = 0;
}
}//end for int i
return hitRate;
}//end setAssoc function

Not sure if that's the only problem in this code, but you seem to be confusing between the ages and the way number. By assigning maxAge = j, you put an arbitrary way number in your age var (which will interfere with finding the LRU way). You then use it as a way index.
I would suggest splitting this into 2 variables:
if(!hit){
for(int j = 0; j < associativity; j++){
//loop to find the least recently used block
if(age[setNum][j] > maxAge){
maxAgeWay = j;
maxAge = age[setNum][j];
}//end if
}//end for int j
cache[setNum][maxAgeWay] = blockNum;
age[setNum][maxAgeWay] = 0;
}
(with the proper initialization and bound checking of course)

Related

How to Compare multiple variables at the same time in the C++?

I'm making Sudoku validater program that checks whether solved sudoku is correct or not, In that program i need to compare multiple variables together to check whether they are equal or not...
I have provided a snippet of code, what i have tried, whether every su[][] has different value or not. I'm not getting expecting result...
I want to make sure that all the values in su[][] are unequal.
How can i achieve the same, what are mistakes in my snippet?
Thanks...
for(int i=0 ; i<9 ;++i){ //for checking a entire row
if(!(su[i][0]!=su[i][1]!=su[i][2]!=su[i][3]!=su[i][4]!=su[i][5]!=su[i][6]!=su[i][7]!=su[i][8])){
system("cls");
cout<<"SUDOKU'S SOLUTION IS INCORRECT!!";
exit(0);
}
}

To check for each column uniqueness like that you would have to compare each element to the other ones in a column.
e.g.:
for (int i = 0; i < 9; ++i) {
for (int j = 0; j < 9; ++j) {
for (int k = j + 1; k < 9; ++k) {
if (su[i][j] == su[i][k]) {
system("cls");
cout << "SUDOKU'S SOLUTION IS INCORRECT!!\n";
exit(0);
}
}
}
}
Since there are only 8 elements per row this cubic solution shouldn't give you much overhead.
If you had a higher number N of elements you could initialize an array of size N with 0 and transverse the column. For the i-th element in the column you add 1 to that elements position in the array. Then transverse the array. If there's a position whose value is different from 1, it means you have a duplicated value in the column.
e.g.:
for (int i = 0; i < N; ++i) {
int arr[N] = {0};
for (int j = 0; j < N; ++j)
++arr[su[i][j] - 1];
for (int i = 0; i < N; ++i) {
if (arr[i] != 1) {
system("cls");
cout << "SUDOKU'S SOLUTION IS INCORRECT!!\n";
exit(0);
}
}
}
This approach is way more faster than the first one for high values of N.
The codes above check the uniqueness for each column, you would still have to check for each row.
PS: I have not tested the codes, it may have a bug, but hope you get the idea.

C++ Part of brute-force knapsack

reader,
Well, I think I just got brainfucked a bit.
I'm implementing knapsack, and I thought about I implemented brute-force algorithm like 1 or 2 times ever. So I decided to make another one.
And here's what I chocked in.
Let us decide W is maximum weight, and w(min) is minimal-weighted element we can put in knapsack like k=W/w(min) times. I'm explaining this because you, reader, are better know why I need to ask my question.
Now. If we imagine that we have like 3 types of things we can put in knapsack, and our knapsack can store like 15 units of mass, let's count each unit weight as its number respectively. so we can put like 15 things of 1st type, or 7 things of 2nd type and 1 thing of 1st type. but, combinations like 22222221[7ed] and 12222222[7ed] will mean the same for us. and counting them is a waste of any type of resources we pay for decision. (it's a joke, 'cause bf is a waste if we have a cheaper algorithm, but I'm very interested)
As I guess the type of selections we need to go through all possible combinations is called "Combinations with repetitions". The number of C'(n,k) counts as (n+k-1)!/(n-1)!k!.
(while I typing my message I just spotted a hole in my theory. we will probably need to add an empty, zero-weighted-zero-priced item to hold free space it's probably just increases n by 1)
so, what's the matter.
https://rosettacode.org/wiki/Combinations_with_repetitions
as this problem is well-described up here^ I don't really want to use stack this way, I want to generate variations in single cycle, which is going from i=0 to i<C'(n,k).
so, If I can make it, how it works?
we have
int prices[n]; //appear mystically
int weights[n]; // same as previous and I guess we place (0,0) in both of them.
int W, k; // W initialized by our lord and savior
k = W/min(weights);
int road[k], finalroad[k]; //all 0
int curP = curW = maxP = maxW = 0;
for (int i = 0; i < rCombNumber(n, k); i ++) {
/*guys please help me to know how to generate this mask which is consists of indices from 0 to n (meaning of each element) and k is size of mask.*/
curW = 0;
for (int j = 0; j < k; j ++)
curW += weights[road[j]];
if (curW < W) {
curP = 0;
for (int l = 0; l < k; l ++)
curP += prices[road[l]];
if (curP > maxP) {
maxP = curP;
maxW = curW;
finalroad = road;
}
}
}
mask, road -- is an array of indices, each can be equal from 0 to n; and have to be generated as C'(n,k) (link about it above) from { 0, 1, 2, ... , n } by k elements in each selection (combination with repetitions where order is unimportant)
that's it. prove me wrong or help me. Much thanks in advance _
and yes, of course algorithm will take the hell much time, but it looks like it should work. and I'm very interesting in it.
UPDATE:
what do I miss?
http://pastexen.com/code.php?file=EMcn3F9ceC.txt

The answer was provided by Minoru here https://gist.github.com/Minoru/745a7c19c7fa77702332cf4bd3f80f9e ,
it's enough to increment only the first element, then we count all of the carries, set where we did a carry and count reset value as the maximum of elements to reset and reset with it.
here's my code:
#include <iostream>
using namespace std;
static long FactNaive(int n)
{
long r = 1;
for (int i = 2; i <= n; ++i)
r *= i;
return r;
}
static long long CrNK (long n, long k)
{
long long u, l;
u = FactNaive(n+k-1);
l = FactNaive(k)*FactNaive(n-1);
return u/l;
}
int main()
{
int numberOFchoices=7,kountOfElementsInCombination=4;
int arrayOfSingleCombination[kountOfElementsInCombination] = {0,0,0,0};
int leftmostResetPos = kountOfElementsInCombination;
int resetValue=1;
for (long long iterationCounter = 0; iterationCounter<CrNK(numberOFchoices,kountOfElementsInCombination); iterationCounter++)
{
leftmostResetPos = kountOfElementsInCombination;
if (iterationCounter!=0)
{
arrayOfSingleCombination[kountOfElementsInCombination-1]++;
for (int anotherIterationCounter=kountOfElementsInCombination-1; anotherIterationCounter>0; anotherIterationCounter--)
{
if(arrayOfSingleCombination[anotherIterationCounter]==numberOFchoices)
{
leftmostResetPos = anotherIterationCounter;
arrayOfSingleCombination[anotherIterationCounter-1]++;
}
}
}
if (leftmostResetPos != kountOfElementsInCombination)
{
resetValue = 1;
for (int j = 0; j < leftmostResetPos; j++)
{
if (arrayOfSingleCombination[j] > resetValue)
{
resetValue = arrayOfSingleCombination[j];
}
}
for (int j = leftmostResetPos; j != kountOfElementsInCombination; j++)
{
arrayOfSingleCombination[j] = resetValue;
}
}
for (int j = 0; j<kountOfElementsInCombination; j++)
{
cout<<arrayOfSingleCombination[j]<<" ";
}
cout<<"\n";
}
return 0;
}
thanks a lot, Minoru

Spatial Locality for 3D array's?

why is A[k][i][j] better for spatial locality in a 3D array? ( where i,j,k are row, col, depth) CMU lecture 55min

I think that OP's question
why is A[k][i][j] better for spatial locality in a 3D array? ( where i,j,k are row, col, depth)
Comes from a misunderstanding of the exercise given as an example of spatial locality, where the reader is asked to
permute the loops so that the function ... has good spatial locality
and this code is given:
int sum_array_3d(int a[M][N][N])
{
int i, j, k, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
sum += a[k][i][j];
return sum;
}
My interpretation of this task is that the students are asked to either rewrite the inner statement as sum += a[i][j][k]; or change the order of the loops:
int sum_array_3d(int a[M][N][N])
{
int i, j, k, sum = 0;
for (k = 0; k < M; k++) // <-- those are reordered
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
sum += a[k][i][j]; // <-- this is mantained, verbatim
return sum;
}

Actually, that example is completely wrong. While rank 0 goes from 0..M-1, that loop is iterating 0..N-1. Unless M==N, you'll be reading the wrong element.
The goal is to have your loop iteratively access physically-adjacent locations in memory by manipulating the order of the loops.
Whenever your program reads a value, the CPU requests it from the cache controller. If it's not in cache, that value - and those near it - are retrieved from memory and stored in the cache.
If you then read the next element, it should (usually) already be in the cache, so there's no slow round-trip out to the next cache or host RAM.
If your loop is walking all over the place rather than taking advantage of spatial locality, then you run the risk of suffering far more cache misses, which makes things slow.
In short: getting stuff from the cache is fast, getting it from RAM is slow, and ordering your loops so that they touch adjacent locations helps keep the cache happy.
In graphics, we typically do this:
int a[M*N*N];
for(int offset=0; offset < M*N*N; ++offset)
{
//int y = offset / cols;
//int x = offset % rows;
sum += a[offset];
}
if you need an element by it's X,Y, just
offset = Y * cols + X;
int val = a[offset];
or for 3D
offset = Z*N*N + Y*N + X
or
offset = Z * rows * cols + Y * cols + X;
... and skip all the multidimensional array silliness.
Personally, I'd just do this:
int *p = &a[0][0][0]; // could probably just do int* p=a, but for clarity...
//... array gets populated somehow
for(int i=0;i<M*N*N;++i)
{
sum += p[i];
}
... but that assumes the array is a regular square array, not an array of pointers, or an array of an array of pointers.

Almost same code running much slower

I am trying to solve this problem:
Given a string array words, find the maximum value of length(word[i]) * length(word[j]) where the two words do not share common letters. You may assume that each word will contain only lower case letters. If no such two words exist, return 0.
https://leetcode.com/problems/maximum-product-of-word-lengths/
You can create a bitmap of char for each word to check if they share chars in common and then calc the max product.
I have two method almost equal but the first pass checks, while the second is too slow, can you understand why?
class Solution {
public:
int maxProduct2(vector<string>& words) {
int len = words.size();
int *num = new int[len];
// compute the bit O(n)
for (int i = 0; i < len; i ++) {
int k = 0;
for (int j = 0; j < words[i].length(); j ++) {
k = k | (1 <<(char)(words[i].at(j)));
}
num[i] = k;
}
int c = 0;
// O(n^2)
for (int i = 0; i < len - 1; i ++) {
for (int j = i + 1; j < len; j ++) {
if ((num[i] & num[j]) == 0) { // if no common letters
int x = words[i].length() * words[j].length();
if (x > c) {
c = x;
}
}
}
}
delete []num;
return c;
}
int maxProduct(vector<string>& words) {
vector<int> bitmap(words.size());
for(int i=0;i<words.size();++i) {
int k = 0;
for(int j=0;j<words[i].length();++j) {
k |= 1 << (char)(words[i][j]);
}
bitmap[i] = k;
}
int maxProd = 0;
for(int i=0;i<words.size()-1;++i) {
for(int j=i+1;j<words.size();++j) {
if ( !(bitmap[i] & bitmap[j])) {
int x = words[i].length() * words[j].length();
if ( x > maxProd )
maxProd = x;
}
}
}
return maxProd;
}
};
Why the second function (maxProduct) is too slow for leetcode?
Solution
The second method does repetitive call to words.size(). If you save that in a var than it working fine

Since my comment turned out to be correct I'll turn my comment into an answer and try to explain what I think is happening.
I wrote some simple code to benchmark on my own machine with two solutions of two loops each. The only difference is the call to words.size() is inside the loop versus outside the loop. The first solution is approximately 13.87 seconds versus 16.65 seconds for the second solution. This isn't huge, but it's about 20% slower.
Even though vector.size() is a constant time operation that doesn't mean it's as fast as just checking against a variable that's already in a register. Constant time can still have large variances. When inside nested loops that adds up.
The other thing that could be happening (someone much smarter than me will probably chime in and let us know) is that you're hurting your CPU optimizations like branching and pipelining. Every time it gets to the end of the the loop it has to stop, wait for the call to size() to return, and then check the loop variable against that return value. If the cpu can look ahead and guess that j is still going to be less than len because it hasn't seen len change (len isn't even inside the loop!) it can make a good branch prediction each time and not have to wait.

Add OpenMP to program to calculate the determinant of an n x n matrix n x n

Here is code to find determinant of matrix n x n.
#include <iostream>
using namespace std;
int determinant(int *matrix[], int size);
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column);
int main()
{
int size;
cout << "What is the size of the matrix for which you want to find the determinant?:\t";
cin >> size;
int **matrix;
matrix = new int*[size];
for (int i = 0 ; i < size ; i++)
matrix[i] = new int[size];
cout << "\nEnter the values of the matrix seperated by spaces:\n\n";
for(int i = 0; i < size; i++)
for(int j = 0; j < size; j++)
cin >> matrix[i][j];
cout << "\nThe determinant of the matrix is:\t" << determinant(matrix, size) << endl;
return 0;
}
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
}
}
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column){
for(int i = 0; i < size; i++){
for(int j = 0; j < size; j++){
if(i < row){
if(j < column)minorMatrix[i][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i][j-1] = matrix[i][j];
}
else if(i == row)continue;
else{
if(j < column)minorMatrix[i-1][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i-1][j-1] = matrix[i][j];
}
}
}
}
After adding OpenMP pragmas, I've changed the determinant function and now it looks like this:
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
#pragma omp parallel for default(none) shared(size,matrix,sign) private(j,k) reduction(+ : result)
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
delete [] matrix;
}
}
My problem is that the result is every time different. Sometimes it gives correct value, but most often it is wrong. I think it's because of the sign variable. I am following the formula:
As you can see, in every iteration of my for loop there should be different sign but when I use OpenMP, something is wrong. How can I make this program to run with OpenMP?
Finally, my second issue is that using OpenMP does not make the program run quicker than without OpenMP. I also tried to make a 100,000 x 100,000 matrix, but my program reports an error about allocating memory. How can I run this program with very large matrices?

Your issues as I see it are as follows:
1) As noted by Hristo, your threads are stomping over each other's data with respect to the sign variable. It should be private to each thread so that they have full read/write access to it without having to worry about race conditions. Then, you simply need an algorithm to compute whether sign is plus or minus 1 depending on the iteration j independently from the other iterations. With a little thinking, you'll see that Hristo's suggestion is correct: sign = (j % 2) ? -1 : 1; should do the trick.
2) Your determinant() function is recursive. As is, that means that every iteration of the loop, after forming your minors, you then call your function again on that minor. Therefore, a single thread is going to be performing its iteration, enter the recursive function, and then try to split itself up into nthreads more threads. You can see now how you are oversubscribing your system by launching many more threads than you physically have cores. Two easy solutions:
Call your original serial function from within the omp parallel code. This is the fastest way to do it because this would avoid any OpenMP-startup overhead.
Turn off nested parallelism by calling omp_set_nested(0); before your first call to determinant().
Add an if clause to your parallel for directive: if(omp_in_parallel())
3) Your memory issues are because every iteration of your recursion, you are allocating more memory. If you fix problem #2, then you should be using comparable amounts of memory in the serial case as the parallel case. That being said, it would be much better to allocate all the memory you want before entering your algorithm. Allocating large chunks of memory (and then freeing it!), especially in parallel, is a terrible bottleneck in your code.
Compute the amount of memory you would need (on paper) before entering the first loop and allocate it all at once. I would also strongly suggest you consider allocating your memory contiguously (aka in 1D) to take better advantage of caching as well. Remember that each thread should have its own separate area to work with. Then, change your function to:
int determinant(int *matrix, int *startOfMyWorkspace, int size).
Instead of allocating a new (size-1)x(size-1) matrix inside of your loop, you would simply utilize the next (size-1)*(size-1) integers of your workspace, update what startOfMyWorkspace would be for the next recursive call, and continue along.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Associative Set Cache Underestimating Hit Rate - c++

Related

How to Compare multiple variables at the same time in the C++?

C++ Part of brute-force knapsack

Spatial Locality for 3D array's?

Almost same code running much slower

Add OpenMP to program to calculate the determinant of an n x n matrix n x n

Categories

Resources