Bit Shifting in Cache Simulation - c++

What is the formula for calculating the index and tag bits in
Direct Mapped Cache
Associative Cache
Set Associative Cache
I am currently using this formula for Direct Mapped:
#define BLOCK_SHIFT 5;
#define CACHE_SIZE 4096;
int index = (address >> BLOCK_SHIFT) & (CACHE_SIZE-1);
/* in the line above we want the "middle bits" that say where the block goes */
long tag = address >> BLOCK_SHIFT; /* the high order bits are the tag */
Please tell me how many bits are shifted in Associative and Set Associative Cache..

So, I think the concrete answer to your question is "zero", but that's simply because you are asking the wrong question.
Right, so a cache with a given size X, that is directly mapped, will simply use the lower part [or some other part(s)] of the address to form the index into the cache. So index is a value between 0 and (chace-size-1). In other words, "address modulo size". Since sizes of caches are nearly always 2n, we make use of the fact that both of these can be performed using simple bitwise "and" with (size-1) instead of using divide.
In your code, each cache entry (cache-line) holds a "BLOCK" of 32 bytes, so the address should be divided (shifted) down by the block-size. 25 = 32. This shift remains a constant for a constant cache-line size. Since there is no other shift in your example code, I presume you are misunderstanding what you should do.
In a set-associative cache, there are multiple sets of cache-lines that can be used for the same index. So instead of simply taking the lower part of the address as an index, we take a SMALLER part of the lower address. So, the index = address_of_block & (CACHE_SIZE-1) should become address_of_block & ((CACHE_SIZE-1) / ways. Since we are dealing with a 2n number again, we can use the old "shift instead of divide" trick - x / y where y is 2n can be done by x >> n.
So, now you just have to figure out what n is for your number of ways.
And of course, figure out how you determine which of the ways to use when replacing something in the cache, but that is certainly a completely different question.

Related

Using part of a variable as bool

Let's say memory is precious, and I have a class with a uint32_t member variable ui and I know that the values will stay below 1 million. The class also hase some bool members.
Does it make sense to use the highest (highest 2,3,..) bit(s) of ui in order to save memory, since bool is 1 byte?
If it does make sense, what is the most efficient way to get the highest (leftmost?) bit (or 2nd)? I read a few old threads and there seems to be disagreement about using inline ASM or some sort of shift.
It's a bit dangerous to use part of the bits as bool. The thing is that the way the numbers are kept in binary, makes it harder to maintain that keeping mechanism correct.
Negative numbers are kept as a complement of positive. Check this for more explanation. You may assign number to be 10 and then setting bool bit from false to true, and the number may turn out to become huge negative number as a result.
As for getting if n-th bit is 0 or 1 you can use this, where 0-th bit is the right most:
int nth_bit(int a, int n){
return a & (1 << n);
}
It will return 0 or 1 identifying the n-th bit.
Well, if the memory is in fact precious, you should look deeper.
1,000,000 uses only 20 bits. This is less that 3 bytes. So you can allocate 3 bytes to keep your value and up to four booleans. Obviously, access will be a bit more complicated, but you save 25% of memory!
If you know that the values are below 524,287, for example, you can save another 15% by packing it (with bool) into 20 bits :)
Also, keeping bool in a separate array (as you said in a comment) would kill performance if you need to access the value and a corresponding bool simultaneously because they are far apart and will likely never be in a cache.

SSE optimisation for a loop that finds zeros in an array and toggles a flag + updates another array

A piece of C++ code determines the occurances of zero and keeps a binary flag variable for each number that is checked. The value of the flag toggles between 0 and 1 each time a zero is encountered in a 1 dimensional array.
I am attempting to use SSE to speed it up, but I am unsure of how to go about this. Evaluating the individual fields of __m128i is inefficient, I've read.
The code in C++ is:
int flag = 0;
int var_num2[1000];
for(int i = 0; i<1000; i++)
{
if (var[i] == 0)
{
var_num2[i] = flag;
flag = !flag; //toggle value upon encountering a 0
}
}
How should I go about this using SSE intrinsics?
You'd have to recognize the problem, but this is a variation of a well-known problem. I'll first give a theoretical description
Introduce a temporary array not_var[] which contains 1 if var contains 0 and 0 otherwise.
Introduce a temporary array not_var_sum[] which holds the partial sum of not_var.
var_num2 is now the LSB of not_var_sum[]
The first and third operation are trivially parallelizable. Parallelizing a partial sum is only a bit harder.
In a practical implementation, you wouldn't construct not_var[], and you'd write the LSB directly to var_num2 in all iterations of step 2. This is valid because you can discard the higher bits. Keeping just the LSB is equivalent to taking the result modulo 2, and (a+b)%2 == ((a%2) + (b%2))%s.
What type are the elements of var[]? int? Or char? Are zeroes frequent?
A SIMD prefix sum aka partial is possible (with log2(vector_width) work per element, e.g. 2 shuffles and 2 adds for a vector of 4 float), but the conditional-store based on the result is the other major problem. (Your array of 1000 elements is probably too small for multi-threading to be profitable.)
An integer prefix-sum is easier to do efficiently, and the lower latency of integer ops helps. NOT is just adding without carry, i.e. XOR, so use _mm_xor_si128 instead of _mm_add_ps. (You'd be using this on the integer all-zero/all-one compare result vector from _mm_cmpeq_epi32 (or epi8 or whatever, depending on the element size of var[]. You didn't specify, but different choices of strategy are probably optimal for different sizes).
But, just having a SIMD prefix sum actually barely helps: you'd still have to loop through and figure out where to store and where to leave unmodified.
I think your best bet is to generate a list of indices where you need to store, and then
for (size_t j = 0 ; j < scatter_count ; j+=2) {
var_num2[ scatter_element[j+0] ] = 0;
var_num2[ scatter_element[j+1] ] = 1;
}
You could generate the whole list if indices up-front, or you could work in small batches to overlap the search work with the store work.
The prefix-sum part of the problem is handled by alternately storing 0 and 1 in an unrolled loop. The real trick is avoiding branch mispredicts, and generating the indices efficiently.
To generate scatter_element[], you've transformed the problem into left-packing (filtering) an (implicit) array of indices based on the corresponding _mm_cmpeq_epi32( var[i..i+3], _mm_setzero_epi32() ). To generate the indices you're filtering, start with a vector of [0,1,2,3] and add [4,4,4,4] to it (_mm_add_epi32). I'm assuming the element size of var[] is 32 bits. If you have smaller elements, this require unpacking.
BTW, AVX512 has scatter instructions which you could use here, otherwise doing the store part with scalar code is your best bet. (But beware of Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake when just storing without loading.)
To overlap the left-packing with the storing, I think you want to left-pack until you have maybe 64 indices in a buffer. Then leave that loop and run another loop that left-packs indices and consumes indices, only stopping if your circular buffer is full (then just store) or empty (then just left-pack). This lets you overlap the vector compare / lookup-table work with the scatter-store work, but without too much unpredictable branching.
If zeros are very frequent, and var_num2[] elements are 32 or 64 bits, and you have AVX or AVX2 available, you could consider doing an standard prefix sum and using AVX masked stores. e.g. vpmaskmovd. Don't use SSE maskmovdqu, though: it has an NT hint, so it bypasses and evicts data from cache, and is quite slow.
Also, because your prefix sum is mod 2, i.e. boolean, you could use a lookup table based on the packed-compare result mask. Instead of horizontal ops with shuffles, use the 4-bit movmskps result of a compare + a 5th bit for the initial state as an index to a lookup table of 32 vectors (assuming 32-bit element size for var[]).

Using CRCs as a digest to detect duplicates among files

The primary use of CRCs and similar computations (such as Fletcher and Adler) seems to be for the detection of transmission errors. As such, most studies I have seen seem to address the issue of the probability of detecting small-scale differences between two data sets. My needs are slightly different.
What follows is a very approximate description of the problem. Details are much more complicated than this, but the description below illustrates the functionality I am looking for. This little disclaimer is intended to ward of answers such as "Why are you solving your problem this way when you can more easily solve it this other way I propose?" - I need to solve my problem this way for a myriad of reasons that are not germane to this question or post, so please don't post such answers.
I am dealing with collections of data sets (size ~1MB) on a distributed network. Computations are performed on these data sets, and speed/performance is critical. I want a mechanism to allow me to avoid re-transmitting data sets. That is, I need some way to generate a unique identifier (UID) for each data set of a given size. (Then, I transmit data set size and UID from one machine to another, and the receiving machine only needs to request transmission of the data if it does not already have it locally, based on the UID.)
This is similar to the difference between using CRC to check changes to a file, and using a CRC as a digest to detect duplicates among files. I have not seen any discussions of the latter use.
I am not concerned with issues of tampering, i.e. I do not need cryptographic strength hashing.
I am currently using a simple 32-bit CRC of the serialized data, and that has so far served me well. However, I would like to know if anyone can recommend which 32-bit CRC algorithm (i.e. which polynomial?) is best for minimizing the probability of collisions in this situation?
The other question I have is a bit more subtle. In my current implementation, I ignore the structure of my data set, and effectively just CRC the serialized string representing my data. However, for various reasons, I want to change my CRC methodology as follows. Suppose my top-level data set is a collection of some raw data and a few subordinate data sets. My current scheme essentially concatenates the raw data and all the subordinate data sets and then CRC's the result. However, most of the time I already have the CRC's of the subordinate data sets, and I would rather construct my UID of the top-level data set by concatenating the raw data with the CRC's of the subordinate data sets, and then CRC this construction. The question is, how does using this methodology affect the probability of collisions?
To put it in a language what will allow me to discuss my thoughts, I'll define a bit of notation. Call my top-level data set T, and suppose it consists of raw data set R and subordinate data sets Si, i=1..n. I can write this as T = (R, S1, S2, ..., Sn). If & represents concatenation of data sets, my original scheme can be thought of as:
UID_1(T) = CRC(R & S1 & S2 & ... & Sn)
and my new scheme can be thought of as
UID_2(T) = CRC(R & CRC(S1) & CRC(S2) & ... & CRC(Sn))
Then my questions are: (1) if T and T' are very different, what CRC algorithm minimizes prob( UID_1(T)=UID_1(T') ), and what CRC algorithm minimizes prob( UID_2(T)=UID_2(T') ), and how do these two probabilities compare?
My (naive and uninformed) thoughts on the matter are this. Suppose the differences between T and T' are in only one subordinate data set, WLOG say S1!=S1'. If it happens that CRC(S1)=CRC(S1'), then clearly we will have UID_2(T)=UID_2(T'). On the other hand, if CRC(S1)!=CRC(S1'), then the difference between R & CRC(S1) & CRC(S2) & ... & CRC(Sn) and R & CRC(S1') & CRC(S2) & ... & CRC(Sn) is a small difference on 4 bytes only, so the ability of UID_2 to detect differences is effectively the same as a CRC's ability to detect transmission errors, i.e. its ability to detect errors in only a few bits that are not widely separated. Since this is what CRC's are designed to do, I would think that UID_2 is pretty safe, so long as the CRC I am using is good at detecting transmission errors. To put it in terms of our notation,
prob( UID_2(T)=UID_2(T') ) = prob(CRC(S1)=CRC(S1')) + (1-prob(CRC(S1)=CRC(S1'))) * probability of CRC not detecting error a few bits.
Let call the probability of CRC not detecting an error of a few bits P, and the probability of it not detecting large differences on a large size data set Q. The above can be written approximately as
prob( UID_2(T)=UID_2(T') ) ~ Q + (1-Q)*P
Now I will change my UID a bit more as follows. For a "fundamental" piece of data, i.e. a data set T=(R) where R is just a double, integer, char, bool, etc., define UID_3(T)=(R). Then for a data set T consisting of a vector of subordinate data sets T = (S1, S2, ..., Sn), define
UID_3(T) = CRC(ID_3(S1) & ID_3(S2) & ... & ID_3(Sn))
Suppose a particular data set T has subordinate data sets nested m-levels deep, then, in some vague sense, I would think that
prob( UID_3(T)=UID_3(T') ) ~ 1 - (1-Q)(1-P)^m
Given these probabilities are small in any case, this can be approximated as
1 - (1-Q)(1-P)^m = Q + (1-Q)*P*m + (1-Q)*P*P*m*(m-1)/2 + ... ~ Q + m*P
So if I know my maximum nesting level m, and I know P and Q for various CRCs, what I want is to pick the CRC that gives me the minimum value for Q + m*P. If, as I suspect might be the case, P~Q, the above simplifies to this. My probability of error for UID_1 is P. My probability of error for UID_3 is (m+1)P, where m is my maximum nesting (recursion) level.
Does all this seem reasonable?
I want a mechanism to allow me to avoid re-transmitting data sets.
rsync has already solved this problem, using generally the approach you outline.
However, I would like to know if anyone can recommend which 32-bit CRC
algorithm (i.e. which polynomial?) is best for minimizing the
probability of collisions in this situation?
You won't see much difference among well-selected CRC polynomials. Speed may be more important to you, in which case you may want to use a hardware CRC, e.g. the crc32 instruction on modern Intel processors. That one uses the CRC-32C (Castagnoli) polynomial. You can make that really fast by using all three arithmetic units on a single core in parallel by computing the CRC on three buffers in the same loop, and then combining them. See below how to combine CRCs.
However, most of the time I already have the CRC's of the subordinate
data sets, and I would rather construct my UID of the top-level data
set by concatenating the raw data with the CRC's of the subordinate
data sets, and then CRC this construction.
Or you could quickly compute the CRC of the entire set as if you had done a CRC on the whole thing, but using the already calculated CRCs of the pieces. Look at crc32_combine() in zlib. That would be better than taking the CRC of a bunch of CRCs. By combining, you retain all the mathematical goodness of the CRC algorithm.
Mark Adler's answer was bang on. If I'd taken my programmers hat off and put on my mathematicians hat, some of it should have been obvious. He didn't have the time to explain the mathematics, so I will here for those who are interested.
The process of calculating a CRC is essentially the process of doing a polynomial division. The polynomials have coefficients mod 2, i.e. the coefficient of each term is either 0 or 1, hence a polynomial of degree N can be represented by an N-bit number, each bit being the coefficient of a term (and the process of doing a polynomial division amounts to doing a whole bunch of XOR and shift operations). When CRC'ing a data block, we view the "data" as one big polynomial, i.e. a long string of bits, each bit representing the coefficient of a term in the polynomial. Well call our data-block polynomial A. For each CRC "version", there has been chosen the polynomial for the CRC, which we'll call P. For 32-bit CRCs, P is a polynomial with degree 32, so it has 33 terms and 33 coefficients. Because the top coefficient is always 1, it is implicit and we can represent the 32nd-degree polynomial with a 32-bit integer. (Computationally, this is quite convenient actually.) The process of calculating the CRC for a data block A is the process of finding the remainder when A is divided by P. That is, A can always be written
A = Q * P + R
where R is a polynomial of degree less than degree of P, i.e. R has degree 31 or less, so it can be represented by a 32-bit integer. R is essentially the CRC. (Small note: typically one prepends 0xFFFFFFFF to A, but that is unimportant here.) Now, if we concatenate two data blocks A and B, the "polynomial" corresponding to the concatenation of the two blocks is the polynomial for A, "shifted to the left" by the number of bits in B, plus B. Put another way, the polynomial for A&B is A*S+B, where S is the polynomial corresponding to a 1 followed by N zeros, where N is the number of bits in B. (i.e. S = x**N ). Then, what can we say about the CRC for A&B? Suppose we know A=Q*P+R and B=Q'*P+R', i.e. R is the CRC for A and R' is the CRC for B. Suppose we also know S=q*P+r. Then
A * S + B = (Q*P+R)*(q*P+r) + (Q'*P+R')
= Q*(q*P+r)*P + R*q*P + R*r + Q'*P + R'
= (Q*S + R*q + Q') * P + R*r + R'
So to find the remainder when A*S+B is divided by P, we need only find the remainder when R*r+R' is divided by P. Thus, to calculate the CRC of the concatenation of two data streams A and B, we need only know the separate CRC's of the data streams, i.e. R and R', and the length N of the trailing data stream B (so we can compute r). This is also the content of one of Marks other comments: if the lengths of the trailing data streams B are constrained to a few values, we can pre-compute r for each of these lengths, making combination of two CRC's quite trivial. (For an arbitrary length N, computing r is not trivial, but it is much faster (log_2 N) than re-doing the division over the entire B.)
Note: the above is not a precise exposition of CRC. There is some shifting that goes on. To be precise, if L is the polynomial represented by 0xFFFFFFFF, i.e. L=x*31+x*30+...+x+1, and S_n is the "shift left by n bits" polynomial, i.e. S_n = x**n, then the CRC of a data block with polynomial A of N bits, is the remainder when ( L * S_N + A ) * S_32 is divided by P, i.e. when (L&A)*S_32 is divided by P, where & is the "concatenation" operator.
Also, I think I disagree with one of Marks comments, but he can correct me if I'm wrong. If we already know R and R', comparing the time to compute the CRC of A&B using the above methodology, as compared with computing it the straightforward way, does not depend on the ratio of len(A) to len(B) - to compute it the "straight forward" way, one really does not have to re-compute the CRC on the entire concatenated data set. Using our notation above, one only needs to compute the CRC of R*S+B. That is, instead of pre-pending 0xFFFFFFFF to B and computing its CRC, we prepend R to B and compute its CRC. So its a comparison of the time to compute B's CRC over again with the time to compute r, (followed by dividing R*r+R' by P, which is trivial and inconsequential in time likely).
Mark Adler's answer addresses the technical question so that's not what I'll do here. Here I'm going to point out a major potential flaw in the synchronization algorithm proposed in the OP's question and suggest a small improvement.
Checksums and hashes provide a single signature value for some data. However, being of finite length, the number of possible unique values of a checksum/hash is always smaller than the possible combinations of the raw data if the data is longer. For instance, a 4 byte CRC can only ever take on 4 294 967 296 unique values whilst even a 5 byte value which might be the data can take on 8 times as many values. This means for any data longer than the checksum itself, there always exists one or more byte combinations with exactly the same signature.
When used to check integrity, the assumption is that the likelihood of a slightly different stream of data resulting in the same signature is small so that we can assume the data is the same if the signature is the same. It is important to note that we start with some data d and verify that given a checksum, c, calculated using a checksum function, f that f(d) == c.
In the OP's algorithm, however, the different use introduces a subtle, detrimental degradation of confidence. In the OP's algorithm, server A would start with the raw data [d1A,d2A,d3A,d4A] and generate a set of checksums [c1,c2,c3,c4] (where dnA is the n-th data item on server A). Server B would then receive this list of checksums and check its own list of checksums to determine if any are missing. Say Server B has the list [c1,c2,c3,c5]. What should then happen is that it requests d4 from Server A and the synchronization has worked properly in the ideal case.
If we recall the possibilty of collisions, and that it doesn't always take that much data to produce one (e.g. CRC("plumless") == CRC("buckeroo")), then we'll quickly realize that the best guarantee our scheme provides is that server B definitely doesn't have d4A but it cannot guarantee that it has [d1A,d2A,d3A]. This is because it is possible that f(d1A) = c1 and f(d1B) = c1 even though d1A and d1B are distinct and we would like both servers to have both. In this scheme, neither server can ever know about the existence of both d1A and d1B. We can use more and more collision resistant checksums and hashes but this scheme can never guarantee complete synchronization. This becomes more important, the greater the number of files the network must keep track of. I would recommend using a cryptographic hash like SHA1 for which no collisions have been found.
A possible mitigation of the risk of this is to introduce redundant hashes. One way of doing is is to use a completely different algorithm since whilst it is possible crc32(d1) == crc32(d2) it is less likely that adler32(d1) == adler32(d2) simultaneously. This paper suggests you don't gain all that much this way though. To use the OP notation, it is also less likely that crc32('a' & d1) == crc32('a' & d2) and crc32('b' & d1) == crc32('b' & d2) are simultaneously true so you can "salt" to less collision prone combinations. However, I think you may just as well just use a collision resistant hash function like SHA512 which in practice likely won't have that great an impact on your performance.

Mutually exclusive contiguous ranges from multiple bitfields

(This is not a CS class homework, even if it looks like one)
I'm using bitfields to represent ranges between 0 and 22. As an input, I have several different ranges, for example (order doesn't matter). I used . for 0 and X for 1 for better readability.
.....XXXXX..............
..XXXX..................
.....XXXXXXXXXXXXXXX....
........XXXXXXX.........
XXXXXXXXXXXXXXXXXXXXXXXX
The number of bitfield ranges is typically below 10, but can potentially become as high as 100. From that input, I want to calculate the mutually exclusive, contiguous ranges, like this:
XX......................
..XXX...................
.....X..................
......XX................
........XX..............
..........XXXXX.........
...............XXXXX....
....................XXXX
(again, the output order doesn't matter, they just need to be mutually exclusive and contiguous, i.e. they can't have holes in them. .....XXX.......XXXXX.... must be split up in two individual ranges).
I tried a couple of algorithms, but all of them ended up being rather complex and unelegant. What would help me immensely is a way to detect that .....XXX.......XXXXX.... has a hole and a way to determine the index of one of the bits in the hole.
Edit: The bitfield range represent zoomlevels on a map. They are intended to be used for outputting XML stylesheets for Mapnik (the tile rendering system that is, among others, used by OpenStreetMap).
I'm assuming the solution you're mentioning in the comment is something like this:
Start at the left or right (so index = 0), and scan which bits are set (upto 100 operations). Name that set x. Also set a variable block=0.
At index=1, repeat and store to set y. If x XOR y = 0, both are identical sets, so move on to index=2. If it x XOR y = z != 0, then range [block, index) is contiguous. Now set x = y, block = index, and continue.
If you have 100 bit-arrays of length 22 each, this takes something on the order of 2200 operations.
This is an optimum solution because the operation cannot be reduced further -- at each stage, your range is broken if another set doesn't match your set, so to check if the range is broken you must check all 100 bits.
I'll take a shot at your sub-problem, at least..
What would help me immensely is a way to detect that
.....XXX.......XXXXX.... has a hole and a way to determine the index
of one of the bits in the hole.
Finding the lowest and highest set ("1") bits in a bitmask is a pretty
solved problem; See, for example, ffs(3) in glibc, or see
e.g. http://en.wikipedia.org/wiki/Bit_array#Find_first_one
Given the first and last indexes of a bitmap, call them i, and j,
you can compute the bitmap that has all bits betweem i and j set
using M = ((1 << i) - 1) & (~((1 << j) - 1)) (apologies for any
off-by-one-errors).
You can then test if the original bitmap has a hole by comparing it to
M. If it doesn't match, you can take the input xor M to find the
holes and repeat.

how to efficiently access 3^20 vectors in a 2^30 bits of memory

I want to store a 20-dimensional array where each coordinate can have 3 values,
in a minimal amount of memory (2^30 or 1 Gigabyte).
It is not a sparse array, I really need every value.
Furthermore I want the values to be integers of arbirary but fixed precision,
say 256 bits or 8 words
example;
set_big_array(1,0,0,0,1,2,2,0,0,2,1,1,2,0,0,0,1,1,1,2, some_256_bit_value);
and
get_big_array(1,0,0,0,1,2,2,0,0,2,1,1,2,0,0,0,1,1,1,2, &some_256_bit_value);
Because the value 3 is relative prime of 2. its difficult to implement this using
efficient bitwise shift, and and or operators.
I want this to be as fast as possible.
any thoughts?
Seems tricky to me without some compression:
3^20 = 3486784401 values to store
256bits / 8bitsPerByte = 32 bytes per value
3486784401 * 32 = 111577100832 size for values in bytes
111577100832 / (1024^3) = 104 Gb
You're trying to fit 104 Gb in 1 Gb. There'd need to be some pattern to the data that could be used to compress it.
Sorry, I know this isn't much help, but maybe you can rethink your strategy.
There are 3.48e9 variants of 20-tuple of indexes that are 0,1,2. If you wish to store a 256 bit value at each index, that means you're talking about 8.92e11 bits - about a terabit, or about 100GB.
I'm not sure what you're trying to do, but that sounds computationally expensive. It may be reasonable feasible as a memory-mapped file, and may be reasonably fast as a memory-mapped file on an SSD.
What are you trying to do?
So, a practical solution would be to use a 64-bit OS and a large memory-mapped file (preferably on an SSD) and simply compute the address for a given element in the typical way for arrays, i.e. as sum-of(forall-i(i-th-index * 3^i)) * 32 bytes in pseudeo-math. Or, use a very very expensive machine with that much memory, or another algorithm that doesn't require this array in the first place.
A few notes on platforms: Windows 7 supports just 192GB of memory, so using physical memory for a structure like this is possible but really pushing it (more expensive editions support more). If you can find a machine at all that is. According to microsoft's page on the matter the user-mode virtual address space is 7-8TB, so mmap/virtual memory should be doable. Alex Ionescu explains why there's such a low limit on virtual memory despite an apparently 64-bit architecture. Wikipedia puts linux's addressable limits at 128TB, though probably that's before the kernel/usermode split.
Assuming you want to address such a multidimensional array, you must process each index at least once: that means any algorithm will be O(N) where N is the number of indexes. As mentioned before, you don't need to convert to base-2 addressing or anything else, the only thing that matters is that you can compute the integer offset - and which base the maths happens in is irrelevant. You should use the most compact representation possible and ignore the fact that each dimension is not a multiple of 2.
So, for a 16-dimensional array, that address computation function could be:
int offset = 0;
for(int ii=0;ii<16;ii++)
offset = offset*3 + indexes[ii];
return &the_array[offset];
As previously said, this is just the common array indexing formula, nothing special about it. Note that even for "just" 16 dimensions, if each item is 32 bytes, you're dealing with a little more than a gigabyte of data.
Maybe i understand your question wrong. But can't you just use a normal array?
INT256 bigArray[3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3];
OR
INT256 ********************bigArray = malloc(3^20 * 8);
bigArray[1][0][0][1][2][0][1][1][0][0][0][0][1][1][2][1][1][1][1][1] = some_256_bit_value;
etc.
Edit:
Will not work because you would need 3^20 * 8Byte = ca. 25GByte.
The malloc variant is wrong.
I'll start by doing a direct calculation of the address, then see if I can optimize it
address = 0;
for(i=15; i>=0; i--)
{
address = 3*address + array[i];
}
address = address * number_of_bytes_needed_for_array_value
2^30 bits is 2^27 bytes so not actually a gigabyte, it's an eighth of a gigabyte.
It appears impossible to do because of the mathematics although of course you can create the data size bigger then compress it, which may get you down to the required size although it cannot guarantee. (It must fail to some of the time as the compression is lossless).
If you do not require immediate "random" access your solution may be a "variable sized" two-bit word so your most commonly stored value takes only 1 bit and the other two take 2 bits.
If 0 is your most common value then:
0 = 0
10 = 1
11 = 2
or something like that.
In that case you will be able to store your bits in sequence this way.
It could take up to 2^40 bits this way but probably will not.
You could pre-run through your data and see which is the commonly occurring value and use that to indicate your single-bit word.
You can also compress your data after you have serialized it in up to 2^40 bits.
My assumption here is that you will be using disk possibly with memory mapping as you are unlikely to have that much memory available.
My assumption is that space is everything and not time.
You might want to take a look at something like STXXL, an implementation of the STL designed for handling very large volumes of data
You can actually use a pointer-to-array20 to have your compiler implement the index calculations for you:
/* Note: there are 19 of the [3]'s below */
my_256bit_type (*foo)[3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3];
foo = allocate_giant_array();
foo[0][1][1][0][2][1][2][2][0][2][1][0][2][1][0][0][2][1][0][0] = some_256bit_value;