Im trying to understand hardware Caches. I have a slight idea, but i would like to ask on here whether my understanding is correct or not.
So i understand that there are 3 types of cache mapping, direct, full associative and set associative.
I would like to know is the type of mapping implemented with logic gates in hardware and specific to say some computer system and in order to change the mapping, one would be required to changed the electrical connections?
My current understanding is that in RAM, there exists a memory address to refer to each block of memory. Within a block contains words, each words contain a number of bytes. We can represent the number of options with number of bits.
So for example, 4096 memory locations, each memory location contains 16 bytes. If we were to refer to each byte then 2^12*2^4 = 2^16
16 bit memory address would be required to refer to each byte.
The cache also has a memory address, valid bit, tag, and some data capable of storing a block of main memory of n words and thus m bytes. Where m = n*i (bytes per word)
For an example, direct mapping
1 block of main memory can only be at one particular memory location in cache. When the CPU requests for some data using a 16bit memory location of RAM, it checks for cache first.
How does it know that this particular 16bit memory address can only be in a few places?
My thoughts are, there could be some electrical connection between every RAM address to a cache address. The 16bit address could then be split into parts, for example only compare the left 8bits with every cache memory address, then if match compare the byte bits, then tag bits then valid bit
Is my understanding correct? Thank you!
Really do appreciate if someone read this long post
You may want to read 3.3.1 Associativity in What Every Programmer Should Know About Memory from Ulrich Drepper.
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf#subsubsection.3.3.1
The title is a little bit catchy, but it explains everything you ask in detail.
In short:
the problem of caches is the number of comparisons. If your cache holds 100 blocks, you need to perform 100 comparisons in one cycle. You can reduce this number with the introduction of sets. if A specific memory-region can only be placed in slot 1-10, you reduce the number of comparisons to 10.
The sets are addressed by an additional bit-field inside the memory-address called index.
So for instance your 16 Bit (from your example) could be splitted into:
[15:6] block-address; stored in the `cache` as the `tag` to identify the block
[5:4] index-bits; 2Bit-->4 sets
[3:0] block-offset; byte position inside the block
so the choice of the method depends on the availability of hardware-resources and the access-time you want to archive. Its pretty much hardwired, since you want to reduce the comparison-logic.
There are few mapping functions used for map cache lines with main memory
Direct Mapping
Associative Mapping
Set-Associative Mapping
you have to have an slight idea about these three mapping functions
Related
I just learnt about 4 or 8 memory alignment and came about this question.
Will Memory alignment happen in virtual memory space or absolute addresss?
I guess the answer is virtual memory space, and the os will load the process to the position that the absolute address ends with '0X00' or '0X0'.
If not, please show me why. Thanks a lot. XD
Both virtual and actual addresses will be word-aligned to the CPU's native word-size where appropriate(*). (The reason for that is that the virtual-to-physical mapping is done on a per-page basis, and the size of a memory page is always an even multiple of the CPU's native word-size).
(*) the exception would be for items that are smaller than a word, and are packed together consecutively to save memory; e.g. many of the individual elements inside of char and uint_8 arrays will necessarily not be word-aligned.
Why does the value of the address in C and C++ is always even?
For example: I declare a variable int x and x has a memory address 0x6ffe1c (in hexa decimal). No matter what. That value is never odd number it is always an even number. Why is that so??
Computer memory is composed of bits. The bits are organized into groups. A computer may have a gigabyte of memory, which is over 1,000,000,000 bytes or 8,000,000,000 bits, but the physical connections to memory cannot simply get any one particular bit from that memory.
When the processor wants data from memory, it puts a signal on a bus that asks for a particular word of memory. A bus is largely a set of wires that connects different parts of the computer. A word of memory is a group of bits of some size particular to that hardware, perhaps 32 bits. When the memory device sees a request for a word, it gets those bits and puts them on the bus, all at once. (The bus for that will have 32 or more wires, so it can carry all the data for one word at one time.)
Let’s continue with the example of 32-bit words. Since memory is grouped into words of 32 bits, each word has a memory address that is a multiple of 32 bits, or four bytes. And every address that is a multiple of four (0, 4, 8, 12, 16, … 4096, 4100, 4104,…) is the address of a word. The processor always reads or writes memory in units of words—that is the only interaction the hardware can do; the processor cannot read individual bytes from memory. If your int is in a single word, then the processor can get it from memory by asking for that word.
On the other hand, suppose your int starts at address 99. Then one byte of it is in the word that starts at address 96 (addresses 96 to 99), and three bytes of it are in the word that starts at address 100 (addresses 100 to 103). In order to get your int, the processor has to read two words and then stitch together bytes from them to make one int.
First, that is a waste of time. Doing two reads from memory takes longer than doing one read. Second, if the processor has to have extra wires and circuits for doing that, it makes the processor more expensive and use more energy, and it takes resources away from other things the processor could be doing, like adding or multiplying.
So processors are designed to prefer aligned data. They may have components for handling unaligned data, but using those components may take extra time or resources. So compilers are designed to align objects in ways that are preferable for the target architecture.
Why does the value of the address in C and C++ is always even?
This is not true in general. For example, if you have an array char[2], you'll find that exactly one of those elements has an even address, and the other must have an odd address.
I declare a variable int x and x has a memory address 0x6ffe1c
Every type in C and C++ have have some alignment requirement. That is, objects of the type must be stored in an address that is divisible by that alignment. Alignment requirement is an integer that is always a power of two.
There is exactly one power of two that is not even: 20 == 1. Objects with an alignment requirement of 1 can be stored in odd addresses. char always has size and alignment of 1 byte. int typically has a higher alignment requirement in which case it will be stored in an even address.
The reason why alignment is important is that there are CPU instruction sets which only allow reading and writing to memory addresses that are aligned to the width of the CPU word. Other CPU instruction sets may support operations on addresses aligned to some fractions of the word size. Further, some CPU instruction sets, such as the x86 support operating on entirely unaligned address but (at least on older models), such operations may be much slower.
I'm hoping for some high-level advice on how to approach a design I'm about to undertake.
The straightforward approach to my problem will result in millions and millions of pointers. On a 64-bit system these will presumably be 64-bit pointers. But as far as my application is concerned, I don't think I need more than a 32-bit address space. I would still like for the system to take advantage of 64-bit processor arithmetic, however (assuming that is what I get by running on a 64-bit system).
Further background
I'm implementing a tree-like data structure where each "node" contains an 8-byte payload, but also needs pointers to four neighboring nodes (parent, left-child, middle-child, right-child). On a 64-bit system using 64-bit pointers, this amounts to 32 bytes just for linking an 8-byte payload into the tree -- a "linking overhead" of 400%.
The data structure will contain millions of these nodes, but my application will not need much memory beyond that, so all these 64-bit pointers seem wasteful. What to do? Is there a way to use 32-bit pointers on a 64-bit system?
I've considered
Storing the payloads in an array in a way such that an index implies (and is implied by) a "tree address" and neighbors of a given index can be calculated with simple arithmetic on that index. Unfortunately this requires me to size the array according to the maximum depth of the tree, which I don't know beforehand, and it would probably incur even greater memory overhead due to empty node elements in the lower levels because not all branches of the tree go to the same depth.
Storing nodes in an array large enough to hold them all, and then using indices instead of pointers to link neighbors. AFAIK the main disadvantage here would be that each node would need the array's base address in order to find its neighbors. So they either need to store it (a million times over) or it needs to be passed around with every function call. I don't like this.
Assuming that the most-significant 32 bits of all these pointers are zero, throwing an exception if they aren't, and storing only the least-significant 32 bits. So the required pointer can be reconstructed on demand. The system is likely to use more than 4GB, but the process will never. I'm just assuming that pointers are offset from a process base-address and have no idea how safe (if at all) this would be across the common platforms (Windows, Linux, OSX).
Storing the difference between 64-bit this and the 64-bit pointer to the neighbor, assuming that this difference will be within the range of int32_t (and throwing if it isn't). Then any node can find it's neighbors by adding that offset to this.
Any advice? Regarding that last idea (which I currently feel is my best candidate) can I assume that in a process that uses less than 2GB, dynamically allocated objects will be within 2 GB of each other? Or not at all necessarily?
Combining ideas 2 and 4 from the question, put all the nodes into a big array, and store e.g. int32_t neighborOffset = neighborIndex - thisIndex. Then you can get the neighbor from *(this+neighborOffset). This gets rid of the disadvantages/assumptions of both 2 and 4.
If on Linux, you might consider using (and compiling for) the x32 ABI. IMHO, this is the preferred solution for your issues.
Alternatively, don't use pointers, but indexes into a huge array (or an std::vector in C++) which could be a global or static variable. Manage a single huge heap-allocated array of nodes, and use indexes of nodes instead of pointers to nodes. So like your §2, but since the array is a global or static data you won't need to pass it everywhere.
(I guess that an optimizing compiler would be able to generate clever code, which could be nearly as efficient as using pointers)
You can remove the disadvantage of (2) by exploiting the alignment of memory regions to find the base address of the the array "automatically". For example, if you want to support up to 4 GB of nodes, ensure your node array starts at a 4GB boundary.
Then within a node with address addr, you can determine the address of another at index as addr & -(1UL << 32) + index.
This is kind of the "absolute" variant of the accepted solution which is "relative". One advantage of this solution is that an index always has the same meaning within a tree, whereas in the relative solution you really need the (node_address, index) pair to interpret an index (of course, you can also use the absolute indexes in the relative scenarios where it is useful). It means that when you duplicate a node, you don't need to adjust any index values it contains.
The "relative" solution also loses 1 effective index bit relative to this solution in its index since it needs to store a signed offset, so with a 32-bit index, you could only support 2^31 nodes (assuming full compression of trailing zero bits, otherwise it is only 2^31 bytes of nodes).
You can also store the base tree structure (e.g,. the pointer to the root and whatever bookkeeping your have outside of the nodes themselves) right at the 4GB address which means that any node can jump to the associated base structure without traversing all the parent pointers or whatever.
Finally, you can also exploit this alignment idea within the tree itself to "implicitly" store other pointers. For example, perhaps the parent node is stored at an N-byte aligned boundary, and then all children are stored in the same N-byte block so they know their parent "implicitly". How feasible that is depends on how dynamic your tree is, how much the fan-out varies, etc.
You can accomplish this kind of thing by writing your own allocator that uses mmap to allocate suitably aligned blocks (usually just reserve a huge amount of virtual address space and then allocate blocks of it as needed) - ether via the hint parameter or just by reserving a big enough region that you are guaranteed to get the alignment you want somewhere in the region. The need to mess around with allocators is the primary disadvantage compared to the accepted solution, but if this is the main data structure in your program it might be worth it. When you control the allocator you have other advantages too: if you know all your nodes are allocated on an 2^N-byte boundary you can "compress" your indexes further since you know the low N bits will always be zero, so with a 32-bit index you could actually store 2^(32+5) = 2^37 nodes if you knew they were 32-byte aligned.
These kind of tricks are really only feasible in 64-bit programs, with the huge amount of virtual address space available, so in a way 64-bit giveth and also taketh away.
Your assertion that a 64 bit system necessarily has to have 64 bit pointers is not correct. The C++ standard makes no such assertion.
In fact, different pointer types can be different sizes: sizeof(double*) might not be the same as sizeof(int*).
Short answer: don't make any assumptions about the sizes of any C++ pointer.
Sounds like to me that you want to build you own memory management framework.
I am implementing an emulator for an old game console, mostly for learning purposes.
This console maps roms, and a lot of other things, to regions within its address space. Certain locations are also mirrored so that multiple addresses can correspond to the same physical location. I would like to emulate this, but I am not sure what would be a good approach to do so (and also have no idea what this process is called, hence this somewhat generic question).
One thing that does work is a simple, unordered map. Have it contain absolute addresses and the corresponding pointers to my data structures. This way, I can easily map everything I need into the system's address space. The problem with this approach is that, it's obviously a memory hog. Even with small roms, I end up with close to ten million entries, thanks to the aforementioned mirroring. Surely, this can't be the rigth thing to do?
Any help is much appreciated.
Edit:
To provide some details as to how exactly I am doing this:
The system in question is, of course, the SNES. Using this wiki as my primary resource, I implemeted what I mentioned above as follows:
Create a std::unordered_map<uint32,uint8_t*> mMemoryMap;
Check whether the rom is LoRom or HiRom
For each byte in the rom
Calculate the address where it should be mapped and emplace both the address and a pointer to said byte in the map
If the section needs to be mirrored somewhere else, repeat the above
This will be applied to anything else I need to make available, such as video- or system-memory
If I now want to access anything within the address space, I can simply use the address the system would use internally.
I'm assuming that for contiguous addresses the physical locations are also contiguous, within certain blocks of memory or "chunks". That is, if the address 0x0000 maps to 0xFF00, then 0x0004 maps to 0xFF04.
If they work like that, then you can make a list that contains the information of those chunks. Say:
struct Chunk
{
int addressStart, memoryStart, size;
}
The chunks may be ordered by the addressStart, so you can find out the correct chunk you would need for any address. This requires you to iterate the list, but if you have only a few chunks this may be acceptable.
Rather than use simple maps (which even with ranges can grow to large sizes) you can instead use a more intelligent map.
For instance if the console maps 0x10XXXX through 0x1FXXXX all to the same 0x20XXXX you can design a structure which holds that repetition (start 0x100000 end 0x1FFFFF repeat 0x010000 although you may want to use a bitmask rather than repeat).
I'm currently in the same boat as you and doing an NES emulator for learning purposes. The complete memory map is declared as an array of pointers to bytes. Each pointer can point into another array, which allows me to have pointers to the same data in the case of mirroring.
byte *memoryMap[cpuMemSize];
I iterate over the addresses that repeat and map them to point to the bytes in the following array. The ram array is the memory that gets mapped 4 times across the CPU's memory map.
byte ram[ramSize];
The following code goes through the RAM array and maps it across the CPU memory map 4 times.
// map RAM 4 times to emulate mirroring
mem_address memAddr = 0;
for (int x = 0; x < 4; ++x)
{
for (mem_address ramAddr = 0; ramAddr < ramSize; ++ramAddr.value)
{
memoryMap[memAddr.value++] = &ram[ramAddr.value];
}
}
You can then write the a value to the 256th byte using something like this, which would of course be propagated to the other parts of the memory map since they point to the same byte in memory:
*memoryMap[0x00ff] = 10;
I haven't really tested this and want to do some more testing with regard to CPU cache use and performance. I was clearly out searching for other ways of doing this when I stumbled on your question and figured I'd put in my (unverified) two cents. Hope this makes sense.
I have a very large vector (millions of entries 1024 bytes each). I am exceeding the maximum size of the vector (getting a bad memory alloc exception). I am doing a recursive operation over the vector of items which requires accessing other elements in the vector. The operations need to be done quickly. I am trying to avoid writing to disk for speed reasons. Is there any other way to store this data that would not require writing to disk? If I have to write the data to disk, what would be the most ideal way to do it>
edit for a few more details.
The operations that I am performing on the data set is generating a string recursively based on other data points in the vector. The data is sorted when it is read in. Data sets ranging from 50,000 to 50,000,0000.
The easiest way to solve this problem is to use STXXL. It's a reimplementation of the STL for large structures that transparently writes to disk when the data won't fit in memory.
Your problem cannot be solved as stated and clarified in the comments.
You have requested a way to have a contiguous in-memory buffer of 50,000,000 entries of size 1024 on a 32 bit system.
A 32 bit system has only 4294967296 bytes of addressable memory. You are asking for 51200000000 bytes of addressable memory, or 11.9 times the amount of memory address space on your system.
If you don't require that your data be contiguous and memory-addressable, if you don't require that your data all be in memory at once, or if you relax other requirements, there may be an answer to your problem. Ie, some OSs expose access to a non-memory space of values that corresponds to RAM (there where ways in 8 gig windows systems to use more than 4 gigs of total RAM) through some hacky interface or other.
But as stated, the answer is "no, you cannot do that".
Because your data must be contiguous, and you know how many elements you need to store, just create a std::vector and use the reserve() function to attempt to gain a contiguous block of memory of the required size.
There is very little overhead in storing a vector (just a few pointers to manage the beginning and end). This is as good as you'll be able to do.
If that fails:
add more memory to your machine (may not actually help, if you've run up against addressing or implementation constraints)
switch to a raw array
find a way to reduce the size of your elements
try to find a solution that can tackle the problem in small blocks
That is 1GB of memory (1024KB * 10^6 = 1MB * 10^3 = 1GB). Ideally for a 32 bit machine upto 4GB memory operations can be performed.
To answer your question, try first a normal malloc() call and allocate 1 GB of memory. This should be done without any error.
Also, please paste the exact error msg that you get while using the vector.