shared memory multi-threading and data accessing? - c++

Regarding performance, assuming we get a block of data that will be freqenctly accessed by each threads, and these data are read-only, which means threads wont do anything besides reading the data.
Then is it benefitial to create one copy of these data (assuming the data there read-only) for each thread or not?
If the freqenently accessed data are shared by all threads (instead of one copy for each thread), wouldnt this increase the chance of these data will get properly cached?

One copy of read-only data per thread will not help you with caching; quite the opposite, it can hurt instead when threads execute on the same multicore (and possibly hyperthreaded) CPU and so share its cache, as in this case per-thread copies of the data may compete for limited cache space.
However, in case of a multi-CPU system, virtually all of which are NUMA nowadays, typically having per-CPU memory banks with access cost somewhat different between the "local" and "remote" memory, it can be beneficial to have a per-CPU copies of read-only data, placed in its local memory bank.
The memory mapping is controlled by OS, so if you take this road it makes sense to study NUMA-related behavior of your OS. For example, Linux uses first-touch memory allocation policy, which means memory mapping happens not at malloc but when the program accesses a memory page for the first time, and OS tries to allocate physical memory from the local bank.
And the usual performance motto applies: measure, don't guess.

Related

Concurrent reading of shared memory in multi-core environment

Given two threads running on different cores each of which has a copy of an identical pointer to a shared variable, does that raise any issue if both threads are guaranteed to only read this variable? I'm not using any kind of mutex...
guaranteed to only read it, and no-one else writes to it, then you're good.
The problem comes when any thread is reading from a variable while some other thread is writing to it, a consideration that occurs surprisingly often in practice.
For some reason cache coherency has popped up as a big topic. Regardless, if you have a thread that has a pointer to a variable, your compiled program will dereference that pointer to get the memory address of the variable and will read from it. Cache coherency doesn't stop the reads from working, even if 2 threads access it. Performance may suffer, depending how the CPU manages cached pages - the CPU will still have to read the page containing the variable, and will probably cache it, there will be no difference here than reading a global variable or one allocated on the heap. Your C++ program doesn't know about cache lines, the compiled machine instructions don't know about C++ variables. From what i understand, a cache line in x86 is 64 bytes, and whilst its true that writing to a memory address that is next to your shared variable will cause the CPU to update its cache (ie re-read the variable into CPU cache) its still no worse than using any other global variable.
If you are reading this variable continually, and are worried about performance, it would be better to take a local copy in each thread. If no-one is ever going to update the surrounding 64-bytes of memory then there's no point. You will want to measure any performance impact if you are worried.

File Based Memory Pool - Is it Possible?

Whenever a new / malloc is used, OS create a new(or reuse) heap memory segment, aligned to the page size and return it to the calling process. All these allocations will constitute to the Process's virtual memory. In 32bit computing, any process can scale only upto 4 GB. Higher the heap allocation, the rate of increase of process memory is higher. Though there are lot of memory management / memory pools available, all these utilities end up again in creating a heap and reusing it effeciently.
mmap (Memory mapping) on the other hand, provides the ablity to visualize a file as memory stream, and enables the program to use pointer manipulations directly on file. But here again, mmap is actually allocating the range of addresses in the process space. So if we mmap a 3GB file with size 3GB and take a pmap of the process, you could see the total memory consumed by the process is >= 3GB.
My question is, is it possible to have a file based memory pool [just like mmaping a file], however, does not constitute the process memory space. I visualize something like a memory DB, which is backed by a file, which is so fast for read/write, which supports pointer manipulations [i.e get a pointer to the record and store anything as if we do using new / malloc], which can grow on the disk, without touching the process virtual 4GB limit.
Is it possible ? if so, what are some pointers for me to start working.
I am not asking for a ready made solution / links, but to conceptually understand how it can be achieved.
It is generally possible but very coplicated. You would have to re-map if you wanted to acces different 3Gb segments of your file, which would probably kill the performance in case of scattered access. Pointers would only get much more difficult to work with, as remmpaing changes data but leaves the adresses the same.
I have seen STXXL project that might be interesting to you; or it might not. I have never used it so I cannot give you any other advice about it.
What you are looking for, is in principle, a memory backed file-cache. There are many such things in for example database implementations (where the whole database is way larger than the memory of the machine, and the application developer probably wants to have a bit of memory left for application stuff). This will involve having some sort of indirection - an index, hash or some such to indicate what area of the file you want to access, and using that indirection to determine if the memory is in memory or on disk. You would essentially have to replicate what the virtual memory handling of the OS and the processor does, by having tables that indicate where in physical memory your "virtual heap" is, and if it's not present in physical memory, read it in (and if the cache is full, get rid of some - and if it's been written, write it back again).
However, it's most likely that in today's world, you have a machine capable of 64-bit addressing, and thus, it would be much easier to recompile the application as a 64-bit application, usemmap or similar to access the large memory. In this case, even if RAM isn't sufficient, you can access the memory of the file via the virtual memory system, and it takes care of all the mapping back and forth between disk and RAM (physical memory).

memory access vs. memory copy

I am writing an application in C++ that needs to read-only from the same memory many times from many threads.
My question is from a performance point of view will it be better to copy the memory for each thread or give all threads the same pointer and have all of them access the same memory.
Thanks
There is no definitive answer from the little information you have given about your target system and so on, but on a normal PC, most likely the fastest will be to not copy.
One reason copying could be slow, is that it might result in cache misses if the data area is large. A normal PC would cache read-only access to the same data area very efficiently between threads, even if those threads happen to run on different cores.
One of the benefits explicitly listed by Intel for their approach to caching is "Allows more data-sharing opportunities for threads running on separate cores that are sharing cache". I.e. they encourage a practice where you don't have to program the threads to explicitly cache data, the CPU will do it for you.
Since you specifically mention many threads, I assume you have at least a multi-socket system. Typically, memory banks are associated to processor sockets. That is, one processor is "nearest" to its own memory banks and needs to communicate with the other processors memopry controllers to access data on other banks. (Processor here means the physical thing in the socket)
When you allocate data, typically a first-write policy is used to determine on which memory banks your data will be allocated, which means it can access it faster than the other processors.
So, at least for multiple processors (not just multiple cores) there should be a performance improvement from allocating a copy at least for every processor. Be sure, to allocate/copy the data with every processor/thread and not from a master thread (to exploit the first-write policy). Also you need to make sure, that threads will not migrate between processors, because then you are likely to lose the close connection to your memory.
I am not sure, how copying data for every thread on a single processor would affect performance, but I guess not copying could improve the ability to share the contents of the higher level caches, that are shared between cores.
In any case, benchmark and decide based on actual measurements.

Accessing a certain address in memory

Why can we access to a certain place in our memory in O(1)?
Quick answer: You can't!
The system's main memory, the chips on the board, can however be addressed with a direct access. Just give the correct address and the bus will return the memory at that location (likely in a block).
Once you get into the CPU however memory access is very different. There are several caches, several cores with caches, and possibly other CPUs with caches. Though accessing main memory can be done directly, it is slow, that is why we have all these caches. But this now means that inside the CPU the memory isn't directly accessible.
When the CPU needs to access memory it thus goes into a lookup mode. It also has a locking system to share memory between the caches correctly. Different addresses will actually take different periods of time to access, depending on whether you are reading or writing, and where the most recent cache of that memory resides. This is something known as NUMA (non-uniform memory access). While the time complexity here is probably bound by a constant (so possibly/technicall O(1)) it probably isn't what most people are thinking of as constant time.
It gets more complicated than this. The CPU provides page tables for memory so that the OS can provide virtual memory to the applications (that is, it can partition the address spaces) and load memory on demand. These tables are map-like structures. When you access memory the CPU decides if the address you want is loaded, or if the OS has to retrieve it first. These maps are a function of the total memory size, so are not linear time, though very likely amortized constant time. (If you're running a virtual machine you can add another layer of tables on top here -- one reason why VMs run slightly slower).
This is just a brief overview. Hopefully enough to give you the impression that memory access isn't really constant time and depends on many things. Keep in mind however that so much optimization is employed at these levels that a high-level C program will likely appear to have constant time access.
Memory in modern computer systems is random access, so as long as you know the address of memory you need to access, the computer can go directly to that memory location and read/write to that location.
This is opposed to some [older] systems such as tape memory where the tape had to be physically spooled to access certain areas, thus farther locations take longer time to access.
Not sure what you mean by allocate in O(1) as allocating memory is typically not O(1) when dealing with typical heap on every day computers.
That depends on the computing model that you use, in the Turing machine model, neither operation is O(1), in the random access model, access is O(1), which, since that is the case for most modern hardware using RAM makes that model useful. I assume you are using a model that for the sake of simplicity also allows O(1) allocation as a close approximation to most modern implementations on a machine that is under light memory usage load.
Why can you access in O(1)? Because memory access is by address. If you know the address you want to access, then the hardware can go directly to it and fetch whatever is there in a single operation.
As for allocations being O(1), I'm not sure that is always the case. It is up to the OS to allocate a new block of memory, and the algorithm that it uses to do that may not necessarily be O(1) in all cases. For instance, if you request a large block of memory and there is no contiguous block large enough to satisfy the request, the OS may do things like page out other data or relocate information from other processes so as to create a large enough contiguous block to satisfy the request.
Though if you want to take an extremely simplified view of allocation as being "returning the address of the first byte of available memory" then it's easy to see why that could be an O(1) operation. The system just needs to return the address of the last allocated byte + 1, so as long as it tracks what the last allocated byte is after every allocation and as long as you assume an unlimited memory space, then computing the next free address is always O(1).

Does malloc/new return memory blocks from Cache or RAM?

I wanted to know whether malloc/new returns memory blocks from Cache or RAM.
Thanks in advance.
You are abstracted of all that when living as a process in the OS, you only get memory.
You shouldn't worry ever about that, the OS will manage all that for you and the memory unit will move things from one to another. But you still see a single virtual memory layout.
From virtual memory. OS will take care of bringing the required pages into the RAM whenever the process requires it.
malloc and operator new will give you a chunk of address space.
The operating system will back this chunk of address space with some physical storage. The storage could be system memory or a chunk of a page file and the actual storage location can be moved between between the various physical storage devices and this is handled transparently from the application point of view. In addition the CPU and memory controller (on board or otherwise) may cache system memory but this is usually (largely) transparent to the operating system.
As already said you can't know. The cache/RAM/hard disk is abstracted as virtual memory. But I think if you can measure the access time you may get an idea whether the RAM or cache is being accessed. But after the first access to RAM the memory block will be copied to the cache and subsequent accesses will be served from the cache.
It very much depends. At the start of your program, the memory that the OS gives you will probably not be paged in (at least, not on Linux). However, if you free some stuff, then get a new allocation, the memory could possibly be in the cache.
In there is a constructor which touches the memory, then it will certainly be in the cache after it's constructed.
If you're programming in Java, then you'll likely have a really cool memory allocator, and much more likely to be given memory thats in the cache.
(When I say cache, I mean L2. L1 is unlikely, but you might get lucky, esp. for small programs).
You cannot address the processor cache directly, the processor manages it (almost) transparently... At most you can invalidate or prefetch a cache line; but you access memory addresses (usually virtual if you're not in real mode), and the processor will be feed always data and instructions from its internal memory (if the data it's not already present, then it needs to be fetched).
Read this article for further info: http://en.wikipedia.org/wiki/Memory_hierarchy
At first, the memory allocated for application is a virtual memory, whose address is located in the virtual space. Secondly, such as L1 and L2 cache will not be allocated for you, which is managed by system. In fact, if cache are allocated for you ,it's hard for the system to dispatch tasks.