Cocos2dx-v3. Does “preloadEffect in SimpleAudioEngine” cache duplicates? - cocos2d-iphone

Will calling
SimpleAudioEngine::getInstance()->preloadEffect("sound.mp3");
multiple times lead to duplicating cache? and lead to taking up more memory?

Related

How to abandon (invalidate without saving) a cache line on x86_64?

As I understand, _mm_clflush() / _mm_clflushopt() invalidates a cache line while saving it to memory if it has been changed. Is there a way to simply abandon a cache line, without saving to memory any changes made to it?
A use case is before freeing memory: I don't need cache lines or their values anymore.

Optimisation linked list

Silly question maybe - but, what would be faster:
-Deleting an item from a linked list every time an item is gone (whatever that item might be)
-Just marking the record as dead and overwriting it after x amount of time or conditions.
Would I not use less cpu time by avoiding all the removing and inserting rather than just overwriting.
Only way to find out is to use a profiler.
Keep in mind that there are other factors which are very important that you didn't specify in your question. For example, it makes a big difference if you know which element to delete, i.e. you have the pointer to it ready, or if you have to iterate over the linked list to find it. In that case, a vector could actually be several times faster than any linked list (depending on the number of elements) due to caching.
Overwriting is going to be faster. If deleting involves freeing allocated memory from the heap, then overwriting will definitely be faster. Memory allocation is relatively slow. Plus, over time your list memory would get fragmented which might cause cache misses. Even freeing memory that goes back into a memory manager of some sort would be slower since you have the overhead of managing the nodes of the list.

How many cache lines are cached?

Ok so I can't find much in the way of answers to this, it's a simple question in memory management. I know that when a computer pulls from memory it caches 32-64 bits of memory in a cache line depending on your processor. My question is does it only store 1 cache line's worth of memory or multiple, if multiple how many?
For instance say we're using c++, and I pull a vector<int> using a for loop, then I use those integers to pull information out of another vector that is most likely no where near it in memory. Would that qualify as 2 pulls and then everything is cached or is that just going to continuously pull from memory? Basically, would it pull the vector<int> and store it in cache, then pull the other vector and store it as well in a different catch line? Thus only pulling twice then getting from it's cached memory from then on? Assume that each vector = the size of 1 catch lines worth of data.
EDIT: Ok so on the same line.... I have a second question: Lets assume for a moment that my initial vector<int> is called and iterated over in a for loop, which then references multiple vectors. When it caches those vectors, obviously it's going to have a limit so it will start writing over previous cache right? In which case in what order would it write over the previous cache lines, FIFO, FILO, some other way?
There's different types of cache. Generally, the amount of cache depends on the processor. A moden processor has 3 levels of cache, where the fastest (and smallest) is called L1 and usually range between 128kb and 512kb, where the slowest (and largest) is 1mb to 4mb.
Each access to the memory is 64 bit long, regardless of the processor architecture. Therefore accessing the memory with 64bit long operands is most efficient.
The cache may store memory from different pages too.

What is meaning of locality of data structure?

I was reading following article,
What Every Programmer Should Know About Compiler Optimizations
There are other important optimizations that are currently beyond the
capabilities of any compiler—for example, replacing an inefficient
algorithm with an efficient one, or changing the layout of a data
structure to improve its locality.
Does that mean if I change sequence (layout) of data members in class, it can affect performance?
So,
class One
{
int data0;
abstract-data-type data1;
};
Differes in performance from,
class One
{
abstract-data-type data0;
int data1;
};
If this is true, what is rule of thumb while defining classes or data structure?
Locality in this sense is speaking mostly to cache locality. Writing data structures and algorithms to operate mostly out of cache makes the algorithm run as fast as it possibly can. Cache locality is one of the reasons quick sort is quick.
For a data structure, you want to keep the parts of your data structure that refer to each other relatively close to each other, to avoid flushing out useful cache lines.
Also, you can rearrange your data structure so that the compiler will use the minimum amount of memory required to hold all the members and still efficiently access them. This helps make sure your data structure consumes the minimum number of cache lines.
A single cache line on a current x86-64 architecture (core i7) is 64 bytes.
I am not an expert on data/structure locality, but it has to do with how you organize your data to avoid the CPU caching bits of memory from all over the CPU thus slowing down your program by constantly waiting for a memory fetch.
For example, a linked list can be a scattered all over your memory. However if you changed this into an array of "elements" then they are all in contiguous memory - this would save memory access times if you needed to traverse they array all at one time (its just one example)
Additionally:
Also becareful of some of the STL libraries, again I am not 100% sure which are the best, but some of them (e.g. list) are quite bad in terms of locality.
Another , perhaps more common example is an array of pointers, where the pointed to elements can be scattered around memory.
Of course, you cannot always avoid this easily because you sometimes need to be able to dynamically add/move/insert/delete elements...
Summary:
It basically means take care how you layout your data with regard to memory access.
Sort class members by how frequently you will be accessing them. This maximizes the "hotness" of the cache line that contains the head of your class, increasing the likelihood of it remaining cached. Another factor that you care about is packing - due to alignment, rearranging the order in which members are declared could lead to a reduction in the size of your class which would in turn reduce cache pressure.
(None of them are definitive, of course. These rules of thumb aren't a substitute for profiling.)

Cache performance degradation due to physical layout of data

Each memory address "maps" to their own cache set in the CPU cache(s), based on a modulo operation of the address.
Is there a way in which accessing two identically-sized arrays, like so:
int* array1; //How does the alignment affect the possibility of cache collisions?
int* array2;
for(int i=0; i<array1.size(); i++){
x = array1[i] * array2[i]; //Can these ever not be loaded in cache at same time?
}
can cause a performance decrease because the element at array1[i] and array2[i] give the same cache line modulo result? Or, would this actually be a performance increase because only one cache line would have to be loaded to obtain two data locations?
Would somebody be able to give an example of the above showing performance changes due to cache mappings, including how the alignment of the arrays would affect this?
(The reason for my question is that I am trying to understand when a performance problem occurs due to data alignment/address mappings to the same cache line, which causes one of the pieces of data to not be stored in the cache)
NB: I may have mixed up the terms cache "line" and "set"- please feel free to correct.
Right now your code doesn't make much sense as you didn't allocate any memory for the arrays. The pointers are just 2 uninitialized vars sitting on the stack and pointing at nothing. Also, a pointer to int* doesn't really have size() function.
Assuming you fix all that, if you do allocate, you can decide whether to allocate the data contiguously or not. You could allocate 2*N integers for one pointer, and have the other point to the middle of that region.
The main consideration here is this - if the arrays are small enough as to not wrap around your desired cache level, having them mapped contiguously will avoid having to share the same cache sets between them. This may improve performance since simultaneous accesses to the same sets are often non-optimal due to HW considerations.
The thrashing consideration (will the two arrays throw each others' lines out of the cache) is not a problem really as most caches today enjoy some level of associativity - it means that the arrays can map to the same sets but live in different cache ways. If the arrays are too big and exceed the total number of ways together, then it means their address range wraps around the cache set mapping several times, in which case it doesn't really matter how it's aligned, you're still going to collide with some lines of the other array
for e.g., if you had 4 sets and 2 ways in the cache, and try mapping 2 arrays of 64 ints with an alignment offset, you'd still fill out your entire cache -
way0 way1
set 0 array1[0] array2[32]
set 1 array1[16] array2[48]
set 2 array1[32] array2[0]
set 3 array1[48] array2[16]
but as mentioned above - accesses within the same iteration would go to different sets, which may have some benefit.