What are the next step to improve malloc() algorithm? [closed]

What are the next step to improve malloc() algorithm? [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm writing my own simple malloc() function and I would like to create more faster and efficient variant. I'm writed function that use linear search and allocated sequentially and contiguously in memory.
What is the next step to improve this algorithm? What are the main shortcomings of my current version? I would be very grateful for any feedback and recommendation.
typedef struct heap_block
{
struct heap_block* next;
size_t size;
bool isfree;
}header;
#define Heap_Capacity 100000
static char heap[Heap_Capacity];
size_t heap_size;
void* malloc(size_t sz)
{
if(sz == 0 || sz > Heap_Capacity) { return NULL; }
header* block = (header*)heap;
if(heap_size == 0)
{
set_data_to_block(block, sz);
return (void*)(block+1);
}
while(block->next != NULL) { block = block->next; }
block->next = (header*)((char*)to_end_data(block) + 8);
header* new_block = block->next;
set_data_to_block(new_block, sz);
return (void*)(new_block+1);
}
void set_data_to_block(header* block, size_t sz)
{
block->size = sz;
block->isfree = false;
block->next = NULL;
heap_size += sz;
}
header* to_end_data(header* block)
{
return (header*)((size_t)(block+1) + block->size);
}

Notice that malloc is often built above lower level memory related syscalls (e.g. mmap(2) on Linux). See this answer which mentions GNU glibc and musl-libc. Look also inside tcmalloc, so study the source code of several free software malloc implementations.
Some general ideas for your malloc:
retrieve memory from the OS using mmap (and release it back to the OS kernel eventually with munmap). You certainly should not allocate a fixed size heap (since on a 64 bits computer with 128Gbytes of RAM you would want to succeed in malloc-ing a 10 billion bytes zone).
segregate small allocations from big ones, so handle differently malloc of 16 bytes from malloc of a megabyte. The typical threshold between small and large allocation is generally a small multiple of the page size (which is often 4Kbytes). Small allocations happen inside pages. Large allocations are rounded to pages. You might even handle very specially malloc of two words (like in many linked lists).
round up the requested size to some fancy number (e.g. powers of 2, or 3 times a power of 2).
manage together memory zones of similar sizes, that is of same "fancy" size.
for small memory zones, avoid reclaiming too early the memory zone, so keep previously free-d zones of same (small) size to reuse them in future calls to malloc.
you might use some tricks on the address (but your system might have ASLR), or keep near each memory zone a word of meta-data describing the chunk of which it is a member.
a significant issue is, given some address previously returned by malloc and argument of free, to find out the allocated size of that memory zone. You could manipulate the address bits, you could store that size in the word before, you could use some hash table, etc. Details are tricky.
Notice that details are tricky, and it might be quite hard to write a malloc implementation better than your system one. In practice, writing a good malloc is not a simple task. You should find many academic papers on the subject.
Look also into garbage collection techniques. Consider perhaps Boehm's conservative GC: you will replace malloc by GC_MALLOC and you won't bother about free... Learn about memory pools.

There are 3 ways to improve:
make it more robust
optimise the way memory is allocated
optimise the code that allocates the memory
Make It More Robust
There are many common programmer mistakes that could be easily detected (e.g. modifying data beyond the end of the allocated block). For a very simple (and relatively fast) example, it's possible to insert "canaries" before and after the allocated block, and detect programmer during free and realloc by checking if the canaries are correct (if they aren't then the programmer has trashed something by mistake). This only works "sometimes maybe". The problem is that (for a simple implementation of malloc) the meta-data is mixed in with the allocated blocks; so there's a chance that the meta-data has been corrupted even if the canaries haven't. To fix that it'd be a good idea to separate the meta-data from the allocated blocks. Also, merely reporting "something was corrupted" doesn't help as much as you'd hope. Ideally you'd want to have some sort of identifier for each block (e.g. the name of the function that allocated it) so that if/when problems occur you can report what was corrupted. Of course this could/should maybe be done via. a macro, so that those identifiers can be omitted when not debugging.
The main problem here is that the interface provided by malloc is lame and broken - there's simply no way to return acceptable error conditions ("failed to allocate" is the only error it can return) and no way to pass additional information. You'd want something more like int malloc(void **outPointer, size_t size, char *identifier) (with similar alterations to free and realloc to enable them to return a status code and identifier).
Optimise the Way Memory is Allocated
It's naive to assume that all memory is the same. It's not. Cache locality (including TLB locality) and other cache effects, and things like NUMA optimisation, all matter. For a simple example, imagine you're writing an application that has a structure describing a person (including a hash of their name) and a pointer to a person's name string; and both the structure and the name string are allocated via. malloc. The normal end result is that those structures and strings end up mixed together in the heap; so when you're searching through these structures (e.g. trying to find the structure that contains the correct hash) you end up pounding caches and TLBs. To optimise this properly you'd want to ensure that all the structures are close together in the heap. For that to work malloc needs to the difference between allocating 32 bytes for the structure and allocating 32 bytes for the name string. You need to introduce the concept of "memory pools" (e.g. where everything in "memory pool number 1" is kept close in the heap).
Another important optimisations include "cache colouring" (see http://en.wikipedia.org/wiki/Cache_coloring ). For NUMA systems, it can be important to know the difference between something where max. bandwidth is needed (where using memory from multiple NUMA domains increases bandwidth).
Finally, it'd be nice (to manage heap fragmentation, etc) to use different strategies for "temporary, likely to be freed soon" allocations and longer term allocations (e.g. where it's worth doing a extra to minimise fragmentation and wasted space/RAM).
Note: I'd estimate that getting all of this right can mean software running up to 20% faster in specific cases, due to far less cache misses, more bandwidth where it's needed, etc.
The main problem here is that the interface provided by malloc is lame and broken - there's simply no way to pass the additional information to malloc in the first place. You'd want something more like int malloc(void **outPointer, size_t size, char *identifier, int pool, int optimisationFlags) (with similar alterations to realloc).
Optimise the Code That Allocates the Memory
Given that you can assume memory is used more frequently than its allocated; this is the least important (e.g. less important than getting things like cache locality for the allocated blocks right).
Quite frankly, anyone that actually wants decent performance or decent debugging shouldn't be using malloc to begin with - generic solutions to specific problems are never ideal. With this in mind (and not forgetting that the interface to malloc is lame and broken and prevents everything that's important anyway) I'd recommend simply not bothering with malloc and creating something that's actually good (but non-standard) instead. For this you can adapt the algorithms used by existing implementations of malloc.

Related

Avoid memory fragmentation when memory pools are a bad idea

I am developing a C++ application, where the program run endlessly, allocating and freeing millions of strings (char*) over time. And RAM usage is a serious consideration in the program. This results in RAM usage getting higher and higher over time. I think the problem is heap fragmentation. And I really need to find a solution.
You can see in the image, after millions of allocation and freeing in the program, the usage is just increasing. And the way I am testing it, I know for a fact that the data it stores is not increasing. I can guess that you will ask, "How are you sure of that?", "How are you sure it's not just a memory leak?", Well.
This test run much longer. I run malloc_trim(0), whenever possible in my program. And it seems, application can finally return the unused memory to the OS, and it goes almost to zero (the actual data size my program has currently). This implies the problem is not a memory leak. But I can't rely on this behavior, the allocation and freeing pattern of my program is random, what if it never releases the memory ?
I said memory pools are a bad idea for this project in the title. Of course I don't have absolute knowledge. But the strings I am allocating can be anything between 30-4000 bytes. Which makes many optimizations and clever ideas much harder. Memory pools are one of them.
I am using GCC 11 / G++ 11 as a compiler. If some old versions have bad allocators. I shouldn't have that problem.
How am I getting memory usage ? Python psutil module. proc.memory_full_info()[0], which gives me RSS.
Of course, you don't know the details of my program. It is still a valid question, if this is indeed because of heap fragmentation. Well what I can say is, I am keeping a up to date information about how many allocations and frees took place. And I know the element counts of every container in my program. But if you still have some ideas about the causes of the problem, I am open to suggestions.
I can't just allocate, say 4096 bytes for all the strings so it would become easier to optimize. That's the opposite I am trying to do.
So my question is, what do programmers do(what should I do), in an application where millions of alloc's and free's take place over time, and they are of different sizes so memory pools are hard to use efficiently. I can't change what the program does, I can only change implementation details.
Bounty Edit: When trying to utilize memory pools, isn't it possible to make multiple of them, to the extent that there is a pool for every possible byte count ? For example my strings can be something in between 30-4000 bytes. So couldn't somebody make 4000 - 30 + 1, 3971 memory pools, for each and every possible allocation size of the program. Isn't this applicable ? All pools could start small (no not lose much memory), then enlarge, in a balance between performance and memory. I am not trying to make a use of memory pool's ability to reserve big spaces beforehand. I am just trying to effectively reuse freed space, because of frequent alloc's and free's.
Last edit: It turns out that, the memory growth appearing in the graphs, was actually from a http request queue in my program. I failed to see that hundreds of thousands of tests that I did, bloated this queue (something like webhook). And the reasonable explanation of figure 2 is, I finally get DDOS banned from the server (or can't open a connection anymore for some reason), the queue emptied, and the RAM issue resolved. So anyone reading this question later in the future, consider every possibility. It would have never crossed my mind that it was something like this. Not a memory leak, but an implementation detail. Still I think #Hajo Kirchhoff deserves the bounty, his answer was really enlightening.

If everything really is/works as you say it does and there is no bug you have not yet found, then try this:
malloc and other memory allocation usually uses chunks of 16 bytes anyway, even if the actual requested size is smaller than 16 bytes. So you only need 4000/16 - 30/16 ~ 250 different memory pools.
const int chunk_size = 16;
memory_pools pool[250]; // 250 memory pools, managing '(idx+1)*chunk_size' size
char* reserve_mem(size_t sz)
{
size_t pool_idx_to_use = sz/chunk_size;
char * rc=pool[pool_idx_to_use].allocate();
}
IOW, you have 250 memory pools. pool[0] allocates and manages chunk with a length of 16 bytes. pool[100] manages chunks with 1600 bytes etc...
If you know the length distribution of your strings in advance, you can reserve initial memory for the pools based on this knowledge. Otherwise I'd just probably reserve memory for the pools in 4096 bytes increment.
Because while the malloc C heap usually allocates memory in multiple of 16 bytes it will (at least unter Windows, but I am guessing, Linux is similar here) ask the OS for memory - which usually works with 4K pages. IOW, the "outer" memory heap managed by the operating system reserves and frees 4096 bytes.
So increasing your own internal memory pool in 4096 bytes means no fragmentation in the OS app heap. This 4096 page size (or multiple of...) comes from the processor architecture. Intel processors have a builtin page size of 4K (or multiple of). Don't know about other processors, but I suspect similar architectures there.
So, to sum it up:
Use chunks of multiple of 16 bytes for your strings per memory pool.
Use chunks of multiple of 4K bytes to increase your memory pool.
That will align the memory use of your application with the memory management of the OS and avoid fragmentation as much as possible.
From the OS point of view, your application will only increment memory in 4K chunks. That's very easy to allocate and release. And there is no fragmentation.
From the internal (lib) C heap management point of view, your application will use memory pools and waste at most 15 bytes per string. Also all similar length allocations will be heaped together, so also no fragmentation.

Here a solution, which might help, if your application has the following traits:
Runs on Windows
Has episodes, where the working set is large, but all that data is released at around the same point in time entirely. (If the data is processed in some kind of batch mode and the work is done when its done and then the related data is freed).
This approach uses a (unique?) Windows feature, called custom heaps. Possibly, you can create yourself something similar on other OS.
The functions, you need are in a header called <heapapi.h>.
Here is how it would look like:
At the start of a memory intensive phase of your program, use HeapCreate() to have a heap, where all your data will go.
Perform the memory intensive tasks.
At the end of the memory intensive phase, Free your data and call HeapDestroy().
Depending on the detailed behavior of your application (e.g. whether or not this memory intensive computation runs in 1 or multiple threads), you can configure the heap accordingly and possibly even gain a little speed (If only 1 thread uses the data, you can give HeapCreate() the HEAP_NO_SERIALIZE flag, so it would not take a lock.) You can also give an upper bound of what the heap will allow to store.
And once, your computation is complete, destroying the heap also prevents long term fragmentation, because when the time comes for the next computation phase, you start with a fresh heap.
Here is the documentation of the heap api.
In fact, I found this feature so useful, that I replicated it on FreeBSD and Linux for an application I ported, using virtual memory functions and my own heap implementation. Was a few years back and I do not have the rights for that piece of code, so I cannot show or share.
You could also combine this approach with fixed element size pools, using one heap for one dedicated block size and then expect less/no fragmentation within each of those heaps (because all blocks have same size).
struct EcoString {
size_t heapIndex;
size_t capacity;
char* data;
char* get() { return data; }
};
struct FixedSizeHeap {
const size_t SIZE;
HANDLE m_heap;
explicit FixedSizeHeap(size_t size)
: SIZE(size)
, m_heap(HeapCreate(0,0,0)
{
}
~FixedSizeHeap() {
HeapDestroy(m_heap);
}
bool allocString(capacity, EcoString& str) {
assert(capacity <= SIZE);
str.capacity = SIZE; // we alloc SIZE bytes anyway...
str.data = (char*)HeapAlloc(m_heap, 0, SIZE);
if (nullptr != str.data)
return true;
str.capacity = 0;
return false;
}
void freeString(EcoString& str) {
HeapFree(m_heap, 0, str.data);
str.data = nullptr;
str.capacity = 0;
}
};
struct BucketHeap {
using Buckets = std::vector<FixedSizeHeap>; // or std::array
Buckets buckets;
/*
(loop for i from 0
for size = 40 then (+ size (ash size -1))
while (< size 80000)
collecting (cons i size))
((0 . 40) (1 . 60) (2 . 90) (3 . 135) (4 . 202) (5 . 303) (6 . 454) (7 . 681)
(8 . 1021) (9 . 1531) (10 . 2296) (11 . 3444) (12 . 5166) (13 . 7749)
(14 . 11623) (15 . 17434) (16 . 26151) (17 . 39226) (18 . 58839))
Init Buckets with index (first item) with SIZE (rest item)
in the above item list.
*/
// allocate(nbytes) looks for the correct bucket (linear or binary
// search) and calls allocString() on that bucket. And stores the
// buckets index in the EcoString.heapIndex field (so free has an
// easier time).
// free(EcoString& str) uses heapIndex to find the right bucket and
// then calls free on that bucket.
}

What you are trying to do is called a slab allocator and it is a very well studied problem with lots of research papers.
You don't need every possible size. Usually slabs come in power of 2 sizes.
Don't reinvent the wheel, there are plenty of opensource implementations of a slab allocator. The Linux kernel uses one.
Start by reading the Wikipedia page on Slab Allocation.

I don't know the details of your program. There are many patterns to work with memory, each of which leads to different results and problems. However you can take a look of .Net garbage collector. It uses several generations of objects (similar to pool of memory). Each object in that generation is movable (its address in memory is changed during compressing of generation). Such trick allows to "compress/compact" memory and reduce fragmentation of memory. You can implement memory manager with similar semantics. It's not neccesary to implement garbage collector at all.

How much is there overhead per single object memory allocation? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Say, if I call malloc(sizeof(int)), requesting 4 bytes, how much extra will be added by system (or std library?) to support memory management infrastructure? I believe there should be some. Otherwise, how the system would know how many bytes to dispose of when I call free(ptr).
UPDATE 1: It may sound like a 'too broad question' and, obviously, a C/C++ library specific, but what I am interested in, is a minimum extra memory needed to support a single allocation. Even not a system or implementation specific. For example, for binary tree, there is a must of 2 pointers - left and right children, and no way you can squeeze it.
UPDATE 2:
I decided to check it for myself on Windows 64.
#include <stdio.h>
#include <conio.h>
#include <windows.h>
#include <psapi.h>
void main(int argc, char *argv[])
{
int m = (argc > 1) ? atoi(argv[1]) : 1;
int n = (argc > 2) ? atoi(argv[2]) : 0;
for (int i = 0; i < n; i++)
malloc(m);
size_t peakKb(0);
PROCESS_MEMORY_COUNTERS pmc;
if ( GetProcessMemoryInfo(GetCurrentProcess(), &pmc, sizeof(pmc)) )
peakKb = pmc.PeakWorkingSetSize >> 10;
printf("requested : %d kb, total: %d kb\n", (m*n) >> 10, peakKb);
_getch();
}
requested : 0 kb, total: 2080 kb
1 byte:
requested : 976 kb, total: 17788 kb
extra: 17788 - 2080 - 976 = 14732 (+1410%)
2 bytes:
requested : 1953 kb, total: 17784 kb
extra: 17784 - 2080 - 1953 = (+605% over)
4 bytes:
requested : 3906 kb, total: 17796 kb
extra: 17796 - 2080 - 3906 = 10810 (+177%)
8 bytes:
requested : 7812 kb, total: 17784 kb
extra: 17784 - 2080 - 7812 = (0%)
UPDATE 3: THIS IS THE ANSWER TO MY QUESTION I’VE BEEN LOOKING FOR: In addition to being slow, the genericity of the default C++ allocator makes it very space inefficient for small objects. The default allocator manages a pool of memory, and such management often requires some extra memory. Usually, the bookkeeping memory amounts to a few extra bytes (4 to 32) for each block allocated with new. If you allocate 1024-byte blocks, the per-block space overhead is insignificant (0.4% to 3%). If you allocate 8-byte objects, the per-object overhead becomes 50% to 400%, a figure big enough to make you worry if you allocate many such small objects.

For allocated objects, no additional metadata is theoretically required. A conforming implementation of malloc could round up all allocation requests to a fixed maximum object size, for example. So for malloc (25), you would actually receive a 256-byte buffer, and malloc (257) would fail and return a null pointer.
More realistically, some malloc implementations encode the allocation size in the pointer itself, either directly using bit patterns corresponding to specific fixed sized classes, or indirectly using a hash table or a multi-level trie. If I recall correctly, the internal malloc for Address Sanitizer is of this type. For such mallocs, at least part of the immediate allocation overhead does not come from the addition of metadata for heap management, but from rounding the allocation size up to a supported size class.
Other mallocs have a per-allocation header of a single word. (dlmalloc and its derivative are popular examples). The actual per-allocation overhead is usually slightly larger because due to the header word, you get weired supported allocation sizes (such as 24, 40, 56, … bytes with 16-byte alignment on a 64-bit system).
One thing to keep in mind is that many malloc implementations put a lot of data deallocated objects (which have not yet been returned to the operating system kernel), so that malloc (the function) can quickly find an unused memory region of the appropriate size. Particularly for dlmalloc-style allocators, this also provides a constraint on minimum object size. The use of deallocated objects for heap management contributes to malloc overhead, too, but its impact on individual allocations is difficult to quantify.

Say, if I call malloc(sizeof(int)), requesting 4 bytes, how much extra will be added by system (or std library?) to support memory management infrastructure? I believe there should be some. Otherwise, how the system would know how many bytes to dispose of when I call free(ptr).
This is entirely library specific. The answer could be anything from zero to whatever. Your library could add data to the front of the block. Some add data to the front and back of the block to track overwrites. The amount of overhead added varies among libraries.
The length could be tracked within the library itself with a table. In that case, there may not be hidden field added to the allocated memory.
The library might only allocate blocks in fixed sizes. The amount you ask for gets rounded up to the next block size.

The pointer itself is essentially overhead and can be a dominant driver of memory use in some programs.
The theoretical minimum overhead, might be sizeof(void*) for some theoretical system and use, but that combination of CPU, Memory and usage pattern is so unlikely to exist as to be absolutely worthless for consideration. The standard requires that memory returned by malloc be suitably aligned for any data type, therefore there will always be some overhead; in the form of unused memory between the end of one allocated block, and the beginning of the next allocated block, except in the rare cases where all memory usage is sized to a multiple of the block size.
The minimum implementation of malloc/free/realloc, assumes the heap manager has one contiguous block of memory at it's disposal, located somewhere in system memory, the pointer that said heap manager uses to reference that original block, is overhead (again sizeof(void*)). One could imagine a highly contrived application that requested that entire block of memory, thus avoiding the need for additional tracking data. At this point we have 2 * sizeof(void*) worth of overhead, one internal to the heap manager, plus the returned pointer to the one allocated block (the theoretical minimum). Such a conforming heap manager is unlikely to exist as it must also have some means of allocating more than one block from its pool and that implies at a minimum, tracking which blocks within its pool are in use.
One scheme to avoid overhead involves using pointer sizes that are larger than the physical or logical memory available to the application. One can store some information in those unused bits, but they would also count as overhead if their number exceeds a processor word size. Generally, only a hand full of bits are used and those identify which of the memory managers internal pools the memory comes from. The later, implies additonal overhead of pointers to pools. This brings us to real world systems, where the heap manager implementation is tuned to the OS, hardware architecture and typical usage patterns.
Most hosted implementations (hosted == runs on OS) request one or more memory blocks from the operating system in the c-runtime initialization phase. OS memory management calls are expensive in time and space. The OS has it's own memory manager, with its own overhead set, driven by its own design criteria and usage patterns. So c-runtime heap managers attempt to limit the number of calls into the OS memory manager, in order to reduce the latency of the average call to malloc() and free(). Most request the first block from the OS when malloc is first called, but this usually happens at some point in the c-runtime initialization code. This first block is usually a low multiple of the system page size, which can be one or more orders of magnitude larger than the size requested in the initial malloc() call.
At this point it is obvious that heap manager overhead is extremely fluid and difficult to quantify. On a typical modern system, the heap manager must track multiple blocks of memory allocated from the OS, how many bytes are currently allocated to the application in each of those blocks and potentially, how much time has passed since a block went to zero. Then there's the overhead of tracking allocations from within each of those blocks.

Generally malloc rounds up to a minimum alignment boundary and often this is not special cased for small allocations as applications are expected to aggregate many of these into a single allocation. The minimum alignment is often based on the largest required alignment for a load instruction in the architecture the code is running on. So with 128-bit SIMD (e.g. SSE or NEON) the minimum is 16 bytes. In practice there is a header as well which causes the minimum cost in size to be doubled. As SIMD register widths have increased, malloc hasn't increased it's guaranteed alignment.
As was pointed out, the minimum possible overhead is 0. Though the pointer itself should probably be counted in any reasonable analysis. In a garbage collector design, at least one pointer to the data has to be present. In straight a non-GC design, one has to have a pointer to call free, but there's not an iron clad requirement it has to be called. One could theoretically compress a bunch of pointers together into less space as well, but now we're into an analysis of the entropy of the bits in the pointers. Point being you likely need to specify some more constraints to get a really solid answer.
By way of illustration, if one needs arbitrary allocation and deallocation of just int size, one can allocate a large block and create a linked list of indexes using each int to hold the index of the next. Allocation pulls an item off the list and deallocation adds one back. There is a constraint that each allocation is exactly an int. (And that the block is small enough that the maximum index fits in an int.) Multiple sizes can be handled by having different blocks and searching for which block the pointer is in when deallocation happens. Some malloc implementations do something like this for small fixed sizes such as 4, 8, and 16 bytes.
This approach doesn't hit zero overhead as one needs to maintain some data structure to keep track of the blocks. This is illustrated by considering the case of one-byte allocations. A block can at most hold 256 allocations as that is the maximum index that can fit in the block. If we want to allow more allocations than this, we will need at least one pointer per block, which is e.g. 4 or 8 bytes overhead per 256 bytes.
One can also use bitmaps, which amortize to one bit per some granularity plus the quantization of that granularity. Whether this is low overhead or not depends on the specifics. E.g. one bit per byte has no quantization but eats one eighth the allocation size in the free map. Generally this will all require storing the size of the allocation.
In practice allocator design is hard because the trade-off space between size overhead, runtime cost, and fragmentation overhead is complicated, often has large cost differences, and is allocation pattern dependent.

C++ new is 64-byte aligned and equal to cache line size [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Is there any guarantee of alignment of address return by C++'s new operation?
In this program, i am printing each address returned by new for unsigned chars. Then deleting them backwards in the end.
#include "stdafx.h"
#include<stdlib.h>
void func();
int main()
{
int i=10;
while(i-->0)printf("loaded %i \n", (new unsigned char));
getchar();
unsigned char *p=new unsigned char;printf("last pointer loaded %i \n", p);
i=10;
while(i-->0)delete (p-=64);
getchar();
p+=640;
delete p;//nearly forgot to delete this ^^
return 0;
}
output:
As you can see, each new returns 64-byte aligned data.
Question: Is this 64-Byte being equal to cache-line size or just a compiler thing?
Question: Should i make my structures at mostly 64-bytes long?
Question: will this be different when i change my cpu, ram, OS or compiler?
Pentium-m, VC++ 2010 express, windows-xp
Thanks.

The implementation choices for a heap manager make a lot more sense when you consider what happens after a large number of allocations and deallocations.
A call to malloc() needs to locate a block of unused block of sufficient size to allocate.
It could be bigger (in which case, it could either create a free block with the difference - or waste it). A naive strategy of finding the closest size of block is called best fit. If it goes onto to create new free blocks, you could alternatively call it worst leave.
After use, the best-fit approach results in a large amounts of fragmentation, caused by small blocks that are unlikely to be ever allocated again, and the cost of searching the free blocks becomes high.
Consequently, high performance heap managers don't work like this. Instead they operate as pool allocators for various fixed block-sizes. Schemes in which the blocks are powers of 2 (e.g. 64,128,256,512...) the norm, although throwing in some intermediates is probably worthwhile too (e.g. 48,96,192...). In this scheme, malloc() and free() are both O(1) operations, and the critical sections in allocation are minimal - potentially per pool - which gets important in a multi-threaded environment.
The wasting of memory in small allocations is a much lesser evil than fragmentation, O(n) alloc\dealloc complexity and poor MT performance.
The minimum block size w.r.t. to the cache line size is one of those classic engineering trade-offs, and it's a safe bet that Microsoft did quite a bit of experimentation to arrive at 64 as their minimum. FWIW, I'm pretty sure you'll find the cache-line size of modern CPUs are bigger than that.

how to manage large arrays

I have a c++ program that uses several very large arrays of doubles, and I want to reduce the memory footprint of this particular part of the program. Currently, I'm allocating 100 of them and they can be 100 Mb each.
Now, I do have the advantage, that eventually parts of these arrays become obsolete during later parts of the program's execution, and there is little need to ever have the whole of any one of then in memory at any one time.
My question is this:
Is there any way of telling the OS after I have created the array with new or malloc that a part of it is unnecessary any more ?
I'm coming to the conclusion that the only way to achieve this is going to be to declare an array of pointers, each of which may point to a chunk say 1Mb of the desired array, so that old chunks that are not needed any more can be reused for new bits of the array. This seems to me like writing a custom memory manager which does seem like a bit of a sledgehammer, that's going to create a bit of a performance hit as well
I can't move the data in the array because it is going to cause too many thread contention issues. the arrays may be accessed by any one of a large number of threads at any time, though only one thread ever writes to any given array.

It depends on the operating system. POSIX - including Linux - has the system call madvise to do improve memory performance. From the man page:
The madvise() system call advises the kernel about how to handle paging input/output in the address range beginning at address addr and with size length bytes. It allows an application to tell the kernel how it expects to use some mapped or shared memory areas, so that the kernel can choose appropriate read-ahead and caching techniques. This call does not influence the semantics of the application (except in the case of MADV_DONTNEED), but may influence its performance. The kernel is free to ignore the advice.
See the man page of madvise for more information.
Edit: Apparently, the above description was not clear enough. So, here are some more details, and some of them are specific to Linux.
You can use mmap to allocate a block of memory (directly from the OS instead of the libc), that is not backed by any file. For large chunks of memory, malloc is doing exactly the same thing. You have to use munmap to release the memory - regardless of the usage of madvise:
void* data = ::mmap(nullptr, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// ...
::munmap(data, size);
If you want to get rid of some parts of this chunk, you can use madvise to tell the kernel to do so:
madvise(static_cast<unsigned char*>(data) + 7 * page_size,
3 * page_size, MADV_DONTNEED);
The address range is still valid, but it is no longer backed - neither by physical RAM nor by storage. If you access the pages later, the kernel will allocate some new pages on the fly and re-initialize them to zero. Be aware, that the dontneed pages are also part of the virtual memory size of the process. It might be necessary to make some configuration changes to the virtual memory management, e.g. activating over-commit.

It would be easier to answer if we had more details.
1°) The answer to the question "Is there any way of telling the OS after I have created the array with new or malloc that a part of it is unnecessary any more ?" is "not really". That's the point of C and C++, and any language that let you handle memory manually.
2°) If you're using C++ and not C, you should not be using malloc.
3°) Nor arrays, unless for a very specific reason. Use a std::vector.
4°) Preferably, if you need to change often the content of the array and reduce the memory footprint, use a linked list (std::list), though it'll be more expensive to "access" individually the content of the list (but will be almost as fast if you only iterate through it).

A std::deque with pointers to std::array<double,LARGE_NUMBER> may do the job, but you better make a dedicated container with the deque, so you can remap the indexes and most importantly, define when entries are not used anymore.
The dedicated container can also contain a read/write lock, so it can be used in a thread-safe way.

You could try using lists instead of arrays. Of course list is 'heavyer' than array but on the other hand it is easy to reconstruct a list so that you can throw away a part of it when it becomes obsolete. You could also use a wrapper which would only contain indexes saying which part of the list is up-to-date and which part may be reused.
This will help you improve performance, but will require a little bit more (reusable) memory.

Allocating by chunk and delete[]-ing and new[]-ing on the way seems like the good solution. It may be possible to do as little as memory management as possible. Do not reuse chunk yourself, simply deallocate old one and allocate new chunks when needed.

Extreme memory usage for individual dynamic allocation

here's a simple test I did on MSVC++ 2010 under windows 7:
// A struct with sizeof(s) == 4, e.g 4 bytes
struct s
{
int x;
};
// Allocate 1 million structs
s* test1 = new s[1000000];
// Memory usage show that the increase in memory is roughly 4 bytes * 1000000 - As expected
// NOW! If I run this:
for (int i = 0; i < 1000000; i++)
new s();
// The memory usage is disproportionately large. When divided by 1000000, indicates 64 bytes per s!!!
Is this a common knowledge or am I missing something? Before I always used to create objects on the fly when needed. For example new Triangle() for every triangle in a mesh, etc.
Is there indeed order of magnitude overhead for dynamic memory allocation of individual instances?
Cheers
EDIT:
Just compiled and ran same program at work on Windows XP using g++:
Now the overhead is 16 bytes, not 64 as observed before! Very interesting.

Not necessarily, but the operating system will usually reserve memory on your behalf in whatever sized chunks it finds convenient; on your system, I'd guess it gives you multiples of 64 bytes per request.
There is an overhead associated with keeping track of the memory allocations, after all, and reserving very small amounts isn't worthwhile.

Is that for a debug build? Because in a debug build msvc will allocate "guards" around objects to see if you overwrite past your object boundary.

There is usually overhead with any single memory allocation. Now this is from my knowledge of malloc rather than new but I suspect it's the same.
A section of the memory arena, when carved out for an allocation of (say) 30 bytes, will typically have a header (e.g., 16 bytes, and all figures like that are examples only below, they may be different) and may be padded to a multiple of 16 bytes for easier arena management.
The header is usually important to allow the section to be re-integrated into the free memory pool when you're finished with it.
It contains information about the size of the block at a bare minimum and may have memory guards as well (to detect corruption of the arena).
So, when you allocate your one million structure array, you'll find that it uses an extra 16 bytes for the header (four million and sixteen bytes). When you try to allocate one million individual structures, each and every one of them will have that overhead.
I answered a related question here with more details. I suspect there will be more required header information for C++ since it will probably have to store the number of items over and above the section size (for proper destructor calls) but that's just supposition on my part. It doesn't affect the fact that accounting information of some sort is needed per allocated item.
If you really want to see what the space is being used for, you'll need to dig through the MSVC runtime source code.

You should check the malloc implementation. Probably this will clear things up.
Not sure though if MSVC++'s malloc can be viewed somewhere. If not, look at some other implementation, they are probably similar to some degree.
Don't expect the malloc implementation to be easy. It needs to search for some free space in the allocated virtual pages or allocate a new virtual page. And it must do this fast. As fast as possible. And it must be multithreading safe. Maybe your malloc implementation has some sort of bitvector where it safes which 64 bit chunks are free in some page and it just takes the next free chunk.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js