C++ new memory allocation fragmentation - c++

I was trying to look at the behavior of the new allocator and why it doesn't place data contiguously.
My code:
struct ci {
char c;
int i;
}
template <typename T>
void memTest()
{
T * pLast = new T();
for(int i = 0; i < 20; ++i) {
T * pNew = new T();
cout << (pNew - pLast) << " ";
pLast = pNew;
}
}
So I ran this with char, int, ci. Most allocations were a fixed length from the last, sometimes there were odd jumps from one available block to another.
sizeof(char) : 1
Average Jump: 64 bytes
sizeof(int): 4
Average Jump: 16
sizeof(ci): 8 (int has to be placed on a 4 byte align)
Average Jump: 9
Can anyone explain why the allocator is fragmenting memory like this? Also why is the jump for char so much larger then ints and a structure that contains both an int and char.

There are two issues:
most allocators store some additional data prior to the start of the block (typically block size and a couple of pointers)
there are usually alignment requirements - modern operating systems typically allocate to at least an 8 byte boundary.
So you'll nearly always get some kind of gap between successive allocations.
Of course you should never rely on any specific behaviour for something like this, where the implementation is free to do as it pleases.

Your code contains a bug, to know distance of pointers cast them to (char *), otherwise the deltas are in sizeof(T).

This isn't fragmentation, it's just rounding up the size of your allocation to a round block size.
In general programming you should not pay attention to the pattern of memory addresses returned by general purpose allocators like new. When you do care about allocation behaviour you should always use a special purpose allocator (boost::pool, something you write yourself, etc.)
The exception is if you are studying allocators, in which case you could do worse than pick up your copy of K&R for a simple allocator which might help you understand how new gets its memory.

In general, you cannot depend on particular memory placement. The memory allocator's internal bookkeeping data and alignment requirements can both affect the placement of blocks. There is no requirement for blocks to be allocated contiguously.
Further, some systems will give you even "stranger" behavior. Many modern Linux systems have heap randomization enabled, where newly-allocated virtual memory pages are placed at random addresses to make certain types of security vulnerability exploits more difficult. With virtual memory, disparate allocated block addresses do not necessarily mean that the physical memory is fragmented, as there is no requirement for the virtual address space to be dense.

For small allocation, boost has a very simple allocator I've used called boost::simple_segregated_storage
It creates a copy of slists of free and used blocks, all the same size. As long as you only allocate to its set block size, you get no external fragmentation (though you can get some internal fragmentation if your block size is bigger than the requested size.) It also runs O(1) if you use it in this manner. Great for small allocation the likes of which are common with template programming.

Related

I don't understand about memory issue of appending string

Runtime error: pointer index expression with base 0x000000000000 overflowed to 0xffffffffffffffff for frequency sort
In first answer of that link, it says that appending char to string can cause memory issue.
string s = "";
char c = 'a';
int max = INT_MAX;
for(int j=0;j<max;j++)
s = s + c;
The answer explains [s=s+c in above code copies the same string again and again so it will cause memory issue.] But I don't understand why that code copies the same string again and again.
Is there someone who is likely to make me understand that part :)?
I don't understand why that code copies the same string again and
again.
Okay, let's look at the what happens each time the loop is iterated:
s = s + c;
There are three things the program has to do in order to execute that line of code:
Compute the temporary value s + c -- to do that, the program has to create a temporary, anonymous std::string object, and allocate for it (from the heap) an internal byte-buffer that is at least one byte larger than the number of chars currently in s (so that it can hold all of s's old contents, plus the additional char provided by c)
Set s equal to the temporary-string. In C++03 and earlier, this would be done by reallocating s's internal byte-buffer to be larger, then copying all of the bytes from the temporary-string into s's new/larger buffer. C++11 optimizes this a bit via the new move-assignment operator, so that all the bytes don't have to be copied; rather, s can simply take ownership of the temporary-string's byte-buffer.
Free the temporary string's resources, now that we're done using it. In practice, this takes the form of the std::string class's destructor calling delete[] on the old (no-longer-large-enough) byte-buffer.
Given that the above is going to be performed at least 2 billion times in a loop, it's already quite inefficient.
However, what I think the answer you referred to was particularly concerned about was heap fragmentation. Keep in mind that heap allocation doesn't work by magic; when you (or the std::string class, or anyone) asks to allocate N bytes of memory from the heap, the heap implementation's job is to find N bytes of contiguous memory and return it. And since there is no provision in C++ for moving blocks of memory around (as doing so would invalidate any pointers that the program might have pointing into those blocks of memory), the heap can't create an N-byte contiguous-memory-chunk out of smaller chunks; instead, there has to be a range of contiguous-memory-space already available. For example, it does the heap no good to have a total of 1GB of memory available, if that 1GB of memory is made up of thousands of nonconsecutive 1KB chunks and the caller is asking for a 2KB allocation.
Therefore, the heap's job is to efficiently allocate chunks of memory of the sizes the program requests, and when they are freed again, it will try to glue them back together into larger chunks again if it can, but it may not always be able to. Certain patterns of allocating and freeing memory may result in heap fragmentation, which is simply a large number of discontinuous memory allocations that render the small regions of free memory between them unusable for large allocations.
Whether or not this particular allocate/free pattern would cause that, I'm not sure; given that only one or two buffers are being allocated at a time, the heap may be able to reabsorb them back into adjacent free-memory chunks as they get freed again -- it probably depends on the particular heap algorithm the system is using, as well as on whether any other threads are allocating/freeing heap memory while this is going on. But I wouldn't be too surprised if there are systems out there where it would cause problems (particularly on 16-bit or 32-bit systems where virtual address space is limited, or embedded systems that don't use virtual memory)

C++ allocates more bytes than asked?

int main(int argc, const char* argv[])
{
for(unsigned int i = 0; i < 10000000; i++)
char *c = new char;
cin.get();
}
In the above code, why does my program use 471MB memory instead of 10MB as one would expect?
Allocation of RAM comes from a combined effort of the runtime library and the operating system. In order to identify the one byte your example code requests, there is some structure which identifies this memory to the runtime. It is sometimes a double linked list, but it's defined by the operating system and runtime implementation.
You can analogize it this way: If you have a linked list container, what you're interested in is simply what you've placed inside each link, but the container must have pointers to the other links in the containers in order to maintain the linked list.
If you use a debugger, or some other debugging tool to track memory, these structures can be even larger, making each allocation more costly.
RAM isn't typically allocated out of an array, but it is possible to overload the new operator to change allocation behavior. It could be possible specifically allocate from an array (a large one in your example) so that allocations behaved as you seem to have expected, and in some applications this is a specific strategy to control memory and improve performance (though the details are usually more complex than that simple illustration).
The allocation not only contains the allocated memory itself, but at least one word telling delete how much memory it has to release; moreover that is a number that has to be correctly aligned, so there will be a certain padding after the allocated char to ensure that the next block is correctly aligned. On a 64 bit machine, that means at least 16 bytes per allocation (8 bytes to hold the size, 1 byte to hold the character, and 7 bytes padding to ensure correct alignment).
However most probably that's not the only data stored; to help the memory allocator to find free memory, additional data is likely stored; if one assumes that data to consist of three pointers, one gets to a total 40 bytes per allocation, which matches your data quite well.
Note also that the allocator will also request a bit more memory from the operating system than needed for the actual allocation, so that it won't need to do an expensive OS call for every little allocation. That is, the run time library allocates larger chunks of memory from the operating system, and then cuts those in smaller pieces for your program's allocations. Thus generally there will be some memory allocated from the operating system (and thus showing up in the task manager), but not yet allocated to a certain object in your program.

`std::string` allocations are my current bottleneck - how can I optimize with a custom allocator?

I'm writing a C++14 JSON library as an exercise and to use it in my personal projects.
By using callgrind I've discovered that the current bottleneck during a continuous value creation from string stress test is an std::string dynamic memory allocation. Precisely, the bottleneck is the call to malloc(...) made from std::string::reserve.
I've read that many existing JSON libraries such as rapidjson use custom allocators to avoid malloc(...) calls during string memory allocations.
I tried to analyze rapidjson's source code but the large amount of additional code and comments, plus the fact that I'm not really sure what I'm looking for, didn't help me much.
How do custom allocators help in this situation?
Is a memory buffer preallocated somewhere (where? statically?) and std::strings take available memory from it?
Are strings using custom allocators "compatible" with normal strings?
They have different types. Do they have to be "converted"? (And does that result in a performance hit?)
Code notes:
Str is an alias for std::string.
By default, std::string allocates memory as needed from the same heap as anything that you allocate with malloc or new. To get a performance gain from providing your own custom allocator, you will need to be managing your own "chunk" of memory in such a way that your allocator can deal out the amounts of memory that your strings ask for faster than malloc does. Your memory manager will make relatively few calls to malloc, (or new, depending on your approach) under the hood, requesting "large" amounts of memory at once, then deal out sections of this (these) memory block(s) through the custom allocator. To actually achieve better performance than malloc, your memory manager will usually have to be tuned based on known allocation patterns of your use cases.
This kind of thing often comes down to the age-old trade off of memory use versus execution speed. For example: if you have a known upper bound on your string sizes in practice, you can pull tricks with over-allocating to always accommodate the largest case. While this is wasteful of your memory resources, it can alleviate the performance overhead that more generalized allocation runs into with memory fragmentation. As well as making any calls to realloc essentially constant time for your purposes.
#sehe is exactly right. There are many ways.
EDIT:
To finally address your second question, strings using different allocators can play nicely together, and usage should be transparent.
For example:
class myalloc : public std::allocator<char>{};
myalloc customAllocator;
int main(void)
{
std::string mystring(customAllocator);
std::string regularString = "test string";
mystring = regularString;
std::cout << mystring;
return 0;
}
This is a fairly silly example and, of course, uses the same workhorse code under the hood. However, it shows assignment between strings using allocator classes of "different types". Implementing a useful allocator that supplies the full interface required by the STL without just disguising the default std::allocator is not as trivial. This seems to be a decent write up covering the concepts involved. The key to why this works, in the context of your question at least, is that using different allocators doesn't cause the strings to be of different type. Notice that the custom allocator is given as an argument to the constructor not a template parameter. The STL still does fun things with templates (such as rebind and Traits) to homogenize allocator interfaces and tracking.
What often helps is the creation of a GlobalStringTable.
See if you can find portions of the old NiMain library from the now defunct NetImmerse software stack. It contains an example implementation.
Lifetime
What is important to note is that this string table needs to be accessible between different DLL spaces, and that it is not a static object. R. Martinho Fernandes already warned that the object needs to be created when the application or DLL thread is created / attached, and disposed when the thread is destroyed or the dll is detached, and preferrably before any string object is actually used. This sounds easier than it actually is.
Memory allocation
Once you have a single point of access that exports correctly, you can have it allocate a memory buffer up-front. If the memory is not enough, you have to resize it and move the existing strings over. Strings essentially become handles to regions of memory in this buffer.
Placement new
Something that often works well is called the placement new() operator, where you can actually specify where in memory your new string object needs to be allocated. However, instead of allocating, the operator can simply grab the memory location that is passed in as an argument, zero the memory at that location, and return it. You can also keep track of the allocation, the actual size of the string etc.. in the Globalstringtable object.
SOA
Handling the actual memory scheduling is something that is up to you, but there are many possible ways to approach this. Often, the allocated space is partitioned in several regions so that you have several blocks per possible string size. A block for strings <= 4 bytes, one for <= 8 bytes, and so on. This is called a Small Object Allocator, and can be implemented for any type and buffer.
If you expect many string operations where small strings are incremented repeatedly, you may change your strategy and allocate larger buffers from the start, so that the number of memmove operations are reduced. Or you can opt for a different approach and use string streams for those.
String operations
It is not a bad idea to derive from std::basic_str, so that most of the operations still work but the internal storage is actually in the GlobalStringTable, so that you can keep using the same stl conventions. This way, you also make sure that all the allocations are within a single DLL, so that there can be no heap corruption by linking different kinds of strings between different libraries, since all the allocation operations are essentially in your DLL (and are rerouted to the GlobalStringTable object)
Custom allocators can help because most malloc()/new implementations are designed for maximum flexibility, thread-safety and bullet-proof workings. For instance, they must gracefully handle the case that one thread keeps allocating memory, sending the pointers to another thread that deallocates them. Things like these are difficult to handle in a performant way and drive the cost of malloc() calls.
However, if you know that some things cannot happen in your application (like one thread deallocating stuff another thread allocated, etc.), you can optimize your allocator further than the standard implementation. This can yield significant results, especially when you don't need thread safety.
Also, the standard implementation is not necessarily well optimized: Implementing void* operator new(size_t size) and void operator delete(void* pointer) by simply calling through to malloc() and free() gives an average performance gain of 100 CPU cycles on my machine, which proves that the default implementation is suboptimal.
I think you'd be best served by reading up on the EASTL
It has a section on allocators and you might find fixed_string useful.
The best way to avoid a memory allocation is don't do it!
BUT if I remember JSON correctly all the readStr values either gets used as keys or as identifiers so you will have to allocate them eventually, std::strings move semantics should insure that the allocated array are not copied around but reused until its final use. The default NRVO/RVO/Move should reduce any copying of the data if not of the string header itself.
Method 1:
Pass result as a ref from the caller which has reserved SomeResonableLargeValue chars, then clear it at the start of readStr. This is only usable if the caller actually can reuse the string.
Method 2:
Use the stack.
// Reserve memory for the string (BOTTLENECK)
if (end - idx < SomeReasonableValue) { // 32?
char result[SomeReasonableValue] = {0}; // feel free to use std::array if you want bounds checking, but the preceding "if" should insure its not a problem.
int ridx = 0;
for(; idx < end; ++idx) {
// Not an escape sequence
if(!isC('\\')) { result[ridx++] = getC(); continue; }
// Escape sequence: skip '\'
++idx;
// Convert escape sequence
result[ridx++] = getEscapeSequence(getC());
}
// Skip closing '"'
++idx;
result[ridx] = 0; // 0-terminated.
// optional assert here to insure nothing went wrong.
return result; // the bottleneck might now move here as the data is copied to the receiving string.
}
// fallback code only if the string is long.
// Your original code here
Method 3:
If your string by default can allocate some size to fill its 32/64 byte boundary, you might want to try to use that, construct result like this instead in case the constructor can optimize it.
Str result(end - idx, 0);
Method 4:
Most systems already has some optimized allocator that like specific block sizes, 16,32,64 etc.
siz = ((end - idx)&~0xf)+16; // if the allocator has chunks of 16 bytes already.
Str result(siz);
Method 5:
Use either the allocator made by google or facebooks as global new/delete replacement.
To understand how a custom allocator can help you, you need to understand what malloc and the heap does and why it is quite slow in comparison to the stack.
The Stack
The stack is a large block of memory allocated for your current scope. You can think of it as this
([] means a byte of memory)
[P][][][][][][][][][][][][][][][]
(P is a pointer that points to a specific byte of memory, in this case its pointing at the first byte)
So the stack is a block with only 1 pointer. When you allocate memory, what it does is it performs a pointer arithmetic on P, which takes constant time.
So declaring int i = 0; would mean this,
P + sizeof(int).
[i][i][i][i][P][][][][][][][][][][][],
(i in [] is a block of memory occupied by an integer)
This is blazing fast and as soon as you go out of scope, the entire chunk of memory is emptied simply by moving P back to the first position.
The Heap
The heap allocates memory from a reserved pool of bytes reserved by the c++ compiler at runtime, when you call malloc, the heap finds a length of contiguous memory that fits your malloc requirements, marks it as used so nothing else can use it, and returns that to you as a void*.
So, a theoretical heap with little optimization calling new(sizeof(int)), would do this.
Heap chunk
At first : [][][][][][][][][][][][][][][][][][][][][][][][][]
Allocate 4 bytes (sizeof(int)):
A pointer goes though every byte of memory, finds one that is of correct length, and returns to you a pointer.
After : [i][i][i][i][][][]][][][][][][][][][]][][][][][][][]
This is not an accurate representation of the heap, but from this you can already see numerous reasons for being slow relative to the stack.
The heap is required to keep track of all already allocated memory and their respective lengths. In our test case above, the heap was already empty and did not require much, but in worst case scenarios, the heap will be populated with multiple objects with gaps in between (heap fragmentation), and this will be much slower.
The heap is required to cycle though all the bytes to find one that fits your length.
The heap can suffer from fragmentation since it will never completely clean itself unless you specify it. So if you allocated an int, a char, and another int, your heap would look like this
[i][i][i][i][c][i2][i2][i2][i2]
(i stands for bytes occupied by int and c stands for bytes occupied by a char. When you de-allocate the char, it will look like this.
[i][i][i][i][empty][i2][i2][i2][i2]
So when you want to allocate another object into the heap,
[i][i][i][i][empty][i2][i2][i2][i2][i3][i3][i3][i3]
unless an object is the size of 1 char, the overall heap size for that allocation is reduced by 1 byte. In more complex programs with millions of allocations and deallocations, the fragmentation issue becomes severe and the program will become unstable.
Worry about cases like thread safety (Someone else said this already).
Custom Heap/Allocator
So, a custom allocator usually needs to address these problems while providing the benefits of the heap, such as personalized memory management and object permanence.
These are usually accomplished with specialized allocators. If you know you dont need to worry about thread safety or you know exactly how long your string will be or a predictable usage pattern you can make your allocator fast than malloc and new by quite a lot.
For example, if your program requires a lot of allocations as fast as possible without lots of deallocations, you could implement a stack allocator, in which you allocate a huge chunk of memory with malloc at startup,
e.g
typedef char* buffer;
//Super simple example that probably doesnt work.
struct StackAllocator:public Allocator{
buffer stack;
char* pointer;
StackAllocator(int expectedSize){ stack = new char[expectedSize];pointer = stack;}
allocate(int size){ char* returnedPointer = pointer; pointer += size; return returnedPointer}
empty() {pointer = stack;}
};
Get expected size, get a chunk of memory from the heap.
Assign a pointer to the beginning.
[P][][][][][][][][][] ..... [].
then have one pointer that moves for each allocation. When you no longer need the memory, you simply move the pointer to the beginning of your buffer. This gives your the advantage of O(1) speed allocations and deallocations as well as object permanence for the lack of flexible deallocation and large initial memory requirements.
For strings, you could try a chunk allocator. For every allocation, the allocator gives a set chunk of memory.
Compatibility
Compatibility with other strings is almost guaranteed. As long as you are allocating a contiguous chunk of memory and preventing anything else from using that block of memory, it will work.

Memory Demands: Heap vs Stack in C++

So I had a strange experience this evening.
I was working on a program in C++ that required some way of reading a long list of simple data objects from file and storing them in the main memory, approximately 400,000 entries. The object itself is something like:
class Entry
{
public:
Entry(int x, int y, int type);
Entry(); ~Entry();
// some other basic functions
private:
int m_X, m_Y;
int m_Type;
};
Simple, right? Well, since I needed to read them from file, I had some loop like
Entry** globalEntries;
globalEntries = new Entry*[totalEntries];
entries = new Entry[totalEntries];// totalEntries read from file, about 400,000
for (int i=0;i<totalEntries;i++)
{
globalEntries[i] = new Entry(.......);
}
That addition to the program added about 25 to 35 megabytes to the program when I tracked it on the task manager. A simple change to stack allocation:
Entry* globalEntries;
globalEntries = new Entry[totalEntries];
for (int i=0;i<totalEntries;i++)
{
globalEntries[i] = Entry(.......);
}
and suddenly it only required 3 megabytes. Why is that happening? I know pointer objects have a little bit of extra overhead to them (4 bytes for the pointer address), but it shouldn't be enough to make THAT much of a difference. Could it be because the program is allocating memory inefficiently, and ending up with chunks of unallocated memory in between allocated memory?
Your code is wrong, or I don't see how this worked. With new Entry [count] you create a new array of Entry (type is Entry*), yet you assign it to Entry**, so I presume you used new Entry*[count].
What you did next was to create another new Entry object on the heap, and storing it in the globalEntries array. So you need memory for 400.000 pointers + 400.000 elements. 400.000 pointers take 3 MiB of memory on a 64-bit machine. Additionally, you have 400.000 single Entry allocations, which will all require sizeof (Entry) plus potentially some more memory (for the memory manager -- it might have to store the size of allocation, the associated pool, alignment/padding, etc.) These additional book-keeping memory can quickly add up.
If you change your second example to:
Entry* globalEntries;
globalEntries = new Entry[count];
for (...) {
globalEntries [i] = Entry (...);
}
memory usage should be equal to the stack approach.
Of course, ideally you'll use a std::vector<Entry>.
First of all, without specifying which column exactly you were watching, the number in task manager means nothing. On a modern operating system it's difficult even to define what you mean with "used memory" - are we talking about private pages? The working set? Only the stuff that stays in RAM? does reserved but not committed memory count? Who pays for memory shared between processes? Are memory mapped file included?
If you are watching some meaningful metric, it's impossible to see 3 MB of memory used - your object is at least 12 bytes (assuming 32 bit integers and no padding), so 400000 elements will need about 4.58 MB. Also, I'd be surprised if it worked with stack allocation - the default stack size in VC++ is 1 MB, you should already have had a stack overflow.
Anyhow, it is reasonable to expect a different memory usage:
the stack is (mostly) allocated right from the beginning, so that's memory you nominally consume even without really using it for anything (actually virtual memory and automatic stack expansion makes this a bit more complicated, but it's "true enough");
the CRT heap is opaque to the task manager: all it sees is the memory given by the operating system to the process, not what the C heap has "really" in use; the heap grows (requesting memory to the OS) more than strictly necessary to be ready for further memory requests - so what you see is how much memory it is ready to give away without further syscalls;
your "separate allocations" method has a significant overhead. The all-contiguous array you'd get with new Entry[size] costs size*sizeof(Entry) bytes, plus the heap bookkeeping data (typically a few integer-sized fields); the separated allocations method costs at least size*sizeof(Entry) (size of all the "bare elements") plus size*sizeof(Entry *) (size of the pointer array) plus size+1 multiplied by the cost of each allocation. If we assume a 32 bit architecture with a cost of 2 ints per allocation, you quickly see that this costs size*24+8 bytes of memory, instead of size*12+8 for the contiguous array in the heap;
the heap normally really gives away blocks that aren't really the size you asked for, because it manages blocks of fixed size; so, if you allocate single objects like that you are probably paying also for some extra padding - supposing it has 16 bytes blocks, you are paying 4 bytes extra per element by allocating them separately; this moves out memory estimation to size*28+8, i.e. an overhead of 16 bytes per each 12-byte element.

Is there a memory overhead associated with heap memory allocations (eg markers in the heap)?

Thinking in particular of C++ on Windows using a recent Visual Studio C++ compiler, I am wondering about the heap implementation:
Assuming that I'm using the release compiler, and I'm not concerned with memory fragmentation/packing issues, is there a memory overhead associated with allocating memory on the heap? If so, roughly how many bytes per allocation might this be?
Would it be larger in 64-bit code than 32-bit?
I don't really know a lot about modern heap implementations, but am wondering whether there are markers written into the heap with each allocation, or whether some kind of table is maintained (like a file allocation table).
On a related point (because I'm primarily thinking about standard-library features like 'map'), does the Microsoft standard-library implementation ever use its own allocator (for things like tree nodes) in order to optimize heap usage?
Yes, absolutely.
Every block of memory allocated will have a constant overhead of a "header", as well as a small variable part (typically at the end). Exactly how much that is depends on the exact C runtime library used. In the past, I've experimentally found it to be around 32-64 bytes per allocation. The variable part is to cope with alignment - each block of memory will be aligned to some nice even 2^n base-address - typically 8 or 16 bytes.
I'm not familiar with how the internal design of std::map or similar works, but I very much doubt they have special optimisations there.
You can quite easily test the overhead by:
char *a, *b;
a = new char;
b = new char;
ptrdiff_t diff = a - b;
cout << "a=" << a << " b=" << b << " diff=" << diff;
[Note to the pedants, which is probably most of the regulars here, the above a-b expression invokes undefined behaviour, since subtracting the address of one piece of allocated and the address of another, is undefined behaviour. This is to cope with machines that don't have linear memory addresses, e.g. segmented memory or "different types of data is stored in locations based on their type". The above should definitely work on any x86-based OS that doesn't use a segmented memory model with multiple data segments in for the heap - which means it works for Windows and Linux in 32- and 64-bit mode for sure].
You may want to run it with varying types - just bear in mind that the diff is in "number of the type, so if you make it int *a, *b will be in "four bytes units". You could make a reinterpret_cast<char*>(a) - reinterpret_cast<char *>(b);
[diff may be negative, and if you run this in a loop (without deleting a and b), you may find sudden jumps where one large section of memory is exhausted, and the runtime library allocated another large block]
Visual C++ embeds control information (links/sizes and possibly some checksums) near the boundaries of allocated buffers. That also helps to catch some buffer overflows during memory allocation and deallocation.
On top of that you should remember that malloc() needs to return pointers suitably aligned for all fundamental types (char, int, long long, double, void*, void(*)()) and that alignment is typically of the size of the largest type, so it could be 8 or even 16 bytes. If you allocate a single byte, 7 to 15 bytes can be lost to alignment only. I'm not sure if operator new has the same behavior, but it may very well be the case.
This should give you an idea. The precise memory waste can only be determined from the documentation (if any) or testing. The language standard does not define it in any terms.
Yes. All practical dynamic memory allocators have a minimal granularity1. For example, if the granularity is 16 bytes and you request only 1 byte, the whole 16 bytes is allocated nonetheless. If you ask for 17 bytes, a block whose size is 32 bytes is allocated etc...
There is also a (related) issue of alignment.2
Quite a few allocators seem to be a combination of a size map and free lists - they split potential allocation sizes to "buckets" and keep a separate free list for each of them. Take a look at Doug Lea's malloc. There are many other allocation techniques with various tradeoffs but that goes beyond the scope here...
1 Typically 8 or 16 bytes. If the allocator uses a free list then it must encode two pointers inside every free slot, so a free slot cannot be smaller than 8 bytes (on 32-bit) or 16 byte (on 16-bit). For example, if allocator tried to split a 8-byte slot to satisfy a 4-byte request, the remaining 4 bytes would not have enough room to encode the free list pointers.
2 For example, if the long long on your platform is 8 bytes, then even if the allocator's internal data structures can handle blocks smaller than that, actually allocating the smaller block might push the next 8-byte allocation to an unaligned memory address.