Weird C stack memory overrides - c++

I am implementing a version of malloc and free for practice. So, I have a static char array of a fixed length (10000). Then, I implemented a struct memblock that holds information like size of the block, if it is free...
The way I am implementing malloc is such that I put small blocks (< 8 bytes) to the front of the char array and larger ones to the other end. So, I am basically using two linked lists to link the blocks in front and blocks in back. However, I am having weird problems with initializing the lists (on first call of malloc).
This is my code:
#define MEMSIZE 10000 // this is the maximum size of the char * array
#define BLOCKSIZE sizeof(memblock) // size of the memblock struct
static char memory[MEMSIZE]; // array to store all the memory
static int init; // checks if memory is initialized
static memblock root; // general ptr that deals with both smallroot and bigroot
static memblock smallroot, bigroot; // pointers to storage of small memory blocks and bigger blocks
void initRoots(size_t size, char* fileName, int lineNum)
{
smallroot = (memblock)memory;
smallroot->prev = smallroot->next = 0;
smallroot->size = MEMSIZE - 2 * BLOCKSIZE;
smallroot->isFree = 1;
smallroot->file = fileName;
smallroot->lineNum = lineNum;
bigroot = (memblock)(((char *)memory) + MEMSIZE - BLOCKSIZE - 1);
bigroot->prev = bigroot->next = 0;
bigroot->size = MEMSIZE - 2 * BLOCKSIZE;
bigroot->isFree = 1;
bigroot->file = fileName;
bigroot->lineNum = lineNum;
init = 1;
}
I used GDB to see where I am getting a Seg Fault. It happens when bigroot->next = 0; is executed. This somehow sets smallroot to 0. What is more weird? If I set bigroot->next = 0x123, then smallroot becomes 0x1. If I set 0x1234, then it becomes 0x12. It is setting smallroot to the value of bigroot->next's value excluding its last two bits. I really don't understand how this is happening!
This is the definition of memblock:
typedef struct memblock_* memblock;
struct memblock_ {
struct memblock_ *prev, *next; // pointers to next and previous blocks
/* size: size of allocated memory
isFree: 0 if not free, 1 if free
lineNum: line number of user's file where malloc was invoked
*/
size_t size, isFree, lineNum;
char* file; // user's file name where the block was malloced
};

#define BLOCKSIZE sizeof(memblock) // size of the memblock struct
You want:
#define BLOCKSIZE sizeof(*memblock) // size of the memblock_ struct
Also the -1 here is bogus (creates mis-aligned pointer):
bigroot = (memblock)(((char *)memory) + MEMSIZE - BLOCKSIZE - 1);
Actually, I am storing the pointer to the memblock in the memory array. The values of memblock are stored in stack.
No, they are not. The smallroot and bigroot clearly point into the array itself.

Related

Create a bitmap at the first block of a char array in c++

Say, I have a char array in c++ with 64 blocks, each block has 64 bytes memory allocated:
char **disk = new char*[64];
for (int i = 0; i < 64; i++) {
disk[i] = new char[64];
}
And I want to set a bitmap in the the first block of the char array.
The bitmap only contain integers.
So disk[0] should be the bitmap. The bitmap contains ether 1 for occupied, and 0 for free, specifying the rest of the blocks in array if they are occupied or free. 1 bit for per block.
But I don't know how to implement a bitmap for the specific size that I need becasue the bitmap should also be 64 bytes, and it include integers for 64 blocks, how can I achieve it?
This is the project requirement, so...I cannot define a bitmap outside of the array.
Rather than hijacking what part of an array means, you should just make your own type:
struct MyType
{
bitmap_type bitmap; // is this a uint64_t[8]? Or a std::bitset? Or... ?
char data[63][64]; // or whatever dimensions
};
MyType* data = new MyType;
This way all the users of your type know that data->bitmap is the bitmap and data->data is the actual data, rather than having to remember that data[0] is special but data[x] for x>0 is the actual data.

c++ insane memory consumption on large file

I am loading a 10GB file into memory and I find that even if I strip away any extra overhead and store the data in nothing but an array it still takes up 53 GB of ram. This seems crazy to me since I am converting some of the text data to longs which take up less room and convert the rest to char * which should take up the same amount of room as a text file. I have about 150M rows of data in the file I am trying to load. Is there any reason why this should take up so much ram when I load it the way I do below?
There are three files here a fileLoader class and its header file and a main that simply runs them.
To answer some questions:
OS is UBUNTU 12.04 64bit
This is on a machien with 64GB of RAM and an SSD hd that I have providing 64GB of swap space for RAM
I am loading all of the data at once becuase of the need for speed. It is critical for the application. All sorting, indexing, and lots of the data intensive work runs on the GPU.
The other reason is that loading all of the data at once made it much simpler for me to write the code. I dont have to worry about indexed files, and mappings to locations in another file for example.
Here is the header file:
#ifndef FILELOADER_H_
#define FILELOADER_H_
#include <iostream>
#include <fstream>
#include <fcntl.h>
#include <stdlib.h>
#include <string.h>
#include <string>
class fileLoader {
public:
fileLoader();
virtual ~fileLoader();
void loadFile();
private:
long long ** longs;
char *** chars;
long count;
long countLines(std::string inFile);
};
#endif /* FILELOADER_H_ */
Here is the CPP file
#include "fileLoader.h"
fileLoader::fileLoader() {
// TODO Auto-generated constructor stub
this->longs = NULL;
this->chars = NULL;
}
char ** split(char * line,const char * delim,int size){
char ** val = new char * [size];
int i = 0;
bool parse = true;
char * curVal = strsep(&line,delim);
while(parse){
if(curVal != NULL){
val[i] = curVal;
i++;
curVal = strsep(&line,delim);
}else{
parse = false;
}
}
return val;
}
void fileLoader::loadFile(){
const char * fileName = "/blazing/final/tasteslikevictory";
std::string fileString(fileName);
//-1 since theres a header row and we are skipinig it
this->count = countLines(fileString) -1;
this->longs = new long long*[this->count];
this->chars = new char **[this->count];
std::ifstream inFile;
inFile.open(fileName);
if(inFile.is_open()){
std::string line;
int i =0;
getline(inFile,line);
while(getline(inFile,line)){
this->longs[i] = new long long[6];
this->chars[i] = new char *[7];
char * copy = strdup(line.c_str());
char ** splitValues = split(copy,"|",13);
this->longs[i][0] = atoll(splitValues[4]);
this->longs[i][1] = atoll(splitValues[5]);
this->longs[i][2] = atoll(splitValues[6]);
this->longs[i][3] = atoll(splitValues[7]);
this->longs[i][4] = atoll(splitValues[11]);
this->longs[i][5] = atoll(splitValues[12]);
this->chars[i][0] = strdup(splitValues[0]);
this->chars[i][1] = strdup(splitValues[1]);
this->chars[i][2] = strdup(splitValues[2]);
this->chars[i][3] = strdup(splitValues[3]);
this->chars[i][4] = strdup(splitValues[8]);
this->chars[i][5] = strdup(splitValues[9]);
this->chars[i][6] = strdup(splitValues[10]);
i++;
delete[] splitValues;
free(copy);
}
}
}
fileLoader::~fileLoader() {
// TODO Auto-generated destructor stub
if(this->longs != NULL){
delete[] this->longs;
}
if(this->chars != NULL){
for(int i =0; i <this->count;i++ ){
free(this->chars[i]);
}
delete[] this->chars;
}
}
long fileLoader::countLines(std::string inFile){
int BUFFER_SIZE = 16*1024;
int fd = open(inFile.c_str(), O_RDONLY);
if(fd == -1)
return 0;
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
long lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
return 0;
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
Here is the file with my main function:
#include "fileLoader.h"
int main()
{
fileLoader loader;
loader.loadFile();
return 0;
}
Here is an example of the data that I am loading:
13|0|1|1997|113|1|4|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
14|0|1|1997|113|1|5|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
15|0|1|1997|113|1|6|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
16|0|1|1997|113|1|7|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
17|0|1|1997|113|1|8|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
18|0|1|1997|113|1|9|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
19|0|1|1997|113|1|10|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
20|0|1|1997|113|1|11|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
21|0|1|1997|113|1|12|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
9|0|1|1997|113|1|13|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
27|0|1|1992|125|1|1|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
28|0|1|1992|125|1|2|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
29|0|1|1992|125|1|3|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
30|0|1|1992|125|1|4|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
31|0|1|1992|125|1|5|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
32|0|1|1992|125|1|6|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
33|0|1|1992|125|1|7|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
34|0|1|1992|125|1|8|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
35|0|1|1992|125|1|9|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
36|0|1|1992|125|1|10|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
37|0|1|1992|125|1|11|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
38|0|1|1992|125|1|12|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
39|0|1|1992|125|1|13|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
40|0|1|1992|125|1|14|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
41|0|1|1992|125|1|15|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
10|0|1|1996|126|1|1||||||
You are allocating nine chunks of memory for each line, so you are allocating a total of 1350 million pieces of memory. These allocations have a certain overhead, usually at least twice the size of a pointer, possibly even more. On a 64 bit machine, that is already 16 bytes, so you get 21.6 GB of overhead.
In addition to that, you get the overhead of heap fragmentation and alignment: Even if you only ever store a string in it, the allocator has to align the memory allocations so that you can store the largest possible values in it without triggering misalignment. Alignment may depend on the vector unit of your CPU, which can require very significant alignments, 16 byte alignment not being uncommon.
Doing the calculation with 16 bytes allocation overhead and 16 bytes alignment, we get allocations of 43.2 GB without the original data. With the original data this calculation is already very close to your measurement.
Each of those objects and strings you create has individual memory management overhead. So you load the string "0" from column 2, depending on your memory manager, it probably takes between two and four full words (could be more). Call it 16 to 32 bytes of storage to hold a one byte string. Then you load the "1" from column 3. And so on.

Error when trying to deallocate pointer of char array: _BLOCK_TYPE_IS_VALID(pHead->nBlockUse)

I'm writing a C++ program that sends and receives images using Boost.Asio.
When compiling I don't get errors, but when executing and having sent an image the program that receives the image crashes giving the following error message (in Visual Studio 2012, Windows 7 32bit):
Debug Assertion Failed:
Program: […]\DataSender.exe
File: f:\dd\vctools\crt_bld\self_x86\crt\src\dbgdel.cpp
Line: 52
Expression: _BLOCK_TYPE_IS_VALID(pHead->nBlockUse)
I read packages the size of 4096 bytes into a pointer to a char array while there are still incoming bytes to read. In the final looping—if there are less than 4096 bytes to read—I delete the pointer and create a pointer to a char array the size of the remaining bytes. Until here it still works.
But when I try to delete the char pointer array again at the end of the loop (in order to create a new char pointer array with standard size 4096 for the next incoming images), the program crashes.
Here is my code's excerpt in question:
char* buffer = new char[4096];
[…]
int remainingBytes = imageSize;
[…]
// read data
while( remainingBytes > 0 )
{
boost::system::error_code error;
// use smaller buffer if remaining bytes don't fill the tcp package
// fully
if( remainingBytes < 4096 )
{
delete[] buffer; // this one doesn't give an error
bufferSize = remainingBytes;
char* buffer = new char[bufferSize];
}
// read from socket into buffer
size_t receivedBytes = socket.read_some(
boost::asio::buffer(buffer, bufferSize), error);
remainingBytes -= receivedBytes;
// count total length
totalReceivedBytes += receivedBytes;
// add current buffer to totalBuffer
for( int i = 0; i < bufferSize; i++)
{
totalBuffer.push_back(buffer[i]);
}
// if smaller buffer has been used delete it and
// create usual tcp buffer again
if( receivedBytes < 4096 )
{
delete[] buffer; // here the error occurs
bufferSize = 4096;
char* buffer = new char[bufferSize];
}
}
I ran the same code also on a Debian GNU/Linux 7.2 64bit machine, which returned the following error, at the same position in code:
*** glibc detected *** ./datasender: double free or corruption (!prev): 0x0000000002503970 ***
I assume I'm doing something wrong when deallocating the char pointer array but I haven't figured it out yet.
Can someone point me in the right direction?
You're actually deleting twice the buffer when remainingBytes and receivedBytes are less than 4096.
Indeed, you're deleting buffer once, then allocate memory into a local buffer, not the outer one.
Then, when you delete buffer in the second if block, you're deleting a second time the same buffer. The allocation you've made in the if scopes are memory leaks. These aren't the same variables.
When you do
char* buffer = new char[bufferSize];
in your if scopes, you're creating a new variable, not allocating memory into the outer buffer variable. Thus, you're leaking, and not allocating memory into the buffer you just deleted.
Without looking further, you should remove the char* in front of buffer in both if blocks and then continue debugging.
I would use std::vector instead:
#include <vector>
//...
std::vector<char> buffer(remainingBytes);
bufferSize = remainingBytes;
//...
while( remainingBytes > 0 )
{
boost::system::error_code error;
// use smaller buffer if remaining bytes don't fill the tcp package
// fully
if( remainingBytes < 4096 )
{
buffer.resize(remainingBytes);
bufferSize = remainingBytes;
}
// read from socket into buffer
size_t receivedBytes = socket.read_some(
boost::asio::buffer(&buffer[0], bufferSize), error);
remainingBytes -= receivedBytes;
// count total length
totalReceivedBytes += receivedBytes;
// add current buffer to totalBuffer
totalBuffer.insert(totalBuffer.end(), buffer.begin(),
buffer.begin() + receivedBytes);
// if smaller buffer has been used delete it and
// create usual tcp buffer again
if( receivedBytes < 4096 )
{
buffer.resize(4096);
bufferSize = 4096;
}
}
There will be no memory leaks.
Also, I think your code has a bug in that you are supposed to copy only the number of received bytes (the return value of the read_some() function). Instead you assumed that bufferSize characters were returned.

u_int64_t array

I'm trying to do this:
int main(void){
u_int64_t NNUM = 2<<19;
u_int64_t list[NNUM], i;
for(i = 0; i < 4; i++){
list[i] = 999;
}
}
Why am I getting segfault at my Ubuntu 10.10 64 bits (gcc 4.6.1)?
You try to create a very large array on the stack. This leads to a stack overflow.
Try allocating the array on the heap instead. For example:
// Allocate memory
u_int64_t *list = malloc(NNUM * sizeof(u_int64_t));
// work with `list`
// ...
// Free memory again
free(list);
You declare NNUM = 2*2^19 == 2<<19 == 1048576.
and try to allocate on the stack 64 bits * 1048576 = num of bits* num of cells.
It is 8.5 MegaBytes, it is just too much for allocation on the stack, you can try to allocate it on the heap and check if it really works using the return value of malloc.
heap VS. stack
Your program requires a minimum stack size of 1048576,
if you check with 'ulimit -s', it is most likely less than that.
you can try 'ulimit -s 16384' and then re-execute again.

How to align pointer

How do I align a pointer to a 16 byte boundary?
I found this code, not sure if its correct
char* p= malloc(1024);
if ((((unsigned long) p) % 16) != 0)
{
unsigned char *chpoint = (unsigned char *)p;
chpoint += 16 - (((unsigned long) p) % 16);
p = (char *)chpoint;
}
Would this work?
thanks
C++0x proposes std::align, which does just that.
// get some memory
T* const p = ...;
std::size_t const size = ...;
void* start = p;
std::size_t space = size;
void* aligned = std::align(16, 1024, p, space);
if(aligned == nullptr) {
// failed to align
} else {
// here, p is aligned to 16 and points to at least 1024 bytes of memory
// also p == aligned
// size - space is the amount of bytes used for alignment
}
which seems very low-level. I think
// also available in Boost flavour
using storage = std::aligned_storage_t<1024, 16>;
auto p = new storage;
also works. You can easily run afoul of aliasing rules though if you're not careful. If you had a precise scenario in mind (fit N objects of type T at a 16 byte boundary?) I think I could recommend something nicer.
Try this:
It returns aligned memory and frees the memory, with virtually no extra memory management overhead.
#include <malloc.h>
#include <assert.h>
size_t roundUp(size_t a, size_t b) { return (1 + (a - 1) / b) * b; }
// we assume here that size_t and void* can be converted to each other
void *malloc_aligned(size_t size, size_t align = sizeof(void*))
{
assert(align % sizeof(size_t) == 0);
assert(sizeof(void*) == sizeof(size_t)); // not sure if needed, but whatever
void *p = malloc(size + 2 * align); // allocate with enough room to store the size
if (p != NULL)
{
size_t base = (size_t)p;
p = (char*)roundUp(base, align) + align; // align & make room for storing the size
((size_t*)p)[-1] = (size_t)p - base; // store the size before the block
}
return p;
}
void free_aligned(void *p) { free(p != NULL ? (char*)p - ((size_t*)p)[-1] : p); }
Warning:
I'm pretty sure I'm stepping on parts of the C standard here, but who cares. :P
In glibc library malloc, realloc always returns 8 bytes aligned. If you want to allocate memory with some alignment which is a higher power 2 then you can use memalign and posix_memalign. Read http://www.gnu.org/s/hello/manual/libc/Aligned-Memory-Blocks.html
posix_memalign is one way: http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_memalign.html as long as your size is a power of two.
The problem with the solution you provide is that you run the risk of writing off the end of your allocated memory. An alternative solution is to alloc the size you want + 16 and to use a similar trick to the one you're doing to get a pointer that is aligned, but still falls within your allocated region. That said, I'd use posix_memalign as a first solution.
Updated: New Faster Algorithm
Don't use modulo because it takes hundreds of clock cycles on x86 due to the nasty division and a lot more on other systems. I came up with a faster version of std::align than GCC and Visual-C++. Visual-C++ has the slowest implementation, which actually uses an amateurish conditional statement. GCC is very similar to my algorithm but I did the opposite of what they did but my algorithm is 13.3 % faster because it has 13 as opposed to 15 single-cycle instructions. See here is the research paper with dissassembly. The algorithm is actually one instruction faster if you use the mask instead of the pow_2.
/* Quickly aligns the given pointer to a power of two boundaries.
#return An aligned pointer of typename T.
#desc Algorithm is a 2's compliment trick that works by masking off
the desired number in 2's compliment and adding them to the
pointer. Please note how I took the horizontal comment whitespace back.
#param pointer The pointer to align.
#param mask Mask for the lower LSb, which is one less than the power of
2 you wish to align too. */
template <typename T = char>
inline T* AlignUp(void* pointer, uintptr_t mask) {
intptr_t value = reinterpret_cast<intptr_t>(pointer);
value += (-value) & mask;
return reinterpret_cast<T*>(value);
}
Here is how you call it:
enum { kSize = 256 };
char buffer[kSize + 16];
char* aligned_to_16_byte_boundary = AlignUp<> (buffer, 15); //< 16 - 1 = 15
char16_t* aligned_to_64_byte_boundary = AlignUp<char16_t> (buffer, 63);
Here is the quick bit-wise proof for 3 bits, it works the same for all bit counts:
~000 = 111 => 000 + 111 + 1 = 0x1000
~001 = 110 => 001 + 110 + 1 = 0x1000
~010 = 101 => 010 + 101 + 1 = 0x1000
~011 = 100 => 011 + 100 + 1 = 0x1000
~100 = 011 => 100 + 011 + 1 = 0x1000
~101 = 010 => 101 + 010 + 1 = 0x1000
~110 = 001 => 110 + 001 + 1 = 0x1000
~111 = 000 => 111 + 000 + 1 = 0x1000
Just in case you're here to learn how to align to a cache line an object in C++11, use the in-place constructor:
struct Foo { Foo () {} };
Foo* foo = new (AlignUp<Foo> (buffer, 63)) Foo ();
Here is the std::align implmentation, it uses 24 instructions where the GCC implementation uses 31 instructions, though it can be tweaked to eliminate a decrement instruction by turning (--align) to the mask for the Least Significant bits but that would not operate functionally identical to std::align.
inline void* align(size_t align, size_t size, void*& ptr,
size_t& space) noexcept {
intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
offset = (-int_ptr) & (--align);
if ((space -= offset) < size) {
space += offset;
return nullptr;
}
return reinterpret_cast<void*>(int_ptr + offset);
}
Faster to Use mask rather than pow_2
Here is the code for aligning using a mask rather than the the pow_2 (which is the even power of 2). This is 20% fatert than the GCC algorithm but requires you to store the mask rather than the pow_2 so it's not interchangable.
inline void* AlignMask(size_t mask, size_t size, void*& ptr,
size_t& space) noexcept {
intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
offset = (-int_ptr) & mask;
if ((space -= offset) < size) {
space += offset;
return nullptr;
}
return reinterpret_cast<void*>(int_ptr + offset);
}
few things:
don't change the pointer returned by the malloc/new: you'll need it later to free the memory;
make sure your buffer is big enough after adjusting the alignment
use size_t instead of unsigned long, since size_t guaranteed to have the same size as the pointer, as opposed to anything else:
here's the code:
size_t size = 1024; // this is how many bytes you need in the aligned buffer
size_t align = 16; // this is the alignment boundary
char *p = (char*)malloc(size + align); // see second point above
char *aligned_p = (char*)((size_t)p + (align - (size_t)p % align));
// use the aligned_p here
// ...
// when you're done, call:
free(p); // see first point above