Fast memcpy for small unaligned data

Fast memcpy for small unaligned data - c++

I need to read a binary file which is made of many basic types such as int, double, UTF8 strings, etc. For instance, think about one file containing n pairs of (int, double) one after the other, without any alignment with n being in the order of tens of millions. I need to get very fast access to that file. I read the file using fread calls and my own buffer which is about 16 kB long.
A profiler shows that my main bottleneck happens to be copying from the memory buffer to its final destination. The most obvious way to write a a function that copy from the buffer to a double would be:
// x: a pointer to the final destination of the data
// p: a pointer to the buffer used to read the file
//
void f0(double* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (int i = 0; i < 8; ++i) {
q[i] = p[i];
}
}
It I use the following code, I get huge speedup on x86-64
void f1(double* x, const unsigned char* p) {
double* r = reinterpret_cast<const double*>(p);
*x = *r;
}
But, as I understand, the program would crash on ARM if p is not 8-byte aligned.
Here are my questions:
Is the second program guaranteed to work on both x86 and x86-64?
How would you write such a function on ARM if you need it as fast as you can?
Here is a small benchmark to test on your machine
#include <chrono>
#include <iostream>
void copy_int_0(int* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (std::size_t i = 0; i < 4; ++i) {
q[i] = p[i];
}
}
void copy_double_0(double* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (std::size_t i = 0; i < 8; ++i) {
q[i] = p[i];
}
}
void copy_int_1(int* x, const unsigned char* p) {
*x = *reinterpret_cast<const int*>(p);
}
void copy_double_1(double* x, const unsigned char* p) {
*x = *reinterpret_cast<const double*>(p);
}
int main() {
const std::size_t n = 10000000;
const std::size_t nb_times = 200;
unsigned char* p = new unsigned char[12 * n];
for (std::size_t i = 0; i < 12 * n; ++i) {
p[i] = 0;
}
int* q0 = new int[n];
for (std::size_t i = 0; i < n; ++i) {
q0[i] = 0;
}
double* q1 = new double[n];
for (std::size_t i = 0; i < n; ++i) {
q1[i] = 0.0;
}
const auto begin_0 = std::chrono::high_resolution_clock::now();
for (std::size_t k = 0; k < nb_times; ++k) {
for (std::size_t i = 0; i < n; ++i) {
copy_int_0(q0 + i, p + 12 * i);
copy_double_0(q1 + i, p + 4 + 12 * i);
}
}
const auto end_0 = std::chrono::high_resolution_clock::now();
const double time_0 =
1.0e-9 *
std::chrono::duration_cast<std::chrono::nanoseconds>(end_0 - begin_0)
.count();
std::cout << "Time 0: " << time_0 << " s" << std::endl;
const auto begin_1 = std::chrono::high_resolution_clock::now();
for (std::size_t k = 0; k < nb_times; ++k) {
for (std::size_t i = 0; i < n; ++i) {
copy_int_1(q0 + i, p + 12 * i);
copy_double_1(q1 + i, p + 4 + 12 * i);
}
}
const auto end_1 = std::chrono::high_resolution_clock::now();
const double time_1 =
1.0e-9 *
std::chrono::duration_cast<std::chrono::nanoseconds>(end_1 - begin_1)
.count();
std::cout << "Time 1: " << time_1 << " s" << std::endl;
std::cout << "Prevent optimization: " << q0[0] << " " << q1[0] << std::endl;
delete[] q1;
delete[] q0;
delete[] p;
return 0;
}
The results I get are
clang++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.49403 s
Time 1: 4.01617 s
g++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.65762 s
Time 1: 3.89979 s
icpc -std=c++11 -O3 -xHost copy.cpp -o copy
./copy
Time 0: 8.46155 s
Time 1: 0.0278496 s
I did not check the assembly yet but I guess that the Intel compiler is fooling my benchmark here.

Is the second program guaranteed to work on both x86 and x86-64?
No.
When you dereference a double* the compiler is free to assume that the memory location actually contains a double, which means that it must be aligned to alignof(double).
A lot of x86 instructions are safe to use for unaligned data, but not all of them. Specifically, there are SIMD instructions which require proper alignment which your compiler is free to use.
This isn't just theoretical; LZ4 used to use something very similar to what you posted (it's C, not C++, so it was a C-style cast not reinterpret_cast, but that doesn't really matter), and everything worked as expected. Then GCC 5 was released, and it auto-vectorized the code in question at -O3 using vmovdqa, which requires proper alignment. The end result is that code which worked fine in GCC ≤ 4.9 started crashing at runtime when compiled with GCC ≥ 5.
In other words, even if your program happens to work today, if you depend on unaligned access (or other undefined behavior), it can easily stop working tomorrow. Don't do it.
How would you write such a function on ARM if you need it as fast as you can?
The answer isn't really ARM-specific. After the LZ4 incident Yann Collet (the author of LZ4) did a lot of research to answer this question. There isn't one option which well generate optimal code with every compiler on every architecture.
Using memcpy() is the safest option. If the size is known at compile time the compiler will generally optimize the memcpy() call away… for larger buffers, you can take advantage of that by calling memcpy() in a loop; you'll generally get a loop of fast instructions without the additional overhead of calling memcpy().
If you're feeling more adventurous you can use a packed union to "cast" instead of reinterpret_cast. This is compiler-specific, but when supported it should be safe, and it may be faster than memcpy().
FWIW, I have some code which attempts to find the optimal way to do this depending on various factors (compiler, compiler version, architecture, etc.). It is a bit conservative about platforms I haven't tested, but it should achieve good results on the vast majority of platforms people actually use.

Related

compiler ignores operator new allocation

I am programming a 512 bits integer in C++.
For the integer, I allocate memory from the heap using the new keyword, but the compiler (g++ version 8.1 on MINGW) seems to wrongfully optimize that out.
i.e compiler commands are:
g++ -Wall -fexceptions -Og -g -fopenmp -std=c++14 -c main.cpp -o main.o
g++ -o bin\Debug\cs.exe obj\Debug\main.o -O0 -lgomp
Code:
#include <iostream>
#include <cstdint>
#include <omp.h>
constexpr unsigned char arr_size = 16;
constexpr unsigned char arr_size_half = 8;
void exit(int);
struct uint512_t{
uint32_t * bytes;
uint512_t(uint32_t num){
//The line below is either (wrongfully) ignored or (wrongfully) optimized out
bytes = new(std::nothrow) uint32_t[arr_size];
if(!bytes){
std::cerr << "Error - not enough memory available.";
exit(-1);
}
*bytes = num;
for(uint32_t * ptr = bytes+1; ptr < ptr+16; ++ptr){
//OS throws error 0xC0000005 (accessing unallocated memory) here
*ptr = 0;
}
}
uint512_t inline operator &(uint512_t &b){
uint32_t* itera = bytes;
uint32_t* iterb = b.bytes;
uint512_t ret(0);
uint32_t* iterret = ret.bytes;
for(char i = 0; i < arr_size; ++i){
*(iterret++) = *(itera++) & *(iterb++);
}
return ret;
}
uint512_t inline operator =(uint512_t &b){
uint32_t * itera=bytes, *iterb=b.bytes;
for(char i = 0; i < arr_size; ++i){
*(itera++) = *(iterb++);
}
return *this;
}
uint512_t inline operator + (uint512_t &b){
uint32_t * itera = bytes;
uint32_t * iterb = b.bytes;
uint64_t res = 0;
uint512_t ret(0);
uint32_t *p2ret = ret.bytes;
uint32_t *p2res = 1+(uint32_t*)&res;
//#pragma omp parallel for shared(p2ret, res, p2res, itera, iterb, ret) private(i, arr_size) schedule(auto)
for(char i = 0; i < arr_size;++i){
res = *p2res;
res += *(itera++);
res += *(iterb++);
*(p2ret++) = (i<15) ? res+*(p2res) : res;
}
return ret;
}
uint512_t inline operator += (uint512_t &b){
uint32_t * itera = bytes;
uint32_t * iterb = b.bytes;
uint64_t res = 0;
uint512_t ret(0);
uint32_t *p2ret = ret.bytes;
uint32_t *p2res = 1+(uint32_t*)&res;
//#pragma omp parallel for shared(p2ret, res, p2res, itera, iterb, ret) private(i, arr_size) schedule(auto)
for(char i = 0; i < arr_size;++i){
res = *p2res;
res += *(itera++);
res += *(iterb++);
*(p2ret++) = (i<15) ? res+(*p2res) : res;
}
(*this) = ret;
return *this;
}
//uint512_t inline operator * (uint512_t &b){
//}
~uint512_t(){
delete[] bytes;
}
};
int main(void){
uint512_t a(3);
}

ptr < ptr+16 is always true. The loop is infinite, and eventually overflows the buffer that it writes to.
Simple solution: Value initialise the array so that you don't need the loop:
bytes = new(std::nothrow) uint32_t[arr_size]();
// ^^
PS. If you copy an instance, the behaviour will be undefined, since the copy would point to same allocation and both instances would attempt to delete it in the destructor.
Simple solution: Don't use bare owning pointers. Use a RAII container such as std::vector if you need to allocate an array dynamically.
PPS. Carefully consider whether you need dynamic allocation (and the associated overhead) in the first place. 512 bits is in many cases a fairly safe size to have in-place.

The error is at this line and has nothing to do with new being optimized away:
for(uint32_t * ptr = bytes+1; ptr < ptr+16; ++ptr){
*ptr = 0;
}
The condition for the for is wrong. ptr < ptr+16 will never be false. The loop will go on forever and eventually you will dereference an invalid memory location because ptr gets incremented ad-infinitum.
By the way, the compiler is allowed to perform optimizations but it is not allowed to change the apparent behavior of the program. If your code performs a new, the compiler can optimize it away if it can ensure that the side effects of new are there when you need them (in this case at the moment you access the array).

You are accessing the array out of bound. The smallest reproducible example would be:
#include <cstdint>
int main() {
uint32_t bytes[16];
for(uint32_t * ptr = bytes + 1; ptr < ptr + 16; ++ptr){
//OS throws error 0xC0000005 (accessing unallocated memory) here
*ptr = 0;
}
}
The ptr < ptr + 16 is always true (maybe except for overflow).

p.s i tried your solution, and it worked fine -
bytes = new(std::nothrow) uint32_t[arr_size];
if(!bytes){
std::cerr << "Error - not enough memory available.";
exit(-1);
}
*bytes = num;
auto ptrp16 = bytes+16;
for(uint32_t * ptr = bytes+1;ptr < ptrp16 ; ++ptr){
*ptr = 0;
}

Memory Alignment Issues with GCC Vector Extension

I'm trying to use GCC vector extension (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) to speed up matrix multiplication. The idea is to use SIMD instructions to multiply and add four float numbers at once. A minimal working example is listed below. The example works fine when multiplying a (M=10,K=12) matrix to a (K=12,N=12) matrix. When I change the parameters (say N=9), however, I get a segmentation fault.
I suspect this is due to memory alignment issues. In my understanding, when using a SIMD for a vector wich 16bytes (in this case float4), the target memory address should be a multiple of 16. There are already discussions on memory alignment issues with SIMD instructions. (e.g. Relationship between SSE vectorization and Memory alignment). In the example below, when &b(0,0) is 0x810e10, &b(1,0) is 0x810e34, which is not a multiple of 16.
My questions are,
Is it true that I'm getting the segfault for the memory alignment issues?
Can anyone tell me how to fix the problem easily? I've thought of using a two-dimensional array instead of one array, but I don't want to do this so as not to change the rest of the codes.
Minimal Working Example
#include <iostream>
#include <cstdlib>
#include <stdio.h>
#include <cstring>
#include <assert.h>
#include <algorithm>
using namespace std;
typedef float float4 __attribute__((vector_size (16)));
static inline void * alloc64(size_t sz) {
void * a = 0;
if (posix_memalign(&a, 64, sz) != 0) {
perror("posix_memalign");
exit(1);
}
return a;
}
struct Mat {
size_t m,n;
float * a;
Mat(size_t m_, size_t n_, float f) {
m = m_;
n = n_;
a = (float*) malloc(sizeof(float) * m * n);
fill(a,a + m * n,f);
}
/* a(i,j) */
float& operator()(long i, long j) {
return a[i * n + j];
}
};
Mat operator* (Mat a, Mat b) {
Mat c(a.m, b.n,0);
assert(a.n == b.m);
for (long i = 0; i < a.m; i++) {
for(long k = 0; k < a.n; k++){
float aa = a(i,k);
float4 a4 = {aa,aa,aa,aa};
long j;
for (j = 0; j <= b.n-4; j+=4) {
*((float4 *)&c(i,j)) = *((float4 *)&c(i,j)) + a4 * (*(float4 *)&b(k,j));
}
while(j < b.n){
c(i,j) += aa * b(k,j);
j++;
}
}
}
return c;
}
const int M = 10;
const int K = 12;
const int N = 12;
int main(){
Mat a(M,K,1);
Mat b(K,N,1);
Mat c = a * b;
for(int i = 0; i < M; i++){
for(int j = 0; j < N; j++)
cout << c(i,j) << " ";
cout << endl;
}
cout << endl;
}

In my understanding, when using a SIMD for a vector wich 16bytes (in
this case float4), the target memory address should be a multiple of
16.
That is incorrect on x64 processors. There are instructions that require alignment, but you can perfectly well write and read SIMD registers from unaligned memory locations without penalty and with absolute safety using the right instructions.
Is it true that I'm getting the segfault for the memory alignment
issues?
Yes.
But it is not related to SIMD instructions. In C/C++, it is undefined behavior to write *((float4 *)&c) = ... the way you do, and can certainly crash, but you can reproduce the problem without vectorization... Given the right circumstances, the following basic code will crash...
char * c = ...
*(int *) c = 1;
Can anyone tell me how to fix the problem easily? I've thought of
using a two-dimensional array instead of one array, but I don't want
to do this so as not to change the rest of the codes.
The typical workaround is to use memcpy. Let us look at a code example...
#include <string.h>
typedef float float4 __attribute__((vector_size (16)));
void writeover(float * x, float4 y) {
*(float4 * ) x = y;
}
void writeover2(float * x, float4 y) {
memcpy(x,&y,sizeof(y));
}
With, say, clang++, these two functions get compiled to vmovaps and vmovups. These are equivalent instructions, but the first one will crash if your pointer is not aligned on sizeof(float4). They are very fast functions on recent hardware.
The point is that you can often rely on memcpy to generate code that is nearly optimally fast. Of course, the amount of overhead you get (if any) will depend on the compiler you are using.
If you do get performance problems, then you can use Intel intrinsics or assembly instead... but chances are good that memcpy will serve you well.
A different fix is to only work in terms of float4 * pointers. This forces all your matrices to have dimensions that are divisible by four, but if you pad the leftover with zeroes you will probably get simple and really fast code.

Efficient way to loop through pixels of 16-bit Mat in OpenCV

I'm trying to make very simple (LUT-like) operations on a 16-bit gray-scale OpenCV Mat, which is efficient and doesn't slow down the debugger.
While there is a very detailed page in the documentation addressing exactly this issue, it fails to point out that most of those methods are only available on 8-bit images (including the perfect, optimized LUT function).
I tried the following methods:
uchar* p = mat_depth.data;
for (unsigned int i = 0; i < depth_width * depth_height * sizeof(unsigned short); ++i)
{
*p = ...;
*p++;
}
Really fast, unfortunately only supporting uchart (just like LUT).
int i = 0;
for (int row = 0; row < depth_height; row++)
{
for (int col = 0; col < depth_width; col++)
{
i = mat_depth.at<short>(row, col);
i = ..
mat_depth.at<short>(row, col) = i;
}
}
Adapted from this answer: https://stackoverflow.com/a/27225293/518169. Didn't work for me, and it was very slow.
cv::MatIterator_<ushort> it, end;
for (it = mat_depth.begin<ushort>(), end = mat_depth.end<ushort>(); it != end; ++it)
{
*it = ...;
}
Works well, however it uses a lot of CPU and makes the debugger super slow.
This answer https://stackoverflow.com/a/27099697/518169 points out to the source code of the built-in LUT function, however it only mentions advanced optimization techniques, like IPP and OpenCL.
What I'm looking for is a very simple loop like the first code, but for ushorts.
What method do you recommend for solving this problem? I'm not looking for extreme optimization, just something on par with the performance of the single-for-loop on .data.

I implemented Michael's and Kornel's suggestion and benchmarked them both in release and debug modes.
code:
cv::Mat LUT_16(cv::Mat &mat, ushort table[])
{
int limit = mat.rows * mat.cols;
ushort* p = mat.ptr<ushort>(0);
for (int i = 0; i < limit; ++i)
{
p[i] = table[p[i]];
}
return mat;
}
cv::Mat LUT_16_reinterpret_cast(cv::Mat &mat, ushort table[])
{
int limit = mat.rows * mat.cols;
ushort* ptr = reinterpret_cast<ushort*>(mat.data);
for (int i = 0; i < limit; i++, ptr++)
{
*ptr = table[*ptr];
}
return mat;
}
cv::Mat LUT_16_if(cv::Mat &mat)
{
int limit = mat.rows * mat.cols;
ushort* ptr = reinterpret_cast<ushort*>(mat.data);
for (int i = 0; i < limit; i++, ptr++)
{
if (*ptr == 0){
*ptr = 65535;
}
else{
*ptr *= 100;
}
}
return mat;
}
ushort* tablegen_zero()
{
static ushort table[65536];
for (int i = 0; i < 65536; ++i)
{
if (i == 0)
{
table[i] = 65535;
}
else
{
table[i] = i;
}
}
return table;
}
The results are the following (release/debug):
LUT_16: 0.202 ms / 0.773 ms
LUT_16_reinterpret_cast: 0.184 ms / 0.801 ms
LUT_16_if: 0.249 ms / 0.860 ms
So the conclusion is that reinterpret_cast is the faster by 9% in release mode, while the ptr one is faster by 4% in debug mode.
It's also interesting to see that directly calling the if function instead of applying a LUT only makes it slower by 0.065 ms.
Specs: streaming 640x480x16-bit grayscale image, Visual Studio 2013, i7 4750HQ.

OpenCV implementation is based on polymorphism and runtime dispatching over templates. In OpenCV version the use of templates is limited to a fixed set of primitive data types. That is, array elements should have one of the following types:
8-bit unsigned integer (uchar)
8-bit signed integer (schar)
16-bit unsigned integer (ushort)
16-bit signed integer (short)
32-bit signed integer (int)
32-bit floating-point number (float)
64-bit floating-point number (double)
a tuple of several elements where all elements have the same type (one of the above).
In case your cv::Mat is continues you can use pointer arithmetics to go through the whole data pointer and you should only use the appropriate pointer type to your cv::Mat.
Furthermore, keep it mind that cv::Mats are not always continuous (it can be a ROI, padded, or created from pixel pointer) and iterating over them with pointers will crash.
An example loop:
cv::Mat cvmat16sc1 = cv::Mat::eye(10, 10, CV_16SC1);
if (cvmat16sc1.data)
{
if (!cvmat16sc1.isContinuous())
{
cvmat16sc1 = cvmat16sc1.clone();
}
short* ptr = reinterpret_cast<short*>(cvmat16sc1.data);
for (int i = 0; i < cvmat16sc1.cols * cvmat16sc1.rows; i++, ptr++)
{
if (*ptr == 1)
std::cout << i << ": " << *ptr << std::endl;
}
}

Best solution for your problem is already written in the tutorial that you mentioned, in the chapter named "The efficient way". All you need is to replace every instance of uchar with ushort. No other changes are needed.

pointer arithmetic in C++ using char*

I'm having trouble understanding what the difference between these two code snippets is:
// out is of type char* of size N*D
// N, D are of type int
for (int i=0; i!=N; i++){
if (i % 1000 == 0){
std::cout << "i=" << i << std::endl;
}
for (int j=0; j!=D; j++) {
out[i*D + j] = 5;
}
}
This code runs fine, even for very big data sets (N=100000, D=30000). From what I understand about pointer arithmetic, this should give the same result:
for (int i=0; i!=N; i++){
if (i % 1000 == 0){
std::cout << "i=" << i << std::endl;
}
char* out2 = &out[i*D];
for (int j=0; j!=D; j++) {
out2[j] = 5;
}
}
However, the latter does not work (it freezes at index 143886 - I think it segfaults, but I'm not 100% sure as I'm not used to developing on windows) for a very big data set and I'm afraid I'm missing something obvious about how pointer arithmetic works. Could it be related to advancing char*?
EDIT: We have now established that the problem was an overflow of the index (i.e. (i*D + j) >= 2^32), so using uint64_t instead of int32_t fixed the problem. What's still unclear to me is why the first above case would run through, while the other one segfaults.

N * D is 3e9; that doesn't fit in a 32 bit int.

When using N as size of array, why use int?
does a negative value of an array has any logical meaning?
what do you mean "doesn't work"?
just think of pointers as addresses in memory and not as 'objects'.
char*
void*
int*
are all pointers to memory addresses, and so are exactly the same, when are defined or passes into a function.
char * a;
int* b = (char*)a;
void* c = (void*)b;
a == b == c;
The difference is that when accessing a, a[i], the value that is retrieved is the next sizeof(*a) bytes from the address a.
And when using ++ to advance a pointer the address that the pointer is set to is advanced by
sizeof(pointer_type) bytes.
Example:
char* a = 1;
a++;
a is now 2.
((int*)a)++;
a is now 6.
Another thing:
char* a = 10;
char* b = a + 10;
&(a[10]) == b
because in the end
a[10] == *((char*)(a + 10))
so there should not be a problem with array sizes in your example, because the two examples are the same.
EDIT
Now note that there is not a negative memory address so accessing an array with a signed negative value will convert the value to positive.
int a = -5;
char* data;
data[a] == data[MAX_INT - 5]
For that reason it might be that (when using sign values as array sizes!) your two examples will actually not get the same result.

Version 1
for (int i=0; i!=N; i++) // i starts at 0 and increments until N. Note: If you ever skip N, it will loop forever. You should do < N or <= N instead
{
if (i % 1000 == 0) // if i is a multiple of 1000
{
std::cout << "i=" << i << std::endl; // print i
}
for (int j=0; j!=D; j++) // same as with i, only j is going to D (same problem, should be < or <=)
{
out[i*D + j] = 5; // this is a way of faking a 2D array by making a large 1D array and doing the math yourself to offset the placement
}
}
Version 2
for (int i=0; i!=N; i++) // same as before
{
if (i % 1000 == 0) // same as before
{
std::cout << "i=" << i << std::endl; // same as before
}
char* out2 = &out[i*D]; // store the location of out[i*D]
for (int j=0; j!=D; j++)
{
out2[j] = 5; // set out[i*D+j] = 5;
}
}
They are doing the same thing, but if out is not large enough, they will both behave in an undefined manner (and likely crash).

Flaws in algorithm and algorithm performance

char *stringmult(int n)
{
char *x = "hello ";
for (int i=0; i<n; ++i)
{
char *y = new char[strlen(x) * 2];
strcpy(y,x);
strcat(y,x);
delete[] x;
x=y;
}
return x;
}
I'm trying to figure out what the flaws of this segment is. For one, it deletes x and then tries to copy it's values over to y. Another is that y is twice the size of x and that y never gets deleted. Is there anything that I'm missing? And also, I need to figure out how to get algorithm performance. If you've got a quick link where you learned how, I'd appreciate it.

y needs one more byte than strlen(x) * 2 to make space for the terminating nul character -- just for starters.
Anyway, as you're returning a newed memory area, it's up to the caller to delete it (eek).
What you're missing, it seems to me, is std::string...!-)
As for performance, copying N characters with strcpy is O(N); concatenating N1 characters to a char array with a previous strlen of N2 is O(N1+N2) (std::string is faster as it keeps the length of the string in an O(1)-accessible attribute!-). So just sum N+N**2 for N up to whatever your limit of interest is (you can ignore the N+ part if all you want is a big-O estimate since it's clearly going to drop away for larger and larger values of N!-).

For starters delete[] x; operates for the first time round the loop on some static memory. Not good.
It looks like an attempt to return a buffer containing 2^n copies of the string "hello ". So the fastest way to do that would be to figure out the number of copies, then allocate a big enough buffer for the whole result, then fill it with the content and return it.
void repeat_string(const std::string &str, int count, std::vector<char> &result)
{
result.resize(str.size() * count);
for (int n = 0; n < count; n++)
str.copy(&result[n * s.size()], s.size());
}
void foo(int power, std::vector<char> &result)
{
repeat_string("hello ", 1 << (power + 1), result);
}

no need to call strlen() in a loop - only call it once;
when new is called no space is requested for the null-character - will cause undefined behaviour;
should use strcpy instead of strcat - you already know where to copy the second string and findig the end of string by strcat requires extra computation;
delete[] is used on a statically allocated string literal - will cause undefined behaviour;
memory is constantly reallocated although you know the result length well in advance - memory reallocation is quite expensive
You should instead compute the result length at once and allocate memory at once and pass the char* as an in-parameter:
char* stringMult(const char* what, int n)
{
const size_t sourceLen = strlen( what );
int i;
size_t resultLen = sourceLen;
// this computation can be done more cleverly and faster
for( i = 0; i < n; i++ ) {
resultLen *= 2;
}
const int numberOfCopies = resultLen / sourceLen;
char* result = new char[resultLen + 1];
char* whereToWrite = result;
for( i = 0; i < numberOfCopies; i++ ) {
strcpy( whereToWrite, what );
whereToWrite += sourceLen;
}
return result;
}
Certain parts of my implementation can be optimized but still it is much better and (I hope) contains no undefined-behaviour class errors.

you have to add one while allocating space for Y for NULL terminating string
Check the code at below location http://codepad.org/tkGhuUDn

char * stringmult (int n)
{
int i;
size_t m;
for (i = 0, m = 1; i < n; ++i; m *= 2);
char * source = "hello ";
int source_len = strlen(source);
char * target = malloc(source_len*m+1) * sizeof(char));
char * tmp = target;
for (i = 0; i < m; ++i) {
strcpy(tmp, source);
tmp += source_len;
}
*tmp = '\0';
return target;
}
Here a better version in plain C. Most of the drawbacks of your code have been eliminated, i.e. deleting a non-allocated pointer, too many uses of strlen and new.
Nonetheless, my version may imply the same memory leak as your version, as the caller is responsible to free the string afterwards.
Edit: corrected my code, thanks to sharptooth.

char* string_mult(int n)
{
const char* x = "hello ";
char* y;
int i;
for (i = 0; i < n; i++)
{
if ( i == 0)
{
y = (char*) malloc(strlen(x)*sizeof(char));
strcpy(y, x);
}
else
{
y = (char*)realloc(y, strlen(x)*(i+1));
strcat(y, x);
}
}
return y;
}

Nobody is going to point out that "y" is in fact being deleted?
Not even one reference to Schlmeiel the Painter?
But the first thing I'd do with this algorithm is:
int l = strlen(x);
int log2l = 0;
int log2n = 0;
int ncopy = n;
while (log2l++, l >>= 1);
while (log2n++, n >>= 1);
if (log2l+log2n >= 8*(sizeof(void*)-1)) {
cout << "don't even bother trying, you'll run out of virtual memory first";
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js