faster way than memcpy to copy 0-terminated string

faster way than memcpy to copy 0-terminated string - c++

I have a question about duplicating a 0-terminated string:
const char * str = "Hello World !";
size_t getSize = strlen(str);
char * temp = new char[getSize + 1];
... i know i can use this function
memcpy(temp, str, getSize);
but i want to use my own copy function which have action like this
int Count = 0;
while (str[Count] != '\0') {
temp[Count] = str[Count];
Count++;
}
both way's are true and success. now i want to check it on 10 milions times and for memcpy do this action
const char * str = "Hello World !";
size_t getSize = strlen(str);
for (size_t i = 0; i < 10000000; i++) {
char * temp = new char[getSize + 1];
memcpy(temp, str, getSize);
}
and this is for my own way
const char * str = "Hello World !";
size_t getSize = strlen(str);
for (size_t i = 0; i < 10000000; i++) {
char * temp = new char[getSize + 1];
int Count = 0;
while (str[Count] != '\0') {
temp[Count] = str[Count];
Count++;
}
}
first process done in 420 miliseconds and second done in 650 miliseconds
... why? both of those ways are same ! i want to use my own function not memcpy. is there any way to make my own way faster (fast as memcpy is fast or maybe faster)? how can i update my own way (while) to make it faster or equal with memcpy?
full source
int main() {
const char * str = "Hello world !";
size_t getSize = strlen(str);
auto start_t = chrono::high_resolution_clock::now();
for (size_t i = 0; i < 10000000; i++) {
char * temp = new char[getSize + 1];
memcpy(temp, str, getSize);
}
cout << chrono::duration_cast<chrono::milliseconds>(chrono::high_resolution_clock::now() - start_t).count() << " milliseconds\n";
start_t = chrono::high_resolution_clock::now();
for (size_t i = 0; i < 10000000; i++) {
char * temp = new char[getSize + 1];
int done = 0;
while (str[done] != '\0') {
temp[done] = str[done];
done++;
}
}
cout << chrono::duration_cast<chrono::milliseconds>(chrono::high_resolution_clock::now() - start_t).count() << " milliseconds\n";
return 0;
}
results:
482 milliseconds
654 milliseconds

Replacing library functions with your own often leads to inferior performance.
memcpy represents a very fundamental memory operation. Because of that, it is highly optimized by its authors. Unlike a "naïve" implementation, library version moves more than a single byte at a time whenever is possible, and uses hardware assistance on platforms where one is available.
Moreover, compiler itself "knows" about the inner workings of memcpy and other library functions, and it can optimize them out completely for cases when the length is known at compile time.
Note: Your implementation has semantics of strcpy, not memcpy.

... both of those ways are same !
No, they aren't:
memcpy() doesn't check each character to contain '\0' or not.
There may be more optimizations done by the implementers than you have in your naive approach
It's unlikely that your approach can be made faster than memcpy().

Seeing you didn't use pointers and comparing what you are doing (strcpy) with memcpy clearly shows that you are a beginner and as already stated by everyone else, it is difficult to outsmart an experienced programmer like those that coded your library.
But I'm gonna give you some hints to optimize your code.
I took a quick look at Microsoft's C Standard Library implementation (dubbed C Runtime Library) and they are doing it in assembly which is faster than doing it in C. So that is one point for speed.
In most 32-bit architecture with 32-bit buses, CPU can fetch 32 bits of information from memory in one request to memory (assuming that data is properly aligned), but even if you need 16 bits, or 8 bits, it still needs to make that 1 request. So working with your machine's word size probably gives you some speed up.
Lastly I want to direct your attention to SIMD. If your CPU provides it, you can use it and gain that extra speed. Again MSCRT has some SSE2 optimization options.
In the past from time to time, I had to write code that outperform my library implementation, because I had a specific need or a specific type of data that I could optimize for and while it might have some educational value unless specifically needed, your time is better spent on your actual code than to be spent on re-implementing your library functions.

Related

Fast memcpy for small unaligned data

I need to read a binary file which is made of many basic types such as int, double, UTF8 strings, etc. For instance, think about one file containing n pairs of (int, double) one after the other, without any alignment with n being in the order of tens of millions. I need to get very fast access to that file. I read the file using fread calls and my own buffer which is about 16 kB long.
A profiler shows that my main bottleneck happens to be copying from the memory buffer to its final destination. The most obvious way to write a a function that copy from the buffer to a double would be:
// x: a pointer to the final destination of the data
// p: a pointer to the buffer used to read the file
//
void f0(double* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (int i = 0; i < 8; ++i) {
q[i] = p[i];
}
}
It I use the following code, I get huge speedup on x86-64
void f1(double* x, const unsigned char* p) {
double* r = reinterpret_cast<const double*>(p);
*x = *r;
}
But, as I understand, the program would crash on ARM if p is not 8-byte aligned.
Here are my questions:
Is the second program guaranteed to work on both x86 and x86-64?
How would you write such a function on ARM if you need it as fast as you can?
Here is a small benchmark to test on your machine
#include <chrono>
#include <iostream>
void copy_int_0(int* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (std::size_t i = 0; i < 4; ++i) {
q[i] = p[i];
}
}
void copy_double_0(double* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (std::size_t i = 0; i < 8; ++i) {
q[i] = p[i];
}
}
void copy_int_1(int* x, const unsigned char* p) {
*x = *reinterpret_cast<const int*>(p);
}
void copy_double_1(double* x, const unsigned char* p) {
*x = *reinterpret_cast<const double*>(p);
}
int main() {
const std::size_t n = 10000000;
const std::size_t nb_times = 200;
unsigned char* p = new unsigned char[12 * n];
for (std::size_t i = 0; i < 12 * n; ++i) {
p[i] = 0;
}
int* q0 = new int[n];
for (std::size_t i = 0; i < n; ++i) {
q0[i] = 0;
}
double* q1 = new double[n];
for (std::size_t i = 0; i < n; ++i) {
q1[i] = 0.0;
}
const auto begin_0 = std::chrono::high_resolution_clock::now();
for (std::size_t k = 0; k < nb_times; ++k) {
for (std::size_t i = 0; i < n; ++i) {
copy_int_0(q0 + i, p + 12 * i);
copy_double_0(q1 + i, p + 4 + 12 * i);
}
}
const auto end_0 = std::chrono::high_resolution_clock::now();
const double time_0 =
1.0e-9 *
std::chrono::duration_cast<std::chrono::nanoseconds>(end_0 - begin_0)
.count();
std::cout << "Time 0: " << time_0 << " s" << std::endl;
const auto begin_1 = std::chrono::high_resolution_clock::now();
for (std::size_t k = 0; k < nb_times; ++k) {
for (std::size_t i = 0; i < n; ++i) {
copy_int_1(q0 + i, p + 12 * i);
copy_double_1(q1 + i, p + 4 + 12 * i);
}
}
const auto end_1 = std::chrono::high_resolution_clock::now();
const double time_1 =
1.0e-9 *
std::chrono::duration_cast<std::chrono::nanoseconds>(end_1 - begin_1)
.count();
std::cout << "Time 1: " << time_1 << " s" << std::endl;
std::cout << "Prevent optimization: " << q0[0] << " " << q1[0] << std::endl;
delete[] q1;
delete[] q0;
delete[] p;
return 0;
}
The results I get are
clang++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.49403 s
Time 1: 4.01617 s
g++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.65762 s
Time 1: 3.89979 s
icpc -std=c++11 -O3 -xHost copy.cpp -o copy
./copy
Time 0: 8.46155 s
Time 1: 0.0278496 s
I did not check the assembly yet but I guess that the Intel compiler is fooling my benchmark here.

Is the second program guaranteed to work on both x86 and x86-64?
No.
When you dereference a double* the compiler is free to assume that the memory location actually contains a double, which means that it must be aligned to alignof(double).
A lot of x86 instructions are safe to use for unaligned data, but not all of them. Specifically, there are SIMD instructions which require proper alignment which your compiler is free to use.
This isn't just theoretical; LZ4 used to use something very similar to what you posted (it's C, not C++, so it was a C-style cast not reinterpret_cast, but that doesn't really matter), and everything worked as expected. Then GCC 5 was released, and it auto-vectorized the code in question at -O3 using vmovdqa, which requires proper alignment. The end result is that code which worked fine in GCC ≤ 4.9 started crashing at runtime when compiled with GCC ≥ 5.
In other words, even if your program happens to work today, if you depend on unaligned access (or other undefined behavior), it can easily stop working tomorrow. Don't do it.
How would you write such a function on ARM if you need it as fast as you can?
The answer isn't really ARM-specific. After the LZ4 incident Yann Collet (the author of LZ4) did a lot of research to answer this question. There isn't one option which well generate optimal code with every compiler on every architecture.
Using memcpy() is the safest option. If the size is known at compile time the compiler will generally optimize the memcpy() call away… for larger buffers, you can take advantage of that by calling memcpy() in a loop; you'll generally get a loop of fast instructions without the additional overhead of calling memcpy().
If you're feeling more adventurous you can use a packed union to "cast" instead of reinterpret_cast. This is compiler-specific, but when supported it should be safe, and it may be faster than memcpy().
FWIW, I have some code which attempts to find the optimal way to do this depending on various factors (compiler, compiler version, architecture, etc.). It is a bit conservative about platforms I haven't tested, but it should achieve good results on the vast majority of platforms people actually use.

Why deallocating heap memory is much slower than allocating it?

This is an empirical assumption (that allocating is faster then de-allocating).
This is also one of the reason, i guess, why heap based storages (like STL containers or else) choose to not return currently unused memory to the system (that is why shrink-to-fit idiom was born).
And we shouldn't confuse, of course, 'heap' memory with the 'heap'-like data structures.
So why de-allocation is slower?
Is it Windows-specific (i see it on Win 8.1) or OS independent?
Is there some C++ specific memory manager automatically involved on using 'new' / 'delete' or the whole mem. management is completely relies on the OS? (i know C++11 introduced some garbage-collection support, which i never used really, better relying on the old stack and static duration or self managed containers and RAII).
Also, in the code of the FOLLY string i saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?
P. S. please note that the question is not about virtual memory mechanics, i understand that user-space programs didn't use real mem. addresation.

The assertion that allocating memory is faster than deallocating it seemed a bit odd to me, so I tested it. I ran a test where I allocated 64MB of memory in 32-byte chunks (so 2M calls to new), and I tried deleting that memory in the same order it was allocated, and in a random order. I found that linear-order deallocation was about 3% faster than allocation, and that random deallocation was about 10% slower than linear allocation.
I then ran a test where I started with 64MB of allocated memory, and then 2M times either allocated new memory or deleted existing memory (at random). Here, I found that deallocation was about 4.3% slower than allocation.
So, it turns out you were correct - deallocation is slower than allocation (though I wouldn't call it "much" slower). I suspect this has simply to do with more random accesses, but I have no evidence for this other than that the linear deallocation was faster.
To answer some of your questions:
Is there some C++ specific memory manager automatically involved on using 'new' / 'delete'?
Yes. The OS has system calls which allocate pages of memory (typically 4KB chunks) to processes. It's the process' job to divide up those pages into objects. Try looking up the "GNU Memory Allocator."
I saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?
Most C++ new/delete implementations just call malloc and free under the hood. This is not required by the standard, however, so it's a good idea to always use the same allocation and deallocation function on any particular object.
I ran my tests with the native testing framework provided in Visual Studio 2015, on a Windows 10 64-bit machine (The tests were also 64-bit). Here's the code:
#include "stdafx.h"
#include "CppUnitTest.h"
using namespace Microsoft::VisualStudio::CppUnitTestFramework;
namespace AllocationSpeedTest
{
class Obj32 {
uint64_t a;
uint64_t b;
uint64_t c;
uint64_t d;
};
constexpr int len = 1024 * 1024 * 2;
Obj32* ptrs[len];
TEST_CLASS(UnitTest1)
{
public:
TEST_METHOD(Linear32Alloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
}
TEST_METHOD(Linear32AllocDealloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
for (int i = 0; i < len; ++i) {
delete ptrs[i];
}
}
TEST_METHOD(Random32AllocShuffle)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
int pos = (rand() % (len - i)) + i;
Obj32* temp = ptrs[i];
ptrs[i] = ptrs[pos];
ptrs[pos] = temp;
}
}
TEST_METHOD(Random32AllocShuffleDealloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
int pos = (rand() % (len - i)) + i;
Obj32* temp = ptrs[i];
ptrs[i] = ptrs[pos];
ptrs[pos] = temp;
}
for (int i = 0; i < len; ++i) {
delete ptrs[i];
}
}
TEST_METHOD(Mixed32Both)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
ptrs[i] = new Obj32();
}
else {
delete ptrs[i];
}
}
}
TEST_METHOD(Mixed32Alloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
ptrs[i] = new Obj32();
}
else {
//delete ptrs[i];
}
}
}
TEST_METHOD(Mixed32Dealloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
//ptrs[i] = new Obj32();
}
else {
delete ptrs[i];
}
}
}
TEST_METHOD(Mixed32Neither)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
//ptrs[i] = new Obj32();
}
else {
//delete ptrs[i];
}
}
}
};
}
And here are the raw results over several runs. All numbers are in milliseconds.

I had much the same idea as #Basile: I wondered whether your base assumption was actually (even close to) correct. Since you tagged the question C++, I wrote a quick benchmark in C++ instead.
#include <vector>
#include <iostream>
#include <numeric>
#include <chrono>
#include <iomanip>
#include <locale>
int main() {
std::cout.imbue(std::locale(""));
using namespace std::chrono;
using factor = microseconds;
auto const size = 2000;
std::vector<int *> allocs(size);
auto start = high_resolution_clock::now();
for (int i = 0; i < size; i++)
allocs[i] = new int[size];
auto stop = high_resolution_clock::now();
auto alloc_time = duration_cast<factor>(stop - start).count();
start = high_resolution_clock::now();
for (int i = 0; i < size; i++)
delete[] allocs[i];
stop = high_resolution_clock::now();
auto del_time = duration_cast<factor>(stop - start).count();
std::cout << std::left << std::setw(20) << "alloc time: " << alloc_time << " uS\n";
std::cout << std::left << std::setw(20) << "del time: " << del_time << " uS\n";
}
I also used VC++ on Windows instead of gcc on Linux. The result wasn't much different though: freeing the memory took substantially less time than allocating it did. Here are the results from three successive runs.
alloc time: 2,381 uS
del time: 1,429 uS
alloc time: 2,764 uS
del time: 1,592 uS
alloc time: 2,492 uS
del time: 1,442 uS
I'd warn, however, allocation and freeing is handled (primarily) by the standard library, so this could be different between one standard library and another (even when using the same compiler). I'd also note that it wouldn't surprise me if this were to change somewhat in multi-threaded code. Although it's not actually correct, there appear to be a few authors who are under the mis-apprehension that freeing in a multithreaded environment requires locking a heap for exclusive access. This can be avoided, but the means to do so isn't necessarily immediately obvious.

I am not sure of your observation. I wrote the following program (on Linux, hopefully you could port it to your system).
// public domain code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <errno.h>
#include <string.h>
#include <assert.h>
const unsigned possible_word_sizes[] = {
1, 2, 3, 4, 5,
8, 12, 16, 24,
32, 48, 64, 128,
256, 384, 2048
};
long long totalsize;
// return a calloc-ed array of nbchunks malloced zones of
// somehow random size
void **
malloc_chunks (int nbchunks)
{
const int nbsizes =
(int) (sizeof (possible_word_sizes)
/ sizeof (possible_word_sizes[0]));
void **ad = calloc (nbchunks, sizeof (void *));
if (!ad)
{
perror ("calloc chunks");
exit (EXIT_FAILURE);
};
for (int ix = 0; ix < nbchunks; ix++)
{
unsigned sizindex = random () % nbsizes;
unsigned size = possible_word_sizes[sizindex];
void *zon = malloc (size * sizeof (void *));
if (!zon)
{
fprintf (stderr,
"malloc#%d (%d words) failed (total %lld) %s\n",
ix, size, totalsize, strerror (errno));
exit (EXIT_FAILURE);
}
((int *) zon)[0] = ix;
totalsize += size;
ad[ix] = zon;
}
return ad;
}
void
free_chunks (void **chks, int nbchunks)
{
// first, free the two thirds of chunks in random order
for (int i = 0; 3 * i < 2 * nbchunks; i++)
{
int pix = random () % nbchunks;
if (chks[pix])
{
free (chks[pix]);
chks[pix] = NULL;
}
}
// then, free the rest in reverse order
for (int i = nbchunks - 1; i >= 0; i--)
if (chks[i])
{
free (chks[i]);
chks[i] = NULL;
}
}
int
main (int argc, char **argv)
{
assert (sizeof (int) <= sizeof (void *));
int nbchunks = (argc > 1) ? atoi (argv[1]) : 32768;
if (nbchunks < 128)
nbchunks = 128;
srandom (time (NULL));
printf ("nbchunks=%d\n", nbchunks);
void **chks = malloc_chunks (nbchunks);
clock_t clomall = clock ();
printf ("clomall=%ld totalsize=%lld words\n",
(long) clomall, totalsize);
free_chunks (chks, nbchunks);
clock_t clofree = clock ();
printf ("clofree=%ld\n", (long) clofree);
return 0;
}
I compiled it with gcc -O2 -Wall mf.c -o mf on my Debian/Sid/x86-64 (i3770k, 16Gb). I run time ./mf 100000 and got:
nbchunks=100000
clomall=54162 totalsize=19115681 words
clofree=83895
./mf 100000 0.02s user 0.06s system 95% cpu 0.089 total
on my system clock gives CPU microseconds. If the call to random is negligible (and I don't know if it is) w.r.t. malloc & free time, I tend to disagree with your observations. free seems to be twice as fast as malloc. My gcc is 6.1, my libc is Glibc 2.22.
Please take time to compile the above benchmark on your system and report the timings.
FWIW, I took Jerry's code and
g++ -O3 -march=native jerry.cc -o jerry
time ./jerry; time ./jerry; time ./jerry
gives
alloc time: 1940516
del time: 602203
./jerry 0.00s user 0.01s system 68% cpu 0.016 total
alloc time: 1893057
del time: 558399
./jerry 0.00s user 0.01s system 68% cpu 0.014 total
alloc time: 1818884
del time: 527618
./jerry 0.00s user 0.01s system 70% cpu 0.014 total

When you allocate small memory blocks, the block size you specify maps directly to a suballocator for that size, which is commonly represented as a "slab" of memory containing same size records, to avoid memory fragmentation. This can be very fast, similar to an array access. But freeing such blocks is not so straight forward, because you are passing a pointer to memory of unknown size, requiring additional work to determine what slab it belongs to, before the block can be returned to its proper place.
When you allocate large blocks of virtual memory, a memory page range is set up in your process space without actually mapping any physical memory to it, and that requires very little work to accomplish. But freeing such large blocks can require much more work, because the pointer freed must first be matched to the page tables for that range, followed by walking through all of the page entries for the memory range that it spans, and releasing all of the physical memory pages assigned to that range by the intervening page faults.
Of course, the details of this will vary depending on the implementation being used, but the principles remain much the same: memory allocation of a known block size requires less effort than releasing a pointer to a memory block of unknown size. My knowledge of this comes directly from my experience developing high-performance commercial grade RAII memory allocators.
I should also point out that since every heap allocation has a matching and corresponding release, this pair of operations represents a single allocation cycle, i.e. as the two sides of one coin. Together, their execution time can be accurately measured, but separately such measurement is difficult to pin down, as it varies widely depending on block size, previous activity across similar sizes, caching and other operational considerations. But in the end, allocate/free differences may not much matter, since you don't do one without the other.

The problem here is heap fragmentation. Programs written in languages with explicit pointer arithmetic have no realistic ways of defragmenting heap.
If your heap is fragmented, you can't return memory to OS. OS, barring virtual memory, depends on brk(2)-like mechanism - i.e. you set an upper bound for all memory addresses you'll refer to. But when you have even one buffer allocated and still in use near existing boundary, you can't return memory to OS explicitly. Doesn't matter if 99% of all the memory in your program is freed.
Dealocation doesn't have to be slower than allocation. But the fact that you have manual deallocation with heap fragmenting makes allocation slower and more complex.
GCs fight this by compactifying heap. This way, allocation is just incrementing pointer for them, and deallocation is not needed for bulk of objects.

Concatenate ints in an array?

As part of a homework assignment I need to concatenate certain values in an array in C++. So, for example if I have:
int v[] = {0,1,2,3,4}
I may need at some point to concatenate v[1] -> v[4] so that I get an int with the value 1234.
I got it working using stringstream, by appending the values onto the stringstream and then converting back to an integer. However, throughout the program there will eventually be about 3 million different permutations of v[] passed to my toInt() function, and the stringstream seems rather expensive (at least when dealing with that many values). it's working, but very slow and I'm trying to do whatever I can to optimize it.
Is there a more optimal way to concatenate ints in an array in C++? I've done some searching and nearly everywhere seems to just suggest using stringstream (which works, but seems to be slowing my program down a lot).
EDIT: Just clarifying, I do need the result to be an int.

Pseudo code for a simple solution:
int result = 0;
for (int i=0; i < len(v); i++)
{
result = result*10 + v[i];
}
Large arrays will bomb out due to int size overflow.

How about:
int result = (((v[1])*10+v[2])*10+v[3])*10+v[4];
If the number of elements is variable rather than a fixed number, I'm sure you can spot a pattern here that can be applied in a loop.

Remember ASCII codes?
char concat[vSize+1];
concat[vSize] = 0;
for(int i = 0; i < vSize; i++) {
concat[i] = (v[i] % 10) & 0x30;
}

All are integers. Shouldn't you do the following.
//if you want to concatenate v[1] and v[4]
int concatenated;
concatenated = v[1]*10+v[4];
//If you want to concatenate all
concatenated = 0;
for(int i=1;i<=4;i++)
concatenated = concatenated*10+v[i];
the output would be an integer ( not a string)

Things you can do:
Make sure that you compile with -O3 (Or equivalent compiler optimization).
Do you generate the values in the vector yourself? If so, try changing toInt() function to accept a simple pointer type.
Write the conversion yourself (Browser code : may not even compile - u get the idea though):
char* toInt(int* values, size_t length)
{
int *end = values + sizeof(int)*length;
int *cur = values;
char* buf = new char[length + 1]
for(char* out = buf;cur < end;++cur, ++buf)
{
*out = (char)*cur + '0';
}
*buf = '\0';
return buf;
}

String concatenation C++

Given an arbitrary floating point number, say -0.13, suppose we have an algorithm which calculates a binary string of known length L for this number, one by one, from left to right.
(I need to do this computation for calculating the Morton Key ordering for particles(co-orindates given) which in turn in used in building octrees. I am creating
such binary strings for each of x,y,z dimensions)
Is it better/efficient to first create a character array of length L, and then convert this array into a string? i.e.
char ch[L];
for(i = 0; i < L; ++i)
{
// calculation of ch[i]
}
//convert array into string
Or is it better/efficient to start of with a empty string, and then concatenate a new calculated bit into the string on the fly. i.e.
string s = "";
for(i = 0; i < L; ++i)
{
// calculation of ch[i]
s = s + string(ch);
}

Why not do both?
std::string myStr(L);
for(i = 0; i < L; ++i)
{
// calculation of ch[i]
myStr[i] = ch;
}
This creates a std::string with a given size. You then just set each character. This will only work if you can know the size beforehand exactly.
Alternatively, if you want something that is safe, even if you have to add more than L characters:
std::string myStr;
myStr.reserve(L);
for(i = 0; i < L; ++i)
{
// calculation of ch[i]
myStr.push_back(ch);
}
std::string::reserve preallocates the storage, but push_back will allocate more if needs be. If you don't go past L characters, then you will only get the one initial allocation.

Can't you just use a string with a pre-allocated length?
string s(L, '\0');
for(i = 0; i < L; ++i)
{
// calculation of ch[i]
}

I'm not sure I fully understand the conversion happening, but we have objects for a reason. If you use std::string::reserve() first, the performance should be minuscule, and it's obvious what the intent is.
string s;
s.reserve(L);
for(i = 0; i < L; ++i)
{
// calculation of ch[i]
string.push_back(ch);
}
If speed is absolutely necessary, you can instead initialize the string as length L, and bypass length checks:
string s(L,'\0');
for(i = 0; i < L; ++i)
{
// calculation of ch[i]
string[i] = ch;
}

Personally, i am probably out of date, but i use
sprintf ( char * str, const char * format, ... );
to create strings from numbers
sprintf ( outString,"%f", floatingPointNumber);

Use the latter, but also call s.reserve(L) before entering the loop. This is almost as efficient as direct array assignment, but still easier to grok.
EDIT: Other answers suggest using push_back(). Vote for them.
Sidebar: I'm not sure what you are computing, but if you just want to generate a string representation of the number, I'd suggest you simply call sprintf(), or insert the number into a std::stringstream.

If you want the C++ way, use ostringstream. This is generally cleaner code, less error-prone, and easier to read:
float f = ... // whatever you set it to
std::ostringstream s;
s << f;
std::string stringifiedfloat = s.str();
// now you have the float in a string.
Alternately, you can use the C way, sprintf. This is generally more error-prone, and harder to read, but faster performance:
float f = ... // whatever you set it to
char* buffer = new char[L];
sprintf(buffer, "%f", f);
// now you have the float in a string.
Or, you could even use boost's lexical_cast. This has better performance than ostringstream, and better readability than sprintf, but it gives you a dependency on boost:
float f = ... // whatever you set it to
std::string stringified = boost::lexical_cast<std::string>(f);
// now you have the float in a string.

Flaws in algorithm and algorithm performance

char *stringmult(int n)
{
char *x = "hello ";
for (int i=0; i<n; ++i)
{
char *y = new char[strlen(x) * 2];
strcpy(y,x);
strcat(y,x);
delete[] x;
x=y;
}
return x;
}
I'm trying to figure out what the flaws of this segment is. For one, it deletes x and then tries to copy it's values over to y. Another is that y is twice the size of x and that y never gets deleted. Is there anything that I'm missing? And also, I need to figure out how to get algorithm performance. If you've got a quick link where you learned how, I'd appreciate it.

y needs one more byte than strlen(x) * 2 to make space for the terminating nul character -- just for starters.
Anyway, as you're returning a newed memory area, it's up to the caller to delete it (eek).
What you're missing, it seems to me, is std::string...!-)
As for performance, copying N characters with strcpy is O(N); concatenating N1 characters to a char array with a previous strlen of N2 is O(N1+N2) (std::string is faster as it keeps the length of the string in an O(1)-accessible attribute!-). So just sum N+N**2 for N up to whatever your limit of interest is (you can ignore the N+ part if all you want is a big-O estimate since it's clearly going to drop away for larger and larger values of N!-).

For starters delete[] x; operates for the first time round the loop on some static memory. Not good.
It looks like an attempt to return a buffer containing 2^n copies of the string "hello ". So the fastest way to do that would be to figure out the number of copies, then allocate a big enough buffer for the whole result, then fill it with the content and return it.
void repeat_string(const std::string &str, int count, std::vector<char> &result)
{
result.resize(str.size() * count);
for (int n = 0; n < count; n++)
str.copy(&result[n * s.size()], s.size());
}
void foo(int power, std::vector<char> &result)
{
repeat_string("hello ", 1 << (power + 1), result);
}

no need to call strlen() in a loop - only call it once;
when new is called no space is requested for the null-character - will cause undefined behaviour;
should use strcpy instead of strcat - you already know where to copy the second string and findig the end of string by strcat requires extra computation;
delete[] is used on a statically allocated string literal - will cause undefined behaviour;
memory is constantly reallocated although you know the result length well in advance - memory reallocation is quite expensive
You should instead compute the result length at once and allocate memory at once and pass the char* as an in-parameter:
char* stringMult(const char* what, int n)
{
const size_t sourceLen = strlen( what );
int i;
size_t resultLen = sourceLen;
// this computation can be done more cleverly and faster
for( i = 0; i < n; i++ ) {
resultLen *= 2;
}
const int numberOfCopies = resultLen / sourceLen;
char* result = new char[resultLen + 1];
char* whereToWrite = result;
for( i = 0; i < numberOfCopies; i++ ) {
strcpy( whereToWrite, what );
whereToWrite += sourceLen;
}
return result;
}
Certain parts of my implementation can be optimized but still it is much better and (I hope) contains no undefined-behaviour class errors.

you have to add one while allocating space for Y for NULL terminating string
Check the code at below location http://codepad.org/tkGhuUDn

char * stringmult (int n)
{
int i;
size_t m;
for (i = 0, m = 1; i < n; ++i; m *= 2);
char * source = "hello ";
int source_len = strlen(source);
char * target = malloc(source_len*m+1) * sizeof(char));
char * tmp = target;
for (i = 0; i < m; ++i) {
strcpy(tmp, source);
tmp += source_len;
}
*tmp = '\0';
return target;
}
Here a better version in plain C. Most of the drawbacks of your code have been eliminated, i.e. deleting a non-allocated pointer, too many uses of strlen and new.
Nonetheless, my version may imply the same memory leak as your version, as the caller is responsible to free the string afterwards.
Edit: corrected my code, thanks to sharptooth.

char* string_mult(int n)
{
const char* x = "hello ";
char* y;
int i;
for (i = 0; i < n; i++)
{
if ( i == 0)
{
y = (char*) malloc(strlen(x)*sizeof(char));
strcpy(y, x);
}
else
{
y = (char*)realloc(y, strlen(x)*(i+1));
strcat(y, x);
}
}
return y;
}

Nobody is going to point out that "y" is in fact being deleted?
Not even one reference to Schlmeiel the Painter?
But the first thing I'd do with this algorithm is:
int l = strlen(x);
int log2l = 0;
int log2n = 0;
int ncopy = n;
while (log2l++, l >>= 1);
while (log2n++, n >>= 1);
if (log2l+log2n >= 8*(sizeof(void*)-1)) {
cout << "don't even bother trying, you'll run out of virtual memory first";
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js