hashtable needs lots of memory

hashtable needs lots of memory - c++

I've declared and defined the following HashTable class. Note that I needed a hashtable of hashtables so my HashEntry struct contains a HashTable pointer. The public part is not a big deal it has the traditional hash table functions so I removed them for simplicity.
enum Status{ACTIVE, DELETED, EMPTY};
enum Type{DNS_ENTRY, URL_ENTRY};
class HashTable{
private:
struct HashEntry{
std::string key;
Status current_status;
std::string ip;
int access_count;
Type entry_type;
HashTable *table;
HashEntry(
const std::string &k = std::string(),
Status s = EMPTY,
const std::string &u = std::string(),
const int &a = int(),
Type e = DNS_ENTRY,
HashTable *t = NULL
): key(k), current_status(s), ip(u), access_count(a), entry_type(e), table(t){}
};
std::vector<HashEntry> array;
int currentSize;
public:
HashTable(int size = 1181, int csz = 0): array(size), currentSize(csz){}
};
I am using quadratic probing and I double the size of the vector in my rehash function when I hit array.size()/2. The following list is used when a larger table size is needed.
int a[16] = {49663, 99907, 181031, 360461,...}
My problem is that this class consumes so much memory. I've just profiled it with massif and found out that it needs 33MB(33 million bytes!) of memory for 125000 insertion. To be clear, actually
1 insertion -> 47352 Bytes
8 insertion -> 48376 Bytes
512 insertion -> 76.27KB
1000 insertion 2MB (array size increased to 49663 here)
27000 insertion-> 8MB (array size increased to 99907 here)
64000 insertion -> 16MB (array size increased to 181031 here)
125000 insertion-> 33MB (array size increased to 360461 here)
These may be unnecessary but I just wanted to show you how memory usage changes with the input. As you can see, when rehashing is done, memory usage doubles. For example, our initial array size was 1181. And we have just seen that 125000 elements -> 33MB.
To debug the problem, I changed the initial size to 360461. Now 127000 insertion does not need rehashing. And I see that 20MB of memory is used with this initial value. That is still huge, but I think it suggests there is a problem with rehashing. The following is my rehash function.
void HashTable::rehash(){
std::vector<HashEntry> oldArray = array;
array.resize(nextprime(array.size()));
for(int j = 0; j < array.size(); j++){
array[j].current_status = EMPTY;
}
for(int i = 0; i < oldArray.size(); i++){
if(oldArray[i].current_status == ACTIVE){
insert(oldArray[i].key);
int pos = findPos(oldArray[i].key);
array[pos] = oldArray[i];
}
}
}
int nextprime(int arraysize){
int a[16] = {49663, 99907, 181031, 360461, 720703, 1400863, 2800519, 5600533, 11200031, 22000787, 44000027};
int i = 0;
while(arraysize >= a[i]){i++;}
return a[i];
}
This is the insert function used in rehashing and everywhere else.
bool HashTable::insert(const std::string &k){
int currentPos = findPos(k);
if(isActive(currentPos)){
return false;
}
array[currentPos] = HashEntry(k, ACTIVE);
if(++currentSize > array.size() / 2){
rehash();
}
return true;
}
What am I doing wrong here? Even if it's caused by rehashing, when no rehashing is done it is still 20MB and I believe 20MB is way too much for 100k items. This hashtable is supposed to contain like 8 million elements.

The fact that 360,461 HashEntry's take 20 MB is hardly surprising. Did you try looking at sizeof(HashEntry)?
Each HashEntry includes two std::strings, a pointer, and three int's. As the old joke has it, it's not easy to answer the question "How long is a string?", in this case because there are a large variety of string implementations and optimizations, so you might find that sizeof(std::string) is anywhere between 4 and 32 bytes. (It would only be 4 bytes on a 32-bit architecture.) In practice, a string requires three pointers and the string itself unless it happens to be empty. If sizeof(std::string) is the same as sizeof(void*), then you've probably got a not-too-recent GNU standard library, in which the std::string is an opaque pointer to a block containing two pointers, a reference count, and the string itself. If sizeof(std::string) is 32 bytes, then you might have a recent GNU standard library implementation in which there is a bit of extra space in the string structure for the short-string optimization. See the answer to Why does libc++'s implementation of std::string take up 3x memory as libstdc++? for some measurements. Let's just say 32 bytes per string, and ignore the details; it won't be off by much.
So two strings (32 bytes each) plus a pointer (8 bytes) plus three ints (another 12 bytes) and four bytes of padding because one of the ints is between two 8-byte aligned objects, and that's a total of 88 bytes per HashEntry. And if you have 360,461 hash entries, that would be 31,720,568 bytes, about 30 MB. The fact that you're "only" using 20MB is probably because you're using the old GNU standard library, which optimizes empty strings to a single pointer, and the majority of your strings are empty strings because half the slots have never been used.
Now, let's take a look at rehash. Reduced to its essentials:
void rehash() {
std::vector<HashEntry> tmp = array; /* Copy the entire array */
array.resize(new_size()); /* Internally does another copy */
for (auto const& entry : tmp)
if (entry.used()) array.insert(entry); /* Yet another copy */
}
At peak, we had two copies of the smaller array as well as the new big array. Even if the new array is only 20 MB, it's not surprising that peak memory usage was almost twice that. (Indeed, this is again surprisingly small, not surprisingly big. Possibly it was not actually necessary to change the address of the new vector because it was at the end of the current allocated memory space, which could just be extended.)
Note that we did two copies of all that data, and array.resize() potentially did another one. Let's see if we can do better:
void rehash() {
std::vector<HashEntry> tmp(new_size()); /* Make an array of default objects */
for (auto const& entry: array)
if (entry.used()) tmp.insert(entry); /* Copy into the new array */
std::swap(tmp, array); /* Not a copy, just swap three pointers */
}
This way, we only do one copy. Instead of a (possible) internal copy by resize, we do a bulk construction of the new elements, which should be similar. (It's just zeroing out the memory.)
Also, in the new version we only copy the actual strings once each, instead of twice each, which is the fiddliest part of the copy and thus probably quite a large saving.
Proper string management could reduce that overhead further. rehash doesn't actually need to copy the strings, since they are not changed. So we could keep the strings elsewhere, say in a vector of strings, and just use the index into the vector in the HashEntry. Since you are not expecting to hold billions of strings, only millions, the index could a four-byte int. By also shuffling the HashEntry fields around and reducing the enums to a byte instead of four bytes (in C++11, you can specify the underlying integer type of an enum), the HashEntry could be reduced to 24 bytes, and there wouldn't be a need to leave space for as many string descriptors.

Since you are using open addressing, half your hash slots have to be empty. Since HashEntry is quite large, storing a full HashEntry in each empty slot is terribly wasteful.
You should store your HashEntry structs somewhere else and put HashEntry* in your hash table, or switch to chaining with a much denser load factor. Either one will reduce this waste.
Also, if you're going to move HashEntry objects around, swap instead of copying, or use move semantics so you don't have to copy so many strings. Be sure to clear out the strings in any entries you're no longer using.
Also, even though you say you need HashTables of HashTables, you don't really explain why. It's usually more efficient to use one hash table with efficiently represented compound keys if small hash tables are not memory-efficient.

I have changed my structure a little bit just as you all suggested, but there is this one thing that nobody has noticed.
When rehashing/resizing is being done, my rehashfunction calls insert. In this insert function I am incrementing the currentSize, which holds how many elements a hashtable has. So each time a resizing is needed, currentSize doubles itself while it should have stayed the same. I removed that line and wrote the proper code for rehashing and now I think I'm okay.
I am using two different structs now, and the program consumes 1.6GB memory for 8 million elements, which is what expected due to multibyte strings, and integers. That number was like 7-8GB before.

Related

Which container should I use for random access, cheap addition and removal (without de/allocation), with a known maximum size?

I need the lighter container that must store till 128 unsigned int.
It must add, edit and remove each element accessing it quickly, without allocating new memory every time (I already know it will be max 128).
Such as:
add int 40 at index 4 (1/128 item used)
add int 36 at index 90 (2/128 item used)
edit to value 42 the element at index 4
add int 36 at index 54 (3/128 item used)
remove element with index 90 (2/128 item used)
remove element with index 4 (1/128 item used)
... and so on. So every time I can iterate trought only the real number of elements added to the container, not all and check if NULL or not.
During this process, as I said, it must not allocating/reallocating new memory, since I'm using an app that manage "audio" data and this means a glitch every time I touch the memory.
Which container would be the right candidate?
It sounds like a "indexes" queue.

As I understand the question, you have two operations
Insert/replace element value at cell index
Delete element at cell index
and one predicate
Is cell index currently occupied?
This is an array and a bitmap. When you insert/replace, you stick the value in the array cell and set the bitmap bit. When you delete, you clear the bitmap bit. When you ask, you query the bitmap bit.

You can just use std::vector<int> and do vector.reserve(128); to keep the vector from allocating memory. This doesn't allow you to keep track of particular indices though.
If you need to keep track of an 'index' you could use std::vector<std::pair<int, int>>. This doesn't allow random access though.

If you only need cheap setting and erasing values, just use an array. You
can keep track of what cells are used by marking them in another array (or bitmap). Or by just defining one value (e.g. 0 or -1) as an "unused" value.
Of course, if you need to iterate over all used cells, you need to scan the whole array. But that's a tradeoff you need to make: either do more work during adding and erasing, or do more work during a search. (Note that an .insert() in the middle of a vector<> will move data around.)
In any case, 128 elements is so few, that a scan through the whole array will be negligible work. And frankly, I think anything more complex than a vector will be total overkill. :)
Roughly:
unsigned data[128] = {0}; // initialize
unsigned used[128] = {0};
data[index] = newvalue; used[index] = 1; // set value
data[index] = used[index] = 0; // unset value
// check if a cell is used and do something
if (used[index]) { do something } else { do something else }

I'd suggest a tandem of vectors, one to hold the active indices, the other to hold the data:
class Container
{
std::vector<size_t> indices;
std::vector<int> data;
size_t index_worldToData(size_t worldIndex) const
{
auto it = std::lower_bound(begin(indices), end(indices), worldIndex);
return it - begin(indices);
}
public:
Container()
{
indices.reserve(128);
data.reserve(128);
}
int& operator[] (size_t worldIndex)
{
return data[index_worldToData(worldIndex)];
}
void addElement(size_t worldIndex, int element)
{
auto dataIndex = index_worldToData(worldIndex);
indices.insert(it, worldIndex);
data.insert(begin(data) + dataIndex, element);
}
void removeElement(size_t worldIndex)
{
auto dataIndex = index_worldToData(worldIndex);
indices.erase(begin(indices) + dataIndex);
data.erase(begin(indices) + dataIndex);
}
class iterator
{
Container *cnt;
size_t dataIndex;
public:
int& operator* () const { return cnt.data[dataIndex]; }
iterator& operator++ () { ++dataIndex; }
};
iterator begin() { return iterator{ this, 0 }; }
iterator end() { return iterator{ this, indices.size() }; }
};
(Disclaimer: code not touched by compiler, preconditions checks omitted)
This one has logarithmic time element access, linear time insertion and removal, and allows iterating over non-empty elements.

You could use a doubly-linked list and an array of node pointers.
Preallocate 128 list nodes and keep them on freelist.
Create a empty itemlist.
Allocate an array of 128 node pointers called items
To insert at i: Pop the head node from freelist, add it to
itemlist, set items[i] to point at it.
To access/change a value, use items[i]->value
To delete at i, remove the node pointed to by items[i], reinsert it in 'freelist'
To iterate, just walk itemlist
Everything is O(1) except iteration, which is O(Nactive_items). Only caveat is that iteration is not in index order.
Freelist can be singly-linked, or even an array of nodes, as all you need is pop and push.

class Container {
private:
set<size_t> indices;
unsigned int buffer[128];
public:
void set_elem(const size_t index, const unsigned int element) {
buffer[index] = element;
indices.insert(index);
}
// and so on -- iterate over the indices if necessary
};

There are multiple approaches that you can use, I will cite them in order of effort expended.
The most affordable solution is to use the Boost non-standard containers, of particular interest is flat_map. Essentially, a flat_map offers the interface of a map over the storage provided by a dynamic array.
You can call its reserve member at the start to avoid memory allocation afterward.
A slightly more involved solution is to code your own memory allocator.
The interface of an allocator is relatively easy to deal with, so that coding an allocator is quite simple. Create a pool-allocator which will never release any element, warm it up (allocate 128 elements) and you are ready to go: it can be plugged in any collection to make it memory-allocation-free.
Of particular interest, here, is of course std::map.
Finally, there is the do-it-yourself road. Much more involved, quite obviously: the number of operations supported by standard containers is just... huge.
Still, if you have the time or can live with only a subset of those operations, then this road has one undeniable advantage: you can tailor the container specifically to your needs.
Of particular interest here is the idea of having a std::vector<boost::optional<int>> of 128 elements... except that since this representation is quite space inefficient, we use the Data-Oriented Design to instead make it two vectors: std::vector<int> and std::vector<bool>, which is much more compact, or even...
struct Container {
size_t const Size = 128;
int array[Size];
std::bitset<Size> marker;
}
which is both compact and allocation-free.
Now, iterating requires iterating the bitset for present elements, which might seem wasteful at first, but said bitset is only 16 bytes long so it's a breeze! (because at such scale memory locality trumps big-O complexity)

Why not use std::map<int, int>, it provides random access and is sparse.

If a vector (pre-reserved) is not handy enough, look into Boost.Container for various “flat” varieties of indexed collections. This will store everything in a vector and not need memory manipulation, but adds a layer on top to make it a set or map, indexable by which elements are present and able to tell which are not.

set the size of a string object [duplicate]

I wan to use std::string simply to create a dynamic buffer and than iterate through it using an index. Is resize() the only function to actually allocate the buffer?
I tried reserve() but when I try to access the string via index it asserts. Also when the string's default capacity seems to be 15 bytes (in my case) but if I still can't access it as my_string[1].
So the capacity of the string is not the actual buffer? Also reserve() also does't allocate the actual buffer?
string my_string;
// I want my string to have 20 bytes long buffer
my_string.reserve( 20 );
int i = 0;
for ( parsing_something_else_loop )
{
char ch = <business_logic>;
// store the character in
my_string[i++] = ch; // this crashes
}
If I do resize() instead of reserve() than it works fine. How is it that the string has the capacity but can't really access it with []? Isn't that the point to reserve() size so you can access it?
Add-on
In response to the answers, I would like to ask stl folks, Why would anybody use reserve() when resize() does exactly the same and it also initialize the string? I have to say I don't appreciate the performance argument in this case that much. All that resize() does additional to what reserve() does is that it merely initialize the buffer which we know is always nice to do anyways. Can we vote reserve() off the island?

Isn't that the point to reserve() size so you can access it?
No, that's the point of resize().
reserve() only gives to enough room so that future call that leads to increase of the size (e.g. calling push_back()) will be more efficient.
From your use case it looks like you should use .push_back() instead.
my_string.reserve( 20 );
for ( parsing_something_else_loop )
{
char ch = <business_logic>;
my_string.push_back(ch);
}
How is it that the string has the capacity but can't really access it with []?
Calling .reserve() is like blowing up mountains to give you some free land. The amount of free land is the .capacity(). The land is there but that doesn't mean you can live there. You have to build houses in order to move in. The number of houses is the .size() (= .length()).
Suppose you are building a city, but after building the 50th you found that there is not enough land, so you need to found another place large enough to fit the 51st house, and then migrate the whole population there. This is extremely inefficient. If you knew you need to build 1000 houses up-front, then you can call
my_string.reserve(1000);
to get enough land to build 1000 houses, and then you call
my_string.push_back(ch);
to construct the house with the assignment of ch to this location. The capacity is 1000, but the size is still 1. You may not say
my_string[16] = 'c';
because the house #16 does not exist yet. You may call
my_string.resize(20);
to get houses #0 ~ #19 built in one go, which is why
my_string[i++] = ch;
works fine (as long as 0 ≤ i ≤ 19).
See also http://en.wikipedia.org/wiki/Dynamic_array.
For your add-on question,
.resize() cannot completely replace .reserve(), because (1) you don't always need to use up all allocated spaces, and (2) default construction + copy assignment is a two-step process, which could take more time than constructing directly (esp. for large objects), i.e.
#include <vector>
#include <unistd.h>
struct SlowObject
{
SlowObject() { sleep(1); }
SlowObject(const SlowObject& other) { sleep(1); }
SlowObject& operator=(const SlowObject& other) { sleep(1); return *this; }
};
int main()
{
std::vector<SlowObject> my_vector;
my_vector.resize(3);
for (int i = 0; i < 3; ++ i)
my_vector[i] = SlowObject();
return 0;
}
Will waste you at least 9 seconds to run, while
int main()
{
std::vector<SlowObject> my_vector;
my_vector.reserve(3);
for (int i = 0; i < 3; ++ i)
my_vector.push_back(SlowObject());
return 0;
}
wastes only 6 seconds.
std::string only copies std::vector's interface here.

No -- the point of reserve is to prevent re-allocation. resize sets the usable size, reserve does not -- it just sets an amount of space that's reserved, but not yet directly usable.
Here's one example -- we're going to create a 1000-character random string:
static const int size = 1000;
std::string x;
x.reserve(size);
for (int i=0; i<size; i++)
x.push_back((char)rand());
reserve is primarily an optimization tool though -- most code that works with reserve should also work (just, possibly, a little more slowly) without calling reserve. The one exception to that is that reserve can ensure that iterators remain valid, when they wouldn't without the call to reserve.

The capacity is the length of the actual buffer, but that buffer is private to the string; in other words, it is not yours to access. The std::string of the standard library may allocate more memory than is required to storing the actual characters of the string. The capacity is the total allocated length. However, accessing characters outside s.begin() and s.end() is still illegal.
You call reserve in cases when you anticipate resizing of the string to avoid unnecessary re-allocations. For example, if you are planning to concatenate ten 20-character strings in a loop, it may make sense to reserve 201 characters (an extra one is for the zero terminator) for your string, rather than expanding it several times from its default size.

reserve(n) indeed allocates enough storage to hold at least n elements, but it doesn't actually fill the container with any elements. The string is still empty (has size 0), but you are guaranteed, that you can add (e.g. through push_back or insert) at least n elements before the string's internal buffer needs to be reallocated, whereas resize(n) really resizes the string to contain n elements (and deletes or adds new elements if neccessary).
So reserve is actually a mere optimization facility, when you know you are adding a bunch of elements to the container (e.g. in a push_back loop) and don't want it to reallocate the storage too often, which incurs memory allocation and copying costs. But it doesn't change the outside/client view of the string. It still stays empty (or keeps its current element count).
Likewise capacity returns the number of elements the string can hold until it needs to reallocate its internal storage, whereas size (and for string also length) returns the actual number of elements in the string.

Just because reserve allocates additional space does not mean it is legitimate for you to access it.
In your example, either use resize, or rewrite it to something like this:
string my_string;
// I want my string to have 20 bytes long buffer
my_string.reserve( 20 );
int i = 0;
for ( parsing_something_else_loop )
{
char ch = <business_logic>;
// store the character in
my_string += ch;
}

std::vector instead of std::string might also be a solution - if there are no requirements against it.
vector<char> v; // empty vector
vector<char> v(10); // vector with space for 10 elements, here char's
Your example:
vector<char> my_string(20);
int i=0;
for ( parsing_something_else_loop )
{
char ch = <business_logic>;
my_string[i++] = ch;
}

Performance comparison of STL sort on vector of strings vs. vector of string pointers

I tried to compare the performance of STL sort on vector of strings and vector of pointers to strings.
I expected the pointers version to outperform, but the actual results for 5 million randomly generated strings are
vector of strings : 12.06 seconds
vector of pointers to strings : 16.75 seconds
What explains this behavior? I expected swapping pointers to strings should be faster than swapping string objects.
The 5 million strings were generated by converting random integers.
Compiled with (gcc 4.9.3): g++ -std=c++11 -Wall
CPU: Xeon X5650
// sort vector of strings
int main(int argc, char *argv[])
{
const int numElements=5000000;
srand(time(NULL));
vector<string> vec(numElements);
for (int i = 0; i < numElements; i++)
vec[i] = std::to_string(rand() % numElements);
unsigned before = clock();
sort(vec.begin(), vec.end());
cout<< "Time to sort: " << clock() - before << endl;
for (int i = 0; i < numElements; i++)
cout << vec[i] << endl;
return 0;
}
// sort vector of pointers to strings
bool comparePtrToString (string *s1, string *s2)
{
return (*s1 < *s2);
}
int main(int argc, char *argv[])
{
const int numElements=5000000;
srand(time(NULL));
vector<string *> vec(numElements);
for (int i = 0; i < numElements; i++)
vec[i] = new string( to_string(rand() % numElements));
unsigned before = clock();
sort(vec.begin(), vec.end(), comparePtrToString);
cout<< "Time to sort: " << clock() - before << endl;
for (int i = 0; i < numElements; i++)
cout << *vec[i] << endl;
return 0;
}

This is because all the operations that sort performs on strings is moves and swaps. Both move and swap for an std::string are constant time operations, meaning that they only involve changing some pointers.
Therefore, for both sorts moving of the data has the same performance overhead. However, in case of pointers to strings you pay some extra cost to dereference the pointers on each comparison, which causes it to be noticeably slower.

In the first case the internal pointers to representations of the strings are swapped and not the complete data copied.
You should not expect any benefit from the implementation with pointers, which in fact is slower, since the pointers have to be dereferenced additionally, to perform the comparison.

What explains this behavior? I expected swapping pointers to strings
should be faster than swapping string objects.
There's various things going on here which could impact performance.
Swapping is relatively cheap both ways. Swapping strings tends to always be a shallow operation (just swapping PODs like pointers and integrals) for large strings and possibly deep for small strings (but still quite cheap -- implementation-dependent). So swapping strings tends to be pretty cheap overall, and typically not much more expensive than simply swapping pointers to them*.
[sizeof(string) is certanly bigger than sizeof(string*), but it's not an astronomical difference basically as the operation still
occurs in constant-time, and quite a bit cheaper in this context
when the string fields already have to be fetched into a faster form
of memory for the comparator, giving us temporal locality with
respect to its fields.]
String contents must be accessed anyway both ways. Even the pointer version of your comparator has to examine the string contents (including the fields designating size and capacity). As a result, we end up paying the memory cost of fetching the data for the string contents regardless. Naturally if you just sorted the strings by pointer address (ex: without using a comparator) instead of a lexicographical comparison of the string contents, the performance edge should shift towards the pointer version since that would reduce the amount of data accessed considerably while improving spatial locality (more pointers can fit in a cache line than strings, e.g.).
The pointer version is scattering (or at least increasing the stride of) the string fields in memory. For the pointer version, you're allocating each string on the free store (in addition to the string contents which may or may not be allocated on the free store). That can disperse the memory and reduce locality of reference, so you're potentially incurring a greater cost in the comparator that way with increased cache misses. Even if a sequential allocation of this sort results in a very contiguous set of pages being allocated (ideal scenario), the stride to get from one string's fields to the next would tend to get at least a little larger because of the allocation metadata/alignment overhead (not all allocators require metadata to be stored directly in a chunk, but typically they will at least add some small overhead to the chunk size).
It might be simpler to attribute this to the cost of dereferencing the pointer but it's not so much the cost of the mov/load instruction doing the memory addressing that's expensive (in this relative context) as loading from slower/bigger forms of memory that aren't already cached/paged to faster, smaller memory. Allocating each string individually on the free store will typically increase this cost whether it's due to a loss of contiguity or a larger constant stride between each string entry (in an ideal case).
Even at a basic level without trying too hard to diagnose what's happening at the memory level, this increases the total size of the data that the machine has to look at (string contents/fields + pointer address) in addition to reduced locality/larger or variable strides (typically if you increase the amount of data accessed, it has to at least have improved locality to have a good chance of being beneficial). You might start to see more comparable times if you just sorted pointers to strings that were allocated contiguously (not in terms of the string contents which we have no control over, but just contiguous in terms of the adjacent string objects themselves -- effectively pointers to strings stored in an array). Then you'd get back the spatial locality at least for the string fields in addition to packing the data associated more tightly within a contiguous space.
Swapping smaller data types like indices or pointers can sometimes offer a benefit but they typically need to avoid examining the original contents of the data they refer to or provide a significantly cheaper swap/move behavior (in this case string is already cheap and becomes cheaper in this context considering temporal locality) or both.

Well, a std::string is typically about 3-4 times as big as a std::string*.
So just straight-up swapping two of the former shuffles that much more memory around.
But that is dwarfed by the following effects:
Locality of reference. You need to follow one more pointer to a random position to read the string.
More memory-usage: A pointer plus bookkeeping per allocation of each std::string.
Both put extra demand on caching, and the former cannot even be prefetched.

Swaping containers change just container's content, in string case is the pointer to first character of string, not whole string.
In case vectors of pointers of strings you performed one additional step - casting pointers

A better array shifting algorithm?

I have an assignment that requires me to sort a heap based C style array of names as they're being read rather than reading them all and then sorting. This involves a lot of shifting the contents of the array by one to allow new names to be inserted. I'm using the code below but it's extremely slow. Is there anything else I could be doing to optimize it without changing the type of storage?
//the data member
string *_storedNames = new string[4000];
//together boundary and index define the range of elements to the right by one
for(int k = p_boundary - 1;k > index;k--)
_storedNames[k]=_storedNames[k - 1];
EDIT2:
As suggested by Cartroo I'm attempting to use memmove with the dynamic data that uses malloc. Currently this shifts the data correctly but once again fails in the deallocation process. Am I missing something?
int numberOfStrings = 10, MAX_STRING_SIZE = 32;
char **array = (char **)malloc(numberOfStrings);
for(int i = 0; i < numberOfStrings; i++)
array[i] = (char *)malloc(MAX_STRING_SIZE);
array[0] = "hello world",array[2] = "sample";
//the range of data to move
int index = 1, boundary = 4;
int sizeToMove = (boundary - index) * sizeof(MAX_STRING_SIZE);
memcpy(&array[index + 1], &array[index], sizeToMove);
free(array);

If you're after minimal changes to your approach, you could use the memmove() function, which is potentially faster than your own manual version. You can't use memcpy() as advised by one commenter as the areas of memory aren't permitted to overlap (the behaviour is undefined if they do).
There isn't a lot else you can do without changing the type of your storage or your algorithm. However, if you change to using a linked list then the operation becomes significantly more efficient, although you will be doing more memory allocation. If the allocation is really a problem (and unless you're on a limited embedded system it probably isn't) then pool allocators or similar approaches could help.
EDIT: Re-reading your question, I'm guessing that you're not actually using Heapsort, you just mean that your array was allocated on the heap (i.e. using malloc()) and you're doing a simple insertion sort. In which case, the information below isn't much use to you directly, although you should be aware that insertion sort is quite inefficient compared to a bulk insert followed by a better sorting algorithm (e.g. Quicksort which you can implement using the standard library qsort() function). If you only ever need the lowest (or highest) item instead of full sorted order then Heapsort is still useful reading.
If you're using a standard Heapsort then you shouldn't need this operation at all - items are appended at the end of the array, and then the "heapify" operation is used to swap them into the correct position in the heap. Each swap just requires a single temporary variable to swap two items - it doesn't require anything to be shuffled down as in your code snippet. It does require everything in the array to be the same size (either a fixed size in-place string or, more likely, a pointer), but your code already seems to assume that anyway (and using variable length strings in a standard char array would be a pretty strange thing to be doing).
Note that strictly speaking Heapsort operates on a binary tree. Since you're dealing with an array I assume you're using the implementation where a contiguous array is used where the children of node at index n are stored at indicies 2n and 2n+1 respectively. If this isn't the case, or you're not using a Heapsort at all, you should explain in more detail what you're trying to do to get a more helpful answer.
EDIT: The following is in response to you updated code above.
The main reason you see a problem during deallocation is if you trampled some memory - in other words, you're copying something outside the size of the area you allocated. This is a really bad thing to do as you overwrite the values that the system is using to track your allocations and cause all sorts of problems which usually result in your program crashing.
You seem to have some slight confusion as to the nature of memory allocation and deallocation, first of all. You allocate an array of char*, which on its own is fine. You then allocate arrays of char for each string, which is also fine. However, you then just call free() for the initial array - this isn't enough. There needs to be a call to free() to match each call to malloc(), so you need to free each string that you allocate and then free the initial array.
Secondly, you set sizeToMove to a multiple of sizeof(MAX_STRING_SIZE), which almost certainly isn't what you want. This is the size of the variable used to store the MAX_STRING_SIZE constant. Instead you want sizeof(char*). On some platforms these may be the same, in which case things will still work, but there's no guarantee of that. For example, I'd expect it to work on a 32-bit platform (where int and char* are the same size) but not on a 64-bit platform (where they're not).
Thirdly, you can't just assign a string constant (e.g. "hello world") to an allocated block - what you're doing here is replacing the pointer. You need to use something like strncpy() or memcpy() to copy the string into the allocated block. I suggest snprintf() for convenience because strncpy() has the problem that it doesn't guarantee to nul-terminate the result, but it's up to you.
Fourthly, you're still using memcpy() and not memmove() to shuffle items around.
Finally, I've just seen your comment that you have to use new and delete. There is no equivalent of realloc() for these, but that's OK if everything is known in advance. It looks like what you're trying to do is something like this:
bool addItem(const char *item, char *list[], size_t listSize, size_t listMaxSize)
{
// Check if list is full.
if (listSize >= listMaxSize) {
return false;
}
// Insert item inside list.
for (unsigned int i = 0; i < listSize; ++i) {
if (strcmp(list[i], item) > 0) {
memmove(list + i + 1, list + i, sizeof(char*) * (listSize - i));
list[i] = item;
return true;
}
}
// Append item to list.
list[listSize] = item;
return true;
}
I haven't compiled and checked that, so watch out for off-by-one errors and the like, but hopefully you get the idea. This function should work whether you use malloc() and free() or new and delete, but it assumes that you've already copied the string item into an allocated buffer that you will keep around, because of course it just stores a pointer.
Remember that of course you need to update listSize yourself outside this function - this just inserts an item into the correct point in the array for you. If the function returns true then increment your copy of listSize by 1 - if it returns false then you didn't allocate enough memory so your item wasn't added.
Also note that in C and C++, for the array list the syntax &list[i] and list + i are totally equivalent - use the first one instead within the memmove() call if you find it easier to understand.

I think what you're looking for is heapsort: http://en.wikipedia.org/wiki/Heapsort#Pseudocode
An array is a common way to implement a binary search tree (i.e. a tree in which left children are smaller than the current node and right children are larger than the current node).
Heapsort sorts an array of a specified length. In your case, since the size of the array is going to increase "online", all you need to do is to call change the input size that you pass to heapsort (i.e. increase the number of elements being considered by 1).

Since your array is sorted and you can't use any other data structure your best bet is likely to perform a binary search, then to shift the array up one to free space at the position for insertion and then insert the new element at that position.

To minimise the cost of shifting the array, you could make it an array of pointers to string:
string **_storedNames = new string*[4000];
Now you can use memmove (although you might find now that an element-by-element copy is fast enough). But you will have to manage the allocation and deletion of the individual strings yourself, and this is somewhat error-prone.
Other posters who recommend using memmove on your original array don't seem to have noticed that each array element is a string (not a string* !). You can't use memmove or memcpy on a class like this.

std::strings's capacity(), reserve() & resize() functions

I wan to use std::string simply to create a dynamic buffer and than iterate through it using an index. Is resize() the only function to actually allocate the buffer?
I tried reserve() but when I try to access the string via index it asserts. Also when the string's default capacity seems to be 15 bytes (in my case) but if I still can't access it as my_string[1].
So the capacity of the string is not the actual buffer? Also reserve() also does't allocate the actual buffer?
string my_string;
// I want my string to have 20 bytes long buffer
my_string.reserve( 20 );
int i = 0;
for ( parsing_something_else_loop )
{
char ch = <business_logic>;
// store the character in
my_string[i++] = ch; // this crashes
}
If I do resize() instead of reserve() than it works fine. How is it that the string has the capacity but can't really access it with []? Isn't that the point to reserve() size so you can access it?
Add-on
In response to the answers, I would like to ask stl folks, Why would anybody use reserve() when resize() does exactly the same and it also initialize the string? I have to say I don't appreciate the performance argument in this case that much. All that resize() does additional to what reserve() does is that it merely initialize the buffer which we know is always nice to do anyways. Can we vote reserve() off the island?

Isn't that the point to reserve() size so you can access it?
No, that's the point of resize().
reserve() only gives to enough room so that future call that leads to increase of the size (e.g. calling push_back()) will be more efficient.
From your use case it looks like you should use .push_back() instead.
my_string.reserve( 20 );
for ( parsing_something_else_loop )
{
char ch = <business_logic>;
my_string.push_back(ch);
}
How is it that the string has the capacity but can't really access it with []?
Calling .reserve() is like blowing up mountains to give you some free land. The amount of free land is the .capacity(). The land is there but that doesn't mean you can live there. You have to build houses in order to move in. The number of houses is the .size() (= .length()).
Suppose you are building a city, but after building the 50th you found that there is not enough land, so you need to found another place large enough to fit the 51st house, and then migrate the whole population there. This is extremely inefficient. If you knew you need to build 1000 houses up-front, then you can call
my_string.reserve(1000);
to get enough land to build 1000 houses, and then you call
my_string.push_back(ch);
to construct the house with the assignment of ch to this location. The capacity is 1000, but the size is still 1. You may not say
my_string[16] = 'c';
because the house #16 does not exist yet. You may call
my_string.resize(20);
to get houses #0 ~ #19 built in one go, which is why
my_string[i++] = ch;
works fine (as long as 0 ≤ i ≤ 19).
See also http://en.wikipedia.org/wiki/Dynamic_array.
For your add-on question,
.resize() cannot completely replace .reserve(), because (1) you don't always need to use up all allocated spaces, and (2) default construction + copy assignment is a two-step process, which could take more time than constructing directly (esp. for large objects), i.e.
#include <vector>
#include <unistd.h>
struct SlowObject
{
SlowObject() { sleep(1); }
SlowObject(const SlowObject& other) { sleep(1); }
SlowObject& operator=(const SlowObject& other) { sleep(1); return *this; }
};
int main()
{
std::vector<SlowObject> my_vector;
my_vector.resize(3);
for (int i = 0; i < 3; ++ i)
my_vector[i] = SlowObject();
return 0;
}
Will waste you at least 9 seconds to run, while
int main()
{
std::vector<SlowObject> my_vector;
my_vector.reserve(3);
for (int i = 0; i < 3; ++ i)
my_vector.push_back(SlowObject());
return 0;
}
wastes only 6 seconds.
std::string only copies std::vector's interface here.

No -- the point of reserve is to prevent re-allocation. resize sets the usable size, reserve does not -- it just sets an amount of space that's reserved, but not yet directly usable.
Here's one example -- we're going to create a 1000-character random string:
static const int size = 1000;
std::string x;
x.reserve(size);
for (int i=0; i<size; i++)
x.push_back((char)rand());
reserve is primarily an optimization tool though -- most code that works with reserve should also work (just, possibly, a little more slowly) without calling reserve. The one exception to that is that reserve can ensure that iterators remain valid, when they wouldn't without the call to reserve.

The capacity is the length of the actual buffer, but that buffer is private to the string; in other words, it is not yours to access. The std::string of the standard library may allocate more memory than is required to storing the actual characters of the string. The capacity is the total allocated length. However, accessing characters outside s.begin() and s.end() is still illegal.
You call reserve in cases when you anticipate resizing of the string to avoid unnecessary re-allocations. For example, if you are planning to concatenate ten 20-character strings in a loop, it may make sense to reserve 201 characters (an extra one is for the zero terminator) for your string, rather than expanding it several times from its default size.

reserve(n) indeed allocates enough storage to hold at least n elements, but it doesn't actually fill the container with any elements. The string is still empty (has size 0), but you are guaranteed, that you can add (e.g. through push_back or insert) at least n elements before the string's internal buffer needs to be reallocated, whereas resize(n) really resizes the string to contain n elements (and deletes or adds new elements if neccessary).
So reserve is actually a mere optimization facility, when you know you are adding a bunch of elements to the container (e.g. in a push_back loop) and don't want it to reallocate the storage too often, which incurs memory allocation and copying costs. But it doesn't change the outside/client view of the string. It still stays empty (or keeps its current element count).
Likewise capacity returns the number of elements the string can hold until it needs to reallocate its internal storage, whereas size (and for string also length) returns the actual number of elements in the string.

Just because reserve allocates additional space does not mean it is legitimate for you to access it.
In your example, either use resize, or rewrite it to something like this:
string my_string;
// I want my string to have 20 bytes long buffer
my_string.reserve( 20 );
int i = 0;
for ( parsing_something_else_loop )
{
char ch = <business_logic>;
// store the character in
my_string += ch;
}

std::vector instead of std::string might also be a solution - if there are no requirements against it.
vector<char> v; // empty vector
vector<char> v(10); // vector with space for 10 elements, here char's
Your example:
vector<char> my_string(20);
int i=0;
for ( parsing_something_else_loop )
{
char ch = <business_logic>;
my_string[i++] = ch;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

hashtable needs lots of memory - c++

Related

Which container should I use for random access, cheap addition and removal (without de/allocation), with a known maximum size?

set the size of a string object [duplicate]

Performance comparison of STL sort on vector of strings vs. vector of string pointers

A better array shifting algorithm?

std::strings's capacity(), reserve() & resize() functions

Categories

Resources