Pointer comparision issue - c++

I'm having a problem with a pointer and can't get around it..
In a HashTable implementation, I have a list of ordered nodes in each bucket.The problem I have It's in the insert function, in the comparision to see if the next node is greater than the current node(in order to inserted in that position if it is) and keep the order.
You might find this hash implementation strange, but I need to be able to do tons of lookups(but sometimes also very few) and count the number of repetitions if It's already inserted (so I need fasts lookups, thus the Hash , I've thought about self-balanced trees as AVL or R-B trees, but I don't know them so I went with the solution I knew how to implement...are they faster for this type of problem?),but I also need to retrieve them by order when I've finished.
Before I had a simple list and I'd retrieve the array, then do a QuickSort, but I think I might be able to improve things by keeping the lists ordered.
What I have to map It's a 27 bit unsigned int(most exactly 3 9 bits numbers, but I convert them to a 27 bit number doing (Sr << 18 | Sg << 9 | Sb) making at the same time their value the hash_value. If you know a good function to map that 27 bit int to an 12-13-14 bit table let me know, I currently just do the typical mod prime solution.
This is my hash_node struct:
class hash_node {
public:
unsigned int hash_value;
int repetitions;
hash_node *next;
hash_node( unsigned int hash_val,
hash_node *nxt);
~hash_node();
};
And this is the source of the problem
void hash_table::insert(unsigned int hash_value) {
unsigned int p = hash_value % tableSize;
if (table[p]!=0) { //The bucket has some elements already
hash_node *pred; //node to keep the last valid position on the list
for (hash_node *aux=table[p]; aux!=0; aux=aux->next) {
pred = aux; //last valid position
if (aux->hash_value == hash_value ) {
//It's already inserted, so we increment it repetition counter
aux->repetitions++;
} else if (hash_value < (aux->next->hash_value) ) { //The problem
//If the next one is greater than the one to insert, we
//create a node in the middle of both.
aux->next = new hash_node(hash_value,aux->next);
colisions++;
numElem++;
}
}//We have arrive to the end od the list without luck, so we insert it after
//the last valid position
ant->next = new hash_node(hash_value,0);
colisions++;
numElem++;
}else { //bucket it's empty, insert it right away.
table[p] = new hash_node(hash_value, 0);
numElem++;
}
}
This is what gdb shows:
Program received signal SIGSEGV, Segmentation fault.
0x08050b4b in hash_table::insert (this=0x806a310, hash_value=3163181) at ht.cc:132
132 } else if (hash_value < (aux->next->hash_value) ) {
Which effectively indicates I'm comparing a memory adress with a value, right?
Hope It was clear. Thanks again!

aux->next->hash_value
There's no check whether "next" is NULL.

aux->next might be NULL at that point? I can't see where you have checked whether aux->next is NULL.

Related

Multiple Hash Tables for the Word Count Project

I already wrote a working project but my problem is, it is way slower than what I aimed in the first place so I have some ideas about how to improve it but I don't know how to implement these ideas or should I actually implement these ideas in the first place?
The topic of my project is, reading a CSV (Excel) file full of tweets and counting every single word of it, then displaying most used words.
(Every row of the Excel there is information about the tweet and the tweet itself, I should only care about the tweet)
Instead of sharing the whole code I will just simply wrote what I did so far and only ask about the part I am struggling.
First of all, I want to apologize because it will be a long question.
Important note: Only thing I should focus is speed, storage or size is not a problem.
All the steps:
Read a new line from Excel file.
Find the "tweet" part from the whole line and store it as a string.
Split the tweet string into words and store it in the array.
For every word stored in an array, calculate the ASCII value of the word.
(For finding ascii value of the word I simply sum the ascii value of each letter it has)
Put the word in Hash Table with the key of ASCII value.
(Example: Word "hello" has ascii value of 104+101+108+108+111 = 532, so it stored with key 532 in the hast table)
In Hash Table only the word "as a string" and the key value "as an int" is stored and count of the words (how much the same word is used) is stored in a separated array.
I will share the Insert function (for inserting something to the Hashtable) because I believe it will be confusing if I will try to explain this part without a code.
void Insert(int key, string value) //Key (where we want to add), Value (what we want to add)
{
if (key < 0) key = 0; //If key is somehow less than 0, for not taking any error key become 0.
if (table[key] != NULL) //If there is already something in hast table
{
if (table[key]->value == value) //If existing value is same as the value we want to add
{
countArray[key][0]++;
}
else //If value is different,
{
Insert(key + 100, value); //Call this function again with the key 100 more than before.
}
}
else //There is nothing saved in this place so save this value
{
table[key] = new HashEntry(key, value);
countArray[key][1] = key;
countArray[key][0]++;
}
}
So "Insert" function has three-part.
Add the value to hash table if hast table with the given key is empty.
If hast table with the given key is not empty that means we already put a word with this ascii value.
Because different words can have exact same ascii value.
The program first checks if this is the same word.
If it is, count increase (In the count array).
If not, Insert function is again called with the key value of (same key value + 100) until empty space or same value is found.
After whole lines are read and every word is stored in Hashtable ->
Sort the Count array
Print the first 10 element
This is the end of the program, so what is the problem?
Now my biggest problem is I am reading a very huge CSV file with thousands of rows, so every unnecessary thing increases the time noticeably.
My second problem is there is a lot of values with the same ASCII value, my method of checking hundred more than normal ascii value methods work, but ? for finding the empty space or the same word, Insert function call itself hundred times per word.
(Which caused the most problem).
So I thought about using multiple Hashtables.
For example, I can check the first letter of the word and if it is
Between A and E, store in the first hash table
Between F and J, store in the second hash table
...
Between V and Z, store in the last hash table.
Important note again: Only thing I should focus is speed, storage or size is not a problem.
So conflicts should minimize mostly in this way.
I can even create an absurd amount of hash tables and for every different letter, I can use a different hash table.
But I am not sure if it is the logical thing to do or maybe there are much simpler methods I can use for this.
If it is okay to use multiple hash tables, instead of creating hash tables, one by one, is it possible to create an array which stores a Hashtable in every location?
(Same as Array of Arrays but this time array store Hashtables)
If it is possible and logical, can someone show how to do it?
This is the hash table I have:
class HashEntry
{
public:
int key;
string value;
HashEntry(int key, string value)
{
this->key = key;
this->value = value;
}
};
class HashMap
{
private:
HashEntry * *table;
public:
HashMap()
{
table = new HashEntry *[TABLE_SIZE];
for (int i = 0; i < TABLE_SIZE; i++)
{
table[i] = NULL;
}
}
//Functions
}
I am very sorry for such a long question I asked and I am again very sorry if I couldn't explain every part clear enough, English is not my mother language.
Also one last note, I am doing this for a school project so I shouldn't use any third party software or include any different libraries because it is not allowed.
You are using a very bad hash function (adding all characters), that's why you get so many collisions and your Insert method calls itself so many times as a result.
For a detailed overview of different hash functions see the answer to this question. I suggest you try DJB2 or FNV-1a (which is used in some implementations of std::unordered_map).
You should also use more localized "probes" for the empty place to improve cache-locality and use a loop instead of recursion in your Insert method.
But first I suggest you tweak your HashEntry a little:
class HashEntry
{
public:
string key; // the word is actually a key, no need to store hash value
size_t value; // the word count is the value.
HashEntry(string key)
: key(std::move(key)), value(1) // move the string to avoid unnecessary copying
{ }
};
Then let's try to use a better hash function:
// DJB2 hash-function
size_t Hash(const string &key)
{
size_t hash = 5381;
for (auto &&c : key)
hash = ((hash << 5) + hash) + c;
return hash;
}
Then rewrite the Insert function:
void Insert(string key)
{
size_t index = Hash(key) % TABLE_SIZE;
while (table[index] != nullptr) {
if (table[index]->key == key) {
++table[index]->value;
return;
}
++index;
if (index == TABLE_SIZE) // "wrap around" if we've reached the end of the hash table
index = 0;
}
table[index] = new HashEntry(std::move(key));
}
To find the hash table entry by key you can use a similar approach:
HashEntry *Find(const string &key)
{
size_t index = Hash(key) % TABLE_SIZE;
while (table[index] != nullptr) {
if (table[index]->key == key) {
return table[index];
}
++index;
if (index == TABLE_SIZE)
index = 0;
}
return nullptr;
}

Low Memory Shortest Path Algorithm

I have a global unique path table which can be thought of as a directed un-weighted graph. Each node represents either a piece of physical hardware which is being controlled, or a unique location in the system. The table contains the following for each node:
A unique path ID (int)
Type of component (char - 'A' or 'L')
String which contains a comma separated list of path ID's which that node is connected to (char[])
I need to create a function which given a starting and ending node, finds the shortest path between the two nodes. Normally this is a pretty simple problem, but here is the issue I am having. I have a very limited amount of memory/resources, so I cannot use any dynamic memory allocation (ie a queue/linked list). It would also be nice if it wasn't recursive (but it wouldn't be too big of an issue if it was as the table/graph itself if really small. Currently it has 26 nodes, 8 of which will never be hit. At worst case there would be about 40 nodes total).
I started putting something together, but it doesn't always find the shortest path. The pseudo code is below:
bool shortestPath(int start, int end)
if start == end
if pathTable[start].nodeType == 'A'
Turn on part
end if
return true
else
mark the current node
bool val
for each node in connectedNodes
if node is not marked
val = shortestPath(node.PathID, end)
end if
end for
if val == true
if pathTable[start].nodeType == 'A'
turn on part
end if
return true
end if
end if
return false
end function
Anyone have any ideas how to either fix this code, or know something else that I could use to make it work?
----------------- EDIT -----------------
Taking Aasmund's advice, I looked into implementing a Breadth First Search. Below I have some c# code which I quickly threw together using some pseudo code I found online.
pseudo code found online:
Input: A graph G and a root v of G
procedure BFS(G,v):
create a queue Q
enqueue v onto Q
mark v
while Q is not empty:
t ← Q.dequeue()
if t is what we are looking for:
return t
for all edges e in G.adjacentEdges(t) do
u ← G.adjacentVertex(t,e)
if u is not marked:
mark u
enqueue u onto Q
return none
C# code which I wrote using this code:
public static bool newCheckPath(int source, int dest)
{
Queue<PathRecord> Q = new Queue<PathRecord>();
Q.Enqueue(pathTable[source]);
pathTable[source].markVisited();
while (Q.Count != 0)
{
PathRecord t = Q.Dequeue();
if (t.pathID == pathTable[dest].pathID)
{
return true;
}
else
{
string connectedPaths = pathTable[t.pathID].connectedPathID;
for (int x = 0; x < connectedPaths.Length && connectedPaths != "00"; x = x + 3)
{
int nextNode = Convert.ToInt32(connectedPaths.Substring(x, 2));
PathRecord u = pathTable[nextNode];
if (!u.wasVisited())
{
u.markVisited();
Q.Enqueue(u);
}
}
}
}
return false;
}
This code runs just fine, however, it only tells me if a path exists. That doesn't really work for me. Ideally what I would like to do is in the block "if (t.pathID == pathTable[dest].pathID)" I would like to have either a list or a way to see what nodes I had to pass through to get from the source and destination, such that I can process those nodes there, rather than returning a list to process elsewhere. Any ideas on how i could make that change?
The most effective solution, if you're willing to use static memory allocation (or automatic, as I seem to recall that the C++ term is), is to declare a fixed-size int array (of size 41, if you're absolutely certain that the number of nodes will never exceed 40). By using two indices to indicate the start and end of the queue, you can use this array as a ring buffer, which can act as the queue in a breadth-first search.
Alternatively: Since the number of nodes is so small, Bellman-Ford may be fast enough. The algorithm is simple to implement, does not use recursion, and the required extra memory is only a distance (int, or even byte in your case) and a predecessor id (int) per node. The running time is O(VE), alternatively O(V^3), where V is the number of nodes and E is the number of edges.

C++ inserting (and shifting) data into an array

I am trying to insert data into a leaf node (an array) of a B-Tree. Here is the code I have so far:
void LeafNode::insertCorrectPosLeaf(int num)
{
for (int pos=count; pos>=0; pos--) // goes through values in leaf node
{
if (num < values[pos-1]) // if inserting num < previous value in leaf node
{continue;} // conitnue searching for correct place
else // if inserting num >= previous value in leaf node
{
values[pos] = num; // inserts in position
break;
}
}
count++;
} // insertCorrectPos()
Before the line values[pos] = num, I think need to write some code that shifts the existing data instead of overwriting it. I am trying to use memmove but have a question about it. Its third parameter is the number of bytes to copy. If I am moving a single int on a 64 bit machine, does this mean I would put a "4" here? If I am going about this completely wrong any any help would be greatly appreciated. Thanks
The easiest way (and probably the most efficient) would be to use one of the standard libraries predefined structures to implement "values". I suggest either list or vector. This is because both list and vector has an insert function that does it for you. I suggest the vector class specifically is because it has the same kind of interface that an array has. However, if you want to optimize for speed of this action specifically, then I would suggest the list class because of the way it is implemented.
If you would rather to it the hard way, then here goes...
First, you need to make sure that you have the space to work in. You can either allocate dynamically:
int *values = new int[size];
or statically
int values[MAX_SIZE];
If you allocate statically, then you need to make sure that MAX_SIZE is some gigantic value that you will never ever exceed. Furthermore, you need to check the actual size of the array against the amount of allocated space every time you add an element.
if (size < MAX_SIZE-1)
{
// add an element
size++;
}
If you allocate dynamically, then you need to reallocate the whole array every time you add an element.
int *temp = new int[size+1];
for (int i = 0; i < size; i++)
temp[i] = values[i];
delete [] values;
values = temp;
temp = NULL;
// add the element
size++;
When you insert a new value, you need to shift every value over.
int temp = 0;
for (i = 0; i < size+1; i++)
{
if (values[i] > num || i == size)
{
temp = values[i];
values[i] = num;
num = temp;
}
}
Keep in mind that this is not at all optimized. A truly magical implementation would combine the two allocation strategies by dynamically allocating more space than you need, then growing the array by blocks when you run out of space. This is exactly what the vector implementation does.
The list implementation uses a linked list which has O(1) time for inserting a value because of it's structure. However, it is much less space inefficient and has O(n) time for accessing an element at location n.
Also, this code was written on the fly... be careful when using it. There might be a weird edge case that I am missing in the last code segment.
Cheers!
Ned

priority queue with limited space: looking for a good algorithm

This is not a homework.
I'm using a small "priority queue" (implemented as array at the moment) for storing last N items with smallest value. This is a bit slow - O(N) item insertion time. Current implementation keeps track of largest item in array and discards any items that wouldn't fit into array, but I still would like to reduce number of operations further.
looking for a priority queue algorithm that matches following requirements:
queue can be implemented as array, which has fixed size and _cannot_ grow. Dynamic memory allocation during any queue operation is strictly forbidden.
Anything that doesn't fit into array is discarded, but queue keeps all smallest elements ever encountered.
O(log(N)) insertion time (i.e. adding element into queue should take up to O(log(N))).
(optional) O(1) access for *largest* item in queue (queue stores *smallest* items, so the largest item will be discarded first and I'll need them to reduce number of operations)
Easy to implement/understand. Ideally - something similar to binary search - once you understand it, you remember it forever.
Elements need not to be sorted in any way. I just need to keep N smallest value ever encountered. When I'll need them, I'll access all of them at once. So technically it doesn't have to be a queue, I just need N last smallest values to be stored.
I initially thought about using binary heaps (they can be easily implemented via arrays), but apparently they don't behave well when array can't grow anymore. Linked lists and arrays will require extra time for moving things around. stl priority queue grows and uses dynamic allocation (I may be wrong about it, though).
So, any other ideas?
--EDIT--
I'm not interested in STL implementation. STL implementation (suggested by a few people) works a bit slower than currently used linear array due to high number of function calls.
I'm interested in priority queue algorithms, not implemnetations.
Array based heaps seem ideal for your purpose. I am not sure why you rejected them.
You use a max-heap.
Say you have an N element heap (implemented as an array) which contains the N smallest elements seen so far.
When an element comes in you check against the max (O(1) time), and reject if it is greater.
If the value coming in is lower, you modify the root to be the new value and sift-down this changed value - worst case O(log N) time.
The sift-down process is simple: Starting at root, at each step you exchange this value with it's larger child until the max-heap property is restored.
So, you will not have to do any deletes which you probably will have to, if you use std::priority_queue. Depending on the implementation of std::priority_queue, this could cause memory allocation/deallocation.
So you can have the code as follows:
Allocated Array of size N.
Fill it up with the first N elements you see.
heapify (you should find this in standard text books, it uses sift-down). This is O(N).
Now any new element you get, you either reject it in O(1) time or insert by sifting-down in worst case O(logN) time.
On an average, though, you probably will not have to sift-down the new value all the way down and might get better than O(logn) average insert time (though I haven't tried proving it).
You only allocate size N array once and any insertion is done by exchanging elements of the array, so there is no dynamic memory allocation after that.
Check out the wiki page which has pseudo code for heapify and sift-down: http://en.wikipedia.org/wiki/Heapsort
Use std::priority_queue with the largest item at the head. For each new item, discard it if it is >= the head item, otherwise pop the head item and insert the new item.
Side note: Standard containers will only grow if you make them grow. As long as you remove one item before inserting a new item (after it reaches its maximum size, of course), this won't happen.
Most priority queues I work are based on linked lists. If you have a pre-determined number of priority levels, you can easily create a priority queue with O(1) insertion by having an array of linked lists--one linked list per priority level. Items of the same priority will of course degenerate into either a FIFO, but that can be considered acceptable.
Adding and removal then becomes something like (your API may vary) ...
listItemAdd (&list[priLevel], &item); /* Add to tail */
pItem = listItemRemove (&list[priLevel]); /* Remove from head */
Getting the first item in the queue then becomes a problem of finding the non-empty linked-list with the highest priority. That may be O(N), but there are several tricks you can use to speed it up.
In your priority queue structure, keep a pointer or index or something to the linked list with the current highest priority. This would need to be updated each time an item is added or removed from the priority queue.
Use a bitmap to indicate which linked lists are not empty. Combined with a find most significant bit, or find least significant bit algorithm you can usually test up to 32 lists at once. Again, this would need to be updated on each add / remove.
Hope this helps.
If amount of priorities is small and fixed than you can use ring-buffer for each priority. That will lead to waste of the space if objects is big, but if their size is comparable with pointer/index than variants with storing additional pointers in objects may increase size of array in the same way.
Or you can use simple single-linked list inside array and store 2*M+1 pointers/indexes, one will point to first free node and other pairs will point to head and tail of each priority. In that cases you'll have to compare in avg. O(M) before taking out next node with O(1). And insertion will take O(1).
If you construct an STL priority queue at the maximum size (perhaps from a vector initialized with placeholders), and then check the size before inserting (removing an item if necessary beforehand) you'll never have dynamic allocation during insert operations. The STL implementation is quite efficient.
Matters Computational see page 158. The implementation itself is quite well, and you can even tweak it a little without making it less readable. For example, when you compute the left child like:
int left = i / 2;
You can compute the rightchild like so:
int right = left + 1;
Found a solution ("difference" means "priority" in the code, and maxRememberedResults is 255 (could be any (2^n - 1)):
template <typename T> inline void swap(T& a, T& b){
T c = a;
a = b;
b = c;
}
struct MinDifferenceArray{
enum{maxSize = maxRememberedResults};
int size;
DifferenceData data[maxSize];
void add(const DifferenceData& val){
if (size >= maxSize){
if(data[0].difference <= val.difference)
return;
data[0] = val;
for (int i = 0; (2*i+1) < maxSize; ){
int next = 2*i + 1;
if (data[next].difference < data[next+1].difference)
next++;
if (data[i].difference < data[next].difference)
swap(data[i], data[next]);
else
break;
i = next;
}
}
else{
data[size++] = val;
for (int i = size - 1; i > 0;){
int parent = (i-1)/2;
if (data[parent].difference < data[i].difference){
swap(data[parent], data[i]);
i = parent;
}
else
break;
}
}
}
void clear(){
size = 0;
}
MinDifferenceArray()
:size(0){
}
};
build max-based queue (root is largest)
until it is full, fill up normally
when it is full, for every new element
Check if new element is smaller than root.
if it is larger or equal than root, reject.
otherwise, replace root with new element and perform normal heap "sift-down".
And we get O(log(N)) insert as a worst case scenario.
It is the same solution as the one provided by user with nickname "Moron".
Thanks to everyone for replies.
P.S. Apparently programming without sleeping enough wasn't a good idea.
It's better to implement your own class using std::array and heap algorithms.
`template<class T, int fixed_size = 5>
class fixed_size_arr_pqueue_v2
{
std::array<T, fixed_size> _data;
int _size = 0;
int parent(int i)
{
return (i - 1)/2;
}
void heapify(int i, bool downward = false)
{
int l = 2*i + 1;
int r = 2*i + 2;
int largest = 0;
if (l < size() && _data[l] > _data[i])
largest = l;
else
largest = i;
if (r < size() && _data[r] > _data[largest])
largest = r;
if (largest != i)
{
std::swap(_data[largest], _data[i]);
if (!downward)
heapify(parent(i));
else
heapify(largest, true);
}
}
public:
void push(T &d)
{
if (_size == fixed_size)
{
//min elements in a max heap lies at leaves only.
auto minItr = std::min_element(begin(_data) + _size/2, end(_data));
auto minPos {minItr - _data.begin()};
auto min { *minItr};
if (d > min)
{
_data.at(minPos) = d;
if (_data[parent(minPos)] > d)
{
//this is unlikely to happen in our case? as this position is a leaf?
heapify(minPos, true);
}
else
heapify(parent(minPos));
}
return ;
}
_data.at(_size++) = d;
std::push_heap(_data.begin(), _data.begin() + _size);
}
T pop()
{
T d = _data.front();
std::pop_heap(_data.begin(), _data.begin() + _size);
_size--;
return d;
}
T top()
{
return _data.front();
}
int size() const
{
return _size;
}
};`

c++ stl priority queue insert bad_alloc exception

I am working on a query processor that reads in long lists of document id's from memory and looks for matching id's. When it finds one, it creates a DOC struct containing the docid (an int) and the document's rank (a double) and pushes it on to a priority queue. My problem is that when the word(s) searched for has a long list, when I try to push the DOC on to the queue, I get the following exception:
Unhandled exception at 0x7c812afb in QueryProcessor.exe: Microsoft C++ exception: std::bad_alloc at memory location 0x0012ee88..
When the word has a short list, it works fine. I tried pushing DOC's onto the queue in several places in my code, and they all work until a certain line; after that, I get the above error. I am completely at a loss as to what is wrong because the longest list read in is less than 1 MB and I free all memory that I allocate. Why should there suddenly be a bad_alloc exception when I try to push a DOC onto a queue that has a capacity to hold it (I used a vector with enough space reserved as the underlying data structure for the priority queue)?
I know that questions like this are almost impossible to answer without seeing all the code, but it's too long to post here. I'm putting as much as I can and am anxiously hoping that someone can give me an answer, because I am at my wits' end.
The NextGEQ function reads a list of compressed blocks of docids block by block. That is, if it sees that the lastdocid in the block (in a separate list) is larger than the docid passed in, it decompresses the block and searches until it finds the right one. Each list starts with metadata about the list with the lengths of each compressed chunk and the last docid in the chunk. data.iquery points to the beginning of the metadata; data.metapointer points to wherever in the metadata the function currently is; and data.blockpointer points to the beginning of the block of uncompressed docids, if there is one. If it sees that it was already decompressed, it just searches. Below, when I call the function the first time, it decompresses a block and finds the docid; the push onto the queue after that works. The second time, it doesn't even need to decompress; that is, no new memory is allocated, but after that time, pushing on to the queue gives a bad_alloc error.
Edit: I cleaned up my code some more so that it should compile. I also added in the OpenList() and NextGEQ functions, although the latter is long, because I think the problem is caused by a heap corruption somewhere in it. Thanks a lot!
struct DOC{
long int docid;
long double rank;
public:
DOC()
{
docid = 0;
rank = 0.0;
}
DOC(int num, double ranking)
{
docid = num;
rank = ranking;
}
bool operator>( const DOC & d ) const {
return rank > d.rank;
}
bool operator<( const DOC & d ) const {
return rank < d.rank;
}
};
struct listnode{
int* metapointer;
int* blockpointer;
int docposition;
int frequency;
int numberdocs;
int* iquery;
listnode* nextnode;
};
void QUERYMANAGER::SubmitQuery(char *query){
listnode* startlist;
vector<DOC> docvec;
docvec.reserve(20);
DOC doct;
//create a priority queue to use as a min-heap to store the documents and rankings;
priority_queue<DOC, vector<DOC>,std::greater<DOC>> q(docvec.begin(), docvec.end());
q.push(doct);
//do some processing here; startlist is a pointer to a listnode struct that starts the //linked list
//point the linked list start pointer to the node returned by the OpenList method
startlist = &OpenList(value);
listnode* minpointer;
q.push(doct);
//start by finding the first docid in the shortest list
int i = 0;
q.push(doct);
num = NextGEQ(0, *startlist);
q.push(doct);
while(num != -1)
{
q.push(doct);
//the is where the problem starts - every previous q.push(doct) works; the one after
//NextGEQ(num +1, *startlist) gives the bad_alloc error
num = NextGEQ(num + 1, *startlist);
//this is where the exception is thrown
q.push(doct);
}
}
//takes a word and returns a listnode struct with a pointer to the beginning of the list
//and metadata about the list
listnode QUERYMANAGER::OpenList(char* word)
{
long int numdocs;
//create a new node in the linked list and initialize its variables
listnode n;
n.iquery = cache -> GetiList(word, &numdocs);
n.docposition = 0;
n.frequency = 0;
n.numberdocs = numdocs;
//an int pointer to point to where in the metadata you are
n.metapointer = n.iquery;
n.nextnode = NULL;
//an int pointer to point to the uncompressed block of data, if there is one
n.blockpointer = NULL;
return n;
}
int QUERYMANAGER::NextGEQ(int value, listnode& data)
{
int lengthdocids;
int lengthfreqs;
int lengthpos;
int* temp;
int lastdocid;
lastdocid = *(data.metapointer + 2);
while(true)
{
//if it's not the first chunk in the list, the blockpointer will be pointing to the
//most recently opened block and docpos to the current position in the block
if( data.blockpointer && lastdocid >= value)
{
//if the last docid in the chunk is >= the docid we're looking for,
//go through the chunk to look for a match
//the last docid in the block is in lastdocid; keep going until you hit it
while(*(data.blockpointer + data.docposition) <= lastdocid)
{
//compare each docid with the docid passed in; if it's greater than or equal to it, return a pointer to the docid
if(*(data.blockpointer + data.docposition ) >= value)
{
//return the next greater than or equal docid
return *(data.blockpointer + data.docposition);
}
else
{
++data.docposition;
}
}
//read through the whole block; couldn't find matching docid; increment metapointer to the next block;
//free the block's memory
data.metapointer += 3;
lastdocid = *(data.metapointer + 3);
free(data.blockpointer);
data.blockpointer = NULL;
}
//reached the end of a block; check the metadata to find where the next block begins and ends and whether
//the last docid in the block is smaller or larger than the value being searched for
//first make sure that you haven't reached the end of the list
//if the last docid in the chunk is still smaller than the value passed in, move the metadata pointer
//to the beginning of the next chunk's metadata; read in the new metadata
while(true)
// while(*(metapointers[index]) != 0 )
{
if(lastdocid < value && *(data.metapointer) !=0)
{
data.metapointer += 3;
lastdocid = *(data.metapointer + 2);
}
else if(*(data.metapointer) == 0)
{
return -1;
}
else
//we must have hit a chunk whose lastdocid is >= value; read it in
{
//read in the metadata
//the length of the chunk of docid's is cumulative, so subtract the end of the last chunk
//from the end of this chunk to get the length
//find the end of the metadata
temp = data.metapointer;
while(*temp != 0)
{
temp += 3;
}
temp += 2;
//temp is now pointing to the beginning of the list of compressed data; use the location of metapointer
//to calculate where to start reading and how much to read
//if it's the first chunk in the list,the corresponding metapointer is pointing to the beginning of the query
//so the number of bytes of docid's is just the first integer in the metadata
if( data.metapointer == data.iquery)
{
lengthdocids = *data.metapointer;
}
else
{
//start reading from the offset of the end of the last chunk (saved in metapointers[index] - 3)
//plus 1 = the beginning of this chunk
lengthdocids = *(data.metapointer) - (*(data.metapointer - 3));
temp += (*(data.metapointer - 3)) / sizeof(int);
}
//allocate memory for an array of integers - the block of docid's uncompressed
int* docblock = (int*)malloc(lengthdocids * 5 );
//decompress docid's into the block of memory allocated
s9decompress((int*)temp, lengthdocids /4, (int*) docblock, true);
//set the blockpointer to point to the beginning of the block
//and docpositions[index] to 0
data.blockpointer = docblock;
data.docposition = 0;
break;
}
}
}
}
Thank you very much, bsg.
QUERYMANAGER::OpenList returns a listnode by value. In startlist = &OpenList(value); you then proceed to take the address of the temporary object that's returned. When the temporary goes away, you may be able to access the data for a time and then it's overwritten. Could you just declare a non-pointer listnode startlist on the stack and assign it the return value directly? Then remove the * in front of other uses and see if that fixes the problem.
Another thing you can try is replacing all pointers with smart pointers, specifically something like boost::shared_ptr<>, depending on how much code this really is and how much you're comfortable automating the task. Smart pointers aren't the answer to everything, but they're at least safer than raw pointers.
Assuming you have heap corruption and are not in fact exhausting memory, the commonest way a heap can get corrupted is by deleting (or freeing) the same pointer twice. You can quite easily find out if this is the issue by simply commenting out all your calls to delete (or free). This will cause your program to leak like a sieve, but if it doesn't actually crash you have probably identified the problem.
The other common cause cause of a corrupt heap is deleting (or freeing) a pointer that wasn't ever allocated on the heap. Differentiating between the two causes of corruption is not always easy, but your first priority should be to find out if corruption is actually the problem.
Note this approach won't work too well if the things you are deleting have destructors which if not called break the semantics of your program.
Thanks for all your help. You were right, Neil - I must have managed to corrupt my heap. I'm still not sure what was causing it, but when I changed the malloc(numdocids * 5) to malloc(256) it magically stopped crashing. I suppose I should have checked whether or not my mallocs were actually succeeding! Thanks again!
Bsg