Determining if a position is free in a Closed Hashing - c++

How would you go about determining whether a position is already occupied or not? When the memory is allocated, all that there is in it is garbage (in C++, which is what I'm using atm). I was thinking of using an auxiliary array of bools to know whether the position is occupied, but that would demand quite a lot of additional memory.
I could also set a value for every position, but then I wouldn't be able to use said value. In both cases, I would also lose some performance initializing the values (the bools to false, the other values to 0 to indicate the position is free, for example).
Any other solutions?

Usually, you use a special placeholder element to indicate empty values. In the simplest case, this could be a null pointer but that would of course mean that you introduce an indirection; you can't store your values directly. In all other cases you would have to make allowance for the type actually stored. For example, if you stored 32 bit integers, you would have to reserve at least one predefined value (e.g. 0) as a sentinel element, thus reducing the range of values that may be stored in your hash table.
An additional array with flags is quite a good solution. Consider that this array could be reduced by a factor of at least 8 by storing bit flags instead of whole-byte variables (or even bools, which would require 4 bytes each on most architectures).

You could use boost::optional for this, instead of a raw value. That's the reason it was created, to add a not-initialized value to an item. It has a performance hit similar to initializing the values in the first place, but requires only a small amount of extra memory per item.

Related

What values should I fill my arrays with to indicate an empty space?

This is more of an etiquette question than anything else, but when creating new arrays what value, other than zero, should I use to indicate an empty space in the array? For example:
int* arr;
arr = new int[10];
When I create a new array like in the code above, the array will be filled with ten zeroes. The issue I'm having is that I want to use underscores when printing the array to indicate empty spaces, however, I also have zeroes as part of my data set in the array. So, should I just fill the empty array with some arbitrary value that is unlikely to show up in my data set (like -32000 for example), and use that as the indicator for empty space, or is there some sort of null value that I could use instead so that I can know for a fact that the value at that specific index is definitely an empty space?
What you appear to be asking about is called a sentinel -- some data value that has special meaning.
Regarding choice of sentinel, use something you know is not going to appear and make it a named constant. For example, you might use:
constexpr int NoValue = std::numeric_limits<int>::min();
If you absolutely need the entire integer range or if you cannot reliably sanitize your input to ensure that data is never accepted as a non-empty value, consider using a larger data type that can represent that range and a sentinel, or use std::optional as suggested in another answer.
Alternatively, maintain a separate array to hold that information. Such an array only requires one bit per element to represent whether the value is empty or not and so it only means a fractional increase in storage as opposed to expanding your data type beyond int. This approach trades off memory usage against memory locality, since the data about "emptiness" would not be stored adjacent to a value in your array and that may have implications for caching.
Regarding the actual initialization question: your array is uninitialized and will require setting values with std::fill or similar. Otherwise your program's behavior is undefined if you attempt to use an uninitialized value. Note that there's a special case: new int[100]() which will zero-initialize the memory. But you can't use that construction to initialize with any other value.
Consider using std::vector to avoid memory management issues, and to provide initialization with non-zero values without adding code clutter:
std::vector<int> arr(10, NoValue);
As you can see, there are choices to be made which depend on your program's requirements and its input specification. I hope this helps you make a more informed decision.
should I just fill the empty array with some arbitrary value that is unlikely to show up in my data set
Well, unlikely is not the same as a value that you know for certain will not appear in the data, and the error that you get from being wrong about this will be a nasty error. Generally speaking, however, you usually do have some idea of the range of valid values and it can indeed be easier to use a sentinel value outside that range to indicate nullity. (And if you do it this way I'd recommend being very fastidious about sanitizing the input data coming into your program i.e. explicitly test for your sentinel value unexpectedly coming in from an external source.)
However, in cases where there is no such value, or just to unambiguously declare your intent, the canonical way to handle this situation in modern C++ is to use std::optional<int>. The standard library's optional is a way of turning any type into a nullable type.
Please note that questions of "taste" are generally frowned upon in StackOverflow.
With that said, here's my preference:
Something that cannot masquerade as a valid value, like NaN, makes a good placeholder. If that isn't an option, then as you said, a value that will not/is not allowed to appear in the data set works.

Why does std::vector<bool> have no .data()?

The specialisation of std::vector<bool>, as specified in C++11 23.3.7/1, doesn't declare a data() member (e.g. mentioned here and here).
The question is: Why does a std::vector<bool> have no .data()? This is the very same question as why is a vector of bools not stored contiguously in memory. What are the benefits in not doing so?
Why can a pointer to an array of bools not be returned?
Why does a std::vector have no .data()?
Because std::vector<bool> stores multiple values in 1 byte.
Think about it like a compressed storage system, where every boolean value needs 1 bit. So, instead of having one element per memory block (one element per array cell), the memory layout may look like this:
Assuming that you want to index a block to get a value, how would you use operator []? It can't return bool& (since it will return one byte, which stores more than one bools), thus you couldn't assign a bool* to it. In other words bool *bool_ptr =&v[0]; is not valid code, and would result in a compilation error.
Moreover, a correct implementation might not have that specialization and don't do the memory optimization (compression). So data() would have to copy to the expected return type depending of implementation (or standard should force optimization instead of just allowing it).
Why can a pointer to an array of bools not be returned?
Because std::vector<bool> is not stored as an array of bools, thus no pointer can be returned in a straightforward way. It could do that by copying the data to an array and return that array, but it's a design choice not to do that (if they did, I would think that this works as the data() for all containers, which would be misleading).
What are the benefits in not doing so?
Memory optimization.
Usually 8 times less memory usage, since it stores multiple bits in a single byte. To be exact, CHAR_BIT times less.

How to avoid wasting memory on 64-bit pointers

I'm hoping for some high-level advice on how to approach a design I'm about to undertake.
The straightforward approach to my problem will result in millions and millions of pointers. On a 64-bit system these will presumably be 64-bit pointers. But as far as my application is concerned, I don't think I need more than a 32-bit address space. I would still like for the system to take advantage of 64-bit processor arithmetic, however (assuming that is what I get by running on a 64-bit system).
Further background
I'm implementing a tree-like data structure where each "node" contains an 8-byte payload, but also needs pointers to four neighboring nodes (parent, left-child, middle-child, right-child). On a 64-bit system using 64-bit pointers, this amounts to 32 bytes just for linking an 8-byte payload into the tree -- a "linking overhead" of 400%.
The data structure will contain millions of these nodes, but my application will not need much memory beyond that, so all these 64-bit pointers seem wasteful. What to do? Is there a way to use 32-bit pointers on a 64-bit system?
I've considered
Storing the payloads in an array in a way such that an index implies (and is implied by) a "tree address" and neighbors of a given index can be calculated with simple arithmetic on that index. Unfortunately this requires me to size the array according to the maximum depth of the tree, which I don't know beforehand, and it would probably incur even greater memory overhead due to empty node elements in the lower levels because not all branches of the tree go to the same depth.
Storing nodes in an array large enough to hold them all, and then using indices instead of pointers to link neighbors. AFAIK the main disadvantage here would be that each node would need the array's base address in order to find its neighbors. So they either need to store it (a million times over) or it needs to be passed around with every function call. I don't like this.
Assuming that the most-significant 32 bits of all these pointers are zero, throwing an exception if they aren't, and storing only the least-significant 32 bits. So the required pointer can be reconstructed on demand. The system is likely to use more than 4GB, but the process will never. I'm just assuming that pointers are offset from a process base-address and have no idea how safe (if at all) this would be across the common platforms (Windows, Linux, OSX).
Storing the difference between 64-bit this and the 64-bit pointer to the neighbor, assuming that this difference will be within the range of int32_t (and throwing if it isn't). Then any node can find it's neighbors by adding that offset to this.
Any advice? Regarding that last idea (which I currently feel is my best candidate) can I assume that in a process that uses less than 2GB, dynamically allocated objects will be within 2 GB of each other? Or not at all necessarily?
Combining ideas 2 and 4 from the question, put all the nodes into a big array, and store e.g. int32_t neighborOffset = neighborIndex - thisIndex. Then you can get the neighbor from *(this+neighborOffset). This gets rid of the disadvantages/assumptions of both 2 and 4.
If on Linux, you might consider using (and compiling for) the x32 ABI. IMHO, this is the preferred solution for your issues.
Alternatively, don't use pointers, but indexes into a huge array (or an std::vector in C++) which could be a global or static variable. Manage a single huge heap-allocated array of nodes, and use indexes of nodes instead of pointers to nodes. So like your ยง2, but since the array is a global or static data you won't need to pass it everywhere.
(I guess that an optimizing compiler would be able to generate clever code, which could be nearly as efficient as using pointers)
You can remove the disadvantage of (2) by exploiting the alignment of memory regions to find the base address of the the array "automatically". For example, if you want to support up to 4 GB of nodes, ensure your node array starts at a 4GB boundary.
Then within a node with address addr, you can determine the address of another at index as addr & -(1UL << 32) + index.
This is kind of the "absolute" variant of the accepted solution which is "relative". One advantage of this solution is that an index always has the same meaning within a tree, whereas in the relative solution you really need the (node_address, index) pair to interpret an index (of course, you can also use the absolute indexes in the relative scenarios where it is useful). It means that when you duplicate a node, you don't need to adjust any index values it contains.
The "relative" solution also loses 1 effective index bit relative to this solution in its index since it needs to store a signed offset, so with a 32-bit index, you could only support 2^31 nodes (assuming full compression of trailing zero bits, otherwise it is only 2^31 bytes of nodes).
You can also store the base tree structure (e.g,. the pointer to the root and whatever bookkeeping your have outside of the nodes themselves) right at the 4GB address which means that any node can jump to the associated base structure without traversing all the parent pointers or whatever.
Finally, you can also exploit this alignment idea within the tree itself to "implicitly" store other pointers. For example, perhaps the parent node is stored at an N-byte aligned boundary, and then all children are stored in the same N-byte block so they know their parent "implicitly". How feasible that is depends on how dynamic your tree is, how much the fan-out varies, etc.
You can accomplish this kind of thing by writing your own allocator that uses mmap to allocate suitably aligned blocks (usually just reserve a huge amount of virtual address space and then allocate blocks of it as needed) - ether via the hint parameter or just by reserving a big enough region that you are guaranteed to get the alignment you want somewhere in the region. The need to mess around with allocators is the primary disadvantage compared to the accepted solution, but if this is the main data structure in your program it might be worth it. When you control the allocator you have other advantages too: if you know all your nodes are allocated on an 2^N-byte boundary you can "compress" your indexes further since you know the low N bits will always be zero, so with a 32-bit index you could actually store 2^(32+5) = 2^37 nodes if you knew they were 32-byte aligned.
These kind of tricks are really only feasible in 64-bit programs, with the huge amount of virtual address space available, so in a way 64-bit giveth and also taketh away.
Your assertion that a 64 bit system necessarily has to have 64 bit pointers is not correct. The C++ standard makes no such assertion.
In fact, different pointer types can be different sizes: sizeof(double*) might not be the same as sizeof(int*).
Short answer: don't make any assumptions about the sizes of any C++ pointer.
Sounds like to me that you want to build you own memory management framework.

Why are empty classes 8 bytes and larger classes always > 8 bytes?

class foo { }
writeln(foo.classinfo.init.length); // = 8 bytes
class foo { char d; }
writeln(foo.classinfo.init.length); // = 9 bytes
Is d actually storing anything in those 8 bytes, and if so, what? It seems like a huge waste, If I'm just wrapping a few value types then the the class significantly bloats the program, specifically if I am using a lot of them. A char becomes 8 times larger while an int becomes 3 times as large.
A struct's minimum size is 1 byte.
In D, object have a header containing 2 pointer (so it may be 8bytes or 16 depending on your architecture).
The first pointer is the virtual method table. This is an array that is generated by the compiler filled with function pointer, so virtual dispatch is possible. All instances of the same class share the same virtual method table.
The second pointer is the monitor. It is used for synchronization. It is not sure that this field stay here forever, because D emphasis local storage and immutability, which make synchronization on many objects useless. As this field is older than these features, it is still here and can be used. However, it may disapear in the future.
Such header on object is very common, you'll find the same in Java or C# for instance. You can look here for more information : http://dlang.org/abi.html
D uses two machine words in each class instance for:
A pointer to the virtual function table. This contains the addresses of virtual methods. The first entry points towards the class's classinfo, which is also used by dynamic casts.
The monitor, which allows the synchronized(obj) syntax, documented here.
These fields are described in the D documentation here (scroll down to "Class Properties") and here (scroll down to "Classes").
I don't know the particulars of D, but in both Java and .net, every class object contains information about its type, and also holds information about whether it's the target of any monitor locks, whether it's eligible for finalization cleanup, and various other things. Having a standard means by which all objects store such information can make many things more convenient for both users and implementers of the language and/or framework. Incidentally, in 32-bit versions of .net, the overhead for each object is 8 bytes except that there is a 12-byte minimum object size. This minimum stems from the fact that when the garbage-collector moves objects around, it needs to temporarily store in the old location a reference to the new one as well as some sort of linked data structure that will permit it to examine arbitrarily-deep nested references without needing an arbitrarily-large stack.
Edit
If you want to use a class because you need to be able to persist references to data items, space is at a premium, and your usage patterns are such that you'll know when data items are still useful and when they become obsolete, you may be able to define an array of structures, and then pass around indices to the array elements. It's possible to write code to handle this very efficiently with essentially zero overhead, provided that the structure of your program allows you to ensure that every item that gets allocated is released exactly once and things are not used once they are released.
If you would not be able to readily determine when the last reference to an object is going to go out of scope, eight bytes would be a very reasonable level of overhead. I would expect that most frameworks would force objects to be aligned on 32-bit boundaries (so I'm surprised that adding a byte would push the size to nine rather than twelve). If a system is going have a garbage collector that works better than a Commodore 64(*), it would need to have an absolute minimum of a bit of overhead per object to indicate which things are used and which aren't. Further, unless one wants to have separate heaps for objects which can contain supplemental information and those which can't, one will every object to either include space for a supplemental-information pointer, or include space for all the supplemental information (locking, abandonment notification requests, etc.). While it might be beneficial in some cases to have separate heaps for the two categories of objects, I doubt the benefits would very often justify the added complexity.
(*) The Commodore 64 garbage collector worked by allocating strings from the top of memory downward, while variables (which are not GC'ed) were allocated bottom-up. When memory got full, the system would scan all variables to find the reference to the string that was stored at the highest address. That string would then be moved to the very top of memory and all references to it would be updated. The system would then scan all variables to find the reference to the string at the highest address below the one it just moved and update all references to that. The process would repeat until it didn't find any more strings to move. This algorithm didn't require any extra data to be stored with strings in memory, but it was of course dog slow. The Commodore 128 garbage collector stored with each string in GC space a pointer to the variable that holds a reference and a length byte that could be used to find the next lower string in GC space; it could thus check each string in order to find out whether it was still used, relocating it to the top of memory if so. Much faster, but at the cost of three bytes' overhead per string.
You should look into the storage requirements for various types. Every instruction, storage allocation (ie:variable/object, etc) uses up a specific amount of space. In c# an Int32 type integer object should store integer information to the tune of 4 bytes (32bit). It might have other information, too, because it is an object, but your character data type probably only requires 1 byte of information. If you have constructs like for or while in your class, those things will take up space, too, because each of those things is telling your class to do something. The class itself requires a number of instructions to be created in memory, which would account for the 8 initial bytes.
Take an assembler language course. You'll learn all you ever wanted to know and then some about why your programs use however much memory or take up however much storage when compiled.

Tagging/Encoding Pointers

I need a way to tag a pointer as being either part of set x or part of set y (ie: the tag has only 2 'states'), I'm that means one can assume untagged = x and tagged = y.
Currently I'm looking at using bitwise xor to do this:
ptr ^ magic = encoded_ptr
encoded_ptr ^ magic = ptr
but I'm stumped at how to determine if the pointer is tagged in the first place.
I'm using this to mark what pools nodes in a linked list come from, so that when the are delinked, they can go back to the correct perants.
Update
Just to make it clear to all those people suggesting to store the flag in extra data members, I'm limited to sizeof(void*), so I can't add new members, else I would have. Also the pools aren't contiguous, they consist of many pages, tracking the ranges would add too much overhead (I'm after a fast & simple solution, if one can call it that).
Most solution will be platform specific. here a few of them:
1) A pointer returned by malloc or new will be aligned (4, 8, 16, 32 bytes, you name it). So, on most architectures, several LSB bits of the address will be always 0.
2) And a Win32 specific way: unless your program uses 3GB switch, values of all usermode pointers are less than 0x80000000, so the highest bit can be used as flag. As bonus, it will also crash when the flagged pointer is dereferenced without being repaired.
There is no safe and portable way to make that sort of thing work. You might be able to find some system-specific bits that are always a known value (say, the most significant n bits), but that's an extremely fragile and dangerous thing to rely on. You can't tell whether a pointer is "marked" or not unless some of the bits in the pointer have known values in the first place.
A much better way to do this is to store an identifier in the structure the pointer points to.
Surely if you only have two pools, when you allocate memory for each pool you know the possible address range - so why not check whether your given pointer occurs in one or the other address range with simple pointer arithmetic?
If performance is not a big issue, two std::set's can be used.
If it's important to get this information quickly, and it's acceptable to use only 2-byte aligned pointers, the lowest bit can be used to store this information. But having "hacked" pointers may appear to be quite error-prone...
You might have ptr1 ^ magic = ptr2 with ptr1 in set X and ptr2 in set Y (unless you prove otherwise). Since (I guess) you don't have control on the pointers addresses you are given, your technique seems to be inadequate.
An alternative to Vinay solution is to store the tags as bits of a pre-allocated buffer (specially easy if the size of the list is bounded since you don't have to grow or shrink the buffer). This is a very compact and efficient solution that does not require to modify the pointed data structure.
Cheers,
-stan