What does it mean for a data structure to be "intrusive"? - c++

I've seen the term intrusive used to describe data structures like lists and stacks, but what does it mean?
Can you give a code example of an intrusive data structure, and how it differs from a non-intrusive one?
Also, why make it intrusive (or, non-intrusive)? What are the benefits? What are the disadvantages?

An intrusive data structure is one that requires help from the elements it intends to store in order to store them.
Let me reword that. When you put something into that data structure, that "something" becomes aware of the fact that it is in that data structure, in some way. Adding the element to the data structure changes the element.
For instance, you can build a non-intrusive binary tree, where each node have a reference to the left and right sub-trees, and a reference to the element value of that node.
Or, you can build an intrusive one where the references to those sub-trees are embedded into the value itself.
An example of an intrusive data structure would be an ordered list of elements that are mutable. If the element changes, the list needs to be reordered, so the list object has to intrude on the privacy of the elements in order to get their cooperation. ie. the element has to know about the list it is in, and inform it of changes.
ORM-systems usually revolve around intrusive data structures, to minimize iteration over large lists of objects. For instance, if you retrieve a list of all the employees in the database, then change the name of one of them, and want to save it back to the database, the intrusive list of employees would be told when the employee object changed because that object knows which list it is in.
A non-intrusive list would not be told, and would have to figure out what changed and how it changed by itself.

In a intrusive container the data itself is responsible for storing the necessary information for the container. That means that on the one side the data type needs to be specialized depending on how it will be stored, on the other side it means that the data "knows" how it is stored and thus can be optimized slightly better.
Non-intrusive:
template<typename T>
class LinkedList
{
struct ListItem
{
T Value;
ListItem* Prev;
ListItem* Next;
};
ListItem* FirstItem;
ListItem* LastItem;
[...]
ListItem* append(T&& val)
{
LastItem = LastItem.Next = new ListItem{val, LastItem, nullptr};
};
};
LinkedList<int> IntList;
Intrusive:
template<typename T>
class LinkedList
{
T* FirstItem;
T* LastItem;
[...]
T* append(T&& val)
{
T* newValue = new T(val);
newValue.Next = nullptr;
newValue.Prev = LastItem;
LastItem.Next = newValue;
LastItem = newValue;
};
};
struct IntListItem
{
int Value;
IntListItem* Prev;
IntListItem* Next;
};
LinkedList<IntListItem> IntList;
Personally I prefer intrusive design for it's transparency.

Related

Intrinsic benefit of using a LinkedList class to point to head Node vs. just using Node*

We can define a LinkedListNode as below:
template <typename T>
struct LinkedListNode {
T val;
LinkedListNode* next;
LinkedListNode() : val{}, next(nullptr) {}
LinkedListNode(T x) : val{ x }, next(nullptr) {}
LinkedListNode(T x, LinkedListNode* next) : val{ x }, next(next) {}
};
If we want to define a function that takes a "Linked List", we have two options. First, we could pass a LinkedListNode* to the function.
template <typename T>
int func(LinkedListNode<T>* node);
Second, we could define a LinkedList class that holds a pointer to the "head" node. Then we could define a function that takes a LinkedList.
template <typename T>
struct LinkedList {
LinkedListNode<T>* head;
// other member functions
};
template <typename T>
int func(LinkedList<T>& llist);
One reason the second appears preferable because it might allow better encapsulation of functions that modify a "Linked List". For example, a FindMax that takes a LinkedListNode* might better fit as a member function of LinkedList than as a member function of LinkedListNode.
What concrete reasons are there to prefer one over the other? I'm especially interested in reasons you might prefer to just use LinkedListNode*s.
I think before you even choose to use a singly linked list, you should have some reason to use it over plain std::vector. You need actual benchmarks that show that a singly linked list would improve performance in the particular application you have in mind; you'd be surprised how often it makes things worse, not better. Hint: theoretic computational complexity is orthogonal from memory access patterns, and on modern CPUs the memory access patterns determine performance - most computation is essentially free, in that it takes no extra time: it gets hidden under all the cache misses.
Then you should have a reason not to use std::forward_list. But maybe you need intrusive linked lists: then make a case for not using boost::intrusive::slist<T> or a similar existing and well tested library type.
If you're still going forward with your own implementation, then the very first step would be to use std::unique_ptr as the owning pointer for child nodes, instead of manual memory management - that way it'll be very easy to show that no memory is being leaked - the code becomes correct by construction and memory leaks require extra effort vs. happening by omission.
In other words: don't reinvent the wheel unless you have a well articulated reason for that. Of course, you can implement linked lists all you want as an exercise, but be aware that you're most likely implementing a container that you'll make the least use of - so I'd argue that you'd learn a lot more about how C++ works by implementing e.g. a vector/array container.
If you do use std::unique_ptr, or even manual memory management, you're likely to run into the destructor stack explosion pitfall. Consider
template <typename T>
struct LinkedListNode1 {
T val;
std::unique_ptr<LinkedListNode1> next;
};
template <typename T>
struct LinkedListNode2 {
T val;
LinkedListNode2* next = nullptr;
~LinkedListNode2() { delete next; }
};
In both cases, the destructor gets invoked recursively, and if the list is sufficiently long, you'll run out of stack. Recursion is also usually less efficient than a loop. To prevent that, you must be never deallocating nodes that have non-null next.
template <typename T>
struct LinkedListNode1 {
T val;
std::unique_ptr<LinkedListNode1> next;
~LinkedListNode1() {
auto node = std::move(next);
while (node)
node = std::move(node->next);
assert(!next);
}
};
template <typename T>
struct LinkedListNode2 {
T val;
LinkedListNode2* next = {};
~LinkedListNode2() {
using std::swap;
LinkedListNode2* node = {};
swap(node, next);
while (node) {
LinkedListNode2* tmp = {};
swap(tmp, node);
assert(!node);
swap(node, tmp->next);
assert(!tmp->next);
delete tmp;
}
assert(!next);
}
};
Smart pointers make the code much simpler. I wrote the raw pointer version with swaps to make it easier to show that no memory is leaking: a swap used correctly never "loses" a value.
For example, a FindMax that takes a LinkedListNode*
That's again reinventing the wheel. In C++, the idiom for "finding a maximum element" is std::max_element from #include <algorithm>. You should leverage the algorithms that the standard library provides (and any others you may need, e.g. from Boost or header-only libraries).
To do that, you need an iterator for the list. It will be, by necessity, a LegacyForwardIterator. Here, is a has a strict technical meaning: it's a concise way of saying "your iterator will fulfill the concept of and abide by the contract of LegacyForwardIterator".
Such an iterator would look very roughly as follows:
template <typename T>
class LinkedListNode1 {
std::unique_ptr<LinkedListNode1> next;
template <typename V> class iterator_impl {
LinkedListNode1 *node = {};
using const_value_type = std::add_const_t<V>;
using non_const_value_type = std::remove_const_t<V>;
public:
using value_type = V;
using reference = V&;
using pointer = V*;
iterator_impl() = default;
template <typename VO>
iterator_impl(const iterator_impl<VO> &o) : node(o.operator->()) {}
explicit iterator_impl(LinkedListNode1 *node) : node(node) {}
auto *operator->() const { return node; }
pointer operator&() const { return &(node->val); }
reference operator*() const { return node->val; }
iterator_impl &operator++() { node = node->next.get(); return *this; }
iterator_impl operator++(int) {
auto retval = *this;
this->operator++();
return retval;
}
bool operator==(const iterator_impl &o) const { return node == o.node; }
bool operator!=(const iterator_impl &o) const { return node != o.node; }
};
public:
T val;
using iterator = iterator_impl<T>;
using const_iterator = iterator_impl<const T>;
The next pointer can be made private. Then, the basic functionality would include:
LinkedListNode1() = default;
LinkedListNode1(const T &val) : val(val) {}
~LinkedListNode1() {
auto node = std::move(next);
while (node)
node = std::move(node->next);
}
iterator begin() { return iterator(this); }
iterator end() { return {}; }
const_iterator begin() const { return const_iterator(this); }
const_iterator end() const { return {}; }
const_iterator cbegin() const { return const_iterator(this); }
const_iterator cend() const { return {}; }
iterator insert_after(const_iterator pos, const T& value) {
auto next = std::make_unique<LinkedListNode1>();
next->val = value;
auto retval = iterator(next.get());
pos->next = std::move(next);
return retval;
}
One would use insert_after to extend the list. Other such methods would need to be added, of course.
Then, we'd probably also want to support initializer lists:
LinkedListNode1(std::initializer_list<T> init) {
auto src = init.begin();
if (src == init.end()) return;
val = *src++;
for (auto dst = iterator(this); src != init.end(); ++src)
dst = insert_after(dst, *src);
}
};
Now you can pre-populate the list with an initializer list, iterate it using range-for, and use it with standard algorithms:
#include <iostream>
int main() {
LinkedListNode1<int> list{1, 3, 2};
for (auto const &val : list)
std::cout << val << '\n';
assert(*std::max_element(list.begin(), list.end()) == 3);
}
But now we come to the most important question:
What concrete reasons are there to prefer one over the other
The default - the starting point - is to provide a container, since that's the abstraction we deal with: the "thing" that you think of is a linked list, not a list node. The data structure you learn of is, again, a linked list. And for a good reason: The node type is an implementation detail, so you'd need to come up with application-specific reasons for exposing the node type, and any argument made must stand up to the scrutiny when faced with iterators. Do you really need to expose those nodes, or is what you actually want just a convenient way to iterate over the items stored in the collection, perhaps split the list, etc? Node access is not necessary for any of it. It's all a solved problem, as you'll learn by reading the documentation of std::forward_list.
You'd also want to consider allocator support. I'd not worry about the C++98 allocators, but the polymorphic allocators are (finally!) actually usable, so you'd want to implement those (c.f. std::pmr::polymorphic_allocator and the std::pmr namespace in general).
For full functionality, you'd pretty much need to add most of std::forward_list's methods and constructors. So it's a bit of work, and there are lots of details to make it work well no matter the value type. And thus we come full circle: real containers that are meant to be useful without worrying about low-level details are lots of work, but they are a joy to use - and they look nothing like most textbook "teaching" code.
A linked list is often used when teaching data structures - true. Yet most C++ books used in teaching are woefully inadequate in demonstrating what a modern, fully functional data structure/container entails - they can't even get that right for something as "simple" as a singly linked list.
The gap between a C-like singly linked list - exactly what you started with in the question - and a singly linked list C++ container is on the order of a couple thousand lines of code and tests. That's what they don't usually teach, and that's where the most important bits really are: they are the difference between toy code, and production code.
Even without tests, a fully functional singly linked list container is ~500 lines without polymorphic allocator support, and probably at least double that with such support, and tests would double the code size several times - although if you were clever about it, you could reuse a lot of the tests used by various STL implementations :)
And, by the way: a decent implementation of a linked list in C won't force you to manually deal with nodes either. The list itself - the container - will be an abstract data type with a bunch of functions that provide the functionality, and with some abstraction for iterators as well (even though they'll be just pointers in some type-safe disguise). This is again the difference between teaching code and easy-to-use-correctly and hard-to-use-incorrectly production code. One example I can think of right now are the stretchy buffers, as implemented in Bitwise ion project. This is a link to a video where those are implemented live, and they serve as a decent example of how abstractions work in C (and also how you definitely shouldn't be writing this in C++ - C and C++ are different languages!).
Defining an actual LinkedList type allows you to directly support operations that would be relatively difficult to support by just passing around a pointer to a node.
One comment has already mentioned storing the size of the linked list as a member, so you can have a function return the size of the linked list in constant time. That is a useful thing to do, but I think it only hints at the real point, which (in my opinion) is having things that apply to the linked list as a whole, rather than just operations on individual nodes.
In C++, one obvious possibility here is having a destructor that properly destroys a complete linked list when it goes out of scope.
int foo() {
LinkedList a;
// code that uses `a`
} // <-- here `a` goes out of scope, and should be destroyed
One of the big features of C++ as a whole is deterministic destruction, and its support for that is based on destructors that run when objects go out of scope.
With a linked list, you'd (at least normally) plan on all the nodes in the linked list being allocated dynamically. If you just use a pointer to node, it'll be up to you to keep track of when you no longer need/want a particular linked list, and manually destroy all the nodes in the linked list when it's no longer needed.
By creating a linked-list class, you get the benefit of deterministic destruction, so you no longer need to keep track of when a list is no longer needed--the compiler tracks that for you, and when it goes out of scope, it gets destroyed automatically.
I'd also expect a linked list to support copy construction, move construction, copy assignment, and move assignment--and probably also a few things like comparison (at least for in/equality, and possibly ordering). And all of these require a fair amount of manual intervention if you decide to implement your linked list as a pointer to a node, instead of having an actual linked list class.
As such, I'd say if you really want to use C++ (even close to how it's intended to work) creating a class to encapsulate your linked list is an absolute necessity. As long as you're just passing around pointers to nodes, what you're writing is fundamentally C (even if it may use some features specific to C++ so a C compiler won't accept it).

Why the first node of a linked list is declared as a pointer?

Now I know that why pointers are used in defining linked lists. Simply because structure cannot have a recursive definition and if there would have been no pointers, the compiler won't be able to calculate the size of the node structure.
struct list{
int data;
struct list* next; // this is fine
};
But confusion creeps up when I declare the first node of the linked list as:
struct list* head;
Why this has to be a pointer? Can't it be simply declared as
struct list head;
and the address of this used for further uses? Please clarify my doubt.
There's no definitive answer to this question. You can do it either way. The answer to this question depends on how you want to organize your linked list and how you want to represent an empty list.
You have two choices:
A list without a "dummy" head element. In this case the empty list is represented by null in head pointer
struct list* head = NULL;
So this is the answer to your question: we declare it as a pointer to be able to represent an empty list by setting head pointer to null.
A list with a "dummy" head element. In this case the first element of the list is not used to store actual user data: it simply serves as a starting "dummy" element of the list. It is declared as
struct list head = { 0 };
The above represents an empty list, since head.next is null and head object itself "does not count".
I.e. you can declare it that way, if you so desire. Just keep in mind that head is not really a list element. The actual elements begin after head.
And, as always, keep in mind that when you use non-dynamically-allocated objects, the lifetime of those objects is governed by scoping rules. If you want to override these rules and control the objects' lifetimes manually, then you have no other choice but to allocate them dynamically and, therefore, use pointers.
You can declare a list such a way
struct list head = {};
But there will be some difficulties in the realization of functions that access the list. They have to take into account that the first node is not used as other nodes of the list and data member of the first node data also is not used.
Usually the list is declared the following way
struct List
{
// some other stuff as for example constructors and member functions
struct node
{
int data;
struct node* next; // this is fine
} head;
};
and
List list = {};
Or in C++ you could write simply
struct List
{
// some other stuff as for example constructors and member functions
struct node
{
int data;
struct node* next; // this is fine
} head = nullptr;
};
List list;
Of course you could define the default constructor of the List yourself.
In this case for example to check whether the list is empty it is enough to define the following member function
struct List
{
bool empty() const { return head == nullptr; }
// some other stuff as for example constructors and member functions
struct node
{
int data;
struct node* next; // this is fine
} head;
};
In simple terms, if your head is the start node of the linked list, then it will only contain the address of the first node from where linked list will begin. This is done to avoid confusion for a general programmer. Since the head will contain only address, hence, it is declared as a pointer. But the way you want to declare is also fine, just code accordingly. Tip: If you later on want to make some changes in your linked list, like deletion or insertion operations at the beginning of the linked list, you will face problems as you will require another pointer variable. So its better to declare the first node as pointer.

std::forward_list -- erasing with a stored iterator

I'm trying to keep a global list of a particular (base) class's instances so that I can track them down by iterating through this global list at any time.
I believe the most proper way to address this is with an intrusive list. I have heard that one can encounter these creatures by digging into the Linux kernel, for example.
In the situation where I'm in, I don't really need such guarantees of performance, and using intrusive lists will complicate matters somewhat for me.
Here's what I've got so far to implement this concept of a class that knows about all of its instances.
class A {
static std::forward_list<A*> globallist;
std::forward_list<A*>::iterator listhandle;
public:
A() {
globallist.push_front(this);
listhandle = globallist.begin();
}
virtual ~A() {
globallist.erase_after(...); // problem
}
};
The problem is that there is no forward_list::erase(), and it really does not appear like saving globallist.before_begin() in the ctor would do me much good. I'm never supposed to dereference before_begin()'s iterator. Will it actually hold on to the position? If I save out before_begin's iterator, and then push_front() a new item, that iterator is probably still not capable of being dereferenced, but will it be serviceable for sending to erase_after()?
forward_list is a singly linked list. To remove a node in the middle of that, you must have a pointer to previous node, somehow. For example, you could do something like this:
class A {
static std::forward_list<A*> globallist;
std::forward_list<A*>::iterator prev_node;
public:
A() {
A* old_head = globallist.front();
globallist.push_front(this);
prev_node = globallist.before_begin();
old_head->prev_node = globallist.begin();
}
};
The case of pushing the first element into an empty list, as well as the removal logic, are left as an exercise for the reader (when removing, copy your prev_node to the next node's prev_node).
Or, just use std::list and avoid all this trouble.

Representing Rooted Tree in terms of class

I am facing difficulty in understanding the following paragraph taken from Representing Rooted Trees. It basically shows two methods for representing trees. G & T is somewhat clear to me, but the other one is not that much clear to me, which shows class definition.
G&T Option: Each node has 3-references: item, parent, children. The single reference for children has to refer to a list (so the node can have as many children as necessary).
Another option is to have siblings linked directly. E.g.
class SibTreeNode {
Object item;
SibTreeNode parent;
SibTreeNode firstChild; // Left-most child.
SibTreeNode nextSibling;
}
public class SibTree {
SibTreeNode root;
int size; // Number of nodes in the tree.
}
The author in the video also claims (at around 18 minutes) that second method will require less memory. Can someone help me understand the class definitions and how this will require less memory as compared to first method?
The second option is simply an intrusive singly-linked list. Intrusive singly-linked lists take up less memory because you don't need a pointer from the list node to the node's contents.
Look at the layout for the G&T option with an external list:
class ListNode {
SibTreeNode* object;
ListNode* nextElement;
};
class SibTreeNode {
Object* item;
SibTreeNode* parent;
ListNode* childList;
};
with 5 pointers per SibTreeNode element instead of 4.

Implementing a templated doubly linked list of pointers to objects

I'm a little confused about implementing a doubly linked list where the data in the list are pointers.
The private part of my linked list class looks like:
private:
struct node {
node* next;
node* prev;
T* o;
};
node* first; // The pointer to the first node (NULL if none)
node* last; // The pointer to the last node (NULL if none)
unsigned int size_;
As you can see, the list is full of pointers to objects rather than just plain old objects, which makes it a little more confusing to me.
The following is the description in the spec:
Note that while this list is templated across the contained type, T, it inserts and removes only pointers to T, not instances of T. This ensures that the Dlist implementation knows that it owns inserted objects, it is responsible for copying them if the list is copied, and it must destroy them if the list is destroyed.
Here is my current implementation of insertFront(T* o):
void Dlist::insertFront(T* o) {
node* insert = new node();
insert->o = new T(*o);
insert->next = first;
insert->prev = last;
first = insert;
}
This seems wrong though. What if T doesn't have a copy constructor? And how does this ensure sole ownership of the object in the list?
Could I just do:
insert->o = o;
It seems like this is not safe, because if you had:
Object* item = new Object();
dlist.insertFront(item);
delete item;
Then the item would be also be destroyed for the list. Is this correct? Is my understanding off anywhere?
Thanks for reading.
Note: While this looks like homework, it is not. I am actually a java dev just brushing up my pointer skills by doing an old school project.
When you have a container of pointers, you have one of the two following usage scenarios:
A pointer is given to the container and the container takes responsibility for deleting the pointer when the containing structure is deleted.
A pointer is given to the container but owned by the caller. The caller takes responsibility for deleting the pointer when it is no longer needed.
Number 1 above is quite straight-forward.
In the case of number 2, it is expected that the owner of the container (presumably also the caller) will remove the item from the container prior to deleting the item.
I have purposely left out a third option, which is actually the option you took in your first code example. That is to allocate a new item and copy it. The reason I left it out is because the caller can do that.
The other reason for leaving it out is that you may want a container that can take non-pointer types. Requiring it to be a pointer by always using T* instead of T may not be as flexible as you want. There are times when you should force it to be a pointer, but I can't think of any use (off the top of my head) for doing this for a container.
If you allow the user to declare Dlist<MyClass*> instead of Dlist<MyClass> then the owner of that list is implicitly aware that it is using pointers and this forces them to assume scenario Number 2 from above.
Anyway, here are your examples with some commentary:
1. Do not allocate a new T item unless you have a very good reason. That reason may simply be encapsulation. Although I mentioned above that you shouldn't do this, there are times when you may want to. If there is no copy constructor, then your class is probably plain-old-data. If copying is non-trivial, you should follow the Rule of Three.
void Dlist::insertFront(T* o) {
node* insert = new node();
insert->o = new T(*o); //<-- Follow rule of three
insert->next = first;
insert->prev = last;
first = insert;
}
2. This is what you would normally do
insert->o = o;
3. You must not delete your item after inserting. Either pass ownership to your container, or delete the item when neither you nor the container requires it anymore.
Object* item = new Object();
dlist.insertFront(item);
delete item; //<-- The item in the list is now invalid