The http://clojure.org/data_structures page explains all Clojure collections as being "immutable and persistent". I have been looking for a clear definition of exactly what "persistent" means in this instance and whether anybody has a clear explanation of this?
It refers to the same kind of persistent as this wikipedia article. Summarized:
In computing, a persistent data structure is a data structure that
always preserves the previous version of itself when it is modified.
Such data structures are effectively immutable, as their operations do
not (visibly) update the structure in-place, but instead always yield
a new updated structure. (A persistent data structure is not a data
structure committed to persistent storage, such as a disk; this is a
different and unrelated sense of the word "persistent.")
Related
In one of my projects I have to cache positional information about certain data chunks found in large files. I've already implemented a small API built around std::basic_istream<char>::pos_type placed in maps.
Now I need to serialize these descriptors into a bytestream and write them on a disk for further usage (on other *nix-machines as well). I have read that this type is platform-dependent but still rather being a POD-type. So my questions are:
Whether it will be better to save something besides of just offsets? E.g. std::fpos<std::mbstate_t> keeping the state of reading structure?
How can I safely obtain and restore the offset data from std::basic_istream<char>::pos_type (and other info if it is need)?
Thank you in advance.
The structure of std::fpos<mbstate_t> is unspecified and there may be non-trivial state in the mbstate_t. You certainly can't portably serialize these objects. You can obtain a value of the offset type (std::streamoff) which is an integer type and its value can be serialized.
I am developing a set of vector classes that all derived from an abstract vector. I am doing this so that in our software that makes use of these vectors, we can quickly switch between the vectors without any code breaking (or at least minimize failures, but my goal is full compatibility). All of the vectors match.
I am working on a Disk Based Vector that mostly conforms to match the STL Vector implementation. I am doing this because we need to handle large out of memory files that contain various formats of data. The Disk Vector handles data read/write to disk by using template specialization/polymorphism of serialization and deserialization classes. The data serialization and deserialization has been tested, and it works (up to now). My problem occurs when dealing with references to the data.
For example,
Given a DiskVector dv, a call to dv[10] would get a point to a spot on disk, then seek there, read out the char stream. This stream gets passed to a deserializor which converts the byte stream into the appropriate data type. Once I have the value, I my return it.
This is where I run into a problem. In the STL, they return it as a reference, so in order to match their style, I need to return a reference. What I do it store the value in an unordered_map with the given index (in this example, 10). Then I return a reference to the value in the unordered_map.
If this continues without cleanup, then the purpose of the DiskVector is lost because all the data just gets loaded into memory, which is bad due to data size. So I clean up this map by deleting the indexes later on when other calls are made. Unfortunately, if a user decided to store this reference for a long time, and then it gets deleted in the DiskVector, we have a problem.
So my questions
Is there a way to see if any other references to a certain instance are in use?
Is there a better way to solve this while still maintaining the polymorphic style for reasons described at the beginning?
Is it possible to construct a special class that would behave as a reference, but handle the disk IO dynamically so I could just return that instead?
Any other ideas?
So a better solution at what I was trying to do is to use SQLite as the backend for the database. Use BLOBs as the column types for key and value columns. This is the approach I am taking now. That said, in order to get it to work well, I need to use what cdhowie posted in the comments to my question.
Stash library in "Thinking in C++" by Bruce Eckel:
Basically he seems to be setting up an array-index-addressable interface (via fetch) to a set of entities that are actually stored at random memory locations, without actually copying those data entities, in order to simulate for the user the existence of an actual contiguous-memory data block. In short, a contiguous, indexed address map. Do I have this right? Also, his mapping is on a byte-by-byte basis; if it were not for this requirement (and I am unsure of its importance), I believe that there may be simpler ways to generate such a data structure in C++. I looked into memcpy, but do not see how to actually copy data on a byte-by-byte basis to create such an indexed structure.
Prior posting:
This library appears to create a pointer assemblage, not a data-storage assemblage.
Is this true? (Applies to both the C and C++ versions.) Thus the name "stash" might be a little misleading, as nothing but pointers to data stashed elsewhere is put into a "stash," and Eckel states that "the data is copied."
Background: Looking at “add” and “inflate,” the so-called “copying” is equating pointers to other pointers (“storage” to “e” in “add” and “b” to “storage” in “inflate”). The use of “new” in this case is strange to me, because storage for data is indeed allocated but “b” is set to the address of the data, and no data assignments seem to take place in the entire library. So I am not sure what the point of the “allocation” by “new” is when the allocated space is apparently never written into or read from in the library. The “element” to be added exists elsewhere in memory already, and seemingly all we are doing is creating a sequential pointer structure to each byte of every “element” desired to be reference-able through CStash. Do I understand this library correctly?
Similarly, it looks as though “stack” in the section “Nested structures” appears actually to work only with addresses of data, not with data. I wrote my own linked-list stack successfully, which actually stores data in the stack nodes.
I have a tree structure with a lot of pointers, basically a node of the tree is like this
class Node
{
Node *my_father;
QVector<Node*> my_children;
... a lot of data
}
I need all these pointers to make my job easier while in RAM memory. But now I need to save all the tree structure on disk.. I was thinking about using QDataStream serialization (http://www.developer.nokia.com/Community/Wiki/Qt_Object_Serialization).. but I don't think this is going to work with pointers.. right?
What would you suggest to save this big structure on disk and re-read it into RAM with pointers working?
Why don't you use the XML format? It's by design very easy to use with all structured data, with nested objects, like the tree structure you use. But you don't want to store pointers in it - just the actual data. (The data stored in your pointers, that describe tree structure will become a XML structure itself, so you don't need them).
Then you'll need to recreate the pointers during file read, when you allocate a new children for some node.
BTW sorry for making this answer and not comment but I can't write question comments yet ;].
Clearly, there is no guarantee that pointers read from disk would ever be valid as such. But you could still use them as 'integer IDs', as follows. To write, save the pointers to disk along with the rest of the data. In addition, for each class instance, save its own address to disk. This will be that object's 'integer ID'. To read,
1) Use the saved integer ID information to associate each object with its children and father. Initially, you'll probably have to read all of your Nodes into a single big list.
2) Then once the children, father are in memory write their actual addresses into my_father and my_children respectively.
Feels a bit hacky to me but I can't think of a more direct way to get at this.
One of the selling points of immutable data structures is that they are automatically parallelizable. If no mutation is going on, then references to a functional data structure can be passed around between threads without any locking.
I got to thinking about how functional data structures would be implemented in c++. Suppose that we have a reference count on each node of our data structure. (Functional data structures share structure between old and updated members of a data structure, so nodes would not belong uniquely to one particular data structure.)
The problem is that if reference counts are being updated in different threads, then our data structure is no longer thread safe. And attaching a mutex to every single node is both expensive and defeats the purpose of using immutable data structures for concurrency.
Is there a way to make concurrent immutable data structures work in c++ (and other non-garbage collected environments)?
There are lock-free reference counting algorithms, see, e.g. Lock-Free Reference Counting, Atomic Reference Counting Pointers.
Also note that C++0x (which will probably become C++11 soon) contains atomic integer types especially for purposes like this one.
Well, garbage collected languages also have the issue of multi-threaded environments (and it is not easy for mutable structures).
You have forgotten that unlike arbitrary data, counters can be incremented and decremented atomically, so a mutex would be unnecessary. It still means that cache synchro between processors need be maintained, which may cost badly if a single node keeps being updated.