Python lists with STL like interface - c++

I have to port a C++ STL application to Python. I am a Python newbie, but have been programming for over a decade. I have a great deal of experience with the STL, and find that it keeps me hooked to using C++. I have been searching for the following items on Google these past several days:
Python STL (in hope of leveraging my years of STL experience)
Python linked lists
Python advanced list usage
Python list optimization
Python ordered sets
And have found posts about the above topic, tutorials on Python lists that are decidedly NOT advanced, or dead ends. I am really surprised at my lack of success, I think I am just burned out from overworking and entering bad search terms!
(MY QUESTION) Can I get a Python STL wrapper, or an interface to Python lists that works like the STL? If not can someone point me to a truly advanced tutorial or paper on managing very large sorted collections of non trivial objects?
P.S. I can easily implement workarounds for one or two uses, but if management wants to port more code, I want to be ready to replace any STL code I find with equivalent Python code immediately. And YES I HAVE MEASURED AND DO NEED TO HAVE TOTALLY OPTIMAL CODE! I CANT JUST DO REDUNDANT SORTS AND SEARCHES!
(ADDENDUM) Thanks for the replies, I have checked out some of the references and am pleased. In response to some of the comments here:
1 - It is being ported to python because managements says so, I would just as soon leave it alone - if it aint broke, why fix it?
2 - Advanced list usage with non trivial objects, what I mean by that is: Many different ways to order and compare objects, not by one cmp method. I want to splice, sort, merge, search, insert, erase, and combine the lists extensively. I want lists of list iterators, I want to avoid copying.
3 - I now know that built in lists are actually arrays, and I should be looking for a different python class. I think this was the root of my confusion.
4 - Of course I am learning to do things in the Python way, but I also have deadlines. The STL code I am porting is working right, I would like to change it as little as possible, because that would introduce bugs.
Thanks to everyone for their input, I really appreciate it.

Python's "lists" are not linked lists -- they're like Java ArrayLists or C++'s std::vectors, i.e., in lower-level terms, a resizable compact array of pointers.
A good "advanced tutorial" on such subjects is Hettinger's Core Python containers: under the hood presentation (the video at the URL is of the presentation at an Italian conference, but it's in English; another, shorter presentation of essentially the same talk is here).
So the performance characteristics of Python lists are essentially those of C++'s std::vector: Python's .append, like C++'s push_back, is O(1), but insertion or removal "in the middle" is O(N). Consequently, keeping a list sorted (as can be easily done with the help of functions in Python's standard library module bisect) is costly (if items arrive and/or depart randomly, each insertion and removal is O(N), just like similarly maintaining order in an std::vector would be. For some purposes, such as priority queues, you may get away with a "heap queue", also easy to maintain with the help of functions in Python's standard library module heapq -- but of course that doesn't afford the same range of uses as a completely sorted list (or vector) would.
So for purposes for which in C++ you'd use a std::set (and rely on its being ordered, i.e., a hashset wouldn't do -- Python's sets are hash-based, not ordered) you may be better off avoiding Python builtin containers in favor of something like this module (if you need to keep things pure-Python), or this one (which offers AVL trees, not RB ones, but is coded as a C-implemented Python extension and so may offer better performance) if C-coded extensions are OK.
If you do end up using your own module (be it pure Python, or C-coded), you may, if you wish, give it an STL-like veneer/interface (with .begin, .end, iterator objects that are advanced by incrementing rather than, as per normal Python behavior, by calling their next methods, ...), although it will never perform as well as "going with the grain" of the language would (the for statement is optimized to use normal Python iterators, i.e., one with next methods, and it will be faster than wrapping a somewhat awkward while around non-Python-standard, STL-like iterators).
To give an STL-like veneer to any Python built-in container, you'll incur substantial wrapping overhead, so the performance hit may be considerable. If you, as you say, "DO NEED TO HAVE TOTALLY OPTIMAL CODE", using such a veneer just for "syntax convenience" purposes would therefore seem to be a very bad choice.
Boost Python, the Python extension package that wraps the powerful C++ Boost library, might perhaps serve your purposes best.

If I were you I would take the time to learn how to properly use the various data structures available in Python instead of looking for things that are similar to what you know from C++.
It's not like you're looking for something fancy, just working with some data structures. In that case I would refer you to Python's documentation on the subject.
Doing this the 'Python' way would help you and more importantly future maintainers who will wonder why you try to program C++ in Python.
Just to whet your appetite, there's also no reason to prefer STL's style to Python (and for the record, I'm also a C++ programmer who knows STL throughly), consider the most trivial example of constructing a list and traversing it:
The Pythonic way:
mylist = [1, 2, 3, 4]
for value in mylist:
# playaround with value
The STL way (I made this up, to resemble STL) in Python:
mylist = [1, 2, 3, 4]
mylistiter = mylist.begin()
while mylistiter != mylist.end():
value = mylistiter.item()
mylistiter.next()

For linked-list-like operations people usually use collections.deque.
What operations do you need to perform fast? Bisection? Insertion?

I would say that your issues go beyond just STL porting. Since the list, dict, and set data structures, which are bolted on to C++ via the STL, are native to core Python, then their usage is incorporated into common Python code idioms. If you want to give Google another shot, try looking for references for "Python for C++ Programmers". One of your hits will be this presentation by Alex Martelli. It's a little dated, from way back in ought-three, but there is a side-by-side comparison of some basic Python code that reads through a text file, and how it would look using STL.
From there, I would recommend that you read up on these Python features:
iterators
generators
list and generator comprehensions
And these builtin functions:
zip
map
Once you are familiar with these, then you will be able to construct your own translation/mapping between STL usage and Python builtin data structures.
As others have said, if you are looking for a "plug-and-chug" formula to convert STL C++ code to Python, you will just end up with bad Python. Such a brute force approach will never result in the power, elegance, and brevity of a single-line list comprehension. (I had this very experience when introducing Python to one of our managers, who was familiar with Java and C++ iterators. When I showed him this code:
numParams = 1000
paramRequests = [ ("EqptEmulator/ProcChamberI/Sensors",
"ChamberIData%d"%(i%250)) for i in range(numParams) ]
record.internalArray = [ParameterRequest(*pr) for pr in paramRequests]
and I explained that these replaced this code (or something like it, this might be a mishmash of C++ and Java APIs, sorry):
std::vector<ParameterRequest> prs = new std::vector<ParameterRequest>();
for (int i = 0; i<1000; ++i) {
string idstr;
strstream sstr(idstr);
sstr << "ChamberIData" << (i%250);
prs.add(new ParameterRequest("EqptEmulator/ProcChamberI/Sensors", idstr));
}
record.internalArray = new ParameterRequest[prs.size];
prs.toArray(record.internalArray);
One of your instincts from working with C++ will be a reluctance to create new lists from old, but rather to update or filter a list in place. We even see this on many forums from Python developers asking about how to modify a list while iterating over it. In Python, you are much better off building a new list from the old with a list comprehension.
allItems = [... some list of items, perhaps from a database query ...]
validItems = [it for it in allItems if it.isValid()]
As opposed to:
validItems = []
for it in allItems:
if it.isValid():
validItems.add(it)
or worse:
# get list of indexes of items to be removed
removeIndexes = []
for i in range(len(allItems)):
if not allItems[i].isValid():
removeIndexes.add(i)
# don't forget to remove items in descending order, or later indexes
# will be invalidated by earlier removals
sort(removeIndexes,reverse=True)
# copy list
validItems = allItems[:]
# now remove the items from allItems
for idx in removeIndexes:
del validItems[i]

Python STL (in hope of leveraging my years of STL experience) - Start with the collections ABC's to learn what Python has. http://docs.python.org/library/collections.html
Python linked lists. Python lists have all the features you would want from a linked list.
Python advanced list usage. What does this mean?
Python list optimization. What does this mean?
Python ordered sets. You have several choices here; you could invent your own "ordered set" as a list that discards duplicates. You can subclass the heapq and add methods that discard duplicates: http://docs.python.org/library/heapq.html.
In many cases, however, the cost of maintaing an ordered set is actually excessive because it must only be ordered once at the end of the algorithm. In other cases, the "ordered set" really is a heapq -- you never needed the set-like features and only needed the ordering.
Non-Trivial.
(I'm guessing at what you meant by "non-trivial"). All Python objects are equivalent. There's no "trivial" vs. "non-trivial" objects. They're all first-class objects and can all have "non-trivial" complexity without any real work. This is not C++ where there are primitive (non-object) values floating around. Everything's an object in Python.
Management Expectations.
For the most part the C++ brain-cramping doesn't exist in Python. Use the obvious Python classes the obvious way and you'll have much less code. The reduction in code volume is the big win. Often, the management reason for converting C++ to Python is to get rid of the C++ complexity.
Python code will be much simpler, making it much more reliable and much easier to maintain.
While it's generally true that Python is slower than C++, it's also true that picking the right algorithm and data structure can have dramatic improvements on performance. In one benchmark, someone found that Python was actually faster than C because the C program had such a poorly chosen data structure.
It's possible that your C++ has a really poor algorithm and you will see comparable performance from Python.
It's also possible that your C++ program is I/O bound, or has other limitations that will leave the Python running at a comparable speed.

The design of Python is quite intentionally "you can use just a few data structures (arrays and hash tables) for whatever you want to do, and if that isn't fast enough there's always C".
Python's standard library doesn't have a sorted-list data structure like std::set. You can download a red/black tree implementation or roll your own. (For small data sets, just using a list and periodically sorting it is a totally normal thing to do in Python.)
Rolling your own linked list is very easy.

Related

How can I implement Python sets in another language (maybe C++)?

I want to translate some Python code that I have already written to C++ or another fast language because Python isn't quite fast enough to do what I want to do. However the code in question abuses some of the impressive features of Python sets, specifically the average O(1) membership testing which I spam within performance critical loops, and I am unsure of how to implement Python sets in another language.
In Python's Time Complexity Wiki Page, it states that sets have O(1) membership testing on average and in worst-case O(n). I tested this personally using timeit and was astonished by how blazingly fast Python sets do membership testing, even with large N. I looked at this Stack Overflow answer to see how C++ sets compare when using find operations to see if an element is a member of a given set and it said that it is O(log(n)).
I hypothesize the time complexity for find is logarithmic in that C++ std library sets are implemented with some sort of binary tree. I think that because Python sets have average O(1) membership testing and worst case O(n), they are probably implemented with some sort of associative array with buckets which can just look up an element with ease and test it for some dummy value which indicates that the element is not part of the set.
The thing is, I don't want to slow down any part of my code by switching to another language (since that is the problem im trying to fix in the first place) so how could I implement my own version of Python sets (specifically just the fast membership testing) in another language? Does anybody know anything about how Python sets are implemented, and if not, could anyone give me any general hints to point me in the right direction?
I'm not looking for source code, just general ideas and links that will help me get started.
I have done a bit of research on Associative Arrays and I think I understand the basic idea behind their implementation but I'm unsure of their memory usage. If Python sets are indeed just really associative arrays, how can I implement them with a minimal use of memory?
Additional note: The sets in question that I want to use will have up to 50,000 elements and each element of the set will be in a large range (say [-999999999, 999999999]).
The theoretical difference betwen O(1) and O(log n) means very little in practice, especially when comparing two different languages. log n is small for most practical values of n. Constant factors of each implementation are easily more significant.
C++11 has unordered_set and unordered_map now. Even if you cannot use C++11, there are always the Boost version and the tr1 version (the latter is named hash_* instead of unordered_*).
Several points: you have, as has been pointed out, std::set and
std::unordered_set (the latter only in C++11, but most compilers have
offered something similar as an extension for many years now). The
first is implemented by some sort of balanced tree (usually a red-black
tree), the second as a hash_table. Which one is faster depends on the
data type: the first requires some sort of ordering relationship (e.g.
< if it is defined on the type, but you can define your own); the
second an equivalence relationship (==, for example) and a hash
function compatible with this equivalence relationship. The first is
O(lg n), the second O(1), if you have a good hash function. Thus:
If comparison for order is significantly faster than hashing,
std::set may actually be faster, at least for "smaller" data sets,
where "smaller" depends on how large the difference is—for
strings, for example, the comparison will often resolve after the first
couple of characters, whereas the hash code will look at every
character. In one experiment I did (many years back), with strings of
30-50 characters, I found the break even point to be about 100000
elements.
For some data types, simply finding a good hash function which is
compatible with the type may be difficult. Python uses a hash table for
its set, and if you define a type with a function __hash__ that always
returns 1, it will be very, very slow. Writing a good hash function
isn't always obvious.
Finally, both are node based containers, which means they use a lot
more memory than e.g. std::vector, with very poor locality. If lookup
is the predominant operation, you might want to consider std::vector,
keeping it sorted and using std::lower_bound for the lookup.
Depending on the type, this can result in a significant speed-up, and
much less memory use.

Coming from C++ to AS3 : what are fundamental AS3 data structures classes?

We are porting out game from C++ to web; the game make extensive use of STL.
Can you provide short comparison chart (and if possible, a bit of code samples for basic operations like insertion/deletion/searching and (where applicable) equal_range/binary_search) for the classes what are equivalents to the following STL containers :
std::vector
std::set
std::map
std::list
stdext::hash_map
?
Thanks a lot for your time!
UPD:
wow, it seems we do not have everything we needhere :(
Can anyone point to some industry standard algorithms library for AS3 programs (like boost in C++)?
I can not believe people can write non-trivial software without balanced binary search trees (std::set std::map)!
The choices of data structures are significantly more limited in as3. You have:
Array or Vector.<*> which stores a list of values and can be added to after construction
Dictionary (hash_map) which stores key/value pairs
maps and sets aren't really supported as there's no way to override object equality. As for binary search, most search operations take a predicate function for you to override equality for that search.
Edit: As far as common algorithm and utility libraries, I'd take a look at as3commons
Maybe this library will fit your needs.

When should the STL algorithms be used instead of using your own?

I frequently use the STL containers but have never used the STL algorithms that are to be used with the STL containers.
One benefit of using the STL algorithms is that they provide a method for removing loops so that code logic complexity is reduced. There are other benefits that I won't list here.
I have never seen C++ code that uses the STL algorithms. From sample code within web page articles to open source projects, I haven't seen their use.
Are they used more frequently than it seems?
Short answer: Always.
Long answer: Always. That's what they are there for. They're optimized for use with STL containers, and they're faster, clearer, and more idiomatic than anything you can write yourself. The only situation you should consider rolling your own is if you can articulate a very specific, mission-critical need that the STL algorithms don't satisfy.
Edited to add: (Okay, so not really really always, but if you have to ask whether you should use STL, the answer is "yes".)
You've gotten a number of answers already, but I can't really agree with any of them. A few come fairly close to the mark, but fail to mention the crucial point (IMO, of course).
At least to me, the crucial point is quite simple: you should use the standard algorithms when they help clarify the code you're writing.
It's really that simple. In some cases, what you're doing would require an arcane invocation using std::bind1st and std::mem_fun_ref (or something on that order) that's extremely dense and opaque, where a for loop would be almost trivially simple and straightforward. In such a case, go ahead and use the for loop.
If there is no standard algorithm that does what you want, take some care and look again -- you'll often have missed something that really will do what you want (one place that's often missed: the algorithms in <numeric> are often useful for non-numeric uses). Having looked a couple of times, and confirmed that there's really not a standard algorithm to do what you want, instead of writing that for loop (or whatever) inline, consider writing an generic algorithm to do what you need done. If you're using it one place, there's a pretty good chance you can use it two or three more, at which point it can be a big win in clarity.
Writing generic algorithms isn't all that hard -- in fact, it's often almost no extra work compared to writing a loop inline, so even if you can only use it twice you're already saving a little bit of work, even if you ignore the improvement in the code's readability and clarity.
STL algorithms should be used whenever they fit what you need to do. Which is almost all the time.
When should the STL algorithms be used instead of using your own?
When you value your time and sanity and have more fun things to do than reinventing the wheel again and again.
You need to use your own algorithms when project demands it, and there are no acceptable alternatives to writing stuff yourself, or if you identified STL algorithm as a bottleneck (using profiler, of course), or have some kind of restrictions STL doesn't conform to, or adapting STL for the task will take longer than writing algorithm from scratch (I had to use twisted version of binary search few times...). STL is not perfect and isn't fit for everything, but when you can, you should use it. When someone already did all the work for you, there is frequently no reason to do the same thing again.
I write performance critical applications. These are the kinds of things that need to process millions of pieces of information in as fast a time as possible. I wouldn't be able to do some of the things that I do now if it weren't for STL. Use them always.
There are many good algorithms besides stuff like std::foreach.
However there are lots of non-trivial and very useful algorithms:
Sorting: std::sort, std::upper_bound, std::lower_bound, std::binary_search
Min/Max std::max, std::min, std::partition, std::min_element, std::max_element
Search like std::find, std::find_first_of etc.
And many others.
Algorithms like std::transform are much useful with C++0x lambda expressions or stuff like boost::lambda or boost::bind
If I had to write something due this afternoon, and I knew how to do it using hand-made loops and would need to figure out how to do it in STL algorithms, I would write it using hand-made loops.
Having said that, I would work to make the STL algorithms a reliable part of my toolkit, for reasons articulated in the other answers.
--
Reasons you might not see it is in code is that it is either legacy code or written by legacy programmers. We had about 20 years of C++ programming before the STL came out, and at that point we had a community of programmers who knew how to do things the old way and had not yet learned the STL way. This will likely remain for a generation.
Bear in mind that the STL algorithms cover a lot of bases, but most C++ developers will probably end up coding something that does something equivalent to std::find(), std::find_if() and std::max() almost every day of their working lives (if they're not using the STL versions already). By using the STL versions you separate the algorithm from both the logical flow of your code and from the data representation.
For other less commonly used STL algorithms such as std::merge() or std::lower_bound() these are hugely useful routines (the first for merging two sorted containers, the second for working out where to insert an item in a container to keep it ordered). If you were to try to implement them yourself then it would probably take a few attempts (the algorithms aren't complicated, but you'd probably get off-by-one errors or the like).
I myself use them every day of my professional career. Some legacy codebases that predate a stable STL may not use it as extensively, but if there's a newer project that is intentionally avoiding it I would be inclined to think it was by a part-time hacker who was still labouring under the mid-90's assumption that templates are slow and therefore to be avoided.
The only time I don't use STL algorithms is when the cross-platform implementation differences affect the outcome of my program. This has only happened in one or two rare cases (on the PlayStation 3). Although the interface of the STL is standardized across platforms, the implementation is not.
Also, in certain extremely high performance applications (think: video games, video game servers) we replaced a some STL structures with our own to eke out a bit more efficiency.
However, the vast majority of the time using STL is the way to go. And in my other (non-video game) jobs, I used the STL exclusively.
The main problem with STL algorithms until now was that, even though the algorithm call itself clearer, defining the functors that you'd need to pass to them would make your code longer and more complex, due to the way the language forced you to do it. C++0x is expected to change that, with its support for lambda expressions.
I've been using STL heavily for the past six years and although I tried to use STL algorithms anywhere I could, in most instances it would make my code more obscure, so I got back to a simple loop. Now with C++0x is the opposite, the code seems to always look simpler with them.
The problem is that by now C++0x support is still limited to a few compilers, even because the standard is not completely finished yet. So probably we will have to wait a few years to really see widespread use of STL algorithms in production code.
I would not use STL in two cases:
When STL is not designed for your task. STL is nearly the best for general purposes. However, for specific applications STL may not always be the best. For example, in one of my programs, I need a huge hash table while STL/tr1's hashmap equivalence takes too much memory.
When you are learning algorithms. I am one of the few who enjoy reinventing the wheels and learn a lot in this process. For that program, I reimplemented a hash table. It really took me a lot of time, but in the end all the efforts paid off. I have learned many things that greatly benefit my future career as a programmer.
When you think you can code it better than a really clever coder who spent weeks researching and testing and trying to cope with every conceivable set of inputs.
For most Earthlings the answer is never!
I want to answer the "when not to use STL" case, with a clear example.
(as a challenge to all of you, show me that you can solve this with any STL algorithm until C++17)
Convert a vector of integers std::vector<int> into a vector of std::pair<int,int>, i.e.:
Convert
std::vector<int> values = {1,2,3,4,5,6,7,8,9,10};
to
std::vector<std::pair<int,int>> values = { {1,2}, {3,4} , {5,6}, {7,8} ,{9,10} };
And guess what? It's impossible with any STL algorithm until C++17.
See the complete discussion on solving this problem here:
How can I convert std::vector<T> to a vector of pairs std::vector<std::pair<T,T>> using an STL algorithm?
So to answer your question: Use STL algorithm always only when it perfectly fits your problem. Don't hack an STL algorithm to make it fit your problem.
Are they used more frequently than it seems?
I've never seen them used; except in books. Maybe they're used in the implementation of the STL itself. Maybe they'll become more used because easier to use (see for example Lambda functions and expressions), or even become obsoleted (see for example the Range-based for-loop), in the next version of C++ .

Looking for production quality Hash table/ unordered map implementation to learn?

Looking for good source code either in C or C++ or Python to understand how a hash function is implemented and also how a hash table is implemented using it.
Very good material on how hash fn and hash table implementation works.
Thanks in advance.
Hashtables are central to Python, both as the 'dict' type and for the implementation of classes and namespaces, so the implementation has been refined and optimised over the years. You can see the C source for the dict object here.
Each Python type implements its own hash function - browse the source for the other objects to see their implementations.
When you want to learn, I suggest you look at the Java implementation of java.util.HashMap. It's clear code, well-documented and comparably short. Admitted, it's neither C, nor C++, nor Python, but you probably don't want to read the GNU libc++'s upcoming implementation of a hashtable, which above all consists of the complexity of the C++ standard template library.
To begin with, you should read the definition of the java.util.Map interface. Then you can jump directly into the details of the java.util.HashMap. And everything that's missing you will find in java.util.AbstractMap.
The implementation of a good hash function is independent of the programming language. The basic task of it is to map an arbitrarily large value set onto a small value set (usually some kind of integer type), so that the resulting values are evenly distributed.
There is a problem with your question: there are as many types of hash map as there are uses.
There are many strategies to deal with hash collision and reallocation, depending on the constraints you have. You may find an average solution, of course, that will mostly fit, but if I were you I would look at wikipedia (like Dennis suggested) to have an idea of the various implementations subtleties.
As I said, you can mostly think of the strategies in two ways:
Handling Hash Collision: Bucket, which kind ? Open Addressing ? Double Hash ? ...
Reallocation: freeze the map or amortized linear ?
Also, do you want baked in multi-threading support ? Using atomic operations it's possible to get lock-free multithreaded hashmaps as has been proven in Java by Cliff Click (Google Tech Talk)
As you can see, there is no one size fits them all. I would consider learning the principles first, then going down to the implementation details.
C++ std::unordered_map use a linked-list bucket and freeze the map strategies, no concern is given to proper synchronization as usual with the STL.
Python dict is the base of the language, I don't know of the strategies they elected

Help me understand how the conflict between immutability and running time is handled in Clojure

Clojure truly piqued my interest, and I started going through a tutorial on it:
http://java.ociweb.com/mark/clojure/article.html
Consider these two lines mentioned under "Set":
(def stooges (hash-set "Moe" "Larry" "Curly")) ; not sorted
(def more-stooges (conj stooges "Shemp")) ; -> #{"Moe" "Larry" "Curly" "Shemp"}
My first thought was that the second operation should take constant time to complete; otherwise functional language might have little benefit over an object-oriented one. One can easily imagine a need to start with [nearly] empty set, and populate it and shrink it as we go along. So, instead of assigning the new result to more-stooges, we could re-assign it to itself.
Now, by the marvelous promise of functional languages, side effects are not to be concerned with. So, sets stooges and more-stooges should not work on top of each other ever. So, either the creation of more-stooges is a linear operation, or they share a common buffer (like Java's StringBuffer) which would seem like a very bad idea and conflict with immutability (subsequently stooges can drop an element one-by-one).
I am probably reinventing a wheel here. it seems like the hash-set would be more performant in clojure when you start with the maximum number of elements and then remove them one at a time until empty set as oppose to starting with an empty set and growing it one at a time.
The examples above might not seem terribly practical, or have workarounds, but the object-oriented language like Java/C#/Python/etc. has no problem with either growing or shrinking a set one or few elements at a time while also doing it fast.
A [functional] language which guarantees(or just promises?) immutability would not be able to grow a set as fast. Is there another idiom that one can use which somehow can help avoiding doing that?
For someone familiar with Python, I would mention set comprehension versus an equivalent loop approach. The running time of the two is tiny bit different, but that has to do with relative speeds of C, Python, interpreter and not rooted in complexity. The problem I see is that set comprehension is often a better approach, but NOT ALWAYS the best approach, for the readability might suffer a great deal.
Let me know if the question is not clear.
The core Immutable data structures are one of the most fascinating parts of the language for me also. their is a lot to answering this question and Rich does a really great job of it in this video:
http://blip.tv/file/707974
The core data structures:
are actually fully immutable
the old copies are also immutable
performance does not degrade for the old copies
access is constant (actually bounded <= a constant)
all support efficient appending, concatenating (except lists and seqs) and chopping
How do they do this???
the secret: it's pretty much all trees under the hood (actually a trie).
But what if i really want to edit somthing in place?
you can use clojure's transients to edit a structure in place and then produce a immutable version (in constant time) when you are ready to share it.
as a little background: a Trie is a tree where all the common elements of the key are hoisted up to the top of the tree. the sets and maps in clojure use trie where the indexes are a hash of the key you are looking for. it then breaks the hash up into small chunks and uses each chunk as the key to one level of the hash-trie. This allows the common parts of both the new and old maps to be shared and the access time is bounded because there can only be a fixed number of branches because the hash used as in input has a fixed size.
Using these hash tries also helps prevent big slowdowns during the re-balancing used by a lot of other persistent data structures. so you will actually get fairly constant wall-clock-access-time.
I really reccomend the (relativly short)_ book: Purely Functional Data Structures
In it he covers a lot of really interesting structures and concepts like "removing amortization" to allow true constant time access for queues. and things like lazy-persistent queues. the author even offers a free copy in pdf here
Clojure's data structures are persistent, which means that they are immutable but use structural sharing to support efficient "modifications". See the section on immutable data structures in the Clojure docs for a more thorough explanation. In particular, it states
Specifically, this means that the new version can't be created using a full copy, since that would require linear time. Inevitably, persistent collections are implemented using linked data structures, so that the new versions can share structure with the prior version.
These posts, as well as some of Rich Hickey's talks, give a good overview of the implementation of persistent data structures.