dawgdic is a great DAWG library, but it has a significant drawback because it is static (not updateable) and has to be constructed form strings sorted in alphabetical order. If the raw data from which the DAWG is constructed is big (several gigabytes), the initial construction of DAWG involving sorting a huge array of strings can demand too much resources.
Is there a library that provides a memory-efficient structure as dawgdic which allows construction from non-sorted dictionary?
Currently, I don't think there's any library that allows construction of DAWGs from a non-sorted dictionaries.
But, after a lot of searching, I've found this paper, "Incremental Construction of Minimal Acyclic Finite-State Automata"
, which I think has exactly what you want. Maybe you could make your own library after reading this, and share it with everyone!
EDIT: Have you looked at this question?
I've found some great libraries that allow online construction from non-sorted data, altough they are not based on DAWG:
cedar - a very fast double-array trie
marisa-trie - а very space efficient string matching library
I currently know of no C++ implementations of a DAWG which support the construction from non-sorted data, but if you're open to creating your own solution which has such a feature, Incremental Construction of Minimal Acyclic Finite-State Automata (2000) is a paper which basically lays out the algorithm behind it.
Alternatively, if you're open to porting solutions from other languages, it may be worth your while to check out MDAG, a Java implementation of the data structure. It supports both on-the-fly string addition and string removal, which is exactly what you are looking for. The code is also easy to follow, and extensively commented, so porting it should be a fairly simple task.
Disclaimer: I am the author of MDAG :) .
I want to use vocabulary trees (which is not necessarily binary) in my program and I already have a general idea on how to create a tree class but I was wondering if there are any c++ libraries that are useful for that purpose. If not I would like to know about the methods I can use to manage my tree faster( add/remove/access nodes), like storing them in consecutive memory locations.
thank you
You can use The Boost Graph Library to model all kinds of trees.
Though std::map and std::set get mentioned in oleskii's link they are binary trees. Any n-ary tree can be rearranged to a binary tree, but that may not help you, since the re-organisation will take time. The boost graph libraries are more general purpose.
A quick google for n-ary trees C++" just turned up treetree
"Treetree is a header-only C++ library that implements a generic
tree-structured container class according to STL conventions"
If you want to make you current tree implementation faster, you should measure where it is currently slow.
Check simple things, e.g. make sure you pass by reference rather than by copy.
For my C++ application, there is a requirement to check if a word is a valid English dictionary word or not. What is the best way to do it. Is there freely available dictionary that I can make use of. I just need a collection of all possible words. How to make this lookup least expensive. Do I need to hash it.
Use either a std::set<std::string> or a std::unordered_set<std::string>. The latter is new in C++0x and may or may not be supported by your C++ Standard Library implementation; if it does not support it, it may include a hash_set of some kind: consult your documentation to find out.
Which of these (set, which uses a binary search tree, and unordered_set, which uses a hashtable) is more efficient depends on the number of elements you are storing in the container and how your Standard Library implementation implements them. Your best bet is to try both and see which performs better for your specific scenario.
Alternatively, if the list of words is fixed, you might consider using a sorted std::vector and using std::binary_search to find words in it.
With regards to the presence of a word list, it depends on the platform.
Under Linux, /usr/share/dict/words contains a list of English words
that might meet your needs. Otherwise, there are doubtlessly such lists
available on the network.
Given the size of such lists, the most rapid access will be to load it
into a hash table. std::unsorted_set, if you have it; otherwise, many
C++ compilers come with a hash_set, although different compilers have
a slightly different interface for it, and put it in different
namespaces. If that still has performance problems, it's possible to do
better if you know the number of entries in advance (so the table never
has to grow), and implement the hash table in an std::vector (or even a
C style array); handling collisions will be a bit more complicated,
however.
Another possibility would be a trie. This will almost certainly result
in the least number of basic operations in the lookup, and is fairly
simple to implement. Typical implementations will have very poor
locality, however, which could make it slower than some of the other
solutions in actual practice (or not—the only way to know is to
implement both and measure).
I actually did this a few months ago, or something close to this. You can probably find one online for free.
Like on this website: http://wordlist.sourceforge.net/
Just put it in a text file, and compare words with what is on the list. It should be order n with n being the number of words in the list. Do you need the time complexity faster?
Hope this helps.
Looking for good source code either in C or C++ or Python to understand how a hash function is implemented and also how a hash table is implemented using it.
Very good material on how hash fn and hash table implementation works.
Thanks in advance.
Hashtables are central to Python, both as the 'dict' type and for the implementation of classes and namespaces, so the implementation has been refined and optimised over the years. You can see the C source for the dict object here.
Each Python type implements its own hash function - browse the source for the other objects to see their implementations.
When you want to learn, I suggest you look at the Java implementation of java.util.HashMap. It's clear code, well-documented and comparably short. Admitted, it's neither C, nor C++, nor Python, but you probably don't want to read the GNU libc++'s upcoming implementation of a hashtable, which above all consists of the complexity of the C++ standard template library.
To begin with, you should read the definition of the java.util.Map interface. Then you can jump directly into the details of the java.util.HashMap. And everything that's missing you will find in java.util.AbstractMap.
The implementation of a good hash function is independent of the programming language. The basic task of it is to map an arbitrarily large value set onto a small value set (usually some kind of integer type), so that the resulting values are evenly distributed.
There is a problem with your question: there are as many types of hash map as there are uses.
There are many strategies to deal with hash collision and reallocation, depending on the constraints you have. You may find an average solution, of course, that will mostly fit, but if I were you I would look at wikipedia (like Dennis suggested) to have an idea of the various implementations subtleties.
As I said, you can mostly think of the strategies in two ways:
Handling Hash Collision: Bucket, which kind ? Open Addressing ? Double Hash ? ...
Reallocation: freeze the map or amortized linear ?
Also, do you want baked in multi-threading support ? Using atomic operations it's possible to get lock-free multithreaded hashmaps as has been proven in Java by Cliff Click (Google Tech Talk)
As you can see, there is no one size fits them all. I would consider learning the principles first, then going down to the implementation details.
C++ std::unordered_map use a linked-list bucket and freeze the map strategies, no concern is given to proper synchronization as usual with the STL.
Python dict is the base of the language, I don't know of the strategies they elected
I have to port a C++ STL application to Python. I am a Python newbie, but have been programming for over a decade. I have a great deal of experience with the STL, and find that it keeps me hooked to using C++. I have been searching for the following items on Google these past several days:
Python STL (in hope of leveraging my years of STL experience)
Python linked lists
Python advanced list usage
Python list optimization
Python ordered sets
And have found posts about the above topic, tutorials on Python lists that are decidedly NOT advanced, or dead ends. I am really surprised at my lack of success, I think I am just burned out from overworking and entering bad search terms!
(MY QUESTION) Can I get a Python STL wrapper, or an interface to Python lists that works like the STL? If not can someone point me to a truly advanced tutorial or paper on managing very large sorted collections of non trivial objects?
P.S. I can easily implement workarounds for one or two uses, but if management wants to port more code, I want to be ready to replace any STL code I find with equivalent Python code immediately. And YES I HAVE MEASURED AND DO NEED TO HAVE TOTALLY OPTIMAL CODE! I CANT JUST DO REDUNDANT SORTS AND SEARCHES!
(ADDENDUM) Thanks for the replies, I have checked out some of the references and am pleased. In response to some of the comments here:
1 - It is being ported to python because managements says so, I would just as soon leave it alone - if it aint broke, why fix it?
2 - Advanced list usage with non trivial objects, what I mean by that is: Many different ways to order and compare objects, not by one cmp method. I want to splice, sort, merge, search, insert, erase, and combine the lists extensively. I want lists of list iterators, I want to avoid copying.
3 - I now know that built in lists are actually arrays, and I should be looking for a different python class. I think this was the root of my confusion.
4 - Of course I am learning to do things in the Python way, but I also have deadlines. The STL code I am porting is working right, I would like to change it as little as possible, because that would introduce bugs.
Thanks to everyone for their input, I really appreciate it.
Python's "lists" are not linked lists -- they're like Java ArrayLists or C++'s std::vectors, i.e., in lower-level terms, a resizable compact array of pointers.
A good "advanced tutorial" on such subjects is Hettinger's Core Python containers: under the hood presentation (the video at the URL is of the presentation at an Italian conference, but it's in English; another, shorter presentation of essentially the same talk is here).
So the performance characteristics of Python lists are essentially those of C++'s std::vector: Python's .append, like C++'s push_back, is O(1), but insertion or removal "in the middle" is O(N). Consequently, keeping a list sorted (as can be easily done with the help of functions in Python's standard library module bisect) is costly (if items arrive and/or depart randomly, each insertion and removal is O(N), just like similarly maintaining order in an std::vector would be. For some purposes, such as priority queues, you may get away with a "heap queue", also easy to maintain with the help of functions in Python's standard library module heapq -- but of course that doesn't afford the same range of uses as a completely sorted list (or vector) would.
So for purposes for which in C++ you'd use a std::set (and rely on its being ordered, i.e., a hashset wouldn't do -- Python's sets are hash-based, not ordered) you may be better off avoiding Python builtin containers in favor of something like this module (if you need to keep things pure-Python), or this one (which offers AVL trees, not RB ones, but is coded as a C-implemented Python extension and so may offer better performance) if C-coded extensions are OK.
If you do end up using your own module (be it pure Python, or C-coded), you may, if you wish, give it an STL-like veneer/interface (with .begin, .end, iterator objects that are advanced by incrementing rather than, as per normal Python behavior, by calling their next methods, ...), although it will never perform as well as "going with the grain" of the language would (the for statement is optimized to use normal Python iterators, i.e., one with next methods, and it will be faster than wrapping a somewhat awkward while around non-Python-standard, STL-like iterators).
To give an STL-like veneer to any Python built-in container, you'll incur substantial wrapping overhead, so the performance hit may be considerable. If you, as you say, "DO NEED TO HAVE TOTALLY OPTIMAL CODE", using such a veneer just for "syntax convenience" purposes would therefore seem to be a very bad choice.
Boost Python, the Python extension package that wraps the powerful C++ Boost library, might perhaps serve your purposes best.
If I were you I would take the time to learn how to properly use the various data structures available in Python instead of looking for things that are similar to what you know from C++.
It's not like you're looking for something fancy, just working with some data structures. In that case I would refer you to Python's documentation on the subject.
Doing this the 'Python' way would help you and more importantly future maintainers who will wonder why you try to program C++ in Python.
Just to whet your appetite, there's also no reason to prefer STL's style to Python (and for the record, I'm also a C++ programmer who knows STL throughly), consider the most trivial example of constructing a list and traversing it:
The Pythonic way:
mylist = [1, 2, 3, 4]
for value in mylist:
# playaround with value
The STL way (I made this up, to resemble STL) in Python:
mylist = [1, 2, 3, 4]
mylistiter = mylist.begin()
while mylistiter != mylist.end():
value = mylistiter.item()
mylistiter.next()
For linked-list-like operations people usually use collections.deque.
What operations do you need to perform fast? Bisection? Insertion?
I would say that your issues go beyond just STL porting. Since the list, dict, and set data structures, which are bolted on to C++ via the STL, are native to core Python, then their usage is incorporated into common Python code idioms. If you want to give Google another shot, try looking for references for "Python for C++ Programmers". One of your hits will be this presentation by Alex Martelli. It's a little dated, from way back in ought-three, but there is a side-by-side comparison of some basic Python code that reads through a text file, and how it would look using STL.
From there, I would recommend that you read up on these Python features:
iterators
generators
list and generator comprehensions
And these builtin functions:
zip
map
Once you are familiar with these, then you will be able to construct your own translation/mapping between STL usage and Python builtin data structures.
As others have said, if you are looking for a "plug-and-chug" formula to convert STL C++ code to Python, you will just end up with bad Python. Such a brute force approach will never result in the power, elegance, and brevity of a single-line list comprehension. (I had this very experience when introducing Python to one of our managers, who was familiar with Java and C++ iterators. When I showed him this code:
numParams = 1000
paramRequests = [ ("EqptEmulator/ProcChamberI/Sensors",
"ChamberIData%d"%(i%250)) for i in range(numParams) ]
record.internalArray = [ParameterRequest(*pr) for pr in paramRequests]
and I explained that these replaced this code (or something like it, this might be a mishmash of C++ and Java APIs, sorry):
std::vector<ParameterRequest> prs = new std::vector<ParameterRequest>();
for (int i = 0; i<1000; ++i) {
string idstr;
strstream sstr(idstr);
sstr << "ChamberIData" << (i%250);
prs.add(new ParameterRequest("EqptEmulator/ProcChamberI/Sensors", idstr));
}
record.internalArray = new ParameterRequest[prs.size];
prs.toArray(record.internalArray);
One of your instincts from working with C++ will be a reluctance to create new lists from old, but rather to update or filter a list in place. We even see this on many forums from Python developers asking about how to modify a list while iterating over it. In Python, you are much better off building a new list from the old with a list comprehension.
allItems = [... some list of items, perhaps from a database query ...]
validItems = [it for it in allItems if it.isValid()]
As opposed to:
validItems = []
for it in allItems:
if it.isValid():
validItems.add(it)
or worse:
# get list of indexes of items to be removed
removeIndexes = []
for i in range(len(allItems)):
if not allItems[i].isValid():
removeIndexes.add(i)
# don't forget to remove items in descending order, or later indexes
# will be invalidated by earlier removals
sort(removeIndexes,reverse=True)
# copy list
validItems = allItems[:]
# now remove the items from allItems
for idx in removeIndexes:
del validItems[i]
Python STL (in hope of leveraging my years of STL experience) - Start with the collections ABC's to learn what Python has. http://docs.python.org/library/collections.html
Python linked lists. Python lists have all the features you would want from a linked list.
Python advanced list usage. What does this mean?
Python list optimization. What does this mean?
Python ordered sets. You have several choices here; you could invent your own "ordered set" as a list that discards duplicates. You can subclass the heapq and add methods that discard duplicates: http://docs.python.org/library/heapq.html.
In many cases, however, the cost of maintaing an ordered set is actually excessive because it must only be ordered once at the end of the algorithm. In other cases, the "ordered set" really is a heapq -- you never needed the set-like features and only needed the ordering.
Non-Trivial.
(I'm guessing at what you meant by "non-trivial"). All Python objects are equivalent. There's no "trivial" vs. "non-trivial" objects. They're all first-class objects and can all have "non-trivial" complexity without any real work. This is not C++ where there are primitive (non-object) values floating around. Everything's an object in Python.
Management Expectations.
For the most part the C++ brain-cramping doesn't exist in Python. Use the obvious Python classes the obvious way and you'll have much less code. The reduction in code volume is the big win. Often, the management reason for converting C++ to Python is to get rid of the C++ complexity.
Python code will be much simpler, making it much more reliable and much easier to maintain.
While it's generally true that Python is slower than C++, it's also true that picking the right algorithm and data structure can have dramatic improvements on performance. In one benchmark, someone found that Python was actually faster than C because the C program had such a poorly chosen data structure.
It's possible that your C++ has a really poor algorithm and you will see comparable performance from Python.
It's also possible that your C++ program is I/O bound, or has other limitations that will leave the Python running at a comparable speed.
The design of Python is quite intentionally "you can use just a few data structures (arrays and hash tables) for whatever you want to do, and if that isn't fast enough there's always C".
Python's standard library doesn't have a sorted-list data structure like std::set. You can download a red/black tree implementation or roll your own. (For small data sets, just using a list and periodically sorting it is a totally normal thing to do in Python.)
Rolling your own linked list is very easy.