SQL structure vs. C++ STL map

SQL structure vs. C++ STL map - c++

I just finished a course in data structs and algorithms (cpp) in my school and I am interested in databasing in the real world... so specifically SQL.
So my question is what is the difference between SQL and for example c++ stl std::multimap? Is SQL faster? or can I make an equally as fast (time complexity wise) homemade SQL using a c++ STL?
thanks!
(sorry I'm new to programming outside the boundaries of my classes)

The obvious difference is that SQL is a query language to interact with database while STL is library (conventionally, STL is also used to refer to a certain subset of the standard C++ library). As such, these are apples and oranges.
SQL actually entails a suite of standards specifying various parts of a database system. For a database system to be useful it is desirable that certain characteristics are met (ACID. Even just looking at these there is no requirement that they are met by STL containers. I think only the consistency would even be desirable for STL container:
STL container mutations are not required to be atomic: when an exception is thrown from within one of the mutating functions the container may become unusable, i.e., STL containers are only required to meet the basic exception guarantee.
As mentioned, [successful] mutations yield to a consistent state.
STL containers can't be currently mutated and read, i.e., there is no concept of isolation. If you want to access an STL container in a concurrent environment you need to make sure that there is no other accessor when the container is being mutated (you can have as many concurrent readers while there is not mutator, though).
There is on concept of durability for STL containers while it may be considered the core feature of databases to be durable (well, all ACID features can be considered core database features).
Database internally certainly use some data structures and algorithms to provide the ACID features. This is an area where STL may come in, although primarily with its key strength, i.e., algorithms which aren't really "algorithms" but rather "solvers for specific problems": STL is the foundation of an efficient algorithm library which is applicable to arbitrary data structures (well, that's the objective - I don't think it is, yet, achieved). Sadly, important areas of data structures are not appropriately covered, though. In particular with respect to databases algorithms on trees especially b-trees tend to be important but are not at all covered by STL.
The STL container std::multimap<...> does contain a tree (typically a red/black-tree but that's not mandated) but it is tied to this particular in-memory representation. There is no way to apply the algorithms used to implement this particular data structure to some suitable persistent representation. Also, a std::multimap<...> still uses just one key (the multi refers to allowing multiple elements with the same key, not to having multiple keys) while database typically require multiple look-up mechanisms (indices which are utilized when executing queries based on a query plan for each query.
You got multiple questions and the interesting one (in my opinion) is this: "... or can I make an equally as fast (time complexity wise) homemade SQL using a c++ STL?"
In a perfect world where STL covers all algorithms, yes, you could create a query evaluator for a database based on the STL algorithms. You could even use some of the STL containers as auxiliary data structures although the primary data structures in a database are properly represented in persistent storage. To create an actual database you'd also need something which translates the query into a good query plan which can then be executed.
Of course, if all you really need are some look-ups by a key in a data structure which is read at some point by the program, you wouldn't need a full-blown database and looks are probably faster using suitable STL containers.
Note, that the time complexity tends to be useful for guiding a quick evaluations of different approaches. However, in practice the constant factors tend to matter and often the algorithms with the inferior time complexity behaves better. The canonical example is quicksort which outperforms the "superior" algorithms (e.g. heapsort or mergesort) for typical inputs (although in practice actually introsort is used which is a hybrid of quicksort, heapsort, and insertion-sort is used which combines the strength of these respective algorithms to behave well on all inputs). BTW, to get an illustration of the algorithms you may want to watch the Hungarian Sort Dancers.

Related

Does SML provide an efficient immutable list implementation for very large collections, or should arrays and mutation be used for this optimization?

Every time we do cons and destructuring and similar operations on lists, we create copies of the original lists. This can become a performance issue with very large collections.
It's my understanding that to improve this situation, some programming languages implement lists as data structures that can be copied much more efficiently. Is there something like this in SML? Perhaps in the language definition, or as a feature that is implementation dependent?
If there's no such data structure, are arrays and mutability one pattern that optimizes on large lists? As long as the state is local to the function, can the function still be considered pure?
SML is multi-paradigm, but idiomatic SML is also functional-first, so both "lists with efficient copying" and "mutable arrays" approaches should make sense, depending on what the core language offers.
Is there an immutable data structure that is more efficient than the normal singly linked list for very large collections? If not, is there a native purely functional data structure that can optimize this scenario? Or should mutability and arrays be used internally?

Is there something like this in SML? Perhaps in the language definition, or as a feature that is implementation dependent?
You have a couple options here, depending on what you're willing to rely on.
The "initial basis" provided by the definition doesn't provide anything like this (I suppose an implementation could optimise lists by giving them some special compiler treatment and implementing them as copy-on-write arrays or some such, but I'm not aware of an implementation which does).
The widely implemented (SML/NJ, MLton, MoscowML, SML#, Poly/ML) SML Basis Library provides both functional and mutable options. The ones which come to mind are 'a Vector.vector and 'a Array.array (immutable and mutable, resp.). SML/NJ has an extension for vector literals, e.g. as #‍[1, 2, 3], and I believe MLton supports this on an opt-in basis.
There are some other libraries, such as CMLib, which provide other immutable data structures (e.g., sequences)
#molbdnilo's commented above about Okasaki's Purely Functional Data Structures. I've only read his thesis, and not the book version (which I believe has additional material, although I don't know to what extent; seems that this has come up on on software engineering stack exchange). I'm not aware of any pre-packaged version of the data structures he presents there.
As long as the state is local to the function, can the function still be considered pure?
This obviously depends somewhat on your definition of what it means for a function to be pure. I've heard "observationally pure" for functions which make use of private state, which seems to be widespread enough that it's used by at least some relevant papers (e.g., https://doi.org/10.1016/j.tcs.2007.02.004)
Is there an immutable data structure that is more efficient than the normal singly linked list for very large collections? Or should mutability and arrays be used internally?
I think the above mentioned vectors are what you'd want (I imagine it depends on how large "very large" is), but obviously better options may exist, use-dependent. For instance, if insertion and deletion are more important than random access then there are likely better options.
Mutability might also make more sense depending on your workload, e.g., if random access is important, you're doing many updates, and desire good memory locality.

What is are the advantages of a custom data structure?

What's the need to go for defining and implementing data structures (e.g. stack) ourselves if they are already available in C++ STL?
What are the differences between the two implementations?

First, implementing by your own an existing data structure is a useful exercise. You understand better what it does (so you can understand better what the standard containers do). In particular, you understand better why time complexity is so important.
Then, there is a quality of implementation issue. The standard implementation might not be suitable for you.
Let me give an example. Indeed, std::stack is implementing a stack. It is a general-purpose implementation. Have you measured sizeof(std::stack<char>)? Have you benchmarked it, in the case of a million of stacks of 3.2 elements on average with a Poisson distribution?
Perhaps in your case, you happen to know that you have millions of stacks of char-s (never NUL), and that 99% of them have less than 4 elements. With that additional knowledge, you probably should be able to implement something "better" than what the standard C++ stack provides. So std::stack<char> would work, but given that extra knowledge you'll be able to implement it differently. You still (for readability and maintenance) would use the same methods as in std::stack<char> - so your WeirdSmallStackOfChar would have a push method, etc. If (later during the project) you realize or that bigger stack might be useful (e.g. in 1% of cases) you'll reimplement your stack differently (e.g. if your code base grow to a million lines of C++ and you realize that you have quite often bigger stacks, you might "remove" your WeirdSmallStackOfChar class and add typedef std::stack<char> WeirdSmallStackOfChar; ....)
If you happen to know that all your stacks have less than 4 char-s and that \0 is not valid in them, representing such "stack"-s as a char w[4] field is probably the wisest approach. It is fast and easy to code.
So, if performance and memory space matters, you might perhaps code something as weird as
class MyWeirdStackOfChars {
bool small;
union {
std::stack<char>* bigstack;
char smallstack[4];
}
Of course, that is very incomplete. When small is true your implementation uses smallstack. For the 1% case where it is false, your implemention uses bigstack. The rest of MyWeirdStackOfChars is left as an exercise (not that easy) to the reader. Don't forget to follow the rule of five.
Ok, maybe the above example is not convincing. But what about std::map<int,double>? You might have millions of them, and you might know that 99.5% of them are smaller than 5. You obviously could optimize for that case. It is highly probable that representing small maps by an array of pairs of int & double is more efficient both in terms of memory and in terms of CPU time.
Sometimes, you even know that all your maps have less than 16 entries (and std::map<int,double> don't know that) and that the key is never 0. Then you might represent them differently. In that case, I guess that I am able to implement something much more efficient than what std::map<int,double> provides (probably, because of cache effects, an array of 16 entries with an int and a double is the fastest).
That is why any developer should know the classical algorithms (and have read some Introduction to Algorithms), even if in many cases he would use existing containers. Be also aware of the as-if rule.

STL implementation of Data Structures is not perfect for every possible use case.
I like the example of hash tables. I have been using STL implementation for a while, but I use it mainly for Competitive Programming contests.
Imagine that you are Google and you have billions of dollars in resources destined to storing and accessing hash tables. You would probably like to have the best possible implementation for the company use cases, since it will save resources and make search faster in general.
Oh, and I forgot to mention that you also have some of the best engineers on the planet working for you (:
(This video is made by Kulukundis talking about the new hash table made by his team at Google )
https://www.youtube.com/watch?v=ncHmEUmJZf4
Some other reasons that justify implementing your own version of Data Structures:
Test your understanding of a specific structure.
Customize part of the structure to some peculiar use case.
Seek better performance than STL for a specific data structure.
Hating STL errors.
Benchmarking STL against some simple implementation.

Can most of the data structures be implemented using vectors?

I used C++ vectors to implement stacks, queue, heaps, priority queue and directed weighted graphs. In the books and references, I have seen big classes for these data structures, all of which can be implemented in short using vectors. (May be there is more flexibility in using pointers)
Can we also implement even advanced data structures using vectors ?
If yes, why do C++ books still explain concepts with the long classes using pointers ?
Is it to keep in mind the lower level idea, if it is more vivid that way or it makes students equipped with such usage of pointers ?

It's true that many data structures can be implemented on top of a vector (array, for the sake of this answer), essentially all of them can, since every computation task can be implemented to run on a turing-machine which has a far more basic data access capability (or, in the real world, you may say that any program you implement with pointers eventually runs on a CPU with a simply array-like virtual memory space, so you could just call that a huge array). However, it's not always clever. Two main reasons :
performance / time complexity - a vector simply can't provide all basic operations that in O(1). There's a solution for fast initialization, but try to randomly insert values into a large vector and see how bad you perform - that's because you have to move all the elements by one place over and over. A list could do that in a single operation. Of course other structures have their own performance shortcomings, but that's the beauty of designing complicated data structures with these basic building blocks.
structural complexity - you can think of a list along the same line of a vector as an ordered container, and perhaps extend this into multidimensional matrices that can be implemented on top of them since they still retain some basic ordering, but there are more complicated structures. Take for e.g. a tree, a simple full binary tree one can be implemented with a vector very easily since the parent-child relations can be easily converted to index arithmetics, but what if the tree isn't full and has varying number of children per node? Now, you may say it can still be done (any graph can be implemented with vectors either through adjacency matrix or adjacency list for e.g.), but there's almost no sense in doing so when you can have a much simpler implementation using pointer links. Just think of doing an AVL roll with an array. :shudder:
Mind you that the second argument may very well boil down to performance ("hey, it's an awkward approach but I still managed to use a vector!"), but it's more than that - it would complicate your code, clutter your data structure design, and could make it far more prone to bugs.
Now, here comes the "but" - even though there's much sense in using all the possible tools the language provides you, it's very widely accepted to use vector-based structures for performance critical tasks. See almost all scientific CPU benchmarks, most of them ultimately rely on vectors (uncited, but I can elaborate further if anyone is interested. Suffice to say that even the well-known *graph*500 does that).
The reason is not that it's best programming practice, but that it's more suited with CPU internal structure and gets more "juice" out of the HW. That's due to spatial locality - CPUs are very fond of that as it allows the memory unit to parallelize accesses (in an array you always know where's the next element, in a list you have to wait until the current one is fetched), and also issue stream/stride prefetches that reduce latency of future requests.
I can't say this is always a good practice, when you run through a graph the accesses are still pretty irregular even if you use an array implementation, but it's still a very common practice.
To summarize, taking the question literally - most of them can, of sorts (for a given definition of "most", ok?), but if the intention was "why teach pointers", I believe you can see that in order to understand your limits and what you can and should use - you need to know a great deal more than just arrays and even pointers. A good programmer should know a bit about everything - OS design, CPU design, etc. You can't do anything decent unless you really understand the fabric you're running on, and that unfortunately (or not) includes lots of pointers

You can implement a kind of allocator using an std::vector as the backing store. If you do that, all the standard data structures from elementary computer science can be implemented on top of vectors. It will hardly free you from using pointers, though: vectors are really just chunks of memory with a few useful additional operations, most notably the ability to expand.
More to the point: if you don't understand pointers, you won't understand how to do use vector for advanced data structures either. vector is a useful abstraction, but it follows the C++ rule that "you don't get what you don't pay for", so it's also a very "thin" abstraction, and you do pay for the cost of abstraction in terms of the amount of code you have to write.
(Jonathan Wakely points out, in the comments, that you won't get the exact guarantees that the C++ standard library requires of allocators data structures when you implement them on top of vector. Put in principle, vectors are just a way of handling blocks of memory.)

If you are learning C++ you need to be familiar with pointers and how to use them even if there are more higher level concepts that does that job for you.
Yes, it is possible to implement most data structures with vectors or lists and if you just started learning programming it's probably a good idea that you'll know how to write these data structures yourself.
With that being said, production code should always use the standard library unless there is a good reason not to do so.

Should std::list be deprecated?

According to Bjarne Stroustrup's slides from his Going Native 2012 keynote, insertion and deletion in a std::list are terribly inefficient on modern hardware:
Vector beats list massively for insertion and deletion
If this is indeed true, what use cases are left for std::list? Shouldn't it be deprecated then?

Vector and list solve different problems. List provides the guarantee that iterators never become invalidated as you insert and remove other elements. Vector doesn't make that guarantee.
Its not all about performance. So the answer is no. List should not be deprecated.
Edit Beyond this, C++ isn't designed to work solely on "modern hardware." It is intended to be useful across a much wider range of hardware than that. I'm a programmer in the financial industries and I use C++, but other domains such as embedded devices, programmable controllers, heart-lung machines and myriad others are just as important. The C++ language should not be designed solely with the needs of certain domains and the performance of certain classes of hardware in mind. Just because I might not use a list doesn't mean it should be deprecated from the language.

Whether a vector outperforms a list or not also depends on the type of the elements. For example, for int elements vector is indeed very fast as most of the data fits inside the CPU cache and SIMD instructions can be used for the data copying. So the O(n) complexity of vector doesn't have much impact.
But what about larger data types, where copying doesn't translate to a stream operation, and instead data must be fetched from all over the place? Also, what about hardware that doesn't have large CPU caches and possibly also lacks SIMD instructions? C++ is used on much more than just modern desktop and workstation machines. Deprecating std::list is out of the question.
What Stroustrup is saying in that presentation is that before you pick std::list for your data, you should make sure that it's the right choice for your particular situation. In other words, benchmark and profile. It definitely doesn't say you should always pick std::vector.

No, and especially not based on one particular graph. There are instances where list will perform better than vector. See: http://www.baptiste-wicht.com/2012/12/cpp-benchmark-vector-list-deque/
And that's ignoring the non-performance differences, as others have mentioned.
Bjarne's point in that talk wasn't that you shouldn't use list. It was that people make too many assumptions about list's performance that often turn out to be wrong. He was simply justifying the stance that vector should always be your default go-to container type unless you actually find a need for the performance or other semantic characteristics of lists.

std::list is a deque, it has push_front() and pop_front(). It still has a niche role as such, though it may not be the best choice for a deque.
std::list does not reallocate memory, while std::vector may. Sometimes you don't want an item to move in memory (e.g. a stackful coroutine).
Linked lists are related to tree data structures. Both contain links. If we deprecate std::list, then what about tree-based containers?

Of course not. std::list is a different data structure. Comparing different data structure is good indication of its properties, advantages or disadvantages. But each data structure has its advantage.

Looking for production quality Hash table/ unordered map implementation to learn?

Looking for good source code either in C or C++ or Python to understand how a hash function is implemented and also how a hash table is implemented using it.
Very good material on how hash fn and hash table implementation works.
Thanks in advance.

Hashtables are central to Python, both as the 'dict' type and for the implementation of classes and namespaces, so the implementation has been refined and optimised over the years. You can see the C source for the dict object here.
Each Python type implements its own hash function - browse the source for the other objects to see their implementations.

When you want to learn, I suggest you look at the Java implementation of java.util.HashMap. It's clear code, well-documented and comparably short. Admitted, it's neither C, nor C++, nor Python, but you probably don't want to read the GNU libc++'s upcoming implementation of a hashtable, which above all consists of the complexity of the C++ standard template library.
To begin with, you should read the definition of the java.util.Map interface. Then you can jump directly into the details of the java.util.HashMap. And everything that's missing you will find in java.util.AbstractMap.
The implementation of a good hash function is independent of the programming language. The basic task of it is to map an arbitrarily large value set onto a small value set (usually some kind of integer type), so that the resulting values are evenly distributed.

There is a problem with your question: there are as many types of hash map as there are uses.
There are many strategies to deal with hash collision and reallocation, depending on the constraints you have. You may find an average solution, of course, that will mostly fit, but if I were you I would look at wikipedia (like Dennis suggested) to have an idea of the various implementations subtleties.
As I said, you can mostly think of the strategies in two ways:
Handling Hash Collision: Bucket, which kind ? Open Addressing ? Double Hash ? ...
Reallocation: freeze the map or amortized linear ?
Also, do you want baked in multi-threading support ? Using atomic operations it's possible to get lock-free multithreaded hashmaps as has been proven in Java by Cliff Click (Google Tech Talk)
As you can see, there is no one size fits them all. I would consider learning the principles first, then going down to the implementation details.
C++ std::unordered_map use a linked-list bucket and freeze the map strategies, no concern is given to proper synchronization as usual with the STL.
Python dict is the base of the language, I don't know of the strategies they elected

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js