Formal criterion of availability to implement algorithm using MapReduce paradigm - mapreduce

Computational model supported by MapReduce is expressive enough to compute almost any function on your data.
If I understood correctly there are algorithms which can not be implemented using MapReduce paradigm. Is it correct ?
Are there any criteria, formal or otherwise, to discern whether an algorithm can be implemented using MapRuduce?

Related

How to evaluate my own text classifier

I have written my own text classifier, based on some linguistic theory. Final outcome of the classifier is a tuple pair of an article title and the binary category.
I also used the NB classifier on my Golden standard corpus and evaluated its performance with CV, using Sci-kit learn library in Python. However, I am struggling to figure out how to evaluate performance of my own classifier. :S
I would really appreciate your ideas, since I am not experienced machine learner.
Thanks,
Guzdeh
To evaluate a classifier, the most common metric is accuracy, but there is no rule of thumb for all possible scenarios, so I would suggest that you read a bit about evaluation metric for classifiers. Also read about evaluation methodology.
If you are out of time, stick to accuracy and cross validation for now, but be sure to understand what a given metric means, what your methodology means, how to read a confusion matrix, each metric and methodology pros and cons, and specially its limitations.
Scikit Learn's Reference Page for its metrics: Link
Scikit Learn's User Guide for cross-validation: Link
You stated you have your golden standard. You said you have your model. You then only need to choose a metric and an evaluation methodology.
Your model will predict a class/target given an input (a set of features). The prediction will then be compared to your ground truth/golden standard.

SQL structure vs. C++ STL map

I just finished a course in data structs and algorithms (cpp) in my school and I am interested in databasing in the real world... so specifically SQL.
So my question is what is the difference between SQL and for example c++ stl std::multimap? Is SQL faster? or can I make an equally as fast (time complexity wise) homemade SQL using a c++ STL?
thanks!
(sorry I'm new to programming outside the boundaries of my classes)
The obvious difference is that SQL is a query language to interact with database while STL is library (conventionally, STL is also used to refer to a certain subset of the standard C++ library). As such, these are apples and oranges.
SQL actually entails a suite of standards specifying various parts of a database system. For a database system to be useful it is desirable that certain characteristics are met (ACID. Even just looking at these there is no requirement that they are met by STL containers. I think only the consistency would even be desirable for STL container:
STL container mutations are not required to be atomic: when an exception is thrown from within one of the mutating functions the container may become unusable, i.e., STL containers are only required to meet the basic exception guarantee.
As mentioned, [successful] mutations yield to a consistent state.
STL containers can't be currently mutated and read, i.e., there is no concept of isolation. If you want to access an STL container in a concurrent environment you need to make sure that there is no other accessor when the container is being mutated (you can have as many concurrent readers while there is not mutator, though).
There is on concept of durability for STL containers while it may be considered the core feature of databases to be durable (well, all ACID features can be considered core database features).
Database internally certainly use some data structures and algorithms to provide the ACID features. This is an area where STL may come in, although primarily with its key strength, i.e., algorithms which aren't really "algorithms" but rather "solvers for specific problems": STL is the foundation of an efficient algorithm library which is applicable to arbitrary data structures (well, that's the objective - I don't think it is, yet, achieved). Sadly, important areas of data structures are not appropriately covered, though. In particular with respect to databases algorithms on trees especially b-trees tend to be important but are not at all covered by STL.
The STL container std::multimap<...> does contain a tree (typically a red/black-tree but that's not mandated) but it is tied to this particular in-memory representation. There is no way to apply the algorithms used to implement this particular data structure to some suitable persistent representation. Also, a std::multimap<...> still uses just one key (the multi refers to allowing multiple elements with the same key, not to having multiple keys) while database typically require multiple look-up mechanisms (indices which are utilized when executing queries based on a query plan for each query.
You got multiple questions and the interesting one (in my opinion) is this: "... or can I make an equally as fast (time complexity wise) homemade SQL using a c++ STL?"
In a perfect world where STL covers all algorithms, yes, you could create a query evaluator for a database based on the STL algorithms. You could even use some of the STL containers as auxiliary data structures although the primary data structures in a database are properly represented in persistent storage. To create an actual database you'd also need something which translates the query into a good query plan which can then be executed.
Of course, if all you really need are some look-ups by a key in a data structure which is read at some point by the program, you wouldn't need a full-blown database and looks are probably faster using suitable STL containers.
Note, that the time complexity tends to be useful for guiding a quick evaluations of different approaches. However, in practice the constant factors tend to matter and often the algorithms with the inferior time complexity behaves better. The canonical example is quicksort which outperforms the "superior" algorithms (e.g. heapsort or mergesort) for typical inputs (although in practice actually introsort is used which is a hybrid of quicksort, heapsort, and insertion-sort is used which combines the strength of these respective algorithms to behave well on all inputs). BTW, to get an illustration of the algorithms you may want to watch the Hungarian Sort Dancers.

Comparision between MPI standard and Map-Reduce programming model?

As i have learned basics of various parallel paradigm standard such as OpenMP, MPI, OpenCL to write parallel programming. But i don't have much knowledge about Map-Reduce Programming model.
As it is well known that various popular companies are following the Map-Reduce programming model to solve their huge data intensive tasks. As well as MPI was designed for high performance computing on both massively parallel machines and on workstation clusters.
So my first confusion is ..
Can i use the Map-Reduce model instead of MPI standard or vice-versa? or it depends upon the applications!!
What is the exact difference between them?
Which one is better and when?
You could understand Map-Reduce as a subset of MPI-functionality, as it kind of resembles MPIs collective operations with user-defined functions. Thus you can use MPI instead of Map-Reduce but not vice-versa, as in MPI you can describe many more operations. The main advantage of Map-Reduce seems to be this concentration on this single parallel concept, thereby reducing interfaces that you need to learn in order to use it.

Building an Intrusion Detection System using fuzzy logic

I want to develop an Intrusion Detection System (IDS) that might be used with one of the KDD datasets. In the present case, my dataset has 42 attributes and more than 4,000,000 rows of data.
I am trying to build my IDS using fuzzy association rules, hence my question: What is actually considered as the best tool for fuzzy logic in this context?
Fuzzy association rule algorithms are often extensions of normal association rule algorithms like Apriori and FP-growth in order to model uncertainty using probability ranges. I thus assume that your data consists of quite uncertain measurements and therefore you want to group the measurements in more general ranges like e.g. 'low'/'medium'/'high'. From there on you can use any normal association rule algorithm to find the rules for your IDS (I'd suggest FP-growth as it has lower complexity than Apriori for large data sets).

Looking for production quality Hash table/ unordered map implementation to learn?

Looking for good source code either in C or C++ or Python to understand how a hash function is implemented and also how a hash table is implemented using it.
Very good material on how hash fn and hash table implementation works.
Thanks in advance.
Hashtables are central to Python, both as the 'dict' type and for the implementation of classes and namespaces, so the implementation has been refined and optimised over the years. You can see the C source for the dict object here.
Each Python type implements its own hash function - browse the source for the other objects to see their implementations.
When you want to learn, I suggest you look at the Java implementation of java.util.HashMap. It's clear code, well-documented and comparably short. Admitted, it's neither C, nor C++, nor Python, but you probably don't want to read the GNU libc++'s upcoming implementation of a hashtable, which above all consists of the complexity of the C++ standard template library.
To begin with, you should read the definition of the java.util.Map interface. Then you can jump directly into the details of the java.util.HashMap. And everything that's missing you will find in java.util.AbstractMap.
The implementation of a good hash function is independent of the programming language. The basic task of it is to map an arbitrarily large value set onto a small value set (usually some kind of integer type), so that the resulting values are evenly distributed.
There is a problem with your question: there are as many types of hash map as there are uses.
There are many strategies to deal with hash collision and reallocation, depending on the constraints you have. You may find an average solution, of course, that will mostly fit, but if I were you I would look at wikipedia (like Dennis suggested) to have an idea of the various implementations subtleties.
As I said, you can mostly think of the strategies in two ways:
Handling Hash Collision: Bucket, which kind ? Open Addressing ? Double Hash ? ...
Reallocation: freeze the map or amortized linear ?
Also, do you want baked in multi-threading support ? Using atomic operations it's possible to get lock-free multithreaded hashmaps as has been proven in Java by Cliff Click (Google Tech Talk)
As you can see, there is no one size fits them all. I would consider learning the principles first, then going down to the implementation details.
C++ std::unordered_map use a linked-list bucket and freeze the map strategies, no concern is given to proper synchronization as usual with the STL.
Python dict is the base of the language, I don't know of the strategies they elected