Apriori algorithm Anti-monotonic vs monotonic - data-mining

According to Wikipedia, a monotonic function is a function that is either increasing or decreasing. If a function is increasing and decreasing then it's not a monotonic function or it's an anti-monotonic function.
But the data mining book, "Data Mining: Concepts and Techniques," describes anti-monotonic property as: If a set is infrequent then all of its supersets are also infrequent.
Doesn't this property look the same as monotonic according to Wikipedia? What is the difference between the two?

To begin with a quote:
Mathematics is the art of giving the same name to different things.
Ferdinand Verhulst
Indeed, according to Wikipedia's page on Monotonic functions, the use of "anti" (before "monotone" or "monotonic") for a function in the realm of order theory is different than its use in calculus and analysis.
In order theory, "a monotone function is also called isotone, or order-preserving. The dual notion is often called antitone, anti-monotone, or order-reversing". It only means that the order of the images of the function is inverted.
But generally speaking, we deal with calculus. There, your first definition is the right one : a function "is called monotonic if and only if it is either entirely non-increasing, or entirely non-decreasing."And if a function increases and decreases, it would be simply called non-monotonic.
In data mining, what would be a monotonic function would be the support function of an itemset (its frequency in the transaction database). But when "frequent" (i.e sup(X) > supmin) is our criteria :
"if a set is frequent, then all of its subset are frequent too", and also "if a set is infrequent then all of its superset are also infrequent."
The combination of both means the anti-monotonicity in this context.

Different people use different definitions.
For real valued functions and sets, even the same author might be using different definitions.

When stating that Apriori is antimonotonic one is referring to the definition of antimonocity where "a constraint c is anti-monotone if an itemset S violates constraint c, so does any of its supersets". Apriori pruning is pruning with a anti-monotonic constraint.
Another way of looking at it would be that whenever an anti-monotonic constraint is met, we do not need to mine any further. Indeed, this is what Apriori pruning does.

Related

Rules of thumb for function arguments ordering in Clojure

What (if any) are the rules for deciding the order of the parameters functions in Clojure core?
Functions like map and filter expect a data structure as the last
argument.
Functions like assoc and select-keys expect a data
structure as the first argument.
Functions like map and filter expect a function as the first
argument.
Functions like update-in expect a function as the last argument.
This can cause pains when using the threading macros (I know I can use as-> ) so what is the reasoning behind these decisions? It would also be nice to know so my functions can conform as closely as possible to those written by the great man.
Functions that operate on collections (and so take and return data structures, e.g. conj, merge, assoc, get) take the collection first.
Functions that operate on sequences (and therefore take and return an abstraction over data structures, e.g. map, filter) take the sequence last.
Becoming more aware of the distinction [between collection functions and sequence functions] and when those transitions occur is one of the more subtle aspects of learning Clojure.
(Alex Miller, in this mailing list thread)
This is important part of working intelligently with Clojure's sequence API. Notice, for instance, that they occupy separate sections in the Clojure Cheatsheet. This is not a minor detail. This is central to how the functions are organized and how they should be used.
It may be useful to review this description of the mental model when distinguishing these two kinds of functions:
I am usually very aware in Clojure of when I am working with concrete
collections or with sequences. In many cases I find the flow of data
starts with collections, then moves into sequences (as a result of
applying sequence functions), and then sometimes back to collections
when it comes to rest (via into, vec, or set). Transducers have
changed this a bit as they allow you to separate the target collection
from the transformation and thus it's much easier to stay in
collections all the time (if you want to) by apply into with a
transducer.
When I am building up or working on collections, typically the code
constructing it is "close" and the collection types are known and
obvious. Generally sequential data is far more likely to be vectors
and conj will suffice.
When I am thinking in "sequences", it's very rare for me to do an
operation like "add last" - instead I am thinking in whole collection
terms.
If I do need to do something like that, then I would probably convert
back to collections (via into or vec) and use conj again.
Clojure's FAQ has a few good rules of thumb and visualization techniques for getting an intuition of collection/first-arg versus sequence/last-arg.
Rather than have this be a link-only question, I'll paste a quote of Rich Hickey's response to the Usenet question "Argument order rules of thumb":
One way to think about sequences is that they are read from the left,
and fed from the right:
<- [1 2 3 4]
Most of the sequence functions consume and produce sequences. So one
way to visualize that is as a chain:
map<- filter<-[1 2 3 4]
and one way to think about many of the seq functions is that they are
parameterized in some way:
(map f)<-(filter pred)<-[1 2 3 4]
So, sequence functions take their source(s) last, and any other
parameters before them, and partial allows for direct parameterization
as above. There is a tradition of this in functional languages and
Lisps.
Note that this is not the same as taking the primary operand last.
Some sequence functions have more than one source (concat,
interleave). When sequence functions are variadic, it is usually in
their sources.
I don't think variable arg lists should be a criteria for where the
primary operand goes. Yes, they must come last, but as the evolution
of assoc/dissoc shows, sometimes variable args are added later.
Ditto partial. Every library eventually ends up with a more order-
independent partial binding method. For Clojure, it's #().
What then is the general rule?
Primary collection operands come first.That way one can write -> and
its ilk, and their position is independent of whether or not they have
variable arity parameters. There is a tradition of this in OO
languages and CL (CL's slot-value, aref, elt - in fact the one that
trips me up most often in CL is gethash, which is inconsistent with
those).
So, in the end there are 2 rules, but it's not a free-for-all.
Sequence functions take their sources last and collection functions
take their primary operand (collection) first. Not that there aren't
are a few kinks here and there that I need to iron out (e.g. set/
select).
I hope that helps make it seem less spurious,
Rich
Now, how one distinguishes between a "sequence function" and a "collection function" is not obvious to me. Perhaps others can explain this.

ELKI: Running LOF with varying k

Can I run LOF with varying k through ELKI so that it is easy to compare which k is the best?
Normally you choose a k, and then you can see the ROCAUC for example. I want to take out the best k for the data set, so I need to compare multiple runs. Can I do that some way easier than manually changing the value for k and doing runs? I want to for example compare all k=[1-100].
Thanks
The Greedy Ensemble shows how to run outlier detection methods for a whole range of k at once efficiently (by only computing the nearest-neighbors once, it will be a lot faster!) using the ComputeKNNOutlierScores application included with ELKI.
The application EvaluatePrecomputedOutlierScores can be used to bulk-evaluate these results with multiple measures.
This is what we used for the publication
G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent and M. E. Houle
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study
Data Mining and Knowledge Discovery 30(4): 891-927, 2016, DOI: 10.1007/s10618-015-0444-8
On the supplementary material website, you can look up the best results for many standard data sets, as well as download the raw results.
But beware that outlier detection quality results tend to be inconclusive. On one data set, one method performs best, on another data set another method. There is no clear winner, because data sets are very diverse.

C++ comp(a,a)==false

I was using lambda function in sort() function. In my lambda function I return true if two are equal. Then I got segmentation error.
After reviewing C++ Compare, it says
For all a, comp(a,a) == false
I don't understand why it must be false. Why can't I let comp(a,a)==true?
(Thanks in advance)
Think of Comp as some sort of "is smaller than" relationship, that is it defines some kind of ordering on a set of data.
Now you probably want to do some stuff with this relationship, like sorting data in increasing order, binary search in sorted data, etc.
There are many algorithms that do stuff like this very fast, but they usually have the requirement that the ordering they deal with is "reasonable", which was formalized with the term Strict weak ordering. It is defined by the rules in the link you gave, and the first one basically means:
"No element shall be smaller than itself."
This is indeed reasonable to assume, and one of the things our algorithms require.

Is there a constexpr ordering of types in C++?

Does C++ provide an ordering of the set of all types as a constant expression? It doesn't matter which particular order, any one will do. This could be in form of a constexpr comparison function:
template <typename T1, typename T2>
constexpr bool TypeLesser ();
My use for this is for a compile time self-balancing binary search tree of types, as a replacement of (cons/nil) type lists, to speed up the compilation. For example, checking whether a type is contained in such a tree may be faster than checking if it is contained in a type list.
I will also accept compiler-specific intrinsics if standard C++ does not provide such a feature.
Note that if the only way to get an ordering is to define it manually by adding boilerplate all over the code base (which includes a lot of templates and anonymous structs), I will rather stay with type lists.
The standard’s only ordering is via type_info (provided by typeid expression), which you can use more easily via type_index – the latter provides ordinary comparison functionality so that it can be used in collections.
I guess its ancestry is the class Andrei Alexandrescu had in “Modern C++ Design”.
It's not compile time.
To reduce compilation time you can define traits classes for the types in question, assigning each type some ordinal value. A 128 bit UUID would do nicely as a type id, to avoid the practical issue of guaranteeing unique id's. This of course assumes that you or the client code controls the set of possible types.
The idea of having to "register" relevant types has been used before, in early Boost machinery for determining function result types.
I must anyway recommend seriously measuring compilation performance. The balancing operations that are fast at run time, involving only adjustment of a few pointers, may be slow at compile time, involving creating a huge descriptor of a whole new type. So even though checking for type set membership may be faster, building the type set may be seriously much slower, e.g. O(n2).
Disclaimer: I haven't tried that.
But anyway, I remember again that Andrei Alexandrescu discussed something of the sort in the already mentioned “Modern C++ Design”, or if you don't have access to that book, look in the Loki library (which is a library of things from that book).
You have two main problems: 1) You have no specific comparison criteria (Hence the question, isn't?), and 2) You don't have any standard way to sort at compile-time.
For the first use std::type_info as others suggested (Its currently used on maps via the std::type_index wrapper) or define your own metafunction to specify the ordering criteria for different types. For the second, you could try to write your own template-metaprogramming based quicksort algorithm. Thats what I did for my personal metaprogramming library and works perfectly.
About the assumption "A self-balancing search tree should perform better than classic typelists" I really encourage you to do some profillings (Try templight) before saying that. Compile-time performance has nothing to do with classic runtime performance, depends heavily in the exact implementation of the template instantation system the compiler has.
For example, based on my own experience I'm pretty sure that my simple "O(n)" linear search could perform better than your self balanced tree. Why? Memoization. Compile-time performance is not only instantation depth. In fact memoization has a crucial role on this.
To give you a real example: Consider the implementation of quicksort (Pseudo meta code):
list sort( List l )
{
Int pivot = l[l.length/2];
Tuple(List,List) lists = reorder( l , pivot , l.length/2 );
return concat( sort( lists.left ) , sort( lists.right ) );
}
I hope that example is self-explanatory. Note the functional way it works, there are no side effects. I will be glad if some day metaprogramming in C++ has that syntax...
Thats the recursive case of quicksort. Since we are using typelists (Variadic typelists in my case), the first metainstruction, which computes the value of the pivot has O(n) complexity. Specifically requires a template instantation depth of N/2. The seconds step (Reordering) could be done in O(n), and concatting is O(1) (Remember that are C++11 variadic typelists).
Now consider an example of execution:
[1,2,3,4,5]
The first step calls the recursive case, so the trace is:
Int pivot = l[l.length/2]; traverses the list until 3. That means the instantations needed to perform the traversings [1], [1,2], [1,2,3] are memoized.
During the reordering, more subtraversings (And combinations of subtraversing generated by element "swapping") are generated.
Recursive "calls" and concat.
Since such linear traversings performed to go to the middle of the list are memoized, they are instantiated only once along the whole sort execution. When I first encountered this using templight I was completely dumbfounded. The fact, looking at the instantations graph, is that only the first large traverses are instantiated, the little ones are just part of the large and since the large where memoized, the little are not instanced again.
So wow, the compiler is able of memoizing at least the half of that so slow linear traversings, right? But, what is the cost of such enormous memoization efforts?
What I'm trying to say with this answer is: When doing template meta-programming, forget everything about runtime performance, optimizations, costs, etc, and don't do assumptions. Measure. You are entering in a completely different league. I'm not completely sure what implementation (Your selft balancing trees vs simple linear traversing) is faster, because that depends on the compiler. My example was only to show how actually a compiler could break down completely your assumptions.
Side note: The first time I did that profilings I showed them to an algorithms teacher of my university, and he's still trying to figure out whats happening. In fact, he asked a question here about how to measure the complexity and performance of this monster: Best practices for measuring the run-time complexity of a piece of code

Selection of map or unordered_map based on keys's type

A generally asked question is whether we should use unordered_map or map for faster access.
The most common( rather age old ) answer to this question is:
If you want direct access to single elements, use unordered_map but if you want to iterate over elements(most likely in a sorted way) use map.
Shouldn't we consider the data type of key while making such a choice?
As hash algorithm for one dataType(say int) may be more collision prone than other(say string).
If that is the case( the hash algorithm is quite collision prone ), then I would probably use map even for direct access as in that case the O(1) constant time(probably averaged over large no. of inputs) for unordered_map map be more than lg(N) even for fairly large value of N.
You raise a good point... but you are focusing on the wrong part.
The problem is not the type of the key, per se, but on the hash function that is used to derive a hash value for that key.
Lexicographical ordering is easy: if you tell me you want to order a structure according to its 3 fields (and they already support ordering themselves) then I'll just write:
bool operator<(Struct const& left, Struct const& right) {
return boost::tie(left._1, left._2, left._3)
< boost::tie(right._1, right._2, right._3);
}
And I am done!
However writing a hash function is difficult. You need some knowledge about the distribution of your data (statistics), you might need to prevent specially crafted attacks, etc... Honestly, I do not expect many people of being able to craft a good hash function. But the worst part is, composition is difficult too! Given two independent fields, combining their hash value right is hard (hint: boost::hash_combine).
So, indeed, if you have no idea what you are doing and you are treating user-crafted data, just stick to a map. It's maybe slower (not sure), but it's safer.
There isn't really such a thing as collision prone object, because this thing is dependent on the hash function you use. Assuming the objects are not identical - there is some feature that can be utilized to create an informative hash function to be used.
Assuming you have some knowledge on your data - and you know it is likely to have a lot of collision for some hash function h1() - then you should find and use a different hash function h2() which is better suited for this task.
That said, there are other issues as well why to favor tree based data structures over hash bases (such as latency and the size of the set), some are covered by my answer in this thread.
There's no point trying to be too clever about this. As always, profile, compare, optimise if useful. There are many factors involved - quite a few of which aren't specified in the Standard and will vary across compilers. Some things may profile better or worse on specific hardware. If you are interested in this stuff (or paid to pretend to be) you should learn about these things a bit more systematically. You might start with learning a bit about actual hash functions and their characteristics. It's extremely rare to be unable to find a hash function that has - for all practical purposes - no more collision proneness than a random but repeatable value - it's just that sometimes it's slower to approach that point than it is to handle a few extra collisions.