Association rule mining - data-mining

I have a dataset with mostly integer values. I want to apply association rule mining on it. I have taken a look at the popular algorithms like Apriori, etc. but all of them work on data which have boolean values, i.e., either the item exists in the transaction or doesn't.
Is there an algorithm which lets us account for values of the attributes in addition to their counts? (I plan to normalize the data to have values between 0 and 1)

You can "hack" around this limitation if your nubers are integer (why normalize to 0 1?) and small:
apple banana apple
becomes
apple banana apple_2
which would allow to find association rules like
banana => apple, apple_2
but you need to mix in some clever filters to not get useless rules like
apple_2 => apple

Item-item collaborative filtering is quite similar to similarity-based data mining techniques like association rule mining. Moreover, collaborative filtering was built to handle continuous and ordinal values, such as star ratings or a Likert scale: this is usually preference information from users.
Content-based filtering is probably your best bet for the situation you describe. It allows for item attributes and weights (that do not change per user for that item), then takes in user preference for each item (that does change per user for that item).
If you want both preference (counts) and attributes to change for each user-item pair, I don't know of an algorithm that handles that. Usually algorithms are built for one input per user-item pair.

Yes. There are some variations of the itemset mining problem that will let you specify additional information. For example, high utility itemset mining algorithms let you specify a quantity for each item occuring in a transaction, as well as a weight for each item.

Related

Vector embeddings to mimic a ranking algorithm

Consider a search system where the user submits a query ‘query’ and retrieves products based on some ranking algorithm. Assume that these products are ordered according to their quality such that p_0, p_1, …, p_10 and so on.
I would like to generate vector embeddings that mimic this ranking algorithm. The closest product vector to a query vector should ideally be p_0, the next one should be p_1 and so on.
I have tried to building word2vec embeddings for products by feeding products that have appeared in the same search session as sentences. Then, I have calculated the weighted average of product vectors to find query vectors to make the query vector closer to the top result. Although the closest result is usually the best result for a given query, the subsequent results include some results that would never appear as a top result.
Is there a trick that the word2vec can learn the ranking algorithm or any other techniques that I can try? I have looked into multi-dimensional vector scaling with non-metric distances but it did not seem scalable to me for more than 100Ks of products.
There's no one trick – just iteratively improving your representations, & training set, & ranking methods to better meet your goals.
Word2vec-based representations can often help, but are still fairly simple & centered on individual words – whose senses may vary based on context & position in ways that a simple weighted-average-of-tokens fails to capture.
You may want to represent 'products' by more than just a string-of-word-tokens – to include other properties, as well. These could be scalar values like prices or various other kinds of ratings/properties, or extra synthetic labels, such as the result of other salient groupings (whether hand-edited or learned).
And even if just working with natural-language product descriptions – like product names, or descriptions, or reviews – there are other more-sophisticated text-representations that can be trained or used – such as sentence/document embeddings using deeper-networks than plain word2vec.
Most generically, if you have a bunch of quantitative representations of candidate results, and a query, and want to use some initial examples of "good" results to bootstrap more generalizable rules for scoring top results, you are attempting a "learning-to-rank" process:
https://en.wikipedia.org/wiki/Learning_to_rank
To suggest more specific steps would require a more specific description of inputs/outputs/goals, & what's been tried, and how what's been tried has failed.
For example, are your queries always just textual product names? In such a case, maybe plain keyword search is the central technology required – with things like word-vector-modelling just a tweak for handling some tough cases, like expanding the results, or adding more contrast to the rankings, when results are too few or to many.
Or, can you detect key gaps in the modeling related to exactly those cases where "results include some results that would [ideally] never appear as a top result"? If certain things like rare (poorly-modeled) words, or important qualities not yet captured in the model, seem to be to blame for such cases, that will guide the potential set of corrective changes.

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

Preserve Order for Cross Validation in Weka

I am using the Weka GUI for classifying sensor data.
I have measures of 10 people, the data is sorted. So the first 10% correspond to participant 1, the second 10% to participant 2 etc.
I would like to use 10 fold cross validation to build a model on 9 participants and test it on the remaining participant. In my case I believe I could accomplish this by simply not randomizing the data splits.
How would I best go about doing this?
I don't know how to do this in the Explorer.
In the KnowledgeFlow GUI, there is a CrossValidationFoldMaker used to create cross-validation folds. This has an option to Preserve instances order, which says it preserves the order of instances rather than randomly shuffling.
There's a video describing the KnowledgeFlow interface here:
https://www.youtube.com/watch?v=sHSgoVX9z-8&t=7s

Sentiment analysis with association rule mining

I am trying to come up with an algorithm to find top-3 most frequently used adjectives for the product in the same sentence. I want to use association rule mining(Apriori algorithm).
For that I am planning of using the twitter data. I can more or less decompose twits in to sentences and then with filtering I can find product names and adjectives with it.
For instance, after filtering I have data like;
ipad mini, great
ipad mini, horrible
samsung galaxy s2, best
...
etc.
Product names and adjectives are previously defined. So I have a set of product names and set of adjectives that I am looking for.
I have read couple of papers about sentimental analysis and rule mining and they all say Apriori algorithm is used. But they don't say how they used it and they don't give details.
Therefore how can I reduce my problem to association rule mining problem?
What values should I use for minsup and minconf?
How can I modify Apriori algorithm to solve this problem?
What I' m thinking is;
I should find frequent adjectives separately for each product. Then by sorting I can get top-3 adjectives. But I do not know if it is correct.
Finding the top-3 most used adjectives for each product is not association rule mining.
For Apriori to yield good results, you must be interested in itemsets of length 4 and more. Apriori pruning starts at length 3, and begins to yield major gains at length 4. At length 2, it is mostly enumerating all pairs. And if you are only interested in pairs (product, adjective), then apriori is doing much more work than necessary.
Instead, use counting. Use hash tables. If you really have Exabytes of data, use approximate counting and heavy hitter algorithms. (But most likely, you don't have exabytes of data after extracting those pairs...)
Don't bother to investigate association rule mining if you only need to solve this much simpler problem.
Association rule mining is really only for finding patterns such as
pasta, tomato, onion -> basil
and more complex rules. The contribution of Apriori is to reduce the number of candidates when going from length n-1 -> n for length n > 2. And it gets more effective when n > 3.
Reducing your problem to Association Rule Mining (ARM)
Create a feature vector having all the topics and adjectives. If a feed contains topic then place 1 for it else 0 in tuple. For eg. Let us assume Topics are Samsung and Apple. And Adjectives are good and horrible. And feed contains Samsung good. Then corresponding tuple for it is :
Samsung Apple good horrible
1 0 1 0
Modification to Apriori Algorithm required
generate Association Rules of type 'topic' --> 'adjective' using constrained apriori algorithm. 'topic' --> 'adjective' is a constraint.
How to set MinSup and MinConf :
Read a paper entitled "Minin top-k association rules". Implement that with k=3 for 3 top adjectives.

How does data mining actually work?

Suppose I want to do some data mining on the database of a supermarket. What does that actually mean?
1) What will the output/results be like?
2) Will the output be different every day or change over time?
3) Before applying data mining, do I need to know what I want or will data mining give everything I want automatically?
Data Mining is a general category of techniques that can be applied to different kinds of datasets, just like programming is a general category of techniques that can be applied using different languages to do different things.
None of your questions make any sense.
A1: Data mining will give us an accurate reports about your queries of database of supermarket.
A2: Sure, because Data mining depend on analyzing during time, in this case it depend on your problems or goals that you want to reach it. if your database was very big also you built data warehouse in right way you will get the different output over time.
A3: yes you should determine what are the problems you have to mine then use tools of Data mining to get the results or indicators automatically.
To answer your first question: For the case of supermarket customer data, I could image the following questions:
how many products X are usually sold on Fridays ?
(helps you to determine how many X you should have in stock)
which customers bought product X often in the last month/year ?
Useful when when you introduce a new X-like product: send advertising material (which has a given cost) only to those customers.
given a customer buys product X (e.g. beer) what's the probability that he/she also buys product Y (e.g. chips) ?
useful for the following: make sure X and Y never are on promotional offer at the same time (X and Y are bought together often). Get the customers into the store by offering a rebate on X knowing they'll also by Y at the same time. Or: put a high price X-like product right next to Y, putting the cheaper X somewhere else.
which neighborhoods have the smallest number of customers ?
helps to find out which neighborhoods you could target with advertising to bring more customers into the store.
Often, by 'asking certain questions to the data' one discovers some features and comes up with new questions.
Data mining is a set of techniques. It refers to discovering interesting and unexpected patterns in data.
If you want to apply some data mining techniques, you need to know which one and you should know why. The answer to questions 1, 2 and 3 depends on the techniques that you choose.
For example, if i want to find associations between items sold in a supermarket, i may use association rule mining. If i want to find groups of similar customers, I might use a clustering algorithm. etc.
There is not just ONE technique in data mining.