How to classify a small and peculiar subset out of a large database? - data-mining

I have to perform a data mining task on a database containing informations about insurance policies. Each tuple indicates data about a single policy, along with information regarding the agency that issued it, the customer it is referring to and other fields. It is like a product between hypotetical tables Policies, Customers and Agencies. The fields are the following:
Policy Type,ID Number,Policy Status ,Product Description,Product Combinations,Issue Date,Effective Date,Maturity Date,Policy Duration,Loan Duration ,Cancellation Date ,Reason for cancellation,Total Premium,Splitter Premium,ID Partners,ID Agency,Country Agency,ID Zone,Agency potential,Sex Contractor ,Birth Year Contractor,Job Contractor,Sex Insured,Job Insured,Birth Year Insured,Product Area,Legal Form,ID Claim,Year Claim,Status Claim,Provision Claim,Payments Claim
This is an academic task and our professor wants us to identify churn rates, cross-selling and up-selling. I am not quite into the field and therefore I sought those terms on wikipedia. I started with churn rate and it appears to me that in this case I have to characterize the properties of customers whose Policy Status is set to "canceled" and the Reason for cancellation is "customer cancellation".
With Rapid Miner, I tried to apply decision trees and rule mining, but the subset of interest is so small that the output model, despite having a good accuracy overall, has a very very very poor accuracy in predicting canceled policies. This happens because the subset of canceled policies is really small. I also tried to apply the MetaCost operator with a given cost matrix in which the cost of misclassifying canceled policies is outrageously high with respect to the others (like a million times higher), but this did not change the result at all.
My best option now is to use the sequential covering algorithm for rule mining, but rapid miner does not implement it and I would have to code it manually.
Do you have any suggestion on how to build a good model for that small subset of canceled policies, so that we could use it to identify customers that would potentially cancel their policy in the future?
N.B.: since it comes from a real source, albeit anonymized, I cannot disclose the database or any data contained within.

Did you try Navie Bayes? It works well with small set of data. You can as well try a variant of it like AODE. AODE is not available in Rapid Miner. You should install Weka extension to access AODE in Rapid Miner.

You need to balance your dataset, so that the classes (cancelled / not cancelled) are the same size. This means (temporarily) discarding lots of data.
You can use the Sample operator with the Balance Labels checkbox to do this.

Related

Redshift Query Performance to reduce CPU utilisation

I want to take a general Idea of how I can optimise the query performance in redshift Database, I have Huge queries with lots of joins , I do understand using sort and Dist key it can be achieved but is there a method which we can follow in order to get some optimal results.
What to look in a table and how to approach query optimisation in redshift?
What are the necessary steps to look for or approach in order to have a certain plan for optimisation?
Any guidance will help a lot
Having improved many queries on Redshift there are a few things I can point you towards. First let me list a few tools / techniques to make sure you have these in your toolbox.
Ability to read and EXPLAIN plan and find expected costly points
Know where to find the query "actual" execution report
Know the system tables to find join, distribution, and disk io reports
So with those understood let's look at where many queries go sideways on Redshift. I will try to list these out in pareto order but any of these, or combos, can create significant issue.
#1 - Fat in the middle queries. When joining it is possible to expand the number of rows being operated upon many fold. Cross joining is a clear way this can happen but isn't how this usually happens. If the join on conditions create a many to many join pattern the number of rows can expand. When the table sizes are very large and the "multiplication" can make absurd data sizes. The explain plan can show this but not always - use of DISTINCT and GROUP BY can "hide" the true size of the dataset in play. Performing a SELECT COUNT(*) on your join tree can help show how big this is. You may also may need to look a pieces of the join tree if a later join is collapsing the rows (failure of the query optimizer?). Redshift is a columnar database and not well set up for the creation of data - this includes during the execution of query.
#2 - Distribution of large amounts of data. Redshift is a cluster and the node are connected together by ethernet cables and these connections are the slowest part of the cluster. A lot of work is done by the query optimizer to minimize the amount of data that needs to move around the network. However, it doesn't know your data as well as you do and doesn't always do this well. Look at the type of joins you are getting - is distribution needed? how much data is being distributed? Also, group by (and window functions) need to combine rows and therefore may need redistribution to complete. How big are the data sets entering your aggregation steps?
Moving a lot of data around the network will be slow. The difficulty is that it isn't always clear how to reduce this movement. Large join trees like you say you have can do "odd" things when it comes to the resulting distribution of the "joined" data. Joins are performed one at a time and the order these happen can matter. The query optimizer is making a number of decisions about the order of joins and how to organize the resulting data from each join. The choices it makes is based on what it sees in the table metadata so completeness of metadata matters. WHERE conditions can also impact the optimizer's choices. There are just way to many interactions to itemize them out here. Best advice is to look at the performance per step and see if data distribution is a factor. Then work to control how data is distributed in the query's execution. This may mean changing the join trees or even decomposing the query into several with temp table that have distribution set so that data movement is minimized.
#3 Excessive IO traffic - While not as slow as the networks, the disk IO subsystem is often a bottleneck. This shows up in a few ways. Are you reading more data from disk than is needed? (Metadata up to date?) Do you need a redundant WHERE clause to eliminate data? (Redundant WHERE clause is one that isn't needed functionally but is added so Redshift can perform the metadata comparisons that will reduce data read at scan.) Data spill is another way that disk IO can be strained (this goes back to #1). If data needs to spill to disk it can bring the disk IO performance down considerably. Use your metadata and Where clauses well.
Now these 3 areas often team up to kill your performance. Read too many rows from your tables, join all these extra rows together across the network while also making many new rows. This data doesn't fit in memory so now Redshift needs to spill to disk to complete the query. Things slow down real fast in these conditions.
Lastly these factors I've listed are cluster wide "resources" of Redshift. If one query take up a lot of one of these then there is less for other queries running at the same time. What often happens is that the query writers on a cluster follow similar patterns (good or bad) and when their pattern is costly on one axis then many of their queries are costly on the same axis. This shows up as queries that work "ok" when run in isolation but very badly when others are using the cluster. This generally means that many queries are contributing to pushing the cluster "over the edge" on some limited resource. There are system tables that you can look at to see aggregated IO or network traffic to see these effects.
Good queries are:
Don't make a lot of new "rows" during execution (not fat in the middle)
Keep large data sets "on node" and only redistribute data once the data has been pared down significantly
Don't read more data from disk than is necessary and don't spill
The problem is that doing all of these isn't always possible the trick is to not over subscribe the cluster resources you have.

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

AWS Machine Learning Data

I'm using the AWS Machine Learning regression to predict the waiting time in a line of a restaurant, in a specific weekday/time.
Today I have around 800k data.
Example Data:
restaurantID (rowID)weekDay (categorical)time (categorical)tablePeople (numeric)waitingTime (numeric - target)1 sun 21:29 2 23
2 fri 20:13 4 43
...
I have two questions:
1)
Should I use time as Categorical or Numeric?
It's better to split into two fields: minutes and seconds?
2)
I would like in the same model to get the predictions for all my restaurants.
Example:
I expected to send the rowID identifier and it returns different predictions, based on each restaurant data (ignoring others data).
I tried, but it's returning the same prediction for any rowID. Why?
Should I have a model for each restaurant?
There are several problems with the way you set-up your model
1) Time in the form you have it should never be categorical. Your model treats times 12:29 and 12:30 as two completely independent attributes. So it will never use facts it learn about 12:29 to predict what's going to happen at 12:30. In your case you either should set time to be numeric. Not sure if amazon ML can convert it for you automatically. If not just multiply hour by 60 and add minutes to it. Another interesting thing to do is to bucketize your time, by selecting which half hour or wider interval. You do it by dividing (h*60+m) by some number depending how many buckets you want. So to try 120 to get 2 hr intervals. Generally the more data you have the smaller intervals you can have. The key is to have a lot of samples in each bucket.
2) You should really think about removing restaurantID from your input data. Having it there will cause the model to over-fit on it. So it will not be able to make predictions about restaurant with id:5 based on the facts it learn from restaurants with id:3 or id:9. Having restaurant id there might be okay if you have a lot of data about each restaurant and you don't care about extrapolating your predictions to the restaurants that are not in the training set.
3) You never send restaurantID to predict data about it. The way it usually works you need to pick what are you trying to predict. In your case probably 'waitingTime' is most useful attribute. So you need to send weekDay, time and number of people and the model will output waiting time.
You should think what is relevant for the prediction to be accurate, and you should use your domain expertise to define the features/attributes you need to have in your data.
For example, time of the day, is not just a number. From my limited understanding in restaurant, I would drop the minutes, and only focus on the hours.
I would certainly create a model for each restaurant, as the popularity of the restaurant or the type of food it is serving is having an impact on the wait time. With Amazon ML it is easy to create many models as you can build the model using the SDK, and even schedule retraining of the models using AWS Lambda (that mean automatically).
I'm not sure what the feature called tablePeople means, but a general recommendation is to have as many as possible relevant features, to get better prediction. For example, month or season is probably important as well.
In contrast with some answers to this post, I think resturantID helps and it actually gives valuable information. If you have a significant amount of data per each restaurant then you can train a model per each restaurant and get a good accuracy, but if you don't have enough data then resturantID is very informative.
1) Just imagine what if you had only two columns in your dataset: restaurantID and waitingTime. Then wouldn't you think the restaurantID from the testing data helps you to find a rough waiting time? In the simplest implementation, your waiting time per each restaurantID would be the average of waitingTime. So definitely restaurantID is a valuable information. Now that you have more features in your dataset, you need to check if restaurantID is as effective as the other features or not.
2) If you decide to keep restaurantID then you must use it as a categorical string. It should be a non-parametric feature in your dataset and maybe that's why you did not get a proper result.
On the issue with day and time I agree with other answers and considering that you are building your model for the restaurant, hourly time may give a more accurate result.

How does data mining actually work?

Suppose I want to do some data mining on the database of a supermarket. What does that actually mean?
1) What will the output/results be like?
2) Will the output be different every day or change over time?
3) Before applying data mining, do I need to know what I want or will data mining give everything I want automatically?
Data Mining is a general category of techniques that can be applied to different kinds of datasets, just like programming is a general category of techniques that can be applied using different languages to do different things.
None of your questions make any sense.
A1: Data mining will give us an accurate reports about your queries of database of supermarket.
A2: Sure, because Data mining depend on analyzing during time, in this case it depend on your problems or goals that you want to reach it. if your database was very big also you built data warehouse in right way you will get the different output over time.
A3: yes you should determine what are the problems you have to mine then use tools of Data mining to get the results or indicators automatically.
To answer your first question: For the case of supermarket customer data, I could image the following questions:
how many products X are usually sold on Fridays ?
(helps you to determine how many X you should have in stock)
which customers bought product X often in the last month/year ?
Useful when when you introduce a new X-like product: send advertising material (which has a given cost) only to those customers.
given a customer buys product X (e.g. beer) what's the probability that he/she also buys product Y (e.g. chips) ?
useful for the following: make sure X and Y never are on promotional offer at the same time (X and Y are bought together often). Get the customers into the store by offering a rebate on X knowing they'll also by Y at the same time. Or: put a high price X-like product right next to Y, putting the cheaper X somewhere else.
which neighborhoods have the smallest number of customers ?
helps to find out which neighborhoods you could target with advertising to bring more customers into the store.
Often, by 'asking certain questions to the data' one discovers some features and comes up with new questions.
Data mining is a set of techniques. It refers to discovering interesting and unexpected patterns in data.
If you want to apply some data mining techniques, you need to know which one and you should know why. The answer to questions 1, 2 and 3 depends on the techniques that you choose.
For example, if i want to find associations between items sold in a supermarket, i may use association rule mining. If i want to find groups of similar customers, I might use a clustering algorithm. etc.
There is not just ONE technique in data mining.

Collaborative Filtering: Ways to determine implicit scores for products for each user?

Having implemented an algorithm to recommend products with some success, I'm now looking at ways to calculate the initial input data for this algorithm.
My objective is to calculate a score for each product that a user has some sort of history with.
The data I am currently collecting:
User order history
Product pageview history for both anonymous and registered users
All of this data is timestamped.
What I'm looking for
There are a couple of things I'm looking for suggestions on, and ideally this question should be treated more for discussion rather than aiming for a single 'right' answer.
Any additional data I can collect for a user that can directly imply an interest in a product
Algorithms/equations for turning this data into scores for each product
What I'm NOT looking for
Just to avoid this question being derailed with the wrong kind of answers, here is what I'm doing once I have this data for each user:
Generating a number of user clusters (21 at the moment) using the k-means clustering algorithm, using the pearsons coefficient for the distance score
For each user (on demand) calculating their a graph of similar users by looking for their most and least similar users within their cluster, and repeating for an arbitrary depth.
Calculating a score for each product based on the preferences of other users within the user's graph
Sorting the scores to return a list of recommendations
Basically, I'm not looking for ideas on what to do once I have the input data (I may need further help with that later, but it's not the point of this question), just for ideas on how to generate this input data in the first place
Here's a haymaker of a response:
time spent looking at a product
semantic interpretation of comments left about the product
make a discussion page about a product, brand, or product category and semantically interpret the comments
if they Shared a product page (email, del.icio.us, etc.)
browser (mobile might make them spend less time on the page vis-à-vis laptop while indicating great interest) and connection speed (affects amt. of time spent on the page)
facebook profile similarity
heatmap data (e.g. à la kissmetrics)
What kind of products are you selling? That might help us answer you better. (Since this is an old question, I am addressing both #Andrew Ingram and anyone else who has the same question and found this thread through search.)
You can allow users to explicitly state their preferences, the way netflix allows users to assign stars.
You can assign a positive numeric value for all the stuff they bought, since you say you do have their purchase history. Assign zero for stuff they didn't buy
You could do some sort of weighted value for stuff they bought, adjusted for what's popular. (if nearly everybody bought a product, it doesn't tell you much about a person that they also bought it) See "term frequency–inverse document frequency"
You could also assign some lesser numeric value for items that users looked at but did not buy.