Data Mining situation

Data Mining situation - data-mining

Suppose I have the data as mentioned below.
11AM user1 Brush
11:05AM user1 Prep Brakfast
11:10AM user1 eat Breakfast
11:15AM user1 Take bath
11:30AM user1 Leave for office
12PM user2 Brush
12:05PM user2 Prep Brakfast
12:10PM user2 eat Breakfast
12:15PM user2 Take bath
12:30PM user2 Leave for office
11AM user3 Take bath
11:05AM user3 Prep Brakfast
11:10AM user3 Brush
11:15AM user3 eat Breakfast
11:30AM user3 Leave for office
12PM user4 Take bath
12:05PM user4 Prep Brakfast
12:10PM user4 Brush
12:15PM user4 eat Breakfast
12:30PM user4 Leave for office
This data tell me about the daily routine of different people. From this data it seems user1 and user2 behave similarly (though there is a difference in time they perform the activity but they are following the same sequence). With the same reason, User3 and User4 behave similarly.
Now I have to group such users into different groups. In this example, group1- user1 and USer2 ... followed by group2 including user3 and user4
How should I approach this kind of situation. I am trying to learn data mining and this is an example I thought of as a data mining problem. I am trying to find an approach for the solution, but I can not think of one. I believe this data has the pattern in it. but I am not able to think of the approach which can reveal it.
Also, I have to map this approach on the dataset I have, which is pretty huge but similar to this :) The data is about logs stating occurrence of events at a time. And I want to find the groups representing similar sequence of events.
Any pointers would be appreciated.

It looks like clustering on top of associating mining, more precisely Apriori algorithm. Something like this:
Mine all possible associations between actions, i.e. sequences Bush -> Prep Breakfast, Prep Breakfast -> Eat Breakfast, ..., Bush -> Prep Breakfast -> Eat Breakfast, etc. Every pair, triplet, quadruple, etc. you can find in your data.
Make separate attribute from each such sequence. For better performance add boost of 2 for pair attributes, 3 for triplets and so on.
At this moment you must have an attribute vector with corresponding boost vector. You can calculate feature vector for each user: set 1 * boost at each position in the vector if this sequence exists in user actions and 0 otherwise). You will get vector representation of each user.
On this vectors use clustering algorithm that fits your needs better. Each found class is the group you use.
Example:
Let's mark all actions as letters:
a - Brush
b - Prep Breakfast
c - East Breakfast
d - Take Bath
...
Your attributes will look like
a1: a->b
a2: a->c
a3: a->d
...
a10: b->a
a11: b->c
a12: b->d
...
a30: a->b->c->d
a31: a->b->d->c
...
User feature vectors in this case will be:
attributes = a1, a2, a3, a4, ..., a10, a11, a12, ..., a30, a31, ...
user1 = 1, 0, 0, 0, ..., 0, 1, 0, ..., 4, 0, ...
user2 = 1, 0, 0, 0, ..., 0, 1, 0, ..., 4, 0, ...
user3 = 0, 0, 0, 0, ..., 0, 0, 0, ..., 0, 0, ...
To compare 2 users some distance measure is needed. The simplest one is cosine distance, that is just value of cosine between 2 feature vectors. If 2 users have exactly the same sequence of actions, their similarity will equal 1. If they have nothing common - their similarity will be 0.
With distance measure use clustering algorithm (say, k-means) to make groups of users.

Using an itemset mining algorithm like Apriori as proposed in the other answer is not the best solution because Apriori does not consider time or the sequential ordering. Thus, it requires to do an additional pre-processing step to consider ordering.
A better solution is to use a sequential pattern mining algorithm like PrefixSpan, SPADE, or CM-SPADE directly. A sequential pattern mining algorithm will directly find subsequences that appears often in a set of sequences.
Then you can still apply clustering on the sequential patterns found!

Related

Planning subsequent orders

Let's say I have 5 orders and 3 drivers. I want to maximize the amount of miles they have on the road. Each driver has times that they're available to drive and orders have times that they're able to be picked up at.
Ideally, I would like to be able to plan subsequent orders in one go, rather than writing multiple models at once. My current iteration is to write multiple models that give output and subsequent models take those as inputs. How can you write this as a singular LP model?
O = {Order1, Order2, Order3, Order4, Order5}
D = {Driver1, Driver2, Driver3}
O_avail = {2 pm, 3pm, 230 pm, 8pm, 9pm, 12 am}
D_avail = {2pm, 3pm, 230pm}
Time_to_depot = {7 hours,5 hours,2 hours,5 hours,3hours, 4hours}
constraints
d_avail <= o_avail
obj function
max sum D_i*time_to_depot_i
I laid it out in such a way that driver 1 takes order 1, order 5 and order6. Driver 2 takes order 2 and order 4.

What formula or method could I go to subtract amounts from values in a range using a table

I have a master sheet with values of what I would sell for. I want to create a formula or rules where I can subtract commission based on the value of the cell. I want to be able to edit from the table only so I don't have to mess around with hundreds of cells formulas when things change. I also don't want to just take commission by percentage. I know how to link the cells. I want a formula that will look in the table and say hey its between the two values so ill extract this amount of commission. I have attached a picture of an example of the rules table.
I've tried doing IF statements and ran into too many arguments issues.
I expect the formula to look in my table and take out the proper commission beside it.

=ARRAYFORMULA(Main!B2-VLOOKUP(Main!B2,
{REGEXEXTRACT(Comission!$A$3:$A$13, "\d+")*1, Comission!$B$3:$B$13}, 2))

you can do various things like:
=ARRAYFORMULA(IF(A9:A<>"", IF(COUNTIF(A9:A, A9:A)>1,
B9:B-(B9:B*IFERROR(VLOOKUP(B9:B,
{{REGEXEXTRACT(A3, "\d+")*1, -B3% };
{REGEXEXTRACT(A4, "\d+")*1, -B4%};
{REGEXEXTRACT(A5, "\d+")*1, -B5%};
{REGEXEXTRACT(A6, "\d+")*1, -B6%};
{400, 0}}, 2))),
B9:B-(B9:B*IFERROR(VLOOKUP(B9:B,
{{REGEXEXTRACT(C3, "\d+")*1, -D3% };
{REGEXEXTRACT(C4, "\d+")*1, -D4%};
{REGEXEXTRACT(C5, "\d+")*1, -D5%};
{REGEXEXTRACT(C6, "\d+")*1, -D6%};
{400, 0}}, 2)))), ))
assuming Ema is a reseller and Jane & Yuki are one-timers
alternatives: https://webapps.stackexchange.com/q/123729/186471
=ARRAYFORMULA(IF(A2:A<>"", IFERROR(VLOOKUP(A2:A, Main!A2:B, 2, 0))-
IFERROR(VLOOKUP(IFERROR(VLOOKUP(A2:A, Main!A2:B, 2, 0)),
{IFERROR(REGEXEXTRACT(Comission!A3:A, "\d+")*1), Comission!B3:B}, 2)), ))

Regular expression and csv | Output more readable

I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?

I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.

Merging 2 Pandas dataframes into 1 with permutation of multiple columns

I have two Pandas data frames representing an inventory of items. Both data frames have four columns:
df1
id, item, colour, year
1, car, red, 2015
2, truck,, 2016
3, house, blue,
4, car, blue,
5, truck, red, 2015
df2
id, item, colour, year
1, house, blue, 2015
2, truck,, 2015
3, car, blue,
4, house,,
5, car, red, 2015
I know that these inventories are likely to represent the same object, so I would like to relate both of these.
For instance,
df1[1] = df2[5] (3 identique variables)
df1[4] = df2[3] (2 identique variables)
df1[3] (house, blue,) is probably the same as df2[1] (house, blue, 2015).
I have 2 main issues: how to do it efficiently, and how to give a reliability to the link.
I've thought of creating a common field which would be a combination of all the columns [item, colour, year] and merge on this. I would get the two first matches above; but they don't have the same reliability. I wonder if there would be an easy way to 'score' this reliability (at the moment I'm thinking of doing two merges, depending on variable availability).
The I would create another common field, with only 2 variables (item, colour), and merge on this. That would give me the link: (house, blue,) and (house, blue, 2015). This would obviously be a weaker link.
Any idea how to do this without merging sequentially? My current plan is to merge with 3 attributes (when they are present), then 2 attributes (there are 3 permutations) on what is left and has at least 2 attributes, and then 1 only. I would give a reliability score to the link based on the number of attributes I used to merge.

df = pd.DataFrame(
(df1.values[:, None] == df2.values).sum(2),
df1.index, df2.index)
matches = df.mask(df.lt(2)).stack()
def f(df):
i, j = df.name
return pd.concat([df1.loc[i], df2.loc[i]], axis=1, keys=['df1', 'df2']).T
matches.groupby(level=[0, 1]).apply(f).stack().unstack([-2, -1])

How to know if a variation (f.e. abbreviation) of a string in a list does match agains another list if the original does not?

I currently searching for a method in R which let's me match/merge two data frames. Helas both of these data frames contain non optimal data. They can have certain abbreviations of even typo's in them. Therefore I would like to define a list for each abbreviation and if a string contains one of those elements. If the original entries don't match, R should check if any of the other options of the abbreviation has a match. To illustrate: the name of a company could end with "Limited" but also with "Ltd." of "Ltd" etc.
EXAMPLE
Data
The Original "Address" file contains:
Company name Address
Deloitte Ltd. New York
Coca-Cola New York
Tesla ltd California
Microsoft Limited Washington
Would have to be merged with "EnterpriseNrList"
Company name EnterpriseNumber
Deloitte Ltd. 221
Coca-Cola 334
Tesla ltd 725
Microsoft Limited 127
So the abbreviations should work in "both directions". That's why I said, if R recognises any of the abbreviations, R should try to match all of them.
All of the matches should be reported as the return.
Therefore I would make up a list "Abbreviations" for each possible abbreviation
Limited.
limited
Ltd.
ltd.
Ltd
ltd
Questions
1) Would this be a good method, or would there be a more efficient way?
2) How can I check a list against a list of possible abbreviations (step 1, see below), sort of a containsx from excel?
3) How could I make up a list that replaces for the entries that do not match the abbreviation with all other abbreviatinos (step 2, see below)?
Thoughts for solution
Step 1
As I am still very new to this kind of work, I was thinking the following: use a regex expression to filter out wether a string contains any of the abbreviation options and create a list which will then contain either -1 if no match could be found and >0 if match is found. The no pattern matching can already be matched against the "Address" list. With the other entries I continue to step 2.
In this step I don't really know how to check against a list of options ("Abbreviations" list).
Step 2
Next I would create a list with the matches from step 1 and rbind together all options. In this step I don't really know to I could create a list that combines f.e. Coca-Cola with all it's possible abbreviations.
Coca-Cola Limited
Coca-Cola Ltd.
Coca-Cola Ltd
etc.
Step 3
Lastly I would match/merge this more complete list of companies again with the original "Data" list. With the introduction of step 2 I thought It might be a bit easier on the required computing power, as the original list is about 8000 rows.

I would go in a different approach, fixing the tables first before the merge.
To fix with abreviations, I would use a regex, case insensitive, the final dot being optionnal, I start with a list of 'Normal word' = vector of abbreviations.
abbrevs <- list('Limited'=c('Limited','Ltd'),'Incorporated'=c('Incorporated','Inc'))
The I build the corresponding regex (alternations with an optional dot at end, the case will be ignored by parameter in gsub and agrep later):
regexes <- lapply(abbrevs,function(x) { paste0("(",paste0(x,collapse='|'),")[.]?") })
Which gives:
$Limited
[1] "(Limited|Ltd)[.]?"
$Incorporated
[1] "(Incorporated|Inc)[.]?"
Now we have to apply each regex to the company.name column of each df:
for (i in seq_along(regexes)) {
Address$Company.name <- gsub(regexes[[i]], names(regexes[i]), Address$Company.name, ignore.case=TRUE)
Enterprise$Company.name <- gsub(regexes[[i]], names(regexes[i]), Enterprise$Company.name, ignore.case=TRUE)
}
This does not take into account typos. Here you'll need to work on with agrepor adist to manage it.
Result for Address example data set:
> Address
Company.name Address
1 Deloitte Limited New York
2 Coca-Cola New York
3 Tesla Limited California
4 Microsoft Limited Washington
Input data used:
Address <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), Address = c("New York", "New York",
"California", "Washington")), .Names = c("Company.name", "Address"
), class = "data.frame", row.names = c(NA, -4L))
Enterprise <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), EnterpriseNumber = c(221L,
334L, 725L, 127L)), .Names = c("Company.name", "EnterpriseNumber"
), class = "data.frame", row.names = c(NA, -4L))

I would say that the answer depends on whether you have a list of abbreviations or not.
If you have one, you could just look which element of your list contains an abbreviation with grep or greplfunctions. (grep return all indexes that have a matching pattern whereas grepl returns a logical vector).
Also, use the ignore.case= TRUE parameter of these function, so you don't have to try all capitalized/lowercase possibilities.
If you don't have such a list, my first guest would be to extract the first "word" of each company (I would guess that there is a single "Deloitte" company, and that it is "Deloitte Ltd"). You can do so with:
unlist(strsplit(CompanyNames,split = " "))
If you wanted to also correct for typos, this is more a question of string distance.
Hope that it helped!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js