how to prepare transactional dataset for association rule mining in RapidMiner? - data-mining

I have a dataset like this:
abelia,fl,nc
esculentus,ct,dc,fl,il,ky,la,md,mi,ms,nc,sc,va,pr,vi
abelmoschus moschatus,hi,pr*
dataset link:
My dataset haven't any attribute declaration. I want apply association rules on my dataset. I want to be like this dataset.
plant fl nc ct dc .....
abelia 1 1 0 0
.....

ELKI contains a parser that can read the input as is. Maybe Rapidminer does so, too - or you should write a parser for this format! With the ELKI parameters
-dbc.in /tmp/plants.data
-dbc.parser SimpleTransactionParser -parser.colsep ,
-algorithm itemsetmining.associationrules.AssociationRuleGeneration
-itemsetmining.minsupp 0.10
-associationrules.interestingness Lift
-associationrules.minmeasure 7.0
-resulthandler ResultWriter -out /tmp/rules
we can find all association rules with support >= 10%, Lift >= 7.0, and write them to the folder /tmp/rules (there is currently no visualization of association rules in ELKI):
For example, this finds the rules
sc, va, ga: 3882 --> nc, al: 3529 : 7.065536626573297
va, nj: 4036 --> md, pa: 3528 : 7.206260507764794
So plants that occur in South Carolina, Virigina, and Georgia will also occur in North Carolina and Alabama. NC is not much of a surprise, given that it is inbetween of SC and VA, but Alabama is interesting.
The second rule is Virigina and New Jersey imply Maryland (inbetween the two) and Pennsylvania. Also a very plausible rule, supported by 3528 cases.

I did my work with this python script:
import csv
abbrs = ['states', 'ab', 'ak', 'ar', 'az', 'ca', 'co', 'ct',
'de', 'dc', 'of', 'fl', 'ga', 'hi', 'id', 'il', 'in',
'ia', 'ks', 'ky', 'la', 'me', 'md', 'ma', 'mi', 'mn',
'ms', 'mo', 'mt', 'ne', 'nv', 'nh', 'nj', 'nm', 'ny',
'nc', 'nd', 'oh', 'ok', 'or', 'pa', 'pr', 'ri', 'sc',
'sd', 'tn', 'tx', 'ut', 'vt', 'va', 'vi', 'wa', 'wv',
'wi', 'wy', 'al', 'bc', 'mb', 'nb', 'lb', 'nf', 'nt',
'ns', 'nu', 'on', 'qc', 'sk', 'yt']
with open("plants.data.txt", encoding = "ISO-8859-1") as f1, open("plants.data.csv", "a") as f2:
csv_f2 = csv.writer(f2, delimiter=',')
csv_f2.writerow(abbrs)
csv_f1 = csv.reader(f1)
for row in csv_f1:
new_row = [row[0]]
for abbr in abbrs:
if abbr in row:
new_row.append(1)
else:
new_row.append(0)
csv_f2.writerow(new_row)

If all of the values are single words, you can use text mining extension in Rapidminer to transform them into variables and then run association rule mining methods on them.

Related

Can we create a regular expression that matches every founder in this list?

User #adventured posted this on Hacker News:
Paul Graham (31, Viaweb); Jan Koum (33, WhatsApp); Brian Acton (37, WhatsApp); Ev Williams (34, Twitter); Jack Dorsey (33, Square); Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal); Garrett Camp (30, Uber); Travis Kalanick (32, Uber); Brian Chesky (27, Airbnb); Adam Neumann (31, WeWork); Reed Hastings (37, Netflix); Reid Hoffman (36, LinkedIn); Jack Ma (35, Alibaba); Jeff Bezos (30, Amazon); Jerry Sanders (33, AMD); Marc Benioff (35, Salesforce); Ross Perot (32, EDS); Peter Norton (39, Norton); Larry Ellison (33, Oracle); Mitch Kapor (32, Lotus); Leonard Bosack (32, Cisco); Sandy Lerner (29, Cisco); Gordon Moore (39, Intel); Mark Cuban (37, Broadcast.com); Scott Cook (31, Intuit); Nolan Bushnell (29, Atari); Paul Galvin (33, Motorola); Irwin Jacobs (52, Qualcomm); David Duffield (46, PeopleSoft | 64, Workday); Aneel Bhusri (39, Workday); Thomas Siebel (41, Siebel Systems); John McAfee (42, McAfee); Gary Hendrix (32, Symantec); Scott McNealy (28, Sun); Pierre Omidyar (28, eBay); Rich Barton (29, Expedia | 38, Zillow); Jim Clark (38, SGI | 49, Netscape); Charles Wang (32, CA); David Packard (27, HP); Craig Newmark (43, Craigslist); John Warnock (42, Adobe); Robert Noyce (30, Fairchild | 41, Intel); Rod Canion (37, Compaq); Jen-Hsun Huang (30, nVidia); James Goodnight (33, SAS); John Sall (28, SAS); Eli Harari (41, SanDisk); Sanjay Mehrotra (28, SanDisk); Al Shugart (48, Seagate); Finis Conner (34, Seagate); Henry Samueli (37, Broadcom); Henry Nicholas (32, Broadcom); Charles Brewer (36, Mindspring); William Shockley (45, Shockley); Ron Rivest (35, RSA); Adi Shamir (30, RSA); John Walker (32, Autodesk); Halsey Minor (30, CNet); David Filo (28, Yahoo); Jeremy Stoppelman (27, Yelp); Eric Lefkofsky (39, Groupon); Andrew Mason (29, Groupon); Markus Persson (30, Mojang); David Hitz (28, NetApp); Brian Lee (28, Legalzoom); Demis Hassabis (34, DeepMind); Tim Westergren (35, Pandora); Martin Lorentzon (37, Spotify); Ashar Aziz (44, FireEye); Kevin O'Connor (36, DoubleClick); Ben Silbermann (28, Pinterest); Evan Sharp (28, Pinterest); Steve Kirsch (38, Infoseek); Stephen Kaufer (36, TripAdvisor); Michael McNeilly (28, Applied Materials); Eugene McDermott (52, Texas Instruments); Richard Egan (43, EMC); Gary Kildall (32, Digital Research); Hasso Plattner (28, SAP); Robert Glaser (32, Real Networks); Patrick Byrne (37, Overstock.com); Marc Lore (33, Diapers.com); Ed Iacobucci (36, Citrix Systems); Ray Noorda (55, Novell); Tom Leighton (42, Akamai); Daniel Lewin (28, Akamai); Diane Greene (43, VMWare); Mendel Rosenblum (36, VMWare); Michael Mauldin (35, Lycos); Tom Anderson (33, MySpace); Chris DeWolfe (37, MySpace); Mark Pincus (41, Zynga); Caterina Fake (34, Flickr); Stewart Butterfield (31, Flickr | 36, Slack); Kevin Systrom (27, Instagram); Adi Tatarko (37, Houzz); Brian Armstrong (29, Coinbase); Pradeep Sindhu (43, Juniper); Peter Thiel (31, PayPal | 37, Palantir); Jay Walker (42, Priceline.com); Bill Coleman (48, BEA Systems); Evan Goldberg (35, NetSuite); Fred Luddy (48, ServiceNow); Michael Baum (41, Splunk); Nir Zuk (33, Palo Alto Networks); David Sacks (36, Yammer); Jack Smith (28, Hotmail); Sabeer Bhatia (28, Hotmail); Chad Hurley (28, YouTube); Andy Rubin (37, Danger | 41, Android); Rodney Brooks (36, iRobot); Jeff Hawkins (35, Palm); Tom Gosner (39, DocuSign); Niklas Zennström (37, Skype); Janus Friis (27, Skype); George Kurtz (40, CrowdStrike); Trip Hawkins (28, EA); Gabe Newell (33, Valve); David Bohnett (38, Geocities); Bill Gross (40, GoTo.com/Overture); Subrah Iyar (38, WebEx); Eric Yuan (41, Zoom); Min Zhu (47, WebEx); Bob Parsons (47, GoDaddy); Wilfred Corrigan (43, LSI); Joe Parkinson (33, Micron); Aart J. de Geus (32, Synopsys); Patrick Byrne (37, Overstock); Matthew Prince (34, Cloudflare); Ben Uretsky (28, DigitalOcean); Tom Preston-Werner (28, GitHub); Louis Borders (48, Webvan); John Moores (36, BMC Software); Vivek Ranadivé (40, Tibco); Pony Ma (27, Tencent); Robin Li (32, Baidu); Liu Qiangdong (29, JD.com); Lei Jun (40, Xiaomi); Ren Zhengfei (38, Huawei); Arkady Volozh (36, Yandex); Hiroshi Mikitani (34, Rakuten); Morris Chang (56, Taiwan Semi); Cheng Wei (29, Didi Chuxing); James Liang (29, Ctrip); Zhang Yiming (29, ByteDance);
I tried to write a Regex that would have each "Match group" correspond to these founders. I was able to get 136/144 of the entries, but I'm kind of confused on how to capture the founders with the pipe entries (Elon Musk, David Duffield, Rich Barton, Robert Noyce, etc. Here is an example:
Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal);
I know I can escape the pipes with \| but even wrapping the "paren part" with an * doesn't seem to do it.
Here's the regular expression I created:
([A-Za-zé'.\/\s+-]+{2})\s+\(([0-9]+),\s+([A-Za-z0-9\s+.-\|]+\s?)\);
(I removed the last semi-colon so that I could perform my matches after just running a split(";") on the file contents.
I created a simple repro which is here: https://github.com/arthurcolle/founders
Here's the code inline, in case you don't want to just go to the very simple repro:
rgx = /([A-Za-zé'.\/\s+-]+{2})\s+\(([0-9]+),\s+([A-Za-z0-9\s+.-\|]+\s?)\)/
FOUNDERS_FILE = "/Users/stochastic-thread/founders/founders.txt"
file = File.read(FOUNDERS_FILE)
items = file.split(";")
items.each {|item|
matched = rgx.match(item)
if matched and matched.size == 4
group = "#{matched[1]},#{matched[2]},#{matched[3]}\n"
puts group
File.open("founders.csv", mode: "a") do |f|
f.write(group)
end
end
}
What is the regular expression that would match on every "founder-company" group, taking into account the fact that every single founder might have multiple founded companies, with corresponding ages (in the specific format detailed above in the case of Elon Musk? (The ö character is unicode, so I don't think I'm able to actually match on it because when I put it in the name match group, it said multi-byte characters don't work.)
I know that I can just find entries that don't match the regex, and use a different regex that only matches the parenthesis format, or even just split again on the pipes, but I'm trying to find a "perfect regex" for this.
The question only asks for the founders to be matched, so initially I have not included their enterprises. Later, however, I will discuss a possible way to organize all the information.
Use String#scan with the following regular expression, which I've defined in free-spacing mode to make it self-documenting.
r = /
(?<=\A|;\s) # match the beginning of the string or a semi-colon
# followed by a whitespace char in a positive lookbehind
[\p{L} ]+ # match one or more Unicode letters or spaces
(?=\s\() # match a whitespace followed by "(" in a positive lookahead
/x # free-spacing regex definition mode
str = "Paul Graham (31, Viaweb); Jan Koum (33, WhatsApp); Brian Acton (37, WhatsApp); " +
"Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal); Garrett Camp (30, Uber); " +
"Travis Kalanick (32, Uber);"
str.scan(r)
#=> ["Paul Graham", "Jan Koum", "Brian Acton", "Elon Musk", "Garrett Camp",
# "Travis Kalanick"]
This regular expression is conventionally written as follows.
/(?<=\A|; )[\p{L} ]+(?= \()/
If additional information is needed it may be desirable to create a hash such as the following.
r = /
(?<=\A|;\s) # match the beginning of the string or a semi-colon
# followed by a whitespace char in a positive lookbehind
[\p{L} ]+ # match one or more Unicode letters or spaces
\([^)]+ # match a "(" followed by > 0 characters other than ")"
/x
h = str.scan(r).
map { |s| s.split(/ \(/) }.
each_with_object({}) do |(name, startups),h|
h[name] = startups.split(/ *\| */).map do |s|
age, co = s.split(/, +/)
{ age: age.to_i, co: co }
end
end
#=> {"Paul Graham" =>[{:age=>31, :co=>"Viaweb"}],
# "Jan Koum" =>[{:age=>33, :co=>"WhatsApp"}],
# "Brian Acton" =>[{:age=>37, :co=>"WhatsApp"}],
# "Elon Musk" =>[{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}],
# "Garrett Camp" =>[{:age=>30, :co=>"Uber"}],
# "Travis Kalanick"=>[{:age=>32, :co=>"Uber"}]}
One could then easily compute, for example,
h.each_with_object(Hash.new { |h,k| h[k] = [] }) do |(name, cos),g|
cos.each { |co| g[co[:co]] << name }
end
#=> {"Viaweb"=>["Paul Graham"],
# "WhatsApp"=>["Jan Koum", "Brian Acton"],
# "Tesla"=>["Elon Musk"],
# "SpaceX"=>["Elon Musk"],
# "PayPal"=>["Elon Musk"],
# "Uber"=>["Garrett Camp", "Travis Kalanick"]}
The regular expression used here is conventionally written:
/(?<=\A|; )[\p{L} ]+\([^\)]+/
The steps to compute h are as follows.
a = str.scan(r)
#=> ["Paul Graham (31, Viaweb", "Jan Koum (33, WhatsApp", "Brian Acton (37, WhatsApp",
# "Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal", "Garrett Camp (30, Uber",
# "Travis Kalanick (32, Uber"]
b = a.map { |s| s.split(/ \(/) }
#=> [["Paul Graham", "31, Viaweb"], ["Jan Koum", "33, WhatsApp"],
# ["Brian Acton", "37, WhatsApp"],
# ["Elon Musk", "32, Tesla | 31, SpaceX | 27, PayPal"],
# ["Garrett Camp", "30, Uber"], ["Travis Kalanick", "32, Uber"]]
h = b.each_with_object({}) do |(name, startups),h|
h[name] = startups.split(/ *\| */).map do |s|
age, co = s.split(/, +/)
{ age: age.to_i, co: co }
end
end
#=> <as above>
In computing h from b, when
name = "Elon Musk"
startups = "32, Tesla | 31, SpaceX | 27, PayPal"
h = {"Paul Graham" =>[{:age=>31, :co=>"Viaweb"}],
"Jan Koum" =>[{:age=>33, :co=>"WhatsApp"}],
"Brian Acton" =>[{:age=>37, :co=>"WhatsApp"}]}
the block calculation is as follows.
c = startups.split(/ *\| */)
#=> ["32, Tesla", "31, SpaceX", "27, PayPal"]
d = c.map do |s|
age, co = s.split(/, +/)
{ age: age.to_i, co: co }
end
#=> [{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}]
h[name] = d
#=> [{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}]
and now
h #=> {"Paul Graham"=>[{:age=>31, :co=>"Viaweb"}],
# "Jan Koum" =>[{:age=>33, :co=>"WhatsApp"}],
# "Brian Acton"=>[{:age=>37, :co=>"WhatsApp"}],
# "Elon Musk" =>[{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}]}
My guess is that maybe this expression might simply work OK:
\s*(\d+)\s*,\s*([^)|]*)(?=\s*\||\s*\))|([^(\r\n]*)\(
Test
re = /\s*(\d+)\s*,\s*([^)|]*)(?=\s*\||\s*\))|([^(\r\n]*)\(/
str = 'Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal);
Garrett Camp (30, Uber);
'
str.scan(re) do |match|
puts match.to_s
end
Output
[nil, nil, "Elon Musk "]
["32", "Tesla ", nil]
["31", "SpaceX ", nil]
["27", "PayPal", nil]
[nil, nil, "Garrett Camp "]
["30", "Uber", nil]
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Regular expression and csv | Output more readable

I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?
I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.

Subtracting every two columns

Imagine I have a dataframe like this (or the names of all months)
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(20),3)))
mydata <- rbind(mydata,c(2,round(runif(20),3)))
mydata <- rbind(mydata,c(3,round(runif(20),3)))
colnames(mydata) <- c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2))
.
id Mary1 Mary2 Bob1 Bob2 Dylan1 Dylan2 Tom1 Tom2 Jane1 Jane2 Sam1 Sam2 Tony1 Tony2 Luke1 Luke2 John1 John2 Pam1 Pam2
1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777
2 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411
3 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407
Usually with many more columns.
And I want to add columns (it's up to you to decide to add them to the right, or create a new dataframe with these new columns) substracting every two.. (*)
id, Mary1-Mary2, Bob1-Bob2, Dylan1-Dylan2, Tom1-Tom2, Jane1-Jane2,...
This operation is quite common.
I'd like to do it by name, not by position, to prevent problems if they are not consecutive.
It could even happen that some columns don't have it's "twin" column, just leave as is, or ignore this complication now.
(*) The names of the columns have a prefix and a number.
Instead of just substracting two columns I could have groups of 5 and I may want to do something such as adding all numbers. A generic solution would be great.
I first tried to do it by convert it to long format, later operate with aggregate, and convert it back to wide format, but maybe it's much easier to do it directly in wide format. I know the problem is mainly related to use efficiently regular expressions.
R, data.table or dplyr, long format splitting colnames
I don't mind the speed but the simplest solution.
Any package is wellcome.
PD: All your codes fail if I add a lonely column.
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(21),3)))
mydata <- rbind(mydata,c(2,round(runif(21),3)))
mydata <- rbind(mydata,c(3,round(runif(21),3)))
colnames(mydata) <- c(c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2)),"Lola" )
I know I could filter it out manually but it would be better if the result is the difference (*) of every pair and leave alone the lonely column. (In case of differences of groups of size two)
The best option would be not manually remove the first column but split all columns in single and multiple columns.
How about using base R:
cn <- unique(gsub("\\d", "", colnames(mydata)))[-1]
sapply(cn, function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )
You can use this approach for any arbitrary number of groups. For example this would return the row sums across the names with the suffix 1 or 2.:
sapply(cn, function(x) rowSums(mydata[, paste0(x, 1:2)]))
This paste approach could be replaced by regular expressions for more general applications.
You can do something like,
sapply(unique(sub('\\d', '', names(mydata[,-1]))),
function(i) Reduce('-', mydata[,-1][,grepl(i, sub('\\d', '', names(mydata[,-1])))]))
# Mary Bob Dylan Tom Jane Sam Tony Luke John Pam
#[1,] -0.106 -0.335 -0.696 0.284 0.567 0.029 0.303 0.272 -0.274 -0.397
#[2,] 0.723 0.526 -0.119 -0.369 0.530 -0.118 0.308 0.159 0.686 0.313
#[3,] 0.174 0.230 -0.259 -0.454 0.039 -0.383 0.193 -0.028 -0.203 0.255
As per your comment, we can easily sort the columns and then apply the formula above,
sorted.names <- names(mydata)[order(nchar(names(mydata)), names(mydata))]
mydata <- mydata[,sorted.names]
This solution handles an arbitrary number of twins.
## return data frame
twin.vars <- function(prefix, df) {
df[grep(paste0(prefix, '[0-9]+$'), names(df))]
}
pfx <- unique(sub('[0-9]*$', '', names(mydata[-1])))
tmp <- lapply(pfx, function(x) Reduce(`-`, twin.vars(x, mydata)))
cbind(id=mydata$id, as.data.frame(setNames(tmp, pfx)))
OK, I've chosen #NBATrends solution because it works well almost always and he was the first.
Anyway, I add my little contribution, just in case anybody is interested:
runs <- rle(sort(sub('\\d$', '', names(mydata))))
sapply(runs[[2]][runs[[1]]>1], function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )
The only "problem" is that it changes the final order, but you don't need to manually remove isolated columns, and works for disordered columns too.
I'm perplexed because nobody posted a solution with dplyr or data.table :)

R: Case-insensitive matching of a combination of first and last names (i.e. two columns) across two dataframes

In R, I should like to extract the people who completed both versions of a test I designed and subsequently administered in two phases (I asked participants for their first and last names).
The problem is that 1. people aren't consistent in using capitals; and 2. some people might share a first name or last name with other people. Thus, 1. I need a case-insensitive search; and 2. I should like to extract a new data frame that lists the first and last names of the first version, and the first and last names of the second version, in order to verify the match (also because someone might use "Tom" in one instance and "Thomas" in another):
df1 <- data.frame(firstName = c("John", "Josef", "Tom", "Huckleberry", "Johann"),
lastName = c("Doe", "K", "Sawyer", "Finn", "Bach"))
df2 <- data.frame(firstName = c("John", "josef", "Thomas", "Huck", "Pap", "Johann Sebastian", "Johann"),
lastName = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Pachelbel"))
The above names should all provide a match for me to verify:
repeatDF <- data.frame(firstName.1 = c("John", "Josef", "Tom", "Huckleberry", "Huckleberry", "Johann", "Johann"),
lastName.1 = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Bach"),
firstName.2 = c("John", "josef", "Thomas", "Huck", "Pap", "Johann Sebastian", "Johann"),
lastName.2 = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Pachelbel"))
Of which I then (probably manually?) approve all but "Johann Pachelbel" and "Pap Finn", as they might match name-wise, but aren't the same person as the one they're matched to.
So far I have tried merge (see also match two data.frames based on multiple columns) and %in%, but both methods are case-sensitive and miss out on some matches. I somehow can't get an apply function to work using grep (must admit: not very fluent with either of those functions), but also don't know how to take into account both first and last name using grep? Am I looking in the right direction, or should I use an altogether different function?
Any help would be much appreciated!
PS. There seem to be many, many similar questions, but either for different programmes or not requiring both of my considerations – apologies though if there is indeed already an answer to my question!
This seems to work based on OP's comments and new dataset. I changed df2 slightly so the names are not in the same order in both data frames.
df1 <- data.frame(firstName = c("John", "Josef", "Tom", "Huckleberry", "Johann"),
lastName = c("Doe", "K", "Sawyer", "Finn", "Bach"))
df2 <- data.frame(firstName = c("John", "josef", "Huck", "Pap", "Johann Sebastian", "Johann", "Thomas"),
lastName = c("Doe", "K", "Finn", "Finn", "Bach", "Pachelbel", "Sawyer"))
get.match <- function(A,B) {
A <- as.list(tolower(A)); B <- as.list(tolower(B))
match.last <- grepl(A$lastName,B$lastName)|grepl(B$lastName,A$lastName)
match.first <- grepl(A$firstName,B$firstName)|grepl(B$firstName,A$firstName)
match.first | match.last
}
indx <- apply(df2,1,function(row) apply(df1,1,get.match,row))
indx
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# [4,] FALSE FALSE TRUE TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE TRUE TRUE FALSE
m.1 <- df1[rep(1:nrow(df1),apply(indx,1,sum)),]
result <- cbind(m.1,do.call(rbind,apply(indx,1,function(i)df2[i,])))
result
# firstName lastName firstName lastName
# 1 John Doe John Doe
# 2 Josef K josef K
# 3 Tom Sawyer Thomas Sawyer
# 4 Huckleberry Finn Huck Finn
# 4.1 Huckleberry Finn Pap Finn
# 5 Johann Bach Johann Sebastian Bach
# 5.1 Johann Bach Johann Pachelbel
So this uses an algorithm implemented in get.match(...) which compares a row of df1 to a row of df2 and returns TRUE if the first name in either row is present in the first name of the other row or the last name in either row is present in the last name of the other row. The line:
indx <- apply(df2,1,function(row) apply(df1,1,get.match,row))
then creates an indx matrix where the rows represent rows in df1 and the columns represent rows of df2 and the element is TRUE if the corresponding rows of df1 and df2 match. This allows for the possibility of more than one match in either df1 or df2. Finally we convert this indx matrix to the result you want using:
m.1 <- df1[rep(1:nrow(df1),apply(indx,1,sum)),]
result <- cbind(m.1,do.call(rbind,apply(indx,1,function(i)df2[i,])))
This code extracts all the rows of df1 which have matches in df2, and then binds that to the corresponding rows from df2.

Search a list with words in string as parameter in python

I could use some advice, how to search in a list for genres with words in a string as parameter.
So if i have created a list called genre, which contains a string like:
['crime, drama,action']
I want to use this list to search for movies containing all genres or maybe just 1 of them.
I have created a big list which contains all information about the movie. An example from the list you see here:
('Saving Private Ryan (1998)', '8.5', "Tom Hanks, Matt Damon, Tom Sizemore',\n", 'action, drama, war,\n'),
So if i want to search for saving private ryan, which is a drama + action genre, but not crime, how can i then use my genre list to search for it?
Is there a way to search by something in the string?
UPDATE:
So this is what i done so far. I have tried to precess my tuple movie and use the def function.
Navn_rating = dict(zip(names1, ratings))
Actor_genre = dict(zip(actorlist, genre_list))
var = raw_input("Enter movie: ")
print "you entered ", var
for row in name_rating_actor_genre:
if var in row:
movie.append(row)
print "Movie found",movie
def process_movie(movie):
return {'title': names1, 'rating': ratings, 'actors': actorlist, 'genre': genre_list}
You can "search by something in the string" using in:
>>> genres = 'action, drama, war,\n'
>>> 'action' in genres
True
>>> 'drama' in genres
True
>>> 'romantic comedy' in genres
False
But note that this might not always give the result you want:
>>> 'war' in 'award-winning'
True
I think you should change your data structure. Consider making each movie a dictionary e.g.
{'title': 'Saving Private Ryan', 'year': 1998, 'rating': 8.5, 'actors': ['Tom Hanks', ...], 'genres': ['action', ...]}
then your query becomes
if 'drama' in movie.genres and 'action' in movie.genres:
You can use indexing, split and slicing to process your tuple of strings to make the values of the dictionary, e.g.:
>>> movie = ('Saving Private Ryan (1998)', '8.5', "Tom Hanks, Matt Damon, Tom Sizemore',\n", 'action, drama, war,\n')
>>> int(movie[0][-5:-1])
1998
>>> float(movie[1])
8.5
>>> movie[0][:-7]
'Saving Private Ryan'
>>> movie[2].split(",")
['Tom Hanks', ' Matt Damon', " Tom Sizemore'", '\n']
As you can see, some tidying up may be needed. You could write a function that takes the tuple as an argument and returns the corresponding dictionary:
def process_movie(movie_tuple):
# ... process the tuple here
return {'title': title, 'rating': rating, ...}
and apply this to your list of movies using map:
movies = list(map(process_movie, name_rating_actor_genre))
Edit:
You will know your function works when the following line doesn't raise any errors:
assert process_movie(('Saving Private Ryan (1998)', '8.5', "Tom Hanks, Matt Damon, Tom Sizemore',\n", 'action, drama, war,\n')) == {"title": "Saving Private Ryan", "year": 1998, "rating": 8.5, "actors": ["Tom Hanks", "Matt Damon", "Tom Sizemore"], "genres": ["action", "drama", "war"]}