R: Case-insensitive matching of a combination of first and last names (i.e. two columns) across two dataframes - regex

In R, I should like to extract the people who completed both versions of a test I designed and subsequently administered in two phases (I asked participants for their first and last names).
The problem is that 1. people aren't consistent in using capitals; and 2. some people might share a first name or last name with other people. Thus, 1. I need a case-insensitive search; and 2. I should like to extract a new data frame that lists the first and last names of the first version, and the first and last names of the second version, in order to verify the match (also because someone might use "Tom" in one instance and "Thomas" in another):
df1 <- data.frame(firstName = c("John", "Josef", "Tom", "Huckleberry", "Johann"),
lastName = c("Doe", "K", "Sawyer", "Finn", "Bach"))
df2 <- data.frame(firstName = c("John", "josef", "Thomas", "Huck", "Pap", "Johann Sebastian", "Johann"),
lastName = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Pachelbel"))
The above names should all provide a match for me to verify:
repeatDF <- data.frame(firstName.1 = c("John", "Josef", "Tom", "Huckleberry", "Huckleberry", "Johann", "Johann"),
lastName.1 = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Bach"),
firstName.2 = c("John", "josef", "Thomas", "Huck", "Pap", "Johann Sebastian", "Johann"),
lastName.2 = c("Doe", "K", "Sawyer", "Finn", "Finn", "Bach", "Pachelbel"))
Of which I then (probably manually?) approve all but "Johann Pachelbel" and "Pap Finn", as they might match name-wise, but aren't the same person as the one they're matched to.
So far I have tried merge (see also match two data.frames based on multiple columns) and %in%, but both methods are case-sensitive and miss out on some matches. I somehow can't get an apply function to work using grep (must admit: not very fluent with either of those functions), but also don't know how to take into account both first and last name using grep? Am I looking in the right direction, or should I use an altogether different function?
Any help would be much appreciated!
PS. There seem to be many, many similar questions, but either for different programmes or not requiring both of my considerations – apologies though if there is indeed already an answer to my question!

This seems to work based on OP's comments and new dataset. I changed df2 slightly so the names are not in the same order in both data frames.
df1 <- data.frame(firstName = c("John", "Josef", "Tom", "Huckleberry", "Johann"),
lastName = c("Doe", "K", "Sawyer", "Finn", "Bach"))
df2 <- data.frame(firstName = c("John", "josef", "Huck", "Pap", "Johann Sebastian", "Johann", "Thomas"),
lastName = c("Doe", "K", "Finn", "Finn", "Bach", "Pachelbel", "Sawyer"))
get.match <- function(A,B) {
A <- as.list(tolower(A)); B <- as.list(tolower(B))
match.last <- grepl(A$lastName,B$lastName)|grepl(B$lastName,A$lastName)
match.first <- grepl(A$firstName,B$firstName)|grepl(B$firstName,A$firstName)
match.first | match.last
}
indx <- apply(df2,1,function(row) apply(df1,1,get.match,row))
indx
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# [4,] FALSE FALSE TRUE TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE TRUE TRUE FALSE
m.1 <- df1[rep(1:nrow(df1),apply(indx,1,sum)),]
result <- cbind(m.1,do.call(rbind,apply(indx,1,function(i)df2[i,])))
result
# firstName lastName firstName lastName
# 1 John Doe John Doe
# 2 Josef K josef K
# 3 Tom Sawyer Thomas Sawyer
# 4 Huckleberry Finn Huck Finn
# 4.1 Huckleberry Finn Pap Finn
# 5 Johann Bach Johann Sebastian Bach
# 5.1 Johann Bach Johann Pachelbel
So this uses an algorithm implemented in get.match(...) which compares a row of df1 to a row of df2 and returns TRUE if the first name in either row is present in the first name of the other row or the last name in either row is present in the last name of the other row. The line:
indx <- apply(df2,1,function(row) apply(df1,1,get.match,row))
then creates an indx matrix where the rows represent rows in df1 and the columns represent rows of df2 and the element is TRUE if the corresponding rows of df1 and df2 match. This allows for the possibility of more than one match in either df1 or df2. Finally we convert this indx matrix to the result you want using:
m.1 <- df1[rep(1:nrow(df1),apply(indx,1,sum)),]
result <- cbind(m.1,do.call(rbind,apply(indx,1,function(i)df2[i,])))
This code extracts all the rows of df1 which have matches in df2, and then binds that to the corresponding rows from df2.

Related

how to prepare transactional dataset for association rule mining in RapidMiner?

I have a dataset like this:
abelia,fl,nc
esculentus,ct,dc,fl,il,ky,la,md,mi,ms,nc,sc,va,pr,vi
abelmoschus moschatus,hi,pr*
dataset link:
My dataset haven't any attribute declaration. I want apply association rules on my dataset. I want to be like this dataset.
plant fl nc ct dc .....
abelia 1 1 0 0
.....
ELKI contains a parser that can read the input as is. Maybe Rapidminer does so, too - or you should write a parser for this format! With the ELKI parameters
-dbc.in /tmp/plants.data
-dbc.parser SimpleTransactionParser -parser.colsep ,
-algorithm itemsetmining.associationrules.AssociationRuleGeneration
-itemsetmining.minsupp 0.10
-associationrules.interestingness Lift
-associationrules.minmeasure 7.0
-resulthandler ResultWriter -out /tmp/rules
we can find all association rules with support >= 10%, Lift >= 7.0, and write them to the folder /tmp/rules (there is currently no visualization of association rules in ELKI):
For example, this finds the rules
sc, va, ga: 3882 --> nc, al: 3529 : 7.065536626573297
va, nj: 4036 --> md, pa: 3528 : 7.206260507764794
So plants that occur in South Carolina, Virigina, and Georgia will also occur in North Carolina and Alabama. NC is not much of a surprise, given that it is inbetween of SC and VA, but Alabama is interesting.
The second rule is Virigina and New Jersey imply Maryland (inbetween the two) and Pennsylvania. Also a very plausible rule, supported by 3528 cases.
I did my work with this python script:
import csv
abbrs = ['states', 'ab', 'ak', 'ar', 'az', 'ca', 'co', 'ct',
'de', 'dc', 'of', 'fl', 'ga', 'hi', 'id', 'il', 'in',
'ia', 'ks', 'ky', 'la', 'me', 'md', 'ma', 'mi', 'mn',
'ms', 'mo', 'mt', 'ne', 'nv', 'nh', 'nj', 'nm', 'ny',
'nc', 'nd', 'oh', 'ok', 'or', 'pa', 'pr', 'ri', 'sc',
'sd', 'tn', 'tx', 'ut', 'vt', 'va', 'vi', 'wa', 'wv',
'wi', 'wy', 'al', 'bc', 'mb', 'nb', 'lb', 'nf', 'nt',
'ns', 'nu', 'on', 'qc', 'sk', 'yt']
with open("plants.data.txt", encoding = "ISO-8859-1") as f1, open("plants.data.csv", "a") as f2:
csv_f2 = csv.writer(f2, delimiter=',')
csv_f2.writerow(abbrs)
csv_f1 = csv.reader(f1)
for row in csv_f1:
new_row = [row[0]]
for abbr in abbrs:
if abbr in row:
new_row.append(1)
else:
new_row.append(0)
csv_f2.writerow(new_row)
If all of the values are single words, you can use text mining extension in Rapidminer to transform them into variables and then run association rule mining methods on them.

Subtracting every two columns

Imagine I have a dataframe like this (or the names of all months)
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(20),3)))
mydata <- rbind(mydata,c(2,round(runif(20),3)))
mydata <- rbind(mydata,c(3,round(runif(20),3)))
colnames(mydata) <- c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2))
.
id Mary1 Mary2 Bob1 Bob2 Dylan1 Dylan2 Tom1 Tom2 Jane1 Jane2 Sam1 Sam2 Tony1 Tony2 Luke1 Luke2 John1 John2 Pam1 Pam2
1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777
2 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411
3 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407
Usually with many more columns.
And I want to add columns (it's up to you to decide to add them to the right, or create a new dataframe with these new columns) substracting every two.. (*)
id, Mary1-Mary2, Bob1-Bob2, Dylan1-Dylan2, Tom1-Tom2, Jane1-Jane2,...
This operation is quite common.
I'd like to do it by name, not by position, to prevent problems if they are not consecutive.
It could even happen that some columns don't have it's "twin" column, just leave as is, or ignore this complication now.
(*) The names of the columns have a prefix and a number.
Instead of just substracting two columns I could have groups of 5 and I may want to do something such as adding all numbers. A generic solution would be great.
I first tried to do it by convert it to long format, later operate with aggregate, and convert it back to wide format, but maybe it's much easier to do it directly in wide format. I know the problem is mainly related to use efficiently regular expressions.
R, data.table or dplyr, long format splitting colnames
I don't mind the speed but the simplest solution.
Any package is wellcome.
PD: All your codes fail if I add a lonely column.
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(21),3)))
mydata <- rbind(mydata,c(2,round(runif(21),3)))
mydata <- rbind(mydata,c(3,round(runif(21),3)))
colnames(mydata) <- c(c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2)),"Lola" )
I know I could filter it out manually but it would be better if the result is the difference (*) of every pair and leave alone the lonely column. (In case of differences of groups of size two)
The best option would be not manually remove the first column but split all columns in single and multiple columns.
How about using base R:
cn <- unique(gsub("\\d", "", colnames(mydata)))[-1]
sapply(cn, function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )
You can use this approach for any arbitrary number of groups. For example this would return the row sums across the names with the suffix 1 or 2.:
sapply(cn, function(x) rowSums(mydata[, paste0(x, 1:2)]))
This paste approach could be replaced by regular expressions for more general applications.
You can do something like,
sapply(unique(sub('\\d', '', names(mydata[,-1]))),
function(i) Reduce('-', mydata[,-1][,grepl(i, sub('\\d', '', names(mydata[,-1])))]))
# Mary Bob Dylan Tom Jane Sam Tony Luke John Pam
#[1,] -0.106 -0.335 -0.696 0.284 0.567 0.029 0.303 0.272 -0.274 -0.397
#[2,] 0.723 0.526 -0.119 -0.369 0.530 -0.118 0.308 0.159 0.686 0.313
#[3,] 0.174 0.230 -0.259 -0.454 0.039 -0.383 0.193 -0.028 -0.203 0.255
As per your comment, we can easily sort the columns and then apply the formula above,
sorted.names <- names(mydata)[order(nchar(names(mydata)), names(mydata))]
mydata <- mydata[,sorted.names]
This solution handles an arbitrary number of twins.
## return data frame
twin.vars <- function(prefix, df) {
df[grep(paste0(prefix, '[0-9]+$'), names(df))]
}
pfx <- unique(sub('[0-9]*$', '', names(mydata[-1])))
tmp <- lapply(pfx, function(x) Reduce(`-`, twin.vars(x, mydata)))
cbind(id=mydata$id, as.data.frame(setNames(tmp, pfx)))
OK, I've chosen #NBATrends solution because it works well almost always and he was the first.
Anyway, I add my little contribution, just in case anybody is interested:
runs <- rle(sort(sub('\\d$', '', names(mydata))))
sapply(runs[[2]][runs[[1]]>1], function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )
The only "problem" is that it changes the final order, but you don't need to manually remove isolated columns, and works for disordered columns too.
I'm perplexed because nobody posted a solution with dplyr or data.table :)

Extract words that meet a length condition from string

I have a patent data set and when I import the IPC-class information to R I get a string containing whitespaces in a variable amount and a set of numbers I don't need. The following are the IPC codes corresponding to a patent file:
b <- "F24J 2/05 20060101AFI20150224BHEP F24J 2/46 20060101ALI20150224BHEP "
I would like to remove all whitespaces and that long alphanumeric string and just get the data I am interested in, obtaining a data frame like this, in this case:
m <- data.frame(matrix(c("F24J 2/05", "F24J 2/46"), byrow = TRUE, nrow = 1, ncol = 2))
m
I am trying with gsub, since I know that the long string will always have a length considerably longer than the data I am interested in:
x = gsub("\\b[a-zA-Z0-9]{8,}\\b", "", ipc)
x
But I get stuck when I try to further clean this object in order to get the data frame I want. I am really stuck on this, and I would really appreciate if someone could help me.
Thank you very much in advance.
You can use str_extract_all from stringr package, provided you know the pattern you look for:
library(stringr)
str_extract_all(b, "[A-Z]\\d{2}[A-Z] *\\d/\\d{2}")[[1]]
#[1] "F24J 2/05" "F24J 2/46"
Option 1, select all the noise data and remoe it using a sustitution:
/\s+|\w{5,}/g
(Spaces and 'long' words)
https://regex101.com/r/lG4dC4/1
Option 2, select all the short words (length max 4):
/\b\S{4}\b/g
https://regex101.com/r/fZ8mH5/1
or…
library(stringi)
library(readr)
read_fwf(paste0(stri_match_all_regex(b, "[[:alnum:][:punct:][:blank:]]{50}")[[1]][,1], collapse="\n"),
fwf_widths(c(7, 12, 31)))[,1:2]
## X1 X2
## 1 F24J 2/05
## 2 F24J 2/46
(this makes the assumption - from only seeing 2 'records' - that each 'record' is 50 characters long)
Here's an approach to akie the amtrix using qdapRegex (I maintain this package) + magrittr's pipeline:
library(qdapRegex); library(magrittr)
b %>%
rm_white_multiple() %>%
rm_default(pattern="F[0-9A-Z]+\\s\\d{1,2}/\\d{1,2}", extract=TRUE) %>%
unlist() %>%
strsplit("\\s") %>%
do.call(rbind, .)
## [,1] [,2]
## [1,] "F24J" "2/05"
## [2,] "F24J" "2/46"

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"

How would I turn a multivalue string into a usable frequency table in R?

I have a field in a data frame called plugins_Apache_module
it contains strings like:
c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
"mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
"mod_ssl/2.2.9")
I need a frequency table on the modules, and also their versions.
What is the best way to do this in R? As being rather new in R, I've seen strsplit, gsub, some chatrooms also suggested I use the qdap package.
Ideally I would want the string transformed into a dataframe with a column for every mod, if the module is there, then the version goes in that particular field. How would I accomplish such a transform?
What dataframe format would be suggested if I want top-level frequencies - say mod_ssl (all versions) as well as relational options (mod_perl is very often used with mod_ssl).
I'm not too sure how to handle such variable length data when pushing into a dataframe for processing. Any advice is welcome.
I consider the right answer to look like:
mod_perl mod_python mod_ssl mod_auth_passthrough mod_bwlimited
1.99_16 3.1.3 2.0.52
2.2.23 2.1 1.4
2.2.9
So basically the first bit becomes a column and the version(s) that follows become a row entry
st <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52", "mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23", "mod_ssl/2.2.9")
scan(text=st, what="", sep=",")
Read 7 items
[1] "mod_perl/1.99_16" "mod_python/3.1.3" "mod_ssl/2.0.52"
[4] "mod_auth_passthrough/2.1" "mod_bwlimited/1.4" "mod_ssl/2.2.23"
[7] "mod_ssl/2.2.9"
strsplit( scan(text=st, what="", sep=","), "/")
Read 7 items
[[1]]
[1] "mod_perl" "1.99_16"
[[2]]
[1] "mod_python" "3.1.3"
[[3]]
[1] "mod_ssl" "2.0.52"
[[4]]
[1] "mod_auth_passthrough" "2.1"
[[5]]
[1] "mod_bwlimited" "1.4"
[[6]]
[1] "mod_ssl" "2.2.23"
[[7]]
[1] "mod_ssl" "2.2.9"
table( sapply(strsplit( scan(text=st, what="", sep=","), "/"), "[",1) )
#----------------
Read 7 items
mod_auth_passthrough mod_bwlimited mod_perl mod_python
1 1 1 1
mod_ssl
3
table( scan(text=st, what="", sep=",") )
#-----------
Read 7 items
mod_auth_passthrough/2.1 mod_bwlimited/1.4 mod_perl/1.99_16
1 1 1
mod_python/3.1.3 mod_ssl/2.0.52 mod_ssl/2.2.23
1 1 1
mod_ssl/2.2.9
1
You ask for at minimum two different things. Adding desired output greatly helped. I'm not sure if what you ask for is what you really want but you asked and it seemed like a fun problem. Ok here's how I would approach this using qdap (this requires qdap version 1.1.0 though):
## load qdap
library(qdap)
## your data
x <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
"mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
"mod_ssl/2.2.9")
## strsplit on commas and slashes
dat <- unlist(lapply(x, strsplit, ",|/"), recursive=FALSE)
## make just a list of mods per row
mods <- lapply(dat, "[", c(TRUE, FALSE))
## make a string of versions
ver <- unlist(lapply(dat, "[", c(FALSE, TRUE)))
## make a lookup key and split it into lists
key <- data.frame(mod = unlist(mods), ver, row = rep(seq_along(mods),
sapply(mods, length)))
key2 <- split(key[, 1:2], key$row)
## make it into freq. counts
freqs <- mtabulate(mods)
## rename assign freq table to vers in case you want freqs ans replace 0 with NA
vers <- freqs
vers[vers==0] <- NA
## loop through and fill the ones in each row using an env. lookup (%l%)
for(i in seq_len(nrow(vers))) {
x <- vers[i, !is.na(vers[i, ]), drop = FALSE]
vers[i, !is.na(vers[i, ])] <- colnames(x) %l% key2[[i]]
}
## Don't print the NAs
print(vers, na.print = "")
## mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1 1.99_16 3.1.3 2.0.52
## 2 2.1 1.4 2.2.23
## 3 2.2.9
## the frequency counts per mods
freqs
## mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1 0 0 1 1 1
## 2 1 1 0 0 1
## 3 0 0 0 0 1