Matching words from two files and extract matched one - regex

I have following data frame:
dataFrame <- data.frame(sent = c(1,1,2,2,3,3,3,4,5), word = c("good printer", "wireless easy", "just right size",
"size perfect weight", "worth price", "website great tablet",
"pan nice tablet", "great price", "product easy install"), val = c(1,2,3,4,5,6,7,8,9))
Data frame "dataFrame" looks like below:
sent word val
1 good printer 1
1 wireless easy 2
2 just right size 3
2 size perfect weight 4
3 worth price 5
3 website great tablet 6
3 pan nice tablet 7
4 great price 8
5 product easy install 9
And then I have words:
nouns <- c("printer", "wireless", "weight", "price", "tablet")
I need to extract only these words (nouns) from dataFrame and only these extracted add to new column (eg.extract) in dataFrame.
I really very appreciate any of your help od advice. Thanks a lot in forward.
Desired output:
sent word val extract
1 good printer 1 printer
1 wireless easy 2 wireless
2 just right size 3 size
2 size perfect weight 4 weight
3 worth price 5 price
3 website great tablet 6 table
3 pan nice tablet 7 tablet
4 great price 8 price
5 product easy install 9 remove this row (no match)

Here's a simple solution using the stringi package (size isn't in your nouns list btw).
library(stringi)
transform(dataFrame,
extract = stri_extract_all(word,
regex = paste(nouns, collapse = "|"),
simplify = TRUE))
# sent word val extract
# 1 1 good printer 1 printer
# 2 1 wireless easy 2 wireless
# 3 2 just right size 3 <NA>
# 4 2 size perfect weight 4 weight
# 5 3 worth price 5 price
# 6 3 website great tablet 6 tablet
# 7 3 pan nice tablet 7 tablet
# 8 4 great price 8 price
# 9 5 product easy install 9 <NA>

this is another solution. a bit more complicated but it also deletes the rows which have no matching between nouns and dataFrame$word
require(stringr)
dataFrame <- data.frame("sent" = c(1,1,2,2,3,3,3,4,5),
"word" = c("good printer", "wireless easy", "just right size",
"size perfect weight", "worth price", "website great tablet",
"pan nice tablet", "great price", "product easy install"),
val = c(1,2,3,4,5,6,7,8,9))
nouns <- c("printer", "wireless", "weight", "price", "tablet")
test <- character()
df.del <- list()
for (i in 1:nrow(dataFrame)) {
if(length(intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " ")))) == 0) {
df.del <- rbind(df.del, i)
} else {
test <- rbind(test,
intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " "))))
}
}
dataFrame <- dataFrame[-c(unlist(df.del)), ]
dataFrame <- cbind(dataFrame, test)
names(dataFrame)[4] <- "extract"
output:
sent word val extract
1 1 good printer 1 printer
2 1 wireless easy 2 wireless
4 2 size perfect weight 4 weight
5 3 worth price 5 price
6 3 website great tablet 6 tablet
7 3 pan nice tablet 7 tablet
8 4 great price 8 price

Here is another solution using loop function and if statement.
word<-dataFrame$word
dat<-NULL
extract<-c(rep(c("remove"), each=length(word)))
n<-length(word)
m<-length(nouns)
for (i in 1:n) {
g<-as.character(word[i])
for (j in 1:m) {
dat<-grepl(nouns[j], g)
if(dat == TRUE) {extract[i] <- nouns[j]}
}
}
dataFrame$extract <- extract
# sent word val extract
#1 1 good printer 1 printer
#2 1 wireless easy 2 wireless
#3 2 just right size 3 remove
#4 2 size perfect weight 4 weight
#5 3 worth price 5 price
#6 3 website great tablet 6 tablet
#7 3 pan nice tablet 7 tablet
#8 4 great price 8 price
#9 5 product easy install 9 remove

Related

Plotting categorical variables using a bar diagram/bar chart

data
I am trying to plot a bar graph for both sept and oct waves. As in the image you can see the id are the individuals who are surveyed across time. So on the one graph I need to plot sept in-house, oct in-house, sept out-house, oct out-house and just have to show the proportion of people who said yes in sept in-house, oct in-house, sept out-house, oct out-house. Not all the categories have to be taken into account.
Also I have to show whiskers for 95% confidence intervals for each of the respective categories.
* Example generated by -dataex-. For more info, type help dataex
clear
input float(id sept_outhouse sept_inhouse oct_outhouse oct_inhouse)
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 3 3 3
5 4 4 3 3
6 4 4 3 3
7 4 4 4 1
8 1 1 1 1
9 1 1 1 1
10 1 1 1 1
end
label values sept_outhouse codes
label values sept_inhouse codes
label values oct_outhouse codes
label values oct_inhouse codes
label def codes 1 "yes", modify
label def codes 2 "no", modify
label def codes 3 "don't know", modify
label def codes 4 "refused", modify
save tokenexample, replace
rename (*house) (house*)
reshape long house, i(id) j(which) string
replace which = subinstr(proper(which), "_", " ", .)
gen yes = house == 1
label def WHICH 1 "Sept Out" 2 "Sept In" 3 "Oct Out" 4 "Oct In"
encode which, gen(WHICH) label(WHICH)
statsby, by(WHICH) clear: ci proportion yes, jeffreys
set scheme s1color
twoway scatter mean WHICH ///
|| rspike ub lb WHICH, xla(1/4, noticks valuelabel) xsc(r(0.9 4.1)) ///
xtitle("") legend(off) subtitle(Proportion Yes with 95% confidence interval)
This has to be solved backwards.
The means and confidence intervals have to be plotted using twoway as graph bar is a dead-end here, because it does not allow whiskers too.
The confidence limits have to be put in variables before the graphics. Some graph commands, notably graph bar, will calculate means for you, but as said that is a dead end. So, we need to calculate the means too.
To do that you need an indicator variable for Yes.
The best way I know to get the results then is to reshape to a different structure and then apply ci proportion under statsby.
As a detail, the option jeffreys is explicit as a signal that there are different methods for the confidence interval calculation. You should choose one knowingly.

Detect rows in a data frame that are highly similar but not necessarily exact duplicates

I would like to identify rows in a data frame that are highly similar to each other but not necessarily exact duplicates. I have considered merging all the data from each row into one string cell at the end and then using a partial matching function. It would be nice to be able to set/adjust the level of similarity required to qualify as a match (for example, return all rows that match 75% of the characters in another row).
Here is a simple working example.
df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew"), id = c(12334, 12344, 34345, 98974), score = c(90, 90, 83, 95))
In this scenario, I would want row 2 to show up as a duplicate of row 1, but not row 4 (It is too dissimilar). Thanks for any suggestions.
You can use agrep But first you need to concatenate all columns to do the fuzzy search in all columns and not just the first one.
xx <- do.call(paste0,df)
df[agrep(xx[1],xx,max=0.6*nchar(xx[1])),]
name id score
1 Andrew 12334 90
2 Andrem 12344 90
4 Pamdrew 98974 95
Note that for 0.7 you get all rows.
Once rows matched you should extract them from the data.frame and repeat the same process for other rows(row 3 here with the rest of data)...
You could use agrep (or agrepl) for partial (fuzzy) pattern matching.
> df[agrep("Andrew", df$name), ]
name id score
1 Andrew 12334 90
2 Andrem 12344 90
So this shows that rows 1 and 2 are both found when matching "Andrew" Then you could remove the duplicates (only taking the first "Andrew" match) with
> a <- agrep("Andrew", df$name)
> df[c(a[1], rownames(df)[-a]), ]
name id score
1 Andrew 12334 90
3 Adam 34345 83
4 Pamdrew 98974 95
You could use some approximate string distance metric for the names such as:
adist(df$name)
[,1] [,2] [,3] [,4]
[1,] 0 1 4 3
[2,] 1 0 3 4
[3,] 4 3 0 6
[4,] 3 4 6 0
or use a dissimilartiy matrix calculation:
require(cluster)
daisy(df[, c("id", "score")])
Dissimilarities :
1 2 3
2 10
3 22011 22001
4 86640 86630 64629
Extending the solution provided by agstudy (see comments above) I produced the following solution that produced a data frame with each similar row in a data frame next to each other.
df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew", "Adan"), id = c(12334, 12344, 34345, 98974, 34344), score = c(90, 90, 83, 95, 83))
xx <- do.call(paste0,df) ## concatenate all columns
df3<-df[0,] ## empty data frame for storing loop results
for (i in 1:nrow(df)){ ## produce results for each row of the data frame
df2<-df[agrep(xx[i],xx,max=0.3*nchar(xx[i])),] ##set level of similarity required (less than 30% dissimilarity in this case)
if(nrow(df2) >= 2){df3<-rbind(df3, df2)} ## rows without matches returned themselves...this eliminates them
df3<-df3[!duplicated(df3), ] ## store saved values in df3
}
I am sure there are cleaner ways of producing these results, but this gets the job done.

split a dataframe column by regular expression on characters separated by a "."

In R, I have the following dataframe:
Name Category
1 Beans 1.12.5
2 Pears 5.7.9
3 Eggs 10.6.5
What I would like to have is the following:
Name Cat1 Cat2 Cat3
1 Beans 1 12 5
2 Pears 5 7 9
3 Eggs 10 6 5
Ideally some expression built inside plyr would be nice...
I will investigate on my side but as searching this might take me a lot of time I was just wondering if some of you do have some hints to perform this...
I've written a function concat.split (a "family" of functions, actually) as part of my splitstackshape package for dealing with these types of problems:
# install.packages("splitstackshape")
library(splitstackshape)
concat.split(mydf, "Category", ".", drop=TRUE)
# Name Category_1 Category_2 Category_3
# 1 Beans 1 12 5
# 2 Pears 5 7 9
# 3 Eggs 10 6 5
It also works nicely on "unbalanced" data.
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"),
Category = c("1.12.5", "5.7.9.8", "10.6.5.7.7"))
concat.split(dat, "Category", ".", drop = TRUE)
# Name Category_1 Category_2 Category_3 Category_4 Category_5
# 1 Beans 1 12 5 NA NA
# 2 Pears 5 7 9 8 NA
# 3 Eggs 10 6 5 7 7
Because "long" or "molten" data are often required in these types of situations, the concat.split.multiple function has a "long" argument too:
concat.split.multiple(dat, "Category", ".", direction = "long")
# Name time Category
# 1 Beans 1 1
# 2 Pears 1 5
# 3 Eggs 1 10
# 4 Beans 2 12
# 5 Pears 2 7
# 6 Eggs 2 6
# 7 Beans 3 5
# 8 Pears 3 9
# 9 Eggs 3 5
# 10 Beans 4 NA
# 11 Pears 4 8
# 12 Eggs 4 7
# 13 Beans 5 NA
# 14 Pears 5 NA
# 15 Eggs 5 7
The qdap package has the colsplit2df for just these sort of situations:
#recreate your data first:
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5",
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)
library(qdap)
colsplit2df(dat, 2, paste0("cat", 1:3))
## > colsplit2df(dat, 2, paste0("cat", 1:3))
## Name cat1 cat2 cat3
## 1 Beans 1 12 5
## 2 Pears 5 7 9
## 3 Eggs 10 6 5
If you have a consistent number of categories, then this will work:
#recreate your data first:
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5",
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)
spl <- strsplit(dat$Category,"\\.")
len <- sapply(spl,length)
dat[paste0("cat",1:max(len))] <- t(sapply(spl,as.numeric))
Result:
dat
Name Category cat1 cat2 cat3
1 Beans 1.12.5 1 12 5
2 Pears 5.7.9 5 7 9
3 Eggs 10.6.5 10 6 5
If you have differing numbers of separated values, then this should account for it:
#example unbalanced data
dat <- data.frame(Name = c("Beans", "Pears", "Eggs"), Category = c("1.12.5",
"5.7.9", "10.6.5"),stringsAsFactors=FALSE)
dat$Category[2] <- "5.7"
spl <- strsplit(dat$Category,"\\.")
len <- sapply(spl,length)
spl <- Map(function(x,y) c(x,rep(NA,max(len)-y)), spl, len)
dat[paste0("cat",1:max(len))] <- t(sapply(spl,as.numeric))
Result:
Name Category cat1 cat2 cat3
1 Beans 1.12.5 1 12 5
2 Pears 5.7 5 7 NA
3 Eggs 10.6.5 10 6 5

Removing Percentages from a Data Frame

I have a dataframe that originated from an excel file. It has the usual headers above the columns but some of the columns have % signs in them which I want to remove.
Searching stackoverflow gives some nice code for removing percentages from matrices, Any way to edit values in a matrix in R?, which did not work when I tried to apply it to my dataframe
as.numeric(gsub("%", "", my.dataframe))
instead it just returns a string of "NA"s with a warning message explaining that they were introduced by coercion. When I applied,
gsub("%", "", my.dataframe))
I got the values in "c(...)" form, where the ... represent numbers followed by commas which was reproduced for every column that I had. No % was in evidence; if I could just put this back together ... I'd be cooking.
Any help greatfully received, thanks.
Based on #Arun's comment and imaging how your data.frame looks like:
> DF <- data.frame(X = paste0(1:5,'%'),
Y = paste0(2*(1:5),'%'),
Z = 3*(1:5), stringsAsFactors=FALSE )
> DF # this is how I imagine your data.frame looks like
X Y Z
1 1% 2% 3
2 2% 4% 6
3 3% 6% 9
4 4% 8% 12
5 5% 10% 15
> # Using #Arun's suggestion
> (DF2 <- data.frame(sapply(DF, function(x) as.numeric(gsub("%", "", x)))))
X Y Z
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
I added as.numeric in sapply call for the resulting cols to be numeric, if I don't use as.numeric the result will be factor. Check it out using sapply(DF2, class)

Working with a list of lists of dataframes with different dimensions

I am working with a list of lists of dataframes, like so:
results <- list()
for( i in 1:4 ) {
runData <- data.frame(id=i, t=1:10, value=runif(10))
runResult <- data.frame( id=i, avgValue=mean(runData$value))
results <- c(results,list(list(runResult,runData)))
}
The reason the data looks this way is its essentially how my actual data is generated from running simulations via clusterApply using the new parallel package in R 2.14.0, where each simulation returns a list of some summary results (runResult) and the raw data (runData)
I would like to combine the first dataframe of the second level lists together (they are the same structure), and likewise the second dataframe of the second level lists. This question seemed to be the answer, however all the dataframes have the same structure.
The best method I've found so far is using unlist to make it a list of dataframes, where odd indices and even indices represent dataframes that need to be combined:
results <- unlist(results,recursive=FALSE)
allRunResults <- do.call("rbind", results[seq(1,length(results),2)])
allRunData <- do.call("rbind", results[seq(2,length(results),2)])
I'm certain there's a better way to do this, I just don't see it yet. Can anyone supply one?
Shamelessly stealing a construct from Ben Bolker's excellent answer to this question...
Reduce(function(x,y) mapply("rbind", x,y), results)
[[1]]
id avgValue
1 1 0.3443166
2 2 0.6056410
3 3 0.6765076
4 4 0.4942554
[[2]]
id t value
1 1 1 0.11891086
2 1 2 0.17757710
3 1 3 0.25789284
4 1 4 0.26766182
5 1 5 0.83790204
6 1 6 0.99916116
7 1 7 0.40794841
8 1 8 0.19490817
9 1 9 0.16238479
10 1 10 0.01881849
11 2 1 0.62178443
12 2 2 0.49214165
........
........
........
One option is to extract the give data frame from each piece of the list, then rbind them together:
runData <- do.call(rbind, lapply(results, '[[', 2))
runResult <- do.call(rbind, lapply(results, '[[', 1))
This example gives 2 data frames, but you can recombine them into a single list if you want.