Remove defined strings from sentences in dataframe - regex

I need to remove defined strings from sentences in data frame:
sent1 = data.frame(Sentences=c("bad printer for the money wireless setup was surprisingly easy",
"love my samsung galaxy tabinch gb whitethis is the first"), user = c(1,2))
Sentences User
bad printer for the money wireless setup was surprisingly easy 1
love my samsung galaxy tabinch gb whitethis is the first 2
Defined strings for excluding, e.g.:
stop_words <- c("bad", "money", "love", "is", "the")
I was wondering about something like this:
library(stringr)
words1 <- (str_split(unlist(sent1$Sentences)," "))
ddd = which(words1[[1]] %in% stop_words)
words1[[1]][-ddd]
But I need it for all items in the list. Then I need to have output table in the same structure as input table sent1, but without defined strings.
Please, I very appreciate any of help or advice.

You can combine the stop words and create a regex pattern. Therefore, you only need a single gsub command.
# create regex pattern
pattern <- paste0("\\b(?:", paste(stop_words, collapse = "|"), ")\\b ?")
# [1] "\\b(?:bad|money|love|is|the)\\b ?"
# remove stop words
res <- gsub(pattern, "", sent1$Sentences)
# [1] "printer for wireless setup was surprisingly easy"
# [2] "my samsung galaxy tabinch gb whitethis first"
# store result in a data frame
data.frame(Sentences = res)
# Sentences
# 1 printer for wireless setup was surprisingly easy
# 2 my samsung galaxy tabinch gb whitethis first

Related

How to Write Regular Expression Syntax in R

I have a file of log entries that I want to parse through. All the lines look like this:
F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)
Each part has a specific designation which I'll put below.
F -- identifier of the line
20160525 -- date (yyyymmdd)
17:52:38.791 -- timestamp (HH:MM:SS.sss)
F798259D -- transfer identifier
156.145.15.85:46634 -- IP address and related port
xqixh8sl -- username
AES -- encryption level (could be - (dash))
"/pcgc...fastq.gz" -- transferred file (in ")
"" -- additional string (should be empty "")
2951144113 -- transferred bytes
(0,0) -- error
"2289.47 seconds (10.3 megabits/sec)" -- data about the transfer
I have imported the data file and am using the read.pattern() function to parse and separate it into it's fields. I only want the pieces of information that correlate with 2,3,4,5,6,7,8,10, and 12. However, I cannot get the pattern right. Before, I managed to get two of the fields that I needed by using this pattern:
pattern <- "^F ([0-9]+) [^ ]* .* \\(0,0\\) (.*)$"
This gave me a data frame that looked like this:
date speed of data transfer
1 20160525 "1.62 seconds (1.30 kilobits/sec)"
2 20160525 "0.29 seconds (1.93 kilobits/sec)"
3 20160525 "0.01 seconds (34.0 kilobits/sec)"
4 20160525 "0.01 seconds (102 kilobits/sec)"
5 20160525 "38.05 seconds (214 megabits/sec)"
These are only two of the fields I need, but whenever I try to add more that's where I mess up the syntax. For example:
pattern <- "^F\\s([0-9]+)\\s[0-9:.]+\\s([:alnum:])\\s[A-Z]\\s([0-9.:]+)\\s([:alnum:])\\s([•])\\s[:punct:][A-z][:punct:]\\s[:punct:]\\s.* \\(0,0\\) (.*)$"
This did not work. Could someone please help on writing this? It's been driving me crazy. Thanks!
Here's my solution:
library(stringer)
con <- readLines("dataSet.txt")
pattern <- "^F (\\d+) ([:graph:]+) ([:graph:]+) [A-Z]+ ([:graph:]+) ([:graph:]+) ([:graph:]+) ([:graph:]+) [:graph:]+ (\\d+) [:graph:]+ (.+)$"
matches <- str_match(con,pattern)
df <- data.frame(na.omit(matches[,-1]))
colnames(df) <- c("date", "timestamp", "transfer ID", "IP address", "username", "encryption level", "transferred file", "transferred bytes", "speed of data transfer")
This was the result:
1 20160525 08:22:06.838 F798256B 10.199.194.38:57708 wei2dt - "" 264 "1.62 seconds (1.30 kilobits/sec)"
2 20160525 08:28:26.920 F798256C 10.19.105.15:57708 wei2dt - "isi_audit_log.dmp-sv.tmp" 69 "0.29 seconds (1.93 kilobits/sec)"
If all of your lines follow a similar structure, you may be able to get away with simply splitting each row on the spaces.
x <- "F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz\" \"\" 3322771022 (0,0) \"1499.61 seconds (17.7 megabits/sec)"
library(dplyr)
library(magrittr)
strsplit(x, " ") %>%
unlist() %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
setNames(c("id", "date", "timestamp", "transfer_id",
"curl_method", "ip_address", "username", "encryption",
"tranferred_file", "additional_string",
"transferred_bytes", "error",
"rate1", "rate2", "rate3", "rate4")) %>%
mutate(rate = paste(rate1, rate2, rate3, rate4)) %>%
select(-rate1:-rate4)

Search and replace multiple strings in list of strings: improve R code

I am looking for a simplified solution to the following problem in R: I have a list of names that are separated by commas – however, some of the names also have commas in them. In order to separate the names, I would like to replace all names with commas first and then split by comma. My problem is that I have around 26 000 strings with several names in each and I have a list of around 130 names with commas. I have written a nested foreach loop (in order to use multiple cores to speed things up) and it works but it’s horribly slow. Is there a quicker way to search in the strings and replace the relevant names? Here is my sample code:
List_of_names<-as.data.frame(c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike","Digital, Mike, John, Sr","Svenja, Sven"))
Comma_names<-as.data.frame(c("Franz, Jr.","Nice, LLC","John, Sr"))
colnames(Comma_names)<-"name"
Comma_names$replace_names<-gsub(",", "",Comma_names[,"name"])
library(doParallel)
library(foreach)
cl<-makeCluster(4) # Create cluster with desired number of cores
registerDoParallel(cl) # Register cluster
names_new<-foreach (i=1:nrow(List_of_names),.errorhandling="pass",.packages=c("foreach")) %dopar% {
name_2<-List_of_names[i,]
foreach (j=1:nrow(Comma_names),.combine=rbind,.errorhandling="pass") %do% {
if(length(grep(Comma_names[j,1],name_2))>0){
name_2<-gsub(Comma_names[j,1], Comma_names[j,2],name_2)
}
}
name_2
}
In addition, the result of the foreach loop is a list but if I try to save the list or replace the column in my original dataframe it takes forever. How can I change my code to make it faster?
Thank you everyone who is reads this and is able to help!
Principle
You can use a combination from Reduce and stri_replace_all from package stringi.
Code
library(stringi)
Comma_names <- structure(list(name = c("Franz, Jr.", "Nice, LLC", "John, Sr"),
replace_names = c("Franz Jr.", "Nice LLC", "John Sr")),
.Names = c("name", "replace_names"),
row.names = c(NA, -3L), class = "data.frame")
List_of_names <- structure(list(name = c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike",
"Digital, Mike, John, Sr", "Svenja, Sven")),
.Names = "name",
row.names = c(NA, -3L), class = "data.frame")
wrapper <- function(str, ind) stri_replace_all(str, Comma_names$replace_names[ind],
fixed = Comma_names$name[ind])
ind <- 1:NROW(Comma_names)
Reduce(wrapper, ind, init = List_of_names$name)
# [1] "Fred, Heiko, Franz Jr., Nice LLC, Meike"
# [2] "Digital, Mike, John Sr"
# [3] "Svenja, Sven"
Explanation
stri_replace_all is a fast function which replaces all occurrences in a string. With Reduce you apply a function to the the result of the previous function call. So we apply wrapper to the column with all the names and replace the string in the first row of Comma_names. This string we again feed to wrapper now with the aim to replace all occurrences of the second row and so on. This code should run reasonable fast and you do not need to parallezie. Would be curious to hear your feedback on the execution time.
Benchmark
Just a little benchmark with 3 millions lines:
List_of_names <- List_of_names[rep(1:NROW(List_of_names), 1e6), , drop = FALSE]
system.time(invisible(Reduce(wrapper, ind, init = List_of_names$name)))
# user system elapsed
# 1.95 0.00 1.96

How to know if a variation (f.e. abbreviation) of a string in a list does match agains another list if the original does not?

I currently searching for a method in R which let's me match/merge two data frames. Helas both of these data frames contain non optimal data. They can have certain abbreviations of even typo's in them. Therefore I would like to define a list for each abbreviation and if a string contains one of those elements. If the original entries don't match, R should check if any of the other options of the abbreviation has a match. To illustrate: the name of a company could end with "Limited" but also with "Ltd." of "Ltd" etc.
EXAMPLE
Data
The Original "Address" file contains:
Company name Address
Deloitte Ltd. New York
Coca-Cola New York
Tesla ltd California
Microsoft Limited Washington
Would have to be merged with "EnterpriseNrList"
Company name EnterpriseNumber
Deloitte Ltd. 221
Coca-Cola 334
Tesla ltd 725
Microsoft Limited 127
So the abbreviations should work in "both directions". That's why I said, if R recognises any of the abbreviations, R should try to match all of them.
All of the matches should be reported as the return.
Therefore I would make up a list "Abbreviations" for each possible abbreviation
Limited.
limited
Ltd.
ltd.
Ltd
ltd
Questions
1) Would this be a good method, or would there be a more efficient way?
2) How can I check a list against a list of possible abbreviations (step 1, see below), sort of a containsx from excel?
3) How could I make up a list that replaces for the entries that do not match the abbreviation with all other abbreviatinos (step 2, see below)?
Thoughts for solution
Step 1
As I am still very new to this kind of work, I was thinking the following: use a regex expression to filter out wether a string contains any of the abbreviation options and create a list which will then contain either -1 if no match could be found and >0 if match is found. The no pattern matching can already be matched against the "Address" list. With the other entries I continue to step 2.
In this step I don't really know how to check against a list of options ("Abbreviations" list).
Step 2
Next I would create a list with the matches from step 1 and rbind together all options. In this step I don't really know to I could create a list that combines f.e. Coca-Cola with all it's possible abbreviations.
Coca-Cola Limited
Coca-Cola Ltd.
Coca-Cola Ltd
etc.
Step 3
Lastly I would match/merge this more complete list of companies again with the original "Data" list. With the introduction of step 2 I thought It might be a bit easier on the required computing power, as the original list is about 8000 rows.
I would go in a different approach, fixing the tables first before the merge.
To fix with abreviations, I would use a regex, case insensitive, the final dot being optionnal, I start with a list of 'Normal word' = vector of abbreviations.
abbrevs <- list('Limited'=c('Limited','Ltd'),'Incorporated'=c('Incorporated','Inc'))
The I build the corresponding regex (alternations with an optional dot at end, the case will be ignored by parameter in gsub and agrep later):
regexes <- lapply(abbrevs,function(x) { paste0("(",paste0(x,collapse='|'),")[.]?") })
Which gives:
$Limited
[1] "(Limited|Ltd)[.]?"
$Incorporated
[1] "(Incorporated|Inc)[.]?"
Now we have to apply each regex to the company.name column of each df:
for (i in seq_along(regexes)) {
Address$Company.name <- gsub(regexes[[i]], names(regexes[i]), Address$Company.name, ignore.case=TRUE)
Enterprise$Company.name <- gsub(regexes[[i]], names(regexes[i]), Enterprise$Company.name, ignore.case=TRUE)
}
This does not take into account typos. Here you'll need to work on with agrepor adist to manage it.
Result for Address example data set:
> Address
Company.name Address
1 Deloitte Limited New York
2 Coca-Cola New York
3 Tesla Limited California
4 Microsoft Limited Washington
Input data used:
Address <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), Address = c("New York", "New York",
"California", "Washington")), .Names = c("Company.name", "Address"
), class = "data.frame", row.names = c(NA, -4L))
Enterprise <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), EnterpriseNumber = c(221L,
334L, 725L, 127L)), .Names = c("Company.name", "EnterpriseNumber"
), class = "data.frame", row.names = c(NA, -4L))
I would say that the answer depends on whether you have a list of abbreviations or not.
If you have one, you could just look which element of your list contains an abbreviation with grep or greplfunctions. (grep return all indexes that have a matching pattern whereas grepl returns a logical vector).
Also, use the ignore.case= TRUE parameter of these function, so you don't have to try all capitalized/lowercase possibilities.
If you don't have such a list, my first guest would be to extract the first "word" of each company (I would guess that there is a single "Deloitte" company, and that it is "Deloitte Ltd"). You can do so with:
unlist(strsplit(CompanyNames,split = " "))
If you wanted to also correct for typos, this is more a question of string distance.
Hope that it helped!

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"

Extracting Category IDs and Category Labels from URLs Using R

Suppose I have the following links:
<a class=\"MainCategory\"href=\"/cp/3951?povid=cat1070145-env172199-moduleA080112-lLinkGNAV_Electronics_Computers\">Computers</a>
Desktops
Laptops
Is there an automated way to extract the following IDS : 3951, 132982 & 1089430 and their corresponding labels: Computers, Desktops, and Laptops?
If your URLs are in a vector like the following
vec <- c("<a class=\"MainCategory\"href=\"/cp/3951?povid=cat1070145-env172199-moduleA080112-lLinkGNAV_Electronics_Computers\">Computers</a>",
"Desktops",
"Laptops")
you can use regular expressions to extract the information:
data.frame(ID = sub(".*[0-9]+_[0-9]+_([0-9]+).*", "\\1",
sub(".*[^0-9]([0-9]+)\\?povid.*", "\\1", vec)),
Label = sub(".*>(.*)</a>$", "\\1", vec))
# ID Label
# 1 3951 Computers
# 2 132982 Desktops
# 3 1089430 Laptops