Removing regular expressions from text string in a data-frame in R - regex

I have a data-set with 1000 rows with text containing the order description of lamps. The data is full of inconsistent regex patterns and after referring to the few solutions, I got some help, but its not solving the issue.
R remove multiple text strings in data frame
remove multiple patterns from text vector r
I want to remove all delimiters and also keep only the words present in the wordstoreplace vector.
I have tried removing the delimiters using lapply and post that I have created 2 vectors- "wordstoremove" and "wordstoreplace"
I am trying to apply "str_remove_all()" and the "str_replace_all()". The the first function worked but the second did not.
Initially I had tried using a very naive approach but it was too clumsy.
mydata_sample=data.frame(x=c("LAMP, FLUORESCENT;TYPE TUBE LIGHT, POWER 8 W, POTENTIAL 230 V, COLORWHITE, BASE G5, LENGTH 302.5 MM; P/N: 37755,Mnfr:SuryaREF: MODEL: FW/T5/33 GE 1/25,",
"LAMP, INCANDESCENT;TYPE HALOGEN, POWER 1 KW, POTENTIAL 230 V, COLORWHITE, BASE R7S; Make: Surya",
"BALLAST, LAMP; TYPE: ELECTROMAGNETIC, LAMP TYPE: TUBELIGHT/FLUORESCENT, POWER: 36/40 W, POTENTIAL: 240VAC 50HZ; LEGACY NO:22038 Make :Havells , Cat Ref No : LHB7904025",
"SWITCH,ELECTRICAL,TYPE:1 WCR WAY,VOLTAGE:230V,CURRENT RATED:10A,NUMBEROFPOLES:1P,ADDITIONAL INFORMATION:FOR SNAPMODULESWITCH",
"Brief Desc:HIGH PRES. SODIUM VAPOUR LAMP 250W/400WDetailed Desc:Purchase order text :Short Description :HIGH PRES. SODIUM VAPOURLAMP 250W/400W===============================Part No :SON-T 250W/400W===============================Additional Specification :HIGH PRESSURE SODIUM VAPOUR LAMPSON-T 250W/400W USED IN SURFACE INS SYSTEM TOP LIGHT"))
delimiters1=c('"',"\r\n",'-','=',';')
delimiters2=c('*',',',':')
library(dplyr)
library(stringr)
dat <- mydata_sample %>%
mutate(x1 = str_remove_all(x1, regex(str_c("\\b",delimiters1, "\\b", collapse = '|'), ignore_case = T)))
dat <- mydata_sample %>%
mutate(x1 = str_remove_all(x1, regex(str_c("\\b",delimiters2, "\\b", collapse = '|'), ignore_case = T)))
####Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)
wordstoremove=c('Mnfr','MNFR',"VAPOURTYPEHIGH",'LHZZ07133099MNFR',"BJHF","BJOS",
"BGEMF","BJIR","LIGHTING","FFT","FOR","ACCOMMODATIONQUANTITY","Cat",
"Ref","No","Type","TYPE","QUANTITY","P/N")
wordstoreplace=c('HAVELLS','Havells','Bajaj','BAJAJGrade A','PHILIPS',
'Philips',"MAKEBAJAJ/CG","philips","Philips/Grade A/Grade A/CG/GEPurchase","CG","Bajaj",
"BAJAJ")
dat1 <- dat%>%
mutate(x1 = str_remove_all(x1, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))
dat1=dat1 %>%
mutate(x1=str_replace_all(x1, wordstoreplace, 'Grade A'),ignore_case = T)
###Warning message:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
longer object length is not a multiple of shorter object length

The regex is failing because you need to escape all special characters. See the differences here:
# orig delimiters1=c('"', "\r\n", '-', '=', ';')
delimiters1=c('\\"', "\r\n", '-', '\\=', ';')
# orig delimiters2=c('*', ',', ':')
delimiters2=c('\\*', ',', '\\:')
For the str_replace_all() you need the words to be a single string separated by a | rather than a vector of 12
wordstoreplace <-
c('HAVELLS','Havells','Bajaj','BAJAJGrade A','PHILIPS',
'Philips',"MAKEBAJAJ/CG","philips","Philips/Grade A/Grade A/CG/GEPurchase","CG","Bajaj",
"BAJAJ") %>%
paste0(collapse = "|")
# "HAVELLS|Havells|Bajaj|BAJAJGrade A|PHILIPS|Philips|MAKEBAJAJ/CG|philips|Philips/Grade A/Grade A/CG/GEPurchase|CG|Bajaj|BAJAJ"
This then runs without throwing an error
dat1 <-
dat %>%
mutate(
x1 =
str_remove_all(x1, regex(str_c("\\b", wordstoremove, "\\b", collapse = "|"), ignore_case = T)),
x1 = str_replace_all(x1, wordstoreplace, "Grade A")
)

Related

Removing rows containing special characters

I am working on filtering out a massive dataset that reads in as a list. I need to filter out special markings and am getting stuck on some of them. Here is what I currently have:
library(R.utils)
library(stringr)
gunzip("movies.list.gz") #open file
movies <- readLines("movies.list") #read lines in
movies <- gsub("[\t]", '', movies) #remove tabs (\t)
#movies <- gsub(, '', movies)
a <- movies[!grepl("\\{", movies)] # removed any line that contained special character {
b <- a[!grepl("\\(V)", a)] #remove porn?
c <- b[!grepl("\\(TV)", b)] #remove tv
d <- c[!grepl("\\(VG)", c)] #remove video games
e <- d[!grepl("\\(\\?\\?\\?\\?\\)", d)] #remove anyhting with unknown date ex (????)
f <- e[!grepl("\\#)", e)]
g <- e[!grepl("\\!)", f)]
i <- data.frame(g)
i <- i[-c(1:15),]
i <- data.frame(i)
i$Date <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 2)
i$Title <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 1)
I still need to clean it up a bit, and remove the original column (i) but from the output you can see that it is not removing the special characters ! or #
> head(i)
i Date Title
1 "!Next?" (1994)1994-1995 1994-1995 "!Next?"
2 "#1 Single" (2006)2006-???? 2006-???? "#1 Single"
3 "#1MinuteNightmare" (2014)2014-???? 2014-???? "#1MinuteNightmare"
4 "#30Nods" (2014)2014-2015 2014-2015 "#30Nods"
5 "#7DaysLater" (2013)2013-???? 2013-???? "#7DaysLater"
6 "#ATown" (2014)2014-???? 2014-???? "#ATown"
What I actually want to do is remove the entire rows containing those special characters. Everything I have tried has thrown errors. Any suggestions?
You could sub anything that is not alphanumeric or a "-" or "()" like this:
gsub("[^A-Za-z()-]", "", row)
In order to remove the rows you can try something like the one below:
data[!grepl(pattern = "[#!]", x = data)]
In case you want to remove all the rows with special characters you can use the code suggested by #luke1018 using grepl:
data[!grepl(pattern = "[^A-Za-z0-9-()]", x = data)]

Difficulty Cleaning a Data Frame Using Regexp in R

I am relatively new to R and having difficulty cleaning up a data frame using regex.
One of the columns of that data frame has strings such as:
NUMERO_APPEL
1 NNA
2 VQ-40989
3 41993
4 41993
5 42597
6 VQ-42597
7 DER8
8 40001-2010
I would like to extract the 5 consecutive digits of the strings that have the following format and only the following format, all other strings will be replaced by NAs.
AO-11111
VQ-11111
11111
** Even if Case 8 contains 5 consecutive numbers, it will be replaced by NA as well... Furthermore, a more than or less than 5 digits long number would also be replaced by NA.
Note that the 5 consecutive digits could be any number [0-9], but the characters 'AO-' and 'VQ-' are fixed (i.e. 'AO ' or 'VE-' would be replaced to NA as well.)
This is the code that I currently have:
# Declare a Function that Extracts the 1st 'n' Characters Starting from the Right!
RightSubstring <- function(String, n) {
substr(String, nchar(String)-n+1, nchar(String))
}
# Declare Function to Remove NAs in Specific Columns!
ColRemNAs <- function(DataFrame, Column) {
CompleteVector <- complete.cases(DataFrame, Column)
return(DataFrame[CompleteVector, ])
Contrat$NUMERO_APPEL <- RightSubstring(as.character(Contrat$NUMERO_APPEL), 5)
Contrat$NUMERO_APPEL <- gsub("[^0-9]", NA, Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL <- as.numeric(Contrat$NUMERO_APPEL)
# Efface les Lignes avec des éléments NAs.
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_COMMANDE)
Contrat <- ColRemNAs(Contrat, Contrat$NO_FOURNISSEUR)
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_APPEL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_INITIAL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_ACTUEL)
}
Thanks in advance. Hope my explanations were clear!
Here is a base R solution which will match 5 digits occurring only in the following three forms:
AO-11111
VQ-11111
11111
I use this regular expression to match the five digits:
^((AQ|VQ)-)?(\\d{5})$
Strings which match begin with an optional AQ- or VQ-, and then are followed by 5 consecutive digits, after which the string must terminate.
The following code substitutes all matching patterns with the 5 digits found, and stores NA into all non-matching patterns.
ind <- grep("^((AQ|VQ)-)?(\\d{5})$", Contrat$NUMERO_APPEL, value = FALSE)
Contrat$NUMERO_APPEL <- gsub("^(((AQ|VQ)-)?(\\d{5}))$", "\\4", Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL[-ind] <- NA
For more reading see this SO post.
library(dplyr)
library(stringi)
df %>%
mutate(NUMERO_APPEL.fix =
NUMERO_APPEL %>%
stri_extract_first_regex("[0-9]{5}") %>%
as.numeric)

Truncate words within each element of a character vector in R

I have a data frame where one column is a character vector and every element in the vector is the full text of a document. I want to truncate words in each element so that maximum word length is 5 characters.
For example:
a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
"Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)
head(df)
file text
1 1 Words longer than five characters should be truncated
2 2 Words shorter than five characters should not be modified
And this is what I'm trying to get:
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list)):
x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than" "five" "chara" "shoul" "be" "trunc" "Words" "short" "than"
[12] "five" "chara" "shoul" "not" "be" "modif"
But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above.
Is there a way to do this using gsub and regex?
If you're looking to utilize gsub to perform this task:
> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
# file text
# 1 1 Words longe than five chara shoul be trunc
# 2 2 Words short than five chara shoul not be modif
You were on the right track. In order for your idea to work, however, you have to do the split/trim/combine for each row separated. Here's a way to do it. I was very verbose on purpose, to make it clear, but you can obviously use less lines.
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- strtrim(str, 5)
str <- paste(str, collapse = " ")
str
})
And the output:
> df
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
The short version is
df$text <- sapply(df$text, function(str) {
paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")
})
Edit:
I just realized you asked if it is possible to do this using gsub and regex. Even though you don't need those for this, it's still possible, but harder to read:
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- gsub("(?<=.{5}).+", "", str, perl = TRUE)
str <- paste(str, collapse = " ")
str
})
The regex matches anything that appears after 5 characters and replaces those with nothing. perl = TRUE is necessary to enable the regex lookbehind ((?<=.{5})).

Using regular expressions in R to extract information from string

I searched the stack overflow a little and all I found was, that regex in R are a bit tricky and not convenient compared to Perl or Python.
My problem is the following. I have long file names with informations inside. The look like the following:
20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw
20150416_QEP1_EXT_GR_1237_hs_IP_NON_060
I want to extract the parts from the filename and convert them conveniently into values, for example the first part is a date, the second is machine abbreviation, the next an institute abbreviation, group abbreviation, sample number(s) etc...
What I do at the moment is constructing a regex, to make (almost) sure, I grep the correct part of the string:
regex <- '([:digit:]{8})_([:alnum:]{1,4})_([:upper:]+)_ etc'
Then I use sub to save each snipped into a variable:
date <- sub(regex, '\\1', filename)
machine <- sub(regex, '\\2', filename)
etc
This works, if the filename has the correct convention. It is overall very hard to read and I am search for a more convenient way of doing the work. I thought about splitting the filename by _ and accessing the string by index might be a good solution. But sometimes, since the filenames often get created by hand, there are terms missing or additional information in the names and I am looking for a better solution to this.
Can anyone suggest a better way of doing so?
EDIT
What I want to create is an object, which has all the information of the filenames extracted and accessible... such as my_object$machine or so....
The help page for ?regex actually gives an example that is exactly equivalent to Python's re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") (as per your comment):
## named capture
notables <- c(" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore")
#name groups 'first' and 'last'
name.rex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
(parsed <- regexpr(name.rex, notables, perl = TRUE))
gregexpr(name.rex, notables, perl = TRUE)[[2]]
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
parse.one(notables, parsed)
The normal way (i.e. the R way) to extract from a string is the following:
text <- "Malcolm Reynolds"
x <- gregexpr("\\w+", text) #Don't forget to escape the backslash
regmatches(text, x)
[[1]]
[1] "Malcolm" "Reynolds"
You can use however Perl-style group naming by using argument perl=TRUE:
regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
However regmatches does not support it, hence the need to create your own function to handle that, which is given in the help page:
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
Applied to your example:
text <- "Malcolm Reynolds"
x <- regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
parse.one(text, x)
first_name last_name
[1,] "Malcolm" "Reynolds"
To go back to your initial problem:
filenames <- c("20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw", "20150416_QEP1_EXT_GR_1237_hs_IP_NON_060")
regex <- '(?P<date>[[:digit:]]{8})_(?P<machine>[[:alnum:]]{1,4})_(?P<whatev>[[:upper:]]+)'
x <- regexpr(regex,filenames,perl=TRUE)
parse.one(filenames,x)
date machine whatev
[1,] "20150416" "QEP1" "EXT"
[2,] "20150416" "QEP1" "EXT"
[3,] "20150416" "QEP1" "EXT"
[4,] "20150416" "QEP1" "EXT"

Index of character position in string -- str_locate() and gregexpr() not working

I'm trying to find the string position for a "(" parenthesis character in order to make a new variable, but my usual methods aren't working. Both str_locate() and grepexpr() return an error about invalid regex syntax: invalid regular expression '(', reason 'Missing ')''
This could have been a solution: which(strsplit(df$var, "")[[1]] == "(", but it doesn't work for making new variables.
Here are some reproducible examples:
txt <- "some text (in parentheses)"
df <- structure(list(number = 1:3, txt = c("text (in parentheses)",
"some text (in parentheses)", "some more text (in parentheses)"
)), .Names = c("number", "txt"), row.names = c(NA, 3L), class = "data.frame")
df
## find position of 'm' -- these all work
stringr::str_locate(txt, "m")[[1]]
gregexpr("m", txt)[[1]][1]
## find position of '(' -- now they don't work
stringr::str_locate(txt, "(")[[1]]
gregexpr("(", txt)[[1]][1]
## find position of '(' -- this works for a character string...
which(strsplit(txt, "")[[1]] == "(")
## ...but not for making a new variable
df$newVar <- which(strsplit(df$txt, "")[[1]] == "(")
df #wrong
As pointed out in the comments, these both will work
df$newVar <- str_locate(df$txt, "\\(")[, 1]
Or
df$newVar <- sapply(gregexpr("\\(", df$txt), '[', 1)