Difficulty Cleaning a Data Frame Using Regexp in R - regex

I am relatively new to R and having difficulty cleaning up a data frame using regex.
One of the columns of that data frame has strings such as:
NUMERO_APPEL
1 NNA
2 VQ-40989
3 41993
4 41993
5 42597
6 VQ-42597
7 DER8
8 40001-2010
I would like to extract the 5 consecutive digits of the strings that have the following format and only the following format, all other strings will be replaced by NAs.
AO-11111
VQ-11111
11111
** Even if Case 8 contains 5 consecutive numbers, it will be replaced by NA as well... Furthermore, a more than or less than 5 digits long number would also be replaced by NA.
Note that the 5 consecutive digits could be any number [0-9], but the characters 'AO-' and 'VQ-' are fixed (i.e. 'AO ' or 'VE-' would be replaced to NA as well.)
This is the code that I currently have:
# Declare a Function that Extracts the 1st 'n' Characters Starting from the Right!
RightSubstring <- function(String, n) {
substr(String, nchar(String)-n+1, nchar(String))
}
# Declare Function to Remove NAs in Specific Columns!
ColRemNAs <- function(DataFrame, Column) {
CompleteVector <- complete.cases(DataFrame, Column)
return(DataFrame[CompleteVector, ])
Contrat$NUMERO_APPEL <- RightSubstring(as.character(Contrat$NUMERO_APPEL), 5)
Contrat$NUMERO_APPEL <- gsub("[^0-9]", NA, Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL <- as.numeric(Contrat$NUMERO_APPEL)
# Efface les Lignes avec des éléments NAs.
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_COMMANDE)
Contrat <- ColRemNAs(Contrat, Contrat$NO_FOURNISSEUR)
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_APPEL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_INITIAL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_ACTUEL)
}
Thanks in advance. Hope my explanations were clear!

Here is a base R solution which will match 5 digits occurring only in the following three forms:
AO-11111
VQ-11111
11111
I use this regular expression to match the five digits:
^((AQ|VQ)-)?(\\d{5})$
Strings which match begin with an optional AQ- or VQ-, and then are followed by 5 consecutive digits, after which the string must terminate.
The following code substitutes all matching patterns with the 5 digits found, and stores NA into all non-matching patterns.
ind <- grep("^((AQ|VQ)-)?(\\d{5})$", Contrat$NUMERO_APPEL, value = FALSE)
Contrat$NUMERO_APPEL <- gsub("^(((AQ|VQ)-)?(\\d{5}))$", "\\4", Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL[-ind] <- NA
For more reading see this SO post.

library(dplyr)
library(stringi)
df %>%
mutate(NUMERO_APPEL.fix =
NUMERO_APPEL %>%
stri_extract_first_regex("[0-9]{5}") %>%
as.numeric)

Related

Removing rows containing special characters

I am working on filtering out a massive dataset that reads in as a list. I need to filter out special markings and am getting stuck on some of them. Here is what I currently have:
library(R.utils)
library(stringr)
gunzip("movies.list.gz") #open file
movies <- readLines("movies.list") #read lines in
movies <- gsub("[\t]", '', movies) #remove tabs (\t)
#movies <- gsub(, '', movies)
a <- movies[!grepl("\\{", movies)] # removed any line that contained special character {
b <- a[!grepl("\\(V)", a)] #remove porn?
c <- b[!grepl("\\(TV)", b)] #remove tv
d <- c[!grepl("\\(VG)", c)] #remove video games
e <- d[!grepl("\\(\\?\\?\\?\\?\\)", d)] #remove anyhting with unknown date ex (????)
f <- e[!grepl("\\#)", e)]
g <- e[!grepl("\\!)", f)]
i <- data.frame(g)
i <- i[-c(1:15),]
i <- data.frame(i)
i$Date <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 2)
i$Title <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 1)
I still need to clean it up a bit, and remove the original column (i) but from the output you can see that it is not removing the special characters ! or #
> head(i)
i Date Title
1 "!Next?" (1994)1994-1995 1994-1995 "!Next?"
2 "#1 Single" (2006)2006-???? 2006-???? "#1 Single"
3 "#1MinuteNightmare" (2014)2014-???? 2014-???? "#1MinuteNightmare"
4 "#30Nods" (2014)2014-2015 2014-2015 "#30Nods"
5 "#7DaysLater" (2013)2013-???? 2013-???? "#7DaysLater"
6 "#ATown" (2014)2014-???? 2014-???? "#ATown"
What I actually want to do is remove the entire rows containing those special characters. Everything I have tried has thrown errors. Any suggestions?
You could sub anything that is not alphanumeric or a "-" or "()" like this:
gsub("[^A-Za-z()-]", "", row)
In order to remove the rows you can try something like the one below:
data[!grepl(pattern = "[#!]", x = data)]
In case you want to remove all the rows with special characters you can use the code suggested by #luke1018 using grepl:
data[!grepl(pattern = "[^A-Za-z0-9-()]", x = data)]

How to find the longest string in a text using regex in R

Given a string x, i can count the number of words (length) in this string using gregexpr("[A-Za-z]\w+", x) .
> x<-"\n\n\n\n\n\nMasters Publics\n\n\n\n\n\n\n\n\n\n\n\n\nMasters Universitaires et Prives au Maroc\n\n\n\n\n\n\n\n\\n\n\n\n\nMasters Par Ville\n\n\n\n\n\n\n\n\n\n\n\n\n"
> sapply(gregexpr("[A-Za-z]\\w+", x), function(x) sum(x > 0))
[1] 11
However, how can i retrieve the number of words in the longest attached string (with space and not \n), using regex under R environnent
in this example it would be "Masters Universitaires et Prives au Maroc" which length is 6 .
Thanks in Advance .
I would solve it with
x <- "\n\n\n\n\n\nMasters Publics\n\n\n\n\n\n\n\n\n\n\n\n\nMasters Universitaires et Prives au Maroc\n\n\n\n\n\n\n\n\\n\n\n\n\nMasters Par Ville\n\n\n\n\n\n\n\n\n\n\n\n\n"
max(nchar(gsub("[^ ]+", "", unlist(strsplit(trimws(x), "\n+"))))) + 1
Split a trimmed string into lines, unlist the result, remove all characters other than a space, get the longest item and add one. The [^ ]+ is a regex that matches one or more (due to the + quantifier) characters other than (as [^...] is a negated character class) a space.
See IDEONE demo.
Load the package
library(stringr)
Create a new dataset, extracting and splitting the phrases
data <- unlist(str_split(x, pattern="\n", n = Inf))
index <- lapply(data, nchar)
index <- index !=0
# extract the maximum length of the phrase
max(sapply(gregexpr("\\W+", data[index]), length) + 1)
[1] 6
# just checking
data[index]
[1] "Masters Publics"
[2] "Masters Universitaires et Prives au Maroc"
[3] "\\n"
[4] "Masters Par Ville"

Truncate words within each element of a character vector in R

I have a data frame where one column is a character vector and every element in the vector is the full text of a document. I want to truncate words in each element so that maximum word length is 5 characters.
For example:
a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
"Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)
head(df)
file text
1 1 Words longer than five characters should be truncated
2 2 Words shorter than five characters should not be modified
And this is what I'm trying to get:
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list)):
x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than" "five" "chara" "shoul" "be" "trunc" "Words" "short" "than"
[12] "five" "chara" "shoul" "not" "be" "modif"
But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above.
Is there a way to do this using gsub and regex?
If you're looking to utilize gsub to perform this task:
> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
# file text
# 1 1 Words longe than five chara shoul be trunc
# 2 2 Words short than five chara shoul not be modif
You were on the right track. In order for your idea to work, however, you have to do the split/trim/combine for each row separated. Here's a way to do it. I was very verbose on purpose, to make it clear, but you can obviously use less lines.
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- strtrim(str, 5)
str <- paste(str, collapse = " ")
str
})
And the output:
> df
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
The short version is
df$text <- sapply(df$text, function(str) {
paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")
})
Edit:
I just realized you asked if it is possible to do this using gsub and regex. Even though you don't need those for this, it's still possible, but harder to read:
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- gsub("(?<=.{5}).+", "", str, perl = TRUE)
str <- paste(str, collapse = " ")
str
})
The regex matches anything that appears after 5 characters and replaces those with nothing. perl = TRUE is necessary to enable the regex lookbehind ((?<=.{5})).

regular expression string between two [ ] in R

I am stuck on regular expressions yet again but this time in R.
The problem I am facing is that I a vector I would like to extract a string between two [] for each row in the vector. However, sometimes I have cases where there is more than one series of [ ] in the whole statement and so I am recovering all strings in each row that is in the [ ]. In all cases I just need to recover the first instance of the string in the [ ] not the second or more instances. The example dataframe I have is:
comp541_c0_seq1 gi|356502740|ref|XP_003520174.1| PREDICTED: uncharacterized protein LOC100809655 [Glycine max]
comp5041_c0_seq1 gi|460370622|ref|XP_004231150.1| [Solanum lycopersicum] PREDICTED: uncharacterized protein LOC101250457 [Solanum lycopersicum]
The code i have been using that recovers the string and the index and makes a vector in the new dataframe are:
pattern <- "\\[\\w*\\s\\w*]"
match<- gregexpr(pattern, data$Description)
data$Species <- regmatches(data$Description, match)
the structure of the dataframe that I am using is:
data.frame': 67911 obs. of 6 variables:
$ Column1 : Factor w/ 67911 levels "comp100012_c0_seq1 ",..: 3344 8565 17875 18974 19059 19220 21429 29791 40214 48529 ...
$ Description : Factor w/ 26038 levels "0.0","1.13142e-173",..: NA NA NA NA NA NA NA NA 7970 NA ...
So the problem with my pattern match is that it return a vector (Species) where some of the rows have:
[Glycine max] # this is good
c("[Solanum lycopersicum]", "[Solanum lycopersicum]") # I only need one set returned
What I would like is:
[Glycine max]
[Solanum lycopersicum]
I have been trying every way I can with the regular expression. Would anyone know how to improve what I have to just extract the first instance of the string within [ ]?
Thanks in advance.
I think this example should be illuminating to your problems:
txt <- c("[Bracket text]","[Bracket text1] and [Bracket text2]","No brackets in here")
pattern <- "\\[\\w*\\s\\w*]"
mat <- regexpr(pattern,txt)
#[1] 1 1 -1
#attr(,"match.length")
#[1] 14 15 -1
txt[mat != -1] <- regmatches(txt, mat)
txt
#[1] "[Bracket text]" "[Bracket text1]" "No brackets in here"
Or if you want to do it all in one go and return NA values for non-matches, try:
ifelse(mat != -1, regmatches(txt,mat), NA)
#[1] "[Bracket text]" "[Bracket text1]" NA
Using the base-R facilities for string manipulation is just making life hard for yourself. Use rebus to create the regular expression, and stringi (or stringr) to get the matches.
library(rebus)
library(stringi)
txt <- c("[Bracket text]","[Bracket text1] and [Bracket text2]","No brackets in here") # thanks, thelatemail
pattern <- OPEN_BRACKET %R%
alnum(1, Inf) %R%
space(1, Inf) %R%
alnum(1, Inf) %R%
"]"
stri_extract_first_regex(txt, pattern)
## [1] "[Bracket text]" "[Bracket text1]" NA
I suspect that you probably don't want to keep those square brackets. Try this variant:
pattern <- OPEN_BRACKET %R%
capture(
alnum(1, Inf) %R%
space(1, Inf) %R%
alnum(1, Inf)
) %R%
"]"
stri_match_first_regex(txt, pattern)[, 2]
## [1] "Bracket text" "Bracket text1" NA

R Conditional Replace/Trim with Fill (regex,gsub,gregexpr,regmatches)

I have a question involving conditional replace.
I essentially want to find every string of numbers and, for every consecutive digit after 4, replace it with a space.
I need the solution to be vectorized and speed is essential.
Here is a working (but inefficient solution):
data <- data.frame(matrix(NA, ncol=2, nrow=6, dimnames=list(c(), c("input","output"))),
stringsAsFactors=FALSE)
data[1,] <- c("STRING WITH 2 FIX(ES): 123456 098765 1111 ",NA)
data[2,] <- c(" PADDED STRING WITH 3 FIX(ES): 123456 098765 111111 ",NA)
data[3,] <- c(" STRING WITH 0 FIX(ES): 12 098 111 ",NA)
data[4,] <- c(NA,NA)
data[5,] <- c("1234567890",NA)
data[6,] <- c(" 12345 67890 ",NA)
x2 <- data[,"input"]
x2
p1 <- "([0-9]+)"
m1 <- gregexpr(p1, x2,perl = TRUE)
nchar1 <- lapply(regmatches(x2, m1), function(x){
if (length(x)==0){ x <- NA } else ( x <- nchar(x))
return(x) })
x3 <- mapply(function(match,length,text,cutoff) {
temp_comb <- data.frame(match=match, length=length, stringsAsFactors=FALSE)
for(i in which(temp_comb[,"length"] > cutoff))
{
before <- substr(text, 1, (temp_comb[i,"match"]-1))
middle_4 <- substr(text, temp_comb[i,"match"], temp_comb[i,"match"]+cutoff-1)
middle_space <- paste(rep(" ", temp_comb[i,"length"]-cutoff),sep="",collapse="")
after <- substr(text, temp_comb[i,"match"]+temp_comb[i,"length"], nchar(text))
text <- paste(before,middle_4,middle_space,after,sep="")
}
return(text)
},match=m1,length=nchar1,text=x2,cutoff=4)
data[,"output"] <- x3
Is there a better way?
I was looking at the help section for regmatches and there was a similar type question, but it was full replacement with blanks and not conditional.
I would have written some alternatives and benchmarked them but honestly I couldn't think of other ways to do this.
Thanks ahead of time for the help!
UPDATE
Fleck,
Using your way but making cutoff an input, I am getting an error for the NA case:
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x,cutoff) {
# x <- regmatches(data$input, m)[[4]]
# cutoff <- 4
mapply(function(x, n, cutoff){
formatC(substr(x,1,cutoff), width=-n)
}, x=x, n=nchar(x),cutoff=cutoff)
},cutoff=4)
Here's a fast approach with just one gsub command:
gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234"
# [6] " 1234 6789 "
The string (?<!\\d) is a negative lookahead: A position that is not preceded by a digit. The string (\\d{4}) means 4 consecutive digits. Finally, \\d* represents any number of digits. The part of the string that matches this regex is replaced by the first group (the first 4 digits).
An approach that does not change string length:
matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
if (!is.na(m) && m != -1L) {
for (i in seq_along(m)) {
substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
}
}
return(d)
}, matches, data$input)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234 "
# [6] " 1234 6789 "
You can do the same in one line (and one space for one digit) with:
gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)
details:
(?: # non-capturing group: the two possible entry points
\G # either the position after the last match or the start of the string
(?!\A) # exclude the start of the string position
| # OR
\d{4} # four digits
) # close the non-capturing group
\K # removes all on the left from the match result
\d # a single digit
Here's a way with gregexpr and regmatches
#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})
#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""),
regmatches(data$input, m, invert=T), zz))
The different here is that it turns the NA value into "". We could add in other checks to prevent that or just turn all zero length strings into missing values at the end. I just didn't want to over-complicate the code with safety checks.