R regex matching for tweet pattern - regex

I am trying to use the regex feature in R to parse some tweet text into its key words. I have the following code.
sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub("\\d+", "", sentence)
sentence = tolower(sentence)
However, one of my sentences has the sequence "\ud83d\udc4b". THe parsing fails for this sequence (the error is "invalid input in utf8towcs"). I would like to replace such sequences with "". I tried substituting the regex "\u+", but that did not match. What is the regex I should use to match this sequence? Thanks.

I think you want something like this,
> s <- "\ud83d\udc4b Delta"
> Encoding(s)
[1] "UTF-8"
> iconv(s, "ASCII", sub="")
[1] " Delta"
> f <- iconv(s, "ASCII", sub="")
> sentence = tolower(f)
> sentence
[1] " delta"

> sentence = RemoveNotASCII(sentence)
A function to remove not ASCII characters below.
RemoveNotASCII <- function#Remove all non ASCII characters
### remove column by columns non ASCII characters from a dataframe
(
x ##<< dataframe
){
n <- ncol(x)
z <- list()
for (j in 1:n) {
y = as.character(x[,j])
if (class(y)=="character") {
Encoding(y) <- "latin1"
y <- iconv(y, "latin1", "ASCII", sub="")
}
z[[j]] <- y
}
z = do.call("cbind.data.frame", z)
names(z) <- names(x)
return(z)
### Dataframe with non ASCII characters removed
}

The qdapRegex package has the rm_non_ascii function to handle this:
library(qdapRegex)
tolower(rm_non_ascii(s))
## [1] "delta"

Related

Regex for extracting all words between word and character

i know basic of regex performing with R. But here i have a file like :
**[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981
[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767**
I wanted to extract timestamp alongwith all the SERVICE_ID in that line.
So, my expected output is:
[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981
[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134
The code which I tried was only extracting one SERVICE_ID.
library(qdapRegex)
a <- readLines("C:\\MY_FOLDER\\vinita\\sample.txt")
testi <- rm_between(a,"SERVICE_ID",",",extract = T)
We replace the 2 or more , with " " to get 'str2', then using regex lookarounds, we match one or more space (\\s+) that follows the ]) followed by characters (.*) till the end of the string, replace it with "" so that we can extract the [2016-04..,03] part. From the 'str2', we extract the substrings "SERVICE_ID=" followed by numbers (\\d+) into a list, paste them together and finally paste it with the 'str3'.
library(stringr)
str2 <- gsub(",{2,}", " ", str1)
str3 <- sub("(?<=\\])\\s+.*", "", str2, perl = TRUE)
paste(str3, sapply(str_extract_all(str2, "SERVICE_ID=\\d+"), paste, collapse=" "))
#[1] "[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981"
#[2] "[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134"
data
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str2 <- gsub(",{2,}", " ", str1)
str4 <- sub("\\].*","",str2,perl = TRUE)
str5 <- sub("\\[","",str4,perl = T)
service_ids <- sapply(str_extract_all(str2,"SERVICE_ID=\\d+"), function(x){paste(x,collapse = " ")})
net <- cbind(str5,service_ids)
Output:

Grep in R matching getting non-digits

I need to get the non-digit part of a character. I have problem with this regex in R (which according to regexpal should work):
grep("[\\D]+", "PC 17610", value = TRUE, perl = F)
It should return "PC " while it returns character(0)
Other test cases:
grep("[\\D]+", "STON/O2. 3101282 ", value = TRUE, perl = F)
# should return "STON/O2."
grep("[\\D]+", "S.C./A.4. 23567", value = TRUE, perl = F)
# should return "S.C./A.4."
grep("[\\D]+", "C.A. 31026", value = TRUE, perl = F)
# should return "C.A."
Update:
The job is to divide column "Ticket" (from the Titanic disaster database) into "TicketNumber" and "TicketSeries" columns. As for now, Ticket holds below e.g. values: "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803". So the ticket number column is for the first record 21171 and ticket series column "A/5", and so on for next records.
For the record "113803", TicketNumber should be "113803" and TicketSeries NA.
Help appreciated,
Thanks!
Use sub instead, utilizing the \S regex token to match any non-whitespace characters.
x <- c('PC 17610', 'STON/O2. 3101282 ', 'S.C./A.4. 23567', 'C.A. 31026')
sub('(\\S+).*', '\\1', x)
# [1] "PC" "STON/O2." "S.C./A.4." "C.A."
EDIT
Otherwise, if you want to return NA for invalid or empty matches, I suppose you could do ...
x <- c('PC 17610', 'STON/O2. 3101282 ', 'S.C./A.4. 23567', 'C.A. 31026', '31026')
r <- regmatches(x, gregexpr('^\\S+(?=\\s+)', x, perl=T))
unlist({r[sapply(r, length)==0] <- NA; r})
# [1] "PC" "STON/O2." "S.C./A.4." "C.A." NA
You can use str_extract
library(stringr)
str_extract(x, '\\S+(?=\\s+)')
#[1] "PC" "STON/O2." "S.C./A.4." "C.A." NA
data
x <- c('PC 17610', 'STON/O2. 3101282 ', 'S.C./A.4. 23567',
'C.A. 31026', '31026')

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)

Extract string between parenthesis in R

I have to extract values between a very peculiar feature in R. For eg.
a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"
This is my example string and I wish to extract text between {[0-9]: and } such that my output for the above string looks like
## output should be
"0987617820" "q312132498s7yd09f8sydf987s6df8797yds9f87098", "{112:123123214321}" "20:asdasd3214213"
This is a horrible hack and probably breaks on your real data. Ideally you could just use a parser but if you're stuck with regex... well... it's not pretty
a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"
# split based on }{ allowing for newlines and spaces
out <- strsplit(a, "\\}[[:space:]]*\\{")
# Make a single vector
out <- unlist(out)
# Have an excess open bracket in first
out[1] <- substring(out[1], 2)
# Have an excess closing bracket in last
n <- length(out)
out[length(out)] <- substring(out[n], 1, nchar(out[n])-1)
# Remove the number colon at the beginning of the string
answer <- gsub("^[0-9]*\\:", "", out)
which gives
> answer
[1] "0987617820"
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"
[4] "20:asdasd3214213"
You could wrap something like that in a function if you need to do this for multiple strings.
Using PERL. This way is a bit more robust.
a = "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}{3:{112:123123214321}}{4:20:asdasd3214213}"
foohacky = function(str){
#remove opening bracket
pt1 = gsub('\\{+[0-9]:', '##',str)
#remove a closing bracket that is preceded by any alphanumeric character
pt2 = gsub('([0-9a-zA-Z])(\\})', '\\1',pt1, perl=TRUE)
#split up and hack together the result
pt3 = strsplit(pt2, "##")[[1]][-1]
pt3
}
For example
> foohacky(a)
[1] "0987617820"
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"
[4] "20:asdasd3214213"
It also works with nesting
> a = "{1:0987617820}{{3:{112:123123214321}}{4:{20:asdasd3214213}}"
> foohacky(a)
[1] "0987617820" "{112:123123214321}" "{20:asdasd3214213}"
Here's a more general way, which returns any pattern between {[0-9]: and } allowing for a single nest of {} inside the match.
regPattern <- gregexpr("(?<=\\{[0-9]\\:)(\\{.*\\}|.*?)(?=\\})", a, perl=TRUE)
a_parse <- regmatches(a, regPattern)
a <- unlist(a_parse)

Extract 2nd to last word in string

I know how to do it in Python, but can't get it to work in R
> string <- "this is a sentence"
> pattern <- "\b([\w]+)[\s]+([\w]+)[\W]*?$"
Error: '\w' is an unrecognized escape in character string starting "\b([\w"
> match <- regexec(pattern, string)
> words <- regmatches(string, match)
> words
[[1]]
character(0)
sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', string)
#[1] "a"
which reads - be non-greedy and look for anything until you get to the sequence - some word characters + some non-word characters + some word characters + optional non-word characters + end of string, then extract the first collection of word characters in that sequence
Non-regex solution:
string <- "this is a sentence"
split <- strsplit(string, " ")[[1]]
split[length(split)-1]
Python non regex version
spl = t.split(" ")
if len(spl) > 0:
s = spl[len(spl)-2]