Multiple regexpr in one string in R - regex

So I have a really long string and I want to work with multiple matches. I can only seem to get the first position of the first match using regexpr. How can I get multiple positions (more matches) back within the same string?
I am looking for a specific string in html source code. The titel of an auction (which is between html tags). It prooves kind of difficult to find:
So far I use this:
locationstart <- gregexpr("<span class=\"location-name\">", URL)[[1]]+28
locationend <- regexpr("<", substring(URL, locationstart[1], locationend[1] + 100))
substring(URL, locationstart[1], locationstart[1] + locationend - 2)
That is, I look for a part that comes before a title, then I capture that place, from there on look for a "<" indicating that the title ended. I'm open for more specific suggestions.

Using gregexpr allows for multiple matches.
> x <- c("only one match", "match1 and match2", "none here")
> m <- gregexpr("match[0-9]*", x)
> m
[[1]]
[1] 10
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 1 12
attr(,"match.length")
[1] 6 6
attr(,"useBytes")
[1] TRUE
[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
and if you're looking to extract the match you can use regmatches to do that for you.
> regmatches(x, m)
[[1]]
[1] "match"
[[2]]
[1] "match1" "match2"
[[3]]
character(0)

gregexpr and regmatches as suggested in Dason's answer allow extracting multiple instance of a regex pattern in a string. Furthermore this solution has the advantage of relying exclusively on the {base} package of R rather than requiring an additional package.
Never the less, I'd like to suggest an alternative solution based on the stringr package. In general, this package makes it easier to work with character strings by providing most of the functionality of the various string-support functions of base R (not just the regex-related functions), with a set of functions intuitively named and offering a consistent API. Indeed stringr functions not merely replace base R functions, but in many cases introduce additional features; for example the regex-related functions of stringr are vectorized for both the string and the pattern.
Specifically for the question of extracting multiple patterns in a long string, either str_extract_all and str_match_all can be used as shown below. Depending on the fact that the input is a single string or a vector of it, the logic can be adapted, using list/matrix subscripts, unlist or other approaches like lapply, sapply etc. The point is that the stringr functions return structures that can be used to access just what we want.
# simulate html input. (Using bogus html tags to mark the target texts; the demo works
# the same for actual html patterns, the regular expression is just a bit more complex.
htmlInput <- paste("Lorem ipsum dolor<blah>MATCH_ONE<blah> sit amet, purus",
"sollicitudin<blah>MATCH2<blah>mauris, <blah>MATCH Nr 3<blah>vitae donec",
"risus ipsum, aenean quis, sapien",
"in lorem, condimentum ornare viverra",
"suscipit <blah>LAST MATCH<blah> ipsum eget ac. Non senectus",
"dolor mauris tellus, dui leo purus varius")
# str_extract() may need a bit of extra work to remove the leading and trailing parts
str_extract_all(htmlInput, "(<blah>)([^<]+)<")
# [[1]]
# [1] "<blah>MATCH_ONE<" "<blah>MATCH2<" "<blah>MATCH Nr 3<" "<blah>LAST MATCH<"
str_match_all(htmlInput, "<blah>([^<]+)<")[[1]][, 2]
# [1] "MATCH_ONE" "MATCH2" "MATCH Nr 3" "LAST MATCH"

Related

Extracting capturing groups from a regex

This regex: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])* matches an expression using multiple groups. The point of the regex is that it captures patterns in pairs of two, where the first part of the regex has to be followed by the second part of the regex.
How can I extract each of these two groups?
library(stringr)
data <- c("A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7")
str_extract_all(data, "(.*?)(?:I[0-9]-)*I3(?:-I[0-9])*")
Gives me:
[[1]]
[1] "A-B-C-I1-I2-D-E-F-I1-I3" "-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7"
However, I would want something along the lines of:
[[1]]
[1] "A-B-C-I1-I2-D-E-F" [2] "I1-I3"
[[2]]
[1] "D-D-D-D" [2] "I1-I1-I2-I1-I1-I3-I3-I7"
The key here is that regex matches twice, each time containing 2 groups. I want every match to have a list of it's own, and that list to contain 2 elements, one for each group.
You need to wrap a capturing group around the second part of your expression and if you're using stringr for this task, I would use str_match_all instead to return the captured matches ...
library(stringr)
data <- c('A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7')
mat <- str_match_all(data, '-?(.*?)-((?:I[0-9]-)*I3(?:-I[0-9])*)')[[1]][,2:3]
colnames(mat) <- c('Group 1', 'Group 2')
# Group 1 Group 2
# [1,] "A-B-C-I1-I2-D-E-F" "I1-I3"
# [2,] "D-D-D-D" "I1-I1-I2-I1-I1-I3-I3-I7"

How to allow for arbitrary number of wildcards in regexes?

I have a list of character strings:
> head(g_patterns_clean_strings)
[[1]]
[1] "1FAFA"
[[2]]
[1] "FA,TRFA"
[[3]]
[1] "FAEX"
I am trying to identify specific patterns in these character strings, as such:
library(devtools)
g_patterns_clean <- source_gist("164f798524fd6904236a")[[1]]
g_patterns_clean_strings <- source_gist("af70a76691aacf05c1bb")[[1]]
FA_EX_logic_vector <- grepl(g_patterns_clean_strings, pattern = "(FAEX|EXFA)+")
FA_EX_cluster <- subset(g_patterns_clean, FA_EX_logic_vector)
Let's now say that I want to allow for an arbitrary number of other characters in between FA and EX (or EX and FA), how can I specify that in the regex above?
This is a flexible generalization of #eipi10's answer:
(FA.{0,2}EX|EX.{0,2}FA)
The . matches any character, and the {0,2} quantifier matches between 0 and 2 occurrences of .

Extract location data using regex in R

R newbie here
I have data that looks something like this:
{'id': 19847005, 'profile_sidebar_fill_color': u'http://pbs.foo.com/profile_background', 'profile_text_color': u'333333', 'followers_count': 1105, 'location': u'San Diego, CA', 'profile_background_color': u'9AE4E8', 'listed_count': 43, '009', 'time_zone': u'Pacific Time (US & Canada)', 'protected': False}
I want to extract the location data from this text: San Diego, CA.
I have been trying to use this stringr package to accomplish this, but can't quite get the regex right to capture the city and state. Sometimes state will be present, other times not present.
location_pattern <- "'location':\su'(\w+)'"
rawdata$location <- str_extract(rawdata$user, location_pattern)
You could try
str_extract_all(str1, perl("(?<=location.: u.)[^']+(?=')"))[[1]]
#[1] "San Diego, CA"
Others have given possible solutions, but not explained what likely went wrong with your attempt.
The str_extract function uses POSIX extended regular expressions that do not understand \w and \s, those are specific to Perl regular expressions. You can use the perl function in the stringr package instead and it will then recognize the shortcuts, or you can use [[:space:]] in place of \s and [[:alnum:]_] in place of \w though more likely you will want something like [[:alpha], ] or [^'].
Also, R's string parser gets a shot at the string before it is passed to the matching function, therefore you will need \\s and \\w if you use the perl function (or other regular expressions function in R). the first \ escapes the second so that a single \ remains in the string to be interpreted as part of the regular expression.
It looks like a json string, but if you're not too concerned about that, then perhaps this would help.
library(stringi)
ss <- stri_split_regex(x, "[{}]|u?'|(, '(009')?)|: ", omit=TRUE)[[1]]
(m <- matrix(ss, ncol = 2, byrow = TRUE))
# [,1] [,2]
# [1,] "id" "19847005"
# [2,] "profile_sidebar_fill_color" "http://pbs.foo.com/profile_background"
# [3,] "profile_text_color" "333333"
# [4,] "followers_count" "1105"
# [5,] "location" "San Diego, CA"
# [6,] "profile_background_color" "9AE4E8"
# [7,] "listed_count" "43"
# [8,] "time_zone" "Pacific Time (US & Canada)"
# [9,] "protected" "False"
So now you have the ID names in the left column and the values on the right. It would probably be simple to reassemble the json from this point if need be.
Also, regarding the json-ness, we can coerce m to a data.frame (or leave it as a matrix), and then use jsonlite::toJSON
library(jsonlite)
json <- toJSON(setNames(as.data.frame(m), c("ID", "Value")))
fromJSON(json)
# ID Value
# 1 id 19847005
# 2 profile_sidebar_fill_color http://pbs.foo.com/profile_background
# 3 profile_text_color 333333
# 4 followers_count 1105
# 5 location San Diego, CA
# 6 profile_background_color 9AE4E8
# 7 listed_count 43
# 8 time_zone Pacific Time (US & Canada)
# 9 protected False

php preg_match_all equivalent

I am looking for an R equivalent to PHP's preg_match_all function.
Objective:
Search a single string (not a vector of several strings) for a regexp pattern
Return a matrix of matches
Example:
Assume the following flat string without delimitation.
"This is a sample string written like a paragraph. In this string two sets of information exist. Each set contains two variables. We want to extract the sets and variables within those sets. Each information set is formatted the same way. The first set is Title: Sir; Last Name: John; and the second set is Title: Mr.; Last Name: Smith."
Using a regular expression pattern similar to
"Title: ([^;]*?); Last Name: ([^;.]*?)"
I would like to produce the following matrix from the above string:
[ ][,1] [,2]
[1,] Sir John
[2,] Mr. Smith
I have successfully accomplished this in PHP on a remote server using the preg_match_all function; however, the text files I am accessing are relatively large (not huge but slow to upload anyways). Building this in R will save a significant amount of time.
I have read up on use of grep, etc. in R but every example I have found searches for patterns in a vector and I have been unable to generate the matrix as described above.
I have also played with the stringr package but again I have not been successful generating a matrix.
This seems like a common task to me so I am sure someone smarter than me has found a solution before.
Consider the following option using regmatches :
x <- 'This is a sample string written like a paragraph. In this string two sets of information exist. Each set contains two variables. We want to extract the sets and variables within those sets. Each information set is formatted the same way. The first set is Title: Sir; Last Name: John; and the second set is Title: Mr.; Last Name: Smith.'
m <- regmatches(x, gregexpr('(?i)Title: \\K[^;]+|Last Name: \\K[^;.]+', x, perl=T))
matrix(unlist(m), ncol=2, byrow=T)
Output:
[,1] [,2]
[1,] "Sir" "John"
[2,] "Mr." "Smith"
For some reason there doesn't seem to be an easy way to extract captured matches in base (I wish regmatches also worked with captured groups but it does not). I ended up writing my own you can find it at regcapturedmatches.R. it will work with
a <- "The first set is Title: Sir and Last Name: John; and the second set is Title: Mr. and Last Name: Smith."
m<-gregexpr("Title: ([^;]*) and Last Name: ([^;.]*)", a, perl=T, ignore.case=T)
regcapturedmatches(a,m)[[1]]
This will return
[,1] [,2]
[1,] "Sir" "John"
[2,] "Mr." "Smith"
(I added the [[1]] because you said you would only operate on one string at a time. The function can operate on a vector and will return results in a list. Really, in R, every thing is a vector so there is no such thing as a "single" string, you just have a vector of strings with length 1.)
Of course this method is only as good as your regular expression. I had to modify your sample data a bit so your expression would match more than one Title/Name.
Here is a stringr version:
library(stringr)
str_match_all(x, pattern)
Produces:
[[1]]
[,1] [,2] [,3]
[1,] "Title: Sir and Last Name: John" "Sir" "John"
[2,] "Title: Mr. and Last Name: Smith" "Mr." "Smith"
Note that I had to edit your text so that the second one is also of form "and Last Name:". To get your matrix you can just do:
result[[1]][[-1]] # assumes the above is in `result`
One limitation of this is it uses regexec, which doesn't support perl regular expressions.

R gsub and perl

Hi I am trying to peform a specific pattern matching.
I want to stand. street names.
y <- c("Straße des 18 JAN.")
gsub("(.*)([1-3]?[0-9]\\.?)(JAN\\.?U?A?R?)(.*)","\\1 \\2 JANUAR \\4",y, perl=T)
What I want is that it keeps everything but rewrites the bracket 3 to JANUAR, so far i could not handle that.
Thanks in advance.
The regular expression has to be
gsub("(.*)([1-3]?[0-9]\\.?) (JAN\\.?U?A?R?)(.*)","\\1\\2 JANUAR\\4",y, perl=TRUE)
# [1] "Straße des 18 JANUAR"
I added a whitespace () before the term beginning with (JAN. Furthermore, I removed the whitespaces between \\1 and \\2 and between JANUAR and \\4.