I am looking for an R equivalent to PHP's preg_match_all function.
Objective:
Search a single string (not a vector of several strings) for a regexp pattern
Return a matrix of matches
Example:
Assume the following flat string without delimitation.
"This is a sample string written like a paragraph. In this string two sets of information exist. Each set contains two variables. We want to extract the sets and variables within those sets. Each information set is formatted the same way. The first set is Title: Sir; Last Name: John; and the second set is Title: Mr.; Last Name: Smith."
Using a regular expression pattern similar to
"Title: ([^;]*?); Last Name: ([^;.]*?)"
I would like to produce the following matrix from the above string:
[ ][,1] [,2]
[1,] Sir John
[2,] Mr. Smith
I have successfully accomplished this in PHP on a remote server using the preg_match_all function; however, the text files I am accessing are relatively large (not huge but slow to upload anyways). Building this in R will save a significant amount of time.
I have read up on use of grep, etc. in R but every example I have found searches for patterns in a vector and I have been unable to generate the matrix as described above.
I have also played with the stringr package but again I have not been successful generating a matrix.
This seems like a common task to me so I am sure someone smarter than me has found a solution before.
Consider the following option using regmatches :
x <- 'This is a sample string written like a paragraph. In this string two sets of information exist. Each set contains two variables. We want to extract the sets and variables within those sets. Each information set is formatted the same way. The first set is Title: Sir; Last Name: John; and the second set is Title: Mr.; Last Name: Smith.'
m <- regmatches(x, gregexpr('(?i)Title: \\K[^;]+|Last Name: \\K[^;.]+', x, perl=T))
matrix(unlist(m), ncol=2, byrow=T)
Output:
[,1] [,2]
[1,] "Sir" "John"
[2,] "Mr." "Smith"
For some reason there doesn't seem to be an easy way to extract captured matches in base (I wish regmatches also worked with captured groups but it does not). I ended up writing my own you can find it at regcapturedmatches.R. it will work with
a <- "The first set is Title: Sir and Last Name: John; and the second set is Title: Mr. and Last Name: Smith."
m<-gregexpr("Title: ([^;]*) and Last Name: ([^;.]*)", a, perl=T, ignore.case=T)
regcapturedmatches(a,m)[[1]]
This will return
[,1] [,2]
[1,] "Sir" "John"
[2,] "Mr." "Smith"
(I added the [[1]] because you said you would only operate on one string at a time. The function can operate on a vector and will return results in a list. Really, in R, every thing is a vector so there is no such thing as a "single" string, you just have a vector of strings with length 1.)
Of course this method is only as good as your regular expression. I had to modify your sample data a bit so your expression would match more than one Title/Name.
Here is a stringr version:
library(stringr)
str_match_all(x, pattern)
Produces:
[[1]]
[,1] [,2] [,3]
[1,] "Title: Sir and Last Name: John" "Sir" "John"
[2,] "Title: Mr. and Last Name: Smith" "Mr." "Smith"
Note that I had to edit your text so that the second one is also of form "and Last Name:". To get your matrix you can just do:
result[[1]][[-1]] # assumes the above is in `result`
One limitation of this is it uses regexec, which doesn't support perl regular expressions.
Related
I'm trying to use Flodel's answer here (extra commas in csv causing problems) in order to import some messy CSV data, but I'm having trouble implementing the solution.
When I have more columns than three, I don't know how to get the text and extra comma into my desired column. I'm pretty sure the problem is in my pattern; I just don't know how to fix it.
file <- textConnection("123, hi, NAME1, EMAIL1#ADDRESS.COM
111, hi, NAME2, EMAIL2#ADRESS.ME
699, hi, FIRST M. LAST, Jr., EMAIL4#ADDRESS.GOV")
lines <- readLines(file)
pattern <- "^(\\d+), (.*), (.*), \\b(.*)$"
matches <- regexec(pattern, lines)
bad.rows <- which(sapply(matches, length) == 1L)
if (length(bad.rows) > 0L) stop(paste("bad row: ", lines[bad.rows]))
data <- regmatches(lines, matches)
as.data.frame(matrix(unlist(data), ncol = 5L, byrow = TRUE)[, -1L])
which gives me
V1 V2 V3 V4
123 hi NAME1 EMAIL1#ADDRESS.COM
111 hi NAME2 EMAIL2#ADRESS.ME
699 hi, FIRST M. LAST Jr. EMAIL4#ADDRESS.GOV
I'd like to see:
V1 V2 V3 V4
123 hi NAME1 EMAIL1#ADDRESS.COM
111 hi NAME2 EMAIL2#ADRESS.ME
699 hi FIRST M. LAST, Jr. EMAIL4#ADDRESS.GOV
If you're more explicit with what you want to match on, you might get better results. If column two will always only have a single string that does not include a comma, you can use:
pattern <- "^(\\d+), ([^,]+), (.*), \\b(.*)$"
In my experience, making your regular expression as explicit as you can first and then generalizing when that stops working is the best approach. e.g. if the second string is always hi include that in your regex.
pattern <- "^(\\d+), (hi), (.*), \\b(.*)$"
R newbie here
I have data that looks something like this:
{'id': 19847005, 'profile_sidebar_fill_color': u'http://pbs.foo.com/profile_background', 'profile_text_color': u'333333', 'followers_count': 1105, 'location': u'San Diego, CA', 'profile_background_color': u'9AE4E8', 'listed_count': 43, '009', 'time_zone': u'Pacific Time (US & Canada)', 'protected': False}
I want to extract the location data from this text: San Diego, CA.
I have been trying to use this stringr package to accomplish this, but can't quite get the regex right to capture the city and state. Sometimes state will be present, other times not present.
location_pattern <- "'location':\su'(\w+)'"
rawdata$location <- str_extract(rawdata$user, location_pattern)
You could try
str_extract_all(str1, perl("(?<=location.: u.)[^']+(?=')"))[[1]]
#[1] "San Diego, CA"
Others have given possible solutions, but not explained what likely went wrong with your attempt.
The str_extract function uses POSIX extended regular expressions that do not understand \w and \s, those are specific to Perl regular expressions. You can use the perl function in the stringr package instead and it will then recognize the shortcuts, or you can use [[:space:]] in place of \s and [[:alnum:]_] in place of \w though more likely you will want something like [[:alpha], ] or [^'].
Also, R's string parser gets a shot at the string before it is passed to the matching function, therefore you will need \\s and \\w if you use the perl function (or other regular expressions function in R). the first \ escapes the second so that a single \ remains in the string to be interpreted as part of the regular expression.
It looks like a json string, but if you're not too concerned about that, then perhaps this would help.
library(stringi)
ss <- stri_split_regex(x, "[{}]|u?'|(, '(009')?)|: ", omit=TRUE)[[1]]
(m <- matrix(ss, ncol = 2, byrow = TRUE))
# [,1] [,2]
# [1,] "id" "19847005"
# [2,] "profile_sidebar_fill_color" "http://pbs.foo.com/profile_background"
# [3,] "profile_text_color" "333333"
# [4,] "followers_count" "1105"
# [5,] "location" "San Diego, CA"
# [6,] "profile_background_color" "9AE4E8"
# [7,] "listed_count" "43"
# [8,] "time_zone" "Pacific Time (US & Canada)"
# [9,] "protected" "False"
So now you have the ID names in the left column and the values on the right. It would probably be simple to reassemble the json from this point if need be.
Also, regarding the json-ness, we can coerce m to a data.frame (or leave it as a matrix), and then use jsonlite::toJSON
library(jsonlite)
json <- toJSON(setNames(as.data.frame(m), c("ID", "Value")))
fromJSON(json)
# ID Value
# 1 id 19847005
# 2 profile_sidebar_fill_color http://pbs.foo.com/profile_background
# 3 profile_text_color 333333
# 4 followers_count 1105
# 5 location San Diego, CA
# 6 profile_background_color 9AE4E8
# 7 listed_count 43
# 8 time_zone Pacific Time (US & Canada)
# 9 protected False
I'm using (or I'd like to use) R to extract some information. I have the following sentence and I'd like to split. In the end, I'd like to extract only the number 24.
Here's what I have:
doc <- "Hits 1 - 10 from 24"
And I want to extract the number "24". I know how to extract the number once I can reduce the sentence in "Hits 1 - 10 from" and "24". I tried using this:
n_docs <- unlist(str_split(key_n_docs, ".\\from"))[1]
But this leaves me with: "Hits 1 - 10"
Obviously the split works somehow, but I'm interested in the part after "from" not the one before. All the help is appreciated!
If you want to extract from a single character string:
strsplit(key_n_docs, "from")[[1]][2]
or the equivalent expression used by #BastiM (sorry I saw your answer after I submitted mine)
unlist(strsplit(key_n_docs, "from"))[2]
If you want to extract from a vector of character strings:
sapply(strsplit(key_n_docs, "from"),`[`, 2)
Usually the result of str_split would contain the number you're searching for at index 1, but since you wrap it with unlist it seems you have to increment the index by one. Using
unlist(strsplit("Hits 1 - 10 from 24", "from"))[2]
works like a charm for me.
demo # ideone
You can use str_extract from stringr:
library(stringr)
numbers <- str_extract(doc, "[0-9]+$")
This will give only the numbers in the end of the sentence.
numbers
"24"
You can use sub to extract the number:
sub(".*from *(\\d+).*", "\\1", doc)
# [1] "24"
So I have a really long string and I want to work with multiple matches. I can only seem to get the first position of the first match using regexpr. How can I get multiple positions (more matches) back within the same string?
I am looking for a specific string in html source code. The titel of an auction (which is between html tags). It prooves kind of difficult to find:
So far I use this:
locationstart <- gregexpr("<span class=\"location-name\">", URL)[[1]]+28
locationend <- regexpr("<", substring(URL, locationstart[1], locationend[1] + 100))
substring(URL, locationstart[1], locationstart[1] + locationend - 2)
That is, I look for a part that comes before a title, then I capture that place, from there on look for a "<" indicating that the title ended. I'm open for more specific suggestions.
Using gregexpr allows for multiple matches.
> x <- c("only one match", "match1 and match2", "none here")
> m <- gregexpr("match[0-9]*", x)
> m
[[1]]
[1] 10
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 1 12
attr(,"match.length")
[1] 6 6
attr(,"useBytes")
[1] TRUE
[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
and if you're looking to extract the match you can use regmatches to do that for you.
> regmatches(x, m)
[[1]]
[1] "match"
[[2]]
[1] "match1" "match2"
[[3]]
character(0)
gregexpr and regmatches as suggested in Dason's answer allow extracting multiple instance of a regex pattern in a string. Furthermore this solution has the advantage of relying exclusively on the {base} package of R rather than requiring an additional package.
Never the less, I'd like to suggest an alternative solution based on the stringr package. In general, this package makes it easier to work with character strings by providing most of the functionality of the various string-support functions of base R (not just the regex-related functions), with a set of functions intuitively named and offering a consistent API. Indeed stringr functions not merely replace base R functions, but in many cases introduce additional features; for example the regex-related functions of stringr are vectorized for both the string and the pattern.
Specifically for the question of extracting multiple patterns in a long string, either str_extract_all and str_match_all can be used as shown below. Depending on the fact that the input is a single string or a vector of it, the logic can be adapted, using list/matrix subscripts, unlist or other approaches like lapply, sapply etc. The point is that the stringr functions return structures that can be used to access just what we want.
# simulate html input. (Using bogus html tags to mark the target texts; the demo works
# the same for actual html patterns, the regular expression is just a bit more complex.
htmlInput <- paste("Lorem ipsum dolor<blah>MATCH_ONE<blah> sit amet, purus",
"sollicitudin<blah>MATCH2<blah>mauris, <blah>MATCH Nr 3<blah>vitae donec",
"risus ipsum, aenean quis, sapien",
"in lorem, condimentum ornare viverra",
"suscipit <blah>LAST MATCH<blah> ipsum eget ac. Non senectus",
"dolor mauris tellus, dui leo purus varius")
# str_extract() may need a bit of extra work to remove the leading and trailing parts
str_extract_all(htmlInput, "(<blah>)([^<]+)<")
# [[1]]
# [1] "<blah>MATCH_ONE<" "<blah>MATCH2<" "<blah>MATCH Nr 3<" "<blah>LAST MATCH<"
str_match_all(htmlInput, "<blah>([^<]+)<")[[1]][, 2]
# [1] "MATCH_ONE" "MATCH2" "MATCH Nr 3" "LAST MATCH"
I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.
For example:
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")
I want to find that " Obama, B." and "Obama, B.H." are very similar. Is there a way to do this?
This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:
agrep: only return best match(es)
In R, how do I replace a string that contains a certain pattern with another string?
Fast Levenshtein distance in R?
But most often agrep will do what you want :
> sapply(pres,agrep,pres)
$` Obama, B.`
[1] 1 3
$`Bush, G.W.`
[1] 2
$`Obama, B.H.`
[1] 1 3
$`Clinton, W.J.`
[1] 4
Maybe agrep is what you want? It searches for approximate matches using the Levenshtein edit distance.
lapply(pres, agrep, pres, value = TRUE)
[[1]]
[1] " Obama, B." "Obama, B.H."
[[2]]
[1] "Bush, G.W."
[[3]]
[1] " Obama, B." "Obama, B.H."
[[4]]
[1] "Clinton, W.J."
Add another duplicate to show it works with more than one duplicate.
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.", "Bush, G.")
adist shows the string distance between 2 character vectors
adist(" Obama, B.", pres)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 9 3 10 7
For example, to select the closest string to " Obama, B." you can take the one which has the minimal distance. To avoid the identical string, I took only distances greater than zero:
d <- adist(" Obama, B.", pres)
pres[min(d[d>0])]
# [1] "Obama, B.H."
To obtain unique names, taking into account spelling errors and inconsistencies, you can compare each string to all previous ones. Then if there is a similar one, remove it. I created a keepunique() function that performs this. keepunique() is then applied to all elements of the vector successively with Reduce().
keepunique <- function(previousones, x){
if(any(adist(x, previousones)<5)){
x <- NULL
}
return(c(previousones, x))
}
Reduce(keepunique, pres)
# [1] " Obama, B." "Bush, G.W." "Clinton, W.J."