Extract location data using regex in R - regex

R newbie here
I have data that looks something like this:
{'id': 19847005, 'profile_sidebar_fill_color': u'http://pbs.foo.com/profile_background', 'profile_text_color': u'333333', 'followers_count': 1105, 'location': u'San Diego, CA', 'profile_background_color': u'9AE4E8', 'listed_count': 43, '009', 'time_zone': u'Pacific Time (US & Canada)', 'protected': False}
I want to extract the location data from this text: San Diego, CA.
I have been trying to use this stringr package to accomplish this, but can't quite get the regex right to capture the city and state. Sometimes state will be present, other times not present.
location_pattern <- "'location':\su'(\w+)'"
rawdata$location <- str_extract(rawdata$user, location_pattern)

You could try
str_extract_all(str1, perl("(?<=location.: u.)[^']+(?=')"))[[1]]
#[1] "San Diego, CA"

Others have given possible solutions, but not explained what likely went wrong with your attempt.
The str_extract function uses POSIX extended regular expressions that do not understand \w and \s, those are specific to Perl regular expressions. You can use the perl function in the stringr package instead and it will then recognize the shortcuts, or you can use [[:space:]] in place of \s and [[:alnum:]_] in place of \w though more likely you will want something like [[:alpha], ] or [^'].
Also, R's string parser gets a shot at the string before it is passed to the matching function, therefore you will need \\s and \\w if you use the perl function (or other regular expressions function in R). the first \ escapes the second so that a single \ remains in the string to be interpreted as part of the regular expression.

It looks like a json string, but if you're not too concerned about that, then perhaps this would help.
library(stringi)
ss <- stri_split_regex(x, "[{}]|u?'|(, '(009')?)|: ", omit=TRUE)[[1]]
(m <- matrix(ss, ncol = 2, byrow = TRUE))
# [,1] [,2]
# [1,] "id" "19847005"
# [2,] "profile_sidebar_fill_color" "http://pbs.foo.com/profile_background"
# [3,] "profile_text_color" "333333"
# [4,] "followers_count" "1105"
# [5,] "location" "San Diego, CA"
# [6,] "profile_background_color" "9AE4E8"
# [7,] "listed_count" "43"
# [8,] "time_zone" "Pacific Time (US & Canada)"
# [9,] "protected" "False"
So now you have the ID names in the left column and the values on the right. It would probably be simple to reassemble the json from this point if need be.
Also, regarding the json-ness, we can coerce m to a data.frame (or leave it as a matrix), and then use jsonlite::toJSON
library(jsonlite)
json <- toJSON(setNames(as.data.frame(m), c("ID", "Value")))
fromJSON(json)
# ID Value
# 1 id 19847005
# 2 profile_sidebar_fill_color http://pbs.foo.com/profile_background
# 3 profile_text_color 333333
# 4 followers_count 1105
# 5 location San Diego, CA
# 6 profile_background_color 9AE4E8
# 7 listed_count 43
# 8 time_zone Pacific Time (US & Canada)
# 9 protected False

Related

Split string using regular expressions and store it into data frame

I have a string like this:
Received # 10/10/2014 02:29:55 a.m. Changed status: 'processing' # 10/10/2014 02:40:20 a.m. Changed status: 'processed' # 10/10/2014 02:40:24 a.m.
I need to "parse" this string using certain rules:
The first block is the Received date and time
Each block after the first one starts with Changed status: and ends with a date and time
There can be any number of Changed status: blocks (at least 1) and the status can vary
What I need to do is to:
Split the string and put each block into an array.
Example:
[Received # 10/10/2014 02:29:55 a.m.], [Changed status: 'processing' # 10/10/2014 02:40:20 a.m.], [Changed status: 'processed' # 10/10/2014 02:40:24 a.m.]
After each block is split, I need to split each entry in three fields
For the above example, what I need is something like this:
Received | NULL | 10/10/2014 02:29:55 am
Changed status | processing | 10/10/2014 02:40:20 am
Changed status | processed | 10/10/2014 02:40:20 am
I think step two is quite easy (each block can be split using # and : as separators), but step one is making me pull my hair off. Is there a way to do this kind of thing with Regular Expressions?
I've tried some approaches (like Received|Changed.*[ap].m.), but it doesn't work (the evaluation of the regular expression always returns the full string).
I want to do this in R:
Read the full data table (which has more fields, and the text above is the last one) into a data frame
"Parse" this string and store it into a second data frame
R has built-in support for regular expressions, so that's my fist thought on approaching the solution.
Any help will be appreciated. Honestly, I'm lost here (but I'll keep on trying... I'll edit my post if I find steps that bring me closer to the solution)
Here's a possibility that you could put into a function. In the string you posted, the important information seems to be separated by two spaces, which is nice. Basically what I did was try to get all the relevant lines to split evenly into the right length.
x <- "Received # 10/10/2014 02:29:55 a.m. Changed status: 'processing' # 10/10/2014 02:40:20 a.m. Changed status: 'processed' # 10/10/2014 02:40:24 a.m."
s <- strsplit(gsub("['.]", "", x), " ")[[1]]
s[g] <- sub("(\\D) ", "\\1: ", s[g <- grep("Received", s)])
do.call(rbind, strsplit(s, " # |: "))
# [,1] [,2] [,3]
# [1,] "Received" "" "10/10/2014 02:29:55 am"
# [2,] "Changed status" "processing" "10/10/2014 02:40:20 am"
# [3,] "Changed status" "processed" "10/10/2014 02:40:24 am"
I went without "NULL" because I presume you meant you wanted an empty character there. NULL would not show up in a data frame anyway.
Here is a short solution based on strapplyc. strapplyc matches the regular expression to the input string s extracting the matches to the parenthesized portions of the regular expression except the (?:...) which is non-capturing.
There are 3 capturing pairs of parentheses in pat. The first one matches Recieved or Changed status. Then we optionally match a colon, space, single quote, zero or more non-single-quote characters and another quote. The portion between the two quotes is the second captured string. Then we match space, #, space and the date/time string. The date/time string is captured.
Finally matrix is used to reshape it into 3 columns:
library(gsubfn)
pat <- "(Received|Changed status)(?:: '([^']*)')? # (../../.... ..:..:.. ....)"
matrix(strapplyc(s, pat, simplify = TRUE), nc = 3, byrow = TRUE)
giving:
[,1] [,2] [,3]
[1,] "Received" "" "10/10/2014 02:29:55 a.m."
[2,] "Changed status" "processing" "10/10/2014 02:40:20 a.m."
[3,] "Changed status" "processed" "10/10/2014 02:40:24 a.m."
Update: Simplification. Also modified output to be as in question.
tmp <- "Received # 10/10/2014 02:29:55 a.m. Changed status: 'processing' # 10/10/2014 02:40:20 a.m. Changed status: 'processed' # 10/10/2014 02:40:24 a.m."
tmp1 <- strsplit(gsub('Received', 'Received:', tmp), '\\s{2}', perl = TRUE)
do.call(rbind, strsplit(tmp1[[1]], '# |: '))
# [,1] [,2] [,3]
# [1,] "Received" "" "10/10/2014 02:29:55 a.m."
# [2,] "Changed status" "'processing' " "10/10/2014 02:40:20 a.m."
# [3,] "Changed status" "'processed' " "10/10/2014 02:40:24 a.m."
I'm assuming that you've got your data in a data.frame and that you want to do this on many rows in your data frame. I'm calling that data.frame "Data", and here's what I would do, although perhaps someone else could make this more elegant:
Split <- str_split(Data$String, "#") # Make a list with your string split by "#"
Data$Received <- NA
Data$Processing <- NA
Data$Processed <- NA
for (i in 1:nrow(Data)){
Data$Received[i] <- str_sub(Split[[i]][2], 2, 24) # Extract the date received, etc.
Data$Processing[i] <- str_sub(Split[[i]][3], 2, 24)
Data$Processed[i] <- str_sub(Split[[i]][4], 2, 24)
}
Data$Received <- mdy_hms(Data$Received) # Use lubridate to convert it to POSIX format
Data$Processing <- mdy_hms(Data$Processing)
Data$Processed <- mdy_hms(Data$Processed)
That gives you three columns for the dates and times you want.

how to use regexp to return small part of character based on pattern

This should be easy for anyone who understands regular expressions as I'm struggling to do.
I have a vector of strings that looks like
strings<-c("jklsflk fKASJLJ (LN/WEC/WPS); jsdfjDFSDKTdfkls jfdjk kdkd(LN/WEC/WPS)",
"PEARYMP PEARYVIRGN_16 1 (LN/MP/MP)",
"08VERMLN XF03 08VERMLN_345_3 (XF/CIN/*)")
I want to convert this vector into a dataframe where each row is from an element of the original vector with 3 columns where each column comes from the part in parenthesis. So the result here would be
col1 col2 col3
"LN" "WEC" "WPS"
"LN" "MP" "MP"
"XF" "CIN" "*"
If there are more than one instance of the pattern in a string then it should take the first instance.
I think my main problem is that ( is a special character and I'm trying to escape it \( but I get an error that \( is an unrecognized escape character so I'm just a little lost.
Sounds like you're forgetting to escape the \ in \(, i.e. \\(:
do.call(rbind, strsplit(sub('.*?\\((.*?)\\).*', '\\1', strings), split = "/"))
[,1] [,2] [,3]
[1,] "LN" "WEC" "WPS"
[2,] "LN" "MP" "MP"
[3,] "XF" "CIN" "*"
1) We define a pattern that matches
left-paren non-slashes slash non-slashes slash non-right-parens remainder
which correspond to the following respectively:
\\( ([^/]+) / ([^/]+) / ([^)]+) .*
Now extract the parenthesized portions using strapplyc and simplify into a matrix. The code is:
library(gsubfn)
pat <- "\\(([^/]+)/([^/]+)/([^)]+).*"
strapplyc(strings, pat, simplify = cbind)
giving:
[,1] [,2] [,3]
[1,] "LN" "WEC" "WPS"
[2,] "LN" "MP" "MP"
[3,] "XF" "CIN" "*"
2) This alternative uses strapplyc nested in strapply. The regular expressions are slightly simpler and its still basically one line of code but that code line is longer. The first regex picks out everything between the first set of parens and the second extracts the slash-separated fields:
strapply(strings, "\\(([^)]+).*", ~ strapplyc(x, "[^/]+")[[1]], simplify = rbind)
REVISED Some improvements to first solution plus a variation as second solution.

Split column label by number of letters/characters in R

I have a large dataset where all column headers are individual IDS, each 8 characters in length. I would like to split those individual IDs into 2 rows, where the first row of IDs contains the first 7 characters, and the second row contains just the last character.
Current dataset:
ID1: Indiv01A Indiv01B Indiv02A Indiv02B Speci03A Speci03B
Intended dataset:
ID1: Indiv01 Indiv01 Indiv02 Indiv02 Speci03 Speci03
ID2: A B A B A B
I've looked through other posts on splitting data, but they all seem to have a unique way to separate the column name (ie: there's a comma separating the 2 components, or a period).
This is the code I'm thinking would work best, but I just can't figure out how to code for "7 characters" as the split point, rather than a comma:
sapply(strsplit(as.character(d$ID), ",")
Any help would be appreciated.
Here's a regular expression for a solution with strsplit. It splits the string between the 7th and the 8th character:
ID1 <- c("Indiv01A", "Indiv01B", "Indiv02A", "Indiv02B", "Speci03A", "Speci03B")
res <- strsplit(ID1, "(?<=.{7})", perl = TRUE)
# [[1]]
# [1] "Indiv01" "A"
#
# [[2]]
# [1] "Indiv01" "B"
#
# [[3]]
# [1] "Indiv02" "A"
#
# [[4]]
# [1] "Indiv02" "B"
#
# [[5]]
# [1] "Speci03" "A"
#
# [[6]]
# [1] "Speci03" "B"
Now, you can use rbind to create two columns:
do.call(rbind, res)
# [,1] [,2]
# [1,] "Indiv01" "A"
# [2,] "Indiv01" "B"
# [3,] "Indiv02" "A"
# [4,] "Indiv02" "B"
# [5,] "Speci03" "A"
# [6,] "Speci03" "B"
Explanation of the regex pattern:
(?<=.{7})
The (?<=) is a (positive) lookbehind. It matches any position that is preceded by the specified pattern. Here, the pattern is .{7}. The dot (.) matches any character. {7} means 7 times. Hence, the regex matches the position that is preceded by exactly 7 characters.
Here is a gsubfn solution:
library(gsubfn)
strapplyc(ID1, "(.*)(.)", simplify = cbind)
which gives this matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "Indiv01" "Indiv01" "Indiv02" "Indiv02" "Speci03" "Speci03"
[2,] "A" "B" "A" "B" "A" "B"
or use rbind in place of cbind if you want two columns (rather than two rows).
There are a couple of ways you could go about this.
To extract the final character
First, with substr:
new.vec <- sapply(old.vec, function(x) substr(x, nchar(x), nchar(x)))
or, with sub:
new.vec <- sub('.*(.)', '\\1', old.vec)
where old.vec is the vector of strings that you want to split.
For interest, the latter option uses a regular expression that translates to: "capture (indicating by surrounding with parentheses) the single character (.) that follows zero or more other characters (.*), and replace matches with the captured content (\\1)". For more info, see ?gsub, and here.
The above options allow for varying string lengths. However, if you do always want to split after 7 characters, and the second part of the string always has just a single character, then the following should work:
new.vec <- substr(old.vec, 8, 8)
(Edited to include method for extracting the first part of the string.)
To extract all but the final character
The process is similar.
new.vec <- sapply(old.vec, function(x) substr(x, 1, nchar(x) - 1))
new.vec <- sub('(.*).', '\\1', old.vec)
new.vec <- substr(old.vec, 1, 7)

Regex matching everything that's not a 4 digit number

I match and replace 4-digit numbers preceded and followed by white space with:
str12 <- "coihr 1234 &/()= jngm 34 ljd"
sub("\\s\\d{4}\\s", "", str12)
[1] "coihr&/()= jngm 34 ljd"
but, every try to invert this and extract the number instead fails.
I want:
[1] 1234
does someone has a clue?
ps: I know how to do it with {stringr} but am wondering if it's possible with {base} only..
require(stringr)
gsub("\\s", "", str_extract(str12, "\\s\\d{4}\\s"))
[1] "1234"
regmatches(), only available since R-2.14.0, allows you to "extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec"
Here are examples of how you could use regmatches() to extract either the first whitespace-cushioned 4-digit substring in your input character string, or all such substrings.
## Example strings and pattern
x <- "coihr 1234 &/()= jngm 34 ljd" # string with 1 matching substring
xx <- "coihr 1234 &/()= jngm 3444 6789 ljd" # string with >1 matching substring
pat <- "(?<=\\s)(\\d{4})(?=\\s)"
## Use regexpr() to extract *1st* matching substring
as.numeric(regmatches(x, regexpr(pat, x, perl=TRUE)))
# [1] 1234
as.numeric(regmatches(xx, regexpr(pat, xx, perl=TRUE)))
# [1] 1234
## Use gregexpr() to extract *all* matching substrings
as.numeric(regmatches(xx, gregexpr(pat, xx, perl=TRUE))[[1]])
# [1] 1234 3444 6789
(Note that this will return numeric(0) for character strings not containing a substring matching your criteria).
It's possible to capture group in regex using (). Taking the same example
str12 <- "coihr 1234 &/()= jngm 34 ljd"
gsub(".*\\s(\\d{4})\\s.*", "\\1", str12)
[1] "1234"
I'm pretty naive about regex in general, but here's an ugly way to do it in base:
# if it's always in the same spot as in your example
unlist(strsplit(str12, split = " "))[2]
# or if it can occur in various places
str13 <- unlist(strsplit(str12, split = " "))
str13[!is.na(as.integer(str13)) & nchar(str13) == 4] # issues warning

R grep: Match one string against multiple patterns

In R, grep usually matches a vector of multiple strings against one regexp.
Q: Is there a possibility to match a single string against multiple regexps? (without looping through each single regexp pattern)?
Some background:
I have 7000+ keywords as indicators for several categories. I cannot change that keyword dictionary. The dictionary has following structure (keywords in col 1, numbers indicate categories where these keywords belong to):
ab 10 37 41
abbrach* 38
abbreche 39
abbrich* 39
abend* 37
abendessen* 60 63
aber 20 23 45
abermals 37
Concatenating so many keywords with "|" is not a feasible way (and I wouldn't know which of the keywords generated the hit).
Also, just reversing "patterns" and "strings" does not work, as the patterns have truncations, which wouldn't work the other way round.
[related question, other programming language]
What about applying the regexpr function over a vector of keywords?
keywords <- c("dog", "cat", "bird")
strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
sapply(keywords, regexpr, strings, ignore.case=TRUE)
dog cat bird
[1,] 15 -1 -1
[2,] -1 4 15
[3,] -1 -1 -1
sapply(keywords, regexpr, strings[1], ignore.case=TRUE)
dog cat bird
15 -1 -1
Values returned are the position of the first character in the match, with -1 meaning no match.
If the position of the match is irrelevant, use grepl instead:
sapply(keywords, grepl, strings, ignore.case=TRUE)
dog cat bird
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE TRUE
[3,] FALSE FALSE FALSE
Update: This runs relatively quick on my system, even with a large number of keywords:
# Available on most *nix systems
words <- scan("/usr/share/dict/words", what="")
length(words)
[1] 234936
system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE))
user system elapsed
7.495 0.155 7.596
dim(matches)
[1] 3 234936
re2r package can match multiple patterns (in parallel). Minimal example:
# compile patterns
re <- re2r::re2(keywords)
# match strings
re2r::re2_detect(strings, re, parallel = TRUE)
To expand on the other answer, to transform the sapply() output into a useful logical vector you need to further use an apply() step.
keywords <- c("dog", "cat", "bird")
strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
(matches <- sapply(keywords, grepl, strings, ignore.case=TRUE))
# dog cat bird
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE TRUE
# [3,] FALSE FALSE FALSE
To know which strings contain any of the keywords (patterns):
apply(matches, 1, any)
# [1] TRUE TRUE FALSE
To know which keywords (patterns) were matched in the supplied strings:
apply(matches, 2, any)
# dog cat bird
# TRUE TRUE TRUE