Split string using regular expressions and store it into data frame

Split string using regular expressions and store it into data frame - regex

I have a string like this:
Received # 10/10/2014 02:29:55 a.m. Changed status: 'processing' # 10/10/2014 02:40:20 a.m. Changed status: 'processed' # 10/10/2014 02:40:24 a.m.
I need to "parse" this string using certain rules:
The first block is the Received date and time
Each block after the first one starts with Changed status: and ends with a date and time
There can be any number of Changed status: blocks (at least 1) and the status can vary
What I need to do is to:
Split the string and put each block into an array.
Example:
[Received # 10/10/2014 02:29:55 a.m.], [Changed status: 'processing' # 10/10/2014 02:40:20 a.m.], [Changed status: 'processed' # 10/10/2014 02:40:24 a.m.]
After each block is split, I need to split each entry in three fields
For the above example, what I need is something like this:
Received | NULL | 10/10/2014 02:29:55 am
Changed status | processing | 10/10/2014 02:40:20 am
Changed status | processed | 10/10/2014 02:40:20 am
I think step two is quite easy (each block can be split using # and : as separators), but step one is making me pull my hair off. Is there a way to do this kind of thing with Regular Expressions?
I've tried some approaches (like Received|Changed.*[ap].m.), but it doesn't work (the evaluation of the regular expression always returns the full string).
I want to do this in R:
Read the full data table (which has more fields, and the text above is the last one) into a data frame
"Parse" this string and store it into a second data frame
R has built-in support for regular expressions, so that's my fist thought on approaching the solution.
Any help will be appreciated. Honestly, I'm lost here (but I'll keep on trying... I'll edit my post if I find steps that bring me closer to the solution)

Here's a possibility that you could put into a function. In the string you posted, the important information seems to be separated by two spaces, which is nice. Basically what I did was try to get all the relevant lines to split evenly into the right length.
x <- "Received # 10/10/2014 02:29:55 a.m. Changed status: 'processing' # 10/10/2014 02:40:20 a.m. Changed status: 'processed' # 10/10/2014 02:40:24 a.m."
s <- strsplit(gsub("['.]", "", x), " ")[[1]]
s[g] <- sub("(\\D) ", "\\1: ", s[g <- grep("Received", s)])
do.call(rbind, strsplit(s, " # |: "))
# [,1] [,2] [,3]
# [1,] "Received" "" "10/10/2014 02:29:55 am"
# [2,] "Changed status" "processing" "10/10/2014 02:40:20 am"
# [3,] "Changed status" "processed" "10/10/2014 02:40:24 am"
I went without "NULL" because I presume you meant you wanted an empty character there. NULL would not show up in a data frame anyway.

Here is a short solution based on strapplyc. strapplyc matches the regular expression to the input string s extracting the matches to the parenthesized portions of the regular expression except the (?:...) which is non-capturing.
There are 3 capturing pairs of parentheses in pat. The first one matches Recieved or Changed status. Then we optionally match a colon, space, single quote, zero or more non-single-quote characters and another quote. The portion between the two quotes is the second captured string. Then we match space, #, space and the date/time string. The date/time string is captured.
Finally matrix is used to reshape it into 3 columns:
library(gsubfn)
pat <- "(Received|Changed status)(?:: '([^']*)')? # (../../.... ..:..:.. ....)"
matrix(strapplyc(s, pat, simplify = TRUE), nc = 3, byrow = TRUE)
giving:
[,1] [,2] [,3]
[1,] "Received" "" "10/10/2014 02:29:55 a.m."
[2,] "Changed status" "processing" "10/10/2014 02:40:20 a.m."
[3,] "Changed status" "processed" "10/10/2014 02:40:24 a.m."
Update: Simplification. Also modified output to be as in question.

tmp <- "Received # 10/10/2014 02:29:55 a.m. Changed status: 'processing' # 10/10/2014 02:40:20 a.m. Changed status: 'processed' # 10/10/2014 02:40:24 a.m."
tmp1 <- strsplit(gsub('Received', 'Received:', tmp), '\\s{2}', perl = TRUE)
do.call(rbind, strsplit(tmp1[[1]], '# |: '))
# [,1] [,2] [,3]
# [1,] "Received" "" "10/10/2014 02:29:55 a.m."
# [2,] "Changed status" "'processing' " "10/10/2014 02:40:20 a.m."
# [3,] "Changed status" "'processed' " "10/10/2014 02:40:24 a.m."

I'm assuming that you've got your data in a data.frame and that you want to do this on many rows in your data frame. I'm calling that data.frame "Data", and here's what I would do, although perhaps someone else could make this more elegant:
Split <- str_split(Data$String, "#") # Make a list with your string split by "#"
Data$Received <- NA
Data$Processing <- NA
Data$Processed <- NA
for (i in 1:nrow(Data)){
Data$Received[i] <- str_sub(Split[[i]][2], 2, 24) # Extract the date received, etc.
Data$Processing[i] <- str_sub(Split[[i]][3], 2, 24)
Data$Processed[i] <- str_sub(Split[[i]][4], 2, 24)
}
Data$Received <- mdy_hms(Data$Received) # Use lubridate to convert it to POSIX format
Data$Processing <- mdy_hms(Data$Processing)
Data$Processed <- mdy_hms(Data$Processed)
That gives you three columns for the dates and times you want.

Related

Cleaning up dates (years, specifically) with regex

I have database with an non-validated year field. Most of the entries are 4-digit years but about 10% of the entries are "whatever." This has led me down the rabbit hole of regular expressions to little avail. Getting better results than what I have is progress, even if I don't extract 100%.
#what a mess
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96 ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
#does a good job with any string containing a 4-digit year
as.numeric(sub('\\D*(\\d{4}).*', '\\1', yearEntries))
#does a good job with any string containing a 2-digit year, nought else
as.numeric(sub('\\D*(\\d{2}).*', '\\1', yearEntries))
The desired output is to grab the first readable year, so 1992-1993 would be 1992 and "the 70s" would be 1970.
How can I improve my parsing accuracy? Thanks!
EDIT: Pursuant to garyh's answer this gets me much closer:
sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\d)|\\d{4}).*","\\1",yearEntries,perl=TRUE)
# [1] "79" "07-2608" "07-262008" "96" "70" "93" "70" "15" "60" "70" NA "2013" "1992"
but note that, while the dates with dashes in them work with garyh's regex101.com demo, they don't work with R, keeping the month and day values, and the first dash.
Also I realize I didn't include an example date with slashes rather dashes. Another term in the regex should handle that but again, with R, it doesn't not produce the same (correct) result that regex101.com does.
sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\/|\\d)|\\d{4}).*","\\1","07/09/13",perl=TRUE)
# [1] "07/0913"
These negative lookbacks and lookaheads are very powerful but stretch my feeble brain.

Not sure what flavour of regex R uses but this seems to get all the years in the string
/((?<!\d)\d{2}(?!\-|\d)|\d{4})/g
This is matching any 4 digits or any 2 digits provided they're not followed by a dash - or digit, or preceded by another digit
see demo here

You're going to need some elbow grease and do something like:
library(lubridate)
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96 ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
x <- yearEntries
x <- gsub("(late|early)", "", x, ignore.case=TRUE)
x <- gsub("[']*[s]*", "", x)
x <- gsub(",.*$", "", x)
x <- gsub(" ", "", x)
x <- ifelse(nchar(x)==9 | nchar(x)<8, gsub("[-/]+[[:digit:]]+$", "", x), x)
x <- ifelse(nchar(x)==4, gsub("^[[:digit:]]{2}", "", x), x)
y <- format(parse_date_time(x, "%m-%d-%y!"), "%y")
yearEntries <-ifelse(!is.na(y), y, x)
yearEntries
## [1] "79" "08" "08" "96" "70" "93" "70" "15" "60" "70" NA "13" "92"
We have no idea which year you need from ranged entries, but this should get you started.

I found a very simple way to get a good result (though I would not claim it is bullet proof). It grabs the last readable year, which is okay too.
yearEntries <- c("79, 80, 99","07/26/08","07-26-2008","'96 ","Early 70's","93/95","15",NA,"2013","1992-1993","ongoing")
# assume last two digits present in any string represent a 2-digit year
a<-sub(".*(\\d{2}).*$","\\1",yearEntries)
# [1] "99" "08" "08" "96" "70" "95" "15" "ongoing" NA "13" "93"
# change to numeric, strip NAs and add 2000
b<-na.omit(as.numeric(a))+2000
# [1] 2099 2008 2008 2096 2070 2095 2015 2013 2093
# assume any greater than present is last century
b[b>2015]<-b[b>2015]-100
# [1] 1999 2008 2008 1996 1970 1995 2015 2013 1993
...and Bob's your uncle!

#garyth's regex work well actually if you use the regmatches/grexprcombo to extract the pattern instead of sub:
regmatches(yearEntries,
gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE))
[[1]]
[1] "79" "80" "99"
[[2]]
[1] "08"
[[3]]
[1] "2008"
[[4]]
[1] "96"
[[5]]
[1] "70"
[[6]]
[1] "95"
[[7]]
[1] "70"
[[8]]
[1] "15"
[[9]]
[1] "60"
[[10]]
[1] "70"
[[11]]
character(0)
[[12]]
[1] "2013"
[[13]]
[1] "1992" "1993"
To only keep the first matching pattern:
sapply(regmatches(yearEntries,gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE)),`[`,1)
[1] "79" "08" "2008" "96" "70" "95" "70" "15" "60" "70" NA "2013" "1992"
regmatches("07/09/13",gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}","07/09/13",perl=TRUE))
[[1]]
[1] "13"

Extract location data using regex in R

R newbie here
I have data that looks something like this:
{'id': 19847005, 'profile_sidebar_fill_color': u'http://pbs.foo.com/profile_background', 'profile_text_color': u'333333', 'followers_count': 1105, 'location': u'San Diego, CA', 'profile_background_color': u'9AE4E8', 'listed_count': 43, '009', 'time_zone': u'Pacific Time (US & Canada)', 'protected': False}
I want to extract the location data from this text: San Diego, CA.
I have been trying to use this stringr package to accomplish this, but can't quite get the regex right to capture the city and state. Sometimes state will be present, other times not present.
location_pattern <- "'location':\su'(\w+)'"
rawdata$location <- str_extract(rawdata$user, location_pattern)

You could try
str_extract_all(str1, perl("(?<=location.: u.)[^']+(?=')"))[[1]]
#[1] "San Diego, CA"

Others have given possible solutions, but not explained what likely went wrong with your attempt.
The str_extract function uses POSIX extended regular expressions that do not understand \w and \s, those are specific to Perl regular expressions. You can use the perl function in the stringr package instead and it will then recognize the shortcuts, or you can use [[:space:]] in place of \s and [[:alnum:]_] in place of \w though more likely you will want something like [[:alpha], ] or [^'].
Also, R's string parser gets a shot at the string before it is passed to the matching function, therefore you will need \\s and \\w if you use the perl function (or other regular expressions function in R). the first \ escapes the second so that a single \ remains in the string to be interpreted as part of the regular expression.

It looks like a json string, but if you're not too concerned about that, then perhaps this would help.
library(stringi)
ss <- stri_split_regex(x, "[{}]|u?'|(, '(009')?)|: ", omit=TRUE)[[1]]
(m <- matrix(ss, ncol = 2, byrow = TRUE))
# [,1] [,2]
# [1,] "id" "19847005"
# [2,] "profile_sidebar_fill_color" "http://pbs.foo.com/profile_background"
# [3,] "profile_text_color" "333333"
# [4,] "followers_count" "1105"
# [5,] "location" "San Diego, CA"
# [6,] "profile_background_color" "9AE4E8"
# [7,] "listed_count" "43"
# [8,] "time_zone" "Pacific Time (US & Canada)"
# [9,] "protected" "False"
So now you have the ID names in the left column and the values on the right. It would probably be simple to reassemble the json from this point if need be.
Also, regarding the json-ness, we can coerce m to a data.frame (or leave it as a matrix), and then use jsonlite::toJSON
library(jsonlite)
json <- toJSON(setNames(as.data.frame(m), c("ID", "Value")))
fromJSON(json)
# ID Value
# 1 id 19847005
# 2 profile_sidebar_fill_color http://pbs.foo.com/profile_background
# 3 profile_text_color 333333
# 4 followers_count 1105
# 5 location San Diego, CA
# 6 profile_background_color 9AE4E8
# 7 listed_count 43
# 8 time_zone Pacific Time (US & Canada)
# 9 protected False

colsplit to split into individual characters

colsplit in package reshape2 can be used to split character data:
colsplit(c("A_1", "A_2", "A_3"), pattern="_", c("Letter", "Number"))
Letter Number
1 A 1
2 A 2
3 A 3
In his paper "Rehaping data with the Reshape Package", Hadley Wickham gives an example of using colsplit to split data into individual characters. His example should produce the above from the data c("A1", "A2", "A3") which he does by omitting the pattern argument. But this throws an error.
The documentation for str_split_fixed which colsplit calls says that setting pattern="" will split into individual characters, but this does not work.
Is there any way to use colsplit so that it splits into individual character.
This is R 3.1.1 and packages are up to date.

The problem is that you are referring to an article about "reshape" but are using "reshape2". The two are not the same and they don't work the same:
library(reshape)
library(reshape2)
reshape:::colsplit(c("A1", "A2", "A3"), "", c("V1", "V2"))
# V1 V2
# 1 A 1
# 2 A 2
# 3 A 3
reshape2:::colsplit(c("A1", "A2", "A3"), "", c("V1", "V2"))
# V1 V2
# 1 NA A1
# 2 NA A2
# 3 NA A3
If you don't have to go the colsplit way, there are other options:
do.call(rbind, strsplit(c("A1", "A2", "A3"), "", fixed = TRUE))
# [,1] [,2]
# [1,] "A" "1"
# [2,] "A" "2"
# [3,] "A" "3"
Or, a more general approach (for example characters followed by numbers, not necessarily one character each):
do.call(rbind, strsplit(c("A1", "A2", "A3"),
split = "(?<=[a-zA-Z])(?=[0-9])",
perl = TRUE))
# [,1] [,2]
# [1,] "A" "1"
# [2,] "A" "2"
# [3,] "A" "3"

Using qdap:
library(qdap)
colSplit(c("A1", "A2", "A3"), "")
## X1 X2
## 1 A 1
## 2 A 2
## 3 A 3

Split column label by number of letters/characters in R

I have a large dataset where all column headers are individual IDS, each 8 characters in length. I would like to split those individual IDs into 2 rows, where the first row of IDs contains the first 7 characters, and the second row contains just the last character.
Current dataset:
ID1: Indiv01A Indiv01B Indiv02A Indiv02B Speci03A Speci03B
Intended dataset:
ID1: Indiv01 Indiv01 Indiv02 Indiv02 Speci03 Speci03
ID2: A B A B A B
I've looked through other posts on splitting data, but they all seem to have a unique way to separate the column name (ie: there's a comma separating the 2 components, or a period).
This is the code I'm thinking would work best, but I just can't figure out how to code for "7 characters" as the split point, rather than a comma:
sapply(strsplit(as.character(d$ID), ",")
Any help would be appreciated.

Here's a regular expression for a solution with strsplit. It splits the string between the 7th and the 8th character:
ID1 <- c("Indiv01A", "Indiv01B", "Indiv02A", "Indiv02B", "Speci03A", "Speci03B")
res <- strsplit(ID1, "(?<=.{7})", perl = TRUE)
# [[1]]
# [1] "Indiv01" "A"
#
# [[2]]
# [1] "Indiv01" "B"
#
# [[3]]
# [1] "Indiv02" "A"
#
# [[4]]
# [1] "Indiv02" "B"
#
# [[5]]
# [1] "Speci03" "A"
#
# [[6]]
# [1] "Speci03" "B"
Now, you can use rbind to create two columns:
do.call(rbind, res)
# [,1] [,2]
# [1,] "Indiv01" "A"
# [2,] "Indiv01" "B"
# [3,] "Indiv02" "A"
# [4,] "Indiv02" "B"
# [5,] "Speci03" "A"
# [6,] "Speci03" "B"
Explanation of the regex pattern:
(?<=.{7})
The (?<=) is a (positive) lookbehind. It matches any position that is preceded by the specified pattern. Here, the pattern is .{7}. The dot (.) matches any character. {7} means 7 times. Hence, the regex matches the position that is preceded by exactly 7 characters.

Here is a gsubfn solution:
library(gsubfn)
strapplyc(ID1, "(.*)(.)", simplify = cbind)
which gives this matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "Indiv01" "Indiv01" "Indiv02" "Indiv02" "Speci03" "Speci03"
[2,] "A" "B" "A" "B" "A" "B"
or use rbind in place of cbind if you want two columns (rather than two rows).

There are a couple of ways you could go about this.
To extract the final character
First, with substr:
new.vec <- sapply(old.vec, function(x) substr(x, nchar(x), nchar(x)))
or, with sub:
new.vec <- sub('.*(.)', '\\1', old.vec)
where old.vec is the vector of strings that you want to split.
For interest, the latter option uses a regular expression that translates to: "capture (indicating by surrounding with parentheses) the single character (.) that follows zero or more other characters (.*), and replace matches with the captured content (\\1)". For more info, see ?gsub, and here.
The above options allow for varying string lengths. However, if you do always want to split after 7 characters, and the second part of the string always has just a single character, then the following should work:
new.vec <- substr(old.vec, 8, 8)
(Edited to include method for extracting the first part of the string.)
To extract all but the final character
The process is similar.
new.vec <- sapply(old.vec, function(x) substr(x, 1, nchar(x) - 1))
new.vec <- sub('(.*).', '\\1', old.vec)
new.vec <- substr(old.vec, 1, 7)

R grep: Match one string against multiple patterns

In R, grep usually matches a vector of multiple strings against one regexp.
Q: Is there a possibility to match a single string against multiple regexps? (without looping through each single regexp pattern)?
Some background:
I have 7000+ keywords as indicators for several categories. I cannot change that keyword dictionary. The dictionary has following structure (keywords in col 1, numbers indicate categories where these keywords belong to):
ab 10 37 41
abbrach* 38
abbreche 39
abbrich* 39
abend* 37
abendessen* 60 63
aber 20 23 45
abermals 37
Concatenating so many keywords with "|" is not a feasible way (and I wouldn't know which of the keywords generated the hit).
Also, just reversing "patterns" and "strings" does not work, as the patterns have truncations, which wouldn't work the other way round.
[related question, other programming language]

What about applying the regexpr function over a vector of keywords?
keywords <- c("dog", "cat", "bird")
strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
sapply(keywords, regexpr, strings, ignore.case=TRUE)
dog cat bird
[1,] 15 -1 -1
[2,] -1 4 15
[3,] -1 -1 -1
sapply(keywords, regexpr, strings[1], ignore.case=TRUE)
dog cat bird
15 -1 -1
Values returned are the position of the first character in the match, with -1 meaning no match.
If the position of the match is irrelevant, use grepl instead:
sapply(keywords, grepl, strings, ignore.case=TRUE)
dog cat bird
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE TRUE
[3,] FALSE FALSE FALSE
Update: This runs relatively quick on my system, even with a large number of keywords:
# Available on most *nix systems
words <- scan("/usr/share/dict/words", what="")
length(words)
[1] 234936
system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE))
user system elapsed
7.495 0.155 7.596
dim(matches)
[1] 3 234936

re2r package can match multiple patterns (in parallel). Minimal example:
# compile patterns
re <- re2r::re2(keywords)
# match strings
re2r::re2_detect(strings, re, parallel = TRUE)

To expand on the other answer, to transform the sapply() output into a useful logical vector you need to further use an apply() step.
keywords <- c("dog", "cat", "bird")
strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
(matches <- sapply(keywords, grepl, strings, ignore.case=TRUE))
# dog cat bird
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE TRUE
# [3,] FALSE FALSE FALSE
To know which strings contain any of the keywords (patterns):
apply(matches, 1, any)
# [1] TRUE TRUE FALSE
To know which keywords (patterns) were matched in the supplied strings:
apply(matches, 2, any)
# dog cat bird
# TRUE TRUE TRUE

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Split string using regular expressions and store it into data frame - regex

Related

Cleaning up dates (years, specifically) with regex

Extract location data using regex in R

colsplit to split into individual characters

Split column label by number of letters/characters in R

R grep: Match one string against multiple patterns

Categories

Resources