Parse Data of a String in R

Parse Data of a String in R - regex

I need help in solving what seems like a very easy problem. I have a string,70 - 3/31/2014 - 60#1.66. I would like to parse out only the information between the second "-" and before the "#", i.e "60". Is there any formula or nested formula in R that can parse out string data between two specified characters?
Thanks!

1) sub This matches the entire string and then replaces it with the capture group, i.e. the portion matched to the part of the regular expression in parentheses:
x <- "70 - 3/31/2014 - 60#1.66"
sub(".*- (.*)#.*", "\\1", x)
## [1] "60"
and here is a visualization of the regular expression used:
.*- (.*)#.*
Debuggex Demo
2) gsub This replaces the portion before the wanted substring and the portion after the wanted substring with empty strings:
gsub(".*- |#.*", "", x)
# [1] "60"
whose regular expression can be visualized as:
.*- |#.*
Debuggex Demo

Through sub,
> x <- "70 - 3/31/2014 - 60#1.66"
> sub("^[^-]*-[^-]*-\\s*([^#]*)#.*", "\\1", x)
[1] "60"
> sub("^[^-]*-[^-]*-([^#]*)#.*", "\\1", x)
[1] " 60"
> sub("^(?:[^-]*-){2}\\s*([^#]*)#.*", "\\1", x)
[1] "60"
^ - Asserts that we are at the start.
[^-]*- Matches all the characters but not of -, zero or more times and the following hyphen.
(?:[^-]*-){2} - And the above pattern would be repeated exactly two times. So we end up with the second hyphen.
\\s* - Matches zero or more space characters.
([^#]*) - Captures any character but not of # zero or more times.
.* - Matches all the remaining characters.
So by replacing all the matched chars with the chars inside group index 1 will gave you the desired output.
OR
> x <- "70 - 3/31/2014 - 60#1.66"
> m <- regexpr("^(?:[^-]*-){2}\\s*\\K[^#]*(?=#)", x, perl=TRUE)
> regmatches(x, m)
[1] "60"
\K keeps the text matched so far out of the overall regex match.

Related

Matching until first occurrence of a term followed by a character

I would like match the string that only has one search term (that would only be the first string in my example). Strings with multiple search terms are separated by a + sign:
jobs?search=term1&location=&distance=10+page=2
jobs?search=term1+term2&location=ca&distance=30
jobs?search=term1+term2+term3&location=nyc&distance=25
My idea was to match any word (preceded by search=) not followed by + but is followed with &:
search=.*?[^+]&
But it doesn't quite work and captures strings with multiple terms.

You need to use
[&?]search=([^&+]+)(?=&|$)
See the regex demo
It will match:
[&?] - a ? or & (to make sure search is the whole key name)
search= - a literal substring
([^&+]+) - Group 1 capturing 1+ symbols other than + and &
(?=&|$) - a lookahead requiring a & or end of string to appear immediately after the last symbol captured with the preceding subpattern (note it can be replaced with a non-capturing group, (?:&|$), the value will be still in Group 1).
Python demo:
import re
ss = ['jobs?search=term1&location=&distance=10+page=2','jobs?search=term1+term2&location=ca&distance=30','jobs?search=term1+term2+term3&location=nyc&distance=25']
rx = re.compile(r'[&?]search=([^&+]+)(?=&|$)')
for s in ss:
m = rx.search(s)
if m:
print("{}: {}".format(s, m.group(1)))
Base R:
ss <- c('jobs?search=term1&location=&distance=10+page=2','jobs?search=term1+term2&location=ca&distance=30','jobs?search=term1+term2+term3&location=nyc&distance=25')
results <- regmatches(ss, regexec("[&?]search=([^&+]+)(?:&|$)",ss))
unlist(results)[2]
... or with R stringr:
> library(stringr)
> ss <- c('jobs?search=term1&location=&distance=10+page=2','jobs?search=term1+term2&location=ca&distance=30','jobs?search=term1+term2+term3&location=nyc&distance=25')
> results <- str_match(ss, "[&?]search=([^&+]+)(?:&|$)")
> results[,2]
[1] "term1" NA NA
>

If you want to only capture the term and not the preceding search=:
(?<=search=)[^+]*?(?=&|$)
(?<=search=) - Positive Lookbehind to ensure the search= precedes the term
[^+]*? - To match the term (makes sure it doesn't include any +). This is a non-greedy match (using *?) so that the first occurrence of & works
(?=&|$) - Positive Lookahead to ensure the term is followed by either a & or end of string ($)
Regex101 Demo

capture repetition of letters in a word with regex

I'm trying to detect conditions where words have repetition of letters, and i would like to replace such matched conditions with the repeated letter. The text is in Hebrew. For instance, שללללוווווםםםם should just become שלום.
Basically,when a letter repeats itself 3 times or more - it should be detected and replaced.
I want to use the regex expression for r gsub.
df$text <- gsub("?", "?", df$text)

You can use
> x = "שללללוווווםםםם"
> gsub("(.)\\1{2,}", "\\1", x)
#[1] "שלום"
NOTE :- It will replace any character (not just hebrew) which is repeated more than three times.
or following for only letter/digit from any language
> gsub("(\\w)\\1{2,}", "\\1", x)

If you plan to only remove repeating characters from the Hebrew script (keeping others), I'd suggest:
s <- "שללללוווווםםםם ......... שללללוווווםםםם"
gsub("(\\p{Hebrew})\\1{2,}", "\\1", s, perl=TRUE)
See the regex demo in R
Details:
(\\p{Hebrew}) - Group 1 capturing a character from Hebrew script (as \p{Hebrew} is a Unicode property/category class)
\\1{2,} - 2 or more (due to {2,} limiting quantifier) same characters stored in Group 1 buffer (as \\1 is a backreference to Group 1 contents).

How to match a string and white space in R

I have a dataframe with columns having values like:
"Average 18.24" "Error 23.34". My objective is to replace the text and following space from these. in R. Can any body help me with a regex pattern to do this?
I am able to successfully do this using the [A-Z]. But i am not able to combine the white space. [A-Z][[:space:]] no luck.
Your help is appreciated.

We can use sub. Use the pattern \\D+ to match all non-numeric characters and then use '' in the replacement to remove those.
sub("\\D+", '', v2)
#[1] "18.24" "23.34"
Or match one or more word characters followed by one or more space and replace with ''.
sub("\\w+\\s+", "", v2)
#[1] "18.24" "23.34"
Or if we are using stringr
library(stringr)
word(v2, 2)
#[1] "18.24" "23.34"
data
v2 <- c("Average 18.24" ,"Error 23.34")

You can use a quantifier and add a-z to the pattern (and the ^ anchor)
You can use
"^\\S+\\s+"
"^[a-zA-Z]+[[:space:]]+"
See regex demo
R demo:
> b <- c("Average 18.24", "Error 23.34")
> sub("^[A-Za-z]+[[:space:]]+", "", b)
> ## or sub("^\\S+\\s+", "", b)
[1] "18.24" "23.34"
Details:
^ - start of string
[A-Za-z]+ - one or more letters (replace with \\S+ to match 1 or more non-whitespaces)
[[:space:]]+ - 1+ whitespaces (or \\s+ will match 1 or more whitespaces)

Extracting part of string using regular expressions

I’m struggling to get a bit of regular expressions code to work. I have a long list of strings that I need to partially extract. I need only strings that starting with “WER” and I only need the last part of the string commencing (including) on the letter.
test <- c("abc00012Z345678","WER0004H987654","WER12400G789456","WERF12","0-0Y123")
Here is the line of code which is working but only for one letter. However in my list of strings it can have any letter.
ifelse(substr(test,1,3)=="WER",gsub("^.*H.*?","H",test),"")
What I’m hoping to achieve is the following:
H987654
G789456
F12

You can use the following pattern with gsub:
> gsub("^(?:WER.*([a-zA-Z]\\d*)|.*)$", "\\1", test)
[1] "" "H987654" "G789456" "F12" ""
See the regex demo
This pattern matches:
^ - start of a string
(?: - start of an alternation group with 2 alternatives:
WER.*([a-zA-Z]\\d*) - WER char sequence followed with 0+ any characters (.*) as many as possible up to the last letter ([a-zA-Z]) followed by 0+ digits (\\d*) (replace with \\d+ to match 1+ digits, to require at least 1 digit)
| - or
`.* - any 0+ characters
)$ - closing the alternation group and match the end of string with $.
With str_match from stringr, it is even tidier:
> library(stringr)
> res <- str_match(test, "^WER.*([a-zA-Z]\\d*)$")
> res[,2]
[1] NA "H987654" "G789456" "F12" NA
>
See another regex demo
If there are newlines in the input, add (?s) at the beginning of the pattern: res <- str_match(test, "(?s)^WER.*([a-zA-Z]\\d*)$").

If you don't want empty strings or NA for strings that don't start with "WER", you could try the following approach:
sub(".*([A-Z].*)$", "\\1", test[grepl("^WER", test)])
#[1] "H987654" "G789456" "F12"

Regex not match pattern followed by horizontal ellipsis in string

I am trying to extract Twitter hashtags from text using regex in R, using str_match_all from the "stringr" package.
The problem is that sometimes the hashtag gets truncated, with a horizontal ellipsis character appended to the end of the text string, as shown in this example:
str_match_all("hello #goodbye #au…","#[[:alnum:]_+]*[^…]")[[1]]
I can successfully extract a list of hashtags, using the above code, but I want to exclude hashtags that are truncated (i.e. that have a horizontal ellipsis character).
This is frustrating as I have looked everywhere for a solution, and the above code is the best I can come up with, but clearly does not work.
Any help is deeply appreciated.

I suggest using regmatches with regexpr and the #[^#]+(?!…)\\b Perl-style regex:
x <- "#hashtag1 notHashtag #hashtag2 notHashtag #has…"
m <- gregexpr('#[^#\\s]+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\w+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\S+(?!…)\\b', x, perl=T)
regmatches(x, m)
See demo on CodingGround
The regex means:
# - Literal #
[^#]+ - 1 or more characters other then # (or \\w+ to match alphanumerics and underscore only, or \\S+ that will match any number of non-whitespace characters)
(?!…)\\b - Match a word boundary that is not preceded by a …
Result of the above code execution: [1] "#goodbye"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parse Data of a String in R - regex

Related

Matching until first occurrence of a term followed by a character

capture repetition of letters in a word with regex

How to match a string and white space in R

Extracting part of string using regular expressions

Regex not match pattern followed by horizontal ellipsis in string

Categories

Resources