How to match a string and white space in R - regex

I have a dataframe with columns having values like:
"Average 18.24" "Error 23.34". My objective is to replace the text and following space from these. in R. Can any body help me with a regex pattern to do this?
I am able to successfully do this using the [A-Z]. But i am not able to combine the white space. [A-Z][[:space:]] no luck.
Your help is appreciated.

We can use sub. Use the pattern \\D+ to match all non-numeric characters and then use '' in the replacement to remove those.
sub("\\D+", '', v2)
#[1] "18.24" "23.34"
Or match one or more word characters followed by one or more space and replace with ''.
sub("\\w+\\s+", "", v2)
#[1] "18.24" "23.34"
Or if we are using stringr
library(stringr)
word(v2, 2)
#[1] "18.24" "23.34"
data
v2 <- c("Average 18.24" ,"Error 23.34")

You can use a quantifier and add a-z to the pattern (and the ^ anchor)
You can use
"^\\S+\\s+"
"^[a-zA-Z]+[[:space:]]+"
See regex demo
R demo:
> b <- c("Average 18.24", "Error 23.34")
> sub("^[A-Za-z]+[[:space:]]+", "", b)
> ## or sub("^\\S+\\s+", "", b)
[1] "18.24" "23.34"
Details:
^ - start of string
[A-Za-z]+ - one or more letters (replace with \\S+ to match 1 or more non-whitespaces)
[[:space:]]+ - 1+ whitespaces (or \\s+ will match 1 or more whitespaces)

Related

R regex for everything between LAST backslash and last dot

full.path = 'C:\Users\me\Desktop\Data\my_file.csv'
I can't figure out the right regex to be left with only
essential.name = 'my_file'
I'm afraid I keep on failing on encoding correctly the last backslash
A platform-independent regex solution can also look like
> full.path = 'C:\\Users\\me\\Desktop\\Data\\my_file.csv'
> sub(".*\\\\([^.]*).*", "\\1", full.path)
[1] "my_file"
See online R demo.
Details:
.* - any 0+ characters as many as possible up to the last...
\\\\ - a literal \ symbol
([^.]*) - Group 1 capturing 0+ characters other than a dot
.* - and the rest of the characters up to its end.
The \\1 just inserts the contents of the Group 1 into the result.
We can use the basename and file_path_sans_ext (from tools) to extract the file name
tools::file_path_sans_ext(basename(full.path))
#[1] "my_file"
Or if we need regex, use gsub
gsub(".*\\\\|\\..*$", "", full.path)
#[1] "my_file"
data
full.path = 'C:\\Users\\me\\Desktop\\Data\\my_file.csv'

capture repetition of letters in a word with regex

I'm trying to detect conditions where words have repetition of letters, and i would like to replace such matched conditions with the repeated letter. The text is in Hebrew. For instance, שללללוווווםםםם should just become שלום.
Basically,when a letter repeats itself 3 times or more - it should be detected and replaced.
I want to use the regex expression for r gsub.
df$text <- gsub("?", "?", df$text)
You can use
> x = "שללללוווווםםםם"
> gsub("(.)\\1{2,}", "\\1", x)
#[1] "שלום"
NOTE :- It will replace any character (not just hebrew) which is repeated more than three times.
or following for only letter/digit from any language
> gsub("(\\w)\\1{2,}", "\\1", x)
If you plan to only remove repeating characters from the Hebrew script (keeping others), I'd suggest:
s <- "שללללוווווםםםם ......... שללללוווווםםםם"
gsub("(\\p{Hebrew})\\1{2,}", "\\1", s, perl=TRUE)
See the regex demo in R
Details:
(\\p{Hebrew}) - Group 1 capturing a character from Hebrew script (as \p{Hebrew} is a Unicode property/category class)
\\1{2,} - 2 or more (due to {2,} limiting quantifier) same characters stored in Group 1 buffer (as \\1 is a backreference to Group 1 contents).

Extracting part of string using regular expressions

I’m struggling to get a bit of regular expressions code to work. I have a long list of strings that I need to partially extract. I need only strings that starting with “WER” and I only need the last part of the string commencing (including) on the letter.
test <- c("abc00012Z345678","WER0004H987654","WER12400G789456","WERF12","0-0Y123")
Here is the line of code which is working but only for one letter. However in my list of strings it can have any letter.
ifelse(substr(test,1,3)=="WER",gsub("^.*H.*?","H",test),"")
What I’m hoping to achieve is the following:
H987654
G789456
F12
You can use the following pattern with gsub:
> gsub("^(?:WER.*([a-zA-Z]\\d*)|.*)$", "\\1", test)
[1] "" "H987654" "G789456" "F12" ""
See the regex demo
This pattern matches:
^ - start of a string
(?: - start of an alternation group with 2 alternatives:
WER.*([a-zA-Z]\\d*) - WER char sequence followed with 0+ any characters (.*) as many as possible up to the last letter ([a-zA-Z]) followed by 0+ digits (\\d*) (replace with \\d+ to match 1+ digits, to require at least 1 digit)
| - or
`.* - any 0+ characters
)$ - closing the alternation group and match the end of string with $.
With str_match from stringr, it is even tidier:
> library(stringr)
> res <- str_match(test, "^WER.*([a-zA-Z]\\d*)$")
> res[,2]
[1] NA "H987654" "G789456" "F12" NA
>
See another regex demo
If there are newlines in the input, add (?s) at the beginning of the pattern: res <- str_match(test, "(?s)^WER.*([a-zA-Z]\\d*)$").
If you don't want empty strings or NA for strings that don't start with "WER", you could try the following approach:
sub(".*([A-Z].*)$", "\\1", test[grepl("^WER", test)])
#[1] "H987654" "G789456" "F12"

extract string after first occurrence of pattern AND before another pattern

I have the following string:
strings <- c("David, FC; Haramey, S; Devan, IA",
"Colin, Matthew J.; Haramey, S",
"Colin, Matthew")
If I want the last initials/givenname for all strings i can use the following:
sub(".*, ", "", strings)
[1] "IA" "S" "Matthew"
This removes everything before the last ", "
However, I am stuck on how to get the the first initials/givenname. I know have to remove everything before the first ", " but then I have to remove everything after any spaces, semicolons, if any.
To be clear the output I want is:
c("FC", "Matthew", "Matthew")
Any pointers would be great.
fiddling i can get the first surnames gsub( " .*$", "", strings )
You can use
> gsub( "^[^\\s,]+,\\s+([^;.\\s]+).*", "\\1", strings, perl=T)
[1] "FC" "Matthew" "Matthew"
See the regex demo
Explanation:
^ - start of string
[^\\s,]+ - 1 or more characters other than whitespace or ,
, - a literal comma
\\s+ - 1 or more whitespace
([^;.\\s]+) - Group 1 matching 1 or more characters other than ;, . or whitespace
.* - zero or more any character other than a newline
If you want to use a POSIX-like expression, replace \\s inside the character classes (inside [...]) with [:blank:] (or [:space:]):
gsub( "^[^[:blank:],]+,\\s+([^;.[:blank:]]+).*", "\\1", strings)

Parse Data of a String in R

I need help in solving what seems like a very easy problem. I have a string,70 - 3/31/2014 - 60#1.66. I would like to parse out only the information between the second "-" and before the "#", i.e "60". Is there any formula or nested formula in R that can parse out string data between two specified characters?
Thanks!
1) sub This matches the entire string and then replaces it with the capture group, i.e. the portion matched to the part of the regular expression in parentheses:
x <- "70 - 3/31/2014 - 60#1.66"
sub(".*- (.*)#.*", "\\1", x)
## [1] "60"
and here is a visualization of the regular expression used:
.*- (.*)#.*
Debuggex Demo
2) gsub This replaces the portion before the wanted substring and the portion after the wanted substring with empty strings:
gsub(".*- |#.*", "", x)
# [1] "60"
whose regular expression can be visualized as:
.*- |#.*
Debuggex Demo
Through sub,
> x <- "70 - 3/31/2014 - 60#1.66"
> sub("^[^-]*-[^-]*-\\s*([^#]*)#.*", "\\1", x)
[1] "60"
> sub("^[^-]*-[^-]*-([^#]*)#.*", "\\1", x)
[1] " 60"
> sub("^(?:[^-]*-){2}\\s*([^#]*)#.*", "\\1", x)
[1] "60"
^ - Asserts that we are at the start.
[^-]*- Matches all the characters but not of -, zero or more times and the following hyphen.
(?:[^-]*-){2} - And the above pattern would be repeated exactly two times. So we end up with the second hyphen.
\\s* - Matches zero or more space characters.
([^#]*) - Captures any character but not of # zero or more times.
.* - Matches all the remaining characters.
So by replacing all the matched chars with the chars inside group index 1 will gave you the desired output.
OR
> x <- "70 - 3/31/2014 - 60#1.66"
> m <- regexpr("^(?:[^-]*-){2}\\s*\\K[^#]*(?=#)", x, perl=TRUE)
> regmatches(x, m)
[1] "60"
\K keeps the text matched so far out of the overall regex match.