I am trying to extract Twitter hashtags from text using regex in R, using str_match_all from the "stringr" package.
The problem is that sometimes the hashtag gets truncated, with a horizontal ellipsis character appended to the end of the text string, as shown in this example:
str_match_all("hello #goodbye #au…","#[[:alnum:]_+]*[^…]")[[1]]
I can successfully extract a list of hashtags, using the above code, but I want to exclude hashtags that are truncated (i.e. that have a horizontal ellipsis character).
This is frustrating as I have looked everywhere for a solution, and the above code is the best I can come up with, but clearly does not work.
Any help is deeply appreciated.
I suggest using regmatches with regexpr and the #[^#]+(?!…)\\b Perl-style regex:
x <- "#hashtag1 notHashtag #hashtag2 notHashtag #has…"
m <- gregexpr('#[^#\\s]+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\w+(?!…)\\b', x, perl=T)
// or m <- gregexpr('#\\S+(?!…)\\b', x, perl=T)
regmatches(x, m)
See demo on CodingGround
The regex means:
# - Literal #
[^#]+ - 1 or more characters other then # (or \\w+ to match alphanumerics and underscore only, or \\S+ that will match any number of non-whitespace characters)
(?!…)\\b - Match a word boundary that is not preceded by a …
Result of the above code execution: [1] "#goodbye"
Related
I'm trying to detect conditions where words have repetition of letters, and i would like to replace such matched conditions with the repeated letter. The text is in Hebrew. For instance, שללללוווווםםםם should just become שלום.
Basically,when a letter repeats itself 3 times or more - it should be detected and replaced.
I want to use the regex expression for r gsub.
df$text <- gsub("?", "?", df$text)
You can use
> x = "שללללוווווםםםם"
> gsub("(.)\\1{2,}", "\\1", x)
#[1] "שלום"
NOTE :- It will replace any character (not just hebrew) which is repeated more than three times.
or following for only letter/digit from any language
> gsub("(\\w)\\1{2,}", "\\1", x)
If you plan to only remove repeating characters from the Hebrew script (keeping others), I'd suggest:
s <- "שללללוווווםםםם ......... שללללוווווםםםם"
gsub("(\\p{Hebrew})\\1{2,}", "\\1", s, perl=TRUE)
See the regex demo in R
Details:
(\\p{Hebrew}) - Group 1 capturing a character from Hebrew script (as \p{Hebrew} is a Unicode property/category class)
\\1{2,} - 2 or more (due to {2,} limiting quantifier) same characters stored in Group 1 buffer (as \\1 is a backreference to Group 1 contents).
I have a dataframe with columns having values like:
"Average 18.24" "Error 23.34". My objective is to replace the text and following space from these. in R. Can any body help me with a regex pattern to do this?
I am able to successfully do this using the [A-Z]. But i am not able to combine the white space. [A-Z][[:space:]] no luck.
Your help is appreciated.
We can use sub. Use the pattern \\D+ to match all non-numeric characters and then use '' in the replacement to remove those.
sub("\\D+", '', v2)
#[1] "18.24" "23.34"
Or match one or more word characters followed by one or more space and replace with ''.
sub("\\w+\\s+", "", v2)
#[1] "18.24" "23.34"
Or if we are using stringr
library(stringr)
word(v2, 2)
#[1] "18.24" "23.34"
data
v2 <- c("Average 18.24" ,"Error 23.34")
You can use a quantifier and add a-z to the pattern (and the ^ anchor)
You can use
"^\\S+\\s+"
"^[a-zA-Z]+[[:space:]]+"
See regex demo
R demo:
> b <- c("Average 18.24", "Error 23.34")
> sub("^[A-Za-z]+[[:space:]]+", "", b)
> ## or sub("^\\S+\\s+", "", b)
[1] "18.24" "23.34"
Details:
^ - start of string
[A-Za-z]+ - one or more letters (replace with \\S+ to match 1 or more non-whitespaces)
[[:space:]]+ - 1+ whitespaces (or \\s+ will match 1 or more whitespaces)
I’m struggling to get a bit of regular expressions code to work. I have a long list of strings that I need to partially extract. I need only strings that starting with “WER” and I only need the last part of the string commencing (including) on the letter.
test <- c("abc00012Z345678","WER0004H987654","WER12400G789456","WERF12","0-0Y123")
Here is the line of code which is working but only for one letter. However in my list of strings it can have any letter.
ifelse(substr(test,1,3)=="WER",gsub("^.*H.*?","H",test),"")
What I’m hoping to achieve is the following:
H987654
G789456
F12
You can use the following pattern with gsub:
> gsub("^(?:WER.*([a-zA-Z]\\d*)|.*)$", "\\1", test)
[1] "" "H987654" "G789456" "F12" ""
See the regex demo
This pattern matches:
^ - start of a string
(?: - start of an alternation group with 2 alternatives:
WER.*([a-zA-Z]\\d*) - WER char sequence followed with 0+ any characters (.*) as many as possible up to the last letter ([a-zA-Z]) followed by 0+ digits (\\d*) (replace with \\d+ to match 1+ digits, to require at least 1 digit)
| - or
`.* - any 0+ characters
)$ - closing the alternation group and match the end of string with $.
With str_match from stringr, it is even tidier:
> library(stringr)
> res <- str_match(test, "^WER.*([a-zA-Z]\\d*)$")
> res[,2]
[1] NA "H987654" "G789456" "F12" NA
>
See another regex demo
If there are newlines in the input, add (?s) at the beginning of the pattern: res <- str_match(test, "(?s)^WER.*([a-zA-Z]\\d*)$").
If you don't want empty strings or NA for strings that don't start with "WER", you could try the following approach:
sub(".*([A-Z].*)$", "\\1", test[grepl("^WER", test)])
#[1] "H987654" "G789456" "F12"
I'm looking for a regular expression to catch all digits in the first 7 characters in a string.
This string has 12 characters:
A12B345CD678
I would like to remove A and B only since they are within the first 7 chars (A12B345) and get
12345CD678
So, the CD678 should not be touched. My current solution in R:
paste(paste(str_extract_all(substr("A12B345CD678",1,7), "[0-9]+")[[1]],collapse=""),substr("A12B345CD678",8,nchar("A12B345CD678")),sep="")
It seems too complicated. I split the string at 7 as described, match any digits in the first 7 characters and bind it with the rest of the string.
Looking for a general answer, my current solution is to split the first 7 characters and just match all digits in this sub string.
Any help appreciated.
You can use the known SKIP-FAIL regex trick to match all the rest of the string beginning with the 8th character, and only match non-digit characters within the first 7 with a lookbehind:
s <- "A12B345CD678"
gsub("(?<=.{7}).*$(*SKIP)(*F)|\\D", "", s, perl=T)
## => [1] "12345CD678"
See IDEONE demo
The perl=T is required for this regex to work. The regex breakdown:
(?<=.{7}).*$(*SKIP)(*F) - matches any character but a newline (add (?s) at the beginning if you have newline symbols in the input), as many as possible (.*) up to the end ($, also \\z might be required to remove final newlines), but only if preceded with 7 characters (this is set by the lookbehind (?<=.{7})). The (*SKIP)(*F) verbs make the engine omit the whole matched text and advance the regex index to the position at the end of that text.
| - or...
\\D - a non-digit character.
See the regex demo.
The regex solution is cool, but I'd use something easier to read for maintainability. E.g.
library(stringr)
str_sub(s, 1, 7) = gsub('[A-Z]', '', str_sub(s, 1, 7))
You can also use a simple negative lookbehind:
s <- "A12B345CD678"
gsub("(?<!.{7})\\D", "", s, perl=T)
I'm working with a vector below:
vec <- c("http://statistics.gov.scot/id/statistical-geography/S02000002",
"http://statistics.gov.scot/id/statistical-geography/S02000003")
I would like to remove http://statistics.gov.scot/id/statistical-geography/ from the vector. My present regex syntax:
vec_cln <- gsub(replacement = "", x = vec, perl = TRUE, fixed = FALSE,
pattern = "([[:alnum:]]|[[:punct:]]|)(?<!S\\d{8})")
But this leaves only last digit from vector vec. I'm guessing that the problem is with \\d{8}, however, it's not clear to me how to work around it. I tried various solutions on regex101 but to no avail. Some examples:
(?<!S\d) - this leaves second digit
(?<!S[[:digit:]]) - same
What I'm trying to achieve can be simply summarised: *match everything until you find a capital letter S and 8 digits after.
Notes
I want to arrive at the solution via gsub and regex I don't want to use:
gsubfn and proto objects
I'm not interested in using substr as I may have to work with strings of variable lengths
You can obtain the result using
sub(".*(S\\d{8})", "\\1", vec)
See demo
With .*, we match any amount of (* - 0 or more) any characters but a newline up to the S followed by 8 digits (S\\d{8}). Since (S\\d{8}) is inside unescaped parentheses, the substring matched by this subpattern is placed into a capture group #1. With \\1 backreference, we restore the captured text in the result.
See more about backreferences and capturing groups at regular-expressions.info.
NOTE: if you have more text after S+8 digits, you can use
sub("^.*(S\\d{8}).*$", "\\1", vec)
Here it is with slightly prettier syntax:
library(rex)
library(stringi)
library(magrittr)
regex_1 = rex("S", digits)
vec <- c("http://statistics.gov.scot/id/statistical-geography/S02000002",
"http://statistics.gov.scot/id/statistical-geography/S02000003")
vec %>% stri_extract_last_regex(regex_1)