How to gsub on the text between two words in R? - regex

EDIT:
I would like to place a \n before a specific unknown word in my text. I know that the first time the unknown word appears in my text will be between "Tree" and "Lake"
Ex. of text:
text
[1] "TreeRULakeSunWater"
[2] "A B C D"
EDIT:
"Tree" and "Lake" will never change, but the word in between them is always changing so I do not look for "RU" in my regex
What I am currently doing:
if (grepl(".*Tree\\s*|Lake.*", text)) { text <- gsub(".*Tree\\s*|Lake.*", "\n\\1", text)}
The problem with what I am doing above is that the gsub will sub all of text and leave just \nRU.
text
[1] "\nRU"
I have also tried:
if (grepl(".*Tree *(.*?) *Lake.*", text)) { text <- gsub(".*Tree *(.*?) *Lake.*", "\n\\1", text)}
What I would like text to look like after gsub:
text
[1] "Tree \nRU LakeSunWater"
[2] "A B C D"
EDIT:
From Wiktor Stribizew's comment I am able to do a successful gsub
gsub("Tree(\\w+)Lake", "Tree \n\\1 Lake", text)
But this will only do a gsub on occurrences where "RU" is between "Tree and "Lake", which is the first occurrence of the unknown word. The unknown word and in this case "RU" will show up many times in the text, and I would like to place \n in front of every occurrence of "RU" when "RU" is a whole word.
New Ex. of text.
text
[1] "TreeRULakeSunWater"
[2] "A B C RU D"
New Ex. of what I would like:
text
[1] "Tree \nRU LakeSunWater"
[2] "A B C \nRU D"
Any help will be appreciated. Please let me know if further information is needed.

You need to find the unknown word between "Tree" and "Lake" first. You can use
unknown_word <- gsub(".*Tree(\\w+)Lake.*", "\\1", text)
The pattern matches any characters up to the last Tree in a string, then captures the unknown word (\w+ = one or more word characters) up to the Lake and then matches the rest of the string. It replaces all the strings in the vector. You can access the first one by [[1]] index.
Then, when you know the word, replace it with
gsub(paste0("[[:space:]]*(", unknown_word[[1]], ")[[:space:]]*"), " \n\\1 ", text)
See IDEONE demo.
Here, you have [[:space:]]*( + unknown_word[1] + )[[:space:]]* pattern. It matches zero or more whitespaces on both ends of the unknown word, and the unknown word itself (captured into Group 1). In the replacement, the spaces are shrunk into 1 (or added if there were none) and then \\1 restores the unknown word. You may replace [[:space:]] with \\s.
UPDATE
If you need to only add a newline symbols before RU that are whole words, use the \b word boundary:
> gsub(paste0("[[:space:]]*\\b(", unknown_word[[1]], ")\\b[[:space:]]*"), " \n\\1 ", text)
[1] "TreeRULakeSunWater" "A B C \nRU D"

Related

gsub with exception in R

I'm removing English characters from Hebrew text but would like to keep a short list of English words that i want, e.g. words2keep <- c("ok", "hello", "yes*").
So my current regex is text <- gsub("[A-Z,a-z]", "", text) , but the question is how to add the exception so it will not remove all English words.
reproducibe example:
text = "ok אני מסכים איתך Yossi Cohen"
after gsub with exception
text = "ok אני מסכים איתך"
Thank you for all suggestions
This is a tricky one. I think we can do it by matching against whole words by making use of the \b word boundary assertion, and at the same time include a negative lookahead assertion just prior to the match which rejects the words (again, whole words) that you want to blacklist for removal (or equivalently whitelist for preservation). This appears to be working:
gsub(perl=T,paste0('(?!\\b',paste(collapse='\\b|\\b',words2keep),'\\b)\\b[A-Za-z]+\\b'),'',text);
[1] "ok אני מסכים איתך "
Use gsub function with [A-Z] All uppercase A to Z letters will be removed, for total word removal use .* for total character removal
gsub("[A-Z].*","",text)
[1] "ok אני מסכים איתך "
#data
text = "ok אני מסכים איתך Yossi Cohen"

single letter regex operations in R

I'm trying to identify in Hebrew text incidents where i have a word (with 2 or more words) followed by single letter. I need to match these instances, and then concatenate the single letter to its' preceding word. Any text might have multiple incidents of that:
Example:
texts <- c("שלום חברי צה ל היקרים", "נכון לא נכון קשק ש בבטחון", "צה ל ינצח ")
I need to replace it to:
texts <- c("שלום חברי צהל היקרים", "נכון לא נכון קשקש בבטחון", "צהל ינצח ")
Thank you for the suggestions
From here, the hebrew letter unicode range is from 05D0-05F2, so you can specify the unicode range in the character class which will then match a single hebrew letter. Specifying the space as the word boundary on each side, you can match a single letter word and substitute with the capture group to remove the space before the letter.
gsub("\\s([\u05D0-\u05F2]\\s)", "\\1", texts) # hebrew letter unicode range
# [1] "שלום חברי צהל היקרים" "נכון לא נכון קשקש בבטחון" "צהל ינצח "
Hebrew symbols unicode range from here, you can adjust accordingly based on what you need.
gsub("\\s([\u0590-\u05FF]\\s)", "\\1", texts)
# [1] "שלום חברי צהל היקרים" "נכון לא נכון קשקש בבטחון" "צהל ינצח "

How to Extract a substring that matches a Perticular Regular expression match from a String in R

I am trying to write a function so that i can get all the substrings from a string that matches a regular expression , example : -
str <- "hello Brother How are you"
I want to extract all the substrings from str , where those substrings matches this regular expression - "[A-z]+ [A-z]+"
which results in -
"hello Brother"
"Brother How"
"How are"
"are you"
is there any library function which can do that ?
You can do it with stringr library str_match_all function and the method Tim Pietzcker described in his answer (capturing inside an unanchored positive lookahead):
> library(stringr)
> str <- "hello Brother How are you"
> res <- str_match_all(str, "(?=\\b([[:alpha:]]+ [[:alpha:]]+))")
> l <- unlist(res)
> l[l != ""]
## [1] "hello Brother" "Brother How" "How are" "are you"
Or to only get unqiue values:
> unique(l[l != ""])
##[1] "hello Brother" "Brother How" "How are" "are you"
I just advise to use [[:alpha:]] instead of [A-z] since this pattern matches more than just letters.
Regex matches "consume" the text they match, therefore (generally) the same bit of text can't match twice. But there are constructs called lookaround assertions which don't consume the text they match, and which may contain capturing groups.
That makes your endeavor possible (although you can't use [A-z], that doesn't do what you think it does):
(?=\b([A-Za-z]+ [A-Za-z]+))
will match as expected; you need to look at group 1 of the match result, not the matched text itself (which will always be empty).
The \b word boundary anchor is necessary to ensure that our matches always start at the beginning of a word (otherwise you'd also have the results "ello Brother", "llo Brother", "lo Brother", and "o Brother").
Test it live on regex101.com.

Regex with stringr:: how to find first instance of pattern

Behind this question is an effort to extract all references created by knitr and latex. Not finding another way, my thought was to read into R the .Rnw script and use a regular expression to find references -- where the latex syntax is \ref{caption referenced to}. My script has 250+ references, and some are very close to each other.
The text.1 example below works, but not the text example. I think it has to do with R chugging along to the final closing brace. How do I stop at the first closing brace and extract what preceded it to the opening brace?
library(stringr)
text.1 <- c(" \\ref{test}", "abc", "\\ref{test2}", " \\section{test3}", "{test3")
# In the regular expression below, look back and if find "ref{", grab everything until look behind for } at end
# braces are special characters and require escaping with double backslacs for R to recognize them as braces
# unlist converts the list returned by str_extract to a vector
unlist(str_extract_all(string = text.1, pattern = "(?<=ref\\{).*(?=\\}$)"))
[1] "test" "test2"
# a more complicated string, with more than one set of braces in an element
text <- c("text \ref{?bar labels precision} and more text \ref{?table column alignment}", "text \ref{?table space} }")
unlist(str_extract_all(string = text, pattern = "(?<=ref\\{).*(?=\\}$)"))
character(0)
The problem with text is the backslash in front of "ref" is being interpreted as a carriage return \r by the engine and R's parser; so you're trying to match "ref" but it's really (CR + "ef") ...
Also * is greedy by default, meaning it will match as much as it can and still allow the remainder of the regular expression to match. Use *? or a negated character class to prevent greediness.
unlist(str_extract_all(text, '(?<=\ref\\{)[^}]*'))
# [1] "?bar labels precision" "?table column alignment" "?table space"
As you can see, you can use a character class to match either (\r or r + "ef") ...
x <- c(' \\ref{test}', 'abc', '\\ref{test2}', ' \\section{test3}', '{test3',
'text \ref{?bar labels precision} and more text \ref{?table column alignment}',
'text \ref{?table space} }')
unlist(str_extract_all(x, '(?<=[\rr]ef\\{)[^}]*'))
# [1] "test" "test2" "?bar labels precision"
# [4] "?table column alignment" "?table space"
EDITED
The reason why it didn't capture what is before the closing brace } is because you added an end of line anchor $. Remove $ and it would work.
Therefore, you new code should be like this
unlist(str_extract_all(string = text, pattern = "(?<=ref\\{)[^}]*(?=\\})"))
See DEMO

Extract substrings starting with specific character until next space

I want to extract the tags (twitter handles) from tweets.
tweet <- "#me bla bla bla bla #2_him some text #me_"
The following only extracts part of some substrings due to the punctuation in some tags
regmatches(tweet, gregexpr("#[[:alnum:]]*", tweet))[[1]]
[1] "#me" "#2" "#me"
I don't know what regular expression would return the entire string (#tag).
Thanks!
If you want to match all non-spaces, just use the corresponding regular expression
regmatches(tweet, gregexpr("#[^ ]*", tweet))[[1]]
# [1] "#me" "#2_him" "#me_"
You can use the following. \S will match any non-white space character. As well, you want to use the + quantifier instead of * otherwise you will end up matching the # character by itself if one did exist in the string.
> regmatches(tweet, gregexpr("#\\S+", tweet))[[1]]
# [1] "#me" "#2_him" "#me_"
Instead of [[:alnum:]]* use \w* because _ does not comes under alphanumeric character list(ie, [[:alnum:]] matches alphanumeric[A-Za-z0-9] characters. ) but it comes under word character ([A-Za-z0-9_]) list.
> regmatches(tweet, gregexpr("#\\w*", tweet))[[1]]
[1] "#me" "#2_him" "#me_"
The qdapRegex package has a function specifically designed for this task rm_tag:
library(qdapRegex)
rm_tag(tweet, extract=TRUE)
## [[1]]
## [1] "#me" "#2_him" "#me_"