single letter regex operations in R - regex

I'm trying to identify in Hebrew text incidents where i have a word (with 2 or more words) followed by single letter. I need to match these instances, and then concatenate the single letter to its' preceding word. Any text might have multiple incidents of that:
Example:
texts <- c("שלום חברי צה ל היקרים", "נכון לא נכון קשק ש בבטחון", "צה ל ינצח ")
I need to replace it to:
texts <- c("שלום חברי צהל היקרים", "נכון לא נכון קשקש בבטחון", "צהל ינצח ")
Thank you for the suggestions

From here, the hebrew letter unicode range is from 05D0-05F2, so you can specify the unicode range in the character class which will then match a single hebrew letter. Specifying the space as the word boundary on each side, you can match a single letter word and substitute with the capture group to remove the space before the letter.
gsub("\\s([\u05D0-\u05F2]\\s)", "\\1", texts) # hebrew letter unicode range
# [1] "שלום חברי צהל היקרים" "נכון לא נכון קשקש בבטחון" "צהל ינצח "
Hebrew symbols unicode range from here, you can adjust accordingly based on what you need.
gsub("\\s([\u0590-\u05FF]\\s)", "\\1", texts)
# [1] "שלום חברי צהל היקרים" "נכון לא נכון קשקש בבטחון" "צהל ינצח "

Related

How to extract words entirely written in uppercase with accents (Diacritics) with a Google Sheet REGEXEXTRACT formula?

Ok,
it looks simple but when the words start or finish with an accents it is the mess.
I've looked on Stack Overflow and others and haven't really found a way to solve this problem.
I would like, to be able with a Google sheet formula, to extract from a cell, words only built with the ASCII characters that follow:
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,À,Á,Â,Ã,Ä,Å,Æ,Ç,È,É,Ê,Ë,Ì,Í,Î,Ï,Ð,Ñ,Ò,Ó,Ô,Õ,Ö,Ø,Ù,Ú,Û,Ü,Ý
For example with "Éléonorä-Camilliâ ÀLËMMNIÖ DE SANTORINÕ" or "ÀLËMMNIÖ DE SANTORINÕ Éléonorä Camilliâ" the result has to be the same "ÀLËMMNIÖ DE SANTORINÕ"
This formula works when no accent all:
=REGEXEXTRACT(A2;"\b[A-Z]+(?:\s+[A-Z]+)*\b")
These formula work sometimes when the names are easy.
=REGEXEXTRACT(A2;"\b[A-Ý]+(?:\s+[A-Ý]+)*\b")
=REGEXEXTRACT(A2;"\B[A-Ý]+(?:\S+[A-Ý]+)*\B")
Can anybody help me or give me some hint?
It seems your expected matches are simply between whitespace or start/end of string. If you add a space before and after the cell value, you may simply extract all the chunks of whitespace-separated uppercase letter words between whitespaces, and the formula will boil down to
=REGEXEXTRACT(" " & A2 & " "; "\s([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*)\s")
See the Google sheets demo:
Regex details:
\s - a whitespace
([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*) - Group 1 (the actual value returned by REGEXEXTRACT): one or more uppercase letters from the specified ranges followed with zero or more repetitions of one or more whitespace and then one or more uppercase letters
\s - a whitespace.
You may use an ARRAYFORMULA, as well:
=ARRAYFORMULA(IFERROR(REGEXEXTRACT(" " & A:A & " ", "\s([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*)\s"),""))
Supposing your sample name were in A2, this should work:
=TRIM(REGEXEXTRACT(A2&" ","([A-ZÀ-Ý ]+)\s"))
By appending a space to the end of the string first, we can then look for the [uppercase letter set or space] in any number up ending with a space. This rules out strings like "Éléonorä" and "Camilliâ" because those uppercase letters are not followed by a space.
Put a different way, the rule here says, "Grab as many uppercase letters or spaces in this set as possible, as long as you still have a space left over at the end." And since we appended a space to the end of the entire string, we can catch such groupings anywhere in the modified string.
Try this- backslashing the non A-Z characters.
[A-Z\À\Á\Â\Ã\Ä\Å\Æ\Ç\È\É\Ê\Ë\Ì\Í\Î\Ï\Ð\Ñ\Ò\Ó\Ô\Õ\Ö\Ø\Ù\Ú\Û\Ü\Ý]
If that fails you can encode each one of those letters like below:
Look up for characters: https://www.w3schools.com/charsets/ref_utf_latin1_supplement.asp
[A-Z\u00C0\u00C1... and so on...]
use:
=ARRAYFORMULA(TRIM(TRANSPOSE(QUERY(TRANSPOSE(IF(""<>
IFERROR(REGEXEXTRACT(SPLIT(A1:A, " "), "["&TEXTJOIN("", 1,
UNIQUE(QUERY({UPPER(CHAR(ROW(65:1500))), LOWER(CHAR(ROW(65:1500)))},
"select Col2 where Col1<>Col2")))&"]+")),,IFERROR(SPLIT(A1:A, " ")))),,9^9))))
or 10 characters shorter:
=INDEX(TRIM(TRANSPOSE(QUERY(TRANSPOSE(IF(""<>
IFERROR(REGEXEXTRACT(SPLIT(A:A; " "); "["&JOIN(;
UNIQUE(LOWER(QUERY(CHAR(ROUNDUP(SEQUENCE(1500; 2; 65)/2));
"select Col1 where lower(Col1)<>upper(Col2)"))))&"]+"));;
IFERROR(SPLIT(A:A; " "))));;9^9))))
works with all Europe-based alphabets and captures all diacritics out there. it can differentiate between:
LOWER
and
UPPER

gsub with exception in R

I'm removing English characters from Hebrew text but would like to keep a short list of English words that i want, e.g. words2keep <- c("ok", "hello", "yes*").
So my current regex is text <- gsub("[A-Z,a-z]", "", text) , but the question is how to add the exception so it will not remove all English words.
reproducibe example:
text = "ok אני מסכים איתך Yossi Cohen"
after gsub with exception
text = "ok אני מסכים איתך"
Thank you for all suggestions
This is a tricky one. I think we can do it by matching against whole words by making use of the \b word boundary assertion, and at the same time include a negative lookahead assertion just prior to the match which rejects the words (again, whole words) that you want to blacklist for removal (or equivalently whitelist for preservation). This appears to be working:
gsub(perl=T,paste0('(?!\\b',paste(collapse='\\b|\\b',words2keep),'\\b)\\b[A-Za-z]+\\b'),'',text);
[1] "ok אני מסכים איתך "
Use gsub function with [A-Z] All uppercase A to Z letters will be removed, for total word removal use .* for total character removal
gsub("[A-Z].*","",text)
[1] "ok אני מסכים איתך "
#data
text = "ok אני מסכים איתך Yossi Cohen"

How to gsub on the text between two words in R?

EDIT:
I would like to place a \n before a specific unknown word in my text. I know that the first time the unknown word appears in my text will be between "Tree" and "Lake"
Ex. of text:
text
[1] "TreeRULakeSunWater"
[2] "A B C D"
EDIT:
"Tree" and "Lake" will never change, but the word in between them is always changing so I do not look for "RU" in my regex
What I am currently doing:
if (grepl(".*Tree\\s*|Lake.*", text)) { text <- gsub(".*Tree\\s*|Lake.*", "\n\\1", text)}
The problem with what I am doing above is that the gsub will sub all of text and leave just \nRU.
text
[1] "\nRU"
I have also tried:
if (grepl(".*Tree *(.*?) *Lake.*", text)) { text <- gsub(".*Tree *(.*?) *Lake.*", "\n\\1", text)}
What I would like text to look like after gsub:
text
[1] "Tree \nRU LakeSunWater"
[2] "A B C D"
EDIT:
From Wiktor Stribizew's comment I am able to do a successful gsub
gsub("Tree(\\w+)Lake", "Tree \n\\1 Lake", text)
But this will only do a gsub on occurrences where "RU" is between "Tree and "Lake", which is the first occurrence of the unknown word. The unknown word and in this case "RU" will show up many times in the text, and I would like to place \n in front of every occurrence of "RU" when "RU" is a whole word.
New Ex. of text.
text
[1] "TreeRULakeSunWater"
[2] "A B C RU D"
New Ex. of what I would like:
text
[1] "Tree \nRU LakeSunWater"
[2] "A B C \nRU D"
Any help will be appreciated. Please let me know if further information is needed.
You need to find the unknown word between "Tree" and "Lake" first. You can use
unknown_word <- gsub(".*Tree(\\w+)Lake.*", "\\1", text)
The pattern matches any characters up to the last Tree in a string, then captures the unknown word (\w+ = one or more word characters) up to the Lake and then matches the rest of the string. It replaces all the strings in the vector. You can access the first one by [[1]] index.
Then, when you know the word, replace it with
gsub(paste0("[[:space:]]*(", unknown_word[[1]], ")[[:space:]]*"), " \n\\1 ", text)
See IDEONE demo.
Here, you have [[:space:]]*( + unknown_word[1] + )[[:space:]]* pattern. It matches zero or more whitespaces on both ends of the unknown word, and the unknown word itself (captured into Group 1). In the replacement, the spaces are shrunk into 1 (or added if there were none) and then \\1 restores the unknown word. You may replace [[:space:]] with \\s.
UPDATE
If you need to only add a newline symbols before RU that are whole words, use the \b word boundary:
> gsub(paste0("[[:space:]]*\\b(", unknown_word[[1]], ")\\b[[:space:]]*"), " \n\\1 ", text)
[1] "TreeRULakeSunWater" "A B C \nRU D"

Extract letters (some numbers) and under scores from a character string

I have a bunch of charter strings of various lengths that contain numbers and letters. All charter strings end with an _ followed with a number (e.g. _30, _100, _500, or _1000).
The String object below contains a few examples.
Strings <- c("DET37_30", "DET37_500", "Ele_100", "Ele_1000", "NDVI_MeanMax_100", "RadWint_30", "RadWint_500", "Slope_100")
For each column name, I want to select all the numbers, letters, and _ prior to the final _number
For example DET37_30 and DET_500 would result in DET37, and Ele_100 and Ele_1000 would result in Ele.
In other words, I want all values before the ending _30, _100, _500, or _1000.
You can try:
gsub("(.*)_[0-9]*","\\1",Strings)
It replaces the whole string by whatever is before the underscore.
sub("_\\d+$", "", Strings)
#[1] "DET37" "DET37" "Ele" "Ele" "NDVI_MeanMax" "RadWint"
#[7] "RadWint" "Slope"
This regex matches an underscore followed by one or more digits, and it uses the $ anchor to allow only matches at the end of the line.

Replacing non alphabetical characters & numbers with other special characters

I am using the following code to take out anything other than alphabetical characters, numbers, question mark, exclamation point, periods, parenthesis, commas & hyphen:
MsgBox(System.Text.RegularExpressions.Regex.Replace("hello to you's! My # is (442) 523-5584. #$%^*<>{}[]\|/?,+-=:;`~", "[^A-Za-z0-9]", ""))
I come up with this: hellotoyousMy#is4425235584
It should read like so: hello to yous! My # is (442) 523-5584.?,
Simply add all characters to your negated character class (take note of the space character!):
MsgBox(System.Text.RegularExpressions.Regex.Replace("hello to you's! My # is (442) 523-5584. #$%^*<>{}[]\|/?,+-=:;`~", "[^A-Za-z0-9 ?!.(),#-]+", ""))
(I also added a repeating + to your regex, so it can replace consecutive disallowed characters in one go)
Add a space and other symbols in the regex:
MsgBox(System.Text.RegularExpressions.Regex.Replace("hello to you's! My # is (442) 523-5584. #$%^*<>{}[]\|/?,+-=:;`~", "[^A-Za-z0-9 \(\)\!\.,\-\?]", ""))
Regex.Replace("your text", "[^A-Za-z0-9 ?!.(),-]+", "")
It [^A-Za-z0-9 ?!.(),-]+ will grab all unwanted characters following one after another and replace them by ""