Regex catch word at the start and end of a UITextView - regex

I'm trying to catch when a word is used in a UITextView. I've got it working for words in the interior of the view.
The problem is when the word is first or last in the view.
My code so far:
private func filteredTermFor(_ word: String) -> String {
let punctuationFilter = "([\\A|\\W|\\d|\\z| ])"
let wordInParens = "(\(word))"
return punctuationFilter + wordInParens + punctuationFilter
}
I checked and found I should use ^ for the start of input and $ for the end of input. When I add either of these, for example:
"([^|\\A|\\W|\\d|\\z| ])"
they don't seem to have any effect when the word in question is the first or last in the view.
*For the sake of being verbose with my question, the return value from the function above is being used as searchTerm in this:
func highlightedTextInString(with searchTerm: String, targetString: String) -> NSAttributedString? {
let attributedString = NSMutableAttributedString(string: targetString)
do {
let regex = try NSRegularExpression(pattern: searchTerm, options: .caseInsensitive)
let range = NSRange(location: 0, length: targetString.utf16.count)
for match in regex.matches(in: targetString, options: .withTransparentBounds, range: range) {
let fontColor = UIColor.red
attributedString.addAttribute(NSForegroundColorAttributeName, value: fontColor, range: match.range)
}
return attributedString
} catch _ {
print("Error creating regular expression")
return nil
}
}
** Edit **
Since this was marked as a duplicate
The question this was reported a duplicate of does not cover edge cases when the word is typed next to a punctuation mark or digit without spaces.
For example:
.word , word9 , ?word?

Note that ([^|\\A|\\W|\\d|\\z| ]) is a capturing group ((...)) containing a character class that matches a single char defined inside it. The ^ after [ makes the class a negated one, and it matches any char but the one(s) defined in the set. So, [^|\\A|\\W|\\d|\\z| ] matches a single char other than | (it is no longer an alternation operator inside a character class), A (the \ in front is not considered, is omitted), a non-word char, a digit, z and space. It effectively matches _ and any letters other than A and z.
You state that the words you need to match may occur within word boundaries or digits.
You may use
return "(?<![^\\W\\d])(\(word))(?![^\\W\\d])"
See the regex demo.
Here, "(?<![^\\W\\d])" is a negative lookbehind that matches a location that is NOT immediately preceded with a character other than a non-word and a digit char. This sounds cumbersome, but the main point here is that [^\W\d] matches the same texts as \w excluding digits (\w matches letters, digit, and _. So, "(?<![^\\W\\d])" makes sure there is a start of string or a non-letter and non-_ char right before the word. If you allow a word to match after _, just use (?<!\\p{L}) (where \p{L} matches any Unicode letter).
The "(?![^\\W\\d])" is a negative lookahead that makes sure there is an end of string or a non-letter and non-_ (there can be punctuation, symbols and digits) immediately to the right of the word. Again, if you want to match a word if it is followed with _, you may replace this lookahead with "(?!\\p{L})" (just no letter after the word is allowed).

Related

How to match in a single/common Regex Group matching or based on a condition

I would like to extract two different test strings /i/int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45/,640-1_999899,480-1_999899,960-1_999899,1280-1_999899,1920-1_999899,.mp4.csmil/master.m3u8?set-segment-duration=responsive
and
/i/int/2021/11/25/,live_20211125_215206_sendeton_640x360-50p-1200kbit,live_20211125_215206_sendeton_480x270-50p-700kbit,live_20211125_215206_sendeton_960x540-50p-1600kbit,live_20211125_215206_sendeton_1280x720-50p-3200kbit,live_20211125_215206_sendeton_1920x1080-50p-5000kbit,.mp4.csmil/master.m3u8
with a single RegEx and in Group-1.
By using this RegEx ^.[i,na,fm,d]+\/(.+([,\/])?(\/|.+=.+,\/).+\/[,](live.([^,]).).+_)?.+(640).*$ I can get the second string to match the desired result int/2021/11/25/,live_20211125_215206_
but the first string does not match in Group-1 and the missing expected test string 1 extraction is int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45
Any pointers on this is appreciated.
Thanks!
If you want both values in group 1, you can use:
^/(?:[id]|na|fm)/([^/\s]*/\d{4}/\d{2}/\d{2}/\S*?)(?:/,|[^_]+_)640(?:\D|$)
The pattern matches:
^ Start of string
/ Match literally
(?:[id]|na|fm) Match one of i d na fm
/ Match literally
( Capture group 1
[^/\s]*/ Match any char except a / or a whitespace char, then match /
\d{4}/\d{2}/\d{2}/ Match a date like pattern
\S*? Match optional non whitespace chars, as few as possible
) Close group 1
(?:/,|[^_]+_) Match either /, or 1+ chars other than _ and then match _
640 Match literally
(?:\D|$) Match either a non digits or assert end of string
See a regex demo and a go demo.
We can't know all the rules of how the strings your are matching are constructed, but for just these two example strings provided:
package main
import (
"fmt"
"regexp"
)
func main() {
var re = regexp.MustCompile(`(?m)(\/i/int/\d{4}/\d{2}/\d{2}/.*)(?:\/,|_[\w_]+)640`)
var str = `
/i/int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45/,640-1_999899,480-1_999899,960-1_999899,1280-1_999899,1920-1_999899,.mp4.csmil/master.m3u8?set-segment-duration=responsive
/i/int/2021/11/25/,live_20211125_215206_sendeton_640x360-50p-1200kbit,live_20211125_215206_sendeton_480x270-50p-700kbit,live_20211125_215206_sendeton_960x540-50p-1600kbit,live_20211125_215206_sendeton_1280x720-50p-3200kbit,live_20211125_215206_sendeton_1920x1080-50p-5000kbit,.mp4.csmil/master.m3u8`
match := re.FindAllStringSubmatch(str, -1)
for _, val := range match {
fmt.Println(val[1])
}
}

Why does the regex [a-zA-Z]{5} return true for non-matching string?

I defined a regular expression to check if the string only contains alphabetic characters and with length 5:
use regex::Regex;
fn main() {
let re = Regex::new("[a-zA-Z]{5}").unwrap();
println!("{}", re.is_match("this-shouldn't-return-true#"));
}
The text I use contains many illegal characters and is longer than 5 characters, so why does this return true?
You have to put it inside ^...$ to match the whole string and not just parts:
use regex::Regex;
fn main() {
let re = Regex::new("^[a-zA-Z]{5}$").unwrap();
println!("{}", re.is_match("this-shouldn't-return-true#"));
}
Playground.
As explained in the docs:
Notice the use of the ^ and $ anchors. In this crate, every expression is executed with an implicit .*? at the beginning and end, which allows it to match anywhere in the text. Anchors can be used to ensure that the full text matches an expression.
Your pattern returns true because it matches any consecutive 5 alpha chars, in your case it matches both 'shouldn't' and 'return'.
Change your regex to: ^[a-zA-Z]{5}$
^ start of string
[a-zA-Z]{5} matches 5 alpha chars
$ end of string
This will match a string only if the string has a length of 5 chars and all of the chars from start to end fall in range a-z and A-Z.

Dart Regex does not match whole word for Arabic text

This pattern works fine in Java and javascript but does not seem to work in Dart. Any help is appreciated.
void main() {
String englishText = "The new nature will not find rest";
String englishFind = "Nature";
RegExp englishExp = new RegExp("\\b$englishFind\\b", unicode:true, caseSensitive:false);
bool englishResult = englishExp.hasMatch(englishText);//matches
print(englishResult); //true
String arabicText = "لن تجد الطبيعة الجديدة راحتها";
String arabicFind="الطبيعة";
RegExp arabicExp = new RegExp("\\b$arabicFind\\b", unicode:true);
bool arabicResult = arabicExp.hasMatch(arabicText);//does not match
print(arabicResult);//false
}
\b word boundary is still matching only in ASCII only contexts even when you define unicode:true whose main point is to make sure "UTF-16 surrogate pairs in the original string will be treated as a single code point and will not match separately".
You may "decompose" the word boundary and add Arabic letter and digit ranges to the class:
String arabicText = "لن تجد الطبيعة الجديدة راحتها";
String arabicFind="الطبيعة";
RegExp arabicExp = new RegExp("(?:^|[^a-zA-Z0-9_\\u06F0-\\u06F9\\u0622\\u0627\\u0628\\u067E\\u062A-\\u062C\\u0686\\u062D-\\u0632\\u0698\\u0633-\\u063A\\u0641\\u0642\\u06A9\\u06AF\\u0644-\\u0648\\u06CC\\u202C\\u064B\\u064C\\u064E-\\u0652])$arabicFind(?![a-zA-Z0-9_\\u06F0-\\u06F9\\u0622\\u0627\\u0628\\u067E\\u062A-\\u062C\\u0686\\u062D-\\u0632\\u0698\\u0633-\\u063A\\u0641\\u0642\\u06A9\\u06AF\\u0644-\\u0648\\u06CC\\u202C\\u064B\\u064C\\u064E-\\u0652])", unicode:true);
bool arabicResult = arabicExp.hasMatch(arabicText);//does not match
print(arabicResult); // => true
The regex will match an $arabicFind word when it is
(?:^|[^a-zA-Z0-9_\u06F0-\u06F9\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC\u202C\u064B\u064C\u064E-\u0652]) - preceded with start of string (^) or (|) any char but ASCII letter, digit or _ and Farsi letters or digits
(?![a-zA-Z0-9_\u06F0-\u06F9\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC\u202C\u064B\u064C\u064E-\u0652]) - not followed with an ASCII letter, digit or _ and Farsi letters or digits.

How can I replace the last word using Regex?

I have a String extension:
func replaceLastWordWithUsername(_ username: String) -> String {
let pattern = "#*[A-Za-z0-9]*$"
do {
Log.info("Replacing", self, username)
let regex = try NSRegularExpression(pattern: pattern, options: NSRegularExpression.Options.caseInsensitive)
let range = NSMakeRange(0, self.characters.count)
return regex.stringByReplacingMatches(in: self, options: [], range: range, withTemplate: username )
} catch {
return self
}
}
let oldString = "Hey jess"
let newString = oldString.replaceLastWordWithUsername("#jessica")
newString now equals Hey #jessica #jessica. The expected result should be Hey #jessica
I think it's because the * regex operator will
Match 0 or more times. Match as many times as possible.
This might be causing it to also match the 'no characters at the end' in addition to the word at the end, resulting in two replacements.
As mentioned by #Code Different, if you use let pattern = "\\w+$" instead, it will only match if there are characters, eliminating the 'no characters' match.
"Word1 Word2"
^some characters and then end
^0 characters and then end
Use this regex:
(?<=\s)\S+$
Sample: https://regex101.com/r/kGnQEM/1
/(?<=\s)\S+$/g
Positive Lookbehind (?<=\s)
Assert that the Regex below matches
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\S+ matches any non-whitespace character (equal to [^\r\n\t\f ])
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string, or before the line
terminator right at the end of the string (if any)
Just change your pattern:
let pattern = "\\w+$"
\w matches any word character, i.e [A-Za-z0-9]
+ means one or more

Find a word preceding a symbol set

How can I find a word that preceding to [¹²³⁴⁵⁶⁷⁸⁹⁰]. For ex.:
let myString = "Regular expressions¹ consist of constants, ² and operator symbols...³"
Please, provide a pattern to select characters from start of the target word to superscript:
"expressions¹", "constants, ²", "symbols...³"
& pattern to select only target word
"expressions", "constants", "symbols"
This will match your examples.
Codepoints:
\b\w+\W*[\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+
From Wikipedia:
The most common superscript digits (1, 2, and 3) were in ISO-8859-1 and were therefore carried over into those positions in the Latin-1 range of Unicode. The rest were placed in a dedicated section of Unicode at U+2070 to U+209F.
Update:
To get separate blocks that start with words or non-words, you can just
exclude the superscript range from the non-word class.
The regex is longer and more redundant, but it works.
(?:\b\w+[^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]*|[^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+)[\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+
Formatted
(?:
\b
# Required - Words
\w+
# Optional - Not words, nor supersctipt
[^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]*
| # or,
# Required - Not words, nor supersctipt
[^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+
)
# Required - Superscript
[\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+
based on sin's or Caleb Kleveter's information
let myString = " expressions¹ consist of 元機經中有關文字排版² and operator symbols³"
let noteIdx = "\u{2070}\u{00b9}\u{00b2}\u{00b3}\u{2074}\u{2075}\u{2076}\u{2077}\u{2078}\u{2079}"
let strs = myString.unicodeScalars.split { (s) -> Bool in
noteIdx.unicodeScalars.contains{ $0 == s }
}
strs.forEach {
print($0)
}
/* prints
expressions
consist of 元機經中有關文字排版
and operator symbols
*/
this is just a torso, you can continue if you want