Find a word preceding a symbol set - regex

How can I find a word that preceding to [¹²³⁴⁵⁶⁷⁸⁹⁰]. For ex.:
let myString = "Regular expressions¹ consist of constants, ² and operator symbols...³"
Please, provide a pattern to select characters from start of the target word to superscript:
"expressions¹", "constants, ²", "symbols...³"
& pattern to select only target word
"expressions", "constants", "symbols"

This will match your examples.
Codepoints:
\b\w+\W*[\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+
From Wikipedia:
The most common superscript digits (1, 2, and 3) were in ISO-8859-1 and were therefore carried over into those positions in the Latin-1 range of Unicode. The rest were placed in a dedicated section of Unicode at U+2070 to U+209F.
Update:
To get separate blocks that start with words or non-words, you can just
exclude the superscript range from the non-word class.
The regex is longer and more redundant, but it works.
(?:\b\w+[^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]*|[^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+)[\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+
Formatted
(?:
\b
# Required - Words
\w+
# Optional - Not words, nor supersctipt
[^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]*
| # or,
# Required - Not words, nor supersctipt
[^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+
)
# Required - Superscript
[\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+

based on sin's or Caleb Kleveter's information
let myString = " expressions¹ consist of 元機經中有關文字排版² and operator symbols³"
let noteIdx = "\u{2070}\u{00b9}\u{00b2}\u{00b3}\u{2074}\u{2075}\u{2076}\u{2077}\u{2078}\u{2079}"
let strs = myString.unicodeScalars.split { (s) -> Bool in
noteIdx.unicodeScalars.contains{ $0 == s }
}
strs.forEach {
print($0)
}
/* prints
expressions
consist of 元機經中有關文字排版
and operator symbols
*/
this is just a torso, you can continue if you want

Related

Regex: Find a word that consists of certain characters

I have a list of dictionary words, I would like to find any word that consists of (some or all) certain characters of a source word in any order :
For Example:
Characters (source word) to look for : stainless
Found Words : stainless, stain, net, ten, less, sail, sale, tale, tales, ants, etc.
Also if a letter is found once in the source word it can't be repeated in the found word
Unacceptable words to find : tent (t is repeated), tall (l is repeated) , etc.
Acceptable words to find : less (s is already repeated in the source word), etc.
You could take this approach:
Match any sequence of characters that are in the search word, requiring that the match is a word (word-boundaries)
Prohibit that a certain character occurs more often than it is present in the search word, using a negative look-ahead. Do this for every character that is in the search word.
For the given example the regular expression would be:
(?!(\S*s){4}|(\S*t){2}|(\S*a){2}|(\S*i){2}|(\S*n){2}|(\S*l){2}|(\S*e){2})\b[stainless]+\b
The biggest part of the pattern deals with the negative look-ahead. For example:
(\S*s){4} would match four times an 's' in a single word.
(?! | ) places these patterns as different options in a negative look-ahead so that none of them should match.
Automation
It is clear that making such a regular expression for a given word needs some work, so that is where you could use some automation. Notepad++ cannot help with that, but in a programming environment it is possible. Here is a little snippet in JavaScript that will give you the regular expression that corresponds to a given search word:
function regClassEscape(s) {
// Escape "[" and "^" and "-":
return s.replace(/[\]^-]/g, "\\$&");
}
function buildRegex(searchWord) {
// get frequency of each letter:
let freq = {};
for (let ch of searchWord) {
ch = regClassEscape(ch);
freq[ch] = (freq[ch] ?? 0) + 1;
}
// Produce negative options (too many occurrences)
const forbidden = Object.entries(freq).map(([ch, count]) =>
"(\\S*[" + ch + "]){" + (count + 1) + "}"
).join("|");
// Produce character set
const allowed = Object.keys(freq).join("");
return "(?!" + forbidden + ")\\b[" + allowed + "]+\\b";
}
// I/O management
const [input, output] = document.querySelectorAll("input,div");
input.addEventListener("input", refresh);
function refresh() {
if (/\s/.test(input.value)) {
output.textContent = "Input should have no white space!";
} else {
output.textContent = buildRegex(input.value);
}
}
refresh();
input { width: 100% }
Search word:<br>
<input value="stainless">
Regular expression:
<div></div>

Regex catch word at the start and end of a UITextView

I'm trying to catch when a word is used in a UITextView. I've got it working for words in the interior of the view.
The problem is when the word is first or last in the view.
My code so far:
private func filteredTermFor(_ word: String) -> String {
let punctuationFilter = "([\\A|\\W|\\d|\\z| ])"
let wordInParens = "(\(word))"
return punctuationFilter + wordInParens + punctuationFilter
}
I checked and found I should use ^ for the start of input and $ for the end of input. When I add either of these, for example:
"([^|\\A|\\W|\\d|\\z| ])"
they don't seem to have any effect when the word in question is the first or last in the view.
*For the sake of being verbose with my question, the return value from the function above is being used as searchTerm in this:
func highlightedTextInString(with searchTerm: String, targetString: String) -> NSAttributedString? {
let attributedString = NSMutableAttributedString(string: targetString)
do {
let regex = try NSRegularExpression(pattern: searchTerm, options: .caseInsensitive)
let range = NSRange(location: 0, length: targetString.utf16.count)
for match in regex.matches(in: targetString, options: .withTransparentBounds, range: range) {
let fontColor = UIColor.red
attributedString.addAttribute(NSForegroundColorAttributeName, value: fontColor, range: match.range)
}
return attributedString
} catch _ {
print("Error creating regular expression")
return nil
}
}
** Edit **
Since this was marked as a duplicate
The question this was reported a duplicate of does not cover edge cases when the word is typed next to a punctuation mark or digit without spaces.
For example:
.word , word9 , ?word?
Note that ([^|\\A|\\W|\\d|\\z| ]) is a capturing group ((...)) containing a character class that matches a single char defined inside it. The ^ after [ makes the class a negated one, and it matches any char but the one(s) defined in the set. So, [^|\\A|\\W|\\d|\\z| ] matches a single char other than | (it is no longer an alternation operator inside a character class), A (the \ in front is not considered, is omitted), a non-word char, a digit, z and space. It effectively matches _ and any letters other than A and z.
You state that the words you need to match may occur within word boundaries or digits.
You may use
return "(?<![^\\W\\d])(\(word))(?![^\\W\\d])"
See the regex demo.
Here, "(?<![^\\W\\d])" is a negative lookbehind that matches a location that is NOT immediately preceded with a character other than a non-word and a digit char. This sounds cumbersome, but the main point here is that [^\W\d] matches the same texts as \w excluding digits (\w matches letters, digit, and _. So, "(?<![^\\W\\d])" makes sure there is a start of string or a non-letter and non-_ char right before the word. If you allow a word to match after _, just use (?<!\\p{L}) (where \p{L} matches any Unicode letter).
The "(?![^\\W\\d])" is a negative lookahead that makes sure there is an end of string or a non-letter and non-_ (there can be punctuation, symbols and digits) immediately to the right of the word. Again, if you want to match a word if it is followed with _, you may replace this lookahead with "(?!\\p{L})" (just no letter after the word is allowed).

Regular Expressions - a string containing an even number of a character among other characters

I'm going through my homework and can't seem to figure out how to do this one.
Say the alphabet is {a,b,c}, we want a expression that finds strings with an even number of cs.
Example strings that are included:
empty set,
ccab
abcc
cabc
ababababcc
and so on.. just an even amount of c's.
You can use this regex to allow only even # of c in input:
^(?=(([^c\n]*c){2})*[^\nc]*$)[abc]*$
RegEx Demo
The below regex would match the strings which has only even number of c's,
^(?:[^c]*c[^c]*c[^c\n]*)+?$
DEMO
OR
^(?:[ab]*c[ab]*c[ab]*)+?$
DEMO
Assuming that the total number of c's count, not consecutive cs - there is a nice theoretical approach, based on the fact that **a string with an even number ofc`s can be expressed as a finite state automaton with two states**.
The first state is the initial state, and it is also an accepting state. The second one is a rejecting state. Each c toggles us between the states. Other letters do nothing.
Now, you can convert this simple machine to regex using one of the methods described here.
Something like
^([^c]*(c[^c]*c)+)*[^c]*$
ought to do it. we can break it out, thus:
^ # - start-of-line, followed by
( # - a group, consisting of
[^c]* # - zero or more characters other than 'c', followed by
( # - a group, consisting of
c # - the literal character 'c', followed by
[^c]* # - zero or more characters other than 'c', followed by
c # - the literal character 'c'
)+ # repeated one or more times
)* # repeated zero or more times, followed by
[^c]* # - a final sequence of zero or more characters other than 'c', followed by
$ # - end-of-line
One might note that something like the following C# method will likely perform better and be easier to understand:
public bool ContainsEvenNumberOfCharacters( this string s , char x )
{
int cnt = 0 ;
foreach( char c in s )
{
cnt += ( c == x ? 1 : 0 ) ;
}
bool isEven = 0 == (cnt&1) ; // it's even if the low-order bit is off.
return isEven ;
}
Simply
/^(([^c]*c[^c]*){2})*$/
In English:
Zero or more strings, each of which contains exactly two instances of a c, preceded or followed by any number of non-c's.
This solution has the advantage that it is easily extendable to the case of a string with a number of c's which is multiple of 3, etc., and makes no assumptions about the alphabet.

Ignore String containing special words (Months)

I am trying to find alphanumeric strings by using the following regular expression:
^(?=.*\d)(?=.*[a-zA-Z]).{3,90}$
Alphanumeric string: an alphanumeric string is any string that contains at least a number and a letter plus any other special characters it can be # - _ [] () {} ç _ \ ù %
I want to add an extra constraint to ignore all alphanumerical strings containing the following month formats :
JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
One solution is to actually match an alphanumerical string. Then check if this string contains one of these names by using the following function:
vector<string> findString(string s)
{
vector<string> vec;
boost::regex rgx("JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
");
boost::smatch match;
boost::sregex_iterator begin {s.begin(), s.end(), rgx},
end {};
for (boost::sregex_iterator& i = begin; i != end; ++i)
{
boost::smatch m = *i;
vec.push_back(m.str());
}
return vec;
}
Question: How can I add this constraint directly into the regular expression instead of using this function.
One solution is to use negative lookahead as mentioned in How to ignore words in string using Regular Expressions.
I used it as follows:
String : 2-hello-001
Regular expression : ^(?=.*\d)(?=.*[a-zA-Z]^(?!Jan|Feb|Mar)).{3,90}$
Result: no match
Test website: http://regexlib.com/
The edit provided by #Robin and #RyanCarlson : ^[][\w#_(){}ç\\ù%-]{3,90}$ works perfectly in detecting alphanumeric strings with special characters. It's just the negative lookahead part that isn't working.
You can use negative look ahead, the same way you're using positive lookahead:
(?=.*\d)(?=.*[a-zA-Z])
(?!.*(?:JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre)).{3,90}$
Also you regex is pretty unclear. If you want alphanumerical strings with a length between 3 and 90, you can just do:
/^(?!.*(?:JANVIER|F[Eé]VRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AO[Uù]T|SEPTEMBRE|OCTOBRE|NOVEMBRE|D[Eé]CEMBRE|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))
[][\w#_(){}ç\\ù%-]{3,90}$/i
the i flag means it will match upper and lower case (so you can reduce your forbidden list), \w is a shortcut for [0-9a-zA-Z_] (careful if you copy-paste, there's a linebreak here for readability between (?! ) and [ ]). Just add in the final [...] whatever special characters you wanna match.

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}