How can I allow or ignore apostrophes? - regex

I am looking for a regex expression that allows (or ignores) an apostrophe? I'm fairly new to regex and I looked at other similar questions but didn't find the help I need.
I am using a textbox to search an RTB and match all words with a specific or common ending (i.e. the search term inserted in the textbox). Then, I need to pass all matches to a second RTB.
I have tried many different expressions including: \b\w*[-']\w*\b but the program either separates the word at the apostrophe, finds only words with an apostrophe, or lists all words as matches?
My sample list of words to search is:
mi'iria, mi'i, piraria, makuptiaria, netap, hap, kuap, uimikuaptiaria, uhyt, set, uipu'aptiaria, mu'ap, atat, hat, haria, yat. (commas are not in the original list)!
As you can see, there are words that end in "ria" which contain an apostrophe and words that do not. I want to match all words that end with "ria," but I get results like: mi as one match, iria as another match and piraria, makuptiaria, uimikuaptiaria and haria aren't matched?
I need an expression that will allow (or ignore) the apostrophe so that all words that end in "ria" are matched independent of whether they contain an apostrophe or not. Also, words which contain an apostrophe (i.e. similar to mi'iria) should not be separated because of the apostrofe. Can anyone help on this? I am very grateful for any help! Thanks!
Okay, I spent some time tinkering on https://regex101.com/r/X4oL0y/1 and came up with the following expression which matches all words that end with "ria" including those with and those without an apostrophe:
\b\w+\'?\w+ria\w*\b
However, the w+ria part of this regex represents literal characters. This limits the functionality to words that end with "ria." Is there a way to generically declare the search term the user enters in the textbox as the character(s) to match so that all whole words that end with the search term are matched?
This is my code so far:
'Set index:
Dim index As Integer = 0
'Find and highlight all search term occurencies:
While index < RichTextBox1.Text.LastIndexOf(TextBox1.Text)
RichTextBox1.Find(TextBox1.Text, index, RichTextBox1.TextLength, RichTextBoxFinds.None)
RichTextBox1.SelectionBackColor = ColorTranslator.FromOle(RGB(255, 255, 192))
index = RichTextBox1.Text.IndexOf(TextBox1.Text, index) + 1
End While
' Input string.
Dim value As String = RichTextBox1.Text
' Call Regex.Matches method.
Dim matches As MatchCollection = Regex.Matches(value, "\b\w+\'?\w+ria\w*\b")
' Loop over matches.
For Each m As Match In matches
' Loop over captures.
For Each c As Capture In m.Captures
' Display.
RichTextBox2.Text += String.Format("Index={0}, Value={1}" & Chr(13), c.Index, c.Value)
Next
Next

If you want the whole word to be matched, you could make the character class optional [-']? and add ria to the end right before the word boundary
\b\w*[-']?\w*ria\b
See a .NET regex demo
As per comment of #ctwheels using an optional non capturing group is more efficient.
\b\w*(?:[-']\w*)?ria\b
\b Word boundary
\w* Match 0+ word chars
(?: Non capturing group
[-']\w* Match either - or ' and 0+ word chars
)? Close group and make it optional
ria Match literally
\b Word boundary
See another .NET regex demo

Assuming that list is file like this:
mi'iria
mi'i
piraria
makuptiaria
netap
hap
kuap
uimikuaptiaria
uhyt
set
uipu'aptiaria
mu'ap
atat
hat
haria
yat
Try this one
\b[a'-z]*.ria\b

Related

Regex To Match String With All Words Contains Certain Format

I want to validate a field of string so that it only accept string that contains words with certain format.
Example accepted string:
#key;
#key1; #key2;#key3;
Example rejected string:
key;
%key1X #key2X$key3X
My regex:
\B(\#[a-zA-Z0-9_; ]+\b)(\;)
It seems my regex still accept a string as long as it has a word with valid format, while I only want it to be accepted if whole words are in the correct format.
Current example:
%key1; %key2 #keysz;#key3; #key4;
From the above Current Example still accepted because it contains #keysz; and #key3; while I want it to be rejected because there are %key1; %key2 and #key4;.
I've do some search and the closest I can found is this question, but it returns similar result as my current regex.
What did i do wrong in my regex? What is the right regex?
Sorry if this is dumb question but I'm a newbie in regex.
The main thing needed are start ^ and end $ anchors. The rest can be simplified too:
^( *#\w+;)+$
See live demo.
Breaking it down:
^ = start
* = 0-n spaces
# = a literal hash (these don't need escaping in regex)
\w+ = one or more word characters (letters, digits and the underscore)`
$
If underscore can be in the input and must not be, then use:
^( *#[A-Za-z0-9]+;)+$
Your regex matches a full sentence because in your regex pattern(\B(\#[a-zA-Z0-9_; ]+\b)(\;)) you haven't specified where the matching process should start and end. So regex engine will try to match every position of the string on which you run the regex.match.
The way to specify where regex should try to match is done by adding anchors(^-beginning and $-end) to regex pattern.
You can edit your pattern to look like this: /(?:\s|^)(#[a-zA-Z0-9_; ]+?);(?:\s|$)/gm
Explanation:
/(?:\s|^)
- (?: means a non capture group, means dont include whatever is matched in between these () in the result. \s|^ means start matching if the beginning is a white space or beginning of a string.
(#[a-zA-Z0-9_; ]+);
- () is a regular capture group, which means that things captured in this group are included in the result.
You don't need to insert a '\' before every symbol
(?:\s|$)/
- another non capture group, specifying to match a white space or end position of a string.
gm
- global and multiline flags of javascript regex
Here is an example:
let regex_pattern = /(?:\s|^)(#[a-zA-Z0-9_; ]+);(?=\s|$)/gm
let input1 = " #key;" // string with just one word
let input2 = "#key1; #key2;#key3;" // string with one whole word and another word which will match your pattern
let input3 = "soemthing random #key;andjointstring" // a string with a word that will match the pattern but its not a whole word
console.log(input1.match(regex_pattern)) // it matches
console.log(input2.match(regex_pattern)) // it matches
console.log(input3.match(regex_pattern)) // it doesnt matches

Looking for help to construct a Regex for pattern matching

I'm looking for help in making a regex to match and not match a series of name patterns if anyone can help with that.
Here's a list of cases I want to match/ not match :
// Should Match :
_class
c-class
_class-like
_class--variation
_class__children
_class__children--variation
c-custon-button-test
_class__lol--test
c-my-button-super-style
_class--variation-like
// Should not Match :
class
c--class
_class---variation
_class----variation
_class__test__test
_class--variation__children
_like
c-like
noMargin
no-Margin
_no-Margin
no-margin
_class-like__children
_class-like--variation
For now I came up with this regex :
^(c-|_)([a-z]+)(__|--|-)?([a-z]+)(-{0,2}[a-z]+)+(-?(([a-z]-?)+|(like))$)
Which almost work but I still got a match on some case which shouldn't match and I'm afraid I'm struggling to find how to sort the last cases.
(Here's a link to regex101 with unit test and match case: https://regex101.com/r/HNAUpd/1/)
edit : I forgot to mention, about the word "like" it's a keyword in my pattern and can only be found at the end of the string and cannot be the sole word in the string.
edit 2 : As for the rules of matching they're as follow :
A string can start only with "_", "c-" or "js-".
the following word can be anything but not the word "like" and should not be anything else that letter in the range [a-z] and only in lowercase.
The word "like" can only be the last one of the string and must not be the only one in the string.
Words can be separated by "--" or "__".
If the string starts with "c-" the word can then be separated with "-" in addition to the previous separator.
The purpose of all this is for a CSS class/id matcher for a linter.
If anyone can help me with this it would be awesome :)
I think you're looking for something like this:
^(?!.*[\-_]like[\-_])(?:c-|js-|_)(?!like$)(?:[a-z]+(?:__|--?))?[a-z]+(?:--?[a-z]+)*$
Demo
Breakdown:
^ - Beginning of the string.
(?!.*[\-_]like[\-_]) - Doesn't contain the word "like" between two separators (only at the end of the string).
(?:c-|js-|_) - Either "c-", "js-", or "_" at the beginning of the string.
(?!like$) - Not immediately followed by the word "like".
(?:[a-z]+(?:__|--?))? - (optional) one or more a-z letters followed two underscores or one or two hyphens.
[a-z]+ - One or more a-z letters.
(?:--?[a-z]+)* - Match one or two hyphens followed by one or more a-z letters, and repeat zero or more times.
$ - End of string.

Regular expression to match n consecutive capitalized words

I am trying to capture n consecutive capitalized words. My current code is
n=5
a='This is a Five Gram With Five Caps and it also contains a Two Gram'
re.findall(' ([A-Z]+[a-z|A-Z]* ){n}',a)
Which returns the following:
['Caps ']
It's identifying the fifth consecutive capitalized word, but I would like it to return the entire string of capitalized words. In other words:
[' Five Gram With Five Caps ']
Note that | doesn't act as an OR inside a character class. It'll match | literally. The other issue here is that findall's behaviour is to return the match unless a group exists (although python's documentation doesn't really make this clear):
The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups
So this is why you're getting the result of the first capture group, which is the last uppercase-starting word of Caps.
The simple solution is to change your capturing group to a non-capturing group. I've also changed the space at the start to \b so as to not match an additional whitespace (which I presume you were planning on trimming anyway).
See code in use here
import re
r = re.compile(r"\b(?:[A-Z][a-zA-Z]* ){5}")
s = "This is a Five Gram With Five Caps and it also contains a Two Gram"
print(r.findall(s))
See regex in use here
\b(?:[A-Z][a-zA-Z]* ){5}
\b Assert position as a word boundary
(?:[A-Z][a-zA-Z]* ?){5} Match the following exactly 5 times
[A-Z] Match an uppercase ASCII letter once
[a-zA-Z]* Match any ASCII letter any number of times
Match a space
Result: ['Five Gram With Five Caps ']
Additionally, you may use the regex \b\[A-Z\]\[a-zA-Z\]*(?: \[A-Z\]\[a-zA-Z\]*){4}\b instead. This will allow matches at the start/end of the string as well as anywhere in the middle without grabbing extra whitespace. Another alternative may include (?:^|(?<= ))\[A-Z\]\[a-zA-Z\]*(?: \[A-Z\]\[a-zA-Z\]*){4}(?= |$)
Wrap the whole pattern in a capturing group:
(([A-Z]+[a-z|A-Z]* ){5})
Demo

Match 3 and 4 delimiters and between them; not less not more

I have a command-line program that its first argument ( = argv[ 1 ] ) is a regex pattern.
./program 's/one-or-more/anything/gi/digit-digit'
So I need a regex to check if the entered input from user is correct or not. This regex can be solve easily but since I use c++ library and std::regex_match and this function by default puts begin and end assertion (^ and $) at the given string, so the nan-greedy quantifier is ignored.
Let me clarify the subject. If I want to match /anything/ then I can use /.*?/ but std::regex_match considers this pattern as ^/.*?/$ and therefore if the user enters: /anything/anything/anyhting/ the std::regex_match still returns true whereas the input-pattern is not correct. The std::regex_match only returns true or false and the expected pattern form the user can only be a text according to the pattern. Since the pattern is various, here, I can not provide you all possibilities, but I give you some example.
Should be match
/.//
s/.//
/.//g
/.//i
/././gi
/one-or-more/anything/
/one-or-more/anything/g/3
/one-or-more/anything/i
/one-or-more/anything/gi/99
s/one-or-more/anything/g/4
s/one-or-more/anything/i
s/one-or-more/anything/gi/54
and anything look like this pattern
Rules:
delimiters are /|##
s letter at the beginning and g, i and 2 digits at the end are optional
std::regex_match function returns true if the entire target character sequence can be match, otherwise return false
between first and second delimiter can be one-or-more +
between second and third delimiter can be zero-or-more *
between third and fourth can be g or i
At least 3 delimiter should be match /.// not less so /./ should not be match
ECMAScript 262 is allowed for the pattern
NOTE
May you would need to see may question about std::regex_match:
std::regex_match and lazy quantifier with strange
behavior
I no need any C++ code, I just need a pattern.
Do not try d?([/|##]).+?\1.*?\1[gi]?[gi]?\1?d?\d?\d?. It fails.
My attempt so far: ^(?!s?([/|##]).+?\1.*?\1.*?\1)s?([/|##]).+?\2.*?\2[gi]?[gi]?\d?\d?$
If you are willing to try, you should put ^ and $ around your pattern
If you need more details please comment me, and I will update the question.
Thanks.
You could use this regular expression:
^s?([/|##])((?!\1).)+\1((?!\1).)*\1((gi?|ig)(\1\d\d?)?|i)?$
See regex101.com
Note how this also rejects these cases:
///anything/
/./anything/gg
/./anything/ii
/./anything/i/12
How it works:
Some explanation of the parts that are different:
((?!\1).): this will match any character that is not the delimiter. This way you are sure you can keep track of the exact number of delimiters used. You can this way also prevent that the first character after the first delimiter, is again that delimiter, which should not be allowed.
(gi?|ig): matches any of the valid modifier combinations, except a sole i, which is treated separately. So this also excludes gg and ii as valid character sequences.
(\1\d\d?)?: optionally allows for an extra delimiter (after a g modifier -- see previous) to be added with one or two digits following it.
( |i)?: for the case there is no g modifier present, but just the i or none: then no digits are allowed to follow.
This is a tricky one, but I took the challenge - here is what I have ended up with:
^s?([\/|##])(?:(?!\1).)+\1(?:(?!\1).)*\1(?:i|(?:gi?|ig)(\1\d{1,2})?)?$
Pattern breakdown:
^ matches start of string
s? matches an optional 's' character
([\/|##]) matches the delimeter characters and captures as group 1
(?:(?!\1).)+ matches anything other than the delimiter character one or more times (uses negative lookahead to make sure that the character isn't the delimiter matched in group 1)
\1 matches the delimiter character captured in group 1
(?:(?!\1).)* matches anything other than the delimiter character zero or more times
\1 matches the delimiter character captured in group 1
(?: starts a new group
i matches the i character
| or
(?:gi?|ig) matches either g, gi, or ig
(\1\d{1,2})? followed by an optional extra delimiter and 0-9 once or twice
)? closes group and makes it optional
$ matches end of string
I have used non capturing groups throughout - these are groups that start ?:

Only match unique string occurrences

We are doing some Data Loss Prevention for emails, but the issue is when people reply to emails multiple times sometimes the credit card number or account number will appear multiple times.
How can we get Java Regex to only match strings once each.
So for example, we are using the following regex to catch account numbers that match 2 letters followed by 5 or 6 numbers. it will also omit CR in either case.
\b(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}\b
How can we have it find:
CX12345
CX14584
JB145888
JD748452
CX12345 (Ignore as its already found it above)
LM45855
Unique string occurrence can be matched with
<STRING_PATTERN>(?!.*<STRING_PATTERN>) // Find the last occurrence
(?<!<STRING_PATTERN>.*)<STRING_PATTERN> // Find the first occurrence, only works in regex
// that supports infinite-width lookbehind patterns
where <STRING_PATTERN> is the pattern the unique occurrence of which one searches for. Note that both will work with the .NET regex library, but the second one is not usually supported by the majority of other libraries (only PyPi Python regex library and the JavaScript ECMAScript 2018 regex support it). Note that . does not match line break chars by default, so you need to pass a modifier like DOTALL (in most libraries, you may add (?s) modifier inside the pattern (only in Ruby (?m) does the same), or use specific flags that you pass to the regex compile method. See more about this in How do I match any character across multiple lines in a regular expression?
You seem to need a regex like this:
/\b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?![\s\S]*\b\1\b)/
The regex demo is available here
Details:
\b - a leading word boundary
((?!CR|cr)[A-Za-z]{2}\d{5,6}) - Group 1 capturing
(?!CR|cr) - the next two characters cannot be CR or cr, the negative lookahead check
[A-Za-z]{2} - 2 ASCII letters
\d{5,6} - 5 to 6 digits
\b - trailing word boundary
(?![\s\S]*\b\1\b) - a negative lookahead that fails the match if there are any 0+ chars ([\s\S]*) followed with a word boundary (\b), same value captured into Group 1 (with the \1 backreference), and a trailing word boundary.
I would use a Map of some sort here, to keep tally of the strings which you encounter. For example:
String ccNumber = "CX12345";
Map<String, Boolean> ccMap = new HashMap<>();
if (ccNumber.matches("^(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}$")) {
ccMap.put(ccNumber, null);
}
Then just iterate over the keyset of the map to get unique credit card numbers which matched the pattern in your regex:
for (String key : map.keySet()) {
System.out.println("Found a matching credit card: " + key);
}