Regex To Match String With All Words Contains Certain Format

Regex To Match String With All Words Contains Certain Format - regex

I want to validate a field of string so that it only accept string that contains words with certain format.
Example accepted string:
#key;
#key1; #key2;#key3;
Example rejected string:
key;
%key1X #key2X$key3X
My regex:
\B(\#[a-zA-Z0-9_; ]+\b)(\;)
It seems my regex still accept a string as long as it has a word with valid format, while I only want it to be accepted if whole words are in the correct format.
Current example:
%key1; %key2 #keysz;#key3; #key4;
From the above Current Example still accepted because it contains #keysz; and #key3; while I want it to be rejected because there are %key1; %key2 and #key4;.
I've do some search and the closest I can found is this question, but it returns similar result as my current regex.
What did i do wrong in my regex? What is the right regex?
Sorry if this is dumb question but I'm a newbie in regex.

The main thing needed are start ^ and end $ anchors. The rest can be simplified too:
^( *#\w+;)+$
See live demo.
Breaking it down:
^ = start
* = 0-n spaces
# = a literal hash (these don't need escaping in regex)
\w+ = one or more word characters (letters, digits and the underscore)`
$
If underscore can be in the input and must not be, then use:
^( *#[A-Za-z0-9]+;)+$

Your regex matches a full sentence because in your regex pattern(\B(\#[a-zA-Z0-9_; ]+\b)(\;)) you haven't specified where the matching process should start and end. So regex engine will try to match every position of the string on which you run the regex.match.
The way to specify where regex should try to match is done by adding anchors(^-beginning and $-end) to regex pattern.
You can edit your pattern to look like this: /(?:\s|^)(#[a-zA-Z0-9_; ]+?);(?:\s|$)/gm
Explanation:
/(?:\s|^)
- (?: means a non capture group, means dont include whatever is matched in between these () in the result. \s|^ means start matching if the beginning is a white space or beginning of a string.
(#[a-zA-Z0-9_; ]+);
- () is a regular capture group, which means that things captured in this group are included in the result.
You don't need to insert a '\' before every symbol
(?:\s|$)/
- another non capture group, specifying to match a white space or end position of a string.
gm
- global and multiline flags of javascript regex
Here is an example:
let regex_pattern = /(?:\s|^)(#[a-zA-Z0-9_; ]+);(?=\s|$)/gm
let input1 = " #key;" // string with just one word
let input2 = "#key1; #key2;#key3;" // string with one whole word and another word which will match your pattern
let input3 = "soemthing random #key;andjointstring" // a string with a word that will match the pattern but its not a whole word
console.log(input1.match(regex_pattern)) // it matches
console.log(input2.match(regex_pattern)) // it matches
console.log(input3.match(regex_pattern)) // it doesnt matches

Related

Regexp regular/recursive find/replace in Notepad++

How to split some strings defined in a specific format:
[length namevalue field]name=value[length namevalue field]name=value[length namevalue field]name=value[length namevalue field]name=value
Is it possible with a Find/Replace regex in Notepad++ isolate the pair name=value replacing [length namevalue field] with a white space?
The main problem is related to numeric value where a simple \d{4} search doesn't work.
Eg.
INPUT:
0010name=mario0013surname=rossi0006age=180006phone=0014address=street
0013name=marianna0013surname=rossi0006age=210006phone=0015address=street1
0003name=pia0015surname=rossini0005age=30017phone=+39221122330020address=streetstreet
OUTPUT:
name=mario surname=rossi age=18 phone= address=street
name=mario surname=rossi age=18 phone= address=street
name=marianna surname=rossi age=21 phone= address=street1
name=pia surname=rossini age=3 phone=+3922112233 address=streetstreet

You can use
\d{4}(?=[[:alpha:]]\w*=)
\d{4}(?=[^\W\d]\w*=)
See the regex demo.
The patterns match
\d{4} - four digits
(?=[[:alpha:]]\w*=) - that are immediately followed with a letter and then any zero or more word chars followed with a = char immediately to the right of the current position.
(?=[^\W\d]\w*=) - that are immediately followed with a letter or an underscore and then any zero or more word chars followed with a = char immediately to the right of the current position.
In Notepad++, if you want to remove the match at the start of the line and replace with space anywhere else, you can use
^(\d{4}(?=[[:alpha:]]\w*=))|(?1)
and replace with (?1: ). The above explained pattern, \d{4}(?=[[:alpha:]]\w*=), is matched and captured into Group 1 if it is at the start of a line (^), and just matched anywhere else ((?1) recurses the Group 1 pattern, so as not to repeat it). The (?1: ) replacement means we replace with empty string if Group 1 matched, else, we replace with a space.
See the demo screenshot:

Regular Expression to match first word with a character in each line

I am trying to write a regex that finds the first word in each line that contains the character a.
For a string like:
The cat ate the dog
and the mouse
The expression should find cat and
So far, I have:
/\b\w*a\w*\b/g
However this will return every match in each line, not just the first match (cat ate and).
What is the easiest way to only return the first occurrence?

Assuming you are onluy looking for words without numbers and underscores (\w would include those), I'd advise to maybe use:
(?i)^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)
And use whatever is in the 1st capture group. See an online demo. Or, if supported:
(?i)^.*?\K(?<!\S)[b-z]*a[a-z]*(?!\S)
See an online demo.
Please note that I used lookaround to assert that the word is not inbetween anything other than whitespace characters. You may also use word-boundaries if you please and swap those lookarounds for \b. Also, depending on your application you can probably scratch the inline case-insensitive switch to a 'flag'. For example, if you happen to use JavaScript /^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)/gmi should probably be your option. See for example:
var myString = "The cat ate the dog\nand the mouse";
var myRegexp = new RegExp("^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)", "gmi");
m = myRegexp.exec(myString);
while (m != null) {
console.log(m[1])
m = myRegexp.exec(myString);
}

If you want to match a word using \w you might also use a negated character class matching any character except a or a newline.
Then match a word that consists of at least an a char with word boundaries \b
^[^a\n\r]*\b([^\Wa]*a\w*)
The pattern matches:
^ Start of string
[^a\n\r]*\b Optionally match any character except a or a newline
( Capture group 1
[^\Wa]*a\w* Optionally match a word character without a, then match a and optional word characters
) Close group 1
Regex demo
Using whitespace boundaries on the left and right:
^[^a\n\r]*(?<!\S)([^\Wa]*a\w*)(?!\S)
Regex demo

The text could be matched with the regular expression
(?=(\b[a-z]*a[a-z]*\b)).*\r?\n
with the multiline and case-indifferent flags set. For each match capture group 1 contains the first word (comprised only of letters) in a line that contains an "a". There are no matches in lines that do not contain an "a".
Demo
The expression can be broken down as follows.
(?= # begin a positive lookahead
\b # match a word boundary
([a-z]*a[a-z]*) # match a word containing an "a" and save to
# capture group 1
)
.*\r?\n # match the remainder of the line including the
# line terminator

Regular Expression: Find a specific group within other groups in VB.Net

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY

You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it

My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

How can I allow or ignore apostrophes?

I am looking for a regex expression that allows (or ignores) an apostrophe? I'm fairly new to regex and I looked at other similar questions but didn't find the help I need.
I am using a textbox to search an RTB and match all words with a specific or common ending (i.e. the search term inserted in the textbox). Then, I need to pass all matches to a second RTB.
I have tried many different expressions including: \b\w*[-']\w*\b but the program either separates the word at the apostrophe, finds only words with an apostrophe, or lists all words as matches?
My sample list of words to search is:
mi'iria, mi'i, piraria, makuptiaria, netap, hap, kuap, uimikuaptiaria, uhyt, set, uipu'aptiaria, mu'ap, atat, hat, haria, yat. (commas are not in the original list)!
As you can see, there are words that end in "ria" which contain an apostrophe and words that do not. I want to match all words that end with "ria," but I get results like: mi as one match, iria as another match and piraria, makuptiaria, uimikuaptiaria and haria aren't matched?
I need an expression that will allow (or ignore) the apostrophe so that all words that end in "ria" are matched independent of whether they contain an apostrophe or not. Also, words which contain an apostrophe (i.e. similar to mi'iria) should not be separated because of the apostrofe. Can anyone help on this? I am very grateful for any help! Thanks!
Okay, I spent some time tinkering on https://regex101.com/r/X4oL0y/1 and came up with the following expression which matches all words that end with "ria" including those with and those without an apostrophe:
\b\w+\'?\w+ria\w*\b
However, the w+ria part of this regex represents literal characters. This limits the functionality to words that end with "ria." Is there a way to generically declare the search term the user enters in the textbox as the character(s) to match so that all whole words that end with the search term are matched?
This is my code so far:
'Set index:
Dim index As Integer = 0
'Find and highlight all search term occurencies:
While index < RichTextBox1.Text.LastIndexOf(TextBox1.Text)
RichTextBox1.Find(TextBox1.Text, index, RichTextBox1.TextLength, RichTextBoxFinds.None)
RichTextBox1.SelectionBackColor = ColorTranslator.FromOle(RGB(255, 255, 192))
index = RichTextBox1.Text.IndexOf(TextBox1.Text, index) + 1
End While
' Input string.
Dim value As String = RichTextBox1.Text
' Call Regex.Matches method.
Dim matches As MatchCollection = Regex.Matches(value, "\b\w+\'?\w+ria\w*\b")
' Loop over matches.
For Each m As Match In matches
' Loop over captures.
For Each c As Capture In m.Captures
' Display.
RichTextBox2.Text += String.Format("Index={0}, Value={1}" & Chr(13), c.Index, c.Value)
Next
Next

If you want the whole word to be matched, you could make the character class optional [-']? and add ria to the end right before the word boundary
\b\w*[-']?\w*ria\b
See a .NET regex demo
As per comment of #ctwheels using an optional non capturing group is more efficient.
\b\w*(?:[-']\w*)?ria\b
\b Word boundary
\w* Match 0+ word chars
(?: Non capturing group
[-']\w* Match either - or ' and 0+ word chars
)? Close group and make it optional
ria Match literally
\b Word boundary
See another .NET regex demo

Assuming that list is file like this:
mi'iria
mi'i
piraria
makuptiaria
netap
hap
kuap
uimikuaptiaria
uhyt
set
uipu'aptiaria
mu'ap
atat
hat
haria
yat
Try this one
\b[a'-z]*.ria\b

Regular expression for duplicate words

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?

Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here

I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html

The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.

Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source

The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1

Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.

Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.

The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.

This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result

Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b

I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.

You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}

As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )

To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.

Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex To Match String With All Words Contains Certain Format - regex

Related

Regexp regular/recursive find/replace in Notepad++

Regular Expression to match first word with a character in each line

Regular Expression: Find a specific group within other groups in VB.Net

How can I allow or ignore apostrophes?

Regular expression for duplicate words

Categories

Resources