Negating duplicate words pattern - regex

I am new to regex and have the following pattern that detects duplicate words separated with dashes
\b(\w+)-+\1\b
// matches: hey-hey
// not matches: hey-hei
What I really need is a negated version of this pattern.
I've tried negative lookahead, but no good.
(?!\b(\w+)-+\1\b)

You can use
\b(\w+)-+(?!\1\b)\w+
See the regex demo. Details:
\b - a word boundary
(\w+) - Group 1: one or more word chars
-+ - one or more hyphens
(?!\1\b)\w+ - one or more word chars that are not equal to the first capturing group value.

Related

Ungreedy with look behind

I have this kind of text:
other text opt1 opt2 opt3 I_want_only_this_text because_of_this
And am using this regex:
(?<=opt1|opt2|opt3).*?(?=because_of_this)
Which returns me:
opt2 opt3 I_want_only_this_text
However, I want to match only "I_want_only_this_text".
What is the best way to achieve this?
I don't know in what order the opt's will appear and they are only examples. Actual words will be different and there will be more of them.
Test screenshot
Actual data:
regex
(?<=※|を|備考|町|品は|。).*(?=のお届けとなります|でお届けします|にてお届け致します|にてお届けいたします)
text
こだわり豚には通常の豚よりビタミンB1が2倍以上あります。私たちの育てた愛情たっぷりのこだわり豚をぜひ召し上がってください。商品説明名称えびの産こだわり豚切落し産地宮崎県えびの市内容量500g×8パック合計4kg賞味期限90日保存方法-15℃以下で保存すること提供者株式会社さつま屋産業備考・本お礼品は冷凍でのお届けとなります
what I want to get:
冷凍で
You can use
(?<=※|を|備考|町|品は|。)(?:(?!※|を|備考|町|品は|。).)*?(?=のお届けとなります|でお届けします|にてお届け致します|にてお届けいたします)
See the regex demo. The scheme is the same as in (?<=opt1|opt2|opt3)(?:(?!opt1|opt2|opt3).)*?(?=because_of_this) (see demo).
The tempered greedy token solution allows you to match multiple occurrences of the same pattern in a longer string.
Details
(?<=※|を|備考|町|品は|。) - a positive lookbehind that matches a location that is immediately preceded with one of the alternatives listed in the lookbehind
(?:(?!※|を|備考|町|品は|。).)*? - any char other than a line break char, zero or more but as few as possible occurrences, that is not a starting point of any of the alternative patterns in the negative lookahead
(?=のお届けとなります|でお届けします|にてお届け致します|にてお届けいたします) - a positive lookahead that requires one of the alternative patterns to appear immediately to the right of the current location.
You could add a negative lookahead (?!\s*opt\d) to assert that there is no opt and a digit to the right. You can use a character class to list the digits 1, 2 and 3 instead of using the alternation with |.
(?<=\bopt[123]\s(?!\s*opt\d)).*?(?=\s*\bbecause_of_this\b)
Regex demo
It might be a bit more efficient to use a match with a capture group:
\bopt[123]\s(?!\s*opt\d)(.*?)\s*\bbecause_of_this\b
Regex demo
What about:
.*\bopt[123]\b\s*(.*?)\s*because_of_this\b
See the online demo.
.* - A greedy match of any character other than newline upto the last occurence of:
\bopt[123]\b - A word boundary followed by literally "opt" with a trailing number 1, 2 or 3 and another word boundary.
\s* - 0+ whitespace characters.
(.*?) - A 1st capture group with a lazy match of 0+ characters upto:
\s* - 0+ whitespace characters.
because_of_this\b - Literally "because_of_this" followed by a word-boundary.
If you need to have this written out in alternations:
.*\b(?:opt1|opt2|opt3)\b\s*(.*?)\s*because_of_this\b
See that demo.

Regular expression using positive lookbehind not working in Alteryx

I am trying to match a string the 2nd word after "Vores ref.:" using positive lookbehind. It works in online testers like https://regexr.com/, but my tool Alteryx dont allow quantifiers like + in a lookbehind.
"ABC This is an example Vores ref.: 23244-2234 LW782837673 Test 2324324"
(?<=Vores\sref.:\s\d+-\d+\s+)\w+ is correctly matching the LW78283767, on regexr.com but not in Alteryx.
How can I rewrite the lookahead expression by using quantifiers but still get what I want?
You can use a replacement approach here using
.*?\bVores\s+ref\.:\s+\d+-\d+\s+(\w+).*
Replace with $1.
See the regex demo.
Details:
.*? - any 0+ chars other than line break chars, as few as possible
\bVores - whole word Vores
\s+ - one or more whitespaces
ref\.: - ref.: substring
\s+ - one or more whitespaces
\d+-\d+ - one or more digits, - and one or more digits
\s+ - one or more whitespaces
(\w+) - Capturing group 1: one or more word chars.
.* - any 0+ chars other than line break chars, as many as possible.
You can use a capture group instead.
Note to escape the dot \. to match it literally.
\bVores\sref\.:\s\d+-\d+\s+(\w+)
The pattern matches:
\bVores\sref\.:\s\d+-\d+\s+ Your pattern turned into a match
(\w+) Capture group 1, match 1+ word characters
Regex demo

Finding words in a string that start with number (Regex)

I need to find words in a string that start with number(i.e digit)
In following string:
1st 2nd 3rd a56b 5th 6th ***7th
The words 1st 2nd 3rd 5th 6th should be returned.
I tried with the regex:
(\b[^ a-zA-Z ^ *]+(th|rd|st|nd))+
But this regex returns the words not starting with alphabets but can't handle the cases when word starts with special characters.
For the current string, you may use a pattern like
(?<!\S)\d+(?:th|rd|st|nd)\b
See the regex demo
The pattern matches:
(?<!\S) - a location at the start of a string or after a whitespace
\d+ - 1 or more digits
(?:th|rd|st|nd) - one of the four alternatives
\b - a word boundary.
If you plan to match any 0+ non-whitespace chars after a digit that is preceded with a whitespace or is at the start of a string, use
(?<!\S)\d\S*
where \S* will match any 0+ non-whitespace chars.
See this regex demo.
NOTE: In case the lookbehind is not supported, replace (?<!\S) with (?:^|\s) and also wrap the rest of the pattern with a capturing group to access the latter later:
(?:^|\s)(\d\S*)
and the value will be in Group 1.
To get word which is starting with number/digit and ending with th/st/nd/rd you can try this.
((?<!\S)(\d+)(th|rd|nd|st))
(?<!\S) detects the word's starting position
\d+ matches 1 or more digits
th|rd|st|nd matches one among those 4.
You can check it here

Matching Word Regex

Hello i want to match with regex this word
(Parc Installé)
from this text:
31/1/2017 17:19:23,4245986,ct0001#Intotel.int,Parc Installé,100.100.30.100
I did this regex ',[A-Za-zA-zÀ-ú+ \/\w+0-9._%+-]+,'
But the result is : 4245986 ans Parc Installé.
How can i match only Parc Installé
You may try a regex based on a lookahead that will require a comma and digits/commas after it up to the end of string:
[^,]+(?=\s*,[\d.]+$)
See this regex demo
Details:
[^,]+ - 1 or more chars other than ,
(?=\s*,[\d.]+$) - a lookahead requiring
\s* - zero or more whitespaces
, - a comma
[\d.]+ - 1+ digits or dots up to...
$ - ... the end of string
To make it a bit more restrictive, you may replace the lookahead with (?=\s*,\d+(?:\.\d+){3}$) to require 4 sequences of dot-separated 1+ digits. See this regex demo.
If a lookahead is not supported (case with a RE2 engine), you might want to use a capturing group based solution:
([^,]+)\s*,[\d.]+$
Here, the part within (...) will be captured into Group 1 and will be accessible via a backreference or a function like =REGEXEXTRACT in Google Spreasheets that only retrieves the contents of a capturing group if the latter is present in the pattern.

Regex for matching groups but excluding a specific combination of groups

I'm trying to match two groups in an expression, each group represents a single letter in initials as part of a name, for example in George R. R. Martin the first group would match the first R and the second group would match the second R, I have something like this:
\b([a-zA-Z])[\.{0,1} {0,1}]{1,2}([a-zA-Z])\b
However, I'd like to exclude a specific combination of those groups, say when the first group matches the letter d and the second group matches the letter r.
Is that possible?
You may restrict matches with a negative lookahead:
\b(?![dD]\.? ?[rR]\b)([a-zA-Z])\.? ?([a-zA-Z])\b
^^^^^^^^^^^^^^^^^^^
See the regex demo
Note:
The (?![dD]\.? ?[rR]\b) lookahead should be better placed after the word boundary, so that the check only gets triggered upon encountering a word boundary, not at every location in string
The lookahead is negative, it fails the match if the pattern inside it matches the text
It matches: a d or D with [dD], then an optional literal dot with \.?, an optional space with ?, an r or R with [rR] and a trailing word boundary \b.
The main pattern is a more generic pattern - \b([a-zA-Z])\.? ?([a-zA-Z]):
\b - leading word boundary
(?![dD]\.? ?[rR]\b) - the negative lookahead
([a-zA-Z]) - Group 1 capturing an ASCII letter
\.? - an optional dot
? - an optional space
([a-zA-Z]) - Group 2 capturing an ASCII letter
\b - a trailing word boundary