I have a regex replace function:
reg_replace(Input_Column,'\b(?:(?!https|www|http)\w)+\b', 'x')
With www.google.com input, the result is www.x.x where as it should be www.xxxxxx.xxx.
Please help me to write a regex which works by letters and not by words.
Use
\w(?!\w*\b(?<=\bhttps|\bwww|\bhttp))
See proof
Explanation
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\w* word characters (a-z, A-Z, 0-9, _) (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
https 'https'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
www 'www'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
http 'http'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
) end of look-ahead
Related
I have a requirement where the regex has to contains only certain set of characters .
For example requirement is that string can start with
JIRA-<5 digit number> or PROJ-<5 digit number>
This means allowed values can be as:
JIRA-12345
PROJ-98765
I tried regex as
(\JIRA-[0-9]+)|(\ PROJ-[0-9]+)
This seems to be not working, please suggest on how to proceed on this.
Thanks
Use
\b(?:JIRA|PROJ)-\d{5}\b
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
JIRA 'JIRA'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
PROJ 'PROJ'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
\d{5} digits (0-9) (5 times)
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
I need an regex to find numbers that were encountered 3 or more times in text.
some text 577
some 123 text
577 some text
some 577 text
some text 512
I need regex to match 577
My last try was: (?:\d+){3,}
Use
\b([0-9]+)\b(?=(?:[\w\W]*?\b\1\b){2})
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?: group, but do not capture (2 times):
--------------------------------------------------------------------------------
[\w\W]*? any character of: word characters (a-
z, A-Z, 0-9, _), non-word characters
(all but a-z, A-Z, 0-9, _) (0 or more
times (matching the least amount
possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
){2} end of grouping
--------------------------------------------------------------------------------
) end of look-ahead
I have a regex for grabbing text on email for french number which is like this :
(?:(?:\+|00)33|0)\s*[1-9](?:[\s.-]*\d{2}){4}
Which work pretty well but if there is no phone number on an email it will grab part of the id of a Facebook page www.facebook.com/leboncoin-1565**0575204105**27 and then I have people trying to ring that nuumber :X
In case it's not clear and don't want it, I tried negative lookahead and behind but without any success
See problem at regex101.
Note that the phone number could be anywhere not necessary at the beginning of a line.
Use
(?:(?:\+|\b00)33|\b0)\s*[1-9](?:[\s.-]*\d{2}){4}\b
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\+ '+'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
00 '00'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
33 '33'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
0 '0'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
[1-9] any character of: '1' to '9'
--------------------------------------------------------------------------------
(?: group, but do not capture (4 times):
--------------------------------------------------------------------------------
[\s.-]* any character of: whitespace (\n, \r,
\t, \f, and " "), '.', '-' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
){4} end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
I have a long xml text and I want to match each product that is available. The text is made of products that are structured like this:
<product>
...
<available>instock</available>
...
</product>
I can match all products with this regex
((?s)<product>.*?<\/product>)
Example: https://regex101.com/r/kz8cn1/1
However, I want to match, only those products that have an 'instock' value in their tag.
My solution is this:
((?s)<product>(?=.*?\binstock\b).*?<\/product>)
Unfortunately, this works only partially as I believe the lookaround regex is not contained to the match group which results in products with 'outofstock' values being matched as well.
Here is my example:
https://regex101.com/r/AHlC0K/1
How should I change my regex so that the lookaround works only in the context of the match?
Use an XML parser. If there is none you can use use
(?s)<product>(?=(?:(?!<\/?product>).)*?\binstock\b).*?<\/product>
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?s) set flags for this block (with . matching
\n) (case-sensitive) (with ^ and $
matching normally) (matching whitespace
and # normally)
--------------------------------------------------------------------------------
<product> '<product>'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/? '/' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
product> 'product>'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character
--------------------------------------------------------------------------------
)*? end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
instock 'instock'
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
product> 'product>'
Basically I want to match this:
So this. So that. [this should match]
Yes this. No that. [this shouldn't match]
I thought this would work:
(\b(\w+)\1\b.*){2,}
But right now, it's matching the second line too: https://regexr.com/5jhag
Why is this and how to fix it?
Match if the line has two or more of the same capitalized word
As you want to match capitalized words only a \w is not right because it matches [a-zA-Z0-9_] characters. Also using \1 just after the capture group means consecutive repeats only. Finally \b is also required around matches.
You may use this regex:
\b([A-Z]\w*)\b.*\b\1\b
RegEx Demo
RegEx Details:
\b: Word boundary
([A-Z]\w*): Match a capitalize word that start with uppercase letter followed by 0 or more of any word characters
\b: Word boundary
.*: Match 0 or more of any characters
\b\1\b: Match same word as what we captured in group #1 surrounded with word boundaries
(\b(\w+)\1\b.*){2,} is a repeated capturing group. \1 is a backreference that references the value of the group it is defined in and it is always assigned an empty string, at each iteration. Note: if you were to test with PCRE engine, there would be no match, see proof, because \1 is not empty, it is null and there is no match.
Your regex matches Yes this. No that. because the current expression is equal to (\b(\w+)\b.*){2,} and matches any word, then any text, two times or more.
Use
.*\b([A-Z][a-zA-Z]+)\b.*\b\1\b.*
See proof.
Unicode version:
.*\b(\p{Lu}\p{L}+)\b.*\b\1\b.*
See another proof.
Explanation
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
[a-zA-Z]+ any character of: 'a' to 'z', 'A' to 'Z'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))