Regular expression model - regex

Hey guys am new to regular expression i have found a regular expression like this ..
preg_match("/^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$/",$number)
preg_match("/^
(1[-\s.])? # optional '1-', '1.' or '1'
( \( )? # optional opening parenthesis
\d{3} # the area code
(?(2) \) ) # if there was opening parenthesis, close it
[-\s.]? # followed by '-' or '.' or space
\d{3} # first 3 digits
[-\s.]? # followed by '-' or '.' or space
\d{4} # last 4 digits
$/x",$number);
I found these explanation from a tutorial website ..I just need to know why (?(2)) is assigned here..why questionmark(optional symbol) is applied at the beginning and what is the use of (2) there in that code ..
Am sorry if this question is of low standard since am a newbie .Any help would be appreciated .ThANKS .:)

The (?(2)\)) is an if clause that checks to see if the 2nd match group was captured.
You should be able to see a break down of your regex at Regex101. It's pretty useful to see what the regex is doing at all points and it's easy to tweak a regex from there.

Related

How to negate string pattern using re2 regex?

I'm using google re2 regex for the purpose of querying Prometheus on Grafana dashboard. Trying to get value from key by below 3 types of possible input strings
1. object{one="ab-vwxc",two="value1",key="abcd-eest-ed-xyz-bnn",four="obsoleteValues"}
2. object{one="ab-vwxc",two="value1",key="abcd-eest-xyz-bnn",four="obsoleteValues"}
3. object{one="ab-vwxc",two="value1",key="abcd-eest-xyz-bnn-ed",four="obsoleteValues"}
..with validation as listed below
should contain abcd-
shouldn't contain -ed
Somehow this regex
\bkey="(abcd(?:-\w+)*[^-][^e][^d]\w)"
..satisfies the first condition abcd- but couldn't satisfy the second condition (negating -ed).
The expected output would be abcd-eest-xyz-bnn from the 2nd input option. Any help would be really appreciated. Thanks a lot.
If I understand your requirements correctly, the following pattern should work:
\bkey="(abcd(?:-e|-(?:[^e\W]|e[^d\W])\w*)*)"
Demo.
Breakdown for the important part:
(?: # Start a non-capturing group.
-e # Match '-e' literally.
| # Or the following...
- # Match '-' literally.
(?: # Start a second non-capturing group.
[^e\W] # Match any word character except 'e'.
| # Or...
e[^d\W] # Match 'e' followed by any word character except 'd'.
) # Close non-capturing group.
\w* # Match zero or more additional word characters.
) # Close non-capturing group.
Or in simple terms:
Match a hyphen followed by:
only the letter 'e'. Or..
a word* not starting with 'e'. Or..
a word starting with 'e' not followed by 'd'.
*A "word" here means a string of word characters as defined in regex.
Maybe have a go with:
\bkey="((?:ktm-(?:(?:e-|[^e]\w*-|e[^d]\w*-)*)abcd(?:(?:-e|-[^e]\w*|-e[^d]\w*)*)|abcd(?:(?:-e|-[^e]\w*|-e[^d]\w*)*)))"
This would ensure that:
String starts with either ktm- or abcd.
If starts with ktm-, there should at least be an element called abcd.
If starts with abcd, there doesn't have to be another element.
Both options check that there must not be an element starting with -ed.
See the online demo
The struggle without lookarounds...

Remove the text outside the first brackets in R

I know that it was asked a lot of times, but I've tried to adapt the other answers to my need and I was not able to make it work using SKIP and FAIL (I'm a bit confused, I've to admit)
I'm using R actually.
The url I need to clean is:
url <- "posts.fields(id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0))"
and I need to retain only the content inside the first brackets that are always prefixed by the word "fields" (while "posts" may vary). In other words something like
id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0)
As you may see there're some nesting inside. But I eventually could change my source code to accept this string too (removing every parhentesis by every prefix)
id,from,message,comments,likes
I don't know on how to remove the trailing parhentesis which balances the first.
If it's good enough to just remove everything up to and including the first open parenthesis and also remove the last close parenthesis and thereafter then:
sub("^.*?\\((.*)\\)[^)]*$", "\\1", url)
Note:
If it's good enough to just remove the first open parenthesis and last close parenthesis then try this:
sub("\\((.*)\\)", "\\1", url)
Using lazy .* instead of greedy:
sub(".*?fields\\((.*)\\)", "\\1", url)
[1] "id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0)"
You need to use a recursive pattern:
sub("[^.]*+(?:\\.(?!fields\\()[^.]*)*+\\.fields\\(([^()]*+(?:\\((?1)\\)[^()]*)*+)\\)(?s:.*)", "\\1", url, perl=T)
demo
details:
# reach the dot before "fields("
[^.]*+ # all except a dot (possessive)
(?: # open a non-capturing group
\\. # a literal dot
(?!fields\\() # not followed by "fields("
[^.]* # all except a dot
)*+ # repeat the group zero or more times
\\.fields\\(
# match a content between parenthesis with any level of nesting
( # open the capture group 1
[^()]*+ # 0 or more character that are not brackets (possessive)
(?: # open a non capturing group
\\(
(?1) # recursion in group 1
\\) #
[^()]* # all that is not a bracket
)*+ # close the non capturing group and repeat 0 or more time (possessive)
) # close the capture group 1
\\)
(?s:.*) # end of the string
Possessive quantifiers are used here to limit the backtracking when for any reason a part of the pattern fails.

Matching percentages

I've been trying to enhance some code which determines whether a string is a valid percentage.
I decided that it was time to finally have a hundred problems, and learned regex.
I've been using this web regex tester to build my pattern.
I'm trying to do this rather loosely, such that valid percentages may be integer or decimal, positive or negative, include commas or not, and have any amount of whitespace at the beginning and end, as well as around the optional negative sign and the required percentage sign.
So far, I have \s*-?\s*\d+(,\d+)*(?:\.\d*)?\s*%\s*, which matches almost all of my test cases correctly:
0
0
0
% 0
- 0 %
20948.924780%
315%
2,456,875 %
2,104.86%
89fqyf0gp948y1-%ghghpq98fy92,.?><
, , , ,,,, 0,0,000,00,00,,,0
, , , ,,,, 0,0,000,00,00,,,0%
000000000,00000000000 %
000000000,00000000000,00000000000 %
000000000,00000000000,00000000000,00000000000.00000000000 %
These are not in any particular order, some pass and some fail, but only one is incorrect. In , , , ,,,, 0,0,000,00,00,,,0%, the last 0%\n is a match, but the whole line should be invalid. Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
It may be something small, but as someone who only learned regex yesterday, it's far beyond my reach.
Thanks!
Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
Those anchors should be working. However, it does depend on the regex engine and the options whether they match line begins/ends or file begins/ends. On RegExr, you'd have to check the multiline option: http://regexr.com?380p9 - in programming, use the m flag.
It could be done like this.
Edit: So after realizing its a line thing, this is the regex now.
Note(s) -
Uses multiline mode line Bergi's.
Also, you CANNOT just use \s wihitespace class in this.
It doesn't matter what mode used, \s will WILL match CRLF if it can, which means
-
000,000000.22
%
will match because it satisfies all the conditions.
[^\S\r\n] means match whitespace except CRLF characters. It could be replaced with
[^\S\n] in the real world. The initial input on that tester used \r\n linebreaks.
Good Luck!!
# ^[^\S\r\n]*-?[^\S\r\n]*(?:(?:\.\d+)|(?:\d+(?:,\d+)*(?:\.\d*)?))[^\S\r\n]*%[^\S\r\n]*$
^ # BOL
[^\S\r\n]*
-? # optional -
[^\S\r\n]*
(?: # group
(?: \. \d+ ) # .number
| # or
(?: # group
\d+ # number
(?: , \d+ )* # optional many ,number
(?: \. \d* )? # optional . optional number
) # end group
) # end group
[^\S\r\n]*
% # %
[^\S\r\n]*
$ # EOL

complex regular expression question on stop set [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
What regular expression to perform search for header that starts with a number such as 1. Humility?
Here's the sample data screen shot, http://www.knowledgenotebook.com/issue/sampleData.html
Thanks.
Don't know what regex your using so I asume its Perl compatible.
You should always post some example data incase your perceptions of regex are unclear.
Breaking down what your 'Stop signs' are:
## left out of regex, this could be anything up here
##
(?: # Start of non-capture group START sign
\d+\. # 1 or more digits followed by '.'
| # or
\(\d+\) # '(' folowed by 1 or more digits followed by ')'
# note that \( could be start of capture group1 in bizzaro world
) # End group
\s? # 0 or 1 whitespace (includes \n)
[^\n<]+ # 1 or more of not \n AND not '<' STOP sign's
It seems you want all chars after the group up to but not to include the
very next \n OR the very next '<'. In that case you should get rid of the \s?
because \s includes newline, if it matches a newline here, it will continue to match
until [^\n<]+ is satisfied.
(?:\d+\.|\(\d+\))[^\n<]+
Edit - After viewing your sample, it appears that you are searching unrendered html
pasted in html content. In that case the header appears to be:
'1. Self-Knowledge<br>' which when the entities are converted, would be
1. Self-Knowledge<br>
Self-Knowledge
Superior leadership ...
You can add the entity to the mix so that all your bases are covered (ie: entity, \n, <):
((?:\d+\.|\(\d+\)))[^\S\n]+((?:(?!<|[\n<]).)+)
Where;
Capture group1 = '1.'
Capture group2 = 'Self-Knowledge'
Other than that, I don't know what it could be.

BEGINNER: REGEX Match numeric sequence except where the word "CODE" exists on a line

I've been able to stumble my way through regular expressions for quite some time, but alas, I cannot help a friend in need.
My "friend" is trying to match all lines in a text file that match the following criteria:
Only a 7 to 10 digit number (0123456 or 0123456789)
Only a 7 to 10 digit number, then a dash, then another two digits (0123456-01 or 0123456789-01)
Match any of the above except where the words Code/code or Passcode/passcode is before the numbers to match (Such as "Access code: 16434629" or "Passcode 5253443-12")
EDIT: Only need the numbers that match, nothing else.
Here is the nastiest regex I have ever seen that "he" gave me:
^(?=.*?[^=/%:]\b\d{7,10}((\d?\d?)|(-\d\d))?\b)((?!Passcode|passcode|Code|code).)*$
...
Question: Is there a way to use a short regex to find all lines that meet the above criteria?
Assume PCRE. My friend thanks you in advance. ;-)
BTW - I have not been able to find any other questions listed in stackoverflow.com or superuser.com which can answer this question accurately.
EDIT: I'm using Kodos Python Regex Debugger to validate and test the regex.
(?<!(?:[Pp]asscode|[Cc]ode).*)[0-9]{7,10}(?:-[0-9]{2})?
Commented version:
(?<! # Begin zero-width negative lookbehind. (Makes sure the following pattern can't match before this position)
(?: # Begin non-matching group
[Pp]asscode # Either Passcode or passcode
| # OR
[Cc]ode # Either Code or code
) # End non-matching group
.* # Any characters
) # End lookbehind
[0-9]{7,10} # 7 to 10 digits
(?: # Begin non-matching group
-[0-9]{2} # dash followed by 2 digits
) # End non-matching group
? # Make last group optional
Edit: final version after comment discussion -
/^(?!\D*(?:[Pp]asscode|[Cc]ode))\D*([0-9]{7,10}(?:-[0-9]{2})?)/
(result in first capture buffer)
You can get by with a nasty regex you have to get help with ...
... or you can use two simple regexes. One that matches what you want, and one that filters what you don't want. Simpler and more readable.
Which one would you like to read?
$foo =~ /(?<!(?:[Pp]asscode|[Cc]ode).*)[0-9]{7,10}(?:-[0-9]{2})?/
or
$foo =~ /\d{7,10}(-\d{2})?/ and $foo !~ /(access |pass)code/i;
Edit: case-insensitivity.