Correctness of a regular expression

Correctness of a regular expression - regex

I'm trying to create a regex which will match either one of the following -
FVAL(A)
FVAL("A")
FVAL(A,B)
FVAL("A",B)
FVAL("A","B")
FVAL(A,"B")
FVAL(A,B,C)
FVAL("A",B,C)
FVAL("A","B",C)
FVAL("A","B","C")
FVAL("A",B,"C")
FVAL(A,"B","C")
Regex -
FVAL\s*\(\s*["*]\s*\w+\s*["*]\s*,*\s*["*]\s*\w+\s*["*]\s*,*\s*,*\s*["*]\s*\w+\s*["*]\s*\)
This regex is supposed to return all and any form of the function that is used.
For e.g. -
If match string were - FVAL(A,"B")+5 then match group should be FVAL(A,"B")
P.S. - I'm ignoring white spaces in match string, but they can be there.

Your expression is way too complicated.
FVAL\("?\w+"?(?:,"?\w+"?){0,2}\)
Breakdown:
FVAL # "FVAL"
\( # "("
"? # an optional double quote
\w+ # at least one word character
"? # an optional double quote
(?: # group
, # a comma
"?\w+"? # quote - word character - quote
){0,2} # end group, repeat 0-2 times
\) # ")"
Insert whitespace \s into the expression where you see fit.

Related

Parenthesis content after a specific word

I'm trying to get UNIX group names using a regex (can't use groups because I can only get the process uid, so I'm using id <process_id> to get groups)
input looks like this
uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n
I'd like to capture kawsay, sudo, video and gpio
The only pieces I've got are:
a positive lookbehind to start capturing after groups: /(?<=groups)/
capture the parenthesis content: /\((\w+)\)/

Using PCRE's \G you may use this regex:
(?:\bgroups=|(?<!^)\G)[^(]*\(([^)]+)\)
Your intended matches are available in capture group #1
RegEx Demo
RegEx Details:
(?:: Start non-capture group
\bgroups=: Match word groups followed by a =
|: OR
(?<!^)\G: Start from end position of the previous match
): End non-capture group
[^(]*: Match 0 or more of any character that is not (
\(: Match opening (
([^)]+): Use capture group #1 to match 1+ of any non-) characters
\): Match closing )

You can use
(?:\G(?!\A)\),|\bgroups=)\d+\(\K\w+
See the regex demo. Details:
(?:\G(?!\A)\),|\bgroups=) - either of
\G(?!\A)\), - end of the previous match (\G operator matches either start of string or end of the previous match, so the (?!\A) is necessary to exclude the start of string location) and then ), substring
| - or
\bgroups= - a whole word groups (\b is a word boundary) and then a = char
\d+\( - one or more digits and a (
\K - match reset operator that makes the regex engine "forget" the text matched so far
\w+ - one or more word chars.

Here are two more ways to extract the strings of interest. Both return matches and do not employ capture groups. My preference is for second one.
str = "uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n"
Match substrings between parentheses that are not followed later in the string with "groups="
Match the regular expression
rgx = /(?<=\()(?!.*\bgroups=).*?(?=\))/
str.scan(rgx)
#=> ["kawsay", "sudo", "video", "gpio"]
Demo
See String#scan.
This expression can be broken down as follows.
(?<=\() # positive lookbehind asserts previous character is '('
(?! # begin negative lookahead
.* # match zero or more characters
\bgroups= # match 'groups=' preceded by a word boundary
) # end negative lookahead
.* # match zero or more characters lazily
(?=\)) # positive lookahead asserts next character is ')'
This may not be as efficient as expressions that employ \G (because of the need to determine if 'groups=' appears in the string after each left parenthesis), but that may not matter.
Extract the portion of the string following "groups=" and then match substrings between parentheses
First, obtain the portion of the string that follows "groups=":
rgx1 = /(?<=\bgroups=).*/
s = str[rgx1]
#=> "1001(kawsay),27(sudo),44(video),997(gpio)\n"
See String#[].
Then match the regular expression
rgx2 = /(?<=\()[^\)\r\n]+/
against s:
s.scan(rgx2)
#=> ["kawsay", "sudo", "video", "gpio"]
The regular expression rgx1 can be broken down as follows:
(?<=\bgroups=) # Positive lookbehind asserts that the current
# position in the string is preceded by`'groups'`,
# which is preceded by a word boundary
.* # match zero of more characters other than line
# terminators (to end of line)
rgx2 can be broken down as follows:
(?<=\() # Use a positive lookbehind to assert that the
# following character is preceded by '('
[^\)\r\n]+ # Match one or more characters other than
# ')', '\r' and '\n'
Note:
The operations can of course be chained: str[/(?<=\bgroups=).*/].scan(/(?<=\()[^\)\r\n]+/); and
rgx2 could alternatively be written /(?<=\().+?(?=\)), where ? makes the match of one or more characters lazy and (?=\)) is a positive lookahead that asserts that the match is followed by a right parenthesis.
This would probably be the fastest solution of those offered and certainly the easiest to test.

Regex match text after last '-'

I am really stuck with the following regex problem:
I want to remove the last piece of a string, but only if the '-' is more then once occurring in the string.
Example:
BOL-83846-M/L -> Should match -M/L and remove it
B0L-026O1 -> Should not match
D&F-176954 -> Should not match
BOL-04134-58/60 -> Should match -58/60 and remove it
BOL-5068-4 - 6 jaar -> Should match -4 - 6 jaar and remove it (maybe in multiple search/replace steps)
It would be no problem if the regex needs two (or more) steps to remove it.
Now I have
[^-]*$
But in sublime it matches B0L-026O1 and D&F-176954
Need your help please

You can match the first - in a capture group, and then match the second - till the end of the string to remove it.
In the replacement use capture group 1.
^([^-\n]*-[^-\n]*)-.*$
^ Start of string
( Capture group 1
[^-\n]*-[^-\n]* Match the first - between chars other than - (or a newline if you don't want to cross lines)
) Capture group 1
-.*$ Match the second - and the rest of the line
Regex demo

You can match the following regular expression.
^[^-\r\n]*(?:$|-[^-\r\n]*(?=-|$))
Demo
If the string contains two or more hyphens this returns the beginning of the string up to, but not including, the second hyphen; else it returns the entire string.
The regular expression can be broken down as follows.
^ # match the beginning of the string
[^-\r\n]* # match zero or more characters other than hyphens,
# carriage returns and linefeeds
(?: # begin a non-capture group
$ # match the end of the string
| # or
- # match a hyphen
[^-\r\n]* # match zero or more characters other than hyphens,
# carriage returns and linefeeds
(?= # begin a positive lookahead
- # match a hyphen
| # or
$ # match the end of the string
) # end positive lookahead
) # end non-capture group

Regex Ruby How to group every word within parentheses

I'm trying to get all the words between the parentheses after a specific word and the end of the string.
For example, I have this case:
p " some other text in downcase LOREM (foo, bar)".scan(/ LOREM \((.*?)\)\z/m)
# [["foo, bar"]]
The regex is getting foo, bar which is between the parenthesis, it's okay, but I'd like to get them like two separate elements within a single array, meaning:
["foo", "bar"]
That's to say, the regex should group every words as a separate element.
My intention is to get everything between LOREM ( and the last closing parenthesis ).
I've tried adding (\b\w+\b), which groups every word in the string. But when adding it to the attempt to get the words from the parenthesis, it returns nothing.

You may use
.scan(/(?:\G(?!\A)\s*,\s*|\sLOREM\s+\()\K\w+(?=[^()]*\)\z)/
See the Ruby demo and the Rubular regex demo. You may replace \w+ with [[:alnum:]]+, or \p{L}+ (to only match letters), or [^\s,()]+ (to match any 1+ chars other than whitespace, ,, ( and )), it all depends on what you want to match inside the paretheses.
Details
(?:\G(?!\A)\s*,\s*|\sLOREM\s+\() - either the end of the previous successful match and a , enclosed with 0+ whitespaces, or whitespace, LOREM, 1+ whitespaces and (
\K - omit the text matched so far
\w+ - consume 1+ word chars
(?=[^()]*\)\z) - immediately to the right, there must be 0 or more chars other than ( and ) and then ) at the end of the string.

r = /
(?<= # begin a positive lookbehind
LOREM[ ] # match 'LOREM '
\( # match left paren
| # or
,[ ] # match a comma followed by a space
) # end positive lookbehind
(?: # begin a non-capture group
[^, ")]+ # match one or more characters other than ',', ' ', '"' and ")"
| # or
\" # match a double quote
[^, ")]+ # match one or more characters other than ',', ' ', '"' and ")"
\" # match a double quote
) # end non-capture group
(?= # begin a positive lookahead
.*\) # match any number of characters followed by a right paren
) # end positive lookahead
/x # free-spacing regex definition mode
Conventionally this is written
r = /(?<=LOREM \(|, )(?:[^, ")]+|\"[^, ")]+\")(?=.*\))/
Let's try it.
str = "some other text in downcase LOREM (foo, \"bar\", \"baz), daz"
str.scan(r)
#=> ["foo", "\"bar\""]
The first match, "foo", matches
str.scan /(?<=LOREM \()[^, ")]+/
#=> ["foo"]
That is, this matches one or more characters other than a comma, space, double quote or left parenthesis, immediately preceded by "LOREM " followed by a left parenthesis.
The next attempted match begins at the end of "foo". There is no match of "L" in "LOREM" so an attempt is made to match ", ", which is met with success. [^, ")]+ does not match "bar", so an attempt is made to match \"[^, ")]+\", which is successful. As ", " is matched within the lookaround it is not part of the match returned. This matches '"bar"'.
\"baz is not matched because it has no closing double quote.

REGEX input validation

I am trying to put together REGEX expression to validate the following format:
"XXX/XXX","XXX/XXX","XXX/XXX"
where X could be either a letter, a number, or dash or underscore. What i got so far is
"(.*?)(\/)(.*?)"(?:,|$)/g
but it does not seem to work
Update: there could be any number of "XXX/XXX" strings, comma-separated, not just 3

you can try the following regex:
"([\w-]+)\/([\w-]+)"
Edit: regex explained:
([\w-]+) in the square brackets we say we want to match \w: matches any word character (equal to [a-zA-Z0-9_]). After this, we have "-", which just adds literally the symbol "-" to the matching symbols.
"+" says we want one or more symbols from the previous block: [\w-]
\/ matches the symbol "/" directly. It should be escaped in the regex, that's why it is preceded by "\"
([\w-]+) exactly like point 1, matches the same thing since the two parts are identical.
() - those brackets mark capturing group, which you can later use in your code to get the value it surrounds and matches.
Example:
Full match: 1X-/-XX
Group 1: 1X-
Group 2: -XX
Here is a demo with the matching cases - click. If this doesn't do the trick, let me know in the comments.

This will do the job:
"[-\w]+/[-\w]+"(?:,"[-\w]+/[-\w]+")*
Explanation:
" # quote
[-\w]+ # 1 or more hyphen or word character [a-zA-0-9_]
/ # a slash
[-\w]+ # 1 or more hyphen or word character [a-zA-0-9_]
" # quote
(?: # non capture group
, # a comma
" # quote
[-\w]+ # 1 or more hyphen or word character [a-zA-0-9_]
/ # a slash
[-\w]+ # 1 or more hyphen or word character [a-zA-0-9_]
" # quote
)* # end group, may appear 0 or more times
Demo

Here, we would be starting with a simple expression with quantifiers:
("[A-Za-z0-9_-]+\/[A-Za-z0-9_-]+")(,|$)
where we collect our desired three chars in a char class, followed by slash and at the end we would add an optional ,.
Demo
RegEx Circuit
jex.im visualizes regular expressions:

Regular expression captures unwanted string

I have created the following expression: (.NET regex engine)
((-|\+)?\w+(\^\.?\d+)?)
hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121
It works well except that :
it captures the term 'hello+' but without the '+' : this group should not be captured at all
the last term 'hello^-1212121' as 2 groups 'hello' and '-1212121' both should be ignored
The strings to capture are as follows :
word can have a + or a - before it
or word can have a ^ that is followed by a positive number (not necessarily an integer)
words are separated by commas and any number of white spaces (both not part of the capture)
A few examples of valid strings to capture :
hello^2
hello^.2
+hello
-hello
hello
EDIT
I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :
([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)

Ok, there are some issues to tackle here:
((-|+)?\w+(\^.?\d+)?)
^ ^
The + and . should be escaped like this:
((-|\+)?\w+(\^\.?\d+)?)
Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:
((-|\+)?[a-zA-Z]+(\^\.?\d+)?)
\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.
And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:
(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)
EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:
(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))
And lastly, if capturing the words is sufficient, we can remove the lookarounds:
([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)

It would be better if you first state what it is you are looking to extract.
You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...
Assuming you want to capture only:
words that have a leading + or -
words that have a trailing ^ followed by an optional period followed by one or more digits
and that words are sequences of one or more letters
I'd use:
([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)
which breaks down into:
( # start capture group
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
\^ # literal
\.? # optional period
\d+ # one or more digits
| # OR
[+-]? # optional plus or minus
[a-zA-Z]+ # one or more letters or underscores
) # end of capture group
EDIT
To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:
([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)
which breaks down into:
( # start capture group
[+-] # literal plus or minus
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
| # OR
[a-zA-Z]+ # one or more letters
\^ # literal
(?: # start of non-capturing group
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
) # end of non-capturing group
| # OR
[a-zA-Z]+ # one or more letters
) # end of capture group
Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)
FURTHER EDIT
This regexp will only match the following patterns delimited by commas:
word
word with leading plus or minus
word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+
([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)
Please note that the useful match will appear in the first capture group, not the entire match.
So, in Javascript, you'd:
var src="hello , hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;
while(RE.test(src)){
console.log(RegExp.$1)
}
which produces:
hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Correctness of a regular expression - regex

Related

Parenthesis content after a specific word

Regex match text after last '-'

Regex Ruby How to group every word within parentheses

REGEX input validation

Regular expression captures unwanted string

Categories

Resources