Understanding regex criteria in pattern match - regex

I am trying to determine what the following pattern match criteria allows me to enter:
\s*([\w\.-]+)\s*=\s*('[^']*'|"[^"]*"|[^\s]+)
From my attempt to decipher (by looking at the regex's I do understand) it seems to say I can start with any character sequence then I must have a brace followed by alphanumerics, then another sequence followed by braces, one intial single quote, no backslashes closed by a brace ???
Sorry if I have got this completely muddled. Any help is appreciated.
Regards,
Pablo

The square brackets are character classes, and the parens are for grouping. I'm not sure what you mean by "braces".
This basically matches a name=value pair where than name consists of one or more "word", dot or hyphen characters, and the value is either a single quoted character or a double-quoted string of characters, or a bunch of non-whitespace characters. Single-quoted characters cannot contain a single quote, and double quoted strings may not contain double-quotes (both arguably minor flaws whatever syntax this is from). There's also arguably some ambiguity since the last option ("a bunch on non-whitespace characters") could match something starting with a single or double quote.
Also, zero or more whitespaces may appear around the equal sign or at the beginning (that's the \s* bits).

It's looking for strings of text which are basically
<identifier> = <value>
identifier is made up of letters, digits, '-' and '.'
value can be a single-quoted strings, double-quoted strings, or any other sequence of characters (as long as it doesn't contain a space).
So it would match lines that look like this:
foo = 1234
bar-bar= "a double-quoted string"
bar.foo-bar ='a single quoted string'
.baz =stackoverflow.com this part is ignored
Some things to note:
There's no way to put a quote inside a quoted string (such as using \" inside "...").
Anything after the quoted string is ignored.
If a quoted string isn't used for value, then everything from the first space onwards is ignored.
Whitespace is optional

RegexBuddy says:
\s*([\w\.-]+)\s*=\s*('[^']*'|"[^"]*"|[^\s]+)
Options: case insensitive
Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.) «\s*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the regular expression below and capture its match into backreference number 1 «([\w\.-]+)»
Match a single character present in the list below «[\w\.-]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A word character (letters, digits, etc.) «\w»
A . character «\.»
The character “-” «-»
Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.) «\s*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “=” literally «=»
Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.) «\s*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the regular expression below and capture its match into backreference number 2 «('[^']*'|"[^"]*"|[^\s]+)»
Match either the regular expression below (attempting the next alternative only if this one fails) «'[^']*'»
Match the character “'” literally «'»
Match any character that is NOT a “'” «[^']*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “'” literally «'»
Or match regular expression number 2 below (attempting the next alternative only if this one fails) «"[^"]*"»
Match the character “"” literally «"»
Match any character that is NOT a “"” «[^"]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “"” literally «"»
Or match regular expression number 3 below (the entire group fails if this one fails to match) «[^\s]+»
Match a single character that is a “non-whitespace character” «[^\s]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Created with RegexBuddy

Let us break \s*([\w\.-]+)\s*=\s*('[^']*'|\"[^\"]*\"|[^\s]+) apart:
\s*([\w\.-]+)\s*:
\s* means 0 or more whitespace characters
`[\w.-]+ means 1 or more of the following characters: A-Za-z0-9_.-
('[^']*'|\"[^\"]*\"|[^\s]+):
One or more characters non-' characters enclosed in ' and '.
One or more characters non-" characters enclodes in " and ".
One or more characters not containing a space
So basically, you can mostly ignore the \s*'s in trying to understand the expression, they just handle removing spacing.

Yes, you have got it completely muddled. :P For one thing, there are no braces in that regex; that word usually refers to the curly brackets: {}. That regex only contains square brackets and parentheses (aka round brackets), and they're all regex metacharacters--they aren't meant to match those characters literally. The same goes for most of the other characters.
You might find this site useful. Very good tutorial and reference site for all things regex.

Related

How would I detect superscript for one word if there's no parentheses, but if there are parentheses, for all the contents of them?

I want to detect the two following circumstances, preferably with one regex:
This is a sentence ^that I wrote today.
And:
This is a sentence ^(that I wrote) today.
So basically, if there are parentheses after the caret, I want to match whatever is inside them. Otherwise, I just want to match just the next word.
I'm new to regex. Is this possible without making it too complicated?
\^(\w+|\([\w ]+\))
Options: case insensitive; ^ and $ match at line breaks
Match the character “^” literally «\^»
Match the regular expression below and capture its match into backreference number 1 «(\w+|\([\w ]+\))»
Match either the regular expression below (attempting the next alternative only if this one fails) «\w+»
Match a single character that is a “word character” (letters, digits, etc.) «\w+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «\([\w ]+\)»
Match the character “(” literally «\(»
Match a single character present in the list below «[\w ]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A word character (letters, digits, etc.) «\w»
The character “ ” « »
Match the character “)” literally «\)»
Created with RegexBuddy

R regular expression repetition ignores upper bound

I try to make regular expression which helps me filter strings like
blah_blah_suffix
where suffix is any string that has length from 2 to 5 characters. So I want accept strings
blah_blah_aa
blah_blah_abcd
but discard
blah_blah_a
blah_aaa
blah_blah_aaaaaaa
I use grepl in the following way:
samples[grepl("blah_blah_.{2,5}", samples)]
but it ignores upper bound for repetition (5). So it discards strings blah_blah_a,
blah_aaa, but accepts string blah_blah_aaaaaaa.
I know there is a way to filter strings without usage of regular expression but I want to understand how to use grepl correctly.
You need to bound the expression to the start and end of the line:
^blah_blah_.{2,5}$
The ^ matches beginning of line and $ matches end of line. See a working example here: Regex101
If you want to bound the expression to the beginning and end of a string (not multi-line), use \A and \Z instead of ^ and $.
Anchors Tutorial
/^[\w]+_[\w]+_[\w]{2,5}$/
DEMO
Options: dot matches newline; case insensitive; ^ and $ match at line breaks
Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match a single character that is a “word character” (letters, digits, and underscores) «[\w]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “_” literally «_»
Match a single character that is a “word character” (letters, digits, and underscores) «[\w]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “_” literally «_»
Match a single character that is a “word character” (letters, digits, and underscores) «[\w]{2,5}»
Between 2 and 5 times, as many times as possible, giving back as needed (greedy) «{2,5}»
Assert position at the end of a line (at the end of the string or before a line break character) «$»

Regex for 2 items but with one exclusion

I am building a RegEx that needs to find lines that have either:
DateTime.Now
or
Date.Now
But cannot have the literal "SystemDateTime" on the same line.
I started with this (DateTime\.Now|Date\.Now) but now I am stuck with where to put the "SystemDateTime"
Use this. Assuming you are not using /s modifier(or DOTALL) which takes newline characters under the dot(.)
(?!.*SystemDateTime)(DateTime\.Now|Date\.Now)
(?!.*SystemDateTime) means there is no SystemDateTime in front.
You could use negative lookahead like this:
(?!.*SystemDateTime)\bDate(?:Time)?\.Now\b
/(?!.*SystemDateTime)Date(?:Time)?\.Now/
DEMO
EXPLANATION:
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*SystemDateTime)»
Match any single character that is not a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the characters “SystemDateTime” literally «SystemDateTime»
Match the characters “Date” literally «Date»
Match the regular expression below «(?:Time)?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the characters “Time” literally «Time»
Match the character “.” literally «\.»
Match the characters “Now” literally «Now»

What does this regular expression mean?

/\ATo\:\s+(.*)/
Also, how do you work it out, what's the approach?
In multi-line regular expressions, \A matches the start of the string (and \Z is end of string, while ^/$ matches the start/end of the string or the start/end of a line). In single line variants, you just use ^ and $ for start and end of string/line since there is no distinction.
To is literal, \: is an escaped :.
\s means whitespace and the + means one or more of the preceding "characters" (white space in this case).
() is a capturing group, meaning everything in here will be stored in a "register" that you can use. Hence, this is the meat that will be extracted.
.* simply means any non newline character ., zero or more times *.
So, what this regex will do is process a string like:
To: paxdiablo
Re: you are so cool!
and return the text paxdiablo.
As to how to learn how to work this out yourself, the Perl regex tutorial(a) is a good start, and then practise, practise, practise :-)
(a) You haven't actually stated which regex implementation you're using but most modern ones are very similar to Perl. If you can find a specific tutorial for your particular flavour, that would obviously be better.
\A is a zero-width assertion and means "Match only at beginning of string".
The regex reads: On a line beginning with "To:" followed by one or more whitespaces (\s), capture the remainder of the line ((.*)).
First, you need to know what the different character classes and quantifiers are. Character classes are the backslash-prefixed characters, \A from your regex, for instance. Quantifiers are for instance the +. There are several references on the internet, for instance this one.
Using that, we can see what happens by going left to right:
\A matches a beginning of the string.
To matches the text "To" literally
\: escapes the ":", so it loses it's special meaning and becomes "just a colon"
\s matches whitespace (space, tab, etc)
+ means to match the previous class one or more times, so \s+ means one or more spaces
() is a capture group, anything matched within the parens is saved for later use
. means "any character"
* is like the +, but zero or more times, so .* means any number of any characters
Taking that together, the regex will match a string beginning with "To:", then at least one space, and the anything, which it will save. So, with the string "To: JaneKealum", you'll be able to extract "JaneKealum".
You start from left and look for any escaped (ie \A) characters. The rest are normal characters. \A means the start of the input. So To: must be matched at the very beginning of the input. I think the : is escaped for nothing. \s is a character group for all spaces (tabs, spaces, possibly newlines) and the + that follows it means you must have one or more space characters. After that you capture all the rest of the line in a group (marked with ( )).
If the input was
To: progo#home
the capture group would contain "progo#home"
It matches To: at the beginning of the input, followed by at least one whitespace, followed by any number of characters as a group.
The initial and trailing / characters delimit the regular expression.
A \ inside the expression means to treat the following character specially or treat it as a literal if it normally has a special meaning.
The \A means match only at the beginning of a string.
To means match the literal "To"
\: means match a literal ':'. A colon is normally a literal and has no special meaning it can be given.
\s means match a whitespace character.
+ means match as many as possible but at least one of whatever it follows, so \s+ means match one or more whitespace characters.
The ( and ) define a group of characters that will be captured and returned by the expression evaluator.
And finally the . matches any character and the * means match as many as possible but can be zero. Therefore the (.*) will capture all characters to the end of the input string.
So therefore the pattern will match a string that starts "To:" and capture all characters that occur after the first succeeding non-whitespace character.
The only way to really understand these things is to go through them one bit at a time and check the meaning of each component.

Replace Property Definitions in VB.Net Code

In VB 2010, you can use the implied properties like C# which turns this
Private _SONo As String
Public Property SONo() As String
Get
Return _SONo
End Get
Set(ByVal value As String)
_SONo = value
End Set
End Property
Into
Public Property SONo() As String
What I want to do is replace the old style with the new style in a few file. Since Visual Studio's find and replace tool allows you to do regular expressions, I assume there must be an expression I can use to do this conversion.
What would the regular expression be to do this conversion?
This could be dangerous as you might have logic in the property setters/getters, but if they don't have logic you could say:
Regular Expression:
Private\s_(\w+)\sAs\s(\w+).*?(^\w+).*?Property.*?End\sProperty
Replace:
${3} Property ${1} As ${2}
I've tested this with RegexBuddy targeting the .NET regex variant. Note, that this may or may not work in the Visual Studio Find/Replace prompt as that is yet another variant.
UPDATE: VS's variant (Dot can't match newlines so we need to add that functionality, also converted: \w = :a, \s = :b, {} for tags, and *? = #):
Private:b_{:a+}:bAs:b{:a+}(.|\n)#{:a+}(.|\n)#Property(.|\n)#End:bProperty
\3 Property \1 As \2
The Regex does the following:
Options: dot matches newline; case insensitive; ^ and $ match at line breaks
Match the characters “Private” literally «Private»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Match the character “_” literally «_»
Match the regular expression below and capture its match into backreference number 1 «(\w+)»
Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Match the characters “As” literally «As»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Match the regular expression below and capture its match into backreference number 2 «(\w+)»
Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below and capture its match into backreference number 3 «(\w+)»
Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the characters “Property” literally «Property»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the characters “End” literally «End»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Match the characters “Property” literally «Property»