I have a regular expression of the form
def parse(self, format_string):
for m in re.finditer(
r"""(?: \$ \( ( [^)]+ ) \) ) # the field access specifier
| (
(?:
\n | . (?= \$ \( ) # any one single character before the '$('
)
| (?:
\n | . (?! \$ \( ) # any one single character, except the one before the '$('
)*
)""",
format_string,
re.VERBOSE):
...
I would like to replace all the repeating sequences (\$ \() with some custom shorthand "constant", which would look like this:
def parse(self, format_string):
re.<something>('\BEGIN = \$\(')
for m in re.finditer(
r"""(?: \BEGIN ( [^)]+ ) \) ) # the field access specifier
| (
(?:
\n | . (?= \BEGIN ) # any one single character before the '$('
)
| (?:
\n | . (?! \BEGIN ) # any one single character, except the one before the '$('
)*
)""",
format_string,
re.VERBOSE):
...
Is there a way to do this with regular expressions themselves (i.e. not using Python's string formatting to substitute \BEGIN with \$\()?
Clarification: the Python source is purely for context and illustration. I'm looking for RE solution, which would be available in some RE dialect (maybe not Python's one), not the solution specifically for Python.
I don't think this is possible in Python's regex flavor. You would need recursion (or rather pattern reuse) which is only supported by PCRE. In fact, PCRE even mentions how defining shorthands works in its man page (search for "Defining subpatterns").
In PCRE, you can use the recursion syntax in a similar way to backreferences - except that the pattern is applied again, instead of trying to get the same literal text as from a backreference. Example:
/(\d\d)-(?1)-(?1)/
Matches something like a date (where (?1) will be replaced with with \d\d and evaluated again). This is really powerful, because if you use this construct within the referenced group itself you get recursion - but we don't even need that here. The above also works with named groups:
/(?<my>\d\d)-(?&my)-(?&my)/
Now we're already really close, but the definition is also the first use of the pattern, which somewhat clutters up the expression. The trick is to use the pattern first in a position that is never evaluated. The man pages suggest a conditional that is dependent on a (non-existent) group DEFINE:
/
(?(DEFINE)
(?<my>\d\d)
)
(?&my)-(?&my)-(?&my)
/x
The construct (?(group)true|false) applies pattern true if the group group was used before, and (the optional) pattern false otherwise. Since there is no group DEFINE, the condition will always be false, and the true pattern will be skipped. Hence, we can put all kinds of definitions there, without being afraid that they will ever be applied and mess up our results. This way we get them into the pattern, without really using them.
And alternative is a negative lookahead that never reaches the point where the expression is defined:
/
(?!
(?!) # fail - this makes the surrounding lookahead pass unconditionally
# the engine never gets here; now we can write down our definitions
(?<my>\d\d)
)
(?&my)-(?&my)-(?&my)
/x
However, you only really need this form, if you don't have conditionals, but do have named pattern reuse (and I don't think a flavor like this exists). The other variant has the advantage, that the use of DEFINE makes it obvious what the group is for, while the lookahead approach is a bit obfuscated.
So back to your original example:
/
# Definitions
(?(DEFINE)
(?<BEGIN>[$][(])
)
# And now your pattern
(?: (?&BEGIN) ( [^)]+ ) \) ) # the field access specifier
|
(
(?: # any one single character before the '$('
\n | . (?= (?&BEGIN) )
)
|
(?: # any one single character, except the one before the '$('
\n | . (?! (?&BEGIN) )
)*
)
/x
There are two major caveats to this approach:
Recursive references are atomic. That is, once the reference has matched something it will never be backtracked into. For certain cases this can mean that you have to be a bit clever in crafting your expression, so that the first match will always be the one you want.
You cannot use capturing inside the defined patterns. If you use something like (?<myPattern>a(b)c) and reuse it, the b will never be captured - when reusing a pattern, all groups are non-capturing.
The most important advantage over any kind of interpolation or concatenation however is, that you can never produce invalid patterns with this, and you cannot mess up your capturing group counts either.
Related
hiii every body
i need help for this reg expression pattern
i need to search on text for this
( anything ) -
check this example to every statement
i need to detect if this pattern exist on the statement that i will feed to my function and get the matched string
be careful for space and braces and dash and anything mean any content Arabic or English no matter what is it , just pattern start with ( and end to - and if this pattern exist on the first statement so it say exist
thanks for every one .....
The task can be easier if it is described in a way "guiding to"
the proper solution. Let's rewrite your task the following way:
The text to match:
Should start with ( and a space.
Then there is a non-empty sequence of chars other than )
(the text between parentheses, which you wrote as anything).
The last part is a space, ), another space and -.
Having such a description, it is obvious, that the pattern should be:
\( [^)]+ \) -
where each fragment: \(, [^)]+ and \) - expresses each of the
above conditions.
Note: If spaces after ( and before ) are optional, then you can express it
with ? after each such space, and then the whole regex will change to
\( ?[^)]+ ?\) -.
A common thing I want to do, when doing a search-replace in an IDE (in this case: PyCharm), is to avoid cutting expressions or statements in half.
For example, suppose I want to fix the fact that my is using python-2-style print statements. I might write:
Search: print (.+), replace: print($1)
But this will do the wrong thing for multi-line statements:
print 'one' \
'two'
In general, recognizing multi-line statements is complicated. You need to check for trailing \s and also do bracket-matching for multiple types of brackets. Is there built-in functionality for doing this? Some kind of end-of-statement / end-of-expression escape sequence?
You could probably do it this way.
Find print((?:.+?(?:\\\r?\n)?)+)
Replace print($1)
Expanded
print
( # (1 start)
(?:
.+?
(?: \\ \r? \n )? # Possible line-continuation
)+
) # (1 end)
kindly help me with my regular expression I am new to it. I am checking for the presence of email in a text and textarea field. My regex works when I check it without putting spaces in between email string. for example:
raza chohan # gm ail . co m
but I want it to ignore whitespaces and linebreaks wherever they occur. Following is my regex:
/^(.(?!([a-z0-9._-](\s|\r\n|\n|\r){0,}(at|#)(\s|\r\n|\n|\r){0,}[a-z0-9._-]{2,}(\s|\r\n|\n|\r){0,}[a-z0-9._-]{0,}(\s|\r\n|\n|\r){0,}(\.|dot)(\s|\r\n|\n|\r){0,}[a-z]{2,})))*$/im
Kindly update this regex to avoid or skip whitespaces and line breaks . Thank you!!
As collapsar mentioned, regular expressions are not the only tool you can use. It isn't entirely clear from the question how you want the whitespace matching to change, but consider stripping whitespace (or at least newlines) from your string before you check for the email addresses. Then your regular expression can be simplified.
Also, using \b (that matches a word boundary) may be easier than explicitly mentioning all types of whitespace.
There is good advice on email matching on regular-expressions.info.
Usually, the Dot metachar does not include newlines by default.
Not sure about R, but you do have the //m flag set (multi line mode)
that may mean the dot includes newlines.
Either way, you can get the equivalent by moving all the \s|\r\n stuff into a separate alternation.
This is your original with the negative lookahead after the match char.
# /^(.(?!([a-z0-9._-](at|#)[a-z0-9._-]{2,}[a-z0-9._-]{0,}(\.|dot)[a-z]{2,})|\s))*$/im
^
(
.
(?!
(
[a-z0-9._-]
( at | # )
[a-z0-9._-]{2,}
[a-z0-9._-]{0,}
( \. | dot )
[a-z]{2,}
)
|
\s
)
)*
$
this is how it should probably be.
# /^((?!([a-z0-9._-](at|#)[a-z0-9._-]{2,}[a-z0-9._-]{0,}(\.|dot)[a-z]{2,})|\s).)*$/im
^
(
(?!
(
[a-z0-9._-]
( at | # )
[a-z0-9._-]{2,}
[a-z0-9._-]{0,}
( \. | dot )
[a-z]{2,}
)
|
\s
)
.
)*
$
New to regex and I need to pattern match on some dates to change the format.
I'm going from mm/dd/yy to yyyy-mm-dd where there are no entries prior to 2000.
What I'm unfamiliar with is how to group things to use their respective references of \1, \2, etc.
Would I first want to match on mm/dd/yy with something like ( \d{2} ) ( \/\d{2} ) ( \/\d{2} ) or is it as easy as \d\d/\d\d/\d\d ?
Assuming my first grouping is partially the right idea, I'm looking to do something like:
:%s/old/new/g
:%s/ ( \d{2} ) ( \/\d{2} ) ( \/\d{2} ) / ( 20+\3) - (\3) - (\1) /g
EDIT: Sorry, the replace is going to a yyyy-mm-dd format with hyphens, not the slash.
I was going to comment on another answer but it got complicated.
Mind the magic setting. If you want unescaped parens to do grouping, you need to include \v somewhere in your pattern. (See :help magic).
You can avoid escaping the slashes if you use something other than slashes in the :s command.
You are close. :) You don't want all of those spaces though as they'll require spaces in the same places to match.
My solution, where I use \v so I don't need to escape the parens and exclamation points so I can use slashes in my pattern without escaping them:
:%s!\v(\d{2})/(\d{2})/(\d{2})!20\3-\2-\1!g
This will match "inside" items that start or end with three or more digits though, too. If you can give begin/end criteria then that'd possibly be helpful. Assuming that simple "word boundary" conditions work, you can use <>:
:%s!\v<(\d{2})/(\d{2})/(\d{2})>!20\3-\2-\1!g
To critique yours specifically (for learning!):
:%s/ ( \d{2} ) ( \/\d{2} ) ( \/\d{2} ) / ( 20+\3) - (\3) - (\1) /g
Get rid of the spaces since presumably you don't want them!
Your grouping needs either \( \) or \v to work
You also need \{2} unless you use \v
You are putting the slashes in groups two and three which means they'll show up in the replacement too
You don't want the parentheses in the output!
You're substituting text directly; you don't want the + after the 20 in the output
Try this:
:%s/\(\d\{2}\)\/\(\d\{2}\)\/\(\d\{2}\)/20\3-\2-\1/g
The bits you're interested in are: \(...\) - capture; \d - a digit; \{N} - N occurrences; and \/ - a literal forward slash.
So that's capturing two digits, skipping a slash, capturing two more, skipping another slash, and capturing two more, then replacing it with "20" + the third couplet + "-" + the second couplet + "-" + the first couplet. That should turn "dd/mm/yy" into "20yy-mm-dd".
ok, try this one:
:0,$s#\(\d\{1,2\}\)/\(\d\{1,2\}\)/\(\d\{1,2\}\)#20\3-\2-\1#g
I've removed a lot of the spaces, both in the matching section and the replacement section, and most of parens, because the format you were asking for didn't have it.
Some things of note:
With vi you can change the '/' to any other character, which helps when you're trying to match a string with slashes in it.. I usually use '#' but it doesn't have to be.
You've got to escape the parens, and the curly braces
I use the :0,$ instead of %s, but I think it has the same meaning -- apply the following command to every row between row 0 and the end.
For the match: (\d{2})\/(\d{2})\/(\d{2})
For the replace: 20\3\/\1\/\2
I have a regex like this
(?<!(\w/))$#Cannot end with a word and slash
I would like to extract the comment from the end. While the example does not reflect this case, there could be a regex with includes regex on hashes.
\##value must be a hash
What would the regex be to extract the comment ensuring it is safe when used against regex which could contain #'s that are not comments.
Here's a .Net flavored Regex for partly parsing .Net flavor patterns, which should get pretty close:
\A
(?>
\\. # Capture an escaped character
| # OR
\[\^? # a character class
(?:\\.|[^\]])* # which may also contain escaped characters
\]
| # OR
\(\?(?# inline comment!)\#
(?<Comment>[^)]*)
\)
| # OR
\#(?<Comment>.*$) # a common comment!
| # OR
[^\[\\#] # capture any regular character - not # or [
)*
\z
Luckily, in .Net each capturing group remembers all of its captures, and not just the last, so we can find all captures of the Comment group in a single parse. The regex pretty much parses regular expression - but hardly fully, it just parses enough to find comments.
Here's how you use the result:
Match parsed = Regex.Match(pattern, pattern,
RegexOptions.IgnorePatternWhitespace |
RegexOptions.Multiline);
if (parsed.Success)
{
foreach (Capture capture in parsed.Groups["Comment"].Captures)
{
Console.WriteLine(capture.Value);
}
}
Working example: http://ideone.com/YP3yt
One last word of caution - this regex assumes the whole pattern is in IgnorePatternWhitespace mode. When it isn't set, all # are matched literally. Keep in mind the flag might change multiple times in a single pattern. In (?-x)#(?x)#comment, for example, regardless of IgnorePatternWhitespace, the first # is matched literally, (?x) turns the IgnorePatternWhitespace flag back on, and the second # is ignored.
If you want a robust solution you can use a regex-language parser.
You can probably adapt the .Net source code and extract a parser:
Reference Source - RegexParser.cs
GitHub - RegexParser.cs
Something like this should work (if you run it separately on each line of the regex). The comment itself (if it exists) will be in the third capturing group.
/^((\\.)|[^\\\#])*\#(.*)/
(\\.) matches an escaped character, [^\#] matches any non-slash non-hash characters, together with the * quantifier they match the entire line before the comment. Then the rest of the regex detects the comment marker and extracts the text.
One of the overlooked options in regex parsing is the RightToLeft mode.
extract the comment from the end.
One can simply the pattern if we work our way from the end of the line to the beginning. Such as
^
.+? # Workable regex
(?<Comment> # Comment group
(?<!\\) # Not a comment if escaped.
\# # Anchor for actual comment
[^#]+ # The actual commented text to stop at #
)? # We may not have a comment
$
Use the above pattern in C# with these options RegexOptions.RightToLeft | RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline
there could be a regex with includes regex on hashes
This line (?<!\\) # Not a comment if escaped. handles that situation by saying if there is a proceeding \, we do not have a comment.