I have a difficult user-input validation question (or at least it's difficult for me). I'm trying to make sure users are inputting a pre-defined subset of allowed Angular expressions if they try to add angular to their input at all.
I'm currently using http://www.regexpal.com/ (the actual implementation is in an HTML webpage using javascript) to test my expression and the two following cases:
VALID
Any text, punctuation (except double-{), or numb3r5 {{model.variable|phone}} is valid
Any text, punctuation (except double-{), or numb3r5 {{model.variable}} is valid.
Stick with the format {{model.variable|zipcode}} and we remain valid.
INVALID
Any text, punctuation (except double-{), or numb3r5 {{model.variable|phone}} is valid
Any text, punctuation (except double-{), or numb3r5 {{model.variable}} is valid.
Any deviation from the format, e.g. {{model.variable|custom}} makes the entire input invalid.
I figured out the regex to identify the three angular blocks and un-match the "custom" one...
{{model\.[^}|]+(\|((ein)|(phone)|(zipcode)|(currency:'':0)){1})?}}
... but I can't get it to enforce that regex. I tried lots of variations on lookaheads, and this is what I think I need, but it doesn't match the valid input, so obviously I'm off.
^(((.(?!({{)|(}})))*({{model\.[^}|]+(\|((ein)|(phone)|(zipcode)|(currency:'':0)){1})?}}))?)+$
Does anyone out there know how I might validate this input?
Nicely composed question. You described the problem and what you have tried.
Using Lookaheads is one solution but you may end up consuming the text for other purposes, so normal groups work fine here.
I would suggest:^((?:^|[^\r\n\{]*)(?:\{(?:[^{]|$)|(?:\{{2}model\.variable(?:\|(?:(ein)|(phone)|(zipcode)|(currency:'':0)))?\}{2}|$)))+$ (demo)
Be aware that visibly empty strings can pass this regex. I would do a .trim().length check if that is an issue. I didn't think it was appropriate to add more bloat to this regex.
^ # Anchors to beginning of string or line,
# depending on multinline flag
( # Opens capturing group 1
(?: # Opens noncapturing group
^ # Anchors to the beginning of string or line
| # or
[^\r\n\{]* # Any character but carriage return, new line, {, one or more times
) # Closes noncapturing group
(?: # Opens noncapturing group
\{ # Literal {
(?: # Opens noncapturing group
[^{] # Any character but {
# to filter {{'ss
| # or
$ # End of string or line
) # Closes noncapturing group
| # or
(?: # Opens noncapturing group
\{{2} # {, twice
model\.variable # model.variable
(?: # Opens noncapturing group
\| # Literal |
(?: # Opens noncapturing group
(ein) # ein as capturing group 2
| # or
(phone) # phone as capturing group 3
| # or
(zipcode) # zipcode as capturing group 4
| # or
(currency:'':0) # currency as capturing group 5
) # closes non-capturing group
)? # closes non-capturing group, iternates 0 or 1 times
\}{2} # }, twice
| # or
$ # end of string or line, dependong on multiline
) #
) #
)+ #
$ #
Per: I'm going to run with this and see if I can get it to ignore the newlines/carriage returns when building the overall match for the entire input.
^((?:^|[^{]+)(?:\{(?:[^{]|$)|(?:\{{2}model\.variable(?:\|(?:(ein)|(phone)|(zipcode)|(currency:'':0)))?\}{2}|$)))+$ (demo)
I only needed to remove the single \r\n and remove the multiline flag.
Related
I have a Python script in which I'm trying to parse a string of the form:
one[two=three].four
Each word should be in its own capture group. The punctuation should not be captured.
Additionally, each part of the string is optional, and the part delimited by brackets can be repeated. So the above is the most complete example, but all of the following should also be valid matches:
one
.four
one[two=three][five=six]
[two=three]
[two].four
[two][five]
[]
In the case that one of the words is not present, instead of failing to capture, I'd like to capture a string of length 0.
The regex that I'm using is as follows:
pattern = re.compile(
r"""
^ # Assert start of string
(?P<cap1> # Start a new group for "one"
[a-z]* #
) #
(?: # Start a group for "two" and "three"
\[ # Match the "["
(?P<cap_2> # Start a group for "two"
[a-z]* #
) #
=? # Delimit two/three with "="
(?P<cap_3> # Start a group for "three"
[a-z]* #
) #
\] # Match the "]"
)* # End the two-three group, allowing repeats
\.? # Delimit three/four with "."
(?P<cap_4> # Begin a group for "four"
[a-z]* #
) #
$ # Assert end of string
""", re.IGNORECASE|re.VERBOSE)
What I've tried to do during that regex is, instead of allowing 0 or 1 of a group by appending ? to the entire group, I allowed any number of characters to be in the actual match itself by appending * to the character selection. Therefore, the match is forced to exist, but the string itself can have a length of 0.
The problem comes with the bracketed block. The package I'm using allows me to access all captures of a named group using match.captures(groupname). This way, I can access all matches for cap_2 using match.captures("cap_2"):
>>> pattern.match("one[two=three][five=six].four").captures("cap_2")
["two", "five"]
This works fine when the brackets are present. However, when they're not:
>>> pattern.match("one.four").captures("cap_2")
[]
Expected: [""]
I expect there to be at least an empty string present for cap_2 and cap_3. However, there's nothing.
This is because of the * I place after the two+three section of the regex, in order to allow multiple of those groups - this is allowing that part of the regex to be skipped altogether.
Changing that * to + breaks the regex, as now it won't match the above example at all because it's trying to match the brackets. Adding a ? after each bracket means that cap_1 and cap_2 are not delimited and includes what should be in cap_4 in cap_3.
What's the solution here? How can I allow a group containing two capturing groups to be executed multiple times, but match only empty strings when the brackets are not present?
You may solve the problem by replacing * after the (?:\[(?P<cap_2>[a-z]*)=?(?P<cap_3>[a-z]*)\])* repeated group with + and adding an alternative with a second occurrence of groups cap_2 and cap_3 (note that PyPi regex module supports multiple identically named groups in the same regex):
import regex as re
s = 'one.four'
pattern = re.compile(
r"""
^ # Assert start of string
(?P<cap1> # Start a new group for "one"
[a-z]* #
) #
(?:
(?: # Start a group for "two" and "three"
\[ # Match the "["
(?P<cap_2> # Start a group for "two"
[a-z]* #
) #
=? # Delimit two/three with "="
(?P<cap_3> # Start a group for "three"
[a-z]* #
) #
\] # Match the "]"
)+ # End the two-three group, allowing repeats
|
(?P<cap_2>)(?P<cap_3>)
)
\.? # Delimit three/four with "."
(?P<cap_4> # Begin a group for "four"
[a-z]* #
) #
$ # Assert end of string
""", re.IGNORECASE|re.VERBOSE)
print ( pattern.match("one.four").captures("cap_2") )
# => ['']
See the Python demo
The thing is, the (?:\[(?P<cap_2>[a-z]*)=?(?P<cap_3>[a-z]*)\])* part matches by all means since it can match an empty string, and if you just add the alternatives without changing the modifier, the expected results won't be achieved. So, if there is no [...]s, the second cap_2 and cap_3 groups with empty patterns willmatch by all means capturing an empty string.
if you want it to either match the empty string or something else, you need the OR operator: |
if you want your regexp to match an empty string, you need something that matches the empty string: e.g. () or (not empty|)
Combined and applied to your case, that would look like this (simplified):
((?:\[stuff inside the brackets\])+|)
The outermost group captures the whole bracket construct (e.g. [two][three]) if it's present or the empty string. Notice that the left part of the | operator now has to match at least once (+).
I know that it was asked a lot of times, but I've tried to adapt the other answers to my need and I was not able to make it work using SKIP and FAIL (I'm a bit confused, I've to admit)
I'm using R actually.
The url I need to clean is:
url <- "posts.fields(id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0))"
and I need to retain only the content inside the first brackets that are always prefixed by the word "fields" (while "posts" may vary). In other words something like
id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0)
As you may see there're some nesting inside. But I eventually could change my source code to accept this string too (removing every parhentesis by every prefix)
id,from,message,comments,likes
I don't know on how to remove the trailing parhentesis which balances the first.
If it's good enough to just remove everything up to and including the first open parenthesis and also remove the last close parenthesis and thereafter then:
sub("^.*?\\((.*)\\)[^)]*$", "\\1", url)
Note:
If it's good enough to just remove the first open parenthesis and last close parenthesis then try this:
sub("\\((.*)\\)", "\\1", url)
Using lazy .* instead of greedy:
sub(".*?fields\\((.*)\\)", "\\1", url)
[1] "id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0)"
You need to use a recursive pattern:
sub("[^.]*+(?:\\.(?!fields\\()[^.]*)*+\\.fields\\(([^()]*+(?:\\((?1)\\)[^()]*)*+)\\)(?s:.*)", "\\1", url, perl=T)
demo
details:
# reach the dot before "fields("
[^.]*+ # all except a dot (possessive)
(?: # open a non-capturing group
\\. # a literal dot
(?!fields\\() # not followed by "fields("
[^.]* # all except a dot
)*+ # repeat the group zero or more times
\\.fields\\(
# match a content between parenthesis with any level of nesting
( # open the capture group 1
[^()]*+ # 0 or more character that are not brackets (possessive)
(?: # open a non capturing group
\\(
(?1) # recursion in group 1
\\) #
[^()]* # all that is not a bracket
)*+ # close the non capturing group and repeat 0 or more time (possessive)
) # close the capture group 1
\\)
(?s:.*) # end of the string
Possessive quantifiers are used here to limit the backtracking when for any reason a part of the pattern fails.
I am attemping to use regex to parse chat logs (namely Skype messages). So the regex I am currently using matches Skype logs correctly....as long as they don't have new lines.
So I tried adding the s modifier to the end, but this now makes it match everything (because its now multiline). So I was wondering if there was a way to both allow multiline, but stop before the [ at the beginning of Skype messages.
My regex is here: https://regex101.com/r/nL0vO9/1
You can use a tempered greedy Token:
\[(?:(?!\n\[).)*
Note you also need to include g modifier so you don't stop on first match
See Demo
Like #sln pointed out, if you want to keep new lines use this instead:
\[(?:.(?<!\n\[))*
Set the flags to //mg multi-line and global. Don't use the s Dot all flag.
edit: I guess you could use something simpler if you don't care about validating/parsing out the time/name/message parts. Either #CrayonViolent or #RodrigoLópez should work for that.
# ^\[([^\r\n\]]*)\]([^:\r\n]*):((?:(?!^\[).*(?:\r?\n)*)*)
^ # BOL
\[
( [^\r\n\]]* ) # (1), Time
\]
( [^:\r\n]* ) # (2), Person
:
( # (3 start), Message
(?: # Cluster group
(?! ^ \[ ) # Assert, not BOL and [
.* # Get all to end of line
(?: \r? \n )* # Optional, Get 1 to many line-breaks
)* # End cluster, do 1 to many times
) # (3 end)
In a text editor, I want to replace a given word with the number of the line number on which this word is found. Is this is possible with Regex?
Recursion, Self-Referencing Group (Qtax trick), Reverse Qtax or Balancing Groups
Introduction
The idea of adding a list of integers to the bottom of the input is similar to a famous database hack (nothing to do with regex) where one joins to a table of integers. My original answer used the #Qtax trick. The current answers use either Recursion, the Qtax trick (straight or in a reversed variation), or Balancing Groups.
Yes, it is possible... With some caveats and regex trickery.
The solutions in this answer are meant as a vehicle to demonstrate some regex syntax more than practical answers to be implemented.
At the end of your file, we will paste a list of numbers preceded with a unique delimiter. For this experiment, the appended string is :1:2:3:4:5:6:7 This is a similar technique to a famous database hack that uses a table of integers.
For the first two solutions, we need an editor that uses a regex flavor that allows recursion (solution 1) or self-referencing capture groups (solutions 2 and 3). Two come to mind: Notepad++ and EditPad Pro. For the third solution, we need an editor that supports balancing groups. That probably limits us to EditPad Pro or Visual Studio 2013+.
Input file:
Let's say we are searching for pig and want to replace it with the line number.
We'll use this as input:
my cat
dog
my pig
my cow
my mouse
:1:2:3:4:5:6:7
First Solution: Recursion
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested).
The recursive structure lives in a lookahead, and is optional. Its job is to balance lines that don't contain pig, on the left, with numbers, on the right: think of it as balancing a nested construct like {{{ }}}... Except that on the left we have the no-match lines, and on the right we have the numbers. The point is that when we exit the lookahead, we know how many lines were skipped.
Search:
(?sm)(?=.*?pig)(?=((?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?:(?1)|[^:]+)(:\d+))?).*?\Kpig(?=.*?(?(2)\2):(\d+))
Free-Spacing Version with Comments:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # fail right away if pig isn't there
(?= # The Recursive Structure Lives In This Lookahead
( # Group 1
(?: # skip one line
^
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
(?:(?1)|[^:]+) # recurse Group 1 OR match all chars that are not a :
(:\d+) # match digits
)? # End Group
) # End lookahead.
.*?\Kpig # get to pig
(?=.*?(?(2)\2):(\d+)) # Lookahead: capture the next digits
Replace: \3
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Second Solution: Group that Refers to Itself ("Qtax Trick")
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested). The solution is easy to adapt to .NET by converting the \K to a lookahead and the possessive quantifier to an atomic group (see the .NET Version a few lines below.)
Search:
(?sm)(?=.*?pig)(?:(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*+.*?\Kpig(?=[^:]+(?(1)\1):(\d+))
.NET version: Back to the Future
.NET does not have \K. It its place, we use a "back to the future" lookbehind (a lookbehind that contains a lookahead that skips ahead of the match). Also, we need to use an atomic group instead of a possessive quantifier.
(?sm)(?<=(?=.*?pig)(?=(?>(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*).*)pig(?=[^:]+(?(1)\1):(\d+))
Free-Spacing Version with Comments (Perl / PCRE Version):
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# for each line skipped, let Group 1 match an ever increasing portion of the numbers string at the bottom
(?= # lookahead
[^:]+ # skip all chars that are not colons
( # start Group 1
(?(1)\1) # match Group 1 if set
:\d+ # match a colon and some digits
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop everything we've matched so far
pig # match pig (this is the match!)
(?=[^:]+(?(1)\1):(\d+)) # capture the next number to Group 2
Replace:
\2
Output:
my cat
dog
my 3
my cow
my mouse
:1:2:3:4:5:6:7
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Choice of Delimiter for Digits
In our example, the delimiter : for the string of digits is rather common, and could happen elsewhere. We can invent a UNIQUE_DELIMITER and tweak the expression slightly. But the following optimization is even more efficient and lets us keep the :
Optimization on Second Solution: Reverse String of Digits
Instead of pasting our digits in order, it may be to our benefit to use them in the reverse order: :7:6:5:4:3:2:1
In our lookaheads, this allows us to get down to the bottom of the input with a simple .*, and to start backtracking from there. Since we know we're at the end of the string, we don't have to worry about the :digits being part of another section of the string. Here's how to do it.
Input:
my cat pi g
dog p ig
my pig
my cow
my mouse
:7:6:5:4:3:2:1
Search:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line that doesn't have pig
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# Group 1 matches increasing portion of the numbers string at the bottom
(?= # lookahead
.* # get to the end of the input
( # start Group 1
:\d+ # match a colon and some digits
(?(1)\1) # match Group 1 if set
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop match so far
pig # match pig (this is the match!)
(?=.*(\d+)(?(1)\1)) # capture the next number to Group 2
Replace: \2
See the substitutions in the demo.
Third Solution: Balancing Groups
This solution is specific to .NET.
Search:
(?m)(?<=\A(?<c>^(?:(?!pig)[^\r\n])*(?:\r?\n))*.*?)pig(?=[^:]+(?(c)(?<-c>:\d+)*):(\d+))
Free-Spacing Version with Comments:
(?xm) # free-spacing, multi-line
(?<= # lookbehind
\A #
(?<c> # skip one line that doesn't have pig
# The length of Group c Captures will serve as a counter
^ # beginning of line
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
) # end skipper
* # repeat skipper
.*? # we're on the pig line: lazily match chars before pig
) # end lookbehind
pig # match pig: this is the match
(?= # lookahead
[^:]+ # get to the digits
(?(c) # if Group c has been set
(?<-c>:\d+) # decrement c while we match a group of digits
* # repeat: this will only repeat as long as the length of Group c captures > 0
) # end if Group c has been set
:(\d+) # Match the next digit group, capture the digits
) # end lokahead
Replace: $1
Reference
Qtax trick
On Which Line Number Was the Regex Match Found?
Because you didn't specify which text editor, in vim it would be:
:%s/searched_word/\=printf('%-4d', line('.'))/g (read more)
But as somebody mentioned it's not a question for SO but rather Super User ;)
I don't know of an editor that does that short of extending an editor that allows arbitrary extensions.
You could easily use perl to do the task, though.
perl -i.bak -e"s/word/$./eg" file
Or if you want to use wildcards,
perl -MFile::DosGlob=glob -i.bak -e"BEGIN { #ARGV = map glob($_), #ARGV } s/word/$./eg" *.txt
I'm using Visual Basic .NET and I'm trying to download a string of HTML, and I want to replace this
id="dynamicstring"
With
id="replacement"
The dynamicstring can be anything, that's why I'm having trouble replacing it.
You can match the content of the id attribute with this pattern:
(?<=<div\b(?>[^i]+|\Bi|i(?!d\s*=))*id\s*=\s*")[^"]+
details:
(?<= # open a look behind assertion (it's just a check
# nothing is matched inside it)
<div\b # div tag
(?> # atomic group (all the content until the id attribute
[^i]+ # all that is not a "i"
| # OR
\Bi # a "i" not preceded by a word boundary
| # OR
i(?!d\s*=) # a "i" (with an implicite word boundary)
# not followed by "d="
)* # close the atomic group and repeat as necessary
id\s*=\s*" # the id attribute until the first double quote
) # close the lookbehind
[^"]+ # content of the id attribute
# (all that is not a double quote)