Regex Match Multiline Chat Messages - regex

I am attemping to use regex to parse chat logs (namely Skype messages). So the regex I am currently using matches Skype logs correctly....as long as they don't have new lines.
So I tried adding the s modifier to the end, but this now makes it match everything (because its now multiline). So I was wondering if there was a way to both allow multiline, but stop before the [ at the beginning of Skype messages.
My regex is here: https://regex101.com/r/nL0vO9/1

You can use a tempered greedy Token:
\[(?:(?!\n\[).)*
Note you also need to include g modifier so you don't stop on first match
See Demo
Like #sln pointed out, if you want to keep new lines use this instead:
\[(?:.(?<!\n\[))*

Set the flags to //mg multi-line and global. Don't use the s Dot all flag.
edit: I guess you could use something simpler if you don't care about validating/parsing out the time/name/message parts. Either #CrayonViolent or #RodrigoLópez should work for that.
# ^\[([^\r\n\]]*)\]([^:\r\n]*):((?:(?!^\[).*(?:\r?\n)*)*)
^ # BOL
\[
( [^\r\n\]]* ) # (1), Time
\]
( [^:\r\n]* ) # (2), Person
:
( # (3 start), Message
(?: # Cluster group
(?! ^ \[ ) # Assert, not BOL and [
.* # Get all to end of line
(?: \r? \n )* # Optional, Get 1 to many line-breaks
)* # End cluster, do 1 to many times
) # (3 end)

Related

How to comment SQL statements in Notepad++?

How can I "block comment" SQL statements in Notepad++?
For example:
CREATE TABLE gmr_virtuemart_calc_categories (
id int(1) UNSIGNED NOT NULL,
virtuemart_calc_id int(1) UNSIGNED NOT NULL DEFAULT '0',
virtuemart_category_id int(1) UNSIGNED NOT NULL DEFAULT '0'
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
It should be wrapped with /* at the start and */ at the end using regex in Notepad++ to produce:
/*CREATE TABLE ... (...) ENGINE=MyISAM DEFAULT CHARSET=utf8;*/
You only offer one sample input, so I am forced to build the pattern literally. If this pattern isn't suitable because there are alternative queries and/or other interfering text, then please update your question.
Tick the "Match case" box.
Find what: (CREATE[^;]+;) Replace with: /*$1*/
Otherwise, you can use this for sql query blocks that start with a capital and end in semicolon:
Find what: ([A-Z][^;]+;) Replace with: /*$1*/
To improve accuracy, you might include ^ start of line anchors or add \r\n after the semi-colon or match the CHARSET portion before the semi-colon. There are several adjustments that can be made. I cannot be confident of accuracy without knowing more about the larger body of text.
You could use a recursive regex.
I think NP uses boost or PCRE.
This works with both.
https://regex101.com/r/P75bXC/1
Find (?s)(CREATE\s+TABLE[^(]*(\((?:[^()']++|'.*?'|(?2))*\))(?:[^;']|'.*?')*;)
Replace /*$1*/
Explained
(?s) # Dot-all modifier
( # (1 start) The whole match
CREATE \s+ TABLE [^(]* # Create statement
( # (2 start), Recursion code group
\(
(?: # Cluster group
[^()']++ # Possesive, not parenth's or quotes
| # or,
' .*? ' # Quotes (can wrap in atomic group if need be)
| # or,
(?2) # Recurse to group 2
)* # End cluster, do 0 to many times
\)
) # (2 end)
# Trailer before colon statement end
(?: # Cluster group, can be atomic (?> ) if need be
[^;'] # Not quote or colon
| # or,
' .*? ' # Quotes
)* # End cluster, do 0 to many times
; # Colon at the end
) # (1 end)

Use regex to validate angular expressions in a paragraph input

I have a difficult user-input validation question (or at least it's difficult for me). I'm trying to make sure users are inputting a pre-defined subset of allowed Angular expressions if they try to add angular to their input at all.
I'm currently using http://www.regexpal.com/ (the actual implementation is in an HTML webpage using javascript) to test my expression and the two following cases:
VALID
Any text, punctuation (except double-{), or numb3r5 {{model.variable|phone}} is valid
Any text, punctuation (except double-{), or numb3r5 {{model.variable}} is valid.
Stick with the format {{model.variable|zipcode}} and we remain valid.
INVALID
Any text, punctuation (except double-{), or numb3r5 {{model.variable|phone}} is valid
Any text, punctuation (except double-{), or numb3r5 {{model.variable}} is valid.
Any deviation from the format, e.g. {{model.variable|custom}} makes the entire input invalid.
I figured out the regex to identify the three angular blocks and un-match the "custom" one...
{{model\.[^}|]+(\|((ein)|(phone)|(zipcode)|(currency:'':0)){1})?}}
... but I can't get it to enforce that regex. I tried lots of variations on lookaheads, and this is what I think I need, but it doesn't match the valid input, so obviously I'm off.
^(((.(?!({{)|(}})))*({{model\.[^}|]+(\|((ein)|(phone)|(zipcode)|(currency:'':0)){1})?}}))?)+$
Does anyone out there know how I might validate this input?
Nicely composed question. You described the problem and what you have tried.
Using Lookaheads is one solution but you may end up consuming the text for other purposes, so normal groups work fine here.
I would suggest:^((?:^|[^\r\n\{]*)(?:\{(?:[^{]|$)|(?:\{{2}model\.variable(?:\|(?:(ein)|(phone)|(zipcode)|(currency:'':0)))?\}{2}|$)))+$ (demo)
Be aware that visibly empty strings can pass this regex. I would do a .trim().length check if that is an issue. I didn't think it was appropriate to add more bloat to this regex.
^ # Anchors to beginning of string or line,
# depending on multinline flag
( # Opens capturing group 1
(?: # Opens noncapturing group
^ # Anchors to the beginning of string or line
| # or
[^\r\n\{]* # Any character but carriage return, new line, {, one or more times
) # Closes noncapturing group
(?: # Opens noncapturing group
\{ # Literal {
(?: # Opens noncapturing group
[^{] # Any character but {
# to filter {{'ss
| # or
$ # End of string or line
) # Closes noncapturing group
| # or
(?: # Opens noncapturing group
\{{2} # {, twice
model\.variable # model.variable
(?: # Opens noncapturing group
\| # Literal |
(?: # Opens noncapturing group
(ein) # ein as capturing group 2
| # or
(phone) # phone as capturing group 3
| # or
(zipcode) # zipcode as capturing group 4
| # or
(currency:'':0) # currency as capturing group 5
) # closes non-capturing group
)? # closes non-capturing group, iternates 0 or 1 times
\}{2} # }, twice
| # or
$ # end of string or line, dependong on multiline
) #
) #
)+ #
$ #
Per: I'm going to run with this and see if I can get it to ignore the newlines/carriage returns when building the overall match for the entire input.
^((?:^|[^{]+)(?:\{(?:[^{]|$)|(?:\{{2}model\.variable(?:\|(?:(ein)|(phone)|(zipcode)|(currency:'':0)))?\}{2}|$)))+$ (demo)
I only needed to remove the single \r\n and remove the multiline flag.

Regex word can be optional but only if it matches the characters

Following pattern: (v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?((-schema)?(-dev)?)((-schema)?(-dev)?) from http://regexr.com/ is meant to be used in a shell script with grep and does match the following strings (working example):
Hello I am a text and this is my v1.12.33-32 version
Hello I am a text and this is my v1.12.33-dev version
Hello I am a text and this is my v1.12.33-dev-schema version
Hello I am a text and this is my v1.12.33-schema version
Hello I am a text and this is my v1.12.33-3-schema version
and so forth
So I made the words schema and dev optional. They can be ommitted or used in a arbitrary order. What I don't what is this:
Hello I am a text and this is my v1.12.33-foo version
or Hello I am a text and this is my v1.12.33-asfs version
to match.
I want the option to be a bit more constrained. At the moment the Regex is still matching the stuff that...well actually matches.
This for example:
Hello I am a text and this is my v1.123.33
results in an empty string while this:
`Hello I am a text and this is my v1.12.33-bla"
still results in v.1.12.33
Is this because of the grouping I made? So at least the fully matching groups will be taken for the returned match-string?
To match only the version string, disallow extra trailing tags, yet allow trailing unmatched text, you need a regex language that supports lookahead. Standard grep / egrep regexes do not support lookahead.
You have two options:
Since you seem to be relying on GNU grep anyway, you could use a Perl regex, such as
v[0-9]{1,2}(\.[0-9]{1,2}){2}(-[0-9]{1,2})?((-schema(-dev)?)?|(-dev(-schema)?)?)?(?!\S)
The negative lookahead at the end allows the match to appear at the end of the line, but also requires that if it does not end the line then the next character following the match must be whitespace (which is not itself included in the match).
You could give up on completely isolating the target text via -o, and instead allow the pattern to match the trailing context, too:
v[0-9]{1,2}(\.[0-9]{1,2}){2}(-[0-9]{1,2})?((-schema(-dev)?)?|(-dev(-schema)?)?)?(\s.*)?$
In this case, you could isolate the target text in a second step, by stripping off any tail beginning with whitespace.
Note that neither of these pays attention to text preceeding the match. You have similar options for handling that portion as you do for handling the trailing portion.
The problem seems to be all the optional expressions lurking at the
edge (end).
You can solve that a few ways, but none are %100 because you'd need
more rules to control what matches.
It's not like you can say no - is allowed afterword, the engine will
backtrack to one of the range digits {1,2} to make a match.
What seems to work for now is passing on a whitespace end edge
or matching the dev/schema items.
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(?:(?!\S)|(-(schema|dev)(?:-(schema|dev))?))
Expanded
( # (1 start)
v [0-9]{1,2}
\. [0-9]{1,2}
\. [0-9]{1,2}
) # (1 end)
( - [0-9]{1,2} )? # (2)
(?:
(?! \S ) # Whitespace boundary
| # or,
( # (3 start)
-
( schema | dev ) # (4)
(?:
-
( schema | dev ) # (5)
)?
) # (3 end)
)
edit
If you want to avoid matching the same schema|dev word twice, just add
a negative assertion of group 4, before capture group 5 above.
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(?:(?!\S)|(-(schema|dev)(?:-(?!\4)(schema|dev))?))
Expanded
( # (1 start)
v [0-9]{1,2}
\. [0-9]{1,2}
\. [0-9]{1,2}
) # (1 end)
( - [0-9]{1,2} )? # (2)
(?:
(?! \S ) # Whitespace boundary
| # or,
( # (3 start)
-
( schema | dev ) # (4)
(?:
-
(?! \4 ) # Not same word twice
( schema | dev ) # (5)
)?
) # (3 end)
)
Since regular expressions are open-ended, you need to specify with $ where you want the match to end, so you don't let the regex engine silently ignore trailing junk.
With only two tags in the optional set, I would just enumerate the 4 possibilities:
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(-schema|-dev|-dev-schema|-schema-dev)?$
My version:
grep --perl-regexp \
'\bv(?:\d{1,2}\.){2}\d{1,2}(?:\-\d{1,2})?(?:\-(?:schema|dev))?(?:\s|$)' \
path/to/file
Where
the first \b is a word boundary(you might want to make it stricter);
(?: ... ) expressions are non-capturing groups;
\s|$ is either a space character, or the end of line
The rest is just refactored for simplicity.
The expression allows only schema, or dev at the "end".

Make sure the presence of a particular character in a string

I'm using the following regex to parse my application log file to search for particular string
\[\s*\b(?:[0-9A-Za-z][0-9A-Za-z-_.#]{0,200})(?:\.(?:[0-9*A-Za-z][0-9A-Za-z-_.#]{0,200}))*(\.?|\b)\s*]
This works fine but now we need to make sure that the string "must" contain "-" character to match. I'm confused to add this condition to the original regx.
Any pointers would be helpful.
Thanks and Regards,
Santhosh
The regex matches a string inside square brackets, [ and ], and may only consist of non-[ and non-] symbols.
You can easily add a positive lookahead restriction after the opening [ like check if the next characters other than ] and [ are followed with -:
\[ # opening [
(?=[^\]\[]*-) # There must be a hyphen in [...]
\s* # 0+ whitespaces
\b(?:[0-9A-Za-z][0-9A-Za-z-_.#]{0,200}) # Part 1 (with obligatory subpattern)
(?:\. # Part 2, optional
(?:[0-9*A-Za-z][0-9A-Za-z-_.#]{0,200})
)*
(\.?|\b) # optional . or word boundary
\s* # 0+ whitespaces
] # closing ]
See the regex demo
And a one-liner:
\[(?=[^\]\[]*-)\s*\b(?:[0-9A-Za-z][0-9A-Za-z-_.#]{0,200})(?:\.(?:[0-9*A-Za-z][0-9A-Za-z-_.#]{0,200}))*(\.?|\b)\s*]
Tip: use the verbose /x modifier to split the pattern into separate multiline blocks for analysis, it will help you in the future when you need to modify the pattern again.
If you need to match only if - or # is present inside [...], modify the lookahead as (?=[^\]\[]*[-#]). For a more general case, use (?=[^\]\[]*(?:one|another|must-be-present)) alternatives inside an additional group inside the lookahead.
Updated answer - Assertion
In this case, the better way to do it is to use an assertion consisting of checking only the position's expected to match the character in question.
I know it's simple, but using the outter pseudo-anchor text \[ ... \] as
a delimiter that cannot exist in the body is a rarity.
You should always try to avoid doing it like this.
Things change, your input could change.
The rule to follow in validation of known characters that are Mid-String is to use only them
when using an assertion validator.
This avoids the necessity of relying on what is not there at the moment ie, not a ],
but should rely on what is there.
Again, this pertains to mid-string matching.
BOL/EOL is a different thing entirely ^$, and is a more permanent construct
with which to leverage.
It's always better to code smarter.
\[\s*\b(?=[0-9A-Za-z][0-9A-Za-z_.#]{0,199}-|[0-9A-Za-z][0-9A-Za-z_.#]{0,200}(?:\.[0-9*A-Za-z][0-9A-Za-z_.#]{0,200})*\.[0-9*A-Za-z][0-9A-Za-z_.#]{0,199}-)(?:[0-9A-Za-z][0-9A-Za-z_.#-]{0,200})(?:\.(?:[0-9*A-Za-z][0-9A-Za-z_.#-]{0,200}))*(\.?|\b)\s*\]
Using Conditionals
If your engine supports conditionals, the easy way is to not rely on a fluke
of pseudo anchor text, ie. [..].
\[\s*\b[0-9A-Za-z](?:[0-9A-Za-z_.#]|(-)){0,200}(?:\.(?:[0-9*A-Za-z](?:[0-9A-Za-z_.#]|(-)){0,200}))*(\.?|\b)\s*\](?(1)|(?(2)|(?!)))
Expanded
\[ \s* \b
[0-9A-Za-z]
(?:
[0-9A-Za-z_.#]
| ( - ) # (1)
){0,200}
(?:
\.
(?:
[0-9*A-Za-z]
(?:
[0-9A-Za-z_.#]
| ( - ) # (2)
){0,200}
)
)*
( \.? | \b ) # (3)
\s* \]
(?(1) # Fail if no dash found
| (?(2)
| (?!)
)
)
This conditional would work if just want to make sure that - occurs within your string before running your block of code.
if (myString.indexOf('-') >= 0) {
//your code
}
If you have to have a single hyphen, you'll have to either repeat most of the pattern, or check for it in a second phase:
if re.match(pattern, line):
if not '-' in line:
raise MissingDash('No dash in line: {}'.format(line))
I'd suggest adding the second check, since adding the requirement to the regex would make it even more horrible to read.

Remove the text outside the first brackets in R

I know that it was asked a lot of times, but I've tried to adapt the other answers to my need and I was not able to make it work using SKIP and FAIL (I'm a bit confused, I've to admit)
I'm using R actually.
The url I need to clean is:
url <- "posts.fields(id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0))"
and I need to retain only the content inside the first brackets that are always prefixed by the word "fields" (while "posts" may vary). In other words something like
id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0)
As you may see there're some nesting inside. But I eventually could change my source code to accept this string too (removing every parhentesis by every prefix)
id,from,message,comments,likes
I don't know on how to remove the trailing parhentesis which balances the first.
If it's good enough to just remove everything up to and including the first open parenthesis and also remove the last close parenthesis and thereafter then:
sub("^.*?\\((.*)\\)[^)]*$", "\\1", url)
Note:
If it's good enough to just remove the first open parenthesis and last close parenthesis then try this:
sub("\\((.*)\\)", "\\1", url)
Using lazy .* instead of greedy:
sub(".*?fields\\((.*)\\)", "\\1", url)
[1] "id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0)"
You need to use a recursive pattern:
sub("[^.]*+(?:\\.(?!fields\\()[^.]*)*+\\.fields\\(([^()]*+(?:\\((?1)\\)[^()]*)*+)\\)(?s:.*)", "\\1", url, perl=T)
demo
details:
# reach the dot before "fields("
[^.]*+ # all except a dot (possessive)
(?: # open a non-capturing group
\\. # a literal dot
(?!fields\\() # not followed by "fields("
[^.]* # all except a dot
)*+ # repeat the group zero or more times
\\.fields\\(
# match a content between parenthesis with any level of nesting
( # open the capture group 1
[^()]*+ # 0 or more character that are not brackets (possessive)
(?: # open a non capturing group
\\(
(?1) # recursion in group 1
\\) #
[^()]* # all that is not a bracket
)*+ # close the non capturing group and repeat 0 or more time (possessive)
) # close the capture group 1
\\)
(?s:.*) # end of the string
Possessive quantifiers are used here to limit the backtracking when for any reason a part of the pattern fails.