Regex timing out - regex

I'm trying to match something like
foo: anything after the colon can be matched with (.*)+
foo.bar1.BAZ: balh5317{}({}(
This is the regex I'm using:
/^((?:(?:(?:[A-Za-z_]+)(?:[0-9]+)?)+[\.]?)+)(?:\s)?(?:\:)(?:\s)?((?:.*)+)$/
Excuse the non-matching groups and extra parens, this is being compiled from a builder class
This works on the examples. The problem arises when I try to put in a string like this:
foo.bar.baz.beef.stew.ect.and.forward
I need to be able to check strings like this, but the regex engine times out or runs infinity (as far as I can tell) after a certain amount of foo.s every time.
I'm sure this is a logical problem I could figure out, but unfortunately I've far from mastered regex and I hoped a more experienced user could shed some light on how I can make this more efficient.
Also, here is a more detailed description of what I need to match:
Property Name: can contain A-z, numbers, and underscores but can't start with a number
<Property Name>.<Property Name>.<Prop...:<Anything after the colon>
Thanks for your time!

Starting with your regex:
^((?:(?:(?:[A-Za-z_]+)(?:[0-9]+)?)+[\.]?)+)(?:\s)?(?:\:)(?:\s)?((?:.*)+)$
^ # Anchors to the beginning to the string.
( # Opens CG1
(?: # Opens NCG
(?: # Opens NCG
(?: # Opens NCG
[A-Za-z_]+ # Character class (any of the characters within)
) # Closes NCG
(?: # Opens NCG
[0-9]+ # Character class (any of the characters within)
)? # Closes NCG
)+ # Closes NCG
[\.]? # Character class (any of the characters within)
)+ # Closes NCG
) # Closes CG1
(?: # Opens NCG
\s # Token: \s (white space)
)? # Closes NCG
(?: # Opens NCG
\: # Literal :
) # Closes NCG
(?: # Opens NCG
\s # Token: \s (white space)
)? # Closes NCG
( # Opens CG2
(?: # Opens NCG
.* # . denotes any single character, except for newline
)+ # Closes NCG
) # Closes CG2
$ # Anchors to the end to the string.
I converted [0-9] to \d, simply for easier readability (both match the same thing). I also removed a lot of non-capturing groups because they weren't really being used.
^((?:(?:[A-Za-z_]+\d*)+\.?)+)\s?\:\s?((?:.*)+)$
I also merged the \s and .* into [\s\S]*, but seeing that it was followed by a + sign, i removed the group and just made [\s\S].
^((?:(?:[A-Za-z_]+\d*)+\.?)+)\s?\:([\s\S]+)$
^
Now I'm not sure what the + above the carat is supposed to do. We can remove it, and thusly the non-capturing group surrounding it.
^((?:[A-Za-z_]+\d*\.?)+)\s?\:([\s\S]+)$
Explanation:
^ # Anchors to the beginning to the string.
( # Opens CG1
(?: # Opens NCG
[A-Za-z_]+ # Character class (any of the characters within)
\d* # Token: \d (digit)
\.? # Literal .
)+ # Closes NCG
) # Closes CG1
\s? # Token: \s (white space)
\: # Literal :
( # Opens CG2
[\s\S]+ # Character class (any of the characters within)
) # Closes CG2
$ # Anchors to the end to the string.
Now, you might want to change the [\s\S]+ back to .* if you're dealing with multiple lines. There are a few different options regarding that but it kind of matters what language you're using.
Honestly, I did this in steps but the largest problem was (?:.*)+ This is telling the engine to match 0 or more characters 1 or more times catastrophic backtracking (as xufox linked to in comments).
The resulting regex, and your original too, permits variables that end in . I'd use something more like this, which your regex really wasn't that far from.
This will match names like foo.ba5r, if that's acceptable, your prior regex wouldn't.
^([A-Za-z_]\w*(?:\.[A-Za-z_]+\w*)*)\s?\:([\s\S]+)$
Explanation:
^ # Anchors to the beginning to the string.
( # Opens CG1
[A-Za-z_] # Character class (any of the characters within)
\w* # Token: \w (a-z, A-Z, 0-9, _)
(?: # Opens NCG
\. # Literal .
[A-Za-z_] # Character class (any of the characters within)
\w* # Token: \w (a-z, A-Z, 0-9, _)
)* # Closes NCG
) # Closes CG1
\s? # Token: \s (white space)
\: # Literal :
( # Opens CG2
[\s\S]+ # Character class (any of the characters within)
) # Closes CG2
$ # Anchors to the end to the string.

Related

Match a pattern not preceded by a quotation mark

I have this pattern (?<!')(\w*)\((\d+|\w+|.*,*)\) that is meant to match strings like:
c(4)
hello(54, 41)
Following some answers on SO, I added a negative lookbehind so that if the input string is preceded by a ', the string shouldn't match at all. However, it still partially matches.
For example:
'c(4) returns (4) even though it shouldn't match anything because of the negative lookbehind.
How do I make it so if a string is preceded by ' NOTHING matches?
Since nobody came along, I'll throw this out to get you started.
This regex will match things like
aa(a , sd,,,f,)
aa( as , " ()asdf)) " ,, df, , )
asdf()
but not
'ab(s)
This will fix the basic problem (?<!['\w])\w*
Where (?<!['\w]) will not let the engine skip over a word char just
to satisfy the not quote.
Then the optional words \w* to grab all the words.
And if a 'aaa( quote is before it, then it won't match.
This regex here embellishes what I think you are trying to accomplish
in the function body part of your regex.
It might be a little overwhelming to understand at first.
(?s)(?<!['\w])(\w*)\(((?:,*(?&variable)(?:,+(?&variable))*[,\s]*)?)\)(?(DEFINE)(?<variable>(?:\s*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')\s*|[^()"',]+)))
Readable version (via: http://www.regexformat.com)
(?s) # Dot-all modifier
(?<! ['\w] ) # Not a quote, nor word behind
# <- This will force matching a complete function name
# if it exists, thereby blocking a preceding quote '
( \w* ) # (1), Function name (optional)
\(
( # (2 start), Function body
(?: # Parameters (optional)
,* # Comma (optional)
(?&variable) # Function call, get first variable (required)
(?: # More variables (optional)
,+ # Comma (required)
(?&variable) # Variable (required)
)*
[,\s]* # Whitespace or comma (optional)
)? # End parameters (optional)
) # (2 end)
\)
# Function definitions
(?(DEFINE)
(?<variable> # (3 start), Function for a single Variable
(?:
\s*
(?: # Double or single quoted string
"
[^"\\]*
(?: \\ . [^"\\]* )*
"
|
'
[^'\\]*
(?: \\ . [^'\\]* )*
'
)
\s*
| # or,
[^()"',]+ # Not quote, paren, comma (can be whitespace)
)
) # (3 end)
)

Regex pattern without one case

I would like to remove some strings from filename.
I want to remove every string in bracket but not if there is a string "remix" or "Remix" or "REMIX"
Now I have got
sed "s/\s*\(\s?[A-z0-9. ]*\)//g"
but how to exclude cases when there is remix in string?
You can use a capture group:
sed 's/\(\s*([^)]*remix[^)]*)\)\|\s*(\s\?[a-z0-9. ]*)/\1/gi'
When the "remix branch" doesn't match, the capture group is not defined and the matched part is replaced with an empty string.
When the "remix branch" succeeds, the matched part is replaced by the content of the capture group, so by itself.
Note: if that helps to avoid false positive, you can add word-boundaries around "remix": \bremix\b
pattern details:
\( # open the capture group 1
\s* # zero or more white-spaces
( # a literal parenthesis
[^)]* # zero or more characters that are not a closing parenthesis
remix
[^)]*
)
\) # close the capture group 1
\| # OR
# something else between parenthesis
\s* # note that it is essential that the two branches are able to
# start at the same position. If you remove \s* in the first
# branch, the second branch will always win when there's a space
# before the opening parenthesis.
(\s\?[a-z0-9. ]*)
\1 is the reference to the capture group 1
i makes the pattern case-insensitive
[EDIT]
If you want to do it in a POSIX compliant way, you must use a different approach because several Gnu features are not available, in particular the alternation \| (but also the i modifier, the \s character class, the optional quantifier \?).
This other approach consists to find all eventual characters that are not an opening parenthesis and all eventual substrings enclosed between parenthesis with "remix" inside, followed by eventual white-spaces and an eventual substring enclosed between parenthesis.
As you can see all is optional and the pattern can match an empty string, but it isn't a problem.
All before the parenthesis part to remove is captured in group 1.
sed 's/\(\([^(]*([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)[^ \t(]*\([ \t]\{1,\}[^ \t(]\{1,\}\)*\)*\)\([ \t]*([^)]*)\)\{0,1\}/\1/g;'
pattern details:
\( # open the capture group 1
\(
[^(]* # all that is not an opening parenthesis
# substring enclosed between parenthesis without "remix"
( [^)]* [Rr][Ee][Mm][Ii][Xx] [^)]* )
# Let's reach the next parenthesis without to match the white-spaces
# before it (otherwise the leading white-spaces are not removed)
[^ \t(]* # all that is not a white-space or an opening parenthesis
# eventual groups of white-spaces followed by characters that are
# not white-spaces nor opening parenthesis
\( [ \t]\{1,\} [^ \t(]\{1,\} \)*
\)*
\) # close the capture group 1
\(
[ \t]* # leading white-spaces
([^)]*) # parenthesis
\)\{0,1\} # makes this part optional (this avoid to remove a "remix" part
# alone at the end of the string)
The word boundaries in this mode aren't available too. So the only way to emulate them is to list the four possibilities:
([Rr][Ee][Mm][Ii][Xx]) # poss1
([Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss2
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx]) # poss3
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss4
and to replace ([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*) with:
\(poss1\)\{0,\}\(poss2\)\{0,\}\(poss3\)\{0,\}\(poss4\)\{0,\}
Just skip the lines matching "remix":
sed '/([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)/! s/([^)]*)//g'
where bracket are (US) :[]
sed '/remix\|REMIX\|Remix/ !s/\[[^]]*]//g'
where bracet (ROW): ()
sed '/remix\|REMIX\|Remix/ !s/([^)]*)//g'
assuming:
- there is no internal bracket
- Other form of remix are excluced (ReMix, ...), so line is deleted
- Remix could be any place in title (i love remix) [if needed specify which to take and remove]

Exact string coldfusion regular expression

I am using a regular expression to replace all characters that are not equal to the exact word "NULL" and also keep all digits. I did a first step, by replacing all "NULL" words from my string with this :
<cfset data = ReReplaceNoCase("123NjyfjUghfLL|NULL|NULL|NULL","\bNULL\b","","ALL")>
It removes all instances of the exact "NULL" word, that means it does not remove letters "N", "U" and "L" from the substring "123NjyfjUghfLL". And this is correct. But now, I want to reverse that. I want to keep only "NULL" word, meaning that it removes single "L", "U" and "L". So I tried that :
<cfset data = ReReplaceNoCase("123NjyfjUghfLL|NULL|NULL|NULL","[^\bNULL\b]","","ALL")>
But now this keeps all "N", "U" and "L" letters, so it outputs "NULLNULLNULLNULL". There should be only 3 times "NULL".
Can someone help me with this please? And where to add the extra code to keep digits? Thank you.
You can do this
<cfset data = ReReplaceNoCase("123NjyfjUghfLL|NULL|NULL|NULL","(^|\|)(?!NULL(?:$|\|))([^|]*)(?=$|\|)","\1","ALL")>
(^|\|)(?!NULL(?:$|\|))([^|]*)(?=$|\|)
Explanation:
( # Opens Capture Group 1
^ # Anchors to the beginning to the string.
| # Alternation (CG1)
\| # Literal |
) # Closes CG1
(?! # Opens Negative Lookahead
NULL # Literal NULL
(?: # Opens Non-Capturing group
$ # Anchors to the end to the string.
| # Alternation (NCG)
\| # Literal |
) # Closes NCG
) # Closes NLA
( # Opens Capture Group 2
[^|]* # Negated Character class (excludes the characters within)
# None of: |
# * repeats zero or more times
) # Closes CG2
(?= # Opens LA
$ # Anchors to the end to the string.
| # Alternation (LA)
\| # Literal |
) # Closes LA
Regex101.com demo
Lastly, some insight about character classes (content between square brackets)
What [^\bNULL\b] means is
[^\bNULL\b] # Negated Character class (excludes the characters within)
# None of: \b,N,U,L
# When \b is inside a character class, it matches a backspace character.
# Outside of a character class, \b matches a word boundary as you use it in your first code.
Character classes are not designed for matching or ignoring words, they're designed for permitting or excluding characters or ranges of characters.
Edit:
Ok so it works well. But what if I would like to keep also the digits? I am a kind of lost in this line of code and I cannot find where to put extra code... I think the extra code would be [^0-9] right?
This regex (demo) works to also permit numbers of any length where the number is the entire value
(^|\|)(?!(?:NULL|[0-9]+)(?:$|\|))([^|]*)(?=$|\|)
You can also use this regex (demo) to permit numbers with a decimal value.
(^|\|)(?!(?:NULL|[0-9]+(?:\.[0-9]+)?)(?:$|\|))([^|]*)(?=$|\|)

RegEx to replace prefix and postfix

I would like to build a RegEx expression to replace the prefix and postfix of a string. the general string is built from
a known prefix string
some letter a-z or A-Z
some unknown string with letters, hyphens, backslash, slash and numbers.
a hyphen
an integer number
the symbols #.
some string of letters
Examples:
KnownStringr/df-2e\d-3724#.Gkjsu
KnownStringEd\e4v-bn-824#.YKfg
KnownStringa-YK224E\yy-379924#.awws
I would like to replace the prefix and postfix of the NUMBER so that I get:
MyPrefix3724MyPostfix
MyPrefix824MyPostfix
MyPrefix379924MyPostfix
This regex should do the trick, but you always should specify the language/framework you're using, because not all regex engines support the same features.
The number that you want to capture would be in capture group #3 ((\d+)), which most languages reference as \3
(?:KnownString)([a-zA-Z])(.*?)-(\d+)\#\.[a-zA-Z]+
Explanation:
(?: # Opens NCG
KnownString # Literal KnownString
) # Closes NCG
( # Opens CG1
[a-zA-Z] # Character class (any of the characters within)
# Anything between a and z
# Anything between A and Z
) # Closes CG1
( # Opens CG2
.*? # . denotes any single character, except for newline
# * repeats zero or more times
# ? as few times as possible
)- # Closes CG2
# Literal -
( # Opens CG3
\d+ # Token: \d (digit)
# + repeats one or more times
) # Closes CG3
\# # Literal #
\. # Literal .
[a-zA-Z]+ # Character class (any of the characters within)
# Anything between a and z
# Anything between A and Z
# + repeats one or more times
You haven't specified what the known prefix is, you should be careful to escape special characters in known string, especially period, plus sign, asterisk, question mark, and parentheses.

how to interpret this regular expression

I am having a lot of trouble interpreting this expression and I am getting really lost trying to read it. can someone help me?
^[^?](?:htaccess|access_log)(?:[.][^/?])?(?:[~])?(?:[?].*)?$
I know that ^ means to start at the beginning of the line, [^?] not matching a "?" i think, and then (?:) not sure what this does or how to interpret the rest of the line. Im thinking that htaccess|access_log means its an or statement so either htacces or access_log. [.][^/?] is a . followed by not a "?" but then what would the earlier [^?] mean...
What would an example of something this matches?
There are plenty of explainers that will breakdown a regular expression for you.
To be concise, the caret inside of a character class [^ ] is the negation operator, meaning match anything NOT in the character class. The ?: placed inside of an opening parentheses is a non-capturing group which specifies that the group is not to be captured, but to group expressions, and | is the alternation operator.
I would recommend taking a look at these sites for basic use of regular expressions.
Regular-Expressions.info
Rexegg (Regex Tutorial)
Regular Expression:
^ # the beginning of the string
[^?] # any character except: '?'
(?: # group, but do not capture:
htaccess # 'htaccess'
| # OR
access_log # 'access_log'
) # end of grouping
(?: # group, but do not capture (optional):
[.] # any character of: '.'
[^/?] # any character except: '/', '?'
)? # end of grouping
(?: # group, but do not capture (optional):
[~] # any character of: '~'
)? # end of grouping
(?: # group, but do not capture (optional):
[?] # any character of: '?'
.* # any character except \n (0 or more times)
)? # end of grouping
$ # before an optional \n, and the end of the string