Regex does not match the whole string

Regex does not match the whole string - regex

Following is the regular expression pattern I'm dealing with:
[\t\v\f ]*([^=\s]+)[\t\v\f ]*=[\t\v\f ]*([^=\s]+)[\t\v\f ]*(?:\r?\n\s*([^=\s]+)[\t\v\f ]*=[\t\v\f ]*([^=\s]+)\s*)*
It basically tries to match key value pairs in a single section of a .ini file. So, for instance, it should be able to match the whole string below:
"aa = 11\nbb = 22\ncc = 33"
I tried to test it on this regex matching website and some others as well and they all seem to match only the first 2 lines. Here is how the match looks like (global flag is disabled):
However when I try to force the regex to find all 3 lines as follows:
[\t\v\f ]*([^=\s]+)[\t\v\f ]*=[\t\v\f ]*([^=\s]+)[\t\v\f ]*(?:\r?\n\s*([^=\s]+)[\t\v\f ]*=[\t\v\f ]*([^=\s]+)\s*){2}
Then it seems to be able to match the whole string.
Can anyone give me a good reason as to why the entire string above does not match my regex? Also what regex should I use in order to match all key value pairs in a string like the one I wrote above?

Your problem is the \s* at the end of the non-capturing group; this is being greedy and absorbing the vertical white-space at the end of the line containing bb = 22 and preventing the group matching again on the line with cc = 33 in it. Changing that to [\t\v\f ] (or even \s*?) makes the regex match the entire string as desired. See demo on regex101. The reason it works when you add the {2} quantifier is that the desire to match makes the engine backtrack when processing the \s* to a point where it can then match the non-capturing group again.

Related

RegEx - double condition to find some string

I'd like to find word RADU3_ or RADU3- in a sentence that begins with xlink:href= and ends with .svg
How to do this?
I've tried following, but does not give the result I'm expecting.
(?=\wxlink:href=|\wsvg\b)|\bRADU3_|\bRADU3-
Just last line in example is good result (RADU3_)
ProductionGraphics\GP1**RADU3-**11_HeatingFurnaceF1.svg
PB:ExpressionText id="RADU3_FUEL GAS _SUM_EX" PBD:LinkUses
xlink:href="C:\ProcBookImport\MaintenanceGraphics\RADU3_AI.svg"
Example...

Not sure exactly how you want to use it but the below pattern finds the string. I put the RADU3 part in a group where I matches RADU3 followed by - or _ ([_-])
(xlink:href=.*)(RADU3[_-]*)(.*\.svg)
Edit, handle multiple occurences
If a string might contain the pattern several times then use ? to allow a group to repeat itself
(RADU3[_-]*?)(.*?\.svg?)
The above could be used in a replace expression like
\1someotherword\3
Where \2 is the second group that is replaced

If you want to make sure that the string starts with xlink:href= and ends with \.svg you could use anchors to assert the start ^ and the end $ of the string.
Use 1 capturing group to make sure xlink:href= comes before RADU3 followed by an underscore or a hyphen. Then you could match it and in the replacement use that capturing group follwed by your replacement.
You could use a positive lookahead to assert that the string ends with \.svg
That will match:
^(xlink:href=.*)\bRADU3[_-](?=.*\.svg$)
^ Assert the start of the string
(xlink:href=.*) Capturing group, match up until the last occurence of ..
\bRADU3[_-] Word boundary to prevent matching part of a larger word. Match RADU3 followed by an underscore or hyphen
(?=.*\.svg$) Positive lookahead to assert the string ends with .svg
See the regex demo

It sounds like you only want the word (substring) if it is in a specific context?
In your case, you can restart the regex midways if you want to have starting and ending conditions (multiple conditions) for a string, but at the same time only want to use these conditions as "if-statements" and not as part of the result.
The following uses this method, and utilizes restarts (\K) in order to only extract the substring you are looking for.
# The string has to start with "xlink:href="
xlink:href=
# Fetch everything up to our match, and the restart the regex
.*\K
# The strings we are looking for
(RADU3[-_])
# String has to end with ".svg"
(?=(.*\.svg))
If you want the entire string matching our rules you are looking for something like this:
#The string has to start with "xlink:href"
^(xlink:href=).*
# The strings we are looking for
(RADU3[-_])
# String has to end with ".svg"
(\w+\.svg)
#Get everything after .svg too
.*
If you only want the ending " after the .svg, you'd want to modify the last part where I just take everything after .svg
You can play around with what I have come up with at regex101 (no affiliation, just love their site): https://regex101.com/r/g0v07V/3/

Python2 re match repeating patterns doesn't behave as expected

I was trying to extract urls from messy text data using regular expression. I used to match [\w.]+[a-zA-Z]{2,4} which behaved as I expected: find consecutive alphanumerical and dots, then ends with 2~4 letters like com/net/gov. It wasn't perfect but sufficed for my use.
Now I want to improve the syntax a bit: I want to find all alphanumerical FOLLOWED BY ONE dot, repeat the pattern multiple times, then ends with 2~4 letters. This would exclude things like "abc....com". However, this time the result really confused me:
test = 'www.1f23123.asda.com'
re.findall(r'(\w+\.){1,}[a-zA-Z]{2,4}', test)
and the result was ['asda.']
Could someone explain to me what goes wrong here?

you are printing the captured group, try adding ?: to make it a non capturing group so it would print the whole match
test = 'www.1f23123.asda.com'
match = re.findall(r'(?:\w+\.){1,}[a-zA-Z]{2,4}', test)
print match

Your regex uses a repeating capturing group where you would need to capture a repeating group. So only the last match is captured in your regex. You will need:
((?:\w+\.){1,})[a-zA-Z]{2,4}
See example

Extracting part of a string using regex

I am trying to extract part of a strings below
I tried (.*)(?:table)?,it fails in the last case. How to make the expression capture entire string in the absence of the text "table"
Text: "diningtable" Expected Match: dining
Text: "cookingtable" Match: cooking
Text: "cooking" Match:cooking
Text: "table" Match:""

Rather than try to match everything but table, you should do a replacement operation that removes the text table.
Depending on the language, this might not even need regex. For example, in Java you could use:
String output = input.replace("table", "");

If you want to use regex, you can use this one:
(^.*)(?=table)|(?!.*table.*)(^.+)
See demo here: regex101
The idea is: match everything from the beginning of the line ^ until the word table or if you don't find table in the string, match at least one symbol. (to avoid matching empty lines). Thus, when it finds the word table, it will return an empty string (because it matches from the beginning of the line till the word table).

The (.*)(?:table)? fails with table (matches it) as the first group (.*) is a greedy dot matching pattern that grabs the whole string into Group 1. The regex engine backtracks and looks for table in the optional non-capturing group, and matches an empty string at the end of the string.
The regex trick is to match any text that does not start with table before the optional group:
^((?:(?!table).)+)(?:table)?$
See the regex demo
Now, Group 1 - ((?:(?!table).)+) - contains a tempered greedy token (?:(?!table).)+ that matches 1 or more chars other than a newline that do not start a table sequence. Thus, the first group will never match table.
The anchors make the regex match the whole line.
NOTE: Non-regex solutions might turn out more efficient though, as a tempered greedy token is rather resource consuming.
NOTE2: Unrolling the tempered greedy token usually enhances performance n times:
^([^t]*(?:t(?!able)[^t]*)*)(?:table)?$
See another demo
But usually it looks "cryptic", "unreadable", and "unmaintainable".

Despite other great answers, you could also use alternation:
^(?|(.*)table$|(.*))$
This makes use of a branch reset, so your desired content is always stored in group 1. If your language/tool of choice doesn't support it, you would have to check which of groups 1 and 2 contains the string.
See Demo

Name validation - Adding a check to this regex to stop entering just identical characters

I'm trying to add another feature to a regex which is trying to validate names (first or last).
At the moment it looks like this:
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$)([a-z][a-z'-]{1,})$/i
https://regex101.com/r/pQ1tP2/1
The idea is to do the following
Don't allow just adding a title like Mr, Mrs etc
Ensure the first character is a letter
Ensure subsequent characters are either letters, hyphens or apostrophes
Minimum of two characters
I have managed to get this far (shockingly I find regex so confusing lol).
It matches things like O'Brian or Anne-Marie etc and is doing a pretty good job.
My next additions I've struggled with though! trying to add additional features to the regex to not match on the following:
Just entering the same characters i.e. aaa bbbbb etc
Thanks :)

I'd add another negative lookahead alternative matching against ^(.)\1*$, that is, any character, repetead until the end of the string.
Included as is in your regex, it would make that :
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$|^(.)\1*$)([a-z][a-z'-]{1,})$/i
However, I would probably simplify your negative lookahead as follows :
/^(?!(mr|ms|miss|dr|mr-mrs|(.)\2*)$)([a-z][a-z'-]{1,})$/i
The modifications are as follow :
We're evaluating the lookahead at the start of the string, as indicated by the ^ preceding it : no need to repeat that we match the start of the string in its clauses
Each alternative match the end of the string. We can put the alternatives in a group, which will be followed by the end-of-string anchor
We have created a new group, which we have to take into account in our back-reference : to reference the same group, it now must address \2 rather than \1. An alternative in certain regex flavours would have been to use a non-capturing group (?:...)

Regex: Find multiple matching strings in all lines

I'm trying to match multiple strings in a single line using regex in Sublime Text 3.
I want to match all values and replace them with null.
Part of the string that I'm matching against:
"userName":"MyName","hiScore":50,"stuntPoints":192,"coins":200,"specialUser":false
List of strings that it should match:
"MyName"
50
192
200
false
Result after replacing:
"userName":null,"hiScore":null,"stuntPoints":null,"coins":null,"specialUser":null
Is there a way to do this without using sed or any other substitution method, but just by matching the wanted pattern in regex?

You can use this find pattern:
:(.*?)(,|$)
And this replace pattern:
:null\2
The first group will match any symbol (dot) zero or more times (asterisk) with this last quantifier lazy (question mark), this last part means that it will match as little as possible. The second group will match either a comma or the end of the string. In the replace pattern, I substitute the first group with null (as desired) and I leave the symbol matched by the second group unchanged.

Here is an alternative on amaurs answer where it doesn't put the comma in after the last substitution:
:\K(.*?)(?=,|$)
And this replacement pattern:
null
This works like amaurs but starts matching after the colon is found (using the \K to reset the match starting point) and matches until a comma of new line (using a positive look ahead).
I have tested and this works in Sublime Text 2 (so should work in Sublime Text 3)
Another slightly better alternative to this is:
(?<=:).+?(?=,|$)
which uses a positive lookbehind instead of resetting the regex starting point
Another good alternative (so far the most efficient here):
:\K[^,]*

This may help.
Find: (?<=:)[^,]*
Replace: null

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex does not match the whole string - regex

Related

RegEx - double condition to find some string

Python2 re match repeating patterns doesn't behave as expected

Extracting part of a string using regex

Name validation - Adding a check to this regex to stop entering just identical characters

Regex: Find multiple matching strings in all lines

Categories

Resources