Extracting part of a string using regex - regex

I am trying to extract part of a strings below
I tried (.*)(?:table)?,it fails in the last case. How to make the expression capture entire string in the absence of the text "table"
Text: "diningtable" Expected Match: dining
Text: "cookingtable" Match: cooking
Text: "cooking" Match:cooking
Text: "table" Match:""

Rather than try to match everything but table, you should do a replacement operation that removes the text table.
Depending on the language, this might not even need regex. For example, in Java you could use:
String output = input.replace("table", "");

If you want to use regex, you can use this one:
(^.*)(?=table)|(?!.*table.*)(^.+)
See demo here: regex101
The idea is: match everything from the beginning of the line ^ until the word table or if you don't find table in the string, match at least one symbol. (to avoid matching empty lines). Thus, when it finds the word table, it will return an empty string (because it matches from the beginning of the line till the word table).

The (.*)(?:table)? fails with table (matches it) as the first group (.*) is a greedy dot matching pattern that grabs the whole string into Group 1. The regex engine backtracks and looks for table in the optional non-capturing group, and matches an empty string at the end of the string.
The regex trick is to match any text that does not start with table before the optional group:
^((?:(?!table).)+)(?:table)?$
See the regex demo
Now, Group 1 - ((?:(?!table).)+) - contains a tempered greedy token (?:(?!table).)+ that matches 1 or more chars other than a newline that do not start a table sequence. Thus, the first group will never match table.
The anchors make the regex match the whole line.
NOTE: Non-regex solutions might turn out more efficient though, as a tempered greedy token is rather resource consuming.
NOTE2: Unrolling the tempered greedy token usually enhances performance n times:
^([^t]*(?:t(?!able)[^t]*)*)(?:table)?$
See another demo
But usually it looks "cryptic", "unreadable", and "unmaintainable".

Despite other great answers, you could also use alternation:
^(?|(.*)table$|(.*))$
This makes use of a branch reset, so your desired content is always stored in group 1. If your language/tool of choice doesn't support it, you would have to check which of groups 1 and 2 contains the string.
See Demo

Related

Conditional Regex not working as expected

I'm trying to write a conditional Regex to achieve the following:
If the word "apple" or "orange" is present within a string:
there must be at least 2 occurrences of the word "HORSE" (upper-case)
else
there must be at least 1 occurrence of the word "HORSE" (upper-case)
What I wrote so far:
(?(?=((apple|orange).*))(HORSE.*){2}|(HORSE.*){1})
I was expecting this Regex to work as I'm following the pattern (?(?=regex)then|else).
However, it looks like (HORSE.*){1} is always evaluated instead. Why?
https://regex101.com/r/V5s8hV/1
The conditional is nice for checking a condition in one place and use outcome in another.
^(?=(?:.*?\b(apple|orange)\b)?)(.*?\bHORSE\b)(?(1)(?2))
The condition is group one inside an optional (?: non capturing group )
In the second group the part until HORSE which we always need gets matched
(?(1)(?2)) conditional if first group succeeded, require group two pattern again
See this demo at regex101 (more explanation on the right side)
The way you planned it does work as well, but needs refactoring e.g. that regex101 demo.
^(?(?=.*?\b(?:apple|orange)\b)(?:.*?\bHORSE\b){2}|.*?\bHORSE\b)
Or another way without conditional and a negative lookahead like this demo at regex101.
^(?:(?!.*?\b(?:apple|orange)\b).*?\bHORSE\b|(?:.*?\bHORSE\b){2})
FYI: To get full string in the output, just attach .* at the end. Further to mention, {1} is redundant. Used a lazy quantifier (as few as possible) in the dot-parts of all variants for improving efficiency.
I would keep it simple and use lookaheads to assert the number of occurrences of the word HORSE:
^((?=.*\bHORSE\b.*\bHORSE\b).*\b(?:apple|orange)\b.*|(?=.*\bHORSE\b)(?!.*\b(?:apple|orange)\b).*)$
Demo
Explanation:
^ from the start of the string
( match either of
(?=.*\bHORSE\b.*\bHORSE\b) assert that HORSE appears at least twice
.* match any content
\b(?:apple|orange)\b match apple or orange
.* match any content
| OR
(?=.*\bHORSE\b) assert that HORSE appears at least once
(?!.*\b(?:apple|orange)\b) but apple and orange do not occur
.* match any content
) close alternation
$ end of the string

Regex does not match the whole string

Following is the regular expression pattern I'm dealing with:
[\t\v\f ]*([^=\s]+)[\t\v\f ]*=[\t\v\f ]*([^=\s]+)[\t\v\f ]*(?:\r?\n\s*([^=\s]+)[\t\v\f ]*=[\t\v\f ]*([^=\s]+)\s*)*
It basically tries to match key value pairs in a single section of a .ini file. So, for instance, it should be able to match the whole string below:
"aa = 11\nbb = 22\ncc = 33"
I tried to test it on this regex matching website and some others as well and they all seem to match only the first 2 lines. Here is how the match looks like (global flag is disabled):
However when I try to force the regex to find all 3 lines as follows:
[\t\v\f ]*([^=\s]+)[\t\v\f ]*=[\t\v\f ]*([^=\s]+)[\t\v\f ]*(?:\r?\n\s*([^=\s]+)[\t\v\f ]*=[\t\v\f ]*([^=\s]+)\s*){2}
Then it seems to be able to match the whole string.
Can anyone give me a good reason as to why the entire string above does not match my regex? Also what regex should I use in order to match all key value pairs in a string like the one I wrote above?
Your problem is the \s* at the end of the non-capturing group; this is being greedy and absorbing the vertical white-space at the end of the line containing bb = 22 and preventing the group matching again on the line with cc = 33 in it. Changing that to [\t\v\f ] (or even \s*?) makes the regex match the entire string as desired. See demo on regex101. The reason it works when you add the {2} quantifier is that the desire to match makes the engine backtrack when processing the \s* to a point where it can then match the non-capturing group again.

Parsing multiple groups from a regular expression

I am having a problem parsing some fields from the following regular expression which I uploaded to rubular. The string that I am parsing is a special header from the banner of an FTP server. In order for me to process this banner, the line
special:pTXT1TOCAPTURE^:mTXT2TOCAPTURE^:uTXT3TOCAPTURE^
I thought that: (?i)^special(:[pmu](.*?)\^)?* would do the trick, however unfortunately this only gives me the last match and I am not sure why as I am lazily trying to capture each group. Also note that I should be able to capture an empty string also, i.e. if for ex the match string contains :u^
Wrap words Show invisibles Ruby version
Match result:
special:pTXT1TOMATCH^:mTXT2TOMATCH^:uTXT3TOMATCH^
Match groups:
:uTXT3TOMATCH^
TXT3TOMATCH
The idea is that the line must start with the test 'special' followed by up to 3 capture groups delimited with p,m or u lazily up to the next ^ symbol. I need to capture the text indicated above - basically I need to find TXT1TOCAPTURE, TXT2TOCAPTURE, and TXT3TOCAPTURE. There should be at least one of these three capture groups.
Thanks in advance
You have two problems with your RegEx, one is syntactic and one is conceptual.
Syntactic:
We don't have such a modifier ?* in PCRE but it is equal to * in Ruby which denotes a greedy quantifier. In the case of applying to a capturing group it captures last match.
Conceptual:
Using a lazy quantifier .*? doesn't provide you with continues matches. It stops immediately on engine satisfaction. While g modifier is on next match will never occur as there is no ^special at the next position of last match.
Solution is using \G token to benefit from its mean of start matching at the end of previous match:
(?:special|(?!\A)\G):([pmu][^^]*\^)
Live demo
You might want to have the \G modifier:
(?:(?:^special:)|\G(?!\A)\^:)[pmu]([^^]+)
See it working on rubular.com.

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

Regex: Find multiple matching strings in all lines

I'm trying to match multiple strings in a single line using regex in Sublime Text 3.
I want to match all values and replace them with null.
Part of the string that I'm matching against:
"userName":"MyName","hiScore":50,"stuntPoints":192,"coins":200,"specialUser":false
List of strings that it should match:
"MyName"
50
192
200
false
Result after replacing:
"userName":null,"hiScore":null,"stuntPoints":null,"coins":null,"specialUser":null
Is there a way to do this without using sed or any other substitution method, but just by matching the wanted pattern in regex?
You can use this find pattern:
:(.*?)(,|$)
And this replace pattern:
:null\2
The first group will match any symbol (dot) zero or more times (asterisk) with this last quantifier lazy (question mark), this last part means that it will match as little as possible. The second group will match either a comma or the end of the string. In the replace pattern, I substitute the first group with null (as desired) and I leave the symbol matched by the second group unchanged.
Here is an alternative on amaurs answer where it doesn't put the comma in after the last substitution:
:\K(.*?)(?=,|$)
And this replacement pattern:
null
This works like amaurs but starts matching after the colon is found (using the \K to reset the match starting point) and matches until a comma of new line (using a positive look ahead).
I have tested and this works in Sublime Text 2 (so should work in Sublime Text 3)
Another slightly better alternative to this is:
(?<=:).+?(?=,|$)
which uses a positive lookbehind instead of resetting the regex starting point
Another good alternative (so far the most efficient here):
:\K[^,]*
This may help.
Find: (?<=:)[^,]*
Replace: null