Regex: matching between underscores - regex

For example, I have a string 111352_01_2_SAMPLE_TEXT_SAMPLE. I need to match first, second, third number and remaining text.
Currently I have this:
First number: ^[^_]+(?=_) (Everything until 1. underscore)
Second number: (?<=_)[^_]*(?=_) (Everything between 1. and 2. underscore)
Remaining text: (?:.*?_){3}(.*)\s* (Text after third occurrence of underscore)
Is there any more "readable" way of building expression, since the logic for first three matches in quite similar.
And what's the best way of writing expression for matching everything

Since you tagged regex-group I think a more straightforward way of retrieving these three substring could be:
^(.*?)_(.*?)_.*?_(.*)$
See the demo
Maybe you are looking to get a single regex expressions that is applicable to whichever element from the string you want. In that case you could use:
^(?:.*?_){0}([^\n_]+)
This is a zero-index type of retrieving elements delimited by an underscore. However, I do not see the benefit over a regular split() function. Change the zero to a 1, 2 or 3 etc.

Just use
^(\d+)_(\d+)_(\d+)_(.+)
See a demo on regex101.com.

Related

Trying to extract repeating pattern from string in php/javascript

The following is in PHP but the regex will also be used in javascript.
Trying to extract repeating patterns from a string
string can be any of the following:
"something arbitrary"
"D123"
"D111|something"
"D197|what.org|when.net"
"D297|who.197d234.whatever|when.net|some other arbitrary string"
I'm currently using the following regex: /^D([0-9]{3})(?:\|([^\|]+))*/
This correctly does not match the first string, matches the second and third correctly. The problem is the third and fourth only match the Dxxx and the last string. I need each of the strings between the '|' to be matched.
I'm hoping to use a regex as it makes it a single step. I realize I could just detect the leading Dxxx then use explode or split as appropriate to break the strings out. I've just gotten stuck on wanting a single regular expression match step.
This same regex may be used in Python as well so just want a generic regex solution.
There is no way to have a dynamic number of capture groups in a regular expression, but if you know some upper limit to how many parts you would have in one string, you can just repeat the pattern that many times:
/^D([0-9]{3})(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)/
So after the initial ^D([0-9]{3})(?:$|\|) you just repeat (.*?)(?:$|\|) as many times as you need it.
When the string has fewer elements, those remaining capture groups will match the empty string.
See regex tester.
Is something like preg_match_all() (the PHP variant of a global match) also acceptable for you?
Then you could use:
^(?|D([0-9]{3})|^.+$|(?!^)\|([^|\n]*)(?=\||$))
This will match everything in a string in different matches, e.g. take your string:
D197|what.org|when.net
It will you then give three matches:
D197
what.org
when.net
Running live: https://regex101.com/r/jL2oX6/4 (Everything in green are your group matches. Ignore what's in blue.)

Using regex to match multiple comma separated words

I am trying to find the appropriate regex pattern that allows me to pick out whole words either starting with or ending with a comma, but leave out numbers. I've come up with ([\w]+,) which matches the first word followed by a comma, so in something like:
red,1,yellow,4
red, will match, but I am trying to find a solution that will match like like the following:
red, 1 ,yellow, 4
I haven't been able to find anything that can break strings up like this, but hopefully you'll be able to help!
This regex
,?[a-zA-Z][a-zA-Z0-9]*,?
Matches 'words' optionally enclose with commas. No spaces between commas and the 'word' are permitted and the word must start with an alphanumeric.
See here for a demo.
To ascertain that at least one comma is matched, use the alternation syntax:
(,[a-zA-Z][a-zA-Z0-9]*|[a-zA-Z][a-zA-Z0-9]*,)
Unfortunately no regex engine that i am aware of supports cascaded matching. However, since you usually operate with regexen in the context of programming environments, you could repeatedly match against a regex and take the matched substring for further matches. This can be achieved by chaining or iterated function calls using speical delimiter chars (which must be guaranteed not to occur in the test strings).
Example (Javascript):
"red, 1 ,yellow, 4, red1, 1yellow yellow"
.replace(/(,?[a-zA-Z][a-zA-Z0-9]*,?)/g, "<$1>")
.replace(/<[^,>]+>/g, "")
.replace(/>[^>]+(<|$)/g, "> $1")
.replace(/^[^<]+</g, "<")
In this example, the (simple) regex is tested for first. The call returns a sequence of preliminary matches delimted by angle brackets. Matches that do not contain the required substring (, in this case) are eliminated, as is all intervening material.
This technique might produce code that is easier to maintain than a complicated regex.
However, as a rule of thumb, if your regex gets too complicated to be easily maintained, a good guess is that it hasn't been the right tool in the first place (Many engines provide the x matching modifier that allows you to intersperse whitespace - namely line breaks and spaces - and comments at will).
The issue with your expression is that:
- \w resolves to this: [a-zA-Z0-9_]. This includes numeric data which you do not want.
- You have the comma at the end, this will match foo, but not ,foo.
To fix this, you can do something like so: (,\s*[a-z]+)|([a-z]+\s*,). An example is available here.

RegEx failing for strings with less than 3 characters

I am using a RegEx to test if a string is valid. The string must start and end with a number ([0-9]), but can contain comma's within.
I came up with this example, but it fails for strings less than 3 characters (for example 1 or 15 are as valid as 1,8). Presumably this is because I am specifically testing for a first and last character, but I don't know any other way of doing this.
How can I change this RegEx to match my requirements. Thanks.
^[0-9]+[0-9\,]+[0-9]$
Use this:
^[0-9]+(,[0-9])?$
the ,[0-9] part will be optional
visualized:
if you want allow for multiple comma-number groups... then replace the ? with *.
if you want to allow groups of numbers after the comma (which didn't seem to be the case in your example), then you should put + after that number group as well.
if both of the above mentioned are desired, your final regex could look like this:
^[0-9]+(,[0-9]+)*$
^\d+(?:,\d+)*$
should work.
Always have one or more digits at the start, optionally followed by any number of comma-separated other groups of one or more digits.
If you allow commas next to each other, then the second + should be a *, I think.
I would say the regex
\d(,?\d)*
Should satisfy for 1 or more digits that can be separated by only one comma. Note, 1,,2 fails

REGEX: Select everything NOT equal to a certain string

This should be straightforward. I need a regular expression that selects everything that does not specifically contain a certain word.
So if I have this sentence: "There is a word in the middle of this sentence."
And the regular expression gets everything but "middle", I should select everything in that sentence but "middle".
Is there any easy way to do this?
Thanks.
It is not possible for a single regex match operation to be discontinuous.
You could use two capturing groups:
(.*)middle(.*)
Then concatenate the contents of capturing groups 1 and 2 after the match.
You may wish to enable the "dot also matches newline" option in your parser.
See for example Java's DOTALL, .NET's Singleline, Perl's s, etc.
Positive lookaround is the way to go:
/^(.+)(?=middle)/ -- gets everything before middle, not including middle
and
/(?!middle)(.+)$/ -- gets everything after middle, not including middle
Then you just merge the results of both

Combine Regexp?

After collecting user input for various conditions like
Starts with : /(^#)/
Ends with : /(#$)/
Contains : /#/
Doesn't contains
To make single regex if user enter multiple conditions,
I combine them with "|" so if 1 and 2 given it become /(^#)|(#$)/
This method works so far but,
I'm not able to determine correctly, What should be the regex for the 4th condition? And combining regex this way work?
Update: #(user input) won't be same
for two conditions and not all four
conditions always present but they can
be and in future I might need more
conditions like "is exactly" and "is
exactly not" etc. so, I'm more curious
to know this approach will scale ?
Also there may be issues of user input
cleanup so regex escaped properly, but
that is ignored right now.
Will the conditions be ORed or ANDed together?
Starts with: abc
Ends with: xyz
Contains: 123
Doesn't contain: 456
The OR version is fairly simple; as you said, it's mostly a matter of inserting pipes between individual conditions. The regex simply stops looking for a match as soon as one of the alternatives matches.
/^abc|xyz$|123|^(?:(?!456).)*$/
That fourth alternative may look bizarre, but that's how you express "doesn't contain" in a regex. By the way, the order of the alternatives doesn't matter; this is effectively the same regex:
/xyz$|^(?:(?!456).)*$|123|^abc/
The AND version is more complicated. After each individual regex matches, the match position has to be reset to zero so the next regex has access to the whole input. That means all of the conditions have to be expressed as lookaheads (technically, one of them doesn't have to be a lookahead, I think it expresses the intent more clearly this way). A final .*$ consummates the match.
/^(?=^abc)(?=.*xyz$)(?=.*123)(?=^(?:(?!456).)*$).*$/
And then there's the possibility of combined AND and OR conditions--that's where the real fun starts. :D
Doesn't contain #: /(^[^#]*$)/
Combining works if the intended result of combination is that any of them matching results in the whole regexp matching.
If a string must not contain #, every character must be another character than #:
/^[^#]*$/
This will match any string of any length that does not contain #.
Another possible solution would be to invert the boolean result of /#/.
In my experience with regex you really need to focus on what EXACTLY you are trying to match, rather than what NOT to match.
for example
\d{2}
[1-9][0-9]
The first expression will match any 2 digits....and the second will match 1 digit from 1 to 9 and 1 digit - any digit. So if you type 07 the first expression will validate it, but the second one will not.
See this for advanced reference:
http://www.regular-expressions.info/refadv.html
EDITED:
^((?!my string).)*$ Is the regular expression for does not contain "my string".
1 + 2 + 4 conditions: starts|ends, but not in the middle
/^#[^#]*#?$|^#?[^#]*#$/
is almost the same that:
/^#?[^#]*#?$/
but this one matches any string without #, sample 'my name is hal9000'
Combining the regex for the fourth option with any of the others doesn't work within one regex. 4 + 1 would mean either the string starts with # or doesn't contain # at all. You're going to need two separate comparisons to do that.