Regex matching problem - regex

I don't know how to write such a regex. I will start with example.
My bad regex:
(\d*),?(\d*\.?\d*)-?(\d*\.?\d*),?([0-1]?),?([0-1]?),?([^\/]*)
Matches that are OK:
1,2-3,1,1,asdf
1,2-3,1,1
1,2-3,1
1,2-3
1,2
1
But unfortunately this will also be matched and I don't want it to be:
asdf
1,asdf
Ideally, I would like something like - match, if previous groups was matched.
I know that probably positive look behind should be used, but if I'm not wrong, it should be used right in front each group, except 1st and regex would be large and smelly after that. Um, and it would probably be variable length.
Is there any elegant way to do that?
EDIT
I want to match all lines given below Matches that are OK.
I would like to match \d* to first group. Then, if there was a match to \d* followed by ,, I would like to match (\d*\.?\d*) to second group. After that, if there was a match in first group followed by , and match in second group followed by - I would like to match another (\d*\.?\d*)... etc. to the end of Regex.

You're not very clear in your question, but from the examples I think this is what you need:
^\d(,\d-\d(,\d(,\d(,[a-z]+)?)?)?)?$
It matches:
1,2-3,1,1,asdf
1,2-3,1,1
1,2-3,1
1,2-3
1,2
1
Test link.

Related

Select Northings from a 1 Line String

I have the following string;
Start: 738392E, 6726376N
I extracted 738392 ok using (?<=.art\:\s)([0-9A-Z]*). This gave me a one group match allowing me to extract it as a column value
.
I want to extract 6726376 the same way. Have only one group appear because I am parsing that to a column value.
Not sure why is (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) giving me the entire line after S.
Helping me get it right with an explanation will go along way.
Because you used positive lookaheads. Those just make some assertions, but don't "move the head along".
(?=(art\:\s\s*)) makes sure you're before "art: ...". The next thing is another positive lookahead that you quantify with a star to make it optional. Finally you match anything, so you get the rest of the line in your capture group.
I propose a simpler regex:
(?<=(art\:\s))(\d+)\D+(\d+)
Demo
First we make a positive lookback that makes sure we're after "art: ", then we match two numbers, seperated by non-numbers.
There is no need for you to make it this complicated. Just use something like
Start: (\d+)E, (\d+)N
or
\b\d+(?=[EN]\b)
if you need to match each bit separately.
Your expression (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) has several problems besides the ones already mentioned: 1) your first and second lookahead match at different locations, 2) your second lookahead is quantified, which, in 25 years, I have never seen someone do, so kudos. ;), 3) your capturing group matches about anything, including any line or the empty string.
You match the whole part after it because you use .* which will match until the end of the line.
Note that this part [0-9]* at the end of the pattern does not match because it is optional and the preceding .* already matches until the end of the string.
You could get the match without any lookarounds:
(art:\s)(\d+)[^,]+,\s(\d+)
Regex demo
If you want the matches only, you could make use of the PyPi regex module
(?<=\bStart:(?:\s+\d+[A-Z],)* )\d+(?=[A-Z])
Regex demo (For example only, using a different engine) | Python demo

Match all words before a word if it exists or not

I would like to match all chars except those specific one at the end of a word if it exists or not, for instance:
fooBarsBar
fooBars
I would like it match for both example "fooBars" and "fooBars" without "Bar" at then end even if it does not exist
I tried:
(.*)(?=Bar)|(.*)
And also this:
[^Bar]*
For the first regex it captures all by group and the second one it captures all chars except "Bar" but not the one at the end...
Could you please help me, thanks
There are different ways but your first regex pattern looks good already.
Add anchors ^ for start and $ end inside the lookahead and use + quantifier for one or more.
^.+(?=Bar$)|^.+
Here is a demo at regex101
Further be aware that [^Bar] represents a negated character class matching a character, that is not listed in the set. It won't match substrings that are not Bar as it looks you thought.

regex match combined with something before a string if exists

I tried to get the sub-strings from a string
such like:
test strings:
cat_zoo_New_York_US
dog_zoo_South_Carolina
dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
returned sub strings:
cat, New_York
dog, South_Carolina
dolphin, Montreal
pokemon, d
the Regex pattern I have tried is
([\w]+)(?:(_zoo_|_home_))(((?!(_US|_Canada|_K2-155))\w)+)
which I don't think is very concise and it returns other sub-strings besides what I need. Do you have any other suggestions?
Thanks!
Some updates
after #The fourth bird's answer #03/15/2018.
First of all, I like the idea of utilizing both ([^_]+) and the (?:) for different part of the sample strings.But let me extend a little more of the sample strings.
cat_zoo_New_York_US
dog_zoo_South_Carolina
yellow_dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
pokemon_home_zoo_d_K2-155
I actually want to use the anchor strings such as 'zoo','home' or 'home_zoo' to separate the characters before and after, together with matching(and discarding) the last part of the country(or whatever specified place ID), which makes this question a bit less general(I like the idea of using _,but let me make it more tricky to learn better).
two questions here
what is the function of (?=) and .* here in
(?=(?:_US|_Canada|_K2-155|$)).*$? It seems if I use
(?:_US|_Canada|_K2-155|$), it is still ok...
since I extended a little bit on the anchor string to let it support
_, I used:
(.*?)(?:_*)(?:home_zoo|zoo|home)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It seems ok, but if I use:
(.*?)(?:_*)(?:home|zoo|home_zoo)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It will firstly match home for the last sample string. Is there a
greedy algorithm to catch this without specify the order of the pattern
string?
Well again, I don't like to make a long list of anchor strings, but I don't have other ideas make it more general without doing so.
Thanks again!
You could try it like this:
^([^_]+)_[^_]+_(.*?)(?=(?:_US|_Canada|_K2-155|$)).*$
This will capture 2 groups. You could for example use this in a replacement with group1, group2.
First capture the first part ending on an underscore in group 1 like cat_. Then match the second part ending with an underscore like zoo_ or home_.
From that point capture in a group until you encounter one of your values using a lookahead (?= or the end of the string.
That would match:
^ Begin of the string
([^_]+) Match in a capturing group not an _ one or more times (group 1)
_[^_]+_ match _ then not an _ one or more times followed by _
(.*?) Capture in a group any character zero or more times greedy (group 2)
(?= Positive lookahead that asserts what is on the right side is
(?: Non capturing group
_US|_Canada|_K2-155|$ your values or end of the string
) Close group
) Close group
.*$ Match any character zero or more times till the end of the string
Edit: After the updated question, perhaps this will suit your requirements:
^(.*?)_(?:home_zoo|zoo|home)(.*?)(?=(?:_US|_Canada|_K2-155|$))
This will match any charcter zero or more times non greedy (.*?), then an underscore and a non capturing group (?:home|zoo|home_zoo) to separate the characters before and after.
Well, I tried a more straightforward approach. If your data is more complex than the sample that you gave above, this may fail. Otherwise, for the above text, it works fine.
Here is the expression that I used:
^([^_]*)_[^_]*_(.*)_.*$
1 23 45 67
Basically what I did was:
Group the first char stream, which does not contain _, starting at the beginning of the line.
Then there is an _ following the above group
Follows an arbitrary length string, which does not have _'s in it
Then comes an _
Group the next arbitrary length string
Comes and _ afterwards
Rest of the string
replace it with \1, \2 (first group, second group).
You can find a fiddle here
If you are using vim, you can also achieve the same thing in vim with the following command:
:%s/^[^_]*_\([^_]*\)_\(.*\)_.*$/\1, \2/g
UPDATE
^([^_]*)_[^_]*_(((?:South_)|(?:New_))*[^_]*)((?:_US)|(?:_Canada)|(?:_K2-155))*$
You can find the new fiddle (here)[https://regex101.com/r/qQ2dE4/273]
What is the difference between this one and the previous one?
Now, I cheat a little, as such that I look for adjectives, which modify the state name, like South_ or New_. You can add more here, like East_, West_, Old_ or whatever if there is a case in your date.
There are cases where country is skipped in data. Plus looks like that last token on the very last line does not follow up a pattern. So, I explicitly listed those options in the expression, like US, Canada etc. You may need to add more exceptional cases in here as well.

Is there a way to match Regex based on previous capture group, not captured previously?

Okay, so the task is that there is a string that can either look like post, or post put or even get put post. All of these must be matched. Preferably deviances like [space]post, or get[space] should not be matched.
Currently I came up with this
^(post|put|delete|get)(( )(post|put|delete|get))*$
However I'm not satisfied with it, because I had to specify (post|put|delete|get) twice. It also matches duplications like post post.
I'd like to somehow use a backreference(?) to the first group so that I don't have to specify the same condition twice.
However, backreference \1 would help me only match post post, for example, and that's the opposite of what I want. I'd like to match a word in the first capture group that was NOT previously found in the string.
Is this even possible? I've been looking through SO questions, but my Google-fu is eluding me.
If you are using a PCRE-based regex engine, you may use subroutine calls like (?n) to recurse the subpatterns.
^(post|put|delete|get)( (?!\1)(?1))*$
^^^^
See the regex demo
Expression details:
^ - start of string
(post|put|delete|get) - Group 1 matching one of the alternatives as literal substrings
( (?!\1)(?1))* - zero or more sequences of:
- a space
(?!\1) - a negative lookahead that fails the match if the text after the current location is identical to the one captured into Group 1 due to backreference \1
(?1) - a subroutine call to the first capture group (i.e. it uses the same pattern used in Group 1)
$ - end of string
UPDATE
In order to avoid matching strings like get post post, you need to also add a negative lookahead into Group 1 so that the subroutine call was aware that we do not want to match the same value that was captured into Group 1.
^((post|put|delete|get)(?!.*\2))( (?1))*$
See the regex demo
The difference is that we capture the alternations into Group 2 and add the negative lookahead (?!.*\2) to disallow any occurrences of the word we captured further in the string. The ( (?1))* remains intact: now, the subroutine recurses the whole Capture Group 1 subpattern with the lookahead.

regex optional lookahead

I want a regular expression to match all of these:
startabcend
startdef
blahstartghiend
blahstartjklendsomething
and to return abc, def, ghi and jkl respectively.
I have this the following which works for case 1 and 3 but am having trouble making the lookahead optional.
(?<=start).*(?=end.*)
Edit:
Hmm. Bad example. In reality, the bit in the middle is not numeric, but is preceeded by a certain set of characters and optionally succeeded by it. I have updated the inputs and outputs as requested and added a 4th example in response to someones question.
If you're able to use lookahead,
(?<=start).*?(?=(?:end|$))
as suggested by stema below is probably the simplest way to get the entire pattern to match what you want.
Alternatively, if you're able to use capturing groups, you should just do that instead:
start(.*?)(?:end)?$
and then just get the value from the first capture group.
Maybe like this:
(?<=start).*?(?=(?:end|$))
This will match till "start" and "end" or till the end of line, additionally the quantifier has to be non greedy (.*?)
See it here on Regexr
Extended the example on Regexr to not only work with digits.
An optional lookahead doesn't make sense:
If it's optional then it's ok if it matches, but it's also ok if it doesn't match. And since a lookahead does not extend the match it has absolutely no effect.
So the syntax for an optional lookahead is the empty string.
Lookahead alone won't do the job. Try this:
(?<=start)(?:(?!end).)*
The lookbehind positions you after the word "start", then the rest of it consumes everything until (but not including) the next occurrence of "end".
Here's a demo on Ideone.com
if "end" is always going to be present, then use:
(?<=start)(.*?)(?=end) as you put in the OP. Since you say "make the lookahead optional", then just run up until there's "end" or the carriage return. (?<=start)(.*?)(?=end|\n). If you don't care about capturing the "end" group, you can skip the lookahead and do (?:start)?(.*?)(?:end)? which will start after "start", if it's there and stop before "end", if it's there. You can also use more of those piped "or" patterns: (?:start|^) and (?:end|\n).
Why do you need lookahead?
start(\d+)\w*
See it on rubular