regex match combined with something before a string if exists - regex

I tried to get the sub-strings from a string
such like:
test strings:
cat_zoo_New_York_US
dog_zoo_South_Carolina
dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
returned sub strings:
cat, New_York
dog, South_Carolina
dolphin, Montreal
pokemon, d
the Regex pattern I have tried is
([\w]+)(?:(_zoo_|_home_))(((?!(_US|_Canada|_K2-155))\w)+)
which I don't think is very concise and it returns other sub-strings besides what I need. Do you have any other suggestions?
Thanks!
Some updates
after #The fourth bird's answer #03/15/2018.
First of all, I like the idea of utilizing both ([^_]+) and the (?:) for different part of the sample strings.But let me extend a little more of the sample strings.
cat_zoo_New_York_US
dog_zoo_South_Carolina
yellow_dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
pokemon_home_zoo_d_K2-155
I actually want to use the anchor strings such as 'zoo','home' or 'home_zoo' to separate the characters before and after, together with matching(and discarding) the last part of the country(or whatever specified place ID), which makes this question a bit less general(I like the idea of using _,but let me make it more tricky to learn better).
two questions here
what is the function of (?=) and .* here in
(?=(?:_US|_Canada|_K2-155|$)).*$? It seems if I use
(?:_US|_Canada|_K2-155|$), it is still ok...
since I extended a little bit on the anchor string to let it support
_, I used:
(.*?)(?:_*)(?:home_zoo|zoo|home)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It seems ok, but if I use:
(.*?)(?:_*)(?:home|zoo|home_zoo)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It will firstly match home for the last sample string. Is there a
greedy algorithm to catch this without specify the order of the pattern
string?
Well again, I don't like to make a long list of anchor strings, but I don't have other ideas make it more general without doing so.
Thanks again!

You could try it like this:
^([^_]+)_[^_]+_(.*?)(?=(?:_US|_Canada|_K2-155|$)).*$
This will capture 2 groups. You could for example use this in a replacement with group1, group2.
First capture the first part ending on an underscore in group 1 like cat_. Then match the second part ending with an underscore like zoo_ or home_.
From that point capture in a group until you encounter one of your values using a lookahead (?= or the end of the string.
That would match:
^ Begin of the string
([^_]+) Match in a capturing group not an _ one or more times (group 1)
_[^_]+_ match _ then not an _ one or more times followed by _
(.*?) Capture in a group any character zero or more times greedy (group 2)
(?= Positive lookahead that asserts what is on the right side is
(?: Non capturing group
_US|_Canada|_K2-155|$ your values or end of the string
) Close group
) Close group
.*$ Match any character zero or more times till the end of the string
Edit: After the updated question, perhaps this will suit your requirements:
^(.*?)_(?:home_zoo|zoo|home)(.*?)(?=(?:_US|_Canada|_K2-155|$))
This will match any charcter zero or more times non greedy (.*?), then an underscore and a non capturing group (?:home|zoo|home_zoo) to separate the characters before and after.

Well, I tried a more straightforward approach. If your data is more complex than the sample that you gave above, this may fail. Otherwise, for the above text, it works fine.
Here is the expression that I used:
^([^_]*)_[^_]*_(.*)_.*$
1 23 45 67
Basically what I did was:
Group the first char stream, which does not contain _, starting at the beginning of the line.
Then there is an _ following the above group
Follows an arbitrary length string, which does not have _'s in it
Then comes an _
Group the next arbitrary length string
Comes and _ afterwards
Rest of the string
replace it with \1, \2 (first group, second group).
You can find a fiddle here
If you are using vim, you can also achieve the same thing in vim with the following command:
:%s/^[^_]*_\([^_]*\)_\(.*\)_.*$/\1, \2/g
UPDATE
^([^_]*)_[^_]*_(((?:South_)|(?:New_))*[^_]*)((?:_US)|(?:_Canada)|(?:_K2-155))*$
You can find the new fiddle (here)[https://regex101.com/r/qQ2dE4/273]
What is the difference between this one and the previous one?
Now, I cheat a little, as such that I look for adjectives, which modify the state name, like South_ or New_. You can add more here, like East_, West_, Old_ or whatever if there is a case in your date.
There are cases where country is skipped in data. Plus looks like that last token on the very last line does not follow up a pattern. So, I explicitly listed those options in the expression, like US, Canada etc. You may need to add more exceptional cases in here as well.

Related

Select Northings from a 1 Line String

I have the following string;
Start: 738392E, 6726376N
I extracted 738392 ok using (?<=.art\:\s)([0-9A-Z]*). This gave me a one group match allowing me to extract it as a column value
.
I want to extract 6726376 the same way. Have only one group appear because I am parsing that to a column value.
Not sure why is (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) giving me the entire line after S.
Helping me get it right with an explanation will go along way.
Because you used positive lookaheads. Those just make some assertions, but don't "move the head along".
(?=(art\:\s\s*)) makes sure you're before "art: ...". The next thing is another positive lookahead that you quantify with a star to make it optional. Finally you match anything, so you get the rest of the line in your capture group.
I propose a simpler regex:
(?<=(art\:\s))(\d+)\D+(\d+)
Demo
First we make a positive lookback that makes sure we're after "art: ", then we match two numbers, seperated by non-numbers.
There is no need for you to make it this complicated. Just use something like
Start: (\d+)E, (\d+)N
or
\b\d+(?=[EN]\b)
if you need to match each bit separately.
Your expression (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) has several problems besides the ones already mentioned: 1) your first and second lookahead match at different locations, 2) your second lookahead is quantified, which, in 25 years, I have never seen someone do, so kudos. ;), 3) your capturing group matches about anything, including any line or the empty string.
You match the whole part after it because you use .* which will match until the end of the line.
Note that this part [0-9]* at the end of the pattern does not match because it is optional and the preceding .* already matches until the end of the string.
You could get the match without any lookarounds:
(art:\s)(\d+)[^,]+,\s(\d+)
Regex demo
If you want the matches only, you could make use of the PyPi regex module
(?<=\bStart:(?:\s+\d+[A-Z],)* )\d+(?=[A-Z])
Regex demo (For example only, using a different engine) | Python demo

Name validation - Adding a check to this regex to stop entering just identical characters

I'm trying to add another feature to a regex which is trying to validate names (first or last).
At the moment it looks like this:
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$)([a-z][a-z'-]{1,})$/i
https://regex101.com/r/pQ1tP2/1
The idea is to do the following
Don't allow just adding a title like Mr, Mrs etc
Ensure the first character is a letter
Ensure subsequent characters are either letters, hyphens or apostrophes
Minimum of two characters
I have managed to get this far (shockingly I find regex so confusing lol).
It matches things like O'Brian or Anne-Marie etc and is doing a pretty good job.
My next additions I've struggled with though! trying to add additional features to the regex to not match on the following:
Just entering the same characters i.e. aaa bbbbb etc
Thanks :)
I'd add another negative lookahead alternative matching against ^(.)\1*$, that is, any character, repetead until the end of the string.
Included as is in your regex, it would make that :
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$|^(.)\1*$)([a-z][a-z'-]{1,})$/i
However, I would probably simplify your negative lookahead as follows :
/^(?!(mr|ms|miss|dr|mr-mrs|(.)\2*)$)([a-z][a-z'-]{1,})$/i
The modifications are as follow :
We're evaluating the lookahead at the start of the string, as indicated by the ^ preceding it : no need to repeat that we match the start of the string in its clauses
Each alternative match the end of the string. We can put the alternatives in a group, which will be followed by the end-of-string anchor
We have created a new group, which we have to take into account in our back-reference : to reference the same group, it now must address \2 rather than \1. An alternative in certain regex flavours would have been to use a non-capturing group (?:...)

Capture filename parts: Why doesn't this regexp work?

I'm faily new to regexp and I miss something from capturing groups.
Let's suppose I have a filepath like that
test.orange.john.edn
I want to capture two groups:
test.orange.john (which is the body)
edn (which is the extension)
I used this (and variants of it, taking the $ outside, etc.)
^([a-z]*.)*.([a-z]*$)
But it captures xm only
What did I miss? I do not understand why l is not captured and the body too...
I found answers on the web to capture the extension but I do not understand the problem there.
Thanks
The ^([a-z]*.)*.([a-z]*$) regex is very inefficient as there are lots of unnecessary backtracking steps here.
The start of string is matched, and then [a-z]*. is matched 0+ times. That means, the engine matches as many [a-z] as possible (i.e. it matches test up to the first dot), and then . matches the dot (but only because . matches any character!). So, this ([a-z]*.)* matches test.orange.john.edn only capturing edn since repeating capturing groups only keep the last captured value.
You already have edn in Group 1 at this step. Now, .([a-z]*$) should allocate a substring for the . (any character) pattern. Backtracking goes back and finds n - now, Group 1 only contains ed.
For your task, you should escape the last . to match a literal dot and perhaps, the best expression is
^(.*)\.(.*)$
See demo
It will match all the string up to the end with the first (.*), and then will backtrack to find the last . symbol (so, Group 1 will have all text from the beginning till the last .), and then capturing the rest of the string into Group 2.
If a dot does not have to be present (i.e. if a file name has no extension), add an optional group:
^(.*)(?:\.(.*))?$
See another demo
You can try with:
^([a-z.]+)\.([a-z]+)$
online example

How to match everything up to the second occurrence of a character?

So my string looks like this:
Basic information, advanced information, super information, no information
I would like to capture everything up to second comma so I get:
Basic information, advanced information
What would be the regex for that?
I tried: (.*,.*), but I get
Basic information, advanced information, super information,
This will capture up to but not including the second comma:
[^,]*,[^,]*
English translation:
[^,]* = as many non-comma characters as possible
, = a comma
[^,]* = as many non-comma characters as possible
[...] is a character class. [abc] means "a or b or c", and [^abc] means anything but a or b or c.
You could try ^(.*?,.*?),
The problem is that .* is greedy and matches maximum amount of characters. The ? behind * changes the behaviour to non-greedy.
You could also put the parenthesis around each .*? segment to capture the strings separately if you want.
I would take a DRY approach, like this:
^([^,]*,){1}[^,]*
This way you can match everything until the n occurrence of a character without repeating yourself except for the last pattern.
Although in the case of the original poster, the group and repetition of the group is useless I think this will help others that need to match more than 2 times the pattern.
Explanation:
^ From the start of the line
([^,]*,) Create a group matching everything except the comma character until it meet a comma.
{1} Count the above pattern (the number of time you need)-1. So if you need 2 put 1, if you need 20 put 19.
[^,]* Repeat the pattern one last time without the tailing comma.
Try this approach:
(.*?,.*?),.*
Link to the solution

Regex matching problem

I don't know how to write such a regex. I will start with example.
My bad regex:
(\d*),?(\d*\.?\d*)-?(\d*\.?\d*),?([0-1]?),?([0-1]?),?([^\/]*)
Matches that are OK:
1,2-3,1,1,asdf
1,2-3,1,1
1,2-3,1
1,2-3
1,2
1
But unfortunately this will also be matched and I don't want it to be:
asdf
1,asdf
Ideally, I would like something like - match, if previous groups was matched.
I know that probably positive look behind should be used, but if I'm not wrong, it should be used right in front each group, except 1st and regex would be large and smelly after that. Um, and it would probably be variable length.
Is there any elegant way to do that?
EDIT
I want to match all lines given below Matches that are OK.
I would like to match \d* to first group. Then, if there was a match to \d* followed by ,, I would like to match (\d*\.?\d*) to second group. After that, if there was a match in first group followed by , and match in second group followed by - I would like to match another (\d*\.?\d*)... etc. to the end of Regex.
You're not very clear in your question, but from the examples I think this is what you need:
^\d(,\d-\d(,\d(,\d(,[a-z]+)?)?)?)?$
It matches:
1,2-3,1,1,asdf
1,2-3,1,1
1,2-3,1
1,2-3
1,2
1
Test link.