Find any 4 consecutive characters between two strings - regex

I'm trying to write a regex that would detect if any combination of 4 non-whitespace characters existed between two strings. They will always be seperated by a comma. An example:
Labrador, Matador ---> this would match 'ador'.
Mississippi, Missing ---> This would match 'Miss' and 'issi'
Corporate, Corporation ---> This would match 'Corp' , 'orpo' , 'rpor' , 'pora' and 'orat'
It's been pretty hard to find something similar to this, and the closest I've found has said this is not possible in regex. It's definitely tricky, but I wanted to make sure that it was in fact not possible before looking for a different solution.
If it is impossible, would someone explain why?

For overlapping matches it is possible with a lookahead:
/(?=(\S{4}).*,.*\1)/
Note that there is one more issi possible in your second line example.
Test: https://regex101.com/r/rV3gN9/2

You can use this lookahead based regex:
(?=([a-zA-Z]{4})[a-zA-Z]*, *[a-zA-Z]*\1)
RegEx Demo
Though it will find issi twice since Mississippi has 2 instanced of issi.

This can be achieved with backreferences:
\w*([a-zA-z]{4})\w*, \w*\1\w*
See example: https://regex101.com/r/eW8hB7/1

Related

Regex for *either or both* of last two characters are not digits?

I can't figure out the proper regular expression for this... Most of my data ends with digits as the last two characters. A subset ends with where either one or both of the last two are non-digits. So xyz99 is normal and I'm able to find those records with "*[0-9][0-9]$". If I change that to "*[^0-9][^0-9]$" then I get records where both are non-digits.
I don't know regex well enough to match all of the following with a single regex: xy9z, xyz9, xyzw, but not matching xyz99.
I prefer a single regex, but (already know how to and) can work-around with multiple.
Thanks for any help.
[^\d]$|[^\d].$
should do the trick
https://regex101.com/r/PsZxLj/2
It matches anything that doesn't end in a digit OR anything where the 2nd to last character isn't a digit. Lots of ways to do this, but pick one that is easy for you to read and maintain. :) Good luck!
Something like this should do it:
(?:\d[^\d]|[^\d].)$
You could use a negative lookahead (?!...).
(?![0-9]{2}).{2}$
This will first make sure that [0-9]{2} (2 digits) does not match. Then proceeds to match the remaining regex, which matches any 2 characters .{2} followed by the end of the string $.
Regexper
Thank you all for the quick and helpful responses. A couple of the references above that start with "(?" are beyond what I understand so far. But the "or" operator is what I was missing. Here is what I ended up using (and it worked): select * from mytable where regexp_like( myfield, '.*([^0-9]|[^0-9][0-9])$' );

How can I solve this regex using two asserts?

I have these 3 consecutive words : Nocivic Voie and Quartier
I have something like this :
#Nocivic;Voie;Quartier#
Question :
I need make a regex to extract the 3 words Nocivic Voie and Quartier using positive lookahead and the commas need to be included in my regex but not the #.
I realized that this could work : \bNocivic(?=;Voie);\bVoie;Quartier
But why is this not working ?
\bNocivic(?=;Voie);\bVoie(?<=Voie;)\bQuartier
I am not too experienced with regex so if someone could tell me why or give me the correct answer if I really wanted to use another lookbehind would be greatly appreciated thanks.
First one is equivelent to
\bNocivic;Voie;Quartier\b
(?=;Voie) just tests if ;Voie follows Nocivic, no useful here
Extrac from
https://www.regextutorial.org/positive-and-negative-lookahead-assertions.php
They only assert if in a given test string the match with certain conditions is possible or not Yes or No.
See the difference below
Nocivic;Voie Ok & returns Nocivic;Voie
Nocivic(=?;Voie) Ok & returns Nocivic
Second one :
?< is not a valid command
The second one is not working, as after match Voie you assert that from the current position there should be Voie; to the left using (?<=Voie;) but you have not matched the semi colon yet.
Note that the lookaround assertions are fruitless in the example, as you are asserting what you are also matching.
If you want to match exactly those 3 words, it does not make sense to use lookarounds.
You can use 3 capture groups:
#(Nocivic);(Voie);(Quartier)#
Regex demo

Regex construction assistance

I've become stuck constructing a regex, and was wondering if you guys could help me out.
Here's the full string:
/20271/Avtal%202013/Sammanst%c3%a4lld_produktlista_2013_v121220_l%c3%a5st_web.xls
I want to extract 2013_v121220 but the matching must follow a couple of rules:
Total length of version string must at least be 4 characters
It must allow for version strings that do not contain either "v" or "_", i.e. 2013121220 or 2013_121220
The version string must be the last occurrence before the end of the string (i.e. do not match /20271/ in this case).
I've tried with (\d+[_v]*\d+).*?_web(\.xlsx?)$ but I have no idea how to implement the length check here, i.e. (\d+[_v]*\d+) has to be at least 4 characters, {4,}, this to not match the 5 in _l%c3%a5st_web.xls.
I've come up with this regex: (\d[\d_v]{2,}\d).*?_web(\.xlsx?)$ but it only matches the first occurence, I need the last one, closest to the end. I've tried prefixing .+ to the regex but it fails regardless.
(?<=produktlista_)(\d{1,}(?:_v)\d{1,}|\d{4,})
this will get you the value of productlista
editing anubhavas regex
\d{1,}(?:_v)?\d{1,}(?=.*?_web\.xlsx?$)
demo here : http://regex101.com/r/pW6mV8
You can use this regex:
(?=\d*\d.{2}\d\d*\D)\d+(?:_v)?\d+(?!.*?\d(?:_v)?\d)(?=.*?_web\.xlsx?$)
Working Demo
I managed to solve this myself. Thanks for your efforts though, I really appreciated your
help.
This regex matches exactly according to my rules:
(\d[\d_v]{2,}\d)(?!.*\d[\d_v]{2,}\d).*?_web(\.xlsx?)$

Regex href match a number

Well, here I am back at regex and my poor understanding of it. Spent more time learning it and this is what I came up with:
/(.*)
I basically want the number in this string:
510973
My regex is almost good? my original was:
"/<a href=\"travis.php?theTaco(.*)\">(.*)<\/a>/";
But sometimes it returned me huge strings. So, I just want to get numbers only.
I searched through other posts but there is such a large amount of unrelated material, please give an example, resource, or a link directing to a very related question.
Thank you.
Try using a HTML parser provided by the language you are using.
Reason why your first regex fails:
[0-9999999] is not what you think. It is same as [0-9] which matches one digit. To match a number you need [0-9]+. Also .* is greedy and will try to match as much as it can. You can use .*? to make it non-greedy. Since you are trying to match a number again, use [0-9]+ again instead of .*. Also if the two number you are capturing will be the same, you can just match the first and use a back reference \1 for 2nd one.
And there are a few regex meta-characters which you need to escape like ., ?.
Try:
<a href=\"travis\.php\?theTaco=([0-9]+)\">\1<\/a>
To capture a number, you don't use a range like [0-99999], you capture by digit. Something like [0-9]+ is more like what you want for that section. Also, escaping is important like codaddict said.
Others have already mentioned some issues regarding your regex, so I won't bother repeating them.
There are also issues regarding how you specified what it is you want. You can simply match via
/theTaco=(\d+)/
and take the first capturing group. You have not given us enough information to know whether this suits your needs.

Need a regex to exclude certain strings

I'm trying to get a regex that will match:
somefile_1.txt
somefile_2.txt
somefile_{anything}.txt
but not match:
somefile_16.txt
I tried
somefile_[^(16)].txt
with no luck (it includes even the "16" record)
Some regex libraries allow lookahead:
somefile(?!16\.txt$).*?\.txt
Otherwise, you can still use multiple character classes:
somefile([^1].|1[^6]|.|.{3,})\.txt
or, to achieve maximum portability:
somefile([^1].|1[^6]|.|....*)\.txt
[^(16)] means: Match any character but braces, 1, and 6.
The best solution has already been mentioned:
somefile_(?!16\.txt$).*\.txt
This works, and is greedy enough to take anything coming at it on the same line. If you know, however, that you want a valid file name, I'd suggest also limiting invalid characters:
somefile_(?!16)[^?%*:|"<>]*\.txt
If you're working with a regex engine that does not support lookahead, you'll have to consider how to make up that !16. You can split files into two groups, those that start with 1, and aren't followed by 6, and those that start with anything else:
somefile_(1[^6]|[^1]).*\.txt
If you want to allow somefile_16_stuff.txt but NOT somefile_16.txt, these regexes above are not enough. You'll need to set your limit differently:
somefile_(16.|1[^6]|[^1]).*\.txt
Combine this all, and you end up with two possibilities, one which blocks out the single instance (somefile_16.txt), and one which blocks out all families (somefile_16*.txt). I personally think you prefer the first one:
somefile_((16[^?%*:|"<>]|1[^6?%*:|"<>]|[^1?%*:|"<>])[^?%*:|"<>]*|1)\.txt
somefile_((1[^6?%*:|"<>]|[^1?%*:|"<>])[^?%*:|"<>]*|1)\.txt
In the version without removing special characters so it's easier to read:
somefile_((16.|1[^6]|[^1).*|1)\.txt
somefile_((1[^6]|[^1]).*|1)\.txt
To obey strictly to your specification and be picky, you should rather use:
^somefile_(?!16\.txt$).*\.txt$
so that somefile_1666.txt which is {anything} can be matched ;)
but sometimes it is just more readable to use...:
ls | grep -e 'somefile_.*\.txt' | grep -v -e 'somefile_16\.txt'
somefile_(?!16).*\.txt
(?!16) means: Assert that it is impossible to match the regex "16" starting at that position.
Sometimes it's just easier to use two regular expressions. First look for everything you want, then ignore everything you don't. I do this all the time on the command line where I pipe a regex that gets a superset into another regex that ignores stuff I don't want.
If the goal is to get the job done rather than find the perfect regex, consider that approach. It's often much easier to write and understand than a regex that makes use of exotic features.
Without using lookahead
somefile_(|.|[^1].+|10|11|12|13|14|15|17|18|19|.{3,}).txt
Read it like: somefile_ followed by either:
nothing.
one character.
any one character except 1 and followed by any other characters.
three or more characters.
either 10 .. 19 note that 16 has been left out.
and finally followed by .txt.