Regex construction assistance - regex

I've become stuck constructing a regex, and was wondering if you guys could help me out.
Here's the full string:
/20271/Avtal%202013/Sammanst%c3%a4lld_produktlista_2013_v121220_l%c3%a5st_web.xls
I want to extract 2013_v121220 but the matching must follow a couple of rules:
Total length of version string must at least be 4 characters
It must allow for version strings that do not contain either "v" or "_", i.e. 2013121220 or 2013_121220
The version string must be the last occurrence before the end of the string (i.e. do not match /20271/ in this case).
I've tried with (\d+[_v]*\d+).*?_web(\.xlsx?)$ but I have no idea how to implement the length check here, i.e. (\d+[_v]*\d+) has to be at least 4 characters, {4,}, this to not match the 5 in _l%c3%a5st_web.xls.
I've come up with this regex: (\d[\d_v]{2,}\d).*?_web(\.xlsx?)$ but it only matches the first occurence, I need the last one, closest to the end. I've tried prefixing .+ to the regex but it fails regardless.

(?<=produktlista_)(\d{1,}(?:_v)\d{1,}|\d{4,})
this will get you the value of productlista
editing anubhavas regex
\d{1,}(?:_v)?\d{1,}(?=.*?_web\.xlsx?$)
demo here : http://regex101.com/r/pW6mV8

You can use this regex:
(?=\d*\d.{2}\d\d*\D)\d+(?:_v)?\d+(?!.*?\d(?:_v)?\d)(?=.*?_web\.xlsx?$)
Working Demo

I managed to solve this myself. Thanks for your efforts though, I really appreciated your
help.
This regex matches exactly according to my rules:
(\d[\d_v]{2,}\d)(?!.*\d[\d_v]{2,}\d).*?_web(\.xlsx?)$

Related

How to extract characters from a string with optional string afterwards using Regex?

I am in the process of learning Regex and have been stuck on this case. I have a url that can be in two states EXAMPLE 1:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA
OR EXAMPLE 2:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA
I need to extract the 1HYcYZCOpaLjg51qUg8ilA ID
So far I am using this: (?<=track\/)(.*)(?=\?)? which works well for Example 2 but it includes the ?si=Nf5w1q9MTKu3zG_CJ83RWA when matching with Example 1.
BUT if I remove the ? at the end of the expression then it works for Example 1 but not Example 2! Doesn't that mean that last group (?=\?) is optional and should match?
Where am I going wrong?
Thanks!
I searched a handful of "Questions that may already have your answer" suggestions from SO, and didn't find this case, so I hope asking this is okay!
The capturing group in your regular expression is trying to match anything (.) as much as possible due to the greediness of the quantifier (*).
When you use:
(?<=track\/)(.*)(?=\?)
only 1HYcYZCOpaLjg51qUg8ilA from the first example is captured, as there is no question mark in your second example.
When using:
(?<=track\/)(.*)(?=\??)
You are effectively making the positive lookahead optional, so the capturing group will try to match as much as possible (including the question mark), so that 1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA and 1HYcYZCOpaLjg51qUg8ilA are matched, which is not the desired output.
Rather than matching anything, it is perhaps more appropriate for you to match alphanumerical characters \w only.
(?<=track\/)(\w*)(?=\??)
Alternatively, if you are expecting other characters , let's say a hyphen - or a underscore _, you may use a character class.
(?<=track\/)([a-zA-Z0-9_-]*)(?=\??)
Or you might want to capture everything except a question mark ? with a negated character class.
(?<=track\/)([^?]*)(?=\??)
As pointed out by gaganso, a look-behind is not necessary in this situation (or indeed the lookahead), however it is indeed a good idea to start playing around with them. The look-around assertions do not actually consume the characters in the string. As you can see here, the full match for both matches only consists of what is captured by the capture group. You may find more information here.
This should work:
track\/(\w+)
Please see here.
Since track is part of both the strings, and the ID is formed from alphanumeric characters, the above regex which matches the string "track/" and captures the alphanumeric characters after that string, should provide the required ID.
Regex : (\w+(?=\?))|(\w+&)
See the demo for the regex, https://regexr.com/3s4gv .
This will first try to search for word which has '?' just after it and if thats unsuccessful it will fetch the last word.

Matching all strings without 3 occurrences of/or final single character in RegEx

Trying to figure out the regex for the title,
i.e.,
foo
foo/bar/foo
foo/bar/foo/bar
foo/bar/d
I don't want it to match the 3rd or the 4th one but match the first two. In the 2nd option, the final foo can be anything but a single d.
You could use a regex but it will be more complicated than just counting the number of slashes and also checking the last character isn't a d. If you want to use a regex to check for the last part not being "/d" you could do something like check that it doesn't match ^.*/d$ but it may be clearer to just use code. (If counting slashes and checking string doesn't end in "/d" isn't exactly what you mean then it will help to have more examples)
Figured it out. See below if anyone is interested.
(^foo/?$)|(^foo/[^/]+/(([^d][^/]*)|(d[^/]+))/?$)

Find any 4 consecutive characters between two strings

I'm trying to write a regex that would detect if any combination of 4 non-whitespace characters existed between two strings. They will always be seperated by a comma. An example:
Labrador, Matador ---> this would match 'ador'.
Mississippi, Missing ---> This would match 'Miss' and 'issi'
Corporate, Corporation ---> This would match 'Corp' , 'orpo' , 'rpor' , 'pora' and 'orat'
It's been pretty hard to find something similar to this, and the closest I've found has said this is not possible in regex. It's definitely tricky, but I wanted to make sure that it was in fact not possible before looking for a different solution.
If it is impossible, would someone explain why?
For overlapping matches it is possible with a lookahead:
/(?=(\S{4}).*,.*\1)/
Note that there is one more issi possible in your second line example.
Test: https://regex101.com/r/rV3gN9/2
You can use this lookahead based regex:
(?=([a-zA-Z]{4})[a-zA-Z]*, *[a-zA-Z]*\1)
RegEx Demo
Though it will find issi twice since Mississippi has 2 instanced of issi.
This can be achieved with backreferences:
\w*([a-zA-z]{4})\w*, \w*\1\w*
See example: https://regex101.com/r/eW8hB7/1

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/

What is wrong with my simple regex that accepts empty strings and apartment numbers?

So I wanted to limit a textbox which contains an apartment number which is optional.
Here is the regex in question:
([0-9]{1,4}[A-Z]?)|([A-Z])|(^$)
Simple enough eh?
I'm using these tools to test my regex:
Regex Analyzer
Regex Validator
Here are the expected results:
Valid
"1234A"
"Z"
"(Empty string)"
Invalid
"A1234"
"fhfdsahds527523832dvhsfdg"
Obviously if I'm here, the invalid ones are accepted by the regex. The goal of this regex is accept either 1 to 4 numbers with an optional letter, or a single letter or an empty string.
I just can't seem to figure out what's not working, I mean it is a simple enough regex we have here. I'm probably missing something as I'm not very good with regexes, but this syntax seems ok to my eyes. Hopefully someone here can point to my error.
Thanks for all help, it is greatly appreciated.
You need to use the ^ and $ anchors for your first two options as well. Also you can include the second option into the first one (which immediately matches the third variant as well):
^[0-9]{0,4}[A-Z]?$
Without the anchors your regular expression matches because it will just pick a single letter from anywhere within your string.
Depending on the language, you can also use a negative look ahead.
^[0-9]{0,4}[A-Za-z](?!.*[0-9])
Breakdown:
^[0-9]{0,4} = This look for any number 0 through 4 times at the beginning of the string
[A-Za-z] = This look for any characters (Both cases)
(?!.*[0-9]) = This will only allow the letters if there are no numbers anywhere after the letter.
I haven't quite figured out how to validate against a null character, but that might be easier done using tools from whatever language you are using. Something along this logic:
if String Doesn't equal $null Then check the Rexex
Something along those lines, just adjusted for however you would do it in your language.
I used RegEx Skinner to validate the answers.
Edit: Fixed error from comments