Regex to pick a value from url - regex

I am having difficulty to build a regex which can extract a value from the URL. The condition is get the value between after last "/" and ".html" Please help
Sample URL1 - https://www.example.com/fgf/sdf/sdf/as/dwe/we/bingo.html - The value I want to extract is bingo
Sample URL2 - www.example.com/we/b345g.html - The value I want to extract is b345g
I tried to build a regex and I was able to get "bingo.html" and "b345g.html using [^\/]+$ but was not able to remove or skip ".html"

Here you are:
\/([^\/]+?)(?>\..+)?$
Explaination:
\/ - literal character '/'
([^\/]+?) - first group: at least one character that is not a '/' with greedyness (match only the first expansion)
[^\/] - any character that is not a '/'
+ - at least one occurence
? - greediness operator (match only first expansion)
(?>\..+)? - second optional group: '.' + any character (like '.html' or '.exe' or '.png')
?> - non-capturing lookahead group (exclude the content from the result)
\. - literal character '.'
. - any character (except line terminators)
+ - at least one occurence
? - optionality (note that this one is outside the parenthesis)
$ - end of the string
If you want also to exclude query strings you can expand it like this:
\/([^\/]+?)(?>\..+)?(?>\?.*)?$
If you also need to remove the protocol part of the url you can use this:
(?<!\/)\/([^\/]+?)(?>\..+)?(?>\?.*)?$
Where this (?<!\/) just look if there are no '/' before the start of the match

You are only matching using [^\/]+$ but not differentiating between the part before and after the dot.
To make that different, you could use for example a capture group to get the part after the last slash and before the first dot.
\S*\/([^\/\s.]+)\.[^\/\s]+$
\S*\/ Match optional non whitespace chars till the last occurrence of /
([^\/\s.]+) Capture group 1 Match 1+ times any char except a / whitespace char or .
\. Match a dot
[^\/\s]+ Match 1+ times any char except a / whitespace char or .
$ End of string
See a regex demo.

Related

Regex Pattern that has to include something after /

Using Regex, I want to match any URL that includes the /it-jobs/ but must have something after the final /.
To be a match the URL must have /it-jobs/ + characters after the trailing / otherwise it should not match. Please refer to below example.
Example: www.website.com/it-jobs/ - is not a match
www.website.com/it-jobs/java-developer - is a match
www.website.com/it-jobs/php - is a match
www.website.com/it-jobs/angular-developer - is a match
You can use
/it-jobs/[^/\s]+$
To match the whole string, add .* at the pattern start:
.*/it-jobs/[^/\s]+$
See the regex demo.
Details:
.* - zero or more chars other than line break chars as many as possible
/it-jobs/ - a literal string
[^/\s]+ - any one or more chars other than / and whitespaces
$ - end of string.

REGEX: To extract particular string from path

I am looking to extract particular string from path.
For example, I have to extract 4th value separated by (.) from filename. which is "lm" in below examples.
Examples:
/apps/java/logs/abc.defgh.ijk.lm.nopqrst.uvw.xyz.log
/apps2/java/logs/abc.defgh.ijk.lm.log
This will extract full file name:
.*\/(?<name>.*).log
You can use
.*\/(?:[^.\/]*\.){3}(?<value>[^.\/]*)[^\/]*$
Or, if .log must be the extension:
.*\/(?:[^.\/]*\.){3}(?<value>[^.\/]*)[^\/]*\.log$
See the regex demo. Details:
.* - any zero or more chars other than line break chars, as many as possible
\/ - a / char
(?:[^.\/]*\.){3} - three occurrences of zero or more chars other than . and / as many as possible and a dot
(?<value>[^.\/]*) - Group "value": zero or more chars other than . and / as many as possible
[^\/]* - zero or more chars other than /
\.log - a .log substring
$ - end of string.
You can also try
\/(?:\w+\.){3}(\w+)
Or
\/(?:\w+\.){3}(\w+).*\.log
Where:
\/ - Match string starting from "/"
(?:\w+\.){3} - Matches 3 occurrences of "xyz." e.g. abc.defgh.ijk.
(\w+) - Capture the alpanumeric string. This will contain the target value e.g. "lm"
.*\.log - Optional. Match any set of characters that ends with .log e.g. .nopqrst.uvw.xyz.log

RegEx string to find two strings and delete the rest of the text in the file including lines that don't contain the strings [duplicate]

I need to do a find and delete the rest in a text file with notepad+++
i want tu use RegeX to find variations on thban..... the variable always has max 5 chars behind it(see dots).
with my search string it hit the last line but the whole line. I just want the word preserved.
When this works i also want keep the words containing C3.....
The rest of a tekst file can be delete.
It should also be caps insensitive
(?!thban\w+).*\r?\n?
\
THBANES900 and C3950 bla bla
THBAN
..THBANES901.. C3850 bla bla
THBANMP900
**..thbanes900..**
This should result in
THBANES900 C3950
THBAN
THBANES901 C3850
THBANMP900
thbanes900
Maybe just capture those words of interest instead of replacing everything else? In Notepad++ search for pattern:
^.*\b(thban\S{0,5})(?:.*(\sC3\w+))?.*$|.+
See the Online Demo
^ - Start string ancor.
.*\b - Any character other than newline zero or more times upto a word-boundary.
(- Open 1st capture group.
thban\S{0,5} - Match "thban" and zero or 5 non-whitespace chars.
) - Close 1st capture group.
(?: - Open non-capturing group.
.* - Any character other than newline zero or more times.
( - Open 2nd capture group.
\sC3\w+ - A whitespace character, match "C3" and one ore more word characters.
) - Close 2nd capture group.
)? - Close non-capturing group and make it optional.
.* - Any character other than newline zero or more times.
$ - End string ancor.
| - Alternation (OR).
.+ - Any character other than newline once or more.
Replace with:
$1$2
After this, you may end up with empty line you can switly remove using the build-in option. I'm unaware of the english terms so I made a GIF to show you where to find these buttons:
I'm not sure what the english checkbutton is for ignore case. But make sure that is not ticked.
You may use
Find What: (?|\b(thban\S{0,5})|\s(C3\w+))|(?s:.)
Replace With: (?1$1\n:)
Screenshot & settings
Details
(?| - start of a branch reset group:
\b(thban\S{0,5}) - Group 1: a word boundary, then thban and any 0 to 5 non-whitespace chars
| - or
\s(C3\w+) - a whitespace char, and then Group 1: C3 and one or more word chars
) - end of the branch reset group
| - or
(?s:.) - any one char (including line break chars)
The replacement is
(?1 - if Group 1 matched,
$1\n - Group 1 value with a newline
: - else, replace with empty string
) - end of the conditional replacement pattern

Regex: get string after first occurrence of a character (including it)

I'm trying to get some old links from my site to redirect to the new ones with a 301 redirect instruction. What I need to accomplish is to remove the first part of the string until it matches a hyphen and remove it (including the hyphen)
Example:
http://example.com/19731-la-preservacion-de-la-biodiversidad-es-crucial-para-frenar-la-desertificacion-en-zonas-aridas
or
http://example.com/633-afecta-la-crisis-alimentaria-ya-a-miles-de-personas
Should output to:
http://example.com/la-preservacion-de-la-biodiversidad-es-crucial-para-frenar-la-desertificacion-en-zonas-aridas
http://example.com/afecta-la-crisis-alimentaria-ya-a-miles-de-personas
I have tried so far with
RewriteRule ^[^-|-](.*)$ $1 and
RewriteRule ^([^-]*-)(.*)$ $1 but I can't seem to get it to work.
Thanks!
To get a substring after the first occurrence of some character including it you may use a negated character class that will match any char(s) other than that character, and then you need to start a capturing group, place the char as the first atom in it, and add .*) after:
^[^-]*(-.*)$
Here, ^[^-]*(-.*)$ matches a whole string, and the first - with all the chars after it landing in Group 1 ($1 replacement in RewriteRule).
See the regex demo
Details
^ - start of string
[^-]* - zero or more chars other than - (negated character class)
(-.*) - Group 1 ($1): - and then any 0+ chars
$ - end of string.
Try:
(.*?\/)\d+-(.*)
Replace:
$1$2
Check This

Get the first ocurrence of a string in a variable REGEX

I have the following variable in a database: PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT and I want to split it into two variables, the first will be PSC-CAMPO-GRANDE-I08 and the second V00-C09-H09-IPRMKT.
I'm trying the regex .*(\-I).*(\-V), this doesn't work. Then I tried .*(\-I), but it gets the last -IPRMKT string.
Then my question is: There a way of split the string PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT considering the first occurrence of -I?
This should do the trick:
regex = "(.*?-I[\d]{2})-(.*)"
Here is test script in Python
import re
regex = "(.*?-I[\d]{2})-(.*)"
match = re.search(regex, "PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT")
if match:
print ("yep")
print (match.group(1))
print (match.group(2))
else:
print ("nope")
In the regex, I'm grabbing everything up to the first -I then 2 numbers. Then match but don't capture a -. Then capture the rest. I can help tweak it if you have more logic that you are trying to do.
You may use
^(.*?-I[^-]*)-(.*)
See the regex demo
Details:
^ - start of a string
(.*?-I[^-]*) - Group 1:
.*? - any 0+ 0+ chars other than line break chars up to the first (because *? is a lazy quantifier that matches up to the first occurrence)
-I - a literal substring -I
[^-]* - any 0+ chars other than a hyphen (your pattern was missing it)
- - a hyphen
(.*) - Group 2: any 0+ chars other than line break chars up to the end of a line.