I am having difficulty to build a regex which can extract a value from the URL. The condition is get the value between after last "/" and ".html" Please help
Sample URL1 - https://www.example.com/fgf/sdf/sdf/as/dwe/we/bingo.html - The value I want to extract is bingo
Sample URL2 - www.example.com/we/b345g.html - The value I want to extract is b345g
I tried to build a regex and I was able to get "bingo.html" and "b345g.html using [^\/]+$ but was not able to remove or skip ".html"
Here you are:
\/([^\/]+?)(?>\..+)?$
Explaination:
\/ - literal character '/'
([^\/]+?) - first group: at least one character that is not a '/' with greedyness (match only the first expansion)
[^\/] - any character that is not a '/'
+ - at least one occurence
? - greediness operator (match only first expansion)
(?>\..+)? - second optional group: '.' + any character (like '.html' or '.exe' or '.png')
?> - non-capturing lookahead group (exclude the content from the result)
\. - literal character '.'
. - any character (except line terminators)
+ - at least one occurence
? - optionality (note that this one is outside the parenthesis)
$ - end of the string
If you want also to exclude query strings you can expand it like this:
\/([^\/]+?)(?>\..+)?(?>\?.*)?$
If you also need to remove the protocol part of the url you can use this:
(?<!\/)\/([^\/]+?)(?>\..+)?(?>\?.*)?$
Where this (?<!\/) just look if there are no '/' before the start of the match
You are only matching using [^\/]+$ but not differentiating between the part before and after the dot.
To make that different, you could use for example a capture group to get the part after the last slash and before the first dot.
\S*\/([^\/\s.]+)\.[^\/\s]+$
\S*\/ Match optional non whitespace chars till the last occurrence of /
([^\/\s.]+) Capture group 1 Match 1+ times any char except a / whitespace char or .
\. Match a dot
[^\/\s]+ Match 1+ times any char except a / whitespace char or .
$ End of string
See a regex demo.
Related
Using Regex, I want to match any URL that includes the /it-jobs/ but must have something after the final /.
To be a match the URL must have /it-jobs/ + characters after the trailing / otherwise it should not match. Please refer to below example.
Example: www.website.com/it-jobs/ - is not a match
www.website.com/it-jobs/java-developer - is a match
www.website.com/it-jobs/php - is a match
www.website.com/it-jobs/angular-developer - is a match
You can use
/it-jobs/[^/\s]+$
To match the whole string, add .* at the pattern start:
.*/it-jobs/[^/\s]+$
See the regex demo.
Details:
.* - zero or more chars other than line break chars as many as possible
/it-jobs/ - a literal string
[^/\s]+ - any one or more chars other than / and whitespaces
$ - end of string.
I am looking to extract particular string from path.
For example, I have to extract 4th value separated by (.) from filename. which is "lm" in below examples.
Examples:
/apps/java/logs/abc.defgh.ijk.lm.nopqrst.uvw.xyz.log
/apps2/java/logs/abc.defgh.ijk.lm.log
This will extract full file name:
.*\/(?<name>.*).log
You can use
.*\/(?:[^.\/]*\.){3}(?<value>[^.\/]*)[^\/]*$
Or, if .log must be the extension:
.*\/(?:[^.\/]*\.){3}(?<value>[^.\/]*)[^\/]*\.log$
See the regex demo. Details:
.* - any zero or more chars other than line break chars, as many as possible
\/ - a / char
(?:[^.\/]*\.){3} - three occurrences of zero or more chars other than . and / as many as possible and a dot
(?<value>[^.\/]*) - Group "value": zero or more chars other than . and / as many as possible
[^\/]* - zero or more chars other than /
\.log - a .log substring
$ - end of string.
You can also try
\/(?:\w+\.){3}(\w+)
Or
\/(?:\w+\.){3}(\w+).*\.log
Where:
\/ - Match string starting from "/"
(?:\w+\.){3} - Matches 3 occurrences of "xyz." e.g. abc.defgh.ijk.
(\w+) - Capture the alpanumeric string. This will contain the target value e.g. "lm"
.*\.log - Optional. Match any set of characters that ends with .log e.g. .nopqrst.uvw.xyz.log
I need to do a find and delete the rest in a text file with notepad+++
i want tu use RegeX to find variations on thban..... the variable always has max 5 chars behind it(see dots).
with my search string it hit the last line but the whole line. I just want the word preserved.
When this works i also want keep the words containing C3.....
The rest of a tekst file can be delete.
It should also be caps insensitive
(?!thban\w+).*\r?\n?
\
THBANES900 and C3950 bla bla
THBAN
..THBANES901.. C3850 bla bla
THBANMP900
**..thbanes900..**
This should result in
THBANES900 C3950
THBAN
THBANES901 C3850
THBANMP900
thbanes900
Maybe just capture those words of interest instead of replacing everything else? In Notepad++ search for pattern:
^.*\b(thban\S{0,5})(?:.*(\sC3\w+))?.*$|.+
See the Online Demo
^ - Start string ancor.
.*\b - Any character other than newline zero or more times upto a word-boundary.
(- Open 1st capture group.
thban\S{0,5} - Match "thban" and zero or 5 non-whitespace chars.
) - Close 1st capture group.
(?: - Open non-capturing group.
.* - Any character other than newline zero or more times.
( - Open 2nd capture group.
\sC3\w+ - A whitespace character, match "C3" and one ore more word characters.
) - Close 2nd capture group.
)? - Close non-capturing group and make it optional.
.* - Any character other than newline zero or more times.
$ - End string ancor.
| - Alternation (OR).
.+ - Any character other than newline once or more.
Replace with:
$1$2
After this, you may end up with empty line you can switly remove using the build-in option. I'm unaware of the english terms so I made a GIF to show you where to find these buttons:
I'm not sure what the english checkbutton is for ignore case. But make sure that is not ticked.
You may use
Find What: (?|\b(thban\S{0,5})|\s(C3\w+))|(?s:.)
Replace With: (?1$1\n:)
Screenshot & settings
Details
(?| - start of a branch reset group:
\b(thban\S{0,5}) - Group 1: a word boundary, then thban and any 0 to 5 non-whitespace chars
| - or
\s(C3\w+) - a whitespace char, and then Group 1: C3 and one or more word chars
) - end of the branch reset group
| - or
(?s:.) - any one char (including line break chars)
The replacement is
(?1 - if Group 1 matched,
$1\n - Group 1 value with a newline
: - else, replace with empty string
) - end of the conditional replacement pattern
I'm trying to get some old links from my site to redirect to the new ones with a 301 redirect instruction. What I need to accomplish is to remove the first part of the string until it matches a hyphen and remove it (including the hyphen)
Example:
http://example.com/19731-la-preservacion-de-la-biodiversidad-es-crucial-para-frenar-la-desertificacion-en-zonas-aridas
or
http://example.com/633-afecta-la-crisis-alimentaria-ya-a-miles-de-personas
Should output to:
http://example.com/la-preservacion-de-la-biodiversidad-es-crucial-para-frenar-la-desertificacion-en-zonas-aridas
http://example.com/afecta-la-crisis-alimentaria-ya-a-miles-de-personas
I have tried so far with
RewriteRule ^[^-|-](.*)$ $1 and
RewriteRule ^([^-]*-)(.*)$ $1 but I can't seem to get it to work.
Thanks!
To get a substring after the first occurrence of some character including it you may use a negated character class that will match any char(s) other than that character, and then you need to start a capturing group, place the char as the first atom in it, and add .*) after:
^[^-]*(-.*)$
Here, ^[^-]*(-.*)$ matches a whole string, and the first - with all the chars after it landing in Group 1 ($1 replacement in RewriteRule).
See the regex demo
Details
^ - start of string
[^-]* - zero or more chars other than - (negated character class)
(-.*) - Group 1 ($1): - and then any 0+ chars
$ - end of string.
Try:
(.*?\/)\d+-(.*)
Replace:
$1$2
Check This
I have the following variable in a database: PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT and I want to split it into two variables, the first will be PSC-CAMPO-GRANDE-I08 and the second V00-C09-H09-IPRMKT.
I'm trying the regex .*(\-I).*(\-V), this doesn't work. Then I tried .*(\-I), but it gets the last -IPRMKT string.
Then my question is: There a way of split the string PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT considering the first occurrence of -I?
This should do the trick:
regex = "(.*?-I[\d]{2})-(.*)"
Here is test script in Python
import re
regex = "(.*?-I[\d]{2})-(.*)"
match = re.search(regex, "PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT")
if match:
print ("yep")
print (match.group(1))
print (match.group(2))
else:
print ("nope")
In the regex, I'm grabbing everything up to the first -I then 2 numbers. Then match but don't capture a -. Then capture the rest. I can help tweak it if you have more logic that you are trying to do.
You may use
^(.*?-I[^-]*)-(.*)
See the regex demo
Details:
^ - start of a string
(.*?-I[^-]*) - Group 1:
.*? - any 0+ 0+ chars other than line break chars up to the first (because *? is a lazy quantifier that matches up to the first occurrence)
-I - a literal substring -I
[^-]* - any 0+ chars other than a hyphen (your pattern was missing it)
- - a hyphen
(.*) - Group 2: any 0+ chars other than line break chars up to the end of a line.