Regex for internal URL - regex

I'm trying to create a regex that matches with internal URLs (the ones that don't include the domain or http) that I can find in a file like this one:
category/subcategory/sub-subcategory/item-1
For that I'm using:
/\w+\/.+\/[\w\-]+/
But some URLs are like this:
category/subcategory
And I need a regular expression that also catch those. Do I have to create a different one or is it possible to create one that match both examples? Is for a BASH script but if you have an idea it does not matter if it is for other engine.
Thank you!!
Update: I forgot the context. Each line of the file is like this:
"11","category/subcategory/sub-subcategory/item-1","index.php?option=com_trombinoscopeextended&Itemid=125&lang=es&view=trombinoscope","251","0","0000-00-00","","","","","","","0"
Or like this:
"4","category/subcategory","index.php?option=com_trombinoscopeextended&Itemid=121&lang=es","0","1","0000-00-00","","","","","","","0"
I need to extract the examples for each line.
Thanks.

You may use
/\w+(\/[\w-]+)+/
See the regex demo.
Details
\w+ - 1+ word chars
(\/[\w-]+)+ - 1 or more consecutive sequences of
\/ - a / char
[\w-]+ - 1+ word or - chars.
A hint: you might read in your string with a kind of a CSV parser using your preferred language, and then only return fields that match ^\w+(\/[\w-]+)+$ pattern (here, ^ matches the start of the string and $ matches the end of the string).

That is pretty specific. I came up with this one after some testing. We have subdomains we need to check for as well.
(?!https?:)/?[^/][^/].*|(https?:)?//([^.]*\.)?yourdomain\.com(/.*)?
Someone can probably make it better, but this works for me.

Related

Split complex string into mutliple parts using regex

I've tried a lot to split this string into something i can work with, however my experience isn't enough to reach the goal. Tried first 3 pages on google, which helped but still didn't give me an idea how to properly do this:
I have a string which looks like this:
My Dogs,213,220#Gallery,635,210#Screenshot,219,530#Good Morning,412,408#
The result should be:
MyDogs
213,229
Gallery
635,210
Screenshot
219,530
Good Morning
412,408
Anyone have an idea how to use regex to split the string like shown above?
Given the shared patterns, it seems you're looking for a regex like the following:
[A-Za-z ]+|\d+,\d+
It matches two patterns:
[A-Za-z ]+: any combination of letters and spaces
\d+,\d+: any combination of digits + a comma + any combination of digits
Check the demo here.
If you want a more strict regex, you can include the previous pattern between a lookbehind and a lookahead, so that you're sure that every match is preceeded by either a comma, a # or a start/end of string character.
(?<=^|,|#)([A-Za-z ]+|\d+,\d+)(?=,|#|$)
Check the demo here.

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

regex - exclude substring contains more than 2 "/"

I have a list of the following strings:
/fajwe/conv_1/routing/apwfe/afjwepfj
/fajwe/conv_2/routing/apwfe
/fajwe/conv_2/routing
/fajwe/conv_3/routing/apwfe/afjwepfj/awef
/fajwe/conv_4/routing/apwfe/afjwepfj/awef/0o09
I want a regex to only match string contains no more than 1 / after the word routing. Namely /fajwe/conv_2/routing/apwfe and /fajwe/conv_2/routing.
Currently I use the regex ^((?!rou\w+(\/\w+){2,}).)*$ but it matches nothing. How can I write a regex to exclude strings contains more than 2 / after the word routing?
I would love to learn how to achieve this using Negative Lookbehind. Many thanks!
Something like this?
^.*\/routing(\/[^\/]*){0,1}$
routing(\/[^\/]*)?$
there you go
https://regex101.com/r/KjE8ed/1/
Your regex matches what you are looking for with the multiline flag m as #revo pointed out.
^((?!rou\w+(\/\w+){2,}).)*$
You could also try it like this:
^\/fajwe\/conv_\d\/routing(?:\/[^\/]+)?$
Depending of your context of language you could \/ escape the forward slash

Regex to match all urls, excluding .css, .js recources

I'm looking for a regular expression to exclude the URLs from an extension I don't like.
For example resources ending with: .css, .js, .font, .png, .jpg etc. should be excluded.
However, I can put all resources to the same folder and try to exclude URLs to this folder, like:
.*\/(?!content\/media)\/.*
But that doesn't work! How can I improve this regex to match my criteria?
e.g.
Match:
http://www.myapp.com/xyzOranotherContextRoot/rest/user/get/123?some=par#/other
No match:
http://www.myapp.com/xyzOranotherContextRoot/content/media/css/main.css?7892843
The correct solution is:
^((?!\/content\/media\/).)*$
see: https://regex101.com/r/bD0iD9/4
Inspirit by Regular expression to match a line that doesn't contain a word?
Two things:
First, the ?! negative lookahead doesn't remove any characters from the input. Add [^\/]+ before the trailing slash. Right now it is trying to match two consecutive slashes. For example:
.*\/(?!content\/media)[^\/]+\/.*
(edit) Second, the .*s at the beginning and end match too much. Try tightening those up, or adding more detail to content\/media. As it stands, content/media can be swallowed by one of the .*s and never be checked against the lookahead.
Suggestions:
Use your original idea - test against the extensions: ^.*\.(?!css|js|font|png|jpeg)[a-z0-9]+$ (with case insensitive).
Instead of using the regular expression to do this, use a regex that will pull any URL (e.g., https?:\/\/\S\+, perhaps?) and then test each one you find with String.indexOf: if(candidateURL.indexOf('content/media')==-1) { /*do something with the OK URL */ }

regular expression for multiple filenames

I have some files like that
15.58.55.ser 16.22.20.ser 16.36.23.ser 16.40.13.ser 16.59.41.ser 17.05.08.ser 17.14.40.ser 18.14.40.ser 18.20.43.ser
I want to replace these filenames with the following format
image_1.ser image_2.ser ....
I don't know how to achieve it.
please give me some advice.
The regex is quite simple:
(?:\d{2}\.){3}ser
It matches two digits \d{2} and a dot \. three times {3}, ending in ser.
You can see from RegExr that is matches all of your test cases.
However, in order to know how to do the replacement, you'd have to specify a language that you're working with.
Try this(If you need Java code)
String regex = "\\.ser";
fileName = "15.58.55.ser";
System.out.println(filename.replaceAll(fileName.split(regex)[0], "image_1"));
This is just for only one entry. If you want to replace multiple files, do it in For loop or whatever