A pattern to match [characters]:[characters] inside an URL - regex

I have an url like below and wanted to use RegEx to extract segments like: Id:Reference, Title:dfgdfg, Status.Title:Current Status, CreationDate:Logged...
This is the closest pattern I got [=,][^,]*:[^,]*[,&] but obviously the result is not as expected, any better ideas?
P.S. I'm using [^,] to matach any characters except , because , will not exist the segment.
This is the site using for regex pattern matching.
http://regexpal.com/
The URL:
http://localhost/site/=powerManagement.power&query=_Allpowers&attributes=Id:Reference,Title:dfgdfg,Status.Title:Current Status,CreationDate:Logged,RaiseUser.Title:标题,_MinutesToBreach&sort_by=CreationDate"
Thanks,

You haven't specified what programming language you use. But almost all with support this:
([\p{L}\.]+):([\p{L}\.]+)
\p{L} matches a Unicode character in any language, provided that your regex engine support Unicode. RegEx 101.
You can extract the matches via capturing groups if you want.

In python:
import re
matchobj = re.match("^.*Id:(.*?),Title:(.*?),.*$", url, )
Id = matchobj.group(1)
Title = matchobj.group(2)

Related

How to get string following a certain pattern?

I have a URL:
https://fakedomain.com/2017/07/01/the-string-i-want-to-get/
I can recognize the 2017/07/01/ via this pattern:
(\d{4}/\d{2}/\d{2}/)
But what I want, is the string that comes after it: the-string-i-want-to-get/.
How do I achieve that?
Depending on the language you're using, you might find a library that does that for you (instead of writing your own regex). Anyway, if you want to achieve this by regex, you can:
\d{4}\/\d{2}\/\d{2}\/(.*)\/
This will catch anything after the date, up to the next "/".
You can also use a positive lookbehind:
(?<=\d{4}\/\d{2}\/\d{2}\/)(.*)\/
I suggest you this regex, which matches 2017/07/01/ in the first group and the-string-i-want-to-get/ in the second group:
(\d{4}/\d{2}/\d{2}/)(.*/)
Here is an implementation example in Python3:
import re
url = 'https://fakedomain.com/2017/07/01/the-string-i-want-to-get/'
m = re.search(r'(\d{4}/\d{2}/\d{2}/)(.*/)', url)
print(m.group(1)) # 2017/07/01/
print(m.group(2)) # the-string-i-want-to-get/

RegEx to cut out URL

I try to get an URL from a String of the following format:
RANDOMRUBBISHhttps://www.my-url.com/randomfirstname_randomlastnameRANDOMRUBBISH
I already tried some things, especially the the look before/after, which I used before successfully on another url format (starts https... ends .html, this was working).
But seems I'm too stupid to figure out the regex for the kind of string mentioned above. I just want the URL part from https.... to the end of the random last name. Is this even possible?
Any Ideas?
If you can guarantee that randomfirstname_randomlastname is all lowercase and RANDOMRUBBISH is all uppercase, you can use character classes [a-z] and [A-Z]. The language the regex is for will determine how to use these.
This is example works in javascript:
var str = "RANDOMRUBBISHhttps://www.my-url.com/randomfirstname_randomlastnameRANDOMRUBBISH";
var match = /https:\/\/www\.my-url\.com\/[a-z]*/.exec(str);

Crawler4j Regex Pattern for url

im using crawler4J , and i want to make some patterns to urls only but i couldn't solve regex for that url :
http://www.site.com/liste/product_name_changable/productDetails.aspx?productId={id}&categoryId={category_id}
i try that :
liste\/*\/productDetails:aspx?productId=*&category_id=*
and
private final static Pattern FILTERS = Pattern.compile("^/liste/*/productDetails.aspx?productId=*$");
but it's not working.
how can i make it regex pattern ?
You have several errors in your regex. All of the asterixes should be .+, to indicate that you want to match at least one or more character. The question mark symbol needs to be escaped. category_id should be categoryId. productDetails:aspx should be productDetails.aspx. With all of these fixes, the regex looks like this:
liste\/.+\/productDetails\.aspx\?productId=.+&categoryId=.+
Also, you shouldn't have ^ or $ at the start and end of the regex. Those match the start and end of the input, so they won't work if you're trying to get a portion of the url, which you are.

How to exclude a character in Regex

I have this Regex expression
UriPatternToMatch= new Regex(#"(href|src)=""[\d\w\/:##%;$\(\)~_\?\+\-=\\\.&]*",
RegexOptions.Compiled | RegexOptions.IgnoreCase)
This is working fine to pickup all URLS including http,ftp and others , but it picks up text within "&lt" special characters as URL too
for example it will wrongly pick up the text below as a URL too ( adding a photo instead of text below)
I believe something like ^&lt is what is needed , but where do I add it ?
Thanks
You need to use negative lookahead like this:
(?!.*?<)

URL safe characters RegEx that will allow UTF-8 accents!

I'm looking for a RegEx pattern to use in a rereplace() function that will keep URL safe characters, but include UTF-8 characters with accents. For example: ç and ã.
Something like: url = rereplace(local.url, "pattern") etc. I prefer a ColdFusion only solution, but I'm open to using Java too since it's so easy to integrate with CF.
My URL pattern will look like: /posts/[postId]/[title-with-accents-like-ç-and-ã]
I don't know what language you are using. Perl has some utf8 matching, see for example Tatsuhiko Miyagawa's URI::Find::UTF8
This can be done by matching alpha numeric characters using \w.
rereplace(string, "[^\w]", "", "all")
See this answer for reference.