How to get string following a certain pattern? - regex

I have a URL:
https://fakedomain.com/2017/07/01/the-string-i-want-to-get/
I can recognize the 2017/07/01/ via this pattern:
(\d{4}/\d{2}/\d{2}/)
But what I want, is the string that comes after it: the-string-i-want-to-get/.
How do I achieve that?

Depending on the language you're using, you might find a library that does that for you (instead of writing your own regex). Anyway, if you want to achieve this by regex, you can:
\d{4}\/\d{2}\/\d{2}\/(.*)\/
This will catch anything after the date, up to the next "/".
You can also use a positive lookbehind:
(?<=\d{4}\/\d{2}\/\d{2}\/)(.*)\/

I suggest you this regex, which matches 2017/07/01/ in the first group and the-string-i-want-to-get/ in the second group:
(\d{4}/\d{2}/\d{2}/)(.*/)
Here is an implementation example in Python3:
import re
url = 'https://fakedomain.com/2017/07/01/the-string-i-want-to-get/'
m = re.search(r'(\d{4}/\d{2}/\d{2}/)(.*/)', url)
print(m.group(1)) # 2017/07/01/
print(m.group(2)) # the-string-i-want-to-get/

Related

How to replace part of a URL with regex

I need to remove part of a URL with a regex.
From the words: http or https to the word .com.
And it can be several times in one string.
Can anyone help me with this?
For example a string:
"The request is:https://stackoverflow.com/questions"
After the removal - "The request is:/questions"
The regex that performed the deletion perfectly is: (#"\w+://[^/$]*")
with replace "".
Something like that:
var regex = new Regex(#"\w+:\/\/[^\/$]*");
regex.Replace(url, "");
You can use the re.sub() function from the regex package. Alternatively if your working with python you can use urlparse package to extract different parts of the url and concatenate it to the prefix you want.

Regexp "and OR or" without repeating expression [duplicate]

Consider this (very simplified) example string:
1aw2,5cx7
As you can see, it is two digit/letter/letter/digit values separated by a comma.
Now, I could match this with the following:
>>> from re import match
>>> match("\d\w\w\d,\d\w\w\d", "1aw2,5cx7")
<_sre.SRE_Match object at 0x01749D40>
>>>
The problem is though, I have to write \d\w\w\d twice. With small patterns, this isn't so bad but, with more complex Regexes, writing the exact same thing twice makes the end pattern enormous and cumbersome to work with. It also seems redundant.
I tried using a named capture group:
>>> from re import match
>>> match("(?P<id>\d\w\w\d),(?P=id)", "1aw2,5cx7")
>>>
But it didn't work because it was looking for two occurrences of 1aw2, not digit/letter/letter/digit.
Is there any way to save part of a pattern, such as \d\w\w\d, so it can be used latter on in the same pattern? In other words, can I reuse a sub-pattern in a pattern?
No, when using the standard library re module, regular expression patterns cannot be 'symbolized'.
You can always do so by re-using Python variables, of course:
digit_letter_letter_digit = r'\d\w\w\d'
then use string formatting to build the larger pattern:
match(r"{0},{0}".format(digit_letter_letter_digit), inputtext)
or, using Python 3.6+ f-strings:
dlld = r'\d\w\w\d'
match(fr"{dlld},{dlld}", inputtext)
I often do use this technique to compose larger, more complex patterns from re-usable sub-patterns.
If you are prepared to install an external library, then the regex project can solve this problem with a regex subroutine call. The syntax (?<digit>) re-uses the pattern of an already used (implicitly numbered) capturing group:
(\d\w\w\d),(?1)
^........^ ^..^
| \
| re-use pattern of capturing group 1
\
capturing group 1
You can do the same with named capturing groups, where (?<groupname>...) is the named group groupname, and (?&groupname), (?P&groupname) or (?P>groupname) re-use the pattern matched by groupname (the latter two forms are alternatives for compatibility with other engines).
And finally, regex supports the (?(DEFINE)...) block to 'define' subroutine patterns without them actually matching anything at that stage. You can put multiple (..) and (?<name>...) capturing groups in that construct to then later refer to them in the actual pattern:
(?(DEFINE)(?<dlld>\d\w\w\d))(?&dlld),(?&dlld)
^...............^ ^......^ ^......^
| \ /
creates 'dlld' pattern uses 'dlld' pattern twice
Just to be explicit: the standard library re module does not support subroutine patterns.
Note: this will work with PyPi regex module, not with re module.
You could use the notation (?group-number), in your case:
(\d\w\w\d),(?1)
it is equivalent to:
(\d\w\w\d),(\d\w\w\d)
Be aware that \w includes \d. The regex will be:
(\d[a-zA-Z]{2}\d),(?1)
I was troubled with the same problem and wrote this snippet
import nre
my_regex=nre.from_string('''
a=\d\w\w\d
b={{a}},{{a}}
c=?P<id>{{a}}),(?P=id)
''')
my_regex["b"].match("1aw2,5cx7")
For lack of a more descriptive name, I named the partial regexes as a,b and c.
Accessing them is as easy as {{a}}
import re
digit_letter_letter_digit = re.compile("\d\w\w\d") # we compile pattern so that we can reuse it later
all_finds = re.findall(digit_letter_letter_digit, "1aw2,5cx7") # finditer instead of findall
for value in all_finds:
print(re.match(digit_letter_letter_digit, value))
Since you're already using re, why not use string processing to manage the pattern repetition as well:
pattern = "P,P".replace("P",r"\d\w\w\d")
re.match(pattern, "1aw2,5cx7")
OR
P = r"\d\w\w\d"
re.match(f"{P},{P}", "1aw2,5cx7")
Try using back referencing, i believe it works something like below to match
1aw2,5cx7
You could use
(\d\w\w\d),\1
See here for reference http://www.regular-expressions.info/backref.html

RegEx remove part of string and and replace another part

I have a challenge getting the desired result with RegEx (using C#) and I hope that the community can help.
I have a URL in the following format:
https://somedomain.com/subfolder/category/?abc=text:value&ida=0&idb=1
I want make two modifications, specifically:
1) Remove everything after 'value' e.g. '&ida=0&idb=1'
2) Replace 'category' with e.g. 'newcategory'
So the result is:
https://somedomain.com/subfolder/newcategory/?abc=text:value
I can remove the string from 1) e.g. ^[^&]+ above but I have been unable to figure out how to replace the 'category' substring.
Any help or guidance would be much appreciated.
Thank you in advance.
Use the following:
Find: /(category/.+?value)&.+
Replace: /new$1 or /new\1 depending on your regex flavor
Demo & explanation
Update according to comment.
If the new name is completely_different_name, use the following:
Find: /category(/.+?value)&.+
Replace: /completely_different_name$1
Demo & explanation
You haven't specified language here, I mainly work on python so the solution is in python.
url = re.sub('category','newcategory',re.search('^https.*value', value).group(0))
Explanation
re.sub is used to replace value a with b in c.
re.search is used to match specific patterns in string and store value in the group. so in the above code re.search will store value from "https to value" in group 0.
Using Python and only built-in string methods (there is no need for regular expressions here):
url = r"https://somedomain.com/subfolder/category/?abc=text:value&ida=0&idb=1"
new_url = (url.split('value')[0] + "value").replace("category", 'newcategory')
print(new_url)
Outputs:
https://somedomain.com/subfolder/newcategory/?abc=text:value

Perform Regex on value returned by Regex

This is probably straightforward but I'm not even sure which phrase I should google to find the answer. Forgive my noobiness.
I've got strings (filenames) that look like this:
site12345678_date20160912_23001_to_23100_of_25871.txt
What this naming convention means is "Records 23001 through 23100 out of 25871 for site 12345678 for September 12th 2016 (20160912)"
What I want to do is extract the date part (those digits between _date and the following _)
The Regex: .*(_date[0-9]{8}).* will return the string _date20160912. But what I'm actually looking for is just 20160912. Obviously, [0-8]{8} doesn't give me what I want in this case because that could be confused with the site, or potentially record counts
How can I responsibly accomplish this sort of 'substringing' with a single regular expression?
You just need to shift you parentheses so as to change the capture group from including '_date' in it. Then you would want to look for your capture group #1:
If done in python, for example, it would look something like:
import re
regex = '.*_date([0-9]{8}).*'
str = 'site12345678_date20160912_23001_to_23100_of_25871.txt'
m = re.match(regex, str)
print m.group(0) # the whole string
print m.group(1) # the string you are looking for '20160912'
See it in action here: https://eval.in/641446
The Regex: .*(_date[0-9]{8}).* will return the string _date20160912.
That means you are using the regex in a method that requires a full string match, and you can access Group 1 value. The only thing you need to change in the regex is the capturing group placement:
.*_date([0-9]{8}).*
^^^^^^^^^^
See the regex demo.

A pattern to match [characters]:[characters] inside an URL

I have an url like below and wanted to use RegEx to extract segments like: Id:Reference, Title:dfgdfg, Status.Title:Current Status, CreationDate:Logged...
This is the closest pattern I got [=,][^,]*:[^,]*[,&] but obviously the result is not as expected, any better ideas?
P.S. I'm using [^,] to matach any characters except , because , will not exist the segment.
This is the site using for regex pattern matching.
http://regexpal.com/
The URL:
http://localhost/site/=powerManagement.power&query=_Allpowers&attributes=Id:Reference,Title:dfgdfg,Status.Title:Current Status,CreationDate:Logged,RaiseUser.Title:标题,_MinutesToBreach&sort_by=CreationDate"
Thanks,
You haven't specified what programming language you use. But almost all with support this:
([\p{L}\.]+):([\p{L}\.]+)
\p{L} matches a Unicode character in any language, provided that your regex engine support Unicode. RegEx 101.
You can extract the matches via capturing groups if you want.
In python:
import re
matchobj = re.match("^.*Id:(.*?),Title:(.*?),.*$", url, )
Id = matchobj.group(1)
Title = matchobj.group(2)