Create RegEx to find such strings? [duplicate] - regex

This question already has answers here:
Regular expression to match a dot
(7 answers)
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
I am new to RegEx in python. I have created a RegEx formula which should find some special string from text but it is not working as exprected;
def find_short_url(str_field):
search_string = r"moourl.com|ow.ly|goo.gl|polr.me|su.pr|bit.ly|is.gd|tinyurl.com|buff.ly|bit.do|adf.ly"
search_string = re.search(search_string, str(str_field))
result = search_string.group(0) if search_string else None
return result
It should find all the URL shortner from a text. But the su.pr is detecting as surpr from the text. Is there any way to fix it?
find_short_url("It is a surprise that it is ...")
output
'surpr'
It can affect other shortner too. Still scratching my head.

Escape the dots:
search_string = r"moourl\.com|ow\.ly|goo\.gl|polr\.me|su\.pr|bit\.ly|is\.gd|tinyurl\.com|buff\.ly|bit\.do|adf\.ly"
In regex, a dot matches any character. Escaping them makes them match a literal dot.

Related

Python regex to parse '#####' text in description field [duplicate]

This question already has answers here:
regex to extract mentions in Twitter
(2 answers)
Extracting #mentions from tweets using findall python (Giving incorrect results)
(3 answers)
Closed 3 years ago.
Here's the line I'm trying to parse:
#abc def#gmail.com #ghi j#klm #nop.qrs #tuv
And here's the regex I've gotten so far:
#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
My goal is to get ['#abc', '#ghi', '#tuv'], but no matter what I do, I can't get 'j#klm' to not match. Any help is much appreciated.
Try using re.findall with the following regex pattern:
(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)
inp = "#abc def#gmail.com #ghi j#klm #nop.qrs #tuv"
matches = re.findall(r'(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)', inp)
print(matches)
This prints:
['#abc', '#ghi', '#tuv']
The regex calls for an explanation. The leading lookbehind (?:(?<=^)|(?<=\s)) asserts that what precedes the # symbol is either a space or the start of the string. We can't use a word boundary here because # is not a word character. We use a similar lookahead (?=\s|$) at the end of the pattern to rule out matching things like #nop.qrs. Again, a word boundary alone would not be sufficient.
just add the line initiation match at the beginning:
^#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
it shoud work!

Regexp for string stating with a + and having numbers only [duplicate]

This question already has answers here:
Match exact string
(3 answers)
Closed 4 years ago.
I have the following regex for a string which starts by a + and having numbers only:
PatternArticleNumber = $"^(\\+)[0-9]*";
However this allows strings like :
+454545454+4545454
This should not be allowed. Only the 1st character should be a +, others numbers only.
Any idea what may be wrong with my regex?
You can probably workaround this problem by just adding an ending anchor to your regex, i.e. use this:
PatternArticleNumber = $"^(\\+)[0-9]*$";
Demo
The problem with your current pattern is that the ending is open. So, the string +454545454+4545454 might appear to be a match. In fact, that entire string is not a match, but the engine might match the first portion, before the second +, and report a match.

Regex find sting in the middle of two strings [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 5 years ago.
I want to get the time in the following line. I want to get the string
2017-07-07 08:30:00.065156
in
[ID] = 0,[Time] = 2017-07-07 08:30:00.065156,[access]
I tried this
(?<=[Time] = )(.*?)(?=,)
Where i want to get the string in-between the time tag and the first comma but this doesn't work.
[Time] inside a regex means a T, an i, an m, or an e, unless you escape your square brackets.
You can drop the reluctant quantifier if you use [^,]* in place of .*:
(?<=\[Time\] = )([^,]*)(?=,)

How do I add special characters into a text string using regex.Replace method? [duplicate]

This question already has answers here:
Is there "\n" equivalent in VBscript?
(6 answers)
Closed 5 years ago.
The first character on every line of a file I have is a comma. How can I remove just this comma?
I have tried to use the replace method but it doesn't seem to accept special characters. Here is an example:
myRegExp.Pattern = "\n,"
strText5 =myRegExp.Replace(strText4,"\n")
The above snipper replaces the first new line char and comma with \n. How can I replace with a special character instead of a literal string?
The MSDN library doesn't seem to have the answers I need.
TIA.
If you enable MultiLine mode (and Global if you have not done so) then ^ will match the start of a line:
myRegExp.Pattern = "^,"
myRegExp.Multiline = true
myRegExp.Global = true
strText5 = myRegExp.Replace(strText4, "")
(In a vanilla VB string there are no escape sequences, "\n"
is just the two slash+n characters, for a \n you would use vbLf or chr(10))

Regex to match all character groups in a string [duplicate]

This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 5 years ago.
I need a regex to match the groups of characters in a string.
For example this is-a#beautiful^day.
Should result in the following list: this, is, a, beautiful, day.
As a mention I don't know how long the string is or by what characters the words are separated.
Any ideas? I have no clue how to build a regex for this.
If you want find all groups of letters:
import re
string = "this is-a#beautiful^day"
list = re.findall(r'[A-Za-z]+', string)
print list
['this', 'is', 'a', 'beautiful', 'day']