This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
What does this regular expression stand for (.*)#(.*). I came to know that (.*) matches any character with period any number of times.
But I couldn't understand the meaning properly. Also, what does two of them separated by # mean?
.*#.* matches any string containing the # character
Example of strings that this pattern would match
#
#qe
asrrd#
qw3e#as112d
(.*)#(.*) would just return whatever is before and after the # character
Example:
for # would return two empty strings '' , ''
for #qe rule will return '' and 'qe'
for asrrd# would return 'asrrd' and ''
for qw3e#as112d would return 'qw3e' and 'as112d'
(.*)#(.*) can match any of the following:
#, .#., ..#., jbkbhjh...#...njbh ...
* means one or many characters.
so this regex means a # symbol enclosed by any number of chars
Also, what does two of them separated by # mean
Answer to this is: the # symbol is required for a text to match this regex
Related
This question already has answers here:
regex to extract mentions in Twitter
(2 answers)
Extracting #mentions from tweets using findall python (Giving incorrect results)
(3 answers)
Closed 3 years ago.
Here's the line I'm trying to parse:
#abc def#gmail.com #ghi j#klm #nop.qrs #tuv
And here's the regex I've gotten so far:
#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
My goal is to get ['#abc', '#ghi', '#tuv'], but no matter what I do, I can't get 'j#klm' to not match. Any help is much appreciated.
Try using re.findall with the following regex pattern:
(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)
inp = "#abc def#gmail.com #ghi j#klm #nop.qrs #tuv"
matches = re.findall(r'(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)', inp)
print(matches)
This prints:
['#abc', '#ghi', '#tuv']
The regex calls for an explanation. The leading lookbehind (?:(?<=^)|(?<=\s)) asserts that what precedes the # symbol is either a space or the start of the string. We can't use a word boundary here because # is not a word character. We use a similar lookahead (?=\s|$) at the end of the pattern to rule out matching things like #nop.qrs. Again, a word boundary alone would not be sufficient.
just add the line initiation match at the beginning:
^#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
it shoud work!
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I have HTML source page as text file.
I need to read file and find out only those numeric strings which have 6 continous digits and can have a space in between those 6 digits
Eg
209 016 - should be come up in search result and as 400013(space removed)
209016 - should also come up in search and unaltered as 209016
any numeric string more then 6 digits long should not come up in search eg 20901677,209016#223, 29016,
I think this can be achieved by regex but I was not able to
A soln in regex is more desirable but anything else is also welcome
To match 6 digits with any number of spaces in between, you may use the following pattern:
\b(?:\d[ ]*?){6}\b
Or if you want to reject it when it's followed by an #, you may use:
\b(?:\d[ ]*?){6}\b(?!#)
Regex demo.
Then, you can use the replace method to remove the space characters.
Python example:
import re
regex = r"\b(?:\d[ ]*?){6}\b(?!#)"
test_str = ("209016 \n"
"209 016\n"
"20901677','209016#223', '29016")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print (match.group().replace(" ", ""))
Output:
209016
209016
Try it online.
You can try the following regex:
\b(?<!#)\d(?:\s*\d){5}\b(?!#)
demo: https://regex101.com/r/ZCcDmF/2/
But note that you might have to modify your boundaries if you need to exclude more than the #. it will become something like:
\b(?<!#|other char I need to exclude|another one|...)\d(?:\s*\d){5}\b(?!#|other char I need to exclude|another one|...)
where you have to replace other char I need to exclude, another one,... by the characters.
This question already has answers here:
Regular expression to match a dot
(7 answers)
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
I am new to RegEx in python. I have created a RegEx formula which should find some special string from text but it is not working as exprected;
def find_short_url(str_field):
search_string = r"moourl.com|ow.ly|goo.gl|polr.me|su.pr|bit.ly|is.gd|tinyurl.com|buff.ly|bit.do|adf.ly"
search_string = re.search(search_string, str(str_field))
result = search_string.group(0) if search_string else None
return result
It should find all the URL shortner from a text. But the su.pr is detecting as surpr from the text. Is there any way to fix it?
find_short_url("It is a surprise that it is ...")
output
'surpr'
It can affect other shortner too. Still scratching my head.
Escape the dots:
search_string = r"moourl\.com|ow\.ly|goo\.gl|polr\.me|su\.pr|bit\.ly|is\.gd|tinyurl\.com|buff\.ly|bit\.do|adf\.ly"
In regex, a dot matches any character. Escaping them makes them match a literal dot.
This question already has answers here:
Regex Match all characters between two strings
(16 answers)
Closed 5 years ago.
I have files names in the below format -
India_AP_Dev1.txt
USA_GA_QA2.txt
USA_NY_AWSDev1.txt
AUS_AA_BB_QA4.txt
I want to extract only the environment part from the file name i.e. Dev1, QA2, AWSDev1, QA4etc. How can I go about with this type of file names. I thought about substring but the environment length is not constant. Is it possible to do it with regex
Appreciate your help. TIA
It is definitely possible using lookarounds:
(?<=_)[^._]*(?=\.)
(?<=_) match is preceded by _
[^._] take all characters except . and _
(?=\.) match is followed by .
Demo
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 5 years ago.
I created a regular expression to match tokens in a german text which is of type string.
My Regular expression is working as expected using regex101.com. Here is a link of my regex with an example sentence: My regex + example on regex101.com
So I implemented it in python 2.7 like this:
GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
([A-ZÄÖÜ]\.)+ # abbrevations including ÄÖÜ
|\d+([.,]\d+)?([€$%])? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():-_'!] # matches special characters including !
'''
def tokenize_german_text(text):
'''
Takes a text of type string and
tokenizes the text
'''
matchObject = re.findall(GERMAN_TOKENIZER, text)
pass
tokenize_german_text(u'Das ist ein Deutscher Text! Er enthält auch Währungen, 10€')
Result:
When I was debugging this I found out that the matchObject is only a list containing 11 entries with empty characters. Why is it not working as expected and how can I fix this?
re.findall() collects only the matches in capturing groups (unless there are no capturing groups in your regex, in which case it captures each match).
So your regex matches several times, but every time the match is one where no capturing group is participating. Remove the capturing groups, and you'll see results. Also, place the - at the end of the character class unless you actually want to match the range of characters between : and _ (but not the - itself):
GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
(?:[A-ZÄÖÜ]\.)+ # abbrevations including ÄÖÜ
|\d+(?:[.,]\d+)?[€$%]? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():_'!-] # matches special characters including !
'''
Result:
['Das', 'ist', 'ein', 'Deutscher', 'Text', '!', 'Er', 'enthält', 'auch', 'Währungen', ',', '10€']