Python regex to parse '#####' text in description field [duplicate] - regex

This question already has answers here:
regex to extract mentions in Twitter
(2 answers)
Extracting #mentions from tweets using findall python (Giving incorrect results)
(3 answers)
Closed 3 years ago.
Here's the line I'm trying to parse:
#abc def#gmail.com #ghi j#klm #nop.qrs #tuv
And here's the regex I've gotten so far:
#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
My goal is to get ['#abc', '#ghi', '#tuv'], but no matter what I do, I can't get 'j#klm' to not match. Any help is much appreciated.

Try using re.findall with the following regex pattern:
(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)
inp = "#abc def#gmail.com #ghi j#klm #nop.qrs #tuv"
matches = re.findall(r'(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)', inp)
print(matches)
This prints:
['#abc', '#ghi', '#tuv']
The regex calls for an explanation. The leading lookbehind (?:(?<=^)|(?<=\s)) asserts that what precedes the # symbol is either a space or the start of the string. We can't use a word boundary here because # is not a word character. We use a similar lookahead (?=\s|$) at the end of the pattern to rule out matching things like #nop.qrs. Again, a word boundary alone would not be sufficient.

just add the line initiation match at the beginning:
^#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
it shoud work!

Related

How can I remove a certain pattern from a string? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I have this string like "682_2, 682_3, 682_4". (682 is a random number)
How can i get this string "2, 3, 4" using regex and ruby?
You can do this in ruby
input="682_2, 682_3, 682_4"
output = input.gsub(/\d+_/,"")
puts output
A simple regex could be
/_([0-9]+)$/ and in the match group of the result you will have 2 for 682_2 and 3 for 682_3
Ruby code snippet would be "64532_2".match(/_([0-9]+)/).captures[0]
you can use scan which returns an array containing the matches:
string_code.scan(/(?<=_)\d/)
(?<=_) tells to find a pattern that has a given pattern (_ in this case) before itself but wont capture that, it captures only \d. if it can have more than 1 digit like 682_13,682_33 then \d+ is necessary.

Regex function to find all and only 6 digit numeric string ignoring spaces if any any between [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I have HTML source page as text file.
I need to read file and find out only those numeric strings which have 6 continous digits and can have a space in between those 6 digits
Eg
209 016 - should be come up in search result and as 400013(space removed)
209016 - should also come up in search and unaltered as 209016
any numeric string more then 6 digits long should not come up in search eg 20901677,209016#223, 29016,
I think this can be achieved by regex but I was not able to
A soln in regex is more desirable but anything else is also welcome
To match 6 digits with any number of spaces in between, you may use the following pattern:
\b(?:\d[ ]*?){6}\b
Or if you want to reject it when it's followed by an #, you may use:
\b(?:\d[ ]*?){6}\b(?!#)
Regex demo.
Then, you can use the replace method to remove the space characters.
Python example:
import re
regex = r"\b(?:\d[ ]*?){6}\b(?!#)"
test_str = ("209016 \n"
"209 016\n"
"20901677','209016#223', '29016")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print (match.group().replace(" ", ""))
Output:
209016
209016
Try it online.
You can try the following regex:
\b(?<!#)\d(?:\s*\d){5}\b(?!#)
demo: https://regex101.com/r/ZCcDmF/2/
But note that you might have to modify your boundaries if you need to exclude more than the #. it will become something like:
\b(?<!#|other char I need to exclude|another one|...)\d(?:\s*\d){5}\b(?!#|other char I need to exclude|another one|...)
where you have to replace other char I need to exclude, another one,... by the characters.

Groovy - Extract a string between two different strings [duplicate]

This question already has answers here:
Regex Match all characters between two strings
(16 answers)
Closed 5 years ago.
I have files names in the below format -
India_AP_Dev1.txt
USA_GA_QA2.txt
USA_NY_AWSDev1.txt
AUS_AA_BB_QA4.txt
I want to extract only the environment part from the file name i.e. Dev1, QA2, AWSDev1, QA4etc. How can I go about with this type of file names. I thought about substring but the environment length is not constant. Is it possible to do it with regex
Appreciate your help. TIA
It is definitely possible using lookarounds:
(?<=_)[^._]*(?=\.)
(?<=_) match is preceded by _
[^._] take all characters except . and _
(?=\.) match is followed by .
Demo

How to include regex pattern matches into substitution output [duplicate]

This question already has answers here:
re.sub replace with matched content
(4 answers)
Closed 4 years ago.
For example, if I want to add a space in-between all instances where I have one uppercase letter preceding a hyphen (A-, C-, etc...), then what function can I use to achieve this?
Alternatively, is there a way to get re.sub to output the pattern that was matched? :
>>> text = 'T- AB-'
>>> re.sub(r'\b[A-Z]-', 'what goes here?', text)
>>> text
'T - AB-'
You are looking to use capturing parenthesis and a \1
import re
text = 'T- AB-'
text = re.sub(r'\b([A-Z])-', r'\1 -', text)
print (text)
results:
T - AB-
That should do the trick. Whatever you capture in the ( ) can be referenced with \1. If you had a series of parenthesis each set can be referenced like \2, \3, etc. Good luck!

Regexp - Match any character except "Something.AnyChar" [duplicate]

This question already has answers here:
Need a regex to exclude certain strings
(6 answers)
Closed 9 years ago.
I have a string:
Input:
"Feature.. sklsd " AND klsdjkls 9290 "Feass . lskdk SDFSD __ ksdljsklfsd" NOT "Feuas" "Feature.lskd" OR PUT klasdkljf al9- .s.a, 9a0sd90209 .a,sdklf jalkdfj al;akd
I need to match any character except OR, NOT, AND, "Feature.any_count_of_characters"
the last one is important this start with: "Feature.
This is followed by any number of characters and then ends with: " character.
I'm trying to solve this using lookahead or lookbehind but I can get the expected output, only a portion of characters that I don't want.
My expected output is
"Feature.. sklsd " AND klsdjkls 9290 "Feass . lskdk SDFSD __ ksdljsklfsd" NOT "Feuas" "Feature.lskd" OR PUT klasdkljf al9- .s.a, 9a0sd90209 .a,sdklf jalkdfj al;akd
All that is in black.
To test it i'm using these links:
http://gskinner.com/RegExr/
http://regexpal.com/
Thanks.
EDIT
Check this link http://regexr.com?37v36
inside the link i get matched some expression. But i don't need the expression that matched. i need the inverse, how i can get it?
Thanks.
Just use
\s*(?:AND|OR|NOT|"[^"]+")\s*
but do a replace operation. That will leave what you want.
Your basic problem is that look behinds can not have arbitrary lengths, but you need that. There are work arounds, but a simpler approach is to use a capturing group:
"Feature\.[^"]*" (?:OR|NOT|AND) ([^"])
And your target will be in group 1 of the match.