How to include regex pattern matches into substitution output [duplicate] - regex

This question already has answers here:
re.sub replace with matched content
(4 answers)
Closed 4 years ago.
For example, if I want to add a space in-between all instances where I have one uppercase letter preceding a hyphen (A-, C-, etc...), then what function can I use to achieve this?
Alternatively, is there a way to get re.sub to output the pattern that was matched? :
>>> text = 'T- AB-'
>>> re.sub(r'\b[A-Z]-', 'what goes here?', text)
>>> text
'T - AB-'

You are looking to use capturing parenthesis and a \1
import re
text = 'T- AB-'
text = re.sub(r'\b([A-Z])-', r'\1 -', text)
print (text)
results:
T - AB-
That should do the trick. Whatever you capture in the ( ) can be referenced with \1. If you had a series of parenthesis each set can be referenced like \2, \3, etc. Good luck!

Related

How can I remove a certain pattern from a string? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I have this string like "682_2, 682_3, 682_4". (682 is a random number)
How can i get this string "2, 3, 4" using regex and ruby?
You can do this in ruby
input="682_2, 682_3, 682_4"
output = input.gsub(/\d+_/,"")
puts output
A simple regex could be
/_([0-9]+)$/ and in the match group of the result you will have 2 for 682_2 and 3 for 682_3
Ruby code snippet would be "64532_2".match(/_([0-9]+)/).captures[0]
you can use scan which returns an array containing the matches:
string_code.scan(/(?<=_)\d/)
(?<=_) tells to find a pattern that has a given pattern (_ in this case) before itself but wont capture that, it captures only \d. if it can have more than 1 digit like 682_13,682_33 then \d+ is necessary.

Python regex to parse '#####' text in description field [duplicate]

This question already has answers here:
regex to extract mentions in Twitter
(2 answers)
Extracting #mentions from tweets using findall python (Giving incorrect results)
(3 answers)
Closed 3 years ago.
Here's the line I'm trying to parse:
#abc def#gmail.com #ghi j#klm #nop.qrs #tuv
And here's the regex I've gotten so far:
#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
My goal is to get ['#abc', '#ghi', '#tuv'], but no matter what I do, I can't get 'j#klm' to not match. Any help is much appreciated.
Try using re.findall with the following regex pattern:
(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)
inp = "#abc def#gmail.com #ghi j#klm #nop.qrs #tuv"
matches = re.findall(r'(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)', inp)
print(matches)
This prints:
['#abc', '#ghi', '#tuv']
The regex calls for an explanation. The leading lookbehind (?:(?<=^)|(?<=\s)) asserts that what precedes the # symbol is either a space or the start of the string. We can't use a word boundary here because # is not a word character. We use a similar lookahead (?=\s|$) at the end of the pattern to rule out matching things like #nop.qrs. Again, a word boundary alone would not be sufficient.
just add the line initiation match at the beginning:
^#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
it shoud work!

Regex function to find all and only 6 digit numeric string ignoring spaces if any any between [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I have HTML source page as text file.
I need to read file and find out only those numeric strings which have 6 continous digits and can have a space in between those 6 digits
Eg
209 016 - should be come up in search result and as 400013(space removed)
209016 - should also come up in search and unaltered as 209016
any numeric string more then 6 digits long should not come up in search eg 20901677,209016#223, 29016,
I think this can be achieved by regex but I was not able to
A soln in regex is more desirable but anything else is also welcome
To match 6 digits with any number of spaces in between, you may use the following pattern:
\b(?:\d[ ]*?){6}\b
Or if you want to reject it when it's followed by an #, you may use:
\b(?:\d[ ]*?){6}\b(?!#)
Regex demo.
Then, you can use the replace method to remove the space characters.
Python example:
import re
regex = r"\b(?:\d[ ]*?){6}\b(?!#)"
test_str = ("209016 \n"
"209 016\n"
"20901677','209016#223', '29016")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print (match.group().replace(" ", ""))
Output:
209016
209016
Try it online.
You can try the following regex:
\b(?<!#)\d(?:\s*\d){5}\b(?!#)
demo: https://regex101.com/r/ZCcDmF/2/
But note that you might have to modify your boundaries if you need to exclude more than the #. it will become something like:
\b(?<!#|other char I need to exclude|another one|...)\d(?:\s*\d){5}\b(?!#|other char I need to exclude|another one|...)
where you have to replace other char I need to exclude, another one,... by the characters.

Python 2.7 Regex Tokenizer Implementation Not Working [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 5 years ago.
I created a regular expression to match tokens in a german text which is of type string.
My Regular expression is working as expected using regex101.com. Here is a link of my regex with an example sentence: My regex + example on regex101.com
So I implemented it in python 2.7 like this:
GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
([A-ZÄÖÜ]\.)+ # abbrevations including ÄÖÜ
|\d+([.,]\d+)?([€$%])? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():-_'!] # matches special characters including !
'''
def tokenize_german_text(text):
'''
Takes a text of type string and
tokenizes the text
'''
matchObject = re.findall(GERMAN_TOKENIZER, text)
pass
tokenize_german_text(u'Das ist ein Deutscher Text! Er enthält auch Währungen, 10€')
Result:
When I was debugging this I found out that the matchObject is only a list containing 11 entries with empty characters. Why is it not working as expected and how can I fix this?
re.findall() collects only the matches in capturing groups (unless there are no capturing groups in your regex, in which case it captures each match).
So your regex matches several times, but every time the match is one where no capturing group is participating. Remove the capturing groups, and you'll see results. Also, place the - at the end of the character class unless you actually want to match the range of characters between : and _ (but not the - itself):
GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
(?:[A-ZÄÖÜ]\.)+ # abbrevations including ÄÖÜ
|\d+(?:[.,]\d+)?[€$%]? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():_'!-] # matches special characters including !
'''
Result:
['Das', 'ist', 'ein', 'Deutscher', 'Text', '!', 'Er', 'enthält', 'auch', 'Währungen', ',', '10€']

Parse formula name and arguments with regex [duplicate]

This question already has answers here:
How to get function parameter names/values dynamically?
(34 answers)
Closed 6 years ago.
The objective of this Regex (\w*)\s*\([(\w*),]*\) is to get a function name and its arguments.
For example, given f1 (11,22,33)
the Regex should capture four elements:
f1
11
22
33
What's wrong with this regex?
You can do it with split Here is an example in javascript
var ar = str.match(/\((.*?)\)/);
if (ar) {
var result = ar[0].split(",");
}
Reference: https://stackoverflow.com/a/13953005/1827594
Some things are hard for regexes :-)
As the commenters above are saying, '*' can be too lax. It means zero or more. So foo(,,) also matches. Not so good.
(\w+)\s*\((\w+)(?:,\s*(\w+)\s*)*\)
That is closer to what you want I think. Let's break that down.
\w+ <-- The function name, has to have at least one character
\s* <-- zero or more whitespace
\( <-- parens to start the function call
(\w+) <-- at least one parameter
(?:) <-- this means not to save the matches
,\s* <-- a comma with optional space
(\w+) <-- another parameter
\s* <-- followed by optional space
This is the result from Python:
>>> m = re.match(r'(\w+)\s*\((\w+)(?:,\s*(\w+)\s*)*\)', "foo(a,b,c)")
>>> m.groups()
('foo', 'a', 'c')
But, what about something like this:
foo(a,b,c
d,e,f)
?? Yeah, it gets hard fast with regexes and you move on to richer parsing tools.