Keeping special marks when splitting text into tokens using regex - regex

I have this text 'I love this but I have a! question to?' and currently using
token_pattern = re.compile(r"(?u)\b\w+\b")
token_pattern.findall(text)
When using this regex I'm getting
['I','love', 'this', 'but', 'I', 'have', 'a', 'question', 'to']
I'm not the one who wrote this regex and I know nothing about regex (tried to understand from example but just gave up trying) and now I need to change this regex in a way that it will keep the question and exclamation marks and will split them to unique tokens also, so it'll return this list
['I','love', 'this', 'but', 'I', 'have', 'a', '!', 'question', 'to', '?']
Any suggestions on how I can do that.

Try this:
token_pattern = re.compile(r"(?u)[^\w ]|\b\w+\b")
token_pattern.findall(text)
It matches all non alphanumeric characters as a single match, too.
If you really only need question and exclamation marks you can change the regex to
token_pattern = re.compile(r"(?u)[!?]|\b\w+\b")
token_pattern.findall(text)

Related

Splitting/Tokenizing a sentence into string words with special conditions

I am trying to implement a tokenizer to split string of words.
The special conditions I have are: split punctuation . , ! ? into a separate string
and split any characters that have a space in them i.e. I have a dog!'-4# -> 'I', 'have', 'a' , 'dog', !, "'-4#"
Something like this.....
I don't plan on trying the nltk's package, and I have looked at re.split and re.findall, yet for both cases:
re.split = I don't know how to split out words with punctuation next to them such as 'Dog,'
re.findall = Sure it prints out all the matched string, but what about the unmatched ones?
IF you guys have any suggestions, I'd be very happy to try them.
Are you trying to split on a delimiter(punctuation) while keeping it in the final results? One way of doing that would be this:
import re
import string
sent = "I have a dog!'-4#"
punc_Str = str(string.punctuation)
print(re.split(r"([.,;:!^ ])", sent))
This is the result I get.
['I', ' ', 'have', ' ', 'a', ' ', 'dog', '!', "'-4#"]
Try:
re.findall(r'[a-z]+|[.!?]|(?:(?![.!?])\S)+', txt, re.I)
Alternatives in the regex:
[a-z]+ - a non-empty sequence of letters (ignore case),
[.!?] - any (single) char from your list (note that between brackets
neither a dot nor a '?' need to be quoted),
(?:(?![.!?])\S)+ - a non-empty sequence of non-white characters,
other than in your list.
E.g. for text containing I have a dog!'-4#?. the result is:
['I', 'have', 'a', 'dog', '!', "'-4#", '?', '.']

How to extract - capital letters and digits in between two words, in a string?

I'm extracting information from an Image of an Invoice using PyTesseract and I need to tag the relevant fields to their values
I've tried using regex to extract content, but this is a new concept and I've been able to extract words that contain capital letters, but not a combination of both letters and digits in between particular words
re.findall(r'[A-Z]+', string)
Example Sentence - Hello. I AM IRONMAN even though I would've preferred TO BE BATMAN. 123457678. Superhero FANTASY.
Expected Result - I AM IRONMAN I TO BE BATMAN. 123457678.
You can combine a split on the delimiter of interest, 123457678, and then apply a regex:
import re
string = "Hello. I AM IRONMAN even though I would've preferred TO BE BATMAN. 123457678. Superhero FANTASY"
re.findall(r'[A-Z0-9]+\b\.?', ''.join(re.split('(?<=123457678\.).', string)[0]))
# ['I', 'AM', 'IRONMAN', 'I', 'TO', 'BE', 'BATMAN.', '123457678.']

Regex expression to separate collapsed title

First time post. I have a text where lots of text in title case is collapsed without spaces. I'm trying to:
a) keep the full text (not loose any words),
b) use logic to separate 'A' as in 'A Way Forward',
c) avoid separating acronyms such as EPA, DOJ, ect (which are already in full caps).
My regex code comes pretty close, but it's leaving 'A' at the beginning or end of words:
f = "TheCuriousIncidentOfAManInAWhiteHouseAt1600PennsylvaniaAveAndTheEPA"
re.sub( r"([A-Z][a-z]|[A-Z][A-Z]|\d+)", r" \1", f).split()
output:
['The', 'Curious', 'Incident', 'Of', 'AMan','In', 'AWhite','House', 'At', '1600', 'Pennsylvania', 'Ave', 'And', 'The', 'EPA']
The problem is output like 'AMan', 'AWhite', ect.
It should be:
['The', 'Curious', 'Incident', 'Of', 'A', Man','In', 'A', White','House', 'At', '1600', 'Pennsylvania', 'Ave', 'And', 'The', 'EPA']
Thank you
Welcome to Stack Overflow Greg. Good start on your regex.
I'd try something like this:
([A-Z]{2,}(?![a-z])|[a-zA-Z][a-z]*|[0-9]+)
Broken down, for explanation:
([A-Z]{2,}(?![a-z]) // 2 or more capital letters, not followed by a lowercase letter
| // OR
[a-zA-Z][a-z]* // Any letter, followed by any number of lowercase letters
| // OR
[0-9]+) // One or more digits
Best used like this:
re.findall(r'([A-Z]{2,}(?![a-z])|[a-zA-Z][a-z]*|[0-9]+)', s)
Try it online (contains \W* for formatting)

Avoid twitter profile names (#Profile) with regex

I am trying to analyze tweets but want to avoid the profile users names that are followed by an # (#Profile_name) using regex!
I've tried:
re.findall(r'(?!#[\w+]*)(\w+)', "I want to take everything but #this, but I cannot find a way"))
and it gives me:
>>>> [['I', 'want', 'to', 'take', 'everything', 'but', 'this', 'but', 'I', 'cannot', 'find', 'a', 'way']]
I don't want the "this" :/
I'm quite new in regex, but I really cannot solve this one.
Thanks!
Try re.sub
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.
>>> re.sub(r'(#\w+)', "", "I want to take everything but #this, but I cannot find a way")
'I want to take everything but , but I cannot find a way'

Why does this regular expression not capture arithmetic operators?

I'm trying to capture tokens from a pseudo-programming-language script, but the +-*/, etc are not captured.
I tried this:
[a-z_]\w*|"([^"\r\n]+|"")*"|\d*\.?\d*|\+|\*|\/|\(|\)|&|-|=|,|!
For example i have this code:
for i = 1 to 10
test_123 = 3.55 + i- -10 * .5
next
msg "this is a ""string"" with quotes in it..."
in this part of code the regular expression has to highlight:
valid variablenames,
strings enclosed with quotes,
operators like (),+-*/!
numbers like 0.1 123 .5 10.
the result of the regular expression has to be:
'for',
'i',
'=',
'1',
'to',
'10',
'test_123',
'=',
'3.55',
'+'
etc....
the problem is that the operators are not selected if i use this regular expression...
We don't know your requirements, but it seems that in your regex you are capturing only a few non \n, \r etc...
try something like this, grouping the tokens you want to capture:
'([a-z_]+)|([\.\d]+)|([\+\-\*\/])|(\=)|([\(\)\[\]\{\}])|(['":,;])'
EDIT: With the new information you wrote in your question, I adjusted the regex to this new one, and tried it with python. I don't know vbscript.
import re
test_string = r'''for i = 1 to 10:
test_123 = 3.55 + i- -10 * .5
next
msg "this is a 'string' with quotes in it..."'''
patterb = r'''([\da-z_^\.]+|[\.\d]+|[\+\-\*\/]|\=|[\(\)\[\]\{\}]|[:,;]|".*[^"]"|'.*[^']')'''
print(re.findall(pattern, test_string, re.MULTILINE))
And this is the list with the matches:
['for', 'i', '=', '1', 'to', '10', ':', 'test_123', '=', '3.55', '+', 'i', '-', '-', '10', '*', '.5', 'next', 'msg', '"this is a \'string\' with quotes in it..."']
I think it captures all you need.
This fits my needs i guess:
"([^"]+|"")*"|[\-+*/&|!()=,]|[a-z_]\w*|(\d*\.)?\d*
but only white space must be left over so i have to find a way to capture everything else that is not white space to if its not any of the other options in my regular expression.
characters like "$%µ°" are ignored even when i put "|." after my regular expression :(