Splitting/Tokenizing a sentence into string words with special conditions

Splitting/Tokenizing a sentence into string words with special conditions - regex

I am trying to implement a tokenizer to split string of words.
The special conditions I have are: split punctuation . , ! ? into a separate string
and split any characters that have a space in them i.e. I have a dog!'-4# -> 'I', 'have', 'a' , 'dog', !, "'-4#"
Something like this.....
I don't plan on trying the nltk's package, and I have looked at re.split and re.findall, yet for both cases:
re.split = I don't know how to split out words with punctuation next to them such as 'Dog,'
re.findall = Sure it prints out all the matched string, but what about the unmatched ones?
IF you guys have any suggestions, I'd be very happy to try them.

Are you trying to split on a delimiter(punctuation) while keeping it in the final results? One way of doing that would be this:
import re
import string
sent = "I have a dog!'-4#"
punc_Str = str(string.punctuation)
print(re.split(r"([.,;:!^ ])", sent))
This is the result I get.
['I', ' ', 'have', ' ', 'a', ' ', 'dog', '!', "'-4#"]

Try:
re.findall(r'[a-z]+|[.!?]|(?:(?![.!?])\S)+', txt, re.I)
Alternatives in the regex:
[a-z]+ - a non-empty sequence of letters (ignore case),
[.!?] - any (single) char from your list (note that between brackets
neither a dot nor a '?' need to be quoted),
(?:(?![.!?])\S)+ - a non-empty sequence of non-white characters,
other than in your list.
E.g. for text containing I have a dog!'-4#?. the result is:
['I', 'have', 'a', 'dog', '!', "'-4#", '?', '.']

Related

Regex expression to separate collapsed title

First time post. I have a text where lots of text in title case is collapsed without spaces. I'm trying to:
a) keep the full text (not loose any words),
b) use logic to separate 'A' as in 'A Way Forward',
c) avoid separating acronyms such as EPA, DOJ, ect (which are already in full caps).
My regex code comes pretty close, but it's leaving 'A' at the beginning or end of words:
f = "TheCuriousIncidentOfAManInAWhiteHouseAt1600PennsylvaniaAveAndTheEPA"
re.sub( r"([A-Z][a-z]|[A-Z][A-Z]|\d+)", r" \1", f).split()
output:
['The', 'Curious', 'Incident', 'Of', 'AMan','In', 'AWhite','House', 'At', '1600', 'Pennsylvania', 'Ave', 'And', 'The', 'EPA']
The problem is output like 'AMan', 'AWhite', ect.
It should be:
['The', 'Curious', 'Incident', 'Of', 'A', Man','In', 'A', White','House', 'At', '1600', 'Pennsylvania', 'Ave', 'And', 'The', 'EPA']
Thank you

Welcome to Stack Overflow Greg. Good start on your regex.
I'd try something like this:
([A-Z]{2,}(?![a-z])|[a-zA-Z][a-z]*|[0-9]+)
Broken down, for explanation:
([A-Z]{2,}(?![a-z]) // 2 or more capital letters, not followed by a lowercase letter
| // OR
[a-zA-Z][a-z]* // Any letter, followed by any number of lowercase letters
| // OR
[0-9]+) // One or more digits
Best used like this:
re.findall(r'([A-Z]{2,}(?![a-z])|[a-zA-Z][a-z]*|[0-9]+)', s)
Try it online (contains \W* for formatting)

Keeping special marks when splitting text into tokens using regex

I have this text 'I love this but I have a! question to?' and currently using
token_pattern = re.compile(r"(?u)\b\w+\b")
token_pattern.findall(text)
When using this regex I'm getting
['I','love', 'this', 'but', 'I', 'have', 'a', 'question', 'to']
I'm not the one who wrote this regex and I know nothing about regex (tried to understand from example but just gave up trying) and now I need to change this regex in a way that it will keep the question and exclamation marks and will split them to unique tokens also, so it'll return this list
['I','love', 'this', 'but', 'I', 'have', 'a', '!', 'question', 'to', '?']
Any suggestions on how I can do that.

Try this:
token_pattern = re.compile(r"(?u)[^\w ]|\b\w+\b")
token_pattern.findall(text)
It matches all non alphanumeric characters as a single match, too.
If you really only need question and exclamation marks you can change the regex to
token_pattern = re.compile(r"(?u)[!?]|\b\w+\b")
token_pattern.findall(text)

Regex: how to separate strings by apostrophes in certain cases only

I am looking to capitalize the first letter of words in a string. I've managed to put together something by reading examples on here. However, I'm trying to get any names that start with O' to separate into 2 strings so that each gets capitalized. I have this so far:
\b([^\W_\d](?!')[^\s-]*) *
which omits selecting the X' from any string X'XYZ. That works for capitalizing the part after the ', but doesn't capitalize the X'. Further more, i'm becomes i'M since it's not specific to O'. To state the goal:
o'malley should go to O'Malley
o'malley's should go to O'Malley's
don't should go to Don't
i'll should go to I'll
(as an aside, I want to omit any strings that start with numbers, like 23F, that seems to work with what I have)
How to make it specific to the strings that start with O'? Thx

if you use the following pattern:
([oO])'([\w']+)|([\w']+)
then you can access each word by calling:
match[0] == 'o' || match[1] == 'name' #if word is "o'name"
match[2] == 'word' #if word is "word"
if it is one of the two above, the others will be blank, ie if word == "word" then
match[0] == match[1] == ""
since there is no o' prefix.
Test Example:
>>> import re
>>> string = "o'malley don't i'm hello world"
>>> match = re.findall(r"([oO])'([\w']+)|([\w']+)",string)
>>> match
[('o', 'malley', ''), ('', '', "don't"), ('', '', "i'm"), ('', '', 'hello'), ('', '', 'world')]
NOTE: This is for python. This MIGHT not work for all engines.

Why does this regular expression not capture arithmetic operators?

I'm trying to capture tokens from a pseudo-programming-language script, but the +-*/, etc are not captured.
I tried this:
[a-z_]\w*|"([^"\r\n]+|"")*"|\d*\.?\d*|\+|\*|\/|\(|\)|&|-|=|,|!
For example i have this code:
for i = 1 to 10
test_123 = 3.55 + i- -10 * .5
next
msg "this is a ""string"" with quotes in it..."
in this part of code the regular expression has to highlight:
valid variablenames,
strings enclosed with quotes,
operators like (),+-*/!
numbers like 0.1 123 .5 10.
the result of the regular expression has to be:
'for',
'i',
'=',
'1',
'to',
'10',
'test_123',
'=',
'3.55',
'+'
etc....
the problem is that the operators are not selected if i use this regular expression...

We don't know your requirements, but it seems that in your regex you are capturing only a few non \n, \r etc...
try something like this, grouping the tokens you want to capture:
'([a-z_]+)|([\.\d]+)|([\+\-\*\/])|(\=)|([\(\)\[\]\{\}])|(['":,;])'
EDIT: With the new information you wrote in your question, I adjusted the regex to this new one, and tried it with python. I don't know vbscript.
import re
test_string = r'''for i = 1 to 10:
test_123 = 3.55 + i- -10 * .5
next
msg "this is a 'string' with quotes in it..."'''
patterb = r'''([\da-z_^\.]+|[\.\d]+|[\+\-\*\/]|\=|[\(\)\[\]\{\}]|[:,;]|".*[^"]"|'.*[^']')'''
print(re.findall(pattern, test_string, re.MULTILINE))
And this is the list with the matches:
['for', 'i', '=', '1', 'to', '10', ':', 'test_123', '=', '3.55', '+', 'i', '-', '-', '10', '*', '.5', 'next', 'msg', '"this is a \'string\' with quotes in it..."']
I think it captures all you need.

This fits my needs i guess:
"([^"]+|"")*"|[\-+*/&|!()=,]|[a-z_]\w*|(\d*\.)?\d*
but only white space must be left over so i have to find a way to capture everything else that is not white space to if its not any of the other options in my regular expression.
characters like "$%µ°" are ignored even when i put "|." after my regular expression :(

Splitting Two Characters In a String - Perl

I'm trying to split this string. Here's the code:
my $string = "585|487|314|1|1,651|365|302|1|1,585|487|314|1|1,651|365|302|1|1,656|432|289|1|1,136|206|327|1|1,585|487|314|1|1,651|365|302|1|1,585|487|314|1|1,651|365|302|1|1%656|432|289|1|1%136|206|327|1|1%654|404|411|1|1";
my #ids = split(",", $string);
What I want is to split only % and , in the string, I was told that I could use a pattern, something like this? /[^a-zA-Z0-9_]/

Character classes can be used to represent a group of possible single characters that can match. And the ^ symbol at the beginning of a character class negates the class, saying "Anything matches except for ...." In the context of split, whatever matches is considered the delimiter.
That being the case, `[^a-zA-Z0-9_] would match any character except for the ASCII letters 'a' through 'z', 'A' through 'Z', and the numeric digits '0' through '9', plus underscore. In your case, while this would correctly split on "," and "%" (since they're not included in a-z, A-Z, 0-9, or _), it would mistakenly also split on "|", as well as any other character not included in the character class you attempted.
In your case it makes a lot more sense to be specific as to what delimiters to use, and to not use a negated class; you want to specify the exact delimiters rather than the entire set of characters that delimiters cannot be. So as mpapec stated in his comment, a better choice would be [%,].
So your solution would look like this:
my #ids = split/[%,]/, $string;
Once you split on '%' and ',', you'll be left with a bunch of substrings that look like this: 585|487|314|1|1 (or some variation on those numbers). In each case, it's five positive integers separated by '|' characters. It seems possible to me that you'll end up wanting to break those down as well by splitting on '|'.
You could build a single data structure represented by list of lists, where each top level element represents a [,%] delimited field, and consists of a reference to an anonymous array consisting of the pipe-delimited fields. The following code will build that structure:
my #ids = map { [ split /\|/, $_ ] } split /[%,]/, $string;
When that is run, you will end up with something like this:
#ids = (
[ '585', '487', '314', '1', '1' ],
[ '651', '365', '302', '1', '1' ],
# ...
);
Now each field within an ID can be inspected and manipulated individually.
To understand more about how character classes work, you could check perlrequick, which has a nice introduction to character classes. And for more information on split, there's always perldoc -f split (as mentioned by mpapec). split is also discussed in chapter nine of the O'Reilly book, Learning Perl, 6th Edition.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Splitting/Tokenizing a sentence into string words with special conditions - regex

Related

Regex expression to separate collapsed title

Keeping special marks when splitting text into tokens using regex

Regex: how to separate strings by apostrophes in certain cases only

Why does this regular expression not capture arithmetic operators?

Splitting Two Characters In a String - Perl

Categories

Resources