Regex expression to separate collapsed title - regex

First time post. I have a text where lots of text in title case is collapsed without spaces. I'm trying to:
a) keep the full text (not loose any words),
b) use logic to separate 'A' as in 'A Way Forward',
c) avoid separating acronyms such as EPA, DOJ, ect (which are already in full caps).
My regex code comes pretty close, but it's leaving 'A' at the beginning or end of words:
f = "TheCuriousIncidentOfAManInAWhiteHouseAt1600PennsylvaniaAveAndTheEPA"
re.sub( r"([A-Z][a-z]|[A-Z][A-Z]|\d+)", r" \1", f).split()
output:
['The', 'Curious', 'Incident', 'Of', 'AMan','In', 'AWhite','House', 'At', '1600', 'Pennsylvania', 'Ave', 'And', 'The', 'EPA']
The problem is output like 'AMan', 'AWhite', ect.
It should be:
['The', 'Curious', 'Incident', 'Of', 'A', Man','In', 'A', White','House', 'At', '1600', 'Pennsylvania', 'Ave', 'And', 'The', 'EPA']
Thank you

Welcome to Stack Overflow Greg. Good start on your regex.
I'd try something like this:
([A-Z]{2,}(?![a-z])|[a-zA-Z][a-z]*|[0-9]+)
Broken down, for explanation:
([A-Z]{2,}(?![a-z]) // 2 or more capital letters, not followed by a lowercase letter
| // OR
[a-zA-Z][a-z]* // Any letter, followed by any number of lowercase letters
| // OR
[0-9]+) // One or more digits
Best used like this:
re.findall(r'([A-Z]{2,}(?![a-z])|[a-zA-Z][a-z]*|[0-9]+)', s)
Try it online (contains \W* for formatting)

Related

Letters-only regex is including numerals

I wrote a regex in a function meant to separate capital words by dashes and remove numbers:
def kebabize(string):
import re
split_lower_string = [item.lower() for item in re.findall('[a-zA-Z][^A-Z]*', string)]
return '-'.join(split_lower_string)
Based on my (limited) understanding of regex, the function call on myCamelHas3Humps should return my-camel-has-humps, based on the [a-zA-Z][^A-Z]* expression.
Can someone please walk me through the faulty logic of this regex?
As #Mathias R. Jessen mentioned in the comments, the problem with your regex was [^A-Z]* means anything other than A-Z - including digits. Your problem can be solved in two steps.
Logic
First we remove any digits from the input string. regex to find digits is quite simple: \d. All the digits need to replaced with empty string.
re.sub("\\d", "", input_string)
Now, we find all the positions where we have a lowercase alphabet on the left and uppercase alphabet on the right. Using positive lookbehind and lookahead we get: (?<=[a-z])(?=[A-Z]). We need to replace all such positions with -.
re.sub("(?<=[a-z])(?=[A-Z])", "-", new_string)
Code
input_string = "myCamelHas3Humps"
new_string = re.sub("\\d", "", input_string)
string_with_dash = re.sub("(?<=[a-z])(?=[A-Z])", "-", new_string).lower()
There is an extra lower() method since you need the final answer in lowercase.
Output
my-camel-has-humps
[a-zA-Z][^A-Z]* will match a&#% and A$.+.
If you mean to capture only lowercase letters, you should write [a-zA-Z][a-z]*.
You can exclude matching digits as well using [^A-Z\d]* if you want to "remove numbers"
import re
def kebabize(string):
import re
split_lower_string = [item.lower() for item in re.findall('[a-zA-Z][^A-Z\d]*', string)]
return '-'.join(split_lower_string)
print(kebabize("myCamelHas3Humps"))
Output
my-camel-has-humps

Splitting/Tokenizing a sentence into string words with special conditions

I am trying to implement a tokenizer to split string of words.
The special conditions I have are: split punctuation . , ! ? into a separate string
and split any characters that have a space in them i.e. I have a dog!'-4# -> 'I', 'have', 'a' , 'dog', !, "'-4#"
Something like this.....
I don't plan on trying the nltk's package, and I have looked at re.split and re.findall, yet for both cases:
re.split = I don't know how to split out words with punctuation next to them such as 'Dog,'
re.findall = Sure it prints out all the matched string, but what about the unmatched ones?
IF you guys have any suggestions, I'd be very happy to try them.
Are you trying to split on a delimiter(punctuation) while keeping it in the final results? One way of doing that would be this:
import re
import string
sent = "I have a dog!'-4#"
punc_Str = str(string.punctuation)
print(re.split(r"([.,;:!^ ])", sent))
This is the result I get.
['I', ' ', 'have', ' ', 'a', ' ', 'dog', '!', "'-4#"]
Try:
re.findall(r'[a-z]+|[.!?]|(?:(?![.!?])\S)+', txt, re.I)
Alternatives in the regex:
[a-z]+ - a non-empty sequence of letters (ignore case),
[.!?] - any (single) char from your list (note that between brackets
neither a dot nor a '?' need to be quoted),
(?:(?![.!?])\S)+ - a non-empty sequence of non-white characters,
other than in your list.
E.g. for text containing I have a dog!'-4#?. the result is:
['I', 'have', 'a', 'dog', '!', "'-4#", '?', '.']

How to extract - capital letters and digits in between two words, in a string?

I'm extracting information from an Image of an Invoice using PyTesseract and I need to tag the relevant fields to their values
I've tried using regex to extract content, but this is a new concept and I've been able to extract words that contain capital letters, but not a combination of both letters and digits in between particular words
re.findall(r'[A-Z]+', string)
Example Sentence - Hello. I AM IRONMAN even though I would've preferred TO BE BATMAN. 123457678. Superhero FANTASY.
Expected Result - I AM IRONMAN I TO BE BATMAN. 123457678.
You can combine a split on the delimiter of interest, 123457678, and then apply a regex:
import re
string = "Hello. I AM IRONMAN even though I would've preferred TO BE BATMAN. 123457678. Superhero FANTASY"
re.findall(r'[A-Z0-9]+\b\.?', ''.join(re.split('(?<=123457678\.).', string)[0]))
# ['I', 'AM', 'IRONMAN', 'I', 'TO', 'BE', 'BATMAN.', '123457678.']

Splitting Two Characters In a String - Perl

I'm trying to split this string. Here's the code:
my $string = "585|487|314|1|1,651|365|302|1|1,585|487|314|1|1,651|365|302|1|1,656|432|289|1|1,136|206|327|1|1,585|487|314|1|1,651|365|302|1|1,585|487|314|1|1,651|365|302|1|1%656|432|289|1|1%136|206|327|1|1%654|404|411|1|1";
my #ids = split(",", $string);
What I want is to split only % and , in the string, I was told that I could use a pattern, something like this? /[^a-zA-Z0-9_]/
Character classes can be used to represent a group of possible single characters that can match. And the ^ symbol at the beginning of a character class negates the class, saying "Anything matches except for ...." In the context of split, whatever matches is considered the delimiter.
That being the case, `[^a-zA-Z0-9_] would match any character except for the ASCII letters 'a' through 'z', 'A' through 'Z', and the numeric digits '0' through '9', plus underscore. In your case, while this would correctly split on "," and "%" (since they're not included in a-z, A-Z, 0-9, or _), it would mistakenly also split on "|", as well as any other character not included in the character class you attempted.
In your case it makes a lot more sense to be specific as to what delimiters to use, and to not use a negated class; you want to specify the exact delimiters rather than the entire set of characters that delimiters cannot be. So as mpapec stated in his comment, a better choice would be [%,].
So your solution would look like this:
my #ids = split/[%,]/, $string;
Once you split on '%' and ',', you'll be left with a bunch of substrings that look like this: 585|487|314|1|1 (or some variation on those numbers). In each case, it's five positive integers separated by '|' characters. It seems possible to me that you'll end up wanting to break those down as well by splitting on '|'.
You could build a single data structure represented by list of lists, where each top level element represents a [,%] delimited field, and consists of a reference to an anonymous array consisting of the pipe-delimited fields. The following code will build that structure:
my #ids = map { [ split /\|/, $_ ] } split /[%,]/, $string;
When that is run, you will end up with something like this:
#ids = (
[ '585', '487', '314', '1', '1' ],
[ '651', '365', '302', '1', '1' ],
# ...
);
Now each field within an ID can be inspected and manipulated individually.
To understand more about how character classes work, you could check perlrequick, which has a nice introduction to character classes. And for more information on split, there's always perldoc -f split (as mentioned by mpapec). split is also discussed in chapter nine of the O'Reilly book, Learning Perl, 6th Edition.

negative look ahead on whole number but preceded by a character(perl)

I have text like this;
2500.00 $120.00 4500 12.00 $23.00 50.0989
Iv written a regex;
/(?!$)\d+\.\d{2}/g
I want it to only match 2500.00, 12.00 nothing else.
the requirement is that it needs to add the '$' sign onto numeric values that have exactly two digits after the decimal point. with the current regex it ads extra '$' to the ones that already have a '$' sign. its longer but im just saying it briefly. I know i can use regex to remove the '$' then use another regex to add '$' to all the desired numbers.
any help would be appreciated thanks!
To answer your question, you need to look before the pos where the first digit is.
(?<!\$)
But that's not going to work as it will match 23.45 of $123.45 to change it into $1$23.45, and it will match 123.45 of 123.456 to change it into $123.456. You want to make sure there's no digits before or after what you match.
s/(?<![\$\d])(\d+\.\d{2})(?!\d)/\$$1/g;
Or the quicker
s/(?<![\$\d])(?=\d+\.\d{2}(?!\d))/\$/g;
This is tricky only because you are trying to include too many functionalities in your single regex. If you manipulate the string first to isolate each number, this becomes trivial, as this one-liner demonstrates:
$ perl -F"(\s+)" -lane's/^(?=\d+\.\d{2}$)/\$/ for #F; print #F;'
2500.00 $120.00 4500 12.00 $23.00 50.0989
$2500.00 $120.00 4500 $12.00 $23.00 50.0989
The full code for this would be something like:
while (<>) { # or whatever file handle or input you read from
my #line = split /(\s+)/;
s/^(?=\d+\.\d{2}$)/\$/ for #line;
print #line; # or select your desired means of output
# my $out = join "", #line; # as string
}
Note that this split is non-destructive because we use parentheses to capture our delimiters. So for our sample input, the resulting list looks like this when printed with Data::Dumper:
$VAR1 = [
'2500.00',
' ',
'$120.00',
' ',
'4500',
' ',
'12.00',
' ',
'$23.00',
' ',
'50.0989'
];
Our regex here is simply anchored in both ends, and allowed to contain numbers, followed by a period . and two numbers, and nothing else. Because we use a look-ahead assertion, it will insert the dollar sign at the beginning, and keep everything else. Because of the strictness of our regex, we do not need to worry about checking for any other characters, and because we split on whitespace, we do not need to check for any such.
You can use this pattern:
s/(?<!\S)\d+\.\d{2}(?!\S)/\$${^MATCH}/gp
or
s/(?<!\S)(?=\d+\.\d{2}(?!\S))/\$/g
I think it is the shorter way.
(?<!\S) not preceded by a character that is not a white character
(?!\S) not followed by a character that is not a white character
The main interest of these double negations is that you include automaticaly the begining and the end of the string cases.