Regex: how to separate strings by apostrophes in certain cases only

Regex: how to separate strings by apostrophes in certain cases only - regex

I am looking to capitalize the first letter of words in a string. I've managed to put together something by reading examples on here. However, I'm trying to get any names that start with O' to separate into 2 strings so that each gets capitalized. I have this so far:
\b([^\W_\d](?!')[^\s-]*) *
which omits selecting the X' from any string X'XYZ. That works for capitalizing the part after the ', but doesn't capitalize the X'. Further more, i'm becomes i'M since it's not specific to O'. To state the goal:
o'malley should go to O'Malley
o'malley's should go to O'Malley's
don't should go to Don't
i'll should go to I'll
(as an aside, I want to omit any strings that start with numbers, like 23F, that seems to work with what I have)
How to make it specific to the strings that start with O'? Thx

if you use the following pattern:
([oO])'([\w']+)|([\w']+)
then you can access each word by calling:
match[0] == 'o' || match[1] == 'name' #if word is "o'name"
match[2] == 'word' #if word is "word"
if it is one of the two above, the others will be blank, ie if word == "word" then
match[0] == match[1] == ""
since there is no o' prefix.
Test Example:
>>> import re
>>> string = "o'malley don't i'm hello world"
>>> match = re.findall(r"([oO])'([\w']+)|([\w']+)",string)
>>> match
[('o', 'malley', ''), ('', '', "don't"), ('', '', "i'm"), ('', '', 'hello'), ('', '', 'world')]
NOTE: This is for python. This MIGHT not work for all engines.

Related

How to compare two columns by ignoring special charactes?

I am comparing two columns from different tables to get the matching records. Those tables do not have any unique key other than first and last names.But I don't get the correct output if tableA has Aa'aa and tableB has Aaaa. Could any one advice how to compare by ignoring the special characters / any other alternate solution to get them matched?
SELECT * FROM TableA A where EXISTS
(SELECT '' FROM TableB B
WHERE
TRIM(A.namef) = TRIM(B.namef)
AND TRIM(A.namel) = TRIM(B.namel)
)
-Thanks

You could try a regex approach. Assuming that you want to compare only the alphabetic and numeric characters, you can do:
where
regexp_replace(a.namef, '\W', '', 'g') = regexp_replace(b.namef, '\W', '', 'g')
and regexp_replace(a.namel, '\W', '', 'g') = regexp_replace(b.namel, '\W', '', 'g')
Basically this removes non-word characters from each string before comparing them - with a word character being defined as a letter or a digit, plus the underscore character.

If you only want to remove anything that is not a letter, use:
regexp_replace(a.namef, '[^A-Za-z]', '', 'g') = regexp_replace(b.namef, '[^A-Za-z]', '', 'g')

Splitting/Tokenizing a sentence into string words with special conditions

I am trying to implement a tokenizer to split string of words.
The special conditions I have are: split punctuation . , ! ? into a separate string
and split any characters that have a space in them i.e. I have a dog!'-4# -> 'I', 'have', 'a' , 'dog', !, "'-4#"
Something like this.....
I don't plan on trying the nltk's package, and I have looked at re.split and re.findall, yet for both cases:
re.split = I don't know how to split out words with punctuation next to them such as 'Dog,'
re.findall = Sure it prints out all the matched string, but what about the unmatched ones?
IF you guys have any suggestions, I'd be very happy to try them.

Are you trying to split on a delimiter(punctuation) while keeping it in the final results? One way of doing that would be this:
import re
import string
sent = "I have a dog!'-4#"
punc_Str = str(string.punctuation)
print(re.split(r"([.,;:!^ ])", sent))
This is the result I get.
['I', ' ', 'have', ' ', 'a', ' ', 'dog', '!', "'-4#"]

Try:
re.findall(r'[a-z]+|[.!?]|(?:(?![.!?])\S)+', txt, re.I)
Alternatives in the regex:
[a-z]+ - a non-empty sequence of letters (ignore case),
[.!?] - any (single) char from your list (note that between brackets
neither a dot nor a '?' need to be quoted),
(?:(?![.!?])\S)+ - a non-empty sequence of non-white characters,
other than in your list.
E.g. for text containing I have a dog!'-4#?. the result is:
['I', 'have', 'a', 'dog', '!', "'-4#", '?', '.']

Replacing two characters with two different characters using regex in python

I want to replace two different characters with two other different characters using regex in python in one operation. For example: The word is "a/c stuff" and i want to transform this into "ac_stuff" using regex in one regex.sub() line.
I searched here, but find ways to solve this using replace function, but i am looking to do this using regex in one line.
Thank you for the help!

Technically possible, but not pretty to do this in one line using re.sub
re.sub("[/ ]", (lambda match: '' if match.group(0) == '/' else '_'), "a/c stuff")
Much nicer (and faster) way using str.translate
"a/c stuff".translate(str.maketrans({'/': None, ' ': '_'}))
or
"a/c stuff".translate(str.maketrans(' ', '_', '/'))
Probably the most readable way is through str.replace, though this doesn't scale well to many replacements.
"a/c stuff".replace('/', '').replace(' ', '_')

Why is a space required in Regex when part of match is optional?

I have been looking for noun phrases (noun, plus optional determiner, plus multiple optional adjectives). I wrote this long and terrible bit:
import argparse, re, nltk
def get_words(tagged_sentences):
words = re.findall(r'\w*\.*\,*/', tagged_sentences)
clean_word = []
for word in words:
word = word[:-1]
clean_word.append(word)
# return clean_word
return ' '.join(clean_word)
noun_phrase = re.findall(r'(\w*/DT\s\w*/JJ\s\w*/NN)|(\w*/DT\s\w*/JJ\s\w*/NN)|(\w*/DT\s\w*/JJ\s\w*/NNP)|(\w*/DT\s\w*/JJ\s\w*/NNPS)|(\w*/JJ\s\w*/NNS)|(\w*/JJ\s\w*/NN)|(\w*/JJ\s\w*/NNP)|(\w*/JJ\s\w*/NNPS)|(\w*/DT\s\w*/NNS)|(\w*/DT\s\w*/NN)|(\w*/DT\s\w*/NNP)|(\w*/DT\s\w*/NNPS)|(\w*/NNS)|(\w*/NN)|(\w*/NNP)|(\w*/NNPS)', tagged_sentences)
phrases = []
for word in noun_phrase:
phrase = get_words(str(word))
phrases.append(phrase)
return phrases
At first, I was trying to use .* after the NN or the JJ, but that didn't work. What was I doing wrong? I did something like (\w*/DT\s\w*/JJ.* \s\w*/NN.*) to account for all the different ways words could be tagged as (Adjectives can be JJ,JJR,JJS while Nouns can be NN,NNS,NNP,NNPS)
pos_sent = 'All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT animals/NNS are/VBP more/RBR equal/JJ than/IN others/NNS ./.'
Then I saw this:
noun_phrase = re.findall(r'(\S+\/DT )?(\S+\/JJ )*(\S+\/NN )*(\S+\/NN)', tagged_sentences)
I liked it because it is way better in every way to what I first did. BUT I don't understand why the spaces are required after 'DT', 'JJ', and the first 'NN'(but cannot be there after the second 'NN'). I am not even sure why the two NN 'finds' cannot be placed into one.
I also preferred to use \w to \S, because it should be real letters not just not white space. Anyway, help understanding WHY would very much be appreciated.

Ok, here is your example text:
All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT
animals/NNS are/VBP more/RBR equal/JJ than/IN others/NNS ./.
And this is your regular expression you want to understand:
r'(\S+\/DT )?(\S+\/JJ )*(\S+\/NN )*(\S+\/NN)'
Let's take it one group at a time:
(\S+\/DT ) matches 'All/DT ' and 'some/DT '
(\S+\/JJ ) matches 'good/JJ ' and 'equal/JJ '
(\S+\/NN ) matches nothing
(\S+\/NN) matches 'animals/NN' and 'others/NN'
You're using re.findall(), but that doesn't mean find all of these groups, it means consider the entire regex and find all occurrences of the entire pattern. In addition to the groups, it's key that note that the because of the question mark, your first pattern (\S+\/DT ) is optional. Because of the asterisks, your second (\S+\/JJ ) and third (\S+\/NN ) patterns will match zero more times. Thus, they are also effectively optional and the only thing required is your last pattern (\S+\/NN).
A quick test looks like this
import re
s = 'All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT animals/NNS are/VBP more/RBR equal/ JJ than/IN others/NNS ./.'
pat = r'(\S+\/DT )?(\S+\/JJ )*(\S+\/NN )*(\S+\/NN)'
res = re.findall(pat, s)
for i, g in enumerate(res):
print('{}: {}'.format(i, g))
which gives this output:
0: ('All/DT ', 'good/JJ ', '', 'animals/NN')
1: ('some/DT ', '', '', 'animals/NN')
2: ('', '', '', 'others/NN')
If we remove the spaces,
pat2 = r'(\S+\/DT)?(\S+\/JJ)*(\S+\/NN)*(\S+\/NN)'
res2 = re.findall(pat2, s)
for i, g in enumerate(res2):
print('{}: {}'.format(i, g))
the output will be exactly what you'd expect,
0: ('', '', '', 'animals/NN')
1: ('', '', '', 'animals/NN')
2: ('', '', '', 'others/NN')
i.e., only the required ones match. Your question is why? I think the issue is that you may feel you are issuing a series of patterns to look for, but you are looking for a single pattern that has multiple match groups. In other words, your regular expression requires these pattern to exist in the order you specify. If they aren't there, then sure, it doesn't matter that they are ordered, but if they are there, they have to be ordered exactly as the regex specifies.
So with the spaces (\S+\/DT )?(\S+\/JJ ), matches 'All/DT good/JJ ' because it literally says match 1 or more non whitespace characters plus a forward slash plus DT plus a space followed by 1 or more whitespace characters plus a forward slash plus 'JJ'. Without the spaces (\S+\/DT)?(\S+\/JJ), the match would require either that the entire (\S+\/DT ) pattern NOT be there, or that if it IS there, it's definitely does NOT contain a space after 'DT'.
I think the key is that you're matching the entire sequence. Without the space, it simply doesn't match the text anymore. If you want these patterns to be considered independently, you will need to use the pipe symbol (|) to indicate OR between your pattern groups.

What you may do is to write a linear regex using optional groups and simplify your code to only processes valid matches and leverage list comprehension:
import re
def get_words(tagged_sentences):
clean_word = re.findall(r'(\w+)/', tagged_sentences)
return ' '.join(clean_word)
tagged_sentences = 'All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT animals/NNS are/VBP more/RBR equal/JJ than/IN others/NNS ./.'
pat = r"""\w*(?:/(?:JJ\s\w*|DT\s\w*(?:/JJ\s\w*)?))?/NN(?:[SP]?|PS)"""
noun_phrase = re.findall(pat, tagged_sentences)
phrases = [get_words(str(word)) for word in noun_phrase]
print(phrases)
# => ['All good animals', 'some animals', 'others']
See the Python demo.
The pattern extraction regex now matches:
\w* - 0+ word (letter/digit/_ chars) chars (replace * with + to match one or more)
(?:/(?:JJ\s\w*|DT\s\w*(?:/JJ\s\w*)?))? - an optional sequence (due to ? quantifier) of:
/ - a slash
(?:JJ\s\w*|DT\s\w*(?:/JJ\s\w*)?) - either of the sequences of:
JJ\s\w* - JJ followed with 1 whitespace (add + after it to match one or more) and then 0+ word chars
| - or
DT\s\w*(?:/JJ\s\w*)? - a DT, then 1 whitespace, then 0+ word chars, and then an optional sequence of /JJ, followed with 1 whitespace and 0+ word chars
/NN - a literal substring /NN
(?:[SP]?|PS) - either S, or P, or PS or an empty string (since the [SP]? is optional).
The regex that gets the word from the tagged token is
re.findall(r'(\w+)/', tagged_sentences)
Here, (\w+)/ matches and captures 1+ word chars, and they are extracted with re.findall while the / is omitted from the result as it is not part of the capturing group.

Regex pattern to match where my code breaks

I have the following values that I want to place into a mysql db
The pattern should look as follows, I need a regex to make sure that the pattern is always as follows:
('', '', '', '', '')
In some rare execution of my code, I hower get the following output where one of the apostrophes disapear. it dissapears every now and then on the 4th record. like in the code below where I placed the *
('1', '2576', '1', '*, 'y')
anyideas to solve this will be welcomed!
This should be able to match one of the times the code breaks
string.replace(/, \',/ig, ', \'\',');
how would I do it if it is like this
('1', '2576', '1', 'where I have text here and it breaks at the end*, 'y')
I am using javascript and asp
I think the solution would be something like this
string.replace(/, \'[a-zA-Z0-9],/ig, ', \'\','); but not exactly sure how to write it
This is almost the solution that I am looking for...
string.replace(/[a-zA-Z0-9], \'/ig, '\', \'');
this code however replaces the last letter of the text with the ', ' so if the text inside the string is 'approved, ' it will replace the 'approve', ' and cut off the letter d
I know there is a way that you can reference it not to remove the last letter but not sure how to do it

Is this what you're looking for? It matches when all but the last field is missing the '
\('.*?'\)

Your regular expression, would be something like this:
^\('.*?',\ '.*?',\ '.*?',\ '.*?',\ '.*?'\)$
you could check if your string matchs in ASP.net with some code similar to this:
Match m = Regex.Match(inputString, #"^\('.*?',\ '.*?',\ '.*?',\ '.*?',\ '.*?'\)$");
if (!m.Success)
{
//some fix logic here
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex: how to separate strings by apostrophes in certain cases only - regex

Related

How to compare two columns by ignoring special charactes?

Splitting/Tokenizing a sentence into string words with special conditions

Replacing two characters with two different characters using regex in python

Why is a space required in Regex when part of match is optional?

Regex pattern to match where my code breaks

Categories

Resources