Python3 - Handling hyphenated words: combine and split - regex

I want to handle hyphenated words. For example, I'd like to handle the word "well-known" in two different ways.
First, combine this word i.e. ("wellknown") and the second way is to split the word i.e. ("well","known").
The input would be: "well-known" and the expected output is:
--wellknown
--well
--known
But I can only parse each word individually, but not both at the same time. When I loop through my text file, and if I am looking for the hyphenated words, I combine them first.
Then after I've combined them, I don't know how to go back to the original word again and do the split operation. The following are short pieces from my code. (please let me know if you need to see more details)
for text in contents.split():
if not re.search(r'\d', text): #remove numbers
if text not in string.punctuation: #remove punctuation
if '-' in term:
combine = text.replace("-", '') #??problem parts (wellknown)
separate = re.split(r'[^a-z]', text) #??problem parts (well, known)
I know the reason I cannot do both of the operation at the same time, because after I replaced the hyphenated word this word disappeared. Then I couldn't find the hyphenated word to do the split (in the code is "separate") operation. Does any one have any idea how to do it? Or how to fix the logic?

why not just use a tuple containing the separated words and the combined word.
split first then combine:
Sample Code
separate = text.split('-')
combined = ''.join(separate)
words = (combined, separate[0], separate[1])
Output
('wellknown', 'well', 'known')

Think of tokens more as an object than strings, then you can create a token with multiple attributes.
For example, we can use the collections.namedtuple container as a simple object to hold the token:
from collections import namedtuple
from nltk import word_tokenize
Token = namedtuple('Token', ['surface', 'splitup', 'combined'])
text = "This is a well-known example of a small-business grant of $123,456."
tokenized_text = []
for token in word_tokenize(text):
if '-' in token:
this_token = Token(token, tuple(token.split('-')), token.replace('-', ''))
else:
this_token = Token(token, token, token)
tokenized_text.append(this_token)
Then you can iterate through the tokenized_text as a list of the Token namedtuple, e.g. if we just need the list of surface strings:
for token in tokenized_text:
print(token.surface)
tokenized_text
[out]:
This
is
a
well-known
example
of
a
small-business
grant
of
$
123,456
.
If you need to access the combined tokens:
for token in tokenized_text:
print(token.combined)
[out]:
This
is
a
wellknown
example
of
a
smallbusiness
grant
of
$
123,456
.
If you want to access the split up tokens, use the same loop but you'll see that you get a tuple instead of string, e.g.
for token in tokenized_text:
print(token.splitup)
[out]:
This
is
a
('well', 'known')
example
of
a
('small', 'business')
grant
of
$
123,456
.
You can use a list comprehension to access the attributes of the Token namedtuples too, e.g.
>>> [token.splitup for token in tokenized_text]
['This', 'is', 'a', ('well', 'known'), 'example', 'of', 'a', ('small', 'business'), 'grant', 'of', '$', '123,456', '.']
To identify the tokens that have hyphen and have been split up, you could easily check for its type, e.g.
>>> [type(token.splitup) for token in tokenized_text]
[str, str, str, tuple, str, str, str, tuple, str, str, str, str, str]

Related

how to create elasticsearch tokens for a unique use case for text value?

I have a use case where i want to tokenise emailId with
Category 1
words separated by punctuations & prefix tokens
Category 2
words separated by punctuations & prefix tokens & along with punctuation
for example. email - ona.ki#gl.co
I want to know whether the the following tokens for the above emailid is possible to achieve.
on, ona, ona., ona.k, ona.ki, ona.ki#, ona.ki#g, ona.ki#gl, ona.ki#gl., ona.ki#gl.c, ona.ki#gl.co
.k, .ki, .ki#, .ki#g, .ki#gl, .ki#gl. , .ki#gl.c, .ki#gl.co
ki, ki#, ki#g, ki#gl, ki#gl. , ki#gl.c, ki#gl.co
#g, #gl, #gl. , #gl.c, #gl.co
gl, gl., gl.c, gl.co
.c, .co
co
Use case example for ona.ki#gl.co
ona.k - should match
na.k - should not match
.ki# - Should match
ki# - Should match
i# - Should not match
The reason why i want to tokenise this way is because consider there are 2 doc with text values
ona.ki#gl.com
mona.gh#gl.com
When the user types on, ona, ... i want to fetch and show only ona.ki#gl.com not the other one.
Thanks in advance.

Preprocess words that do not match list of words

I have a very specific case I'm trying to match: I have some text and a list of words (which may contain numbers, underscores, or ampersand), and I want to clean the text of numeric characters (for instance) unless it is a word in my list. This list is also long enough that I can't just make a regex that matches every one of the words.
I've tried to use regex to do this (i.e. doing something along the lines of re.sub(r'\d+', '', text), but trying to come up with a more complex regex to match my case. This obviously isn't quite working, as I don't think regex is meant to handle that kind of case.
I'm trying to experiment with other options like pyparsing, and tried something like the below, but this also gives me an error (probably because I'm not understanding pyparsing correctly):
from pyparsing import *
import re
phrases = ["76", "tw3nty", "potato_man", "d&"]
text = "there was once a potato_man with tw3nty cars and d& 76 different homes"
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(lambda word: re.sub(r'\d+', '', word)))
parser.parseString(text)
What's the best way to approach this sort of matching, or are there other better suited libraries that would be worth a try?
You are very close to getting this pyparsing cleaner-upper working.
Parse actions generally get their matched tokens as a list-like structure, a pyparsing-defined class called ParseResults.
You can see what actually gets sent to your parse action by wrapping it in the pyparsing decorator traceParseAction:
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(traceParseAction(lambda word: re.sub(r'\d+', '', word))))
Actually a little easier to read if you make your parse action a regular def'ed method instead of a lambda:
#traceParseAction
def unnumber(word):
return re.sub(r'\d+', '', word)
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(unnumber))
traceParseAction will report what is passed to the parse action and what is returned.
>>entering unnumber(line: 'there was once a potato_man with tw3nty cars and d& 76 different homes', 0, ParseResults(['there'], {}))
<<leaving unnumber (exception: expected string or bytes-like object)
You can see that the value passed in is in a list structure, so you should replace word in your call to re.sub with word[0] (I also modified your input string to add some numbers to the unguarded words, to see the parse action in action):
text = "there was 1once a potato_man with tw3nty cars and d& 76 different99 homes"
def unnumber(word):
return re.sub(r'\d+', '', word[0])
and I get:
['there', 'was', 'once', 'a', 'potato_man', 'with', 'tw3nty', 'cars', 'and', 'd&', '76', 'different', 'homes']
Also, you use the '^' operator for your parser. You may get a little better performance if you use the '|' operator instead, since '^' (which creates an Or instance) will evaluate all paths and choose the longest - necessary in cases where there is some ambiguity in what the alternatives might match. '|' creates a MatchFirst instance, which stops once it finds a match and does not look further for any alternatives. Since your first alternative is a list of the guard words, then '|' is actually more appropriate - if one gets matched, don't look any further.

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

Regular expression within count() of a list not working

I am trying to count certain expressions in tokenized texts. My code is:
tokens = nltk.word_tokenize(raw)
print(tokens.count(r"<cash><flow>"))
'tokens' is a list of tokenized texts (partly shown below). But the regex here is not working and the output shows 0 occurrence of 'cash flow', which is not correct. And I receive no error message. If I only count 'cash', it works fine.
'that', 'produces', 'cash', 'flow', 'from', 'operations', ',', 'none', 'of', 'which', 'are', 'currently', 'planned', ',', 'the', 'cash', 'flows', 'that', 'could', 'result', 'from'
Anyone knows what the problem is?
You don't need regex for this.
Just the find the matching keywords in tokens and count the elements.
Example:
tokens = ['that','produces','cash','flow','from','operations','with','cash']
keywords = ['cash','flow']
keywords_in_tokens = [x for x in keywords if x in tokens]
count_keywords_in_tokens = len(keywords_in_tokens)
print(keywords_in_tokens)
print(count_keywords_in_tokens)
count_keywords_in_tokens returns 2 because both words are found in the list.
To do it the regex way, you need a string to find the matches based on a regex pattern.
In the example below the 2 keywords are separated by an OR (the pipe)
import re
tokens = ['that','produces','cash','flow','from','operations','with','cash']
string = ' '.join(tokens)
pattern = re.compile(r'\b(cash|flow)\b', re.IGNORECASE)
keyword_matches = re.findall(pattern, string)
count_keyword_matches = len(keyword_matches)
print(keyword_matches)
print(count_keyword_matches)
count_keyword_matches returns 3 because there are 3 matches.

python 2.7.8 delimiter.how to del a a letter of string

def split(string,x):
if string == "":
return ""
if string[0] == x :
return split(string[1:],x)
return string[0]+[split(string[1:],x)]
I want to give this func a string like "ballooolam" and "l" and I want this func to give me ["ba","ooo","am"] 3 days I'm thinking about it
The correct way to split on a character in python is to use the split builtin. It will be much faster than anything you implement natively in python as it is a compiled C extension, as are all builtins at this point:
lst = "ballooolam".split("l")
However, as discussed in this question, this might not quite do what you expect. split leaves empty strings in the list when there are tokens of zero length; I.E. if the delimeter is at the first/last position in the string, or if two delimeters are next to each other. This is so that doing word = 'l'.join(lst) will return the original value; Without the empty strings, you would get 'balooolam' back instead of 'ballooolam'. If you want to remove these empty strings, you can do it easily with a list-comprehension:
def splitter(string, x):
return [token for token in string.split(x) if token]
The if token will reject any string that is 'falsy', which empty strings are. If you'd also like to exclude whitespace-only strings from the final list, you can do that with a little tweak:
def splitter(string, x):
return [token for token in string.split(x) if token.strip()]
strip() removes any leading/trailing whitespace from a string. In the case of a whitespace-only string, this will result in an empty string, which will then be falsy.