I want to handle hyphenated words. For example, I'd like to handle the word "well-known" in two different ways.
First, combine this word i.e. ("wellknown") and the second way is to split the word i.e. ("well","known").
The input would be: "well-known" and the expected output is:
But I can only parse each word individually, but not both at the same time. When I loop through my text file, and if I am looking for the hyphenated words, I combine them first.
Then after I've combined them, I don't know how to go back to the original word again and do the split operation. The following are short pieces from my code. (please let me know if you need to see more details)
for text in contents.split():
if not'\d', text): #remove numbers
if text not in string.punctuation: #remove punctuation
if '-' in term:
combine = text.replace("-", '') #??problem parts (wellknown)
separate = re.split(r'[^a-z]', text) #??problem parts (well, known)
I know the reason I cannot do both of the operation at the same time, because after I replaced the hyphenated word this word disappeared. Then I couldn't find the hyphenated word to do the split (in the code is "separate") operation. Does any one have any idea how to do it? Or how to fix the logic?

why not just use a tuple containing the separated words and the combined word.
split first then combine:
Sample Code
separate = text.split('-')
combined = ''.join(separate)
words = (combined, separate[0], separate[1])
('wellknown', 'well', 'known')

Think of tokens more as an object than strings, then you can create a token with multiple attributes.
For example, we can use the collections.namedtuple container as a simple object to hold the token:
from collections import namedtuple
from nltk import word_tokenize
Token = namedtuple('Token', ['surface', 'splitup', 'combined'])
text = "This is a well-known example of a small-business grant of $123,456."
tokenized_text = []
for token in word_tokenize(text):
if '-' in token:
this_token = Token(token, tuple(token.split('-')), token.replace('-', ''))
this_token = Token(token, token, token)
Then you can iterate through the tokenized_text as a list of the Token namedtuple, e.g. if we just need the list of surface strings:
for token in tokenized_text:
If you need to access the combined tokens:
for token in tokenized_text:
If you want to access the split up tokens, use the same loop but you'll see that you get a tuple instead of string, e.g.
for token in tokenized_text:
('well', 'known')
('small', 'business')
You can use a list comprehension to access the attributes of the Token namedtuples too, e.g.
>>> [token.splitup for token in tokenized_text]
['This', 'is', 'a', ('well', 'known'), 'example', 'of', 'a', ('small', 'business'), 'grant', 'of', '$', '123,456', '.']
To identify the tokens that have hyphen and have been split up, you could easily check for its type, e.g.
>>> [type(token.splitup) for token in tokenized_text]
[str, str, str, tuple, str, str, str, tuple, str, str, str, str, str]


