I have a list of 3800 names I want to remove from 750K sentences.
The names can contain multiple words such as "The White Stripes".
Some names might also be look like a subset of a larger name, ex: 'Ame' may be one name and 'Amelie' may be another.
This is what my current implementation looks like:
def find_whole_word(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
names_lowercase = ['the white stripes', 'the beatles', 'slayer', 'ame', 'amelie'] # 3800+ names
def strip_names(sentence: str):
token = sentence.lower()
has_name = False
matches = []
for name in names_lowercase:
match = find_whole_word(name)(token)
if match:
matches.append(match)
def get_match(match):
return match.group(1)
matched_strings = list(map(get_match, matches))
matched_strings.sort(key=len, reverse=True)
for matched_string in matched_strings:
# strip names at the start, end and when they occur in the middle of text (with whitespace around)
token = re.sub(rf"(?<!\S){matched_string}(?!\S)", "", token)
return token
sentences = [
"how now brown cow",
"die hard fan of slayer",
"the white stripes kill",
"besides slayer I believe the white stripes are the best",
"who let ame out",
"amelie has got to go"
] # 750K+ sentences
filtered_list = [strip_names(sentence) for sentence in sentences]
# Expected: filtered_list = ["how now brown cow", "die hard fan of ", " kill", "besides I believe are the best", "who let out", " has got to go"]
My current implementation takes several hours. I don't care about readability as this code won't be used for long.
Any suggestions on how I can increase the run time?
My previous solution was overkill.
All I really had to do was use the word boundary \b as described in the documentation.
Usage example: https://regex101.com/r/2CZ8el/1
import re
names_joined = "|".join(names_lowercase)
names_whole_words_filter_expression = re.compile(rf"\b({names_joined})\b", flags=re.IGNORECASE)
def strip_names(text: str):
return re.sub(names_whole_words_filter_expression, "", text).strip()
Now it takes a few minutes instead of a few hours 🙌
I would like to lemmatize some Italian text in order to perform some frequency counting of words and further investigations on the output of this lemmatized content.
I am preferring lemmatizing than stemming because I could extract the word meaning from the context in the sentence (e.g. distinguish between a verb and a noun) and obtain words that exist in the language, rather than roots of those words that don't usually have a meaning.
I found out this library called pattern (pip2 install pattern) that should complement nltk in order to perform lemmatization of the Italian language, however I am not sure the approach below is correct because each word is lemmatized by itself, not in the context of a sentence.
Probably I should give pattern the responsibility to tokenize a sentence (so also annotating each word with the metadata regarding verbs/nouns/adjectives etc), then retrieving the lemmatized word, but I am not able to do this and I am not even sure it is possible at the moment?
Also: in Italian some articles are rendered with an apostrophe so for example "l'appartamento" (in English "the flat") is actually 2 words: "lo" and "appartamento". Right now I am not able to find a way to split these 2 words with a combination of nltk and pattern so then I am not able to count the frequency of the words in the correct way.
import nltk
import string
import pattern
# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
in_word = input_word#.decode('utf-8')
# print('Something: {}'.format(in_word))
word_it = pattern.it.parse(
in_word,
tokenize=False,
tag=False,
chunk=False,
lemmata=True
)
# print("Input: {} Output: {}".format(in_word, word_it))
the_lemmatized_word = word_it.split()[0][0][4]
# print("Returning: {}".format(the_lemmatized_word))
return the_lemmatized_word
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))
# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))
# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))
# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))
# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))
# difference between stemmer and lemmatizer
print(
"For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
.format(
word_tokenized_no_punct_no_sw[1],
word_tokenized_no_punct_no_sw[6],
word_tokenize_list_no_punct_lc_no_stowords_stem[1],
word_tokenize_list_no_punct_lc_no_stowords_stem[6],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
)
)
Gives this output:
1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)
How to effectively lemmatize some sentences with pattern using their tokenizer? (assuming lemmas are recognized as nouns/verbs/adjectives etc.)
Is there a python alternative to pattern to use for Italian lemmatization with nltk?
How to split articles that are bound to the next word using apostrophes?
I'll try to answer your question, knowing that I don't know a lot about italian!
1) As far as I know, the main responsibility for removing apostrophe is the tokenizer, and as such the nltk italian tokenizer seems to have failed.
3) A simple thing you can do about it is call the replace method (although you probably will have to use the re package for more complicated pattern), an example:
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
It yields:
['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure']
2) An alternative to pattern would be treetagger, granted it is not the easiest install of all (you need the python package and the tool itself, however after this part it works on windows and Linux).
A simple example with your example above:
import treetaggerwrapper
from pprint import pprint
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
tagger = treetaggerwrapper.TreeTagger(TAGLANG="it")
tags = tagger.tag_text(it_string)
pprint(treetaggerwrapper.make_tags(tags))
The pprint yields:
[Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'),
Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'),
Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'),
Tag(word=u'in', pos=u'PRE', lemma=u'in'),
Tag(word=u'due', pos=u'ADJ', lemma=u'due'),
Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'),
Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'),
Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'),
Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'),
Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'),
Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'),
Tag(word=u'la', pos=u'DET:def', lemma=u'il'),
Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'),
Tag(word=u'con', pos=u'PRE', lemma=u'con'),
Tag(word=u'le', pos=u'DET:def', lemma=u'il'),
Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'),
Tag(word=u'.', pos=u'SENT', lemma=u'.')]
It also tokenized pretty nicely the all'ippodromo to al and ippodromo (which is hopefully correct) under the hood before lemmatizing. Now we just need to apply the removal of stop words and punctuation and it will be fine.
The doc for installing the TreeTaggerWrapper library for python
I know this issue has been solved few years ago, but I am facing the same problem with nltk tokenization and Python 3 in regards to parsing words like all'ippodromo or dall'Italia. So I want to share my experience and give a partial, although late, answer.
The first action/rule that an NLP must take into account is to prepare the corpus. So I discovered that by replacing the ' character with a proper accent ’ by using accurate regex replacing during text parsing (or just a propedeutic replace all at once in basic text editor), then the tokenization works correctly and I am having the proper splitting with just nltk.tokenize.word_tokenize(text)
I wrote an executable example - you can test it. When you start this program you will get three QPushButton()-objects and one QLineEdit()-object. There you can install or deinstall/uninstall the event filter or close the application. Please install the event filter and type a text. You will see what I want. I want the example program to protect the space key. In this current version the user can't press the space key more than 2 times. This program does work.
But I have a little problem. When I write a text in the QLineEdit()-object and then I highlight the text and I press the delete or return key, nothing happens. I am not able to delete the text. I am also not able to copy the marked text.
Whats wrong with the code below?
#!/usr/bin/env python
import sys
from PyQt4.QtCore import QEvent, Qt
from PyQt4.QtGui import QMainWindow, QWidget, QApplication, QVBoxLayout, QLineEdit, QPushButton
class Window(QMainWindow):
def __init__(self, parent=None):
QMainWindow.__init__(self, parent)
self.count_space_pressed = 0
self.current_pos = None
self.init_ui()
self.init_signal_slot_push_button()
def init_ui(self):
centralwidget = QWidget(self)
self.input_line_edit = QLineEdit(self)
self.close_push = QPushButton(self)
self.close_push.setEnabled(False)
self.close_push.setText("Close")
self.push_install = QPushButton(self)
self.push_install.setText("Install eventFilter")
self.push_deinstall = QPushButton(self)
self.push_deinstall.setText("Deinstall eventFilter")
layout = QVBoxLayout(centralwidget)
layout.addWidget(self.input_line_edit)
layout.addWidget(self.push_install)
layout.addWidget(self.push_deinstall)
layout.addWidget(self.close_push)
self.setCentralWidget(centralwidget)
return
def install_filter_event(self, widget_object):
widget_object.installEventFilter(self)
return
def deinstall_filter_event(self, widget_object):
widget_object.removeEventFilter(self)
return
def init_signal_slot_push_button(self):
self.close_push.clicked.connect(self.close)
self.push_install.clicked.connect(lambda: self.install_filter_event(self.input_line_edit))
self.push_deinstall.clicked.connect(lambda: self.deinstall_filter_event(self.input_line_edit))
return
def strip_string(self, content, site=None):
if site == "right":
return content.rstrip()
elif site == "right_left":
return content.strip()
elif site == "left":
return content.lstrip()
def eventFilter(self, received_object, event):
content_line_edit = unicode(received_object.text())
if event.type() == QEvent.KeyPress:
if event.key() == Qt.Key_Space:
'''
Yes, the user did press the Space-Key. We
count how often he pressed the space key.
'''
self.count_space_pressed = self.count_space_pressed + 1
if int(self.count_space_pressed) > 1:
'''
The user did press the space key more than 1 time.
'''
self.close_push.setEnabled(False)
'''
Now we know the user did press the
space key more than 1 time. We take a look,
if variablenamed (sel.current_pos) is None.
That means, no current position is saved.
'''
if self.current_pos is None:
'''
Well no current position is saved,
that why we save the new position anf
then we set the position of the cursor.
'''
self.current_pos = received_object.cursorPosition()
received_object.setCursorPosition(int(self.current_pos))
received_object.clear()
received_object.setText(self.strip_string(content_line_edit, site="right"))
else:
'''
Well the user press the space key again, for
example 3, 4, 5, 6 times we want to keep the
old position of the cursor until he press
no space key.
'''
received_object.setCursorPosition(int(self.current_pos))
'''
We have to remove all spaces in a string
on the right side and set the content on QLineEdit-widget.
'''
received_object.clear()
received_object.setText(self.strip_string(content_line_edit, site="right"))
else: pass
else:
'''
No the user didn't press the space key.
So we set all setting on default.
'''
self.close_push.setEnabled(True)
self.current_pos = None
self.count_space_pressed = 0
received_object.clear()
received_object.setText(self.strip_string(content_line_edit, site="left"))
# Call Base Class Method to Continue Normal Event Processing
return QMainWindow.eventFilter(self, received_object, event)
if __name__ == '__main__':
app = QApplication(sys.argv)
window = Window()
window.show()
app.exec_()
EDIT:
import sys, re
from PyQt4 import QtCore, QtGui
class Window(QtGui.QWidget):
def __init__(self):
super(Window, self).__init__()
self.edit = QtGui.QLineEdit(self)
self.edit.textChanged.connect(self.handleTextChanged)
layout = QtGui.QVBoxLayout(self)
layout.addWidget(self.edit)
# First we save the the regular expression pattern
# in a variable named regex.
## This means: one whitespace character, followed by
## one or more whitespaces chatacters
regex = r"\s\s+"
# Now we comple the pattern.
# After then we save the compiled patter
# as result in a variable named compiled_re.
self.compiled_re = re.compile(regex)
def handleTextChanged(self, text):
# When the text of a widget-object is changed,
# we do something.
# Here I am really not sure.
# Do you want to look if the given text isn't empty?
## No, we want to search the string to see if it
## contains any runs of multiple spaces
if self.compiled_re.search(text):
# We know that given text is a QString-object.
# So we have to convert the given text
# into a python-string, because we want to work
# with them in python.
text = unicode(text)
# NOTICE: Do replacements before and after cursor pos
# We save the current and correct cursor position
# of a QLineEdit()-object in the variable named pos.
pos = self.edit.cursorPosition()
# Search and Replace: Here the sub()-method
# replaces all occurrences of the RE pattern
# in string with text.
# And then it returns modified string and saves
# it in the variables prefix and suffix.
# BUT I am not sure If I understand this: [:pos]
# and [pos:]. I will try to understnand.
# I think we are talking about slicing, right?
# And I think the slicing works like string[start:end]:
# So text[:pos] means, search and replace all whitesapce
# at the end of the text-string. And the same again, but
# text[pos:] means, search and replace all whitesapce
# at the start of the string-text.
## Right, but the wrong way round. text[:pos] means from
## the start of the string up to pos (the prefix); and
## text[pos:] means from pos up to the end of the string
## (the suffix)
prefix = self.compiled_re.sub(' ', text[:pos])
suffix = self.compiled_re.sub(' ', text[pos:])
# NOTICE: Cursor might be between spaces
# Now we take a look if the variable prefix ends
# with a whitespace and we check if suffix starts
# with a whitespace.
# BUT, why we do that?
## Imagine that the string is "A |B C" (with the cursor
## shown as "|"). If "B" is deleted, we will get "A | C"
## with the cursor left between multiple spaces. But
## when the string is split into prefix and suffix,
## each part will contain only *one* space, so the
## regexp won't replace them.
if prefix.endswith(' ') and suffix.startswith(' '):
# Yes its True, so we overwrite the variable named
# suffix and slice it. suffix[1:] means, we starts
# at 1 until open end.
## This removes the extra space at the start of the
## suffix that was missed by the regexp (see above)
suffix = suffix[1:]
# Now we have to set the text of the QLineEdit()-object,
# so we put the both varialbes named prefix and suffix
# together.
self.edit.setText(prefix + suffix)
# After then, we have to set the cursor position.
# I know that the len()-method returns the length of the
# variable named prefix.
# BUT why we have to do that?
## When the text is set, it will clear the cursor. The
## prefix and suffix gives the text before and after the
## old cursor position. Removing spaces may have shifted
## the old position, so the new postion is calculated
## from the length of the current prefix
self.edit.setCursorPosition(len(prefix))
if __name__ == '__main__':
app = QtGui.QApplication(sys.argv)
window = Window()
window.setGeometry(500, 150, 300, 100)
window.show()
sys.exit(app.exec_())
EDIT 2:
Two question:
First Question: in the if.condition, where we take a look if prefix ends and suffix starts with sapces, there we are about to remove the extra space at the start of the suffix. But why don't we also remove the extra space at start of the prefix?
Imagine: The user types " Prefix and Suffix " - with extra whitespaces at start and end. Don't we have to remove the extra space at start of the prefix - like:
prefix= prefix[:1]?
Second Question: At the end of the handleTextChanged()-method, we have to calculate the new position of the cursor. In the current case we use prefix to get the length of the string. Why not the len from the new modified text, that is a part from prefix and suffix?
Example: The old string is " Prefix and Suffix ", the user removes the word 'and". Now our string looks like " Prefix | Suffix ". After all whitespaces are removed we get the new modified text: "Prefix Suffix". Why don't we calculate the new position from the modified text? Or did I miss something?
EDIT 3:
I am sorry, I still don't understand the situation.
First situation: When the user types the following string: "A B C |" (| it is shown as cursor). Now the user presses the space key more than 2 times, we get a prefix that contains "A B C |" - and no suffix. And currently the length of the prexis is 6 - suffix has no lenght, because its empty. And the whole word is length 6. The current position of the cursor is 7.
Second situation: The user types "A B D E F |". And now he is realizing that a letter is missing: C. He moves his cursor back between B and D and types C and then he is about to press the space key 2 times. Now we have prefix that contains "A B C " and suffix which content "D E F". The length of prefix is 6 and of suffix is 5. The length of the whole word is 11. And in this moment the current position of the cursor is 7. In this situation you take the length of prefix and set the cursor position, right?
Filtering key-presses is not enough if you really want to prevent multiple spaces.
For instance, the user can simply drag and drop multiple spaces; or paste them either with the mouse, the built-in context menu, or with the standard keyboard shortcuts.
It's also very easy to break your space-key counting method: for example, just type A B C then move back two places and delete B!
A much more robust way to do this is to connect to the textChanged signal and use a regexp to check if there's any multiple spaces. If there are, use the same regexp to replace them, and then restore the cursor to it's original position.
Here's a demo:
import sys, re
from PyQt4 import QtCore, QtGui
class Window(QtGui.QWidget):
def __init__(self):
super(Window, self).__init__()
self.edit = QtGui.QLineEdit(self)
self.edit.textChanged.connect(self.handleTextChanged)
layout = QtGui.QVBoxLayout(self)
layout.addWidget(self.edit)
self.regexp = re.compile(r'\s\s+')
def handleTextChanged(self, text):
if self.regexp.search(text):
text = unicode(text)
# do replacements before and after cursor pos
pos = self.edit.cursorPosition()
prefix = self.regexp.sub(' ', text[:pos])
suffix = self.regexp.sub(' ', text[pos:])
# cursor might be between spaces
if prefix.endswith(' ') and suffix.startswith(' '):
suffix = suffix[1:]
self.edit.setText(prefix + suffix)
self.edit.setCursorPosition(len(prefix))
if __name__ == '__main__':
app = QtGui.QApplication(sys.argv)
window = Window()
window.setGeometry(500, 150, 300, 100)
window.show()
sys.exit(app.exec_())
if you are using python and you have created button for removing the last character, do the following
self.PB_Back.clicked.connect(self.Keypad_Back)
def Keypad_Back(self):
self.LE_Edit.setText(self.LE_Edit.text()[:-1])
this will remove last character one at a time
to delete all the character at once, do the following
self.PB_DeleteResult.clicked.connect(self.Keypad_DeleteResult)
def Keypad_DeleteResult(self):
self.LE_Edit.setText("")
I have some data with name and ethnicity
j-bte letourneau scotish
jane mc-earthar french
amabil bonneau english
I then normalize the name as such by replacing the space with "#" and add trailing "?" to standardize the total length of the name entries. I would like to use sequential three-letter substring as my feature to predict ethnicity.
name_filled substr1 substr2 substr3 \
0 j-bte#letourneau??????????????????????????? j-b -bt bte
1 jane#mc-earthar???????????????????????????? jan ane ne#
2 amabil#bonneau????????????????????????????? ama mab abi
Here is my code for data manipulation to this point:
import pandas as pd
from pandas import DataFrame
import re
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity
# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]
# Make all letters into lowercase
frame3.loc[:, "name"] = frame3["name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3["ethnicity"].str.lower()
# Remove all non-alphabetical characters in Name
frame3.loc[:, "name"] = frame3["name"].str.replace(r'[^a-zA-Z\s\-]', '') # Retain space and hyphen
# Replace empty space as "#"
frame3.loc[:, "name"] = frame3["name"].str.replace('[\s]', '#')
# Find the longest name in the dataset
##frame3["name_length"] = frame3["name"].str.len()
##nameLength = frame3.name_length
##print nameLength.max() # Longest name has !!!40 characters!!! including spaces and hyphens
# Add "?" to fill spaces up to 43 characters
frame3["name_filled"] = frame3["name"].str.pad(side="right", width=43, fillchar="?")
# Split into three-character strings
for i in range(1, 41):
substr = "substr" + str(i)
frame3[substr] = frame3["name_filled"].str[i-1:i+2]
My question is, would it be a problem to store my 3-character substring this way to run the machine learning algorithm? This could be a problem as the example below.
Imagine two Chinese people both with the last name Chan, but one is called "Li Chan" and the other is called "Joseph Chan".
The Chan will be split into "cha" and "han", but for the first case, the "cha" will be stored in the str4 while the other will be stored in the str8 because the first name pushes it to be stored much later. I wonder if I could and should store the 3-character substrings into just one single variable as a list (for example: ["j-b", "-bt", "bte"] for substr variable for case 0), and if the substrings are stored into one single variable, can it be run with machine learning algorithms to predict ethnicity?
I have file which has data in lines as follows:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour', 'The Smashing Pumpkins', 'Warner Bros. Entertainment','This is a good Beer]
['Voices Inside', 'Expressivista', 'The Kentucky Fried Movie', 'The Bridges of Madison County']
and so on. I want to re-write the data into a file which has lines with tokens with words less than 3 or some other number. e.g.:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour']
['Voices Inside', 'Expressivista']
this is what I have tried so far:
for line in open(file):
line = line.strip()
line = line.rstrip()
prog = re.compile("([a-z0-9]){32}")
if line:
line = line.replace('"', '')
line = line.split(",")
if re.match(prog, line[0]) and len(line)>2:
wo=[]
for words in line:
word=words.split()
if len(word)<3:
print word.append(word)
But the output says None. Any thoughts where I am making a mistake?
A better way to do what you're doing is to use ast.literal_eval, which automagically converts string representations of Python objects (e.g. lists) into actual Python objects.
import ast
# raw data
data = """
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour', 'The Smashing Pumpkins', 'Warner Bros. Entertainment','This is a good Beer']
['Voices Inside', 'Expressivista', 'The Kentucky Fried Movie', 'The Bridges of Madison County']
"""
# set threshold number of tokens
threshold = 3
# split into lines
lines = data.split('\n')
# parse non-blank lines into python lists
lists = [ast.literal_eval(line) for line in lines if line]
# for each list, keep only those tokens with less than `threshold` tokens
result = [[item for item in lst if len(item.split()) < threshold]
for lst in lists]
# show result
for line in result:
print(line)
Result:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour']
['Voices Inside', 'Expressivista']
I think the reason your code isn't working is that you're trying to match line[0] against your regex prog - but the problem is that line[0] isn't 32 characters long for either of your lines, so your regex won't match.