Using re.findall to match a string in html - regex

I want to use re.findall() to match instances of business names from a review website. For example, I'd like to capture names in a list like the ones in the example below:
website_html = ', Jimmy Bob's Tune & Lube, Allen's Bar & Grill, Joanne's - Restaurant,'
name_list = re.findall('[,]\s*([\w\'&]*\s?)*[,]', website_html)
My code isn't catching any patterns. Any ideas?

You only provided one input example, so this answer is based on your question as is:
# I replace the single quotes at the start and end of your input, because
# Bob's throws a SyntaxError: invalid syntax
#
website_html = ", Jimmy Bob's Tune & Lube,"
# I removed re.findall, because you only had one example so re.search or
# re.match works.
name_list = re.search(r'[,]\s*([\w\'&]*\s?)*[,]', website_html)
print (name_list.group(0))
# output
, Jimmy Bob's Tune & Lube,
If you have additional input values in website_html please provide them, so that I can modified my answer.
Here is the version that uses re.findall.
# I replace the single quotes at the start and end of your input, because
# Bob's throws a SyntaxError: invalid syntax
#
website_html = ", Jimmy Bob's Tune & Lube,"
# I wrapped your pattern as a capture group
name_list = re.findall(r'([,]\s*([\w\'&]*\s?)*[,])', website_html)
print (type(name_list))
# output
<class 'list'>
print (name_list)
# output
[(", Jimmy Bob's Tune & Lube,", '')]
UPDATED ANSWER
This answer is based on the modified input to your original question.
website_html = ", Jimmy Bob's Tune & Lube, Allen's Bar & Grill, Joanne's - Restaurant,"
name_list = re.findall(r'[?:,].*[?:,]', website_html)
for item in name_list:
split_strings = (str(item).split(','))
for string in split_strings:
print (string)
# output
Jimmy Bob's Tune & Lube
Allen's Bar & Grill
Joanne's - Restaurant

Related

Fastest way to replace phrases from sentences with Python?

I have a list of 3800 names I want to remove from 750K sentences.
The names can contain multiple words such as "The White Stripes".
Some names might also be look like a subset of a larger name, ex: 'Ame' may be one name and 'Amelie' may be another.
This is what my current implementation looks like:
def find_whole_word(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
names_lowercase = ['the white stripes', 'the beatles', 'slayer', 'ame', 'amelie'] # 3800+ names
def strip_names(sentence: str):
token = sentence.lower()
has_name = False
matches = []
for name in names_lowercase:
match = find_whole_word(name)(token)
if match:
matches.append(match)
def get_match(match):
return match.group(1)
matched_strings = list(map(get_match, matches))
matched_strings.sort(key=len, reverse=True)
for matched_string in matched_strings:
# strip names at the start, end and when they occur in the middle of text (with whitespace around)
token = re.sub(rf"(?<!\S){matched_string}(?!\S)", "", token)
return token
sentences = [
"how now brown cow",
"die hard fan of slayer",
"the white stripes kill",
"besides slayer I believe the white stripes are the best",
"who let ame out",
"amelie has got to go"
] # 750K+ sentences
filtered_list = [strip_names(sentence) for sentence in sentences]
# Expected: filtered_list = ["how now brown cow", "die hard fan of ", " kill", "besides I believe are the best", "who let out", " has got to go"]
My current implementation takes several hours. I don't care about readability as this code won't be used for long.
Any suggestions on how I can increase the run time?
My previous solution was overkill.
All I really had to do was use the word boundary \b as described in the documentation.
Usage example: https://regex101.com/r/2CZ8el/1
import re
names_joined = "|".join(names_lowercase)
names_whole_words_filter_expression = re.compile(rf"\b({names_joined})\b", flags=re.IGNORECASE)
def strip_names(text: str):
return re.sub(names_whole_words_filter_expression, "", text).strip()
Now it takes a few minutes instead of a few hours 🙌

Lemmatizing Italian sentences for frequency counting

I would like to lemmatize some Italian text in order to perform some frequency counting of words and further investigations on the output of this lemmatized content.
I am preferring lemmatizing than stemming because I could extract the word meaning from the context in the sentence (e.g. distinguish between a verb and a noun) and obtain words that exist in the language, rather than roots of those words that don't usually have a meaning.
I found out this library called pattern (pip2 install pattern) that should complement nltk in order to perform lemmatization of the Italian language, however I am not sure the approach below is correct because each word is lemmatized by itself, not in the context of a sentence.
Probably I should give pattern the responsibility to tokenize a sentence (so also annotating each word with the metadata regarding verbs/nouns/adjectives etc), then retrieving the lemmatized word, but I am not able to do this and I am not even sure it is possible at the moment?
Also: in Italian some articles are rendered with an apostrophe so for example "l'appartamento" (in English "the flat") is actually 2 words: "lo" and "appartamento". Right now I am not able to find a way to split these 2 words with a combination of nltk and pattern so then I am not able to count the frequency of the words in the correct way.
import nltk
import string
import pattern
# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
in_word = input_word#.decode('utf-8')
# print('Something: {}'.format(in_word))
word_it = pattern.it.parse(
in_word,
tokenize=False,
tag=False,
chunk=False,
lemmata=True
)
# print("Input: {} Output: {}".format(in_word, word_it))
the_lemmatized_word = word_it.split()[0][0][4]
# print("Returning: {}".format(the_lemmatized_word))
return the_lemmatized_word
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))
# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))
# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))
# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))
# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))
# difference between stemmer and lemmatizer
print(
"For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
.format(
word_tokenized_no_punct_no_sw[1],
word_tokenized_no_punct_no_sw[6],
word_tokenize_list_no_punct_lc_no_stowords_stem[1],
word_tokenize_list_no_punct_lc_no_stowords_stem[6],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
)
)
Gives this output:
1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)
How to effectively lemmatize some sentences with pattern using their tokenizer? (assuming lemmas are recognized as nouns/verbs/adjectives etc.)
Is there a python alternative to pattern to use for Italian lemmatization with nltk?
How to split articles that are bound to the next word using apostrophes?
I'll try to answer your question, knowing that I don't know a lot about italian!
1) As far as I know, the main responsibility for removing apostrophe is the tokenizer, and as such the nltk italian tokenizer seems to have failed.
3) A simple thing you can do about it is call the replace method (although you probably will have to use the re package for more complicated pattern), an example:
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
It yields:
['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure']
2) An alternative to pattern would be treetagger, granted it is not the easiest install of all (you need the python package and the tool itself, however after this part it works on windows and Linux).
A simple example with your example above:
import treetaggerwrapper
from pprint import pprint
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
tagger = treetaggerwrapper.TreeTagger(TAGLANG="it")
tags = tagger.tag_text(it_string)
pprint(treetaggerwrapper.make_tags(tags))
The pprint yields:
[Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'),
Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'),
Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'),
Tag(word=u'in', pos=u'PRE', lemma=u'in'),
Tag(word=u'due', pos=u'ADJ', lemma=u'due'),
Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'),
Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'),
Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'),
Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'),
Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'),
Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'),
Tag(word=u'la', pos=u'DET:def', lemma=u'il'),
Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'),
Tag(word=u'con', pos=u'PRE', lemma=u'con'),
Tag(word=u'le', pos=u'DET:def', lemma=u'il'),
Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'),
Tag(word=u'.', pos=u'SENT', lemma=u'.')]
It also tokenized pretty nicely the all'ippodromo to al and ippodromo (which is hopefully correct) under the hood before lemmatizing. Now we just need to apply the removal of stop words and punctuation and it will be fine.
The doc for installing the TreeTaggerWrapper library for python
I know this issue has been solved few years ago, but I am facing the same problem with nltk tokenization and Python 3 in regards to parsing words like all'ippodromo or dall'Italia. So I want to share my experience and give a partial, although late, answer.
The first action/rule that an NLP must take into account is to prepare the corpus. So I discovered that by replacing the ' character with a proper accent ’ by using accurate regex replacing during text parsing (or just a propedeutic replace all at once in basic text editor), then the tokenization works correctly and I am having the proper splitting with just nltk.tokenize.word_tokenize(text)

creating list with multiple regex

I have a information in the format -
AAMOD, Robert Kevin; Salt Lake, '91; Sales Associate, Xyz, UT; r: 101 Williams Ave, Salt Lake City, UT 84105, cell: (xxx) xxx- xxxx, abc#yahoo.com.
I am trying to convert the information to CSV.
I have converted the information to a list by splitting with respect to ';' and now for each item of the list, I am using regex to convert it to another list holding only required information and in particular sequence and none if that information is not present.
for item in list_1:
direct = []
if re.search(r'([A-Z]{3,}),([\d\w\s.]+)', item):
match = re.search(r'([A-Z]{3,}),([\d\w\s.]+)', item)
direct.append(match.group(1))
direct.append(match.group(2))
break
else:
match1 = re.search(r'\'(\d+)', item)
if match1:
direct.append(match1.group(1))
break
else:
match2 = re.search(r'(r:[\w\d\s.]*,*\s*([\w]*)\s*([A-Z]{2})\s([\d]{5}),\s*(.\d{3}.\s\d{3}-\d{4}),\s*(\w+[\w\d\s.]+#+\s*[\w\d.]+\.+\w+))', item)
if match2:
direct.append(match2.group(2))
direct.append(match2.group(3))
direct.append(match2.group(4))
direct.append(match2.group(5))
direct.append(match2.group(6))
break
else:
direct.append('')
break
print direct
when I run this code, list only shows first match.
And if I run each re.search operation individually, it is working. But the moment I try to combine them using nested if-else, nothing happens. So can anyone suggest where is the logic wrong?
Expected output:
[AAMOD, Robert Kevin, 91, Sales Associate, Salt Lake City, UT, 84105, (xxx) xxx- xxxx, abc#yahoo.com]

Print line if any of these words are matched

I have a text file with 1000+ lines, each one representing a news article about a topic that I'm researching. Several hundred lines/articles in this dataset are not about the topic, however, and I need to remove these.
I've used grep to remove many of them (grep -vwE "(wordA|wordB)" test8.txt > test9.txt), but I now need to go through the rest manually.
I have a working code that finds all lines that do not contain a certain word, prints this line to me, and asks if it should be removed or not. It works well, but I'd like to include several other words. E.g. let's say my research topic is meat eating trends. I hope to write a script that prints lines that do not contain 'chicken' or 'pork' or 'beef', so I can manually verify if the lines/articles are about the relevant topic.
I know I can do this with elif, but I wonder if there is a better and simpler way? E.g. I tried if "chicken" or "beef" not in line: but it did not work.
Here's the code I have:
orgfile = 'text9.txt'
newfile = 'test10.txt'
newFile = open(newfile, 'wb')
with open("test9.txt") as f:
for num, line in enumerate(f, 1):
if "chicken" not in line:
print "{} {}".format(line.split(',')[0], num)
testVar = raw_input("1 = delete, enter = skip.")
testVar = testVar.replace('', '0')
testVar = int(testVar)
if testVar == 10:
print ''
os.linesep
else:
f = open(newfile,'ab')
f.write(line)
f.close()
else:
f = open(newfile,'ab')
f.write(line)
f.close()
Edit: I tried Pieter's answer to this question but it does not work here, presumeably because I am not working with integers.
you can use any or all and a generator. For example
>>> key_word={"chicken","beef"}
>>> test_texts=["the price of beef is too high", "the chicken farm now open","tomorrow there is a lunar eclipse","bla"]
>>> for title in test_texts:
if any(key in title for key in key_words):
print title
the price of beef is too high
the chicken farm now open
>>>
>>> for title in test_texts:
if not any(key in title for key in key_words):
print title
tomorrow there is a lunar eclipse
bla
>>>

extract substring using regex in groovy

If I have the following pattern in some text:
def articleContent = "<![CDATA[ Hellow World ]]>"
I would like to extract the "Hellow World" part, so I use the following code to match it:
def contentRegex = "<![CDATA[ /(.)*/ ]]>"
def contentMatcher = ( articleContent =~ contentRegex )
println contentMatcher[0]
However I keep getting a null pointer exception because the regex doesn't seem to be working, what would be the correct regex for "any peace of text", and how to collect it from a string?
Try:
def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[ 0 ]​[ 1 ]
However I worry that you are planning to parse xml with regular expressions. If this cdata is part of a larger valid xml document, better to use an xml parser
The code below shows the substring extraction using regex in groovy:
class StringHelper {
#NonCPS
static String stripSshPrefix(String gitUrl){
def match = (gitUrl =~ /ssh:\/\/(.+)/)
if (match.find()) {
return match.group(1)
}
return gitUrl
}
static void main(String... args) {
def gitUrl = "ssh://git#github.com:jiahut/boot.git"
def gitUrl2 = "git#github.com:jiahut/boot.git"
println(stripSshPrefix(gitUrl))
println(stripSshPrefix(gitUrl2))
}
}
A little bit late to the party but try using backslash when defining your pattern, example:
def articleContent = "real groovy"
def matches = (articleContent =~ /gr\w{4}/) //grabs 'gr' and its following 4 chars
def firstmatch = matches[0] //firstmatch would be 'groovy'
you were on the right track, it was just the pattern definition that needed to be altered.
References:
https://www.regular-expressions.info/groovy.html
http://mrhaki.blogspot.com/2009/09/groovy-goodness-matchers-for-regular.html
One more sinle-line solution additional to tim_yates's one
def result = articleContent.replaceAll(/<!\[CDATA\[(.+)]]>/,/$1/)
Please, take into account that in case of regexp doesn't match then result will be equal to the source. Unlikely in case of
def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[0]​[1]
it will raise an exception.
In my case, the actual string was multi-line like below
ID : AB-223
Product : Standard Profile
Start Date : 2020-11-19 00:00:00
Subscription : Annual
Volume : 11
Page URL : null
Commitment : 1200.00
Start Date : 2020-11-25 00:00:00
I wanted to extract the Start Date value from this string so here is how my script looks like
def matches = (originalData =~ /(?<=Actual Start Date :).*/)
def extractedData = matches[0]
This regex extracts the string content from each line which has a prefix matching Start Date :
In my case, the result is is 2020-11-25 00:00:00
Note : If your originalData is a multi-line string then in groovy you can include it as follows
def originalData =
"""
ID : AB-223
Product : Standard Profile
Start Date : 2020-11-19 00:00:00
Subscription : Annual
Volume : 11
Page URL : null
Commitment : 1200.00
Start Date : 2020-11-25 00:00:00
"""
This script looks simple but took me some good time to figure out few things so I'm posting this here.