I want to count the occurrence of certain words and names in a file. The code below incorrectly counts fish and chips as one case of fish and one case of chips, instead of one count of fish and chips.
ngh.txt = 'test file with words fish, steak fish chips fish and chips'
import re
from collections import Counter
wanted = '''
"fish and chips"
fish
chips
steak
'''
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print cnt
Output:
Counter({'fish': 3, 'chips': 2, 'and': 1, 'steak': 1})
What I want is:
Counter({'fish': 2, 'fish and chips': 1, 'chips': 1, 'steak': 1})
(And ideally, I can get the output like this:
fish: 2
fish and chips: 1
chips: 1
steak: 1
)
Definition:
Wanted item: A string that is being searched for within the text.
To count wanted items, without re-counting them within longer wanted items, first count the number of times each one occurs within the string. Next, go through the wanted items, from longest to shortest, and as you encounter smaller wanted items that occur in a longer wanted item, subtract the number of results for the longer item from the shorter item. For example, assume your wanted items are "a", "a b", and "a b c", and your text is "a/a/a b/a b c". Searching for each of those individually produces: { "a": 4, "a b": 2, "a b c": 1 }. The desired result is: { "a b c": 1, "a b": #("a b") - #("a b c") = 2 - 1 = 1, "a": #("a") - #("a b c") - #("a b") = 4 - 1 - 1 = 2 }.
def get_word_counts(text, wanted):
counts = {}; # The number of times a wanted item was read
# Dictionary mapping word lengths onto wanted items
# (in the form of a dictionary where keys are wanted items)
lengths = {};
# Find the number of times each wanted item occurs
for item in wanted:
matches = re.findall('\\b' + item + '\\b', text);
counts[item] = len(matches)
l = len(item) # Length of wanted item
# No wanted item of the same length has been encountered
if (l not in lengths):
# Create new dictionary of items of the given length
lengths[l] = {}
# Add wanted item to dictionary of items with the given length
lengths[l][item] = 1
# Get and sort lenths of wanted items from largest to smallest
keys = lengths.keys()
keys.sort(reverse=True)
# Remove overlapping wanted items from the counts working from
# largest strings to smallest strings
for i in range(1,len(keys)):
for j in range(0,i):
for i_item in lengths[keys[i]]:
for j_item in lengths[keys[j]]:
#print str(i)+','+str(j)+': '+i_item+' , '+j_item
matches = re.findall('\\b' + i_item + '\\b', j_item);
counts[i_item] -= len(matches) * counts[j_item]
return counts
The following code contains test cases:
tests = [
{
'text': 'test file with words fish, steak fish chips fish and '+
'chips and fries',
'wanted': ["fish and chips","fish","chips","steak"]
},
{
'text': 'fish, fish and chips, fish and chips and burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'fish, fish and chips and burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'My fish and chips and burgers. My fish and chips and '+
'burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'fish fish fish',
'wanted': ["fish fish","fish"]
},
{
'text': 'fish fish fish',
'wanted': ["fish fish","fish","fish fish fish"]
}
]
for i in range(0,len(tests)):
test = tests[i]['text']
print test
print get_word_counts(test, tests[i]['wanted'])
print ''
The output is as follows:
test file with words fish, steak fish chips fish and chips and fries
{'fish and chips': 1, 'steak': 1, 'chips': 1, 'fish': 2}
fish, fish and chips, fish and chips and burgers
{'fish and chips': 1, 'fish and chips and burgers': 1, 'fish': 1}
fish, fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 1, 'fish': 1}
My fish and chips and burgers. My fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 2, 'fish': 0}
fish fish fish
{'fish fish': 1, 'fish': 1}
fish fish fish
{'fish fish fish': 1, 'fish fish': 0, 'fish': 0}
So this solution works with your test data (and with some added terms to the test data, just to be thorough), though it can probably be improved upon.
The crux of it is to find occurances of 'and' in the words list and then to replace 'and' and its neighbours with a compound word (concatenating the neighbours with 'and') and adding this back to the list, along with a copy of 'and'.
I also converted the 'wanted' string to a list to handle the 'fish and chips' string as a distinct item.
import re
from collections import Counter
# changed 'wanted' string to a list
wanted = ['fish and chips','fish','chips','steak', 'and']
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
# look for 'and', replace it and neighbours with 'comp_word'
# slice, concatenate, and append to make new words list
if word == 'and':
and_pos = words.index('and')
comp_word = str(words[and_pos-1]) + ' and ' +str(words[and_pos+1])
words = words[:and_pos-1] + words[and_pos+2:]
words.append(comp_word)
words.append('and')
for word in words:
if word in wanted:
cnt[word] += 1
print cnt
The output from your text would be:
Counter({'fish':2, 'and':1, 'steak':1, 'chips':1, 'fish and chips':1})
As noted in the comment above, it's unclear why you want/expect output to be 2 for fish, 2 for chips, and 1 for fish-and-chips in your ideal output. I'm assuming it's a typo, since the output above it has 'chips':1
I am suggesting two algorithms that will work on any patterns and any file.
The first algorithm has run time proportional to (number of characters in the file)* number of patterns.
1> For every pattern search all the patterns and create a list of super-patterns. This can be done by matching one pattern such as 'cat' against all patterns to be searched.
patterns = ['cat', 'cat and dogs', 'cat and fish']
superpattern['cat'] = ['cat and dogs', 'cat and fish']
2> Search for 'cat' in the file, let's say result is cat_count
3> Now search for every supper pattern of 'cat' in file and get their counts
for (sp in superpattern['cat']) :
sp_count = match sp in file.
cat_count = cat_count - sp
This a general solution that is brute force. Should be able to come up with a linear time solution if we arrange the patterns in a Trie.
Root-->f-->i-->s-->h-->a and so on.
Now when you are at h of the fish, and you do not get an a, increment fish_count and go to root. If you get 'a' continue. Anytime you get something un-expected, increment count of most recently found pattern and go to root or go to some other node (the longest match prefix that is a suffix of that other node). This is the Aho-Corasick algorithm, you can look it up on wikipedia or at:
http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
This solution is linear to the number of characters in the file.
Related
If I have a list of sentences. I need to traverse each sentence and check if any two words are the same in any two sentences. If yes then replace the word in the second sentence with a third word that is initialized. The third word is a common word (var3). For example: Rahul is eating an apple. Rahul drinks milk. Output : Rahul Is eating an apple. He is drinking milk.
var3='तो' #word to replace if words are same
summary=['Rahul drinks milk', 'Rahul eats rice', Seema is going to the market']
for sent in summary:
occurences = [index for index, value in enumerate(summary) if value == sent]
if len(occurences) > 1
for i in range(len(summary)):
for word in i:
var1=sent[i]
var2=sent[i+1]
if(var1==var2):
var3=var1
summary is the list of sentences. Now in this case there are three sentences. Where "Rahul" is the same in two sentences. So the word in the second sentence is replaced.
Can somebody please help me out with this?
class People():
def __init__(self,name,replace_with):
self.name = name
self.replace_with = replace_with
self.first_encountered = False
def __str__(self):
return self.name+" -- "+str(self.first_encountered)
sentences = ["Rahul is eating an apple.",
"Rahul drinks milk.",
"Rahul also drinks Beer.",
"Rahul likes Pizza",
"Seema is going to the market",
"Seema also drinks beer",
"and i am going to hell"
]
names= ["Rahul", "Seema"]
replaces = ["He","She"]
people = [ People(n,r) for n,r in zip(names,replaces) ]
new_sentence = []
found_in_any = [False,False]
for sentence in sentences:
for index,person in enumerate(people):
if(sentence.find(person.name)!=-1):
found_in_any[index] = True
if(not person.first_encountered):
person.first_encountered = True
new_sentence.append(sentence)
continue
if(person.first_encountered):
new_sentence.append(sentence.replace(person.name,person.replace_with))
else:
found_in_any[index] = False
if len(list(set(found_in_any))) == 1 and list(set(found_in_any))[0] == False:
new_sentence.append(sentence)
print(new_sentence)
output : ['Rahul is eating an apple.',
'He drinks milk.',
'He also drinks Beer.',
'He likes Pizza',
'Seema is going to the market',
'Seema is going to the market',
'She also drinks beer',
'and i am going to hell']
Here is a suggested solution
sen1 = "Rahul is eating an apple"
sen2 = "Rahul drinks milk"
var = "He"
for i in sen1.split(" "):
if i in sen2.split(" "):
sen2 = sen2.replace(i, var)
print(sen1)
print(sen2)
Output:
Rahul is eating an apple.
He drinks milk
sentences = ["Rahul is eating an apple.","Rahul drinks milk.","Rahul also drinks Beer.","Rahul likes Pizza","Seema is going to the market"]
new_sentence = [] first_encountered = False for sentence in sentences:
if(sentence.find(replace)!=-1):
if(not first_encountered):
first_encountered = True
new_sentence.append(sentence)
continue
if(first_encountered):
new_sentence.append(sentence.replace(replace,replace_with))
else:
new_sentence.append(sentence) new_sentence
Output :
['Rahul is eating an apple.',
'He drinks milk.',
'He also drinks Beer.',
'He likes Pizza',
'Seema is going to the market']
Hi so I've been trying to count the elements in the list that I have made, and when I do it
The result should be:
a 2
above 2
across 1
and etc..
here's what Ive got:
word = []
with open('Lateralus.txt', 'r') as my_file:
for line in my_file:
temporary_holder = line.split()
for i in temporary_holder:
word.append(i)
for i in range(0,len(word)): word[i] = word[i].lower()
word.sort()
for count in word:
if count in word:
word[count] = word[count] + 1
else:
word[count] = 1
for (word,many) in word.items():
print('{:20}{:1}'.format(word,many))
#Kimberly, as I understood from your code, you want to read a text file of alphabetic characters.
You want to also ignore the cases of alphabetic characters in file. Finally, you want to count the occurences of each unique letters in the text file.
I will suggest you to use dictionary for this. I have written a sample code for this task which
satisfy the following 3 conditions (please comment if you want different result by providing inputs and expected outputs, I will update my code based on that):
Reads text file and creates a single line of text by removing any spaces in between.
It converts upper case letters to lower case letters.
Finally, it creates a dictionary containing unique letters with their frequencies.
» Lateralus.txt
abcdefghijK
ABCDEfgkjHI
IhDcabEfGKJ
mkmkmkmkmoo
pkdpkdpkdAB
A B C D F Q
ab abc ab c
» Code
import json
char_occurences = {}
with open('Lateralus.txt', 'r') as file:
all_lines_combined = ''.join([line.replace(' ', '').strip().lower() for line in file.readlines()])
print all_lines_combined # abcdefghijkabcdefgkjhiihdcabefgkjmkmkmkmkmoopkdpkdpkdababcdfqababcabc
print len(all_lines_combined) # 69 (7 lines of 11 characters, 8 spaces => 77-8 = 69)
while all_lines_combined:
ch = all_lines_combined[0]
char_occurences[ch] = all_lines_combined.count(ch)
all_lines_combined = all_lines_combined.replace(ch, '')
# Pretty printing char_occurences dictionary containing occurences of
# alphabetic characters in a text file
print json.dumps(char_occurences, indent=4)
"""
{
"a": 8,
"c": 6,
"b": 8,
"e": 3,
"d": 7,
"g": 3,
"f": 4,
"i": 3,
"h": 3,
"k": 10,
"j": 3,
"m": 5,
"o": 2,
"q": 1,
"p": 3
}
"""
I have written the following code below. It works without errors, the problem that I am facing is that if there are 2 words in a sentence that have been repeated the same number of times, the code does not return the first word in alphabetical order. Can anyone please suggest any alternatives? This code is going to be evaluated in Python 2.7.
"""Quiz: Most Frequent Word"""
def most_frequent(s):
"""Return the most frequently occuring word in s."""
""" Step 1 - The following assumptions have been made:
- Space is the default delimiter
- There are no other punctuation marks that need removing
- Convert all letters into lower case"""
word_list_array = s.split()
"""Step 2 - sort the list alphabetically"""
word_sort = sorted(word_list_array, key=str.lower)
"""Step 3 - count the number of times word has been repeated in the word_sort array.
create another array containing the word and the frequency in which it is repeated"""
wordfreq = []
freq_wordsort = []
for w in word_sort:
wordfreq.append(word_sort.count(w))
freq_wordsort = zip(wordfreq, word_sort)
"""Step 4 - output the array having the maximum first index variable and output the word in that array"""
max_word = max(freq_wordsort)
word = max_word[-1]
result = word
return result
def test_run():
"""Test most_frequent() with some inputs."""
print most_frequent("london bridge is falling down falling down falling down london bridge is falling down my fair lady") # output: 'bridge'
print most_frequent("betty bought a bit of butter but the butter was bitter") # output: 'butter'
if __name__ == '__main__':
test_run()
Without messing too much around with your code, I find that a good solution can be achieved through the use of the index method.
After having found the word with the highest frequency (max_word), you simply call the index method on wordfreq providing max_word as input, which returns its position in the list; then you return the word associated to this index in word_sort.
Code example is below (I removed the zip function as it is not needed anymore, and added two simpler examples):
"""Quiz: Most Frequent Word"""
def most_frequent(s):
"""Return the most frequently occuring word in s."""
""" Step 1 - The following assumptions have been made:
- Space is the default delimiter
- There are no other punctuation marks that need removing
- Convert all letters into lower case"""
word_list_array = s.split()
"""Step 2 - sort the list alphabetically"""
word_sort = sorted(word_list_array, key=str.lower)
"""Step 3 - count the number of times word has been repeated in the word_sort array.
create another array containing the word and the frequency in which it is repeated"""
wordfreq = []
# freq_wordsort = []
for w in word_sort:
wordfreq.append(word_sort.count(w))
# freq_wordsort = zip(wordfreq, word_sort)
"""Step 4 - output the array having the maximum first index variable and output the word in that array"""
max_word = max(wordfreq)
word = word_sort[wordfreq.index(max_word)] # <--- solution!
result = word
return result
def test_run():
"""Test most_frequent() with some inputs."""
print(most_frequent("london bridge is falling down falling down falling down london bridge is falling down my fair lady")) # output: 'down'
print(most_frequent("betty bought a bit of butter but the butter was bitter")) # output: 'butter'
print(most_frequent("a a a a b b b b")) #output: 'a'
print(most_frequent("z z j j z j z j")) #output: 'j'
if __name__ == '__main__':
test_run()
I am trying to take a list of sentences and split each list into new lists containing the words of each sentence.
def create_list_of_words(file_name):
for word in file_name:
word_list = word.split()
return word_list
sentence = ['a frog ate the dog']
x = create_list_of_words(sentence)
print x
This is fine as my output is
['a', 'frog', 'ate', 'the', 'dog']
However, when I try to do a list of sentences it no longer reacts the same.
my_list = ['the dog hates you', 'you love the dog', 'a frog ate the dog']
for i in my_list:
x = create_list_of_words(i)
print x
Now my out
You've had few issues at your second script:
i is 'the dog hates you' while in the first script the parameter was ['a frog ate the dog'] -> one is string and second is list.
word_list = word.split() with this line inside the loop you instantiate the word_list each iteration, instead use the append function as i wrote in my code sample.
When sending string to the function you need to split the string before the word loop.
Try this:
def create_list_of_words(str_sentence):
sentence = str_sentence.split()
word_list = []
for word in sentence:
word_list.append(word)
return word_list
li_sentence = ['the dog hates you', 'you love the dog', 'a frog ate the dog']
for se in li_sentence:
x = create_list_of_words(se)
print x
I have a string like this:
text <- c("Car", "Ca-R", "My Car", "I drive cars", "Chars", "CanCan")
I would like to match a pattern so it is only matched once and with max. one substitution/insertion. the result should look like this:
> "Car"
I tried the following to match my pattern only once with max. substitution/insertion etc and get the following:
> agrep("ca?", text, ignore.case = T, max = list(substitutions = 1, insertions = 1, deletions = 1, all = 1), value = T)
[1] "Car" "Ca-R" "My Car" "I drive cars" "CanCan"
Is there a way to exclude the strings which are n-characters longer than my pattern?
An alternative which replaces agrep with adist:
text[which(adist("ca?", text, ignore.case=TRUE) <= 1)]
adist gives the number of insertions/deletions/substitutions required to convert one string to another, so keeping only elements with an adist of equal to or less than one should give you what you want, I think.
This answer is probably less appropriate if you really want to exclude things "n-characters longer" than the pattern (with n being variable), rather than just match whole words (where n is always 1 in your example).
You can use nchar to limit the strings based on their length:
pattern <- "ca?"
matches <- agrep(pattern, text, ignore.case = T, max = list(substitutions = 1, insertions = 1, deletions = 1, all = 1), value = T)
n <- 4
matches[nchar(matches) < n+nchar(pattern)]
# [1] "Car" "Ca-R" "My Car" "CanCan"