Python - Obtain the most frequent word in a sentence, if there is a tie return the word that appears first in alphabetical order - python-2.7

I have written the following code below. It works without errors, the problem that I am facing is that if there are 2 words in a sentence that have been repeated the same number of times, the code does not return the first word in alphabetical order. Can anyone please suggest any alternatives? This code is going to be evaluated in Python 2.7.
"""Quiz: Most Frequent Word"""
def most_frequent(s):
"""Return the most frequently occuring word in s."""
""" Step 1 - The following assumptions have been made:
- Space is the default delimiter
- There are no other punctuation marks that need removing
- Convert all letters into lower case"""
word_list_array = s.split()
"""Step 2 - sort the list alphabetically"""
word_sort = sorted(word_list_array, key=str.lower)
"""Step 3 - count the number of times word has been repeated in the word_sort array.
create another array containing the word and the frequency in which it is repeated"""
wordfreq = []
freq_wordsort = []
for w in word_sort:
wordfreq.append(word_sort.count(w))
freq_wordsort = zip(wordfreq, word_sort)
"""Step 4 - output the array having the maximum first index variable and output the word in that array"""
max_word = max(freq_wordsort)
word = max_word[-1]
result = word
return result
def test_run():
"""Test most_frequent() with some inputs."""
print most_frequent("london bridge is falling down falling down falling down london bridge is falling down my fair lady") # output: 'bridge'
print most_frequent("betty bought a bit of butter but the butter was bitter") # output: 'butter'
if __name__ == '__main__':
test_run()

Without messing too much around with your code, I find that a good solution can be achieved through the use of the index method.
After having found the word with the highest frequency (max_word), you simply call the index method on wordfreq providing max_word as input, which returns its position in the list; then you return the word associated to this index in word_sort.
Code example is below (I removed the zip function as it is not needed anymore, and added two simpler examples):
"""Quiz: Most Frequent Word"""
def most_frequent(s):
"""Return the most frequently occuring word in s."""
""" Step 1 - The following assumptions have been made:
- Space is the default delimiter
- There are no other punctuation marks that need removing
- Convert all letters into lower case"""
word_list_array = s.split()
"""Step 2 - sort the list alphabetically"""
word_sort = sorted(word_list_array, key=str.lower)
"""Step 3 - count the number of times word has been repeated in the word_sort array.
create another array containing the word and the frequency in which it is repeated"""
wordfreq = []
# freq_wordsort = []
for w in word_sort:
wordfreq.append(word_sort.count(w))
# freq_wordsort = zip(wordfreq, word_sort)
"""Step 4 - output the array having the maximum first index variable and output the word in that array"""
max_word = max(wordfreq)
word = word_sort[wordfreq.index(max_word)] # <--- solution!
result = word
return result
def test_run():
"""Test most_frequent() with some inputs."""
print(most_frequent("london bridge is falling down falling down falling down london bridge is falling down my fair lady")) # output: 'down'
print(most_frequent("betty bought a bit of butter but the butter was bitter")) # output: 'butter'
print(most_frequent("a a a a b b b b")) #output: 'a'
print(most_frequent("z z j j z j z j")) #output: 'j'
if __name__ == '__main__':
test_run()

Related

Find value of a string given a superstring regex

How can I match for a string that is a substring of a given input string, preferable with regex?
Given a value: A789Lfu891MatchMe2ENOTSTH, construct a regex that would match a string where the string is a substring of the given value.
Expected matches:
MatchMe
ENOTST
891
Expected Non Match
foo
A789L<fu891MatchMe2ENOTSTH_extra
extra_A789L<fu891MatchMe2ENOTSTH
extra_A789L<fu891MatchMe2ENOTSTH_extra
It seems easier for me to do the reverse: /\w*MatchMe\w*/, but I can't wrap my head around this problem.
Something like how SQL would do it:
SELECT * FROM my_table mt WHERE 'A789Lfu891MatchMe2ENOTSTH' LIKE '%' || mt.foo || '%';
Prefix suffixes
You could alternate prefix suffixes, like turn the superstring abcd into a pattern like ^(a|(a)?b|((a)?b)?c|(((a)?b)?c)?d)$. For your example, the pattern has 1253 characters (exactly 2000 fewer than #tobias_k's).
Python code to produce the regex, can then be tested with tobias_k's code (try it online):
from itertools import accumulate
t = "A789Lfu891MatchMe2ENOTSTH"
p = '^(' + '|'.join(accumulate(t, '({})?{}'.format)) + ')$'
Suffix prefixes
Suffix prefixes look nicer and match faster: ^(a(b(c(d)?)?)?|b(c(d)?)?|c(d)?|d)$. Sadly the generating code is less elegant.
Divide and conquer
For a shorter regex, we can use divide and conquer. For example for the superstring abcdefg, every substring falls into one of three cases:
Contains the middle character (the d). Pattern for that: ((a?b)?c)?d(e(fg?)?)?
Is left of that middle character. So recursively build a regex for the superstring abc: a|a?bc?|c.
Is right of that middle character. So recursively build a regex for the superstring efg: e|e?fg?|g.
And then make an alternation of those three cases:
a|a?bc?|c|((a?b)?c)?d(e(fg?)?)?|e|e?fg?|g
Regex length will be Θ(n log n) instead of our previous Θ(n2).
The result for your superstring example of 25 characters is this regex with 301 characters:
^(A|A?78?|8|((A?7)?8)?9(Lf?)?|Lf?|f|(((((A?7)?8)?9)?L)?f)?u(8(9(1(Ma?)?)?)?)?|89?|9|(8?9)?1(Ma?)?|Ma?|a|(((((((((((A?7)?8)?9)?L)?f)?u)?8)?9)?1)?M)?a)?t(c(h(M(e(2(E(N(O(T(S(TH?)?)?)?)?)?)?)?)?)?)?)?|c|c?hM?|M|((c?h)?M)?e(2E?)?|2E?|E|(((((c?h)?M)?e)?2)?E)?N(O(T(S(TH?)?)?)?)?|OT?|T|(O?T)?S(TH?)?|TH?|H)$
Benchmark
Speed benchmarks don't really make that much sense, as in reality we'd just do a regular substring check (in Python s in t), but let's do one anyway.
Results for matching all substrings of your superstring, using Python 3.9.6 on my PC:
1.09 ms just_all_substrings
25.18 ms prefix_suffixes
3.47 ms suffix_prefixes
3.46 ms divide_and_conquer
And on TIO and their "Python 3.8 (pre-release)" with quite different results:
0.30 ms just_all_substrings
46.90 ms prefix_suffixes
2.24 ms suffix_prefixes
2.95 ms divide_and_conquer
Regex lengths (also printed by the below benchmark code):
3253 characters - just_all_substrings
1253 characters - prefix_suffixes
1253 characters - suffix_prefixes
301 characters - divide_and_conquer
Benchmark code (Try it online!):
from timeit import repeat
import re
from itertools import accumulate
def just_all_substrings(t):
return "^(" + '|'.join(t[i:k] for i in range(0, len(t))
for k in range(i+1, len(t)+1)) + ")$"
def prefix_suffixes(t):
return '^(' + '|'.join(accumulate(t, '({})?{}'.format)) + ')$'
def suffix_prefixes(t):
return '^(' + '|'.join(list(accumulate(t[::-1], '{1}({0})?'.format))[::-1]) + ')$'
def divide_and_conquer(t):
def suffixes(t):
# Example: abc => ((a?b)?c)?
regex = f'{t[0]}?'
for c in t[1:]:
regex = f'({regex}{c})?'
return regex
def prefixes(t):
# Example: efg => (e(fg?)?)?
regex = f'{t[-1]}?'
for c in t[-2::-1]:
regex = f'({c}{regex})?'
return regex
def superegex(t):
n = len(t)
if n == 1:
return t
if n == 2:
return f'{t}?|{t[1]}'
mid = n // 2
contain = suffixes(t[:mid]) + t[mid] + prefixes(t[mid+1:])
left = superegex(t[:mid])
right = superegex(t[mid+1:])
return '|'.join([left, contain, right])
return '^(' + superegex(t) + ')$'
creators = just_all_substrings, prefix_suffixes, suffix_prefixes, divide_and_conquer,
t = "A789Lfu891MatchMe2ENOTSTH"
substrings = [t[start:stop]
for start in range(len(t))
for stop in range(start+1, len(t)+1)]
def test(p):
match = re.compile(p).match
return all(map(re.compile(p).match, substrings))
for creator in creators:
print(test(creator(t)), creator.__name__)
print()
print('Regex lengths:')
for creator in creators:
print('%5d characters -' % len(creator(t)), creator.__name__)
print()
for _ in range(3):
for creator in creators:
p = creator(t)
number = 10
time = min(repeat(lambda: test(p), number=number)) / number
print('%5.2f ms ' % (time * 1e3), creator.__name__)
print()
One way to "construct" such a regex would be to build a disjunction of all possible substrings of the original value. Example in Python:
import re
t = "A789Lfu891MatchMe2ENOTSTH"
p = "^(" + '|'.join(t[i:k] for i in range(0, len(t))
for k in range(i+1, len(t)+1)) + ")$"
good = ["MatchMe", "ENOTST", "891"]
bad = ["foo", "A789L<fu891MatchMe2ENOTSTH_extra",
"extra_A789L<fu891MatchMe2ENOTSTH",
"extra_A789L<fu891MatchMe2ENOTSTH_extra"]
assert all(re.match(p, s) is not None for s in good)
assert all(re.match(p, s) is None for s in bad)
For the value "abcd", this would e.g. be "^(a|ab|abc|abcd|b|bc|bcd|c|cd|d)$"; for the given example it's a bit longer, with 3253 characters...

Pseudo code to find number of occurrence of characters in a documents

I am trying to write a Pseudo-Code for a MapReduce technique where I need to find the number of occurrence of characters in the document. For example:
m: 1000 times, M: 5000 times, "": 3000 times, \n: 100 times, .:20000 times etc.
Can someone please let me know if this is this correct or I can make it better?
I have written the Pseudo-Code as shown below:
def Map(documentName, documentContent)
For Character in documentContent
EmitIntermediate(Character, 1)
def Reduce(Character, Counts)
Char_Count = 0
For count in Counts
Char_Count += count
Emit(Character,Char_Count)
I referred some of the online available Pseudo-Code for map-reduce technique and wrote this one.
For example, they have used to the following Pseudo-Code to find the number of occurrence of the word in a document:
def map(documentName, documentContent):
for line in documentContent:
words = line.split(" ")
for word in words:
EmitIntermediate(word, 1)
def reduce(word, counts):
wordCount = 0
for count in counts:
wordCount += count
Emit(word, wordCount)
def Map(documentName, documentContent)
For line in documentContent
Line_String = line
For Charcter in Line_String
EmitIntermediate(Character, 1)
def Reduce(Character, Counts)
Char_Count = 0
For count in Counts
Char_Count += count
Emit(Character,Char_Count)

Is there a pythonic way to count the number of leading matching characters in two strings?

For two given strings, is there a pythonic way to count how many consecutive characters of both strings (starting at postion 0 of the strings) are identical?
For example in aaa_Hello and aa_World the "leading matching characters" are aa, having a length of 2. In another and example there are no leading matching characters, which would give a length of 0.
I have written a function to achive this, which uses a for loop and thus seems very unpythonic to me:
def matchlen(string0, string1): # Note: does not work if a string is ''
for counter in range(min(len(string0), len(string1))):
# run until there is a mismatch between the characters in the strings
if string0[counter] != string1[counter]:
# in this case the function terminates
return(counter)
return(counter+1)
matchlen(string0='aaa_Hello', string1='aa_World') # returns 2
matchlen(string0='another', string1='example') # returns 0
You could use zip and enumerate:
def matchlen(str1, str2):
i = -1 # needed if you don't enter the loop (an empty string)
for i, (char1, char2) in enumerate(zip(str1, str2)):
if char1 != char2:
return i
return i+1
An unexpected function in os.path, commonprefix, can help (because it is not limited to file paths, any strings work). It can also take in more than 2 input strings.
Return the longest path prefix (taken character-by-character) that is a prefix of all paths in list. If list is empty, return the empty string ('').
from os.path import commonprefix
print(len(commonprefix(["aaa_Hello","aa_World"])))
output:
2
from itertools import takewhile
common_prefix_length = sum(
1 for _ in takewhile(lambda x: x[0]==x[1], zip(string0, string1)))
zip will pair up letters from the two strings; takewhile will yield them as long as they're equal; and sum will see how many there are.
As bobble bubble says, this indeed does exactly the same thing as your loopy thing. Its sole pro (and also its sole con) is that it is a one-liner. Take it as you will.

How to write a RNG code in Python 2.7 that writes shakespeare

For fun, I'm trying to write a code in python that associates a random number with a letter of the alphabet or punctuation mark and adds that letter to a list. I then want to have the code keep making new lists of random letters until it outputs "to be or not to be, that is the question." I then want to print that list and see how many evaluations it took. This is what I have so far.
from random import *
alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',',',' ','.']
sentence = []
numbers = []
def random(x):
randval = x
return randval
count = 0
for i in range(1000): # trying to place an upper bound on how many times to try
for i in range(41): # the number of characters in the sentence
randomness = random(randint(0,28)) # the number of enteries in the alphabet list
numbers.append(randomness)
for i in numbers:
count += 1
sentence.append(alphabet[i])
if sentence!=['t','o',' ','b','e',' ','o','r',' ','n','o','t',' ','t','o',' ','b','e',',',' ','t','h','a','t',' ','i','s','t','h','e',' ','q','u','e','s','t','i','o','n','.']:
sentence = [] ### This is supposed to empty the list if it gets the wrong order, but doesn't quite do that.
if sentence == ['t','o',' ','b','e',' ','o','r',' ','n','o','t',' ','t','o',' ','b','e',',',' ','t','h','a','t',' ','i','s','t','h','e',' ','q','u','e','s','t','i','o','n','.']:
print sentence
print count
break
new_sentence = ''.join(sentence)
print new_sentence
I'm not sure what I'm doing wrong. The list size keeps blowing up instead of keeping a length of 41. suggestions?

How can i correctly print out this dictionary in a way i have each word sorted by the number of times(frequency) in the text?

How can i correctly print out this dictionary in a way i have each word sorted by the number of times(frequency) in the text?
slova = dict()
for line in text:
line = re.split('[^a-z]',text)
line[i] = filter(None,line)
i =+ 1
i = 0
for line in text:
for word in line:
if word not in slova:
slova[word] = i
i += 1
I'm not sure what your text looks like, and you also haven't provided example output, but here is what my guess is. If this doesn't help please update your question and I'll try again. The code makes use of Counter from collections to do all the heavy lifting. First all of the words in all of the lines of the text are flattened to a single list, then this list is simply passed to Counter. The keys of the Counter (the words) are then sorted by their counts and printed out.
CODE:
from collections import Counter
import re
text = ['hello hi hello yes hello',
'hello hi hello yes hello']
all_words = [w for l in text for w in re.split('[^a-z]',l)]
word_counts = Counter(all_words)
sorted_words = sorted(word_counts.keys(),
key=lambda k: word_counts[k],
reverse = True)
#Print out the word and counts
for word in sorted_words:
print word,word_counts[word]
OUTPUT:
hello 6
yes 2
hi 2