Pseudo code to find number of occurrence of characters in a documents - mapreduce

I am trying to write a Pseudo-Code for a MapReduce technique where I need to find the number of occurrence of characters in the document. For example:
m: 1000 times, M: 5000 times, "": 3000 times, \n: 100 times, .:20000 times etc.
Can someone please let me know if this is this correct or I can make it better?
I have written the Pseudo-Code as shown below:
def Map(documentName, documentContent)
For Character in documentContent
EmitIntermediate(Character, 1)
def Reduce(Character, Counts)
Char_Count = 0
For count in Counts
Char_Count += count
Emit(Character,Char_Count)
I referred some of the online available Pseudo-Code for map-reduce technique and wrote this one.
For example, they have used to the following Pseudo-Code to find the number of occurrence of the word in a document:
def map(documentName, documentContent):
for line in documentContent:
words = line.split(" ")
for word in words:
EmitIntermediate(word, 1)
def reduce(word, counts):
wordCount = 0
for count in counts:
wordCount += count
Emit(word, wordCount)

def Map(documentName, documentContent)
For line in documentContent
Line_String = line
For Charcter in Line_String
EmitIntermediate(Character, 1)
def Reduce(Character, Counts)
Char_Count = 0
For count in Counts
Char_Count += count
Emit(Character,Char_Count)

Related

How to write a RNG code in Python 2.7 that writes shakespeare

For fun, I'm trying to write a code in python that associates a random number with a letter of the alphabet or punctuation mark and adds that letter to a list. I then want to have the code keep making new lists of random letters until it outputs "to be or not to be, that is the question." I then want to print that list and see how many evaluations it took. This is what I have so far.
from random import *
alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',',',' ','.']
sentence = []
numbers = []
def random(x):
randval = x
return randval
count = 0
for i in range(1000): # trying to place an upper bound on how many times to try
for i in range(41): # the number of characters in the sentence
randomness = random(randint(0,28)) # the number of enteries in the alphabet list
numbers.append(randomness)
for i in numbers:
count += 1
sentence.append(alphabet[i])
if sentence!=['t','o',' ','b','e',' ','o','r',' ','n','o','t',' ','t','o',' ','b','e',',',' ','t','h','a','t',' ','i','s','t','h','e',' ','q','u','e','s','t','i','o','n','.']:
sentence = [] ### This is supposed to empty the list if it gets the wrong order, but doesn't quite do that.
if sentence == ['t','o',' ','b','e',' ','o','r',' ','n','o','t',' ','t','o',' ','b','e',',',' ','t','h','a','t',' ','i','s','t','h','e',' ','q','u','e','s','t','i','o','n','.']:
print sentence
print count
break
new_sentence = ''.join(sentence)
print new_sentence
I'm not sure what I'm doing wrong. The list size keeps blowing up instead of keeping a length of 41. suggestions?

Python - Obtain the most frequent word in a sentence, if there is a tie return the word that appears first in alphabetical order

I have written the following code below. It works without errors, the problem that I am facing is that if there are 2 words in a sentence that have been repeated the same number of times, the code does not return the first word in alphabetical order. Can anyone please suggest any alternatives? This code is going to be evaluated in Python 2.7.
"""Quiz: Most Frequent Word"""
def most_frequent(s):
"""Return the most frequently occuring word in s."""
""" Step 1 - The following assumptions have been made:
- Space is the default delimiter
- There are no other punctuation marks that need removing
- Convert all letters into lower case"""
word_list_array = s.split()
"""Step 2 - sort the list alphabetically"""
word_sort = sorted(word_list_array, key=str.lower)
"""Step 3 - count the number of times word has been repeated in the word_sort array.
create another array containing the word and the frequency in which it is repeated"""
wordfreq = []
freq_wordsort = []
for w in word_sort:
wordfreq.append(word_sort.count(w))
freq_wordsort = zip(wordfreq, word_sort)
"""Step 4 - output the array having the maximum first index variable and output the word in that array"""
max_word = max(freq_wordsort)
word = max_word[-1]
result = word
return result
def test_run():
"""Test most_frequent() with some inputs."""
print most_frequent("london bridge is falling down falling down falling down london bridge is falling down my fair lady") # output: 'bridge'
print most_frequent("betty bought a bit of butter but the butter was bitter") # output: 'butter'
if __name__ == '__main__':
test_run()
Without messing too much around with your code, I find that a good solution can be achieved through the use of the index method.
After having found the word with the highest frequency (max_word), you simply call the index method on wordfreq providing max_word as input, which returns its position in the list; then you return the word associated to this index in word_sort.
Code example is below (I removed the zip function as it is not needed anymore, and added two simpler examples):
"""Quiz: Most Frequent Word"""
def most_frequent(s):
"""Return the most frequently occuring word in s."""
""" Step 1 - The following assumptions have been made:
- Space is the default delimiter
- There are no other punctuation marks that need removing
- Convert all letters into lower case"""
word_list_array = s.split()
"""Step 2 - sort the list alphabetically"""
word_sort = sorted(word_list_array, key=str.lower)
"""Step 3 - count the number of times word has been repeated in the word_sort array.
create another array containing the word and the frequency in which it is repeated"""
wordfreq = []
# freq_wordsort = []
for w in word_sort:
wordfreq.append(word_sort.count(w))
# freq_wordsort = zip(wordfreq, word_sort)
"""Step 4 - output the array having the maximum first index variable and output the word in that array"""
max_word = max(wordfreq)
word = word_sort[wordfreq.index(max_word)] # <--- solution!
result = word
return result
def test_run():
"""Test most_frequent() with some inputs."""
print(most_frequent("london bridge is falling down falling down falling down london bridge is falling down my fair lady")) # output: 'down'
print(most_frequent("betty bought a bit of butter but the butter was bitter")) # output: 'butter'
print(most_frequent("a a a a b b b b")) #output: 'a'
print(most_frequent("z z j j z j z j")) #output: 'j'
if __name__ == '__main__':
test_run()

The words average from a File

I have this questions: Write a program that will calculate the average word length of a text stored in a file (i.e the sum of all the lengths of the word tokens in the text, divided by the number of word tokens).
my code:
allword = 0
words = 0
average = 0
with open('/home/......', 'r') as f:
for i in f:
me = i.split()
allword += len(me)
words += len(i)
average += allword / float(words)
print average
so , i have 4 line and 55 characters without computer blank space, i come from average: 27.54 .... and i think that the result not gut is...
Can anybody with simple words tell me, where are that problem....
Very Thanks!
#mustaccio
Maybe 27.54 to high...now the code with a little change.....
allword = 0
words = 0
average = 0
with open('/home/....', 'r') as f:
for i in f:
me = "".join(i.split(" "))
allword += len(me)
words += len(i)
average += allword / float(words)
print average
Now i come 4.32....

Manipulating strings python 2.7

I am trying to code a program that will insert specific numbers before parts of an input, for example given the input "171819-202122-232425" I would like it to split up the number into pieces and use the dash as a delimiter. I have split up the number using list(str(input)) but have no idea how to insert the appropriate numbers. It has to work for any number Thanks for the help.
Output =
(number)17
(number)18
(number)19
(number+1)20
(number+1)21
(number+1)22
(number+2)23
(number+2)24
(number+2)25
You could use split and regexps to dig out lists of your numbers:
Code
import re
mynum = "171819-202122-232425"
start_number = 5
groups = mynum.split('-') # list of numbers separated by "-"
number_of_groups = xrange(start_number , start_number + len(groups))
for (i, number_group) in zip(number_of_groups, groups):
numbers = re.findall("\d{2}", number_group) # return list of two-digit numbers
for x in numbers:
print "(%s)%s" % (i, x)
Result
(5)17
(5)18
(5)19
(6)20
(6)21
(6)22
(7)23
(7)24
(7)25
Try this:
Code:
mInput = "171819-202122-232425"
number = 9 # Just an example
result = ""
i = 0
for n in mInput:
if n == '-': # To handle dash case
number += 1
continue
i += 1
if i % 2 == 1: # Each two digits
result += "\n(" + str(number) + ")"
result += n # Add current digit
print result
Output:
(9)17
(9)18
(9)19
(10)20
(10)21
(10)22
(11)23
(11)24
(11)25

How to parse parenthesis to sum word frequencies in python 3

I have an input with words and their frequency for a given line, however, I would like to have a total count of word frequency. I know there are many solutions for calculating word frequency from a file as a whole, but the input I have has brackets around each line, and parenthesis around each word. I have not been able to extract the word and count because there are a different number of words for each line. Any help would be greatly appreciated!
A sample input:
[('Company', 1)]
[('Tax', 1), ('Service', 1)]
[('"Birchwood', 1), ('LLC"', 1), ('Enterprise,', 1)]
[("Wendy's", 1), ('Salon', 1)]
Code I have been trying:
from collections import defaultdict
def wordCountTotals (fh):
d = defaultdict(int)
for line in fh:
word, count = line.split()
d[word] += count
return d[word], count
I have also tried using :
re.search("\((\w+)\, [0-9]+)", s)
but still no results
Because there are brackets and parenthesis, this code does not work - there are too many values to unpack. If anyone could help with this, I would be so grateful!
Your input consists of list of tuples as exactly same syntax in Python, we can use ast.literal_eval to exploit this fact.
>>> import ast
>>> ast.literal_eval(" [('Company', 1)]".strip())
[('Company', 1)]
So, something along the lines of:
d = defaultdict(0)
for line in fh:
val = ast.literal_eval(line.strip())
for s, c in val:
d[s] += c
return d
would be enough. I have not tried this, might need some fixes.