The words average from a File - python-2.7

I have this questions: Write a program that will calculate the average word length of a text stored in a file (i.e the sum of all the lengths of the word tokens in the text, divided by the number of word tokens).
my code:
allword = 0
words = 0
average = 0
with open('/home/......', 'r') as f:
for i in f:
me = i.split()
allword += len(me)
words += len(i)
average += allword / float(words)
print average
so , i have 4 line and 55 characters without computer blank space, i come from average: 27.54 .... and i think that the result not gut is...
Can anybody with simple words tell me, where are that problem....
Very Thanks!

#mustaccio
Maybe 27.54 to high...now the code with a little change.....
allword = 0
words = 0
average = 0
with open('/home/....', 'r') as f:
for i in f:
me = "".join(i.split(" "))
allword += len(me)
words += len(i)
average += allword / float(words)
print average
Now i come 4.32....

Related

Pseudo code to find number of occurrence of characters in a documents

I am trying to write a Pseudo-Code for a MapReduce technique where I need to find the number of occurrence of characters in the document. For example:
m: 1000 times, M: 5000 times, "": 3000 times, \n: 100 times, .:20000 times etc.
Can someone please let me know if this is this correct or I can make it better?
I have written the Pseudo-Code as shown below:
def Map(documentName, documentContent)
For Character in documentContent
EmitIntermediate(Character, 1)
def Reduce(Character, Counts)
Char_Count = 0
For count in Counts
Char_Count += count
Emit(Character,Char_Count)
I referred some of the online available Pseudo-Code for map-reduce technique and wrote this one.
For example, they have used to the following Pseudo-Code to find the number of occurrence of the word in a document:
def map(documentName, documentContent):
for line in documentContent:
words = line.split(" ")
for word in words:
EmitIntermediate(word, 1)
def reduce(word, counts):
wordCount = 0
for count in counts:
wordCount += count
Emit(word, wordCount)
def Map(documentName, documentContent)
For line in documentContent
Line_String = line
For Charcter in Line_String
EmitIntermediate(Character, 1)
def Reduce(Character, Counts)
Char_Count = 0
For count in Counts
Char_Count += count
Emit(Character,Char_Count)

What is the best way for sum numbers at a big text file?

What is the best way for sum numbers at a big text file?
The text file will contain numbers separated by a comma (',').
The number can be from any type.
No line or row limits.
for example:
1 ,-2, -3.45-7.8j ,99.6,......
...
...
Input: path to the text file
Output: the sum of the numbers
I am tried to wrote one solution at myself and want to know for better solutions:
This is my try:
I am working with chunks of data and not read line by line, and because the end of the chunk can contain some of the number (just -2 and not -2+3j) i am looking just on the "safe piece" the last comma (',') and the other part save for the next chunk
import re
CHUNK_SIZE = 1017
def calculate_sum(file_path):
_sum = 0
with open(file_path, 'r') as _f:
chunk = _f.read(CHUNK_SIZE)
while chunk:
chunk = chunk.replace(' ', '')
safe_piece = chunk.rfind(',')
next_chunk = chunk[safe_piece:] if safe_piece != 0 else ''
if safe_piece != 0:
chunk = chunk[:safe_piece]
_sum += sum(map(complex, re.findall(r"[+-]\d*\.?\d*[+-]?\d*\.?\d*j|[+-]?\d+(?:\.\d+)?", chunk)))
chunk = next_chunk + _f.read(CHUNK_SIZE)
return _sum
Thanks!
This will add up any amount of numbers in a text file. Example:
input.csv
1,-2,-3.45-7.8j,99.6
-1,1-2j
1.5,2.5,1+1j
example.py
import csv
with open('input.txt','rb') as f:
r = csv.reader(f)
total = 0
for line in r:
total += sum(complex(col) for col in line)
print total
Output
(100.15-8.8j)
If you have really long lines and insufficient memory to read it in one go, then you could use a buffering class to chunk the reads and split numbers out of the buffer using a generator function:
import re
class Buffer:
def __init__(self,filename,chunksize=4096):
self.filename = filename
self.chunksize = chunksize
self.buf = ''
def __iter__(self):
with open(self.filename) as f:
while True:
if ',' in self.buf or '\n' in self.buf:
data,self.buf = re.split(r',|\n',self.buf,1) # split off the text up to the first separator
yield complex(data)
else:
chunk = f.read(self.chunksize)
if not chunk: # if no more data to read, return the remaining buffer and exit function
if self.buf:
yield complex(self.buf)
return
self.buf += chunk
total = 0
for num in Buffer('input.txt'):
total += num
print total
Output:
(100.15-8.8j)

Process a text file to find a value above the PE score threshold of 3.19

The text file can be found at this link. What I am interested in is the value of PE score. Graphically, it appears under the column Feature2 sys.
This is my code:
def main():
file = open ( "combined_scores.txt" , "r" )
lines = file.readlines()
file.close()
count_pe=0
for line in lines:
line=line.strip()
line=line[24:31] #1problem is here:the range is not fixed in all line of the file
if line.find( "3.19") != -1 : # I need value >=3.19 not only 3.19
count_pe = count_pe + 1
print ( ">=3.19: ", count_pe )#at the end i need how many times PE>3,19 occur
main()
I suggest you parse the column using tab (\t), and compare with value "3.19". It should be something like below (Python 2.7):
with open('combined_scores.txt') as f:
lines = f.readlines()[1:] # remove the header line
# reset counter
n = 0
for line in lines:
if float(line.split('\t')[-3]) >= 3.19:
n = n + 1
# print total count
print 'total=', n

python script for limit text file words

I have an input file like:
input.txt:
to
the
cow
eliphant
pigen
then
enthosiastic
I want to remove those words which has character length is <= 4 , and if some word has more than 8 character then write those word in new file till 8 character length
output should be like:
output.txt:
eliphant
pigen
enthosia
This is my code:
f2 = open('output.txt', 'w+')
x2 = open('input.txt', 'r').readlines()
for y in x2:
if (len(y) <= 4):
y = y.replace(y, '')
f2.write(y)
elif (len(y) > 8):
y = y[0:8]
f2.write(y)
else:
f2.write(y)
f2.close()
print "Done!"
when i compile it then it gives the output like:
eliphantpigen
then
enthosia
it also writes 4 character length word... i don't understand what is the problem and how to write the code to limit character length of text file words....?
Use with when working with files, this guarantees that file would be closed.
You have then in your results because your are reading lines and not worlds.
Each line have symbol of ending '\n'. So when you are reading world then you have string
'then\n' and len of this string is 5.
with open('output.txt', 'w+') as ofp, open('input.txt', 'r') as ifp:
for line in ifp:
line = line.strip()
if len(line) > 8:
line = line[:8]
elif len(line) <= 4:
continue
ofp.write(line + '\n')

Manipulating strings python 2.7

I am trying to code a program that will insert specific numbers before parts of an input, for example given the input "171819-202122-232425" I would like it to split up the number into pieces and use the dash as a delimiter. I have split up the number using list(str(input)) but have no idea how to insert the appropriate numbers. It has to work for any number Thanks for the help.
Output =
(number)17
(number)18
(number)19
(number+1)20
(number+1)21
(number+1)22
(number+2)23
(number+2)24
(number+2)25
You could use split and regexps to dig out lists of your numbers:
Code
import re
mynum = "171819-202122-232425"
start_number = 5
groups = mynum.split('-') # list of numbers separated by "-"
number_of_groups = xrange(start_number , start_number + len(groups))
for (i, number_group) in zip(number_of_groups, groups):
numbers = re.findall("\d{2}", number_group) # return list of two-digit numbers
for x in numbers:
print "(%s)%s" % (i, x)
Result
(5)17
(5)18
(5)19
(6)20
(6)21
(6)22
(7)23
(7)24
(7)25
Try this:
Code:
mInput = "171819-202122-232425"
number = 9 # Just an example
result = ""
i = 0
for n in mInput:
if n == '-': # To handle dash case
number += 1
continue
i += 1
if i % 2 == 1: # Each two digits
result += "\n(" + str(number) + ")"
result += n # Add current digit
print result
Output:
(9)17
(9)18
(9)19
(10)20
(10)21
(10)22
(11)23
(11)24
(11)25