Error in mapper and reducer with python - python-2.7

There is a problem in the mapper.py file when I run it in the cluster. The error is " unexpected syntax before line" in "strl = line.strip()".
There is no error when I test it locally. I want to get the words of text file stored and change their format and count them and send to the output in s3 bucket.
Guidance most welcome. Thanks
mapper:
import sys, re
for line in sys.stdin:
strl = line.strip()
words = strl.split()
for word in words:
word = word.lower()
result = ""
charref = re.compile("[a-f]")
match = charref.search(word[0])
if match:
result+= "TR2234J"
else:
result+= ""
print result, "\t"
reducer:
import sys
for line in sys.stdin:
line = line.strip()
new_word =""
words = line.split("\t")
final_count = len(words)
my_num = final_count / 6
for i in range (my_num):
new_word = "".join(words[i*6:10+(i*6)])
print new_word, "\t"

Related

rstrip, split and sort a list from input text file

I am new with python. I am trying to rstrip space, split and append the list into words and than sort by alphabetical order. I don’t what I am doing wrong.
fname = input("Enter file name: ")
fh = open(fname)
lst = list(fh)
for line in lst:
line = line.rstrip()
y = line.split()
i = lst.append()
k = y.sort()
print y
I have been able to fix my code and the expected result output.
This is what I was hoping to code:
name = input('Enter file: ')
handle = open(name, 'r')
wordlist = list()
for line in handle:
words = line.split()
for word in words:
if word in wordlist: continue
wordlist.append(word)
wordlist.sort()
print(wordlist)
If you are using python 2.7, I believe you need to use raw_input() in Python 3.X is correct to use input(). Also, you are not using correctly append(), Append is a method used for lists.
fname = raw_input("Enter filename: ") # Stores the filename given by the user input
fh = open(fname,"r") # Here we are adding 'r' as the file is opened as read mode
lines = fh.readlines() # This will create a list of the lines from the file
# Sort the lines alphabetically
lines.sort()
# Rstrip each line of the lines liss
y = [l.rstrip() for l in lines]
# Print out the result
print y

Notepad++: Replace find query with words from list

I would like to replace all "var_"
var_
Hello
var_
Whats
var_
Up?
...
with words from this list
alpha
beta
gamma
...
so the end result is
alpha
Hello
beta
Whats
gamma
Up?
...
Would appreciate help on achieving this!
This is sort of impossible / overly complicated with a regex. However, if you combine it with a programming language, you can get it done quickly. E.g. in python it would look like this:
import sys
import re
import fileinput
if len(sys.argv) < 3:
exit("Usage: " + sys.argv[0] + " <filename> <replacements>")
input_file = sys.argv[1]
replacements = sys.argv[2:]
num_of_replacements = len(replacements)
replacement_index = 0
searcher = re.compile("^var_\\b")
for line in fileinput.input(input_file, inplace=True, backup='.bak'):
match = searcher.match(line)
if match is None:
print(line.rstrip())
else:
print(re.sub("^var_\\b", line.rstrip(), replacements[replacement_index]))
replacement_index = replacement_index + 1
Usage: replacer.py ExampleInput.txt alpha beta gamma
Update
It's possible to modify the program to accept the string you search for as the 1st param:
replacer.py "var_" ExampleInput.txt alpha beta gamma
The modified python script looks like this:
import sys
import re
import fileinput
if len(sys.argv) < 4:
exit("Usage: " + sys.argv[0] + " <pattern> <filename> <replacements>")
search = "\\b" + sys.argv[1] + "\\b"
input_file = sys.argv[2]
replacements = sys.argv[3:]
num_of_replacements = len(replacements)
replacement_index = 0
searcher = re.compile(search)
for line in fileinput.input(input_file, inplace=True, backup='.bak'):
match = searcher.match(line)
if match is None:
print(line.rstrip())
else:
print(re.sub(search, line.rstrip(), replacements[replacement_index]))
replacement_index = replacement_index + 1
Note: this script still has a few limitations:
it expects that the string you search for occurs only once each line.
it replaces the searched string only if it's a distinct word
you can accidentally incorporate any python regex syntax into the search param

Ascii codec can't decode byte 0xc2 python nltk

I have a code that I'm using for Spam Classification and it works great but everytime I try to stem/lemmatize the word I get this error:
File "/Users/Ramit/Desktop/Bayes1/src/filter.py", line 16, in trim_word
word = ps.stem(word)
File "/Library/Python/2.7/site-packages/nltk/stem/porter.py", line 664, in stem
stem = self._step1a(stem)
File "/Library/Python/2.7/site-packages/nltk/stem/porter.py", line 289, in _step1a
if word.endswith('ies') and len(word) == 4:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Here is my code:
from word import Word
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()
class Filter():
def __init__(self):
self.words = dict()
def trim_word(self, word):
# Helper method to trim away some of the non-alphabetic characters
# I deliberately do not remove all non-alphabetic characters.
word = word.strip(' .:,-!()"?+<>*')
word = word.lower()
word = ps.stem(word)
return word
def train(self, train_file):
lineNumber = 1
ham_words = 0
spam_words = 0
stop = set(stopwords.words('english'))
# Loop through all the lines
for line in train_file:
if lineNumber % 2 != 0:
line = line.split('\t')
category = line[0]
input_words = line[1].strip().split(' ')
#Loop through all the words in the line, remove some characters
for input_word in input_words:
input_word = self.trim_word(input_word)
if (input_word != "") and (input_word not in stop):
# Check if word is in dicionary, else add
if input_word in self.words:
word = self.words[input_word]
else:
word = Word(input_word)
self.words[input_word] = word
# Check wether the word is in ham or spam sentence, increment counters
if category == "ham":
word.increment_ham()
ham_words += 1
elif category == "spam":
word.increment_spam()
spam_words += 1
# Probably bad training file input...
else:
print "Not valid training file format"
lineNumber+=1
# Compute the probability for each word in the training set
for word in self.words:
self.words[word].compute_probability(ham_words, spam_words)
def get_interesting_words(self, sms):
interesting_words = []
stop = set(stopwords.words('english'))
# Go through all words in the SMS and append to list.
# If we have not seen the word in training, assign probability of 0.4
for input_word in sms.split(' '):
input_word = self.trim_word(input_word)
if (input_word != "") and (input_word not in stop):
if input_word in self.words:
word = self.words[input_word]
else:
word = Word(input_word)
word.set_probability(0.40)
interesting_words.append(word)
# Sort the list of interesting words, return top 15 elements if list is longer than 15
interesting_words.sort(key=lambda word: word.interesting(), reverse=True)
return interesting_words[0:15]
def filter(self, input_file, result_file):
# Loop through all SMSes and compute total spam probability of the sms-message
lineNumber = 0
for sms in input_file:
lineNumber+=1
spam_product = 1.0
ham_product = 1.0
if lineNumber % 2 != 0:
try:
for word in self.get_interesting_words(sms):
spam_product *= word.get_probability()
ham_product *= (1.0 - word.get_probability())
sms_spam_probability = spam_product / (spam_product + ham_product)
except:
result_file.write("error")
if sms_spam_probability > 0.8:
result_file.write("SPAM: "+sms)
else:
result_file.write("HAM: "+sms)
result_file.write("\n")
I'm just looking for a solution that would allow me to lemmatize/stem the words. I tried looking around the net I did find similar problems, but they haven't been working for me.
Use sys.
import sys
sys.setdefaultencoding('utf-8')
reload(sys)

Merging every two lines in a text file - Python

As the title says : is there an easy way of merging every two lines of a text file in python? For example my text file looks like this:
fname=xxx
uname=yyy
fname=zzz
uname=ppp
What I want as an output is :
fname=xxx uname=yyy
fname=zzz uname=ppp
and so on. Any help is appreciated!
Instead of printing, you can append these to a text file or a list:
with open("test.txt") as f:
content = f.readlines()
str = ""
for i in xrange(1,len(content)+1):
str += content[i-1].strip()
if i % 2 == 0:
print str
str = ""
or
with open("test.txt") as f:
content = f.readlines()
for i in xrange(1, len(content)+1):
if i % 2 == 0: print content[i-2].strip() + content[i-1].strip()
Here is another solution with sliding window, two lines at a time
with open("test.txt") as f:
data = [x for x in f.read().split("\n") if x.strip() != ""]
for line1, line2 in list(zip(data, data[1:]))[::2]:
print(" ".join([line1, line2]))
This will only work for files with even number of lines
I hope it helps:
import itertools
a =["fname=xxx", "uname=yyy", "fname=zzz", "uname=ppp"]
res = ''
for i in itertools.islice(a, 0, len(a), 2), itertools.islice(a, 1, len(a), 2):
res += ' '.join(i)
res += '\n'
print(res)
output:
fname=xxx fname=zzz
uname=yyy uname=ppp

Python : count function does not work

I am stuck on an exercise from a Coursera Python course, this is the question:
"Open the file mbox-short.txt and read it line by line. When you find a line that starts with 'From ' like the following line:
From stephen.marquard#uct.ac.za Sat Jan 5 09:14:16 2008
You will parse the From line using split() and print out the second word in the line (i.e. the entire address of the person who sent the message). Then print out a count at the end.
Hint: make sure not to include the lines that start with 'From:'.
You can download the sample data at http://www.pythonlearn.com/code/mbox-short.txt"
Here is my code:
fname = raw_input("Enter file name: ")
if len(fname) < 1 : fname = "mbox-short.txt"
fh = open(fname)
count = 0
for line in fh:
words = line.split()
if len(words) > 2 and words[0] == 'From':
print words[1]
count = count + 1
else:
continue
print "There were", count, "lines in the file with From as the first word"`
The output should be a list of emails and the sum of them, but it doesn't work and I don't know why: actually the output is "There were 0 lines in the file with From as the first word"
I used your code and downloaded the file from the link. And I am getting this output:
There were 27 lines in the file with From as the first word
Have you checked if you are downloading the file in the same location as the code file.
fname = input("Enter file name: ")
counter = 0
fh = open(fname)
for line in fh :
line = line.rstrip()
if not line.startswith('From '): continue
words = line.split()
print (words[1])
counter +=1
print ("There were", counter, "lines in the file with From as the first word")
fname = input("Enter file name: ")
fh = open(fname)
count = 0
for line in fh :
if line.startswith('From '): # consider the lines which start from the word "From "
y=line.split() # we split the line into words and store it in a list
print(y[1]) # print the word present at index 1
count=count+1 # increment the count variable
print("There were", count, "lines in the file with From as the first word")
I have written all the comments if anyone faces any difficulty, in case you need help feel free to contact me. This is the easiest code available on internet. Hope you benefit from my answer
fname = input('Enter the file name:')
fh = open(fname)
count = 0
for line in fh:
if line.startswith('From'):
linesplit =line.split()
print(linesplit[1])
count = count +1
fname = input("Enter file name: ")
if len(fname) < 1 : fname = "mbox-short.txt"
fh = open(fname)
count = 0
for i in fh:
i=i.rstrip()
if not i.startswith('From '): continue
word=i.split()
count=count+1
print(word[1])
print("There were", count, "lines in the file with From as the first word")
fname = input("Enter file name: ")
if len(fname) < 1 : fname = "mbox-short.txt"
fh = open(fname)
count = 0
for line in fh:
if line.startswith('From'):
line=line.rstrip()
lt=line.split()
if len(lt)==2:
print(lt[1])
count=count+1
print("There were", count, "lines in the file with From as the first word")
My code looks like this and works as a charm:
fname = input("Enter file name: ")
if len(fname) < 1:
fname = "mbox-short.txt"
fh = open(fname)
count = 0 #initialize the counter to 0 for the start
for line in fh: #iterate the document line by line
words = line.split() #split the lines in words
if not len(words) < 2 and words[0] == "From": #check for lines starting with "From" and if the line is longer than 2 positions
print(words[1]) #print the words on position 1 from the list
count += 1 # count
else:
continue
print("There were", count, "lines in the file with From as the first word")
It is a nice exercise from the course of Dr. Chuck
There is also another way. You can store the found words in a separate empty list and then print out the lenght of the list. It will deliver the same result.
My tested code as follows:
fname = input("Enter file name: ")
if len(fname) < 1:
fname = "mbox-short.txt"
fh = open(fname)
newl = list()
for line in fh:
words = line.split()
if not len(words) < 2 and words[0] == 'From':
newl.append(words[1])
else:
continue
print(*newl, sep = "\n")
print("There were", len(newl), "lines in the file with From as the first word")
I did pass the exercise with it as well. Enjoy and keep the good work. Python is so much fun to me even though i always hated programming.