Py4JJavaError in Pyspark - python-2.7

I am working on Spark using Python API. Below is my code. When I execute the line wordCount.first(). I am receiving ValueError: need more than 1 value to unpack. Any light on the above error would be appreciated. Thanks...
#create an RDD with textFile method
text_data_file=sc.textFile('/resources/yelp_labelled.txt')
#import the required library for word count operation
from operator import add
#Use filter to return RDD for words length greater than zero
wordCountFilter=text_data_file.filter(lambda x:len(x)>0)
#use flat map to split each line into words
wordFlatMap=wordCountFilter.flatMap(lambda x: x.split())
#map each key with value 1 using map function
wordMapper=wordFlatMap.flatMap(lambda x:(x,5))
#Use reducebykey function to reduce above mapped keys
#returns the key-value pairs by adding values for similar keys
wordCount=wordMapper.reduceByKey(add)
#view the first element
wordCount.first()
File "/home/notebook/spark-1.6.0-bin-`hadoop2.6/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues for k, v in iterator: ValueError: need more than 1 value to unpack`

Your mistake is here:
wordMapper=wordFlatMap.flatMap(lambda x:(x,5))
it should be
wordMapper=wordFlatMap.map(lambda x:(x,5))
otherwise you just emit
x
and
5
as separate values. Spark will try to expand x and fail, it its length is not equal to 2. Otherwise it will try to unpack 5 and fail as well.

Related

Convert a single list item of key value pair to an dictionary in python

I have function that returns just one list of key-value pair. How can I convert this to an actual key value or an object type so I can get each attribute from the list. For example I would like to be able to just get the time or price or any other property and not the whole list as one item.
{'time': 1512858529643, 'price': '0.00524096', 'origQty': '530.00000000'
I know it doesn't look like a list but it actually is! The function that I am calling returns this as a list. I am simply storing it to a variable and nothign else.
open_order=client.get_open_orders(symbol="BNBETH",recvWindow=1234567)
If you still have doubts. When I try to print a dictionary item like this print(open_order['time'])
I get the following error.
Traceback (most recent call last):
File "C:\Python27\python-binance-master\main.py", line 63, in <module>
print(open_order['time'])
TypeError: list indices must be integers, not str
Also If I show type it shows as list.
print(type(open_order))
So, I was able to come up with a solution, sort of... by converting the list to string and splitting at the "," character. Now I have list of items that I can actually print by selecting one print(split_order_items[5]) There has to be a better solution.
open_order=client.get_open_orders(symbol="BNBETH",recvWindow=1234567)
y=''.join(str(e)for e in open_order)
split_order_items =([x.strip() for x in y.split(',')])
print(split_order_items[5])
I was able to create a multiple list items using the above code. I just can't seem to convert it to dictionary object!
Thanks!
What you have posted is a dict, not a list. You can do something like this:
data = {'time': 1512858529643, 'price': '0.00524096', 'orderId': 7848174, 'origQty': '530.00000000'}
print(data['time']) # this gets just the time and prints it
print(data['price']) # this gets just the price and prints it
I strongly suggest reading up on the Python dict: https://docs.python.org/3/tutorial/datastructures.html#dictionaries

How can I calculate mean of list of strings?

I trying to calculate mean of one colum in a csv file.First, I read one column from .csv file and save it into a list. Next when I try to get mean it have a error
TypeError: 'builtin_function_or_method' object has no attribute '__getitem__'
my code is :
with open('XXXXXX.csv') as f:
reader = csv.DictReader(f)
for row in reader:
for (k,v) in row.items():
columns_95[k].append(v)
sVaR5 = columns_95['95%']
mean_95 = sum(sVaR5)/len(sVaR5)
and my csv looks like:
95% 99%
1.225 2.332
1.252 10.252
2.336 4.213
... ...
when I check my list, output is['1.225','1.252','2.336'] I think maybe the quote mark is the reason why my code has error. but how to fix it!Thanks!!!
sum is a function. If you want to call the function sum with the argument sVaR5, you need to write:
sum(sVaR5)
If your sVaR5 is a list of strings, you could convert them to floats for the sum:
sum(map(float, sVaR5))
If you put sum[sVaR5], Python tries to call __getitem__ on the object sum, hence the error
'builtin_function_or_method' object has no attribute '__getitem__'

IndexError: list index out of range for list of lists in for loop

I've looked at the other questions posted on the site about index error, but I'm still not understanding how to fix my own code. Im a beginner when it comes to Python. Based on the users input, I want to check if that input lies in the fourth position of each line in the list of lists.
Here's the code:
#create a list of lists from the missionPlan.txt
from __future__ import with_statement
listoflists = []
with open("missionPlan.txt", "r") as f:
results = [elem for elem in f.read().split('\n') if elem]
for result in results:
listoflists.append(result.split())
#print(listoflists)
#print(listoflists[2][3])
choice = int(input('Which command would you like to alter: '))
i = 0
for rows in listoflists:
while i < len(listoflists):
if listoflists[i][3]==choice:
print (listoflists[i][0])
i += 1
This is the error I keep getting:
not getting inside the if statement
So, I think this is what you're trying to do - find any line in your "missionPlan.txt" where the 4th word (after splitting on whitespace) matches the number that was input, and print the first word of such lines.
If that is indeed accurate, then perhaps something along this line would be a better approach.
choice = int(input('Which command would you like to alter: '))
allrecords = []
with open("missionPlan.txt", "r") as f:
for line in f:
words = line.split()
allrecords.append(words)
try:
if len(words) > 3 and int(words[3]) == choice:
print words[0]
except ValueError:
pass
Also, if, as your tags suggest, you are using Python 3.x, I'm fairly certain the from __future__ import with_statement isn't particularly necessary...
EDIT: added a couple lines based on comments below. Now in addition to examining every line as it's read, and printing the first field from every line that has a fourth field matching the input, it gathers each line into the allrecords list, split into separate words as a list - corresponding to the original questions listoflists. This will enable further processing on the file later on in the code. Also fixed one glaring mistake - need to split line into words, not f...
Also, to answer your "I cant seem to get inside that if statement" observation - that's because you're comparing a string (listoflists[i][3]) with an integer (choice). The code above addresses both that comparison mismatch and the check for there actually being enough words in a line to do the comparison meaningfully...

For loop using enumerate through a list with an if statement to search lines for a particular string

I am going to compile a list of a recurring strings (transaction ID).
I am flummoxed. I've researched the correct method and feel like this code should work.
However, I'm doing something wrong in the second block.
This first block correctly compiles a list of the strings that I want.
I cant get this second block to work. If I simplify, I can print each value in the list
by using
for idx, val in enumerate(tidarray): print val
It seems like I should now be able to use that value to search each line for that string,
then print the line (actually I'll be using it in conjunction with another search term to
reduce the number of line reads, but this is my basic test before honing in further.
def main():
pass
samlfile= "2013-08-18 06:24:27,410 tid:5af193fdc DEBUG org.sourceid.saml20.domain.AttributeMapping] Source attributes:{SAML_AUTHN_CTX=urn:oasis:names:tc:SAML:2.0:ac:classes"
tidarray = []
for line in samlfile:
if "tid:" in line:
str=line
tid = re.search(r'(tid:.*?)(?= )', str)
if tid.group() not in tidarray:
tidarray.append(tid.group())
for line in samlfile:
for idx, val in enumerate(tidarray):
if val in line:
print line
Can someone suggest a correction for the second block of code? I recognize that reading the file twice isn't the most elegant solution... My main goal here is to learn how to enumerate through the list and use each value in the subsequent code.
Iterating over a file twice
Basically what you do is:
for line in somefile: pass # first run
for line in somefile: pass # second run
The first run will complete just fine, the second run will not run at all.
This is because the file was read until the end and there's no more data to read lines from.
Call somefile.seek(0) to go to the beginning of the file:
for line in somefile: pass # first run
somefile.seek(0)
for line in somefile: pass # second run
Storing things uniquely
Basically, what you seem to want is a way to store the IDs from the file in the a
data structure and every id shall only be once in said structure.
If you want to store elements uniquely you use, for example, dictionaries (help(dict))
or sets (help(set)). Example with sets:
myset = set()
myset.add(2) # set([2])
myset.add(3) # set([2,3])
myset.add(2) # set([2,3])

Length of Python dictionary created doesn't match length from input file

I'm currently trying to create a dictionary from the following input file:
1776344_at 1779734_at 0.755332745 1.009570769 -0.497209846
1776344_at 1771911_at 0.931592828 0.830039019 2.28101445
1776344_at 1777458_at 0.746306282 0.753624146 3.709120716
...
...
There are a total of 12552 lines in this file.
What I wanted to do is to create a dictionary where the first 2 columns are the keys and the rest are the values. This I've successfully done and it looks something like this:
1770449_s_at;1777263_at:0.825723773;1.188969175;-2.858979578
1772892_at;1772051_at:-0.743866602;-1.303847456;26.41464414
1777227_at;1779218_s_at:0.819554413;0.677758609;4.51390617
But here's THE THING: I ran my python script on ms-dos cmd, and the generated output not only does not have the same sequence as that in the input file (i.e. 1st line is the 34th line), the whole file only has 739 lines.
Can someone enlighten me on what's going on? Is it something to do with memory? Cos the last I check I still have 305GB of disk space.
The script I wrote is as follow:
import sys
import os
input_file = sys.argv[1]
infile = open(input_file, 'r')
model_dict = {}
for line in infile:
key = ';'.join(line.split('\t')[0:2]).rstrip(os.linesep)
value = ';'.join(line.split('\t')[2:]).rstrip(os.linesep)
print 'keys are:',key,'\n','values are:',value
model_dict[key] = value
print model_dict
outfile = open('model_dict', 'w')
for key,value in model_dict.items():
print key,value
outfile.write('%s:%s\n' % (key,value))
outfile.close()
Based on the information given and since each dictionary key is unique, i suspect you have in the input file, lines that are generating the same key. This way the dictionary will only hold the last value associated with that key.
Python dictionaries are unordered set of key: value pairs. So when you print it's elements to the output file, don't expect that the order is preserved.
Another problem i see in your script is the loop that prints the output file, that shouldn't be "inside" the loop that reads from the input file.