Code is Not able to find my function in Python(Spark) class - python-2.7

I need some help regarding the error in code. My Code consists of retrieving the zomato reviews and storing it in HDFS and again reading it performing Recommender Analtyics on it. I am getting a problem regarding my function is not recognizing in pyspark code. I am not entirely pasting the code as it might be confusing so i am writing a small similar use case for your easy understanding.
I am trying to read a file from local and converting it to dataframe from rdd and performing some operations and again converting it to rdd and performing map operation to have delimiter by '|' and then save it to HDFS.
When i try to call self.filter_data(y) in lambda func of check function its not recognizing and giving me error as
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
****CAN ANY ONE HELP ME WHY MY FILTER_DATA FUNCTION IS NOT RECOGNISING? SHOULD I NEED TO ADD ANY THING OR ANY THING WRONG IN THE WAY I AM CALLING. PLEASE HELP ME. THANKS IN ADVANCE****
INPUT VALUE
starting
0|0|ffae4f|0|https://b.zmtcdn.com/data/user_profile_pictures/565/aed32fa2eb18bb4a5a3ba426870fd565.jpg?fit=around%7C100%3A100&crop=100%3A100%3B%2A%2C%2A|https://www.zomato.com/akellaram87?utm_source=api_basic_user&utm_medium=api&utm_campaign=v2.1|2.5|FFBA00|Well...|unknown|16946626|2017-08-01T00-25-43.455182Z|30059877|Have been here for a quick bite for lunch, ambience and everything looked good, food was okay but presentation was not very appealing. We or...|2017-04-15 16:38:38|Big Foodie|6|Venkata Ram Akella|akellaram87|Bad Food|0.969352505662|0|0|0|0|0|0|1|1|0|0|1|0|0|0.782388212399
ending
starting
1|0|ffae4f|0|https://b.zmtcdn.com/data/user_profile_pictures/4d1/d70d7a57e1bfdf296ff4db3d8daf94d1.jpg?fit=around%7C100%3A100&crop=100%3A100%3B%2A%2C%2A|https://www.zomato.com/users/sm4-2011696?utm_source=api_basic_user&utm_medium=api&utm_campaign=v2.1|1|CB202D|Avoid!|unknown|16946626|2017-08-01T00-25-43.455182Z|29123338|Giving a 1.0 rating because one cannot proceed with writing a review, without rating it. This restaurant deserves a 0 star rating. The qual...|2017-01-04 10:54:53|Big Foodie|4|Sm4|unknown|Bad Service|0.964402034541|0|1|0|0|0|0|0|1|0|0|0|1|0|0.814540622345
ending
My code:
if __name__== '__main__':
import os,logging,sys,time,pandas,json;from subprocess
import PIPE,Popen,call;from datetime import datetime, time, timedelta
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('test')
sc = SparkContext(conf = conf,pyFiles=['/bdaas/exe/nlu_project/spark_classifier.py','/bdaas/exe/spark_zomato/other_files/spark_zipcode.py','/bdaas/exe/spark_zomato/other_files/spark_zomato.py','/bdaas/exe/spark_zomato/conf_files/spark_conf.py','/bdaas/exe/spark_zomato/conf_files/date_comparision.py'])
from pyspark.sql import Row, SQLContext,HiveContext
from pyspark.sql.functions import lit
sqlContext = HiveContext(sc)
import sys,logging,pandas as pd
import spark_conf
n = new()
n.check()
class new:
def __init__(self):
print 'entered into init'
def check(self):
data = sc.textFile('file:///bdaas/src/spark_dependencies/classifier_data/final_Output.txt').map(lambda x: x.split('|')).map(lambda z: Row(restaurant_id=z[0], rating = z[1], review_id = z[2],review_text = z[3],rating_color = z[4],rating_time_friendly=z[5],rating_text=z[6],time_stamp=z[7],likes=z[8],comment_count =z[9],user_name = z[10],user_zomatohandle=z[11],user_foodie_level = z[12],user_level_num=z[13],foodie_color=z[14],profile_url=z[15],profile_image=z[16],retrieved_time=z[17]))
data_r = sqlContext.createDataFrame(data)
data_r.show()
d = data_r.rdd.collect()
print d
data_r.rdd.map(lambda x: list(x)).map(lambda y: self.filter_data(y)).collect()
print data_r
def filter_data(self,y):
s = str()
for i in y:
print i.encode('utf-8')
if i != '':
s = s + i.encode('utf-8') + '|'
print s[0:-1]
return s[0:-1]

Related

Cannot iterate over AbstractOrderedScalarSet before it has been constructed (initialized)

I have just started with pyomo and Python, and trying to create a simple model but have a problem with adding a constraint.
I followed the following example from GitHub
https://github.com/brentertainer/pyomo-tutorials/blob/master/introduction/02-lp-pyomo.ipynb
import pandas as pd
import pyomo.environ as pe
import pyomo.opt as po
#DATA
T=3;
CH=2;
time = ['t{0}'.format(t+1) for t in range(T)]
CHP=['CHP{0}'.format(s+1) for s in range(CH)]
#Technical characteristic
heat_maxprod = {'CHP1': 250,'CHP2': 250} #Only for CHPS
#MODEL
seq=pe.ConcreteModel
### SETS
seq.CHP = pe.Set(initialize = CHP)
seq.T = pe.Set(initialize = time)
### PARAMETERS
seq.heat_maxprod = pe.Param(seq.CHP, initialize = heat_maxprod) #Max heat production
### VARIABLES
seq.q_DA=pe.Var(seq.CHP, seq.T, domain=pe.Reals)
### CONSTRAINTS
##Maximum and Minimum Heat Production
seq.Heat_DA1 = pe.ConstraintList()
for t in seq.T:
for s in seq.CHP:
seq.Heat_DA1.add( 0 <= seq.q_DA[s,t])
seq.Heat_DA2 = pe.ConstraintList()
for t in seq.T:
for s in seq.CHP:
seq.Heat_DA2.add( seq.q_DA[s,t] <= seq.heat_maxprod[s])
### OBJECTIVE
seq.obj=Objective(expr=sum( seq.C_fuel[s]*(seq.rho_heat[s]*seq.q_DA[s,t]) for t in seq.T for s in seq.CHP))
When I run the program I am getting the following error:
RuntimeError: Cannot iterate over AbstractOrderedScalarSet 'AbstractOrderedScalarSet' before it has been constructed (initialized): 'iter' is an attribute on an Abstract component and cannot be accessed until the component has been fully constructed (converted to a Concrete component) using AbstractModel.create_instance() or AbstractOrderedScalarSet.construct().
Can someone, please, help with an issue? Thanks!
P.S. I know that the resulting answer for the problem is zero, I just want to make it work in terms of correct syntaxis.
In this line of code:
seq=pe.ConcreteModel
You are missing parenthesis. So, I think you are just creating an alias for the function instead of calling it.
Try:
seq=pe.ConcreteModel()

Writing to a table after transformation (bonobo-sqlalchemy)

I'm trying to read a table, modify a column and write to another table. I followed the available documentation and ran following code. It doesn't give any errors, but the task doesn't get performed either.
I tried removing the transformation step and then information gets written.
import sqlalchemy
import bonobo
import bonobo_sqlalchemy
def get_services():
return {
'sql_alchemy.engine': sqlalchemy.create_engine('postgresql://postgres:password#localhost:5432/postgres')
}
def transform(*row):
new_row = row[0]+1, row[1]
yield new_row
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(bonobo_sqlalchemy.Select('SELECT * FROM users', engine='sql_alchemy.engine')
,
transform,
bonobo_sqlalchemy.InsertOrUpdate(table_name='table_1', engine='sql_alchemy.engine'),
)
return graph
# The __main__ block actually execute the graph.
if __name__ == '__main__':
parser = bonobo.get_argument_parser()
with bonobo.parse_args(parser) as options:
bonobo.run(get_graph(**options), services=get_services(**options))
Output:
- Select in=1 out=6 [done]
- format_for_db in=6 out=6 [done]
- InsertOrUpdate in=6 out=6 [done]
It works when a Dictionary is yielded as follows,
yield {"id": row[0], "text": row[1], "count":row[2]}
with bonobo.UnpackItems(0) node in the chain after the transformation.

how to Looping python scripts together

I have two files. An 'initialization' script aka file1.py and a 'main code' script aka 'main_code.py'. The main_code.py is really a several hundred line .ipynb that was converted to a .py file. I want to run the same skeleton of the code with the only adjustment being to pass in the different parameters found in the 'file1.py' script.
In reality, it is much more complex than what I have laid out below with more references to other locations / DBs and what not.
However, I receive errors such as 'each_item[0]' is not defined. I can't seem to be able to pass in the values/variables that come from my loop in file1.py to my script that is contained inside the loop.
Must be doing something very obviously wrong as I imagine this is a simple fix
file1.py:
import pandas as pd
import os
import bumpy as np
import jaydebeapi as jd
#etc...
cities = ['NYC','LA','DC','MIA'] # really comes from a query/column
value_list = [5,500,5000,300] # comes from a query/column
zipped = list(zip(cities,value_list)) # make tuples
for each_item in zipped:
os.system('python loop_file.py')
# where I'm getting errors.
main_code.py:
names = each_item[0]
value = each_item[1]
# lots of things going on here in real code but for simplicity...
print value = value * 4
print value

Bigquery : job is done but job.query_results().total_bytes_processed returns None

The following code :
import time
from google.cloud import bigquery
client = bigquery.Client()
query = """\
select 3 as x
"""
dataset = client.dataset('dataset_name')
table = dataset.table(name='table_name')
job = client.run_async_query('job_name_76', query)
job.write_disposition = 'WRITE_TRUNCATE'
job.destination = table
job.begin()
retry_count = 100
while retry_count > 0 and job.state != 'DONE':
retry_count -= 1
time.sleep(10)
job.reload()
print job.state
print job.query_results().name
print job.query_results().total_bytes_processed
prints :
DONE
job_name_76
None
I do not understand why total_bytes_processed returns None because the job is done and the documentation says :
total_bytes_processed:
Total number of bytes processed by the query.
See
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#totalBytesProcessed
Return type: int, or NoneType
Returns: Count generated on the server (None until set by the server).
Looks like you are right. As you can see in the code, the current API does not process data regarding bytes processed.
This has been reported in this issue and as you can see in this tseaver's PR this feature has already been implemented and awaits review /merging so probably we'll have this code in production quite soon.
In the mean time you could get the result from the _properties attribute of job, like:
from google.cloud.bigquery import Client
import types
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/key.json'
bc = Client()
query = 'your query'
job = bc.run_async_query('name', query)
job.begin()
wait_job(job)
query_results = job._properties['statistics'].get('query')
query_results should have the totalBytesProcessed you are looking for.

TypeError('The value of a feed cannot be a tf.Tensor object....) though I am providing it a numpy array

I am providing to my feed_dict a numpy array but it still gives this error that the feed need to be a tf.Tensor object.
index = tf.placeholder(tf.int32, shape=[None], name='index')
dontknow = np.random.choice(range(1,200), 180)
_, summary = sess.run([train, merged], feed_dict={
input_placeholder:train_batch_x,
attr_placeholder:train_class_attr,
label_placeholder:train_batch_y,
index:dontknow
})
Is this a bug in the tensorflow library since I wanted to post as an issue but wasn't sure. Any help is highly appreciated.
Thanks
I think that your problem is not with dontknow variable, it is with one of these:
input_placeholder:train_batch_x,
attr_placeholder:train_class_attr,
label_placeholder:train_batch_y,
When I remove them, I can execute your stuff without any error:
import tensorflow as tf
import numpy as np
index = tf.placeholder(tf.int32, shape=[None], name='index')
dontknow = np.random.choice(range(1,200), 180)
with tf.Session() as sess:
print sess.run(index, {index:dontknow})
Print each of them before doing your sess.run to find which one is the tensor