PySpark Processing Stream data and saving processed data to file - python-2.7

I am trying to replicate a device that is streaming it's location's coordinates, then process the data and save it to a text file.
I am using Kafka and Spark streaming (on pyspark),this is my architecture:
1-Kafka producer emits data to a topic named test in the following string format :
"LG float LT float" example : LG 8100.25191107 LT 8406.43141483
Producer code :
from kafka import KafkaProducer
import random
producer = KafkaProducer(bootstrap_servers='localhost:9092')
for i in range(0,10000):
lg_value = str(random.uniform(5000, 10000))
lt_value = str(random.uniform(5000, 10000))
producer.send('test', 'LG '+lg_value+' LT '+lt_value)
producer.flush()
The producer works fine and i get the streamed data in the consumer(and even in spark)
2- Spark streaming is receiving this stream,i can even pprint() it
Spark streaming processing code
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 1)
kvs = KafkaUtils.createDirectStream(ssc, ["test"], {"bootstrap.servers": "localhost:9092"})
lines = kvs.map(lambda x: x[1])
words = lines.flatMap(lambda line: line.split(" "))
words.pprint()
word_pairs = words.map(lambda word: (word, 1))
counts = word_pairs.reduceByKey(lambda a, b: a+b)
results = counts.foreachRDD(lambda word: word.saveAsTextFile("C:\path\spark_test.txt"))
//I tried this kvs.saveAsTextFiles('C:\path\spark_test.txt')
// to copy all stream and it works fine
ssc.start()
ssc.awaitTermination()
As an error i get :
16/12/26 00:51:53 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker did not connect back in time
And other exceptions.
What i actually want is to save each entry "LG float LT float" as a JSON format in a file,but first i want to simply save the coordinates in a file,i cant seem to make that happen.Any ideas?
I can provide with the full stack trace if needed

I solved this like this, so i made a function to save each RDD, in the file ,this is the code that solved my problem :
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 1)
kvs = KafkaUtils.createDirectStream(ssc, ["test"], {"bootstrap.servers": "localhost:9092"})
lines = kvs.map(lambda x: x[1])
coords = lines.map(lambda line: line)
def saveCoord(rdd):
rdd.foreach(lambda rec: open("C:\path\spark_test.txt", "a").write(
"{"+rec.split(" ")[0]+":"+rec.split(" ")[1]+","+rec.split(" ")[2]+":"+rec.split(" ")[3]+"},\n"))
coords.foreachRDD(saveCoord)
coords.pprint()
ssc.start()
ssc.awaitTermination()

Related

How To read/iterate through a Datastream in Python

I have a stream created at port 9999 of my computer.
I have to implement DGIM Algorithm on it.
However I am not able to read the bits in the Data stream one by one.
Below is my code:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
import math
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)
when I use the following command I am able to print the stream in batches:
lines.pprint()
ssc.start() # Start the computation
ssc.awaitTermination()
But when I try to print each bit it gives an error:
for l in lines.iter_lines():
print l
ssc.start() # Start the computation
ssc.awaitTermination()
Can someone tell me how can I read each bit from the stream so that I could
implement the algorithm on it?
I used the following code:
streams.foreachRDD(lambda c: function(c))
function(c):
c.collect()
This makes an rdd out of each stream and the function collects all the streams

Saving data from traceplot in PyMC3

Below is the code for a simple Bayesian Linear regression. After I obtain the trace and the plots for the parameters, is there any way in which I can save the data that created the plots in a file so that if I need to plot it again I can simply plot it from the data in the file rather than running the whole simulation again?
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,9,5)
y = 2*x + 5
yerr=np.random.rand(len(x))
def soln(x, p1, p2):
return p1+p2*x
with pm.Model() as model:
# Define priors
intercept = pm.Normal('Intercept', 15, sd=5)
slope = pm.Normal('Slope', 20, sd=5)
# Model solution
sol = soln(x, intercept, slope)
# Define likelihood
likelihood = pm.Normal('Y', mu=sol,
sd=yerr, observed=y)
# Sampling
trace = pm.sample(1000, nchains = 1)
pm.traceplot(trace)
print pm.summary(trace, ['Slope'])
print pm.summary(trace, ['Intercept'])
plt.show()
There are two easy ways of doing this:
Use a version after 3.4.1 (currently this means installing from master, with pip install git+https://github.com/pymc-devs/pymc3). There is a new feature that allows saving and loading traces efficiently. Note that you need access to the model that created the trace:
...
pm.save_trace(trace, 'linreg.trace')
# later
with model:
trace = pm.load_trace('linreg.trace')
Use cPickle (or pickle in python 3). Note that pickle is at least a little insecure, don't unpickle data from untrusted sources:
import cPickle as pickle # just `import pickle` on python 3
...
with open('trace.pkl', 'wb') as buff:
pickle.dump(trace, buff)
#later
with open('trace.pkl', 'rb') as buff:
trace = pickle.load(buff)
Update for someone like me who is still coming over to this question:
load_trace and save_trace functions were removed. Since version 4.0 even the deprecation waring for these functions were removed.
The way to do it is now to use arviz:
with model:
trace = pymc.sample(return_inferencedata=True)
trace.to_netcdf("filename.nc")
And it can be loaded with:
trace = arviz.from_netcdf("filename.nc")
This way works for me :
# saving trace
pm.save_trace(trace=trace_nb, directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")
# loading saved traces
with model_nb:
t_nb = pm.load_trace(directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")

multiprocessing Queue deadlock when spawn multi threads in one process

I created two processes, one process that spawn multi threads is response for writing data to Queue, the other is reading data from Queue. It always deadblock in high frequent, fewer not. Especially when you add sleep in run method in write module(comment in codes). Let me put my codes below:
environments: python2.7
main.py
from multiprocessing import Process,Queue
from write import write
from read import read
if __name__ == "__main__":
record_queue = Queue()
table_queue = Queue()
pw = Process(target=write,args=[record_queue, table_queue])
pr = Process(target=read,args=[record_queue, table_queue])
pw.start()
pr.start()
pw.join()
pr.join()
write.py
from concurrent.futures import ThreadPoolExecutor, as_completed
def write(record_queue, table_queue):
thread_num = 3
pool = ThreadPoolExecutor(thread_num)
futures = [pool.submit(run, record_queue, table_queue) for _ in range (thread_num)]
results = [r.result() for r in as_completed(futures)]
def run(record_queue, table_queue):
while True:
if table_queue.empty():
break
table = table_queue.get()
# adding this code below reduce deadlock opportunity.
#import time
#import random
#time.sleep(random.randint(1, 3))
process_with_table(record_queue, table_queue, table)
def process_with_table(record_queue, table_queue, table):
#for short
for item in [x for x in range(1000)]:
record_queue.put(item)
read.py
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
import Queue
def read(record_queue, table_queue):
count = 0
while True:
item = record_queue.get()
count += 1
print ("item: ", item)
if count == 4:
break
I googled it and there are same questions on SO, but i cant see the similarity compared with my code, so can anyone help my codes, thanks...
I seem to find a solution, change run method in write module to :
def run(record_queue, table_queue):
while True:
try:
if table_queue.empty():
break
table = table_queue.get(timeout=3)
process_with_table(record_queue, table_queue, table)
except multiprocessing.queues.Empty:
import time
time.sleep(0.1)
and never see deadlock or blocking on get method.

Unable to retrieve Messages from AWS SQS queue using Boto

My python code looks like :
import json
import boto.sqs
import boto
from boto.sqs.connection import SQSConnection
from boto.sqs.message import Message
from boto.sqs.message import RawMessage
sqs = boto.connect_sqs(aws_access_key_id='XXXXXXXXXXXXXXX',aws_secret_access_key='XXXXXXXXXXXXXXXXX')
q = sqs.create_queue("Nishantqueue") // Already present
q.set_message_class(RawMessage)
results = q.get_messages()
ret = "Got %s result(s) this time.\n\n" % len(results)
for result in results:
msg = json.loads(result.get_body())
ret += "Message: %s\n" % msg['message']
ret += "\n... done."
print ret
My SQS queue contains atleast 5 to 6 messages... when i execute this ... i get output as this and this is on every run, this code isn't able to pull the mssgs from the queue :
Got 0 result(s) this time.
...done.
I am sure i am missing something in the loop.... couldn't find though
Your code is retrieving messages from an Amazon SQS queue, but it doesn't seem to be deleting them. This means that messages will be invisible for a period of time (specified by the visibility_timeout parameter), after which they will reappear. The expectation is that if a message is not deleted within this time, then it has failed to be processed and should reappear on the queue to try again.
Here's some code that pulls a message from a queue, then deletes it after processing. Note the visibility_timeout specified when a message is retrieved. It is using read() to simply return one message:
#!/usr/bin/python27
import boto, boto.sqs
from boto.sqs.message import Message
# Connect to Queue
q_conn = boto.sqs.connect_to_region("ap-southeast-2")
q = q_conn.get_queue('queue-name')
# Get a message
m = q.read(visibility_timeout=15)
if m == None:
print "No message!"
else:
print m.get_body()
q.delete_message(m)
It's possible that your messages were invisible ("in-flight") when you tried to retrieve them.

Python multiprocessing: parsing, editing, and writing long series of csv files

I have a very long series of similar cvs files (14Gb altogether). I need to open each file, replace certain characters, and write the fixed version to a new file. I want to use the processing power of my multicore computer. I tried with mp.Pools and with mp.Process/mp.Queue. The pool version works, but the queue approach produces this error:
IOError: [Errno 22] invalid mode ('r') or filename: '<multiprocessing.queues.Queue object at 0x0000000002775A90>'
This is a simplified version of my Pool code:
import os
import pandas as pd
import multiprocessing as mp
def fixer(a_file):
lines = []
opened_file = open(a_file)
for each_line in opened_file:
lines.append(each_line.replace('mad', 'rational'))
opened_file.close()
df = pd.DataFrame(lines)
#some pandas magics here
df.to_csv(a_file[:-4] + '_fixed.csv')
if __name__ == "__main__":
my_path = os.getcwd()
my_files = list(os.walk(my_path))[0][2] #I just get the list of file names here
processors = mp.cpu_count()
pool = mp.Pool(processes = processors) # I set as many processes as processors my computer has.
pool.map(fixer, my_files)
And this is the one for the Queue approach:
import os
import pandas as pd
import multiprocessing as mp
def fixer(a_file):
lines = []
opened_file = open(a_file)
for each_line in opened_file:
lines.append(each_line.replace('mad', 'rational'))
opened_file.close()
df = pd.DataFrame(lines)
#some pandas magics here
df.to_csv(a_file[:-4] + '_fixed.csv')
if __name__ == "__main__":
my_path = os.getcwd()
my_files = list(os.walk(my_path))[0][2] #I just get the list of file names here
processors = mp.cpu_count()
queue = mp.Queue()
for each_file in my_files:
queue.put(each_file)
processes = [mp.Process(target = fixer, args=(queue,)) for core in range(processors)]
for process in processes:
process.start()
for process in processes:
process.join()
I will appreciate if you can provide an example to make the Queue version to work. In a second processing step, before the files are written, I need the processors to get an intermediate result and do some calculations. This is the reason why I need the queues.
The problem in the Queue script is that I was not getting the next element in the Queue, but passing the whole Queue to the fixer function. This problem is solved by assigning the value of queue.get() to a variable in the fixer function:
import os
import pandas as pd
import multiprocessing as mp
def fixer(a_queue):
a_file = a_queue.get()
lines = []
opened_file = open(a_file)
for each_line in opened_file:
lines.append(each_line.replace('mad', 'rational'))
opened_file.close()
df = pd.DataFrame(lines)
#some pandas magics here
df.to_csv(a_file[:-4] + '_fixed.csv')
if __name__ == "__main__":
my_path = os.getcwd()
my_files = list(os.walk(my_path))[0][2] #I just get the list of file names here
processors = mp.cpu_count()
queue = mp.Queue()
for each_file in my_files:
queue.put(each_file)
processes = [mp.Process(target = fixer, args=(queue,)) for core in range(processors)]
for process in processes:
process.start()
for process in processes:
process.join()