How To read/iterate through a Datastream in Python - python-2.7

I have a stream created at port 9999 of my computer.
I have to implement DGIM Algorithm on it.
However I am not able to read the bits in the Data stream one by one.
Below is my code:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
import math
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)
when I use the following command I am able to print the stream in batches:
lines.pprint()
ssc.start() # Start the computation
ssc.awaitTermination()
But when I try to print each bit it gives an error:
for l in lines.iter_lines():
print l
ssc.start() # Start the computation
ssc.awaitTermination()
Can someone tell me how can I read each bit from the stream so that I could
implement the algorithm on it?

I used the following code:
streams.foreachRDD(lambda c: function(c))
function(c):
c.collect()
This makes an rdd out of each stream and the function collects all the streams

Related

Delete pcap file after parsing it with PcapReader in scapy

I'm parsing a pcap file with PcapReader in scapy. After that I want to delete the pcap file. But it sucks because of this error:
OSError: [Errno 26] Text file busy: '/media/sf_SharedFolder/AVB/test.pcap'
This is my python code:
from scapy.all import *
import os
var = []
for packet in PcapReader('/media/sf_SharedFolder/AVB/test.pcap'):
var.append(packet[Ether].src)
os.remove('/media/sf_SharedFolder/AVB/test.pcap')
I think that this error occurs with any pcap file.
Does somebody has any idea?
You may want to try with Scapy's latest development version (from https://github.com/secdev/scapy), since I cannot reproduce your issue with it.
If that does not work, check with lsof /media/sf_SharedFolder/AVB/test.pcap (as root) if another program has opened your capture file. If so, try to find (and kill, if possible) that program.
You can try two different hacks, to try to figure out what exactly is happening:
Test 1: wait.
from scapy.all import *
import os
import time
var = []
for packet in PcapReader('/media/sf_SharedFolder/AVB/test.pcap'):
var.append(packet[Ether].src)
time.sleep(2)
os.remove('/media/sf_SharedFolder/AVB/test.pcap')
Test 2: explicitly close.
from scapy.all import *
import os
var = []
pktgen = PcapReader('/media/sf_SharedFolder/AVB/test.pcap')
for packet in pktgen:
var.append(packet[Ether].src)
pktgen.close()
os.remove('/media/sf_SharedFolder/AVB/test.pcap')
Found the solution. I replaced "PcapReader()" through "rdpcap()". Seems like PcapReader is open until the python script is finished.
This is the working code:
from scapy.all import *
import os
var = []
p=rdpcap('/media/sf_SharedFolder/AVB/test.pcap')
for packet in p:
var.append(packet[Ether].src)
os.remove('/media/sf_SharedFolder/AVB/test.pcap')

How to use regular expressions to pull something from a terminal output?

I'm attempting to use the re module to look through some terminal output. When I ping a server through terminal using ping -n 1 host (I'm using Windows), it gives me much more information than I want. I want just the amount of time that it takes to get a reply from the server, which in this case is always denoted by an integer and then the letters 'ms'. The error I get explains that the output from the terminal is not a string, so I cannot use regular expressions on it.
from os import system as system_call
import re
def ping(host):
return system_call("ping -n 1 " + host) == 0
host = input("Select a host to ping: ")
regex = re.compile(r"\w\wms")
final_ping = regex.search(ping(host))
print(final_ping)
system returns 0, not anything too useful. However, if we were to do subprocess, we can get teh output, and store it to a variable, out, then we can regex search that.
import subprocess
import re
def ping(host):
ping = subprocess.Popen(["ping", "-n", "1", host], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, error = ping.communicate()
return str(out)
host = input("Select a host to ping: ")
final_ping = re.findall("\d+ms",ping(host))[0]
print(final_ping)
Output:
22ms
There are two problems with your code:
Your ping function doesn't return the terminal output. It only returns a bool that reports if the ping succeeded. The ping output is directly forwarded to the terminal that runs the Python script.
Python 3 differentiates between strings (for text, consisting of Unicode codepoints) and bytes (for any data, consisting of bytes). As Python cannot know that ping only outputs ASCII text, you will get a bytes object if you don't specify which text encoding is in use.
It would be the best to use the subprocess module instead of os.system. This is also suggested by the Python documentation.
One possible way is to use subprocess.check_output with the encoding parameter to get a string instead of bytes:
from subprocess import check_output
import sys
def ping(host):
return check_output(
"ping -n 1 " + host,
shell=True,
encoding=sys.getdefaultencoding()
)
...
EDIT: The encoding parameter is only supported since Python 3.6. If you are using an older version, try this:
from subprocess import check_output
import sys
def ping(host):
return check_output(
"ping -n 1 " + host,
shell=True
).decode()
...

PySpark Processing Stream data and saving processed data to file

I am trying to replicate a device that is streaming it's location's coordinates, then process the data and save it to a text file.
I am using Kafka and Spark streaming (on pyspark),this is my architecture:
1-Kafka producer emits data to a topic named test in the following string format :
"LG float LT float" example : LG 8100.25191107 LT 8406.43141483
Producer code :
from kafka import KafkaProducer
import random
producer = KafkaProducer(bootstrap_servers='localhost:9092')
for i in range(0,10000):
lg_value = str(random.uniform(5000, 10000))
lt_value = str(random.uniform(5000, 10000))
producer.send('test', 'LG '+lg_value+' LT '+lt_value)
producer.flush()
The producer works fine and i get the streamed data in the consumer(and even in spark)
2- Spark streaming is receiving this stream,i can even pprint() it
Spark streaming processing code
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 1)
kvs = KafkaUtils.createDirectStream(ssc, ["test"], {"bootstrap.servers": "localhost:9092"})
lines = kvs.map(lambda x: x[1])
words = lines.flatMap(lambda line: line.split(" "))
words.pprint()
word_pairs = words.map(lambda word: (word, 1))
counts = word_pairs.reduceByKey(lambda a, b: a+b)
results = counts.foreachRDD(lambda word: word.saveAsTextFile("C:\path\spark_test.txt"))
//I tried this kvs.saveAsTextFiles('C:\path\spark_test.txt')
// to copy all stream and it works fine
ssc.start()
ssc.awaitTermination()
As an error i get :
16/12/26 00:51:53 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker did not connect back in time
And other exceptions.
What i actually want is to save each entry "LG float LT float" as a JSON format in a file,but first i want to simply save the coordinates in a file,i cant seem to make that happen.Any ideas?
I can provide with the full stack trace if needed
I solved this like this, so i made a function to save each RDD, in the file ,this is the code that solved my problem :
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 1)
kvs = KafkaUtils.createDirectStream(ssc, ["test"], {"bootstrap.servers": "localhost:9092"})
lines = kvs.map(lambda x: x[1])
coords = lines.map(lambda line: line)
def saveCoord(rdd):
rdd.foreach(lambda rec: open("C:\path\spark_test.txt", "a").write(
"{"+rec.split(" ")[0]+":"+rec.split(" ")[1]+","+rec.split(" ")[2]+":"+rec.split(" ")[3]+"},\n"))
coords.foreachRDD(saveCoord)
coords.pprint()
ssc.start()
ssc.awaitTermination()

Python on Raspberry Pi: results are only integers

I am using the following python code on the Raspberry Pi to collect an audio signal and output the volume. I can't understand why my output is only integer.
#!/usr/bin/env python
import alsaaudio as aa
import audioop
# Set up audio
data_in = aa.PCM(aa.PCM_CAPTURE, aa.PCM_NONBLOCK, 'hw:1')
data_in.setchannels(2)
data_in.setrate(44100)
data_in.setformat(aa.PCM_FORMAT_S16_LE)
data_in.setperiodsize(256)
while True:
# Read data from device
l,data = data_in.read()
if l:
# catch frame error
try:
max_vol=audioop.max(data,2)
scaled_vol = max_vol/4680
if scaled_vol==0:
print "vol 0"
else:
print scaled_vol
except audioop.error, e:
if e.message !="not a whole number of frames":
raise e
Also, I don't understand the syntax in this line:
l,data = data_in.read()
It's likely that it's reading in a byte. This line l,data = data_in.read() reads in a tuple (composed of l and data). Run the type() builtin function on those variables and see what you've got to work with.
Otherwise, look into the documentation for PCM Terminology and Concepts located within the documentation for the pyalsaaudio package, located here.

Unable to write to file using Python 2.7

I have written following code I am able to print out the parsed values of Lat and lon but i am unable to write them to a file. I tried flush and also i tried closing the file but of no use. Can somebody point out whats wrong here.
import os
import serial
def get_present_gps():
ser=serial.Serial('/dev/ttyUSB0',4800)
ser.open()
# open a file to write gps data
f = open('/home/iiith/Desktop/gps1.txt', 'w')
data=ser.read(1024) # read 1024 bytes
f.write(data) #write data into file
f = open('/home/iiith/Desktop/gps1.txt', 'r')# fetch the required file
f1 = open('/home/iiith/Desktop/gps2.txt', 'a+')
for line in f.read().split('\n'):
if line.startswith('$GPGGA'):
try:
lat, _, lon= line.split(',')[2:5]
lat=float(lat)
lon=float(lon)
print lat/100
print lon/100
a=[lat,lon]
f1.write(lat+",")
f1.flush()
f1.write(lon+"\n")
f1.flush()
f1.close()
except:
pass
while True:
get_present_gps()
You're covering the error up by using the except: pass. Don't do that... ever. At least log the exception.
One error which it definitely covers is lat+",", which is going to fail because it's float+str and it's not implemented. But there may be more.