Do emr-dynamodb-connector reads data in parallel - amazon-web-services

Do emr-dynamodb-connector reads data in parallel in spark? I checked that RDD which I am getting from it has only one partition.
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
var jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.input.tableName", "TableName")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
Above code to read data from DynamoDB. Following is the number of partitions.
scala> orders.getNumPartitions
res4: Int = 1
Is there any way to read data in parallel and process it?

Related

Can't connect to Online Sharepoint using Python

I'am trying to display all sharepoint's list name but i'am getting this error :
No handlers could be found for logger "office365.runtime.auth.saml_token_provider.SamlTokenProvider._process_service_token_response"
This is my code :
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
url = 'https://abc.sharepoint.com/sites/siteName/'
ctx_auth = AuthenticationContext(url)
if ctx_auth.acquire_token_for_user(username='username#abc.com'
,password ='password'):
ctx = ClientContext(url, ctx_auth)
lists = ctx.web.lists
ctx.load(lists)
ctx.execute_query()
for l in lists:
print(l.properties["Title"])
Thanks
I tested below code here with python 2.7 and it works well.
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
tenant_url= "https://company.sharepoint.com"
site_url="https://company.sharepoint.com/sites/sname"
ctx_auth = AuthenticationContext(tenant_url)
if ctx_auth.acquire_token_for_user("abc#company.onmicrosoft.com","mypassword"):
ctx = ClientContext(site_url, ctx_auth)
lists = ctx.web.lists
ctx.load(lists)
ctx.execute_query()
for l in lists:
print(l.properties["Title"])
else:
print(ctx_auth.get_last_error())
Result:
If this is related to ADFS, please refer to this closed question:
https://github.com/vgrem/Office365-REST-Python-Client/issues/85
BR
Well i found a solution to get data for specific sharepoint List
from shareplum import Site
from shareplum import Office365
import json
import csv
import pandas
authcookie = Office365('https://abc.sharepoint.com/', username='username', password='password').GetCookies()
site = Site('https://abc.sharepoint.com/sites/SitesName/', authcookie=authcookie)
sp_list = site.List('ListName')
#print(sp_list)
data = sp_list.GetListItems(fields=['FieldName1','FieldName2'])
c = pandas.read_json(json.dumps(data)).to_csv("output.csv")

Regex in spark.read.json

I want to read all json files which are having timestamp one hour before the current time from the hadoop directory.
File name is like test_2020021418553333
import java.util.Calendar;
import java.text.SimpleDateFormat;
val form = new SimpleDateFormat("yyyyMMddhh");
val c = Calendar.getInstance();
c.add(Calendar.HOUR, -1);
val path ="/Test_"+form.format(c.getTime())+"*";
val test_df = spark.read.json(path)
When I run this code: Path does not exist error is coming.
Can anyone suggest how to read file names like Test_20200214{Any Possible combination of Digit}??
A quick test show that you have minutes
form.format(c.getTime())
res2: String = 2020021401
So remove the latest 2 cars
regards

How to page through QueryResults

I am getting result from BigQuery using the following code:
from google.oauth2 import service_account
from google.cloud import bigquery
credential = service_account.Credentials.from_service_account_file(SERVICE_ACCOUNT_FILE)
scoped_credential = credential.with_scopes(BIG_QUERY_SCOPE)
client = bigquery.Client(project="XX-XX",credentials=scoped_credential)
query_results = client.run_sync_query(query_detail)
query_results.use_legacy_sql = False
query_results.run()
iterator = query_results.fetch_data()
rows = iterator.query_result.rows
But it only returns up-to 50000 rows. I tried to paginate while fetching data, but failed to figure out how to do it:
page_token = query_results.page_token
iterator = query_results.fetch_data(max_results=500, page_token=page_token)
I could not find out how to get the updated page_token.
Thanks,
I think you are close. Try running this code now:
data = list(query_results.fetch_data()) # changed from `iterator` to `data` the variable name
The management of page tokens is done automatically for you.

Code is Not able to find my function in Python(Spark) class

I need some help regarding the error in code. My Code consists of retrieving the zomato reviews and storing it in HDFS and again reading it performing Recommender Analtyics on it. I am getting a problem regarding my function is not recognizing in pyspark code. I am not entirely pasting the code as it might be confusing so i am writing a small similar use case for your easy understanding.
I am trying to read a file from local and converting it to dataframe from rdd and performing some operations and again converting it to rdd and performing map operation to have delimiter by '|' and then save it to HDFS.
When i try to call self.filter_data(y) in lambda func of check function its not recognizing and giving me error as
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
****CAN ANY ONE HELP ME WHY MY FILTER_DATA FUNCTION IS NOT RECOGNISING? SHOULD I NEED TO ADD ANY THING OR ANY THING WRONG IN THE WAY I AM CALLING. PLEASE HELP ME. THANKS IN ADVANCE****
INPUT VALUE
starting
0|0|ffae4f|0|https://b.zmtcdn.com/data/user_profile_pictures/565/aed32fa2eb18bb4a5a3ba426870fd565.jpg?fit=around%7C100%3A100&crop=100%3A100%3B%2A%2C%2A|https://www.zomato.com/akellaram87?utm_source=api_basic_user&utm_medium=api&utm_campaign=v2.1|2.5|FFBA00|Well...|unknown|16946626|2017-08-01T00-25-43.455182Z|30059877|Have been here for a quick bite for lunch, ambience and everything looked good, food was okay but presentation was not very appealing. We or...|2017-04-15 16:38:38|Big Foodie|6|Venkata Ram Akella|akellaram87|Bad Food|0.969352505662|0|0|0|0|0|0|1|1|0|0|1|0|0|0.782388212399
ending
starting
1|0|ffae4f|0|https://b.zmtcdn.com/data/user_profile_pictures/4d1/d70d7a57e1bfdf296ff4db3d8daf94d1.jpg?fit=around%7C100%3A100&crop=100%3A100%3B%2A%2C%2A|https://www.zomato.com/users/sm4-2011696?utm_source=api_basic_user&utm_medium=api&utm_campaign=v2.1|1|CB202D|Avoid!|unknown|16946626|2017-08-01T00-25-43.455182Z|29123338|Giving a 1.0 rating because one cannot proceed with writing a review, without rating it. This restaurant deserves a 0 star rating. The qual...|2017-01-04 10:54:53|Big Foodie|4|Sm4|unknown|Bad Service|0.964402034541|0|1|0|0|0|0|0|1|0|0|0|1|0|0.814540622345
ending
My code:
if __name__== '__main__':
import os,logging,sys,time,pandas,json;from subprocess
import PIPE,Popen,call;from datetime import datetime, time, timedelta
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('test')
sc = SparkContext(conf = conf,pyFiles=['/bdaas/exe/nlu_project/spark_classifier.py','/bdaas/exe/spark_zomato/other_files/spark_zipcode.py','/bdaas/exe/spark_zomato/other_files/spark_zomato.py','/bdaas/exe/spark_zomato/conf_files/spark_conf.py','/bdaas/exe/spark_zomato/conf_files/date_comparision.py'])
from pyspark.sql import Row, SQLContext,HiveContext
from pyspark.sql.functions import lit
sqlContext = HiveContext(sc)
import sys,logging,pandas as pd
import spark_conf
n = new()
n.check()
class new:
def __init__(self):
print 'entered into init'
def check(self):
data = sc.textFile('file:///bdaas/src/spark_dependencies/classifier_data/final_Output.txt').map(lambda x: x.split('|')).map(lambda z: Row(restaurant_id=z[0], rating = z[1], review_id = z[2],review_text = z[3],rating_color = z[4],rating_time_friendly=z[5],rating_text=z[6],time_stamp=z[7],likes=z[8],comment_count =z[9],user_name = z[10],user_zomatohandle=z[11],user_foodie_level = z[12],user_level_num=z[13],foodie_color=z[14],profile_url=z[15],profile_image=z[16],retrieved_time=z[17]))
data_r = sqlContext.createDataFrame(data)
data_r.show()
d = data_r.rdd.collect()
print d
data_r.rdd.map(lambda x: list(x)).map(lambda y: self.filter_data(y)).collect()
print data_r
def filter_data(self,y):
s = str()
for i in y:
print i.encode('utf-8')
if i != '':
s = s + i.encode('utf-8') + '|'
print s[0:-1]
return s[0:-1]

Continuous REST consumption using Akka stream and http

I am trying to build an akka based system which will periodically(every 15 second) send REST request, do some filter, some data cleansing and validation on the received data and save into HDFS.
Below is the code that I wrote.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Flow, Sink, Source}
import akka.http.scaladsl.Http
import akka.http.scaladsl.model.{HttpRequest, HttpResponse, StatusCodes}
import akka.actor.Props
import akka.event.Logging
import akka.actor.Actor
import scala.concurrent.{ExecutionContext, Future}
import scala.util.Try
import akka.http.scaladsl.client.RequestBuilding._
/**
* Created by rabbanerjee on 4/6/2017.
*/
class MyActor extends Actor {
val log = Logging(context.system, this)
import scala.concurrent.ExecutionContext.Implicits.global
def receive = {
case j:HttpResponse => log.info("received" +j)
case k:AnyRef => log.info("received unknown message"+k)
}
}
object STest extends App{
implicit val system = ActorSystem("Sys")
import system.dispatcher
implicit val materializer = ActorMaterializer()
val ss = system.actorOf(Props[MyActor])
val httpClient = Http().outgoingConnection(host = "rest_server.com", port = 8080)
val filterSuccess = Flow[HttpResponse].filter(_.status.isSuccess())
val runnnn = Source.tick(
FiniteDuration(1,TimeUnit.SECONDS),
FiniteDuration(15,TimeUnit.SECONDS),
Get("/"))
.via(httpClient)
.via(filterSuccess)
.to(Sink.actorRef(ss,onCompleteMessage = "done"))
runnnn.run()
}
The problem I am currently facing is,
Even though I used a repeat/tick with source, I can see the result once. It's not repetitively firing the request.
I am also trying to find grouping the result of say 50 such request, coz as I will be writing it to hadoop, I cant write every request, as it will flood HDFS with multiple file.
You are not consuming the responses you are getting back from the HTTP call. It is compulsory to consume the entity bytes returned by Akka HTTP, even if you are not interested in them.
More about this can be found in the docs.
In your example, as you are not using the response entity, you can just discard its bytes. See example below:
val runnnn = Source.tick(FiniteDuration(1,TimeUnit.SECONDS),FiniteDuration(15,TimeUnit.SECONDS),Get("/"))
.via(httpClient)
.map{resp => resp.discardEntityBytes(); resp}
.via(filterSuccess)
.to(Sink.actorRef(ss,onCompleteMessage = "done"))