Scala : Writing Data to S3 bucket - amazon-web-services

I am trying to write the data to S3 bucket but i am getting below errors.
SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
18/11/18 23:32:14 ERROR Utils: Aborting task
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/11/18 23:32:14 WARN FileOutputCommitter: Could not delete s3a://Accesskey:SecretKey#test-bucket/Output/Check1Result/_temporary/0/_temporary/attempt_20181118233210_0004_m_000000_0
18/11/18 23:32:14 ERROR FileFormatWriter: Job job_20181118233210_0004 aborted.
18/11/18 23:32:14 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 209)
org.apache.spark.SparkException: Task failed while writing rows.
I have tried below code and am able to write the data to local file system. But when i am trying to write the data to S3 bucket then i am getting above errors.
My code :
package Spark_package
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object dataload {
def main(args: Array[String]) {
val spark = SparkSession.builder.master("local[*]").appName("dataload").config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2").getOrCreate()
val sc = spark.sparkContext
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
val conf = new SparkConf().setAppName("dataload").setMaster("local[*]").set("spark.speculation","false")
val sqlContext = spark.sqlContext
val data = "C:\\docs\\Input_Market.csv"
val ddf = spark.read.format("csv").option("inferSchema","true").option("header","true").option("delimiter",",").load(data)
ddf.createOrReplaceTempView("data")
val res = spark.sql("select count(*),cust_id,sum_cnt from data group by cust_id,sum_cnt")
res.write.option("header","true").format("csv").save("s3a://Accesskey:SecretKey#test-bucket/Output/Check1Result1")
spark.stop()
}
}

Related

ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] org.apache.spark.SparkException: Exception thrown in awaitResult:

I am new to pyspark and AWS. I am trying to read data from aws s3
pyspark version 3.3.0
I tried this:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.config('spark.master', 'local')\
.config('spark.app.name', 's3app')\
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-common:3.3.4')\
.getOrCreate()
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'access-key')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'secret-key')
df = spark.read.format('parquet').load('s3a://path-to-s3')
I tried almost all solutions available on "stack overflow" but none of them worked for me.
This error is due to the permission of the bucket, please check with your IAM policies.

Error waiting for Creating CloudFunctions Function: Error Code 3,

Hello I am trying to create my Google Cloud Function in Python by using Terraform
resource "google_cloudfunctions_function" "function" {
name = "scheduled-cloud-function-python-api-request"
description = "A Python Cloud Function that is triggered by a Cloud Schedule."
runtime = "python37"
available_memory_mb = 128
source_archive_bucket = google_storage_bucket.bucket.name
source_archive_object = google_storage_bucket_object.archive.name
trigger_http = true
entry_point = "http_handler" # This is the name of the function that will be executed in your Python code
}
And it's giving me an error at this code block after I do terraform apply:
Error code 3, message: Function failed on loading user code.
What am I doing wrong here?
This is my python code :
import requests
import csv
request = requests.get(f'https://api.sportsdata.io/v3/nba/scores/json/TeamSeasonStats/2022?key=')
nba_team_stats = request.json()
data_file = open('nba_team_stats.csv', 'w')
csv_writer = csv.writer(data_file)
count = 0
for stats in nba_team_stats:
if count == 0:
header = stats.keys()
csv_writer.writerow(header)
count += 1
csv_writer.writerow(stats.values())
data_file.close()
Am I writing my Python Code in an undesired way for the GCP function to execute?

How do I transform a python list to a dynamic frame that can be used to create a csv file in s3

I'm using aws glue with a custom pyspark script which loads data from an aurora instance and transforms it. Due to the nature of my data source (I need to recursively run sql commands on a list of ids) I ended up with a normal python list which contains lists of tuples. the list looks something like this:
list = [[('id': 1), ('name1': value2),('name2': value2')], [('id': 2),...]
I've tried to convert it into a normal data frame using spark's createDataFrameMethod:
listDataFrame = spark.createDataFrame(list)
and converting that list to a dynamic Frame using the the fromDF method on the DynamicFrame class, ending up with something like this:
listDynamicFrame = fromDF(dataframe = listDataFrame, glue_ctx = glueContext, name = listDynamicFrame)
and then passing that to the from_options method:
datasink2 = glueContext.write_dynamic_frame.from_options(frame =
listDynamicFrame, connection_type = "s3", connection_options = {"path": "s3://glue.xxx.test"},
format = "csv", transformation_ctx = "datasink2")
job.commit()
So yes, unfortunately, this doesn't seem to work: I'm getting the following error message:
1554379793700 final status: FAILED tracking URL: http://169.254.76.1:8088/cluster/app/application_1554379233180_0001 user: root
Exception in thread "main"
org.apache.spark.SparkException: Application application_1554379233180_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1122)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1168)
at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/04/04 12:10:00 INFO ShutdownHookManager: Shutdown hook called

Run local DynamoDB spark job without EMR

I want to run local Dynamodb spark job without using EMR cluster,
that read data from some table and write it to parquet / CSV file.
I didn't found any spark-dynamo connector that supports that, maybe you have any ideas?
My code sample:
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.sql.SparkSession
object copyDynamoTable extends App {
val spark = SparkSession
.builder()
.appName("test")
.master("local")
.getOrCreate()
val jobConf = new JobConf(spark.sparkContext.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", "hen.poc.client") // Pointing to DynamoDB table
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("dynamodb.throughput.read", "1")
jobConf.set("dynamodb.throughput.read.percent", "1")
jobConf.set("dynamodb.version", "2011-12-05")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
val orders = spark.sparkContext.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
println(orders.count)
I received the following exception:
18/09/05 17:06:41 INFO util.TaskCalculator: Cluster has 1 active nodes.
18/09/05 17:06:41 WARN util.ClusterTopologyNodeCapacityProvider: Exception when trying to determine instance types
java.nio.file.NoSuchFileException: /mnt/var/lib/info/job-flow.json
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.Files.readAllBytes(Files.java:3152)
at org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider.readJobFlowJsonString(ClusterTopologyNodeCapacityProvider.java:103)
at org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider.getCoreNodeMemoryMB(ClusterTopologyNodeCapacityProvider.java:42)
at org.apache.hadoop.dynamodb.util.TaskCalculator.getMaxMapTasks(TaskCalculator.java:54)
at org.apache.hadoop.dynamodb.DynamoDBUtil.calcMaxMapTasks(DynamoDBUtil.java:265)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBInputFormat.getSplits(AbstractDynamoDBInputFormat.java:47)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD.count(RDD.scala:1162)
at com.data.spark.dynamodb.copyDynamoTable$.delayedEndpoint$com$riskified$data$spark$dynamodb$copyDynamoTable$1(copyDynamoTable.scala:30)
at com.data.spark.dynamodb.copyDynamoTable$delayedInit$body.apply(copyDynamoTable.scala:9)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.data.spark.dynamodb.copyDynamoTable$.main(copyDynamoTable.scala:9)
at com.data.spark.dynamodb.copyDynamoTable.main(copyDynamoTable.scala)
Exception in thread "main" java.lang.ArithmeticException: / by zero
This is a file that is present on an EMR cluster. This is to try to determine what instance type it is running against to determine some job settings such as memory. Obviously running locally you wouldn't have this file so this is expected.
Please follow the bellow thread :
emr/github.com/issues/50

PySpark Processing Stream data and saving processed data to file

I am trying to replicate a device that is streaming it's location's coordinates, then process the data and save it to a text file.
I am using Kafka and Spark streaming (on pyspark),this is my architecture:
1-Kafka producer emits data to a topic named test in the following string format :
"LG float LT float" example : LG 8100.25191107 LT 8406.43141483
Producer code :
from kafka import KafkaProducer
import random
producer = KafkaProducer(bootstrap_servers='localhost:9092')
for i in range(0,10000):
lg_value = str(random.uniform(5000, 10000))
lt_value = str(random.uniform(5000, 10000))
producer.send('test', 'LG '+lg_value+' LT '+lt_value)
producer.flush()
The producer works fine and i get the streamed data in the consumer(and even in spark)
2- Spark streaming is receiving this stream,i can even pprint() it
Spark streaming processing code
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 1)
kvs = KafkaUtils.createDirectStream(ssc, ["test"], {"bootstrap.servers": "localhost:9092"})
lines = kvs.map(lambda x: x[1])
words = lines.flatMap(lambda line: line.split(" "))
words.pprint()
word_pairs = words.map(lambda word: (word, 1))
counts = word_pairs.reduceByKey(lambda a, b: a+b)
results = counts.foreachRDD(lambda word: word.saveAsTextFile("C:\path\spark_test.txt"))
//I tried this kvs.saveAsTextFiles('C:\path\spark_test.txt')
// to copy all stream and it works fine
ssc.start()
ssc.awaitTermination()
As an error i get :
16/12/26 00:51:53 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker did not connect back in time
And other exceptions.
What i actually want is to save each entry "LG float LT float" as a JSON format in a file,but first i want to simply save the coordinates in a file,i cant seem to make that happen.Any ideas?
I can provide with the full stack trace if needed
I solved this like this, so i made a function to save each RDD, in the file ,this is the code that solved my problem :
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 1)
kvs = KafkaUtils.createDirectStream(ssc, ["test"], {"bootstrap.servers": "localhost:9092"})
lines = kvs.map(lambda x: x[1])
coords = lines.map(lambda line: line)
def saveCoord(rdd):
rdd.foreach(lambda rec: open("C:\path\spark_test.txt", "a").write(
"{"+rec.split(" ")[0]+":"+rec.split(" ")[1]+","+rec.split(" ")[2]+":"+rec.split(" ")[3]+"},\n"))
coords.foreachRDD(saveCoord)
coords.pprint()
ssc.start()
ssc.awaitTermination()