using Kinesis Client library with Spark Steaming PySpark - python-2.7

I am looking for using KCL on SparkStreaming using pySpark.
Any pointers would be helpful.
I tried few given by spark Kinesis Ingeration link.
But i get the error for JAVA class reference.
Seems Python is using JAVA class.
i tried linking
spark-streaming-kinesis-asl-assembly_2.10-2.0.0-preview.jar
while trying to apply the KCL app on spark.
but still having the error.
Please let me know if anyone has done it already.
if i search online i get more about Twitter and Kafka.
Not able to get much help with regard to Kinesis.
spark verision used: 1.6.3

I encountered the same problem. The kinesis-asl jar had several files missing.
To overcome this problem, I had included the following jars in my spark-submit.
amazon-kinesis-client-1.9.0.jar
aws-java-sdk-1.11.310.jar
jackson-dataformat-cbor-2.6.7.jar
Note: I am using Spark 2.3.0 so the jar versions listed might not be the same as those you should be using for your spark version.
Hope this helps.

Related

When I try fetch data from Amazon Keyspaces with Pyspark, I get Unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner Error

I'm not experienced in Java or Hadoop ecosystem. I configured my Spark cluster to connect to Amazon Keyspaces by using spark-cassandra-connector from Datastax. I'm using Pyspark to fetch data from Cassandra. I can successfully connect to Keyspaces/Cassandra cluster. But, when I try to fetch data from it.
df = spark.sql("SELECT * FROM cass.tutorialkeyspace.tutorialtable")
print ("Table Row Count: ")
print (df.count())
I get this error:
Unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner
Yes, keyspace & table exists and has data. How can I fix/workaround this? Thanks!
As an FYI, Keyspaces now supports using the RandomPartitioner, which enables reading and writing data in Apache Spark by using the open-source Spark Cassandra Connector.
Docs: https://docs.aws.amazon.com/keyspaces/latest/devguide/spark-integrating.html
Launch announcement: https://aws.amazon.com/about-aws/whats-new/2022/04/amazon-keyspaces-read-write-data-apache-spark/
Spark Cassandra Connector is relying on specific partitioner implementation to define data splits, etc. There is no workaround for this problem right now, until somebody adds the implementation of corresponding TokenFactory into this code. It shouldn't be very complex, just should be done by someone who is interested in it.
Thank you for the feedback. At this time, You can write to Keyspaces using the Cassandra Spark Connector. Reading requires support for token rage. Please see the following doc page to see list of supported APIs https://docs.aws.amazon.com/keyspaces/latest/devguide/cassandra-apis.html.
Although we don't have timelines to share at the moment, we prioritize our roadmap based on customer feedback. We are releasing new features all the time. To learn more about our roadmap and upcoming features please contact your AWS Account manager.

ModuleNotFoundError: No module named 'aiohttp' in AWS Glue

I am using AWS glue to create ETL workflow, where I am fetching the data from the API and loading it into RDS. In AWS Glue, I used pyspark script. In the same script, I have used the 'aiohttp' and 'asyncio' modules of python to call my API asynchronously. But in AWS glue it is throwing me an error that Module Not found for the only aiohttp.
I have already tried with different versions of aiohttp module and tested in the glue job but still throwing me the same error. Can someone please help me with this topic?
Glue 2.0
AWS Glue version 2.0 lets you provide additional Python modules or different versions at the job level. You can use the --additional-python-modules job parameter with a list of comma-separated Python modules to add a new module or change the version of an existing module.
Also, within the --additional-python-modules option you can specify an Amazon S3 path to a Python wheel module.
This link to official documentation lists all modules already available. If you need a different version or need one to be installed, it can be specified in the parameter mentioned above.
Glue 1.0 & 2.0
You can zip the python library, upload it so s3 and specify the path as --extra-py-files job parameter.
See link to official documentation for more information.

AWS Glue Sagemaker Notebook "No module named awsglue.transforms"

I've created a Sagemaker notebook to dev AWS Glue jobs, but when running through the provided example ("Joining, Filtering, and Loading Relational Data with AWS Glue") I get the following error:
Does anyone know what I've setup wrong/haven't setup to cause the import to not work?
You'll need to download the library files from here for Glue 0.9 or here for Glue 1.0 (Check your Glue jobs for the version).
Put the zip in S3 and reference it in the "Python library path" on your Dev Endpoint.
I had the same issue and the selected solution did not work for me.
I did manage to get working by using cloud formation (AWS::Glue::DevEndpoint).
Through trial and error I noticed that you can't specify both NumberOfNodes and NumberOfWorkers at the same time. You have to specify one or the other.
Using NumberOfNodes: 5 resulted in the exact same error as specified in the question. But using the 2nd option worked perfectly.
So to conclude, to fix this error you can use CloudFormation and make sure to use the NumberOfWorkers property.
hm... this approach doesn't work for me.
I've just put zip to "Python library path", referenced to it and it doesn't work
Add AWSGlueServiceNotebookRole to your Dev Endpoint IAM Role, restart your kernel and rerun

What is the proper way to use Google Pub/Sub with Flink Streaming using Dataproc?

I'm trying to figure out the proper way to run Apache Flink on Dataproc and use Google Pub/Sub as a source/sink. When I create a Dataproc cluster, after applying flink initialization action to the most recent image 1.4, Flink 1.6.4 will be installed.
The problem is that flink-connector-gcp-pubsub is only available starting from Flink version 1.9.0.
So my question is what is the proper way to use all of this together? Should I build my own gce image with the latest Flink? Is there one already existing?
As you already said flink-connector-gcp-pubusub is only available from Flink 1.9.0. So you have two options:
Either implement connector yourself
Build your own image based on the flink initialization actions
I would not recommend implementing connector as it is a complex task and requires an in-depth understanding of Flink while building your own image should be relatively easy given an example for Flink 1.6.4
I solved this problem by running Flink 1.9.0 in Kubernetes. This way I do not depend on anybody and can run whatever version I need.

Read file from s3a along with AWS Athena SDK (1.11+)

I am writing a spark/scala program which submits a query on athena (using aws-java-sdk-athena:1.11.420) and waits for the query to complete. Once the query is complete, my spark program directly reads from the S3 bucket using s3a protocol (the output location of the query) using spark's sparkSession.read.csv() function.
In order to read the CSV file, I need to use org.apache.hadoop.hadoop-aws:1.8+ and org.apache.hadoop.hadoop-client:1.8+. Both of these libraries are build using AWS SDK version 1.10.6. However, AWS athena does not have any SDK with that version. The oldest version they have is 1.11+.
How can I resolve the conflict? I need to use the latest version of AWS SDK to get access to athena, but hadoop-aws pushed me back to an older version?
Are there other dependency version of hadoop-aws that uses 1.11+ AWS SDKs? If so, what are the versions that will work for me? If not, what other options do I have?
I found out that I can use hadoop-aws:3.1+ which comes with aws-java-sdk-bundle:1.11+. This AWS SDK Bundle comes with Athena packaged.
I although still need to run spark with hadoop-commons:3.1+ libraries. The spark cluster I have runs 2.8 version libraries.
Due to my spark cluster running 2.8, spark-submit jobs were failing, while normal execution of the jar (java -jar myjar.jar) was working fine. This is because Spark was replacing the hadoop libraries I provided with the version it was bundled with.