Can't reach flask in Spark master node using Amazon EMR - amazon-web-services

I want to understand if it's possible to use flask application connected to Spark master node implemented in Amazon EMR. The goal is to call Flask from a web app to retrieve spark outputs. Ports are open in amazon EMR cluster's security group but I can't reach it from outside on his port.
What do you think about it? Are there any other solutions?

While it is totally possible to call Flask (or anything) running on EMR, depending on what you are doing you might find Apache Livy handy. The good thing is Livy is fully supported by EMR. You can use Livy to submit jobs and to retrieve results synchronously or asynchronously. It gives you a rest API to interact with Spark.

Related

Configure consolidate Spark history server on EMR

Is there any way possible to have a single history server showing Spark applications running on different emr clusters?
According to this link - https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-cluster-application-history.html
Section Off-cluster access to persistent application user interfaces states that the persistent application UIs are run off-cluster, but can it be configured such that every spark application (running on different clusters) can be pointed to single application UI? Or is it cluster specific only?
I have tried figuring it out from the aws docs but can't find anything relevant. Any reference/suggestion will be appreciated.
Thanks.

Spark on EMR | EKS |YARN

I am getting migrated from on-premise to the AWS stack. I have a doubt and often confused about how Apache spark works in AWS/similar.
I will just share my current understanding about the onpremise spark that run on yarn. When the application is submitted in the spark cluster,an application master will be created in any of the data node (as containers) and this will take care of the application by spawning executor tasks in the data nodes. This means the spark code will be deployed to the node where the data resides. This means less network transfer. More over this is logical and easy to visualise (at least to me.)
But, suppose there is a same spark application that runs on AWS. This fetches the data from S3 and run on top of eks. Here as I understand the spark drvier and the executor tasks will be spawn on a k8s pod.
-Then does this mean, data has to be transferred through network from S3 to EKS cluster to the node where the executor pod gets spawned ?
I have seen some of the videos that uses EMR on top of EKS. But I am a little confused here.
-Since EMR provides spark runtime, why do we use EKS here? Can't we run EMR alone for spark applications in actual production environment? (I know that EKS, can be a replacement to YARN in spark world)
-Can't we run spark on top of EKS without using EMR? (I am thinking emr as a cluster where in spark drivers and executors can run )
Edit - This is a query more on k8s with spark integration. Not specific to AWS.

Orchestration of Redshift Stored Procedures using AWS Managed Airflow

I have many redshift stored procedures created(15-20) some can run asynchronously while many have to run in a synchronous manner.
I tried scheduling them in Async and Sync manner using Aws Eventbridge but found many limitations (failure handling and orchestration).
I went ahead to use AWS Managed Airflow.
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
If possible can we achieve all these using AWS console only, without Aws cli?
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
You can use Airflow Connections to connect to Redshift. This is the native approach for managing connections to external services such as databases.
Managing Connections (Airflow)
Amazon Redshift Connection (Airflow)
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
You can use the PostgresOperator to execute SQL commands in the Redshift cluster. When initializing the PostgresOperator, set the postgres_conn_id parameter to the Redshift connection ID (e.g. redshift_default). Example:
PostgresOperator(
task_id="call_stored_proc",
postgres_conn_id="redshift_default",
sql="sql/stored_proc.sql",
)
PostgresOperator (Airflow)
How-to Guide for PostgresOperator (Airflow)
If possible can we achieve all these using AWS console only, without Aws cli?
No, it's not possible to achieve this only using the AWS console.

SparkSubmitOperator vs SSHOperator for submitting pyspark applications in airflow

I have spark and airflow servers differently. And I don't have spark binary in airflow servers. I am able to use SSHOperator and run the spark jobs in cluster mode perfectly well. I would like to know what would be good using either SSHOperator or SparkSubmitOperator in a long run for submitting pyspark jobs. Any help would be appreciated in advance.
Below are the pros and cons of using SSHOperator vs SparkSubmit Operator in airflow and my recommendation followed.
SSHOperator : This operator will perform SSH action into remote Spark server and execute the spark submit in remote cluster.
Pros:
No additional configuration required in the airflow workers
Cons:
Tough to maintain the spark configuration parameters
Need to enable SSH port 22 from airflow servers to spark servers which leads to security concerns ( though you are on private network its not a best practice to use SSH based remote execution.)
SparkSubbmitOperator : This operator will perform spark submit operation in clean way still you need to have additional infrastructure configuration.
Pros:
As mentioned above it comes with handy spark configuration and no additional effort to invoke spark submit
Cons:
Need to install spark on all airflow servers.
Apart from these 2 options I have listed additional 2 options.
Install Livy server on spark clusters and use python Livy library to interact with Spark servers from Airflow. Refer : https://pylivy.readthedocs.io/en/stable/
If your spark clusters are on AWS EMR , I would encourage to using EmrAddStepsOperator
Refer here for additional discussions : To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow
SparkSubmitOperator is a specialized operator. That is, it should make writing tasks for submitting Spark jobs easier and the code itself more readable and maintainable. Therefore, I would use it if possible.
In your case, you should consider if the effort of modifying the infrastructure, such that you can use the SparkSubmitOperator, is worth the benefits, which I mentioned above.

How to deploy and Run Amazon Kinesis Application on Amazon Kinesis service

I am trying to understand how to deploy an Amazon Kinesis Client application that was built using the Kinesis client library (KCL).
I found this but it only states
You can follow your own best practices for deploying code to an Amazon EC2 instance when you deploy a Amazon Kinesis application. For example, you can add your Amazon Kinesis application to one of your Amazon EC2 AMIs.
which is not giving a broader picture to me.
These examples use an Ant script to run Java program. Is this the best practice to follow?
Also, I understand even before running the EC2 instances I need to make sure
The developed code JAR/WAR or any other format needs to be on the EC2 instance
The EC2 instance needs to have all the required environment like Ant setup in place already to execute the program.
Could someone please add some more detail on this?
Amazon Kinesis will be responsible for ingesting data, not running your application. You can run your application anywhere, but it is a good idea to run it in EC2, as you are probably going to use other AWS Services, such as S3 or DynamoDB (Kinesis Client Library uses DynamoDB for sharding, for example).
To understand Kinesis better, I'd recommend that you launch the Kinesis Data Visualization Sample. When you launch this app, use the provided CloudFormation template. It will create a stack with the Kinesis stream and an EC2 instance with the application, that uses Kinesis Client Library and is a fully working example to start from.
The best way I have found to host a consumer program is using EMR, but not as a hadoop cluster. Package your program as a jar, and place it in s3. Launch an emr cluster and have it run your jar. Using the data pipeline you can schedule this job flow to run at regular intervals. You can also scale an emr cluster, or use a actual EMR job to process the stream if you choose to get the high tech.
You can also use Beanstalk. I believe this article is highly useful.