how to access spark history server from DSX? - data-science-experience

I need to access the Spark History Server so that I performance tune a slow spark job.
I was looking for a link within DSX but could not find one, so I have opened up the spark service in the Bluemix console and have navigated to the spark history server directly from there (Job History link) .
Is there a way to access the spark history server directly from DSX?

It seems that you have to access the spark history server by logging in to the Bluemix console as I have been doing.
There is a feature request here asking to make this process much more integrated: https://datascix.uservoice.com/forums/387207-general/suggestions/17707459-give-access-to-spark-ui-to-monitor-jobs-without-ha - please vote for this!

Related

When I try fetch data from Amazon Keyspaces with Pyspark, I get Unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner Error

I'm not experienced in Java or Hadoop ecosystem. I configured my Spark cluster to connect to Amazon Keyspaces by using spark-cassandra-connector from Datastax. I'm using Pyspark to fetch data from Cassandra. I can successfully connect to Keyspaces/Cassandra cluster. But, when I try to fetch data from it.
df = spark.sql("SELECT * FROM cass.tutorialkeyspace.tutorialtable")
print ("Table Row Count: ")
print (df.count())
I get this error:
Unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner
Yes, keyspace & table exists and has data. How can I fix/workaround this? Thanks!
As an FYI, Keyspaces now supports using the RandomPartitioner, which enables reading and writing data in Apache Spark by using the open-source Spark Cassandra Connector.
Docs: https://docs.aws.amazon.com/keyspaces/latest/devguide/spark-integrating.html
Launch announcement: https://aws.amazon.com/about-aws/whats-new/2022/04/amazon-keyspaces-read-write-data-apache-spark/
Spark Cassandra Connector is relying on specific partitioner implementation to define data splits, etc. There is no workaround for this problem right now, until somebody adds the implementation of corresponding TokenFactory into this code. It shouldn't be very complex, just should be done by someone who is interested in it.
Thank you for the feedback. At this time, You can write to Keyspaces using the Cassandra Spark Connector. Reading requires support for token rage. Please see the following doc page to see list of supported APIs https://docs.aws.amazon.com/keyspaces/latest/devguide/cassandra-apis.html.
Although we don't have timelines to share at the moment, we prioritize our roadmap based on customer feedback. We are releasing new features all the time. To learn more about our roadmap and upcoming features please contact your AWS Account manager.

On Prem Application migration to the AWS

We are migrating some of our J2EE based application from on-prem to the AWS cloud. I am trying to find some good document on what steps to be considered for the App migration. Since we already have an AWS account, and some of the applications have been migrated earlier, I don't have to worry about those aspects.. However I am thinking more towards
- Which App-server to use?
- Do i need to migrate DB as well..or just the App?
- Any licensing requirements for app.. we use mostly Open source.. So that should be fine..
- Operational monitoring after migrating to cloud..
Came across some of these articles.
https://serverguy.com/cloud/aws-migration/
Migration Scenario: Migrating Web Applications to the AWS Cloud : https://d36cz9buwru1tt.cloudfront.net/CloudMigration-scenario-wep-app.pdf
I would like to know If you have worked on this kind of work.. and If you point me to some helpful document/links.. or your pwn experience?
So theres 2 good resources I'd recommend for migration:
AWS Whitepaper for migration
AWS Well-Architected Framework.
The key is planning, but not being afraid to experiment. This is cloud so don't be afraid of setting an instance size in stone, you can easily change it.

How to take last 3 months aws billing data using python flask

I need last 3 months AWS billing data in graph using python and python Flask.
i found some articles for this, i just created environment in my local machine. after i dont know how to take billing data using python & Python Flask script. Any one have idea please help me.. Thanks in advance.
You can use Boto to connect to AWS using Python.
In general you need to cover several items:
Authentication - You'll need to set up your credentials to be able to connect to AWS. Check the Documentation
Enable DBR or CUR reports in the billing console. This will export your monthly billing information to S3. Check the Documentation
Once ready use Boto3 to download the reports and import them to DB, ES, whatever you're working with to process large excel files. Check the Documentation
Good luck!
EDIT:
For previous months you can just go to console -> bills and download the reports from the console directly, then process them in your application.

DC/OS service development with Akka

First of all, I'm new to DC/OS ...
I installed DC/OS locally with Vagrant, everything worked fine. Then I installed Cassandra, Spark and I think to understand the container concept with Docker, so far so good.
Now it's time to develop an Akka service and I'm a little bit confused how I should start. The Akka service should simply offer a HTTP REST endpoint and store some data to Cassandra.
So I have my DC/OS ready, and Eclipse in front of me. Now I would like to develop the Akka service and connect to Cassandra from outside DC/OS, how can I do that? Is this the wrong approach? Should I install Cassandra separately and only if I’m ready I would deploy to DC/OS?
Because it was so simple to install Cassandra, Spark and all the rest I would like to use it for development as well.
While slightly outdated (since it's using DC/OS 1.7 and you should be really using 1.8 these days) there's a very nice tutorial from codecentric that should contain everything you need to get started:
It walks you through setting up DC/OS, Cassandra, Kafka, and Spark
It shows how to use Akka reactive streams and the reactive kafka extension to ingest data from Twitter into Kafka
It shows how to use Spark to ingest data Cassandra
Another great walkthrough resource is available via Cake Solutions:
It walks you through setting up DC/OS, Cassandra, Kafka, and Marathon-LB (a load balancer)
It explains service discovery for Akka
It shows how to expose a service via Marathon-LB

Spark on YARN - Submiting Spark jobs from Django

I am developing a web app with the following components :
Apache Spark running on a Cluster with 3 nodes (spark 1.4.0, hadoop 2.4 and YARN)
Django Web App server
The Django app will create 'on demand' spark jobs (they can be concurrent jobs, depending of how many users are using the app)
I would to know if there is any way to submit spark jobs from the python code in Django? can i integrate pyspark in django? or Maybe can i call the YARN API directly to submit jobs?
I know that i can submit jobs to the cluster using the spark-submit script, but i'm trying to avoid using it. (because it would have to be a shell command execution from the code, and is not very safe to do it)
Any help would be very appreciated.
Thanks a lot,
JG
A partial, untested answer: Django is a web framework, so it's difficult to manage long jobs (more than 30sec), which is probably the case for your spark jobs.
So you'll need a asynchronous job queue, such as celery. It's a bit of a pain (not that bad but still), but I would advise you to start with that.
You would then have :
Django to launch/monitor jobs
rabbitMQ/celery asynchronous job queue
custom celery tasks, using pySpark and launching sparks
There's a project on github called Ooyala's job server:
https://github.com/ooyala/spark-jobserver.
This allows you to submit spark jobs to YARN via HTTP request.
In Spark 1.4.0+ support was added to monitor job status via HTTP request as well.