Spark progress monitoring in dsx lost after changing spark configuration parameters - data-science-experience

What is the recommended way to turn on Spark speculation in a DSX notebook? We have a way that works but we no longer see the Spark progress being shown in the notebook, e.g., the number of tasks that have complete so far.
Here is what we do:
We checked the icon for enabling/disabling Spark progress monitoring and it is already enabled. Any tips? Thanks!

Related

AWS Sagemaker Notebook Not working ,how can i solve the issue?

The code failed because of a fatal error:
Error sending http request and maximum retry encountered..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.
Note: There are no such logs on cloudwatch to figureout the issue.
enter image description here
Are you looking to run Spark queries? If not, you can use the Python kernel, or any kernel other than Sparkmagic and proceed with your work.
If not, see this blog and the documentation to use Spark with notebook instances

Run Sagemaker notebook instance and be able to close tab

I'm currently using Sagemaker notebook instance (not from Sagemaker Studio), and I want to run a notebook that is expected to take around 8 hours to finish. I want to leave it overnight, and see the output from each cell, the output is a combination of print statements and plots.
Howevever, when I start running the notebook and make sure the initial cells run, I close the Jupyterlab tab in my browser, and some minutes after, I open it again to see how is it going, but the notebook is stopped.
Is there any way where I can still use my notebook as it is, see the output from each cell (prints and plots) and do not have to keep the Jupyterlab tab open (turn my laptop off, etc)?
Jupyter will stop your kernel when you close the tab. If you want to benefit from your jobs running after you close the jupyter tab, I would recommend looking into using SageMaker Processing or Training jobs for your workloads. Alternatively, this link provides some options on how to keep the notebook running with the tab closed.
Answering my own question.
I ended up using Sagemaker Processing jobs for this. As initially suggested by the other answer. I found this library developed a few months ago: Sagemaker run notebook, which helped still keep my notebook structure and cells as I had them, and be able to run it using Sagemaker run notebook using a bigger instance, and modifying the notebook in a smaller one.
The output of each cell was saved, along the plots I had, in S3 as a jupyter notebook.
I see that no constant support is given to the library, but you can fork it and make changes to it, and use it as per your requirements. For example, creating a docker container based on your needs.

AWS Sagemaker: Jupyter Notebook kernel keeps dying

I get disconnect every now and then when running a piece of code in Jupyter Notebooks on Sagemaker. I usually just restart my notebook and run all the cells again. However, I want to know if there is a way to reconnect to my instance without having to lose my progress. At the minute, it shows that there is "No Kernel" at the bottom bar, but my file seems active in the kernel sessions tab. Can I recover my notebook's variables and contents? Also, is there a way to prevent future kernel disconnections?
Note that I reverted back to tornado = 5.1.1, which seems to decrease the number of disconnections, but it still happens every now and then.
Often, disconnections will be caused by inactivity because a job is running for a long time with no user input. If it's pre-processing that's taking a long time, you could increase the instance size of the processing job so that it executes faster, or increase the instance count. If you're using EMR, you can now run an EMR Spark query directly on the EMR cluster since December 2021:
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-studio-data-notebook-integration-emr/
There's a useful blog here https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ which is helpful in getting you up and running.
Please let me know if you need more information, or vote for the answer if it's useful. :-)
For me a quick solution was to open a Terminal instead, save the notebook file as a Pytohn file, and run it from the terminal within Sagemaker.

Spark step on EMR just hangs as "Running" after done writing to S3

Running PySpark 2 job on EMR 5.1.0 as a step. Even after the script is done with a _SUCCESS file written to S3 and Spark UI showing the job as completed, EMR still shows the step as "Running". I've waited for over an hour to see if Spark was just trying to clean itself up but the step never shows as "Completed". The last thing written in the logs is:
INFO MultipartUploadOutputStream: close closed:false s3://mybucket/some/path/_SUCCESS
INFO DefaultWriterContainer: Job job_201611181653_0000 committed.
INFO ContextCleaner: Cleaned accumulator 0
I didn't have this problem with Spark 1.6. I've tried a bunch of different hadoop-aws and aws-java-sdk jars to no avail.
I'm using the default Spark 2.0 configurations so I don't think anything else like metadata is being written. Also the size of the data doesn't seem to have an impact on this problem.
If you aren't already, you should close your spark context.
sc.stop()
Also, if you are watching the Spark Web UI via a browser, you should close that as it sometimes keeps the spark context alive. I recall seeing this on the spark dev mailing list, but can't find the jira for it.
We experienced this problem and resolved it by running the job in cluster deploy mode using the following spark-submit option:
spark-submit --deploy-mode cluster
It was something to do with when running in client mode the driver runs in the master instance and the spark-submit process is getting stuck despite the spark spark context closing. This was causing the instance controller to continuously polling for process as it never receives the completion signal. Running the driver on one of the instance nodes using the above option doesn't seem to have this problem. Hope this helps
I experienced the same issue with Spark on AWS EMR and I solved the issue by calling sys.exit(O) at the end of my Python script. The same worked with Scala program with System.exit(0).

Job Scheduling in SAS Data Integration Studio

i want to schedule a job in SAS-DIS. i tried the process using sas management console,bt an error is popping up saying scheluing server not found.
can anyone help me how to setup a scheduling server? or is it a software to be installed?
Thanks
I think a scheduling server is an extra package that has to be purchased. Our BI setup is lacking that option and no matter what we can't seem to get it approved. Check with your SAS server admin to see if the job scheduling has been enabled. If so he/she should be able to tell you the process for getting it scheduled.
Alternatively, without a scheduling server you still deploy your jobs and can either use
1. Cron and Crontab (in Unix or Linux)
2. Windows OS scheduler
to schedule jobs manually as this is the best option available if there is none. I know this can be very tedious and cumbersome , but can give it a try if you have less number of jobs to schedule.