Job Scheduling in SAS Data Integration Studio - sas

i want to schedule a job in SAS-DIS. i tried the process using sas management console,bt an error is popping up saying scheluing server not found.
can anyone help me how to setup a scheduling server? or is it a software to be installed?
Thanks

I think a scheduling server is an extra package that has to be purchased. Our BI setup is lacking that option and no matter what we can't seem to get it approved. Check with your SAS server admin to see if the job scheduling has been enabled. If so he/she should be able to tell you the process for getting it scheduled.

Alternatively, without a scheduling server you still deploy your jobs and can either use
1. Cron and Crontab (in Unix or Linux)
2. Windows OS scheduler
to schedule jobs manually as this is the best option available if there is none. I know this can be very tedious and cumbersome , but can give it a try if you have less number of jobs to schedule.

Related

Data flow pipeline got stuck

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles
Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.
Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

What is the best way to transfer large files through Django app hosted on HEROKU?

HEROKU gives me H12 error on transferring the file to an API from my Django application (Understood it's a long running process and there is some memory/worker tradeoff I guess so). I am on one single hobby Dyno right now.
The function just runs smoothly for around 50MB file. The file itself is coming from a different source ( requests python package )
The idea is to build a file transfer utility using Django app on HEROKU. The file will not gets stored in my app side. Its just getting from point A and sending to point B.
Went through multiple discussions along with the standard HEROKU documentations, however I am struggling in between in some concepts:
Will this problem be solved by background tasks really? (If YES, I am finding explanation of the process than the direct way to do it such that I can optimize my flow)
As mentioned in standard docs, they recommend background tasks using RQ package for python, I am using Postgre SQL at moment. Will I need to install and manage Redis Database as well for this. Is this even related to Database?
Some recommend using extra Worker other than the WEB worker we have by default. How does this relate to my problem?
Some say to add multiple workers, not sure how this solve it. Let's say today it starts working for large files using background tasks, what if the load of users at same time increases. How this will impact my solution and how should I plan the mitigation plan around the risks.
If someone here has a strong understanding with respect to the architecture, I am here to listen your experiences and thoughts. Also, let me know if there is other way than HEROKU from a solution standpoint which will make this more easy for me.
Have you looked at using celery to run this as a background task?
This is a very standard way of dealing with requests which take a large time to complete.
Will this problem be solved by background tasks really? ( If YES, I am finding explanation of the process than the direct way to do it such that I can optimise my flow )
Yes it can be solved by background tasks. If you are using something like Celery which has direct support for django, you will be running another instance of your Django application but with a different startup command for Celery. It then keeps reading for new tasks to execute and reads the task name from the redis queue (or rabbitmq - whichever you use as the broker) and then executes that task and updates the status back to redis (or the broker you use).
You can also use flower along with celery so that you have a dashboard to see how many tasks are being executed and what are their statuses etc.
As mentioned in standard docs, they recommend background tasks using RQ package for python, I am using Postgre SQL at moment. Will I need to install and manage Redis Database as well for this. Is this even related to Database?
To use background task with Celery you will need to set up some sort of message broker like Redis or RabbitMQ
Some recommend using extra Worker other than the WEB worker we have by default. How does this relate to my problem?
I dont think that would help for your use case
Some say to add multiple workers, not sure how this solve it. Let's say today it starts working for large files using background tasks, what if the load of users at same time increases. How this will impact my solution and how should I plan the mitigation plan around the risks.
When you use celery, you will have to start few workers for that Celery instance, these workers are the ones who execute your background tasks. Celery documentation will help you with exact count calculation of these workers based on your instance CPU and memory etc.
If someone here has a strong understanding with respect to the architecture, I am here to listen your experiences and thoughts. Also, let me know if there is other way than HEROKU from a solution standpoint which will make this more easy for me.
I have worked on few projects where we used Celery with background tasks to upload large files. It has worked well for our use cases.
Here is my final take on this after full evaluation, trials and earlier recommendations made here, thanks #arun.
HEROKU needs a web worker to deliver the website runtime which hold 512MB of memory, operations your perform if are below this limits should be fine.
Beyond that let's say you have scenarios like mentioned above where a large file is coming from one source api and going into another target api with Django app, you will have to:
First, you will have to run the file upload function as a background process since it will take time more than 30 seconds to respond which HEROKU expects to return. If not H12 Error is waiting for you. Solution to this is implementing Django Background tasks, Celery worked in my case. So here Celery is your same Django app functionality running as a background handler which needs its own app Dyno ( The Worker ) This can be scaled as needed in future.
To make your Django WSGI ( Frontend App ) talk to the Celery ( Background App), you need a message broker in between which can be HEROKU Redis, RabbitMQ, etc.
Second, the problems doesn't gets solved here even though you have a new Worker dedicated for the Celery app, the memory limits will still apply as its also a Dyno with its own memory.
To overcome this, your Python requests module should download the file in stream instead of direct downloading complete file in a single memory buffer. Iterate and load the stream data in chunks and send the file chunks to target endpoint.
Even chunk size plays here an important role. I will not put exact number here since it depends on various factors:
Should not be too small, else it will take more time to transfer.
Should not be too big to be handled by either of the source/target endpoint servers.

AWS Sagemaker: Jupyter Notebook kernel keeps dying

I get disconnect every now and then when running a piece of code in Jupyter Notebooks on Sagemaker. I usually just restart my notebook and run all the cells again. However, I want to know if there is a way to reconnect to my instance without having to lose my progress. At the minute, it shows that there is "No Kernel" at the bottom bar, but my file seems active in the kernel sessions tab. Can I recover my notebook's variables and contents? Also, is there a way to prevent future kernel disconnections?
Note that I reverted back to tornado = 5.1.1, which seems to decrease the number of disconnections, but it still happens every now and then.
Often, disconnections will be caused by inactivity because a job is running for a long time with no user input. If it's pre-processing that's taking a long time, you could increase the instance size of the processing job so that it executes faster, or increase the instance count. If you're using EMR, you can now run an EMR Spark query directly on the EMR cluster since December 2021:
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-studio-data-notebook-integration-emr/
There's a useful blog here https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ which is helpful in getting you up and running.
Please let me know if you need more information, or vote for the answer if it's useful. :-)
For me a quick solution was to open a Terminal instead, save the notebook file as a Pytohn file, and run it from the terminal within Sagemaker.

Informatica - Trigger next workflow upon completion of the first workflow

I am working on Informatica to automatically run the Workflow B upon the completion of the Workflow A. I did research on how to do this and the best that I encountered is using PMCMD but I cannot find the PMCMD.exe file in the installation folder of my Informatica power center. I am using version 8.1.1. I don't know if the PMCMD is available in this version. Kindly advise for alternative solutions. Thank you in advance.
It's possible with pmcmd utility, but there's another option. You can use an Event Wait task in Workflow B, right after the Start task and make it wait for a flat file, e.g. workflowA.done. And add a Command Task as the last one in your WorkflowA to perform a touch workflowA.done command. Use the appropriate path for your case (might be $PMTargetFileDir for example).
Start both your workflows at the same time, Workflow B will process the tasks after the control file gets created.
pmcmd.exe is available in the Informatica installation folder for Informatica server.
For my system it was in the below path:
/infa/server/bin
Usually this is controlled by an external independant scheduler

How to run SAS using batch if I do not have it locally

Is there a way to run SAS using batch if I don't have the sas.exe in my machine?
My computer has the SAS EG but the code is ran on our companies servers
Thanks
If you are asking whether it is possible to run SAS batch on your local machine without having SAS on your local machine, the answer is no.
If you are using EG to connect to a SAS server, and you want to execute a batch job on the SAS server, that is possible (just not with EG). For example, if you have terminal access to the SAS server via putty or whatever, you can do a batch submit.
Enterprise Guide is quite capable of scheduling jobs, whether or not you have a local SAS installation.
Wendy McHenry covers this well in Four Ways to Schedule SAS Tasks. Way 1 is what you probably are familiar with ('batch'), but Ways 2 through 4 are all possible in server environments.
Way 2 is what I use, which is specifically covered in Chris Hemedinger's post Doing More with SAS Enterprise Guide Automation. In Enterprise Guide since I think EG 4.3, there has been an option in the File menu "Schedule ...", as well as a right-click option on a process flow "Schedule ...". These create VBScript files that can be scheduled using your normal Windows scheduler, and allow you to schedule a process flow or a project to run unattended, even if it needs to connect to a server.
You need to make sure you can connect to that server using the credentials you'll schedule the job to run under, of course, and that any network connections are created when you're not logged in interactively, but other than that it's quite simple to schedule the job. Then, once you've run it, it will save the project with the updated log and results tabs.
If your company uses the full suite of server products, I would definitely recommend seeing if you can get Way 3 to work (using SAS Management Console) - that is likely easier than doing it through EG. That's how SAS would expect you to schedule jobs in that kind of environment (and lets your SAS Administrator have better visibility on when the server will be more/less busy).