what will be the query for check completion of workflow? - informatica

I have to cheack the status of workflow weather that workflow completed within scheduled time or not in sql query format. And also send an email of workflow status like 'completed within time ' or not 'completed within time'. So, please help me out

You can do it either using option1 or option 2.
You need access to repository meta database.
Create a post session shell script. You can pass workflow name and benchmark value to the shell script.
Get workflow run time from repository metadata base.
SQL you can use -
SELECT WORKFLOW_NAME,(END_TIME-START_TIME)*24*60*60 diff_seconds
FROM
REP_WFLOW_RUN
WHERE WORKFLOW_NAME='myWorkflow'
You can then compare above value with benchmark value. Shell script can send a mail depending on outcome.
you need to create another workflow to check this workflow.
If you do not have access to Metadata, please follow above steps except metadata SQL.
Use pmcmd GetWorkflowDetails to check status, start and end time for a workflow.
pmcmd GetWorkflowDetails -sv service -d domain -f folder myWorkflow
You can then grep start and end time from there, compare them with benchmark values. The problem is the format etc. You need little bit scripting here.

Related

How to load data/update Power BI Dataset monthly

I've been asked to implement a way to load data to my datasets once a month. As Power BI Service doesn't have this option, I had to find a solution using Power Query and bellow I describe the step-by-step of my solution.
If it helps you at some way, please, let me know by posting a comment bellow. If you have a better and/or more elegant solution I'm glad to hear from you.
So, as my first solution didn't work, here I'll post the definity solution that we (me and my colleges) found.
I have to say that this solution is not so simple as it uses a Linux server, Gitlab and Jenkins, so it require a relative complex environment and I'll not describe how to build it.
At the end, I'll suggest a simpler solution.
THE ENVIRONMENT
On my company we use Jenkins to schedule jobs, Gitlab to store source code and we have a Linux Server to execute small tasks using Shell Script. For this problem I used all three services besides Power BI API.
JENKINS
I use Jenkins to schedule a job that run montlhy. This job was created using the following configs:
Parameters: I created 2 parameters (workspace_id and dataset_id) so I can test the script at any environment (Power BI Workspace) by just changing the value of those parameters;
Schedule Job: this job was schedule to run every day 1 at 02:00 a.m. As Jenkins uses the same sintax as CRON (I thing it is just a intermediate between you and CRON) the value of this field is 0 2 1 * *.
Build: as here we have a remote linux server to execute the scripts, I used a Execute shell script on remote host using ssh. I don't know why on Jenkins you can not execute the curl command direct on the job, it just didn't work, so I had to split the solution into both Jenkins and Linux server. At SSH site you have to select the credentials (previously created by my team) and at command are the commands bellow:
#Navigate to the script shell directory
cd "script-shell-script/"
# pulls the last version of the script. If you aren't using Gitlab,
# remove this command
git pull
# every time git pulls a new file version, it has read access.
# This command allows the execution of the filechmod +x powerbi_refresh_dataset.sh
# make a call to the file passing as parameter the workspace id and dataset id
./powerbi_refresh_dataset.sh $ID_WORKSPACE $ID_DATASET
SHELL SCIPT
As you already imagine, the core solution is the content of powerbi_refresh_dataset.sh. But, before going, there, you must understand how Power BI API works and you have to configure your Power BI environment to make API calls work. So, please, make sure that you already have your Principal Service properly configured by following this tutorial: https://learn.microsoft.com/en-us/power-bi/developer/embedded/embed-service-principal
Once you got your object_id, client_id and client_secret you can create your shell script file. Bellow is the code of my .sh file.
# load OBJECT_ID, CLIENT_ID and CLIENT_SECRET as environment variables
source credential_file.sh
# This command retrieves a new token from Microsoft Credentials Manager
token_msg=$(curl -X POST "https://login.windows.net/$OBJECT_ID/oauth2/token" \
-H 'Content-Type: application/x-www-form-urlencoded' \
-H 'Accept: application/json' \
-d 'grant_type=client_credentials&resource=https://analysis.windows.net/powerbi/api&client_id='$CLIENT_ID'&client_secret='$CLIENT_SECRET
)
# Extract the token from the response message
token=$(echo "$token_msg" | jq -r '.access_token')
# Ask Power BI to refresh dataset
refresh_msg=$(curl -X POST 'https://api.powerbi.com/v1.0/myorg/groups/'$1'/datasets/'$2'/refreshes' \
-H 'Authorization: Bearer '$token \
-H 'Content-Type: application/json' \
-d '{"notifyOption": "NoNotification"}')
And here goes some explanation. The first command is source credential_file.sh which loads 3 variables (OBJECT_ID, CLIENT_ID and CLIENT_SECRET). The intention here is to separate confidential info from the script so I can store the main script file on a version control (Git) and not disclosure any sensitivy information. So, besides powerbi_refresh_dataset.sh file you must have credential_file.sh at the same directory and with the following content:
OBJECT_ID=OBJECT_ID_VALUE
CLIENT_ID=CLIENT_ID_VALUE
CLIENT_SECRET=CLIENT_SECRET_VALUE
It's important to say that if you are using Git or any other version control, only powerbi_refresh_dataset.sh file goes to version control and credential_file.sh file must remain only at your Linux Server. I suggest you to save it's content into a password store application like keepass, as CLIENT_SECRET is not possible to retrieve.
FINAL CONSIDERATIONS
So above is the most relevant info of my solution. As you can see I'm ommiting (intentionally) how to build the environment and make them talk (jekins with linux, jenkins with Git and so on).
If all you have is a Linux or Windows host, I suggest you this:
Linux Host
On this simpler environment, just create the powerbi_refresh_dataset.sh and credential_file.sh, place it at any directory and create a CRON task to call powerbi_refresh_dataset as many time as you wish.
Windows Host
On windows you can do almost the same as on Linux, but you'll have to replace the content of shell script file by Power Shell command (google it) and use the Scheduled Task to regularly execute you Power Shell file.
Well, I think this would help you. I know it's not a complete answer as it will only works if you have a similar environment, but I hope that the final tips might help you.
Best regards
The Solution
First let me resume the solution. I just putted a condition execution at the end of each query that checks if today is the day where new data must be uploaded or not. If yes, it returns the step to be executed, if not, it raises a error.
There is many ways to implement that and I'll go from the simplest form to the more complex one.
Simplest Version: checking if it's the day to load new data directly at the query
This is the simplest way to implement the solution, but, depending on your dataset it may not be the smartest one.
Lets say you have this foo query:
let
step1 = ...,
...,
...,
step10 = SomeFunction(Somevariable, someparameter)
in
setp10
Now lets pretend you want that query to upload new data just on 1st day of the month. To do that, you just insert a condicional struction at in clause.
let
step1 = ...,
...,
...,
step10 = SomeFunction(Somevariable, someparameter)
in
if Date.Day(DateTime.LocalNow()) = 1 then setp10 else error "Today is not the day to load data"
At this example I just replaced the setp10 at the return of the query by this piece of code:if Date.Day(DateTime.LocalNow()) = 1 then setp10 else error "Today is not the day to load data". By doing that, setp10 will be the result of this query only if this query is been executed at day 1st of the month, otherwise, it will return a error.
And here it's worthy some explanation. Power Query is not a script language that runs at the same order that it's declared. So the fact the condicional statement was placed at the end of the query doesn't mean that all code above will be executed before the error is launched. As Power Query just executes what's necessary, the if... statement it will probably be the first one to be executed. For more info about how Power Query works behind the scene, I stronlgy recomend you this reading: https://bengribaudo.com/blog/2018/02/28/4391/power-query-m-primer-part5-paradigm
Using function
Now lets move foward. Lets say that your Dataset set has not only one, but many queries and all of them needs to be executed only once a month. In this case, a smart way to do that is by using what all other programming languages have to reuse block of code: create a function!
For this, create a new Blank Query and paste this code on its body:
(step) =>
let
result = if Date.Day(DateTime.LocalNow()) = 1 then step else error "Today is not the day to load data"
in
result
Now, at each query you'll call this function, sending the last setp as parameter. The function will check which day is today and return the same step passed as parameter if it's the day to load the data. Otherwise, it will return the error.
Bellow is the code of our query using our function called check_if_upload
let
step1 = ...,
...,
...,
step10 = SomeFunction(Somevariable, someparameter)
step11 = check_if_upload(step10)
in
step11
Using parameters
One final tip. As your query raises a error if today is not the day to upload day, it means that you can only test your ETL once a month, right? The error message also limite you to save you Power Query, which means that if you don't apply the modifications you can't upload the new Power Query version (having this implementations) to Power BI Service.
Well, you could change the value of the day verification into the function, but it's let's say, a little dummy.
A more ellegante way to change this parameter is by using parameters. So, lets do it. Create a parameter (I'll call it Upload Day) as a number type. Now, all you have to do is use this parameter at your function. It will look like this:
(step) =>
let
result = if Date.Day(DateTime.LocalNow()) = #"Upload Day" then step else error "Today is not the day to load data"
in
result
That's it. Now you can change the upload day directly at Power BI Service, just changing this parameter at the dataset (click on dataset name and goes to Settings >> Parameters).
Hope you neiled it and that its helpful for you.
Best regards.

How to schedule an export from a BigQuery table to Cloud Storage?

I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.
As an example and more specifically, you can follow these steps:
Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:
# Imports the BigQuery client library
from google.cloud import bigquery
def hello_world(request):
# Replace these values according to your project
project_name = "YOUR_PROJECT_ID"
bucket_name = "YOUR_BUCKET"
dataset_name = "YOUR_DATASET"
table_name = "YOUR_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file
by adding this line:
google-cloud-bigquery
Create a Cloud Scheduler job. Set the Frequency you wish for
the job to be executed with. For instance, setting it to 0 1 * * 0
would run the job once a week at 1 AM every Sunday morning. The
crontab tool is pretty useful when it comes to experimenting
with cron scheduling.
Choose HTTP as the Target, set the URL as the Cloud
Function's URL (it can be found by selecting the Cloud Function and
navigating to the Trigger tab), and as HTTP method choose GET.
Once created, and by pressing the RUN NOW button, you can test how the export
behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID#appspot.gserviceaccount.com.
If you wish to execute exports on different tables,
datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method
instead, and configure a Body containing said parameters as data, which
would be passed on to the Cloud Function - although, that would imply doing
some small changes in its code.
Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query. See the SQL tab in Exporting table data.
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export. And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
Hopefully this is useful for other peeps visiting this question if not for OP :)
You have an alternative to the second part of the Maxim answer. The code for extracting the table and store it into Cloud Storage should work.
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over. Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
Before performing the extraction, don't forget to check the error status of the pubsub notification. You have also a lot of information about the scheduled query; useful is you want to perform more checks or if you want to generalize the function.
So, another point about the SFTP transfert. I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!). If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this. Let me know

Where to find the number of active concurrent invocations in Google Cloud Functions

I am looking for a way to see how many concurrent invocations there are active at any point in time, e.g. in a minute range. I am looking for this as I received the error:
Forbidden: 403 Exceeded rate limits: too many concurrent queries for
this project_and_region. For more information, see
https://cloud.google.com/bigquery/
The quotas are listed here: https://cloud.google.com/functions/quotas
I am fine with having quotas, but I would like to see this number in a chart. Where can I find this?
Currently there is no way of seeing that information directly. There is a workaround though. You can do as follows:
Go to Google Cloud Console > Stackdriver Logging
At the text box that says "Filter by label or text search", click on the small arrow at the end of the text box.
Choose "Convert to advanced filter"
Type that query inside:
resource.type="cloud_function"
resource.labels.function_name="[GOOGLE_CLOUD_FUNCTION_NAME]"
"Function execution started"
At "Last hour" drop down menu, choose "Custom"
Fix the start and end time
This will list all the times that the Cloud Function was executed in the time range. If it was executed multiple times, instead of counting one by one you can use the following Python script:
Open Google Cloud Shell
Install Google Cloud Logging Library $ pip install google-cloud-logging
Create a main.py file using my GitHub code example. (I have tested it and it is working as expected)
Change the date_a_str and set it as start date.
Change the date_b_str and set it as end date.
In function_name = "[CLOUD_FUNCTION_NAME]" change [CLOUD_FUNCTION_NAME] to the name of your Cloud Function.
Execute the Python code $ python main.py
You should see a response as follows:
Found entries: [XX]
Waiting up to 5 seconds.
Sent all pending logs.

Deploy dacpac through VSTS only if it's been modified since last deploy

In the deploy dacpac step in VSTS, you can set the database to only run based on custom conditions. The conditions examples are based on VSTS build information, and I can't find any documentation on using conditions from a connected Azure subscription or dacpac metadata. In the conditional page, they have a version variable which seems like it might be useful, but I can't find other information about it.
Basically, when the dacpac step is triggered, I want to check metadata against existing data, conditionally run the build step, and update metadata. Is this possible through a VSTS build step?
Yes, it is possible. You can add an user defined variable (such as the variable result with default value 0) in the VSTS build definition. And with the value 1 to run the dacpac step, with value 0 to skip the step.
Detail steps as below:
Add a PowerShell task with two operations before the dacpac step:
Check if there has new changes for the existing data.
If the metadata only stored in Azure, you can refer this way to connect with Azure in powershell. If the metadata also stored in the repository (such as a git repo) you build with, you can also check the update in the repository.
Set the result variable value based on if there the metadata is updated or not.
If the data is updated, then change the result variable with value 1:
Write-Host ("##vso[task.setvariable variable=result]1")
Else, do not change the value (keep the value with 0)
Since the data are managed in git VCS, you can check if the data is update or not in git repo. If the data is changed, then change the variable result as 1. detail powershell script as below:
$files=$(git diff HEAD HEAD~1 --name-only)
echo "changed files as below: $files"
if ($files -contains 'filename')
Write-Host ("##vso[task.setvariable variable=result]1")
Set conditions for the dacpac step:
In the task, select Custom conditions for Run this task. If you want to run this task when succeeding and the variable result variable is 1, you can the express:
and(succeeded(), eq(variables['result'], '1'))
Now if the result with the value 0, the dacpac step will be skipped, is the result with value 1, the dacpack will be executed.

Is there an api to send notifications based on job outputs?

I know there are api to configure the notification when a job is failed or finished.
But what if, say, I run a hive query that count the number of rows in a table. If the returned result is zero I want to send out emails to the concerned parties. How can I do that?
Thanks.
You may want to look at Airflow and Qubole's operator for airflow. We use airflow to orchestrate all jobs being run using Qubole and in some cases non Qubole environments. We DataDog API to report success / failures of each task (Qubole / Non Qubole). DataDog in this case can be replaced by Airflow's email operator. Airflow also has some chat operator (like Slack)
There is no direct api for triggering notification based on results of a query.
However there is a way to do this using Qubole:
-Create a work flow in qubole with following steps:
1. Your query (any query) that writes output to a particular location on s3.
2. A shell script - This script reads result from your s3 and fails the job based on any criteria. For instance in your case, fail the job if result returns 0 rows.
-Schedule this work flow using "Scheduler" API to notify on failure.
You can also use "Sendmail" shell command to send mail based on results in step 2 above.