I am using a Google Cloud virtual machine to run several python scripts scheduled on a cron, I am looking for some way to check that they ran.
When I look in my logs I see nothing, so I guess simply running a .py file is not logged? Is there a way to turn on logging at this level? What are the usual approaches for such things?
The technology for recording log information in GCP is called Stackdriver. You have a couple of choices for how to log within your application. The first is to instrument your code with Stackdriver APIs which explicitly write data to the Stackdriver subsytem. Here are the docs for that and here is further recipe.
A second story is that you install the Stackdriver Logging Agent on your Compute Engine. This will then allow you to tap into other sources of logging output such as local syslog.
Related
It's the first time I use Google Cloud Platform, so please be understanding!
I've built a scheduled workflow that simply runs a Batch job. The job runs Python code and uses the standard logging library for logging. When the job is executed, I can correctly see all the entries in Cloud Logging, but all the entries have severity ERROR although they're all INFO.
One possible reason I've been thinking about is that I haven't used the setup_logging function as described in the documentation here. The thing is, I didn't want to run the Cloud Logging setup when I run the code locally.
The questions I have are:
why does logging "work" (in the sense that logs end up in Cloud Logging) even if I did not use the setup_logging function? What is it's real role?
why do my INFO entries show up with ERROR severity?
if I include that snippet and that snippet solves this issue, should I include an if statement in my code that detects if I am running the code locally and skips that Cloud Logging setup step?
According to the documentation, you have to use a setup to send correctly logs to Cloud Logging.
This setup allows then to use the Python logging standard library.
Once installed, this library includes logging handlers to connect
Python's standard logging module to Logging, as well as an API client
library to access Cloud Logging manually.
# Imports the Cloud Logging client library
import google.cloud.logging
# Instantiates a client
client = google.cloud.logging.Client()
# Retrieves a Cloud Logging handler based on the environment
# you're running in and integrates the handler with the
# Python logging module. By default this captures all logs
# at INFO level and higher
client.setup_logging()
Then you can use the Python standard library to add logs to Cloud Logging.
# Imports Python standard library logging
import logging
# The data to log
text = "Hello, world!"
# Emits the data using the standard logging module
logging.warning(text)
why does logging "work" (in the sense that logs end up in Cloud Logging) even if I did not use the setup_logging function? What is
it's real role?
Without the setup, the log will be added to Cloud Logging but not with the correct type and as expected. It's better to use the setup.
why do my INFO entries show up with ERROR severity?
The same reason explained above
if I include that snippet and that snippet solves this issue, should I include an if statement in my code that detects if I am running the
code locally and skips that Cloud Logging setup step?
I think no need to add a if statement you run the code locally. In this case, the logs should be printed in the console even if the setup is present.
I have an application (Automation Anywhere A360) that whenever I want to log something with the app it will log it into a txt/csv file. I run a process in Automation Anywhere that is run in 10 bot runners (Windows VMs) concurrently (so each bot runner is going to log what is going on locally)
My intention is that instead of having sepparate log files for each bot runner, I'd like to have a centralized place where I store all the logs (i.e. Cloud Logging).
I know that this can be accomplished using Python, Java, etc. However, if every time I need to log something into Cloud Logging I invoke a Python script, even though that does the job, it takes around 2-3 seconds (I think this is a bit slow) connecting to gcp client and logging in (taking in this first step most of the time).
How woud you guys tackle this?
The solution that I am looking for is something like this. It is named BindPlane and it can collect log data from on-premises and hybrid infra and send it to GCP monitoring/logging stack
To whom it may (still) concern: You could use fluentd to forward logs to pubSub and from there to a Cloud Logging bucket.
https://flugel.it/infrastructure-as-code/how-to-setup-fluentd-to-retrieve-logs-send-them-to-gcp-pub-sub-to-finally-push-them-to-elasticsearch/
I have read the question for vanilla Airflow.
How can broken DAGs be debugged effectively in Google Cloud Composer?
How can i see the full logs of a broken DAG?
Right now I can only see one line of trace in Airflow UI main page.
EDIT:
Answers seem to be not understanding my question.
I am looking for fixing broken DAGs i.e. the DAG does not even appear in the DAGs list and of course there are no tasks running and no task logs to view.
As hexacynide pointed out, you can look at the task logs - there's details in the Composer docs about doing that specifically found here. You can also use Stackdriver logging, which is enabled by default in Composer projects. In Stackdriver logs, you can filter your logs on many variables, including by time, by pod (airflow-worker, airflow-webserver, airflow-scheduler, etc.) and by whatever keywords you suspect might appear in the logs.
EDIT: Adding screenshots and more clarity in response to question update
In Airflow, when there's a broken DAG, there is usually some form of error message at the top. (Yes, I know this error message is helpful and I don't need to debug further, but I'm going to just to show how to)
In the message, I can see that my DAG bq_copy_across_locations is broken.
To debug, I go to Stackdriver, and search for the name of my DAG. I limit the results to the logs from this Composer environment. You can also limit the time frame if needed.
I looked through the error logs and found a the Traceback error for the broken DAG.
Alternatively, if you know you only want to search for the stack traceback, you can run an advanced filter looking for your DAG name and the word "traceback". To do so, click the arrow at the right side of the Stackdriver logging bar and hit "convert to advance filter"
Then enter your advanced filter
resource.type="cloud_composer_environment"
resource.labels.location="YOUR-COMPOSER-REGION"
resource.labels.environment_name="YOUR-ENV-NAME"
("BROKEN-DAG-NAME" AND
"Traceback")
This is what my advanced search looked like
The only logs that will be returned will be the stack Traceback logs for that DAG.
To determine run-time issues that occur when a DAG is triggered, you can always look at task logs as you would for any typical Airflow installation. These can be found using the web UI, or by looking in the associated logs folder in your Cloud Composer environment's associated Cloud Storage bucket.
To identify issues at parse time, you can execute Airflow commands using gcloud composer. For example, to run airflow list_dags, the gcloud CLI equivalent would be:
$ gcloud composer environments --location=$REGION run $ENV_NAME -- list_dags --report
Note that the second -- is intentional. This is so that the command argument parser can differentiate between arguments to gcloud and arguments to be passed to the Airflow subcommand (in this case list_dags).
All I know is, we can fetch logs using stack driver Logging or monitoring services. But from where these logs are being fetched from?
If i know where these logs are fetched from then no need of doing API calls or using another service to see my logs. I can simply download them and use my own code to process them.
Is there any way to do this?
There is a capability of Stack driver logging called "Exporting". Here is a link to the documentation. At a high level, exporting is the idea that when a new log message is written to a log, a copy of that message is then exported. The targets of the export (called sinks) can be:
Cloud Storage
Big Query
Pub/Sub
From your description, if you set up Cloud Storage as a sink, then you will have new files written to your Cloud Storage bucket that you can then retrieve and process.
The following image (copied from the docs) gives the best overview:
If you don't wish to use exports of new log entries, you can use either the API or gcloud to read the current logs. Realize that GCP held logs (within Stackdriver) expire after a period of time (30 days). See gcloud logging read.
Is there way to trigger dataprep flow on GCS (Google Cloud Storage) file upload? Or, at least, is it possible to make dataprep run each day and take the newest file from certain directory in GCS?
It should be possible, because otherwise what is the point in scheduling? Running the same job over the same data source with the same output?
It seems this product is very immature at the moment, so no API endpoint exists to run a job in this service. It is only possible to run a job in the UI.
In general, this is a pattern that is typically used for running jobs on a schedule. Maybe at some point the service will allow you to publish into the "queue" that Run Job already uses.