I was running a test job on Google dataflow, and I wanted to download the metrics for it afterwards. I can view the metrics in the job metric section of the job, but it doesn't let me download it as a CSV. (It lets me download a PNG only) Some jobs allow for that which is why I am confused. Is there a way I can enable the metric download option on every job?
There are 2 versions of the job metrics section. It occurs to me that only the plots in the newer one (with a button "GO BACK TO CLASSIC METRICS") allow you to download as CSV when you click and expand the three-dots menu.
You can always "View in Metrics Explorer", do some analysis there and read metric data programmatically. More details about how to export cloud monitoring data can be found here.
Related
I have some experience with Google Cloud Functions (CF). I tried to deploy a CF function recently with a Python app, but it uses an NLP model so the 8GB memory limit is exceeded when the model is triggered. The function is triggered when a JSON file is uploaded to a bucket.
So, I plan to try Google Cloud Run but I have no experience with it. Also, I am not completely sure if it is the best course of action.
If it is, what is the best way of implementing provided that the Run service will be triggered by a file uploaded to a bucket? In CF, you can select the triggering event, in Run I didn't see anything like that. I could use some starting points as I couldn't find my case in the GCP documentation.
Any help will be appreciated.
You can use at least these two things:
The legacy one: Create a GCS notification in PubSub. Then create a push subscription and add the Cloud Run URL in the HTTP push destination
A more recent way is to use Eventarc to invoke directly a Cloud Run endpoint from an event (it roughly create the same thing with a PubSub topic and push subscription, but it's fully configured for you)
EDIT 1
When you use Push notification, you will received a standard PubSub message. The format is described in the documentation for the attributes and for the body content; keep in mind that the raw content is base64 encoded and you have to decode it to get the final format
I personally have a Cloud Run service that log the contents of any requests to be able to get in the logs all the data that I need to develop. When I have a new message format, I configure the push to that Cloud Run endpoint and I automatically get the format
For Eventarc, the format will be added to the UI soon (I view that feature in preview, but it's not yet available). The best solution is to log the content to know what you get to know what to do!
actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples
I am using a Google Cloud virtual machine to run several python scripts scheduled on a cron, I am looking for some way to check that they ran.
When I look in my logs I see nothing, so I guess simply running a .py file is not logged? Is there a way to turn on logging at this level? What are the usual approaches for such things?
The technology for recording log information in GCP is called Stackdriver. You have a couple of choices for how to log within your application. The first is to instrument your code with Stackdriver APIs which explicitly write data to the Stackdriver subsytem. Here are the docs for that and here is further recipe.
A second story is that you install the Stackdriver Logging Agent on your Compute Engine. This will then allow you to tap into other sources of logging output such as local syslog.
I read many articles and solutions regarding scheduling queries to external storage places in Google Big Query but they didn't seem to be that clear.
Note: My company has subscription only to Google Big Query and not to the complete cloud Services (Google Cloud Platform).
I know how to do it manually but I am looking to automate the process since I need the same data every week.
Any suggestions will be appreciated. Thank you.
Option 1
You can use Apache Airflow which provides the option to create schedule task on to of BigQuery using BigQuery operator.
You can find in this link the basic steps required to start setting this up
option 2
You can use the Google BigQuery command line to export your data as you do from the webUI, for example:
bq --location=[LOCATION] extract --destination_format [FORMAT] --compression [COMPRESSION_TYPE] --field_delimiter [DELIMITER] --print_header [BOOLEAN] [PROJECT_ID]:[DATASET].[TABLE] gs://[BUCKET]/[FILENAME]
Once you get this working you can use any schedule process of your liking to schedule the run of this job
BTW: Airflow has a connector which enables you to run the command line tool
Once the file in GCP you can use Box G suite integration to see and manage your files
All I know is, we can fetch logs using stack driver Logging or monitoring services. But from where these logs are being fetched from?
If i know where these logs are fetched from then no need of doing API calls or using another service to see my logs. I can simply download them and use my own code to process them.
Is there any way to do this?
There is a capability of Stack driver logging called "Exporting". Here is a link to the documentation. At a high level, exporting is the idea that when a new log message is written to a log, a copy of that message is then exported. The targets of the export (called sinks) can be:
Cloud Storage
Big Query
Pub/Sub
From your description, if you set up Cloud Storage as a sink, then you will have new files written to your Cloud Storage bucket that you can then retrieve and process.
The following image (copied from the docs) gives the best overview:
If you don't wish to use exports of new log entries, you can use either the API or gcloud to read the current logs. Realize that GCP held logs (within Stackdriver) expire after a period of time (30 days). See gcloud logging read.