Writing to Cloud SQL via Dataflow pipeline is very slow

Writing to Cloud SQL via Dataflow pipeline is very slow - google-cloud-platform

I managed to connect to cloud sql via JDBCIO
DataSourceConfiguration.create("com.mysql.jdbc.Driver","jdbc:mysql://google/?cloudSqlInstance=::&socketFactory=com.google.cloud.sql.mysql.SocketFactory&user=&password=")
This works, however, the batch writes takes between 2-5 minutes for 1000 records, which is terrible. i have tried different networks to see if this was related, and the results were consistent.
Anyone have any ideas?

Where are you initializing this connection? If you are doing this inside of your DoFn it will create latency as the socket is built up and torn down on each bundle.
Have a look at DoFn.Setup this provides a clean way to init resources that will be persisted across bundle calls.

Related

How can I push text log files into Cloud Logging?

I have an application (Automation Anywhere A360) that whenever I want to log something with the app it will log it into a txt/csv file. I run a process in Automation Anywhere that is run in 10 bot runners (Windows VMs) concurrently (so each bot runner is going to log what is going on locally)
My intention is that instead of having sepparate log files for each bot runner, I'd like to have a centralized place where I store all the logs (i.e. Cloud Logging).
I know that this can be accomplished using Python, Java, etc. However, if every time I need to log something into Cloud Logging I invoke a Python script, even though that does the job, it takes around 2-3 seconds (I think this is a bit slow) connecting to gcp client and logging in (taking in this first step most of the time).
How woud you guys tackle this?

The solution that I am looking for is something like this. It is named BindPlane and it can collect log data from on-premises and hybrid infra and send it to GCP monitoring/logging stack

To whom it may (still) concern: You could use fluentd to forward logs to pubSub and from there to a Cloud Logging bucket.
https://flugel.it/infrastructure-as-code/how-to-setup-fluentd-to-retrieve-logs-send-them-to-gcp-pub-sub-to-finally-push-them-to-elasticsearch/

Trigger ALL cloud run instances at once to do async job (rebuild cache)

I have a cloud run with multiple instances running or idle.
I want all the instances to do an async job periodically (to rebuild a cache).
Example of async job:
Periodically check if there is a new version of a JSON file on the object storage bucket
Do some processing on the JSON and store it as a variable (cache) that will be used by the API endpoints. So I do not need to contact database on each request.
Options on how to do it:
setInterval() to call rebuildCacheIfNeeded(). Cloud run cannot do async tasks in the background (they are assigned CPU resources only while handling a request).
webcron will not work. Only one instance would handle the request and the cache would be rebuild only on that instance.
Pub / sub on new file added to the bucket. Can pub/sub be setup in the way that all instances are awaken and all will rebuild the cache? If yes, this would be the best solution.
Call rebuildCacheIfNeeded() on each request and keep the http connection until the cache is rebuild. I would like to avoid this for obvious reasons.
Kill all instances of cloud run when new file is added to the bucket. Cloud run should be stateless, so this solution is the only one that complies with statelessness rule. But how kill all instances without running whole redeploy?
Any other possible solutions that I am missing?
Thank you
Please do not suggest "Just use a database"... The cached data is small and I would like to avoid a database latencies and possible point of failure.

You are trying to use side-effects of a service that is neither predictable nor manageable. That will lead to problems today and possibly failure when features are updated or new features are released. Design your application to use documented features.
There is no documented method to achieve your objective.

AWS service for doing jobs

I have the following need - the code needs to call some APIs, get some data, and store them in a database (flat file will do for our purpose). As the APIs give access to a huge number of records, we want to split it into 30 parts, each part scraping a certain section of the data from the APIs. We want these 30 scrapers to run in 30 different machines - and for that, we have got a Python program that does the following:
Call the API, get the data, based on parameters (which part of the API to call)
Dump it to the local flatfile.
And then later, we will merge the output from the 30 files into one giant DB.
Question is - which AWS tool to use for our purpose? We can use EC2 instance, but we have to keep the EC2 console open on our desktop where we connect to it to run the Python program, it is not feasible to keep 30 connections open on my laptop. It is very complicated to get remote desktop on those machines, so logging there, starting the job and then disconnecting - this is also not feasible.
What we want is this - start the tasks (one each on 30 machines), let them run and finish by themselves, and if possible notify me (or I can myself check for health periodically).
Can anyone guide me which AWS tool suits our purpose, and how?

"We can use EC2 instance, but we have to keep the EC2 console open on
our desktop where we connect to it to run the Python program"
That just means you are running the script wrong, and you need to look into running it as a service.
In general you need to look into queueing up these tasks in SQS and then triggering either EC2 auto-scaling or Lambda functions depending on if your script will run inside the Lambda runtime restrictions.

This seems like a good application for Step Functions. Step Functions allow you to orchestrate multiple lambda functions, Glue jobs, and other services into a business process. You could write lambda functions that call the API endpoints and store the results in S3. Once all the data is gathered, your step function could trigger a lambda function, glue job, or something else that processes the data into your database. Step Functions help with error handling and retry and allow easy monitoring of your process.

What service should I use to process my files in a Cloud Storage bucket and upload the result?

I have a software that process some files. What I need is:
start a default image on google cloud (I think docker should be a good solution) using an API or a run command
download files from google storage
process it, run my software using those downloaded files
upload the result to google storage
shut the image down, expecting not to be billed anymore
What I do know is how to create my image hehe. But I can't find any info saying me what google cloud service should I use or even if I could do it like I'm thinking. I think I'm not using the right keywords to find what i need.
I was looking at Kubernetes, but i couldn't figure out how to manipulate those instances to execute a one time processing.
[EDIT]
Explaining better the process I have an app that receive images and send it to Google storage. After that, I need to process that images, apply filters, georeferencing, split image etc. So I want to start a docker image to process it and upload the results to google cloud again.

If you are using any of the runtimes supported by Google Cloud Functions, they are easiest way to do those kind of operations (i.e. fetch something from Google Cloud Storage, perform some actions on those files and upload them again). The Cloud Functions will be triggered by an event of your choice, and after the job, it will die.
Next option in terms of complexity would be to deploy a Google App Engine application in standard environment. It allows you to deploy your own application written in any of the supported languages for this environment. While there is traffic in your application, you will have instances serving, but the number of instances running can go down to 0 when they are not serving, which would mean less cost.
Another option would be Google App Engine in flexible environment. This product allows you to deploy your application in any custom runtime. This option has always at least one instance running, so it would never shut down.
Lastly, you can use Google Compute Engine to "create and run virtual machines on Google infrastructure". Otherwise than GAE, this is not that managed by Google, which means that most of the configuration is up to you. In this case, you would need to programmatically indicate your VM to shut down after you have finished your operations.

Based on your edit where you stated that you already have an app that is inserting images into Google Cloud Storage, your easiest option would be to use Cloud Functions that are triggered by additions, changes, or deletions to objects in Cloud Storage buckets.
You can follow the Cloud Functions tutorial for Cloud Storage to get an idea of the generic process and then implement your own code that handles your specific tasks. There are other tutorials like the Imagemagick tutorial for Cloud Functions that might also be relevant to the type of processing you intend to do.
Cloud Functions is probably your lightest weight approach. You could of course do more full scale applications, but that is likely overkill, more expensive, and more complex. You can write your processing code in Node.js, Python, or Go.

Google Dataprep: Scheduling with updated data source

Is there way to trigger dataprep flow on GCS (Google Cloud Storage) file upload? Or, at least, is it possible to make dataprep run each day and take the newest file from certain directory in GCS?
It should be possible, because otherwise what is the point in scheduling? Running the same job over the same data source with the same output?

It seems this product is very immature at the moment, so no API endpoint exists to run a job in this service. It is only possible to run a job in the UI.
In general, this is a pattern that is typically used for running jobs on a schedule. Maybe at some point the service will allow you to publish into the "queue" that Run Job already uses.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Writing to Cloud SQL via Dataflow pipeline is very slow - google-cloud-platform

Where are you initializing this connection? If you are doing this inside of your DoFn it will create latency as the socket is built up and torn down on each bundle. Have a look at DoFn.Setup this provides a clean way to init resources that will be persisted across bundle calls.

Related

How can I push text log files into Cloud Logging?

Trigger ALL cloud run instances at once to do async job (rebuild cache)

AWS service for doing jobs

What service should I use to process my files in a Cloud Storage bucket and upload the result?

Google Dataprep: Scheduling with updated data source

Categories

Resources