Every user on our website can schedule multiple jobs on GCP using an API. The jobs are called projects/PROJECT_ID/locations/LOCATION_ID/jobs/USER_ID-RANDOM_STRING.
Now I need to delete all jobs of a user. Assuming there are many jobs (100k+), how do I delete all jobs of that user? Simply iterating through them does not scale. Any ideas? I'd rather avoid having to store all of the job IDs separately.
Related
I have a list of 10 time stamps which keeps on updating dynamically. In total there are 3 such lists for 3 users. I want to build a utility to trigger a function at the next upcoming time stamp. (preferably everything over server-less compute)
I am stuck in finding out how to achieve this over aws or firebase
On Firebase/Google Cloud Functions the two most common options are either to store the schedule in a database and then periodically trigger a Cloud Function and run the tasks that are due, or to use Cloud Tasks to dynamically schedule a callback to a separate Cloud Function for each task.
I recommend also reading:
Doug's blog post on How to schedule a Cloud Function to run in the future with Cloud Tasks (to build a Firestore document TTL)
Fireship.io's tutorial on Dynamic Scheduled Background Jobs in Firebase
How can scheduled Firebase Cloud Messaging notifications be made outside of the Firebase Console?
Previous questions on dynamically scheduling functions, as this has been covered quite well before.
Update (late 2022): there is now also a built-in way to schedule Cloud Functions dynamically: enqueue functions with Cloud Tasks.
I'm running into LimitExceededException when starting new AWS Rekognition jobs with the StartFaceDetection API call, so I would like to see a list of my currently running face detection jobs. The GetFaceDetection command apparently requires you to pass in a specific single job ID, but I would like to see all jobs that are currently in progress (or even all jobs that were started recently). Is this possible?
The Rekognition API does not have an operation to retrieve all running operations, you would need to handle this on your side. You could use a RekognitionWaiter to get notifications on the state of individual jobs and keep track in a database of some sort.
Depending on the number of jobs you want to run in parallel and the region you're using right now, you might get a higher limit by using another region. You can check the service quotas for Rekognition here.
I want to create an ingestion/aggregation flow on Google Cloud using Dataproc, where once a day/hour I want a Spark job to run on the data collected till then.
Is there any way to schedule the Spark jobs? Or of making this trigger based for e.g. on any new data event arriving on the flow?
Dataproc Workflow + Cloud Scheduler might be a solution for you. It supports exactly what you described, e.g. run a flow of jobs in a daily base.
I'm new to bigquery and need to do some tests on it. Looking through bigquery documentation, i can't find nothing about creating jobs and scheduling them.
I found in other page on internet that the only available method is creating a bucket in google cloud storage and create a function in cloud functions using javascript, and inside it's body write down the sql query.
Can someone help me here? Is it true?
Your question is a bit confusing as you mix scheduling jobs with defining a query in a cloud function.
There is a difference in scheduling jobs vs scheduling queries.
BigQuery offers Scheduled queries. See docs here.
BigQuery Data Transfer Service (schedule recurring data loads from GCS.) See docs here.
If you want to schedule jobs for (load, delete, copy jobs etc) you better do this with a trigger on the observed resource like Cloud Storage new file, a Pub/Sub message, a HTTP trigger all this wired in a Cloud Function.
Some other related blog posts:
How to schedule a BigQuery ETL job with Dataprep
Scheduling BigQuery Jobs: This time using Cloud Storage & Cloud Functions
I'm planning a project whereby I'd be hitting the (rate-limited) Reddit API and storing data in GCS and BigQuery. Initially, Cloud Functions would be the choice, but I'd have to create a Datastore implementation to manage the "pseudo" queue of requests and GAE for cron jobs.
Doing everything in Dataflow wouldn't make sense because it's not advised the make external requests (i.e. hitting the Reddit API) and perpetually running a single job.
Could I use Cloud Composer to read fields from a Google Sheet, then create a queue of requests based on the Google Sheet, then have a task queue execute those requests, store them in GCS and load into BigQuery?
Sounds like a legitimate use case for Composer, additionally you could also leverage the pool concept in Airflow to manage concurrent calls to the same endpoint (e.g., Reddit API).