Developing a simple cloud data processing system

Developing a simple cloud data processing system - amazon-web-services

What would be the most simple way to take a public data API, for example, and schedule a daily job to calculate a set of statistics and land the computed statistics in a cloud database?

What about using CloudWatch Event rule with Schedule Expression and a target of a Lambda function?
The event rule would trigger your Lambda function, e.g., once a day. The function would then call the API, process the data from the API, and write the results into a DynamoDB or RDS database, depending whether you require relational or non-SQL database.

Same on GCP. You can use Cloud Scheduler for the periodic trigger, and call a Cloud Functions that perform your statistics.
You can use Firestore for storing your data in document format.
The free tiers of each product are huge and if your processing is simple and not running full time, you should pay nothing.

Related

Trigger an AWS lambda when a table is created in BigQuery

Our Google Analytics data events are exported to BigQuery tables. I have reports that need to run when the events data arrives which are set up as AWS lambdas with python code (for various reasons and I can't immediately move these to be Google Cloud Functions etc).
Is it possible to have the creation of a table trigger a lambda? At present, I have a lambda periodically checking to see if the table has been created which seems suboptimal. Eventarc looks like it might possibly be the way to monitor for the creation event at the BigQuery end but it doesn't seem obvious how you'd interface with AWS.
Any genius ideas? I have dug repeatedly through StackOverflow, but can't see a match for this issue

Eventarc isn't magic, it's only a wrapper of different things that you can do and customize (with a custom destination and not a Cloud Run).
Typically, Eventarc do:
Create a Cloud Logging sink on a specific log filter (filter what you want to get custom events)
Sink the filtered log entries in PubSub topic
Create a PubSub push subscription that invoke Cloud Run HTTP endpoint.
You can create piece by piece all those steps. And in the latest one, invoke your AWS Lambda instead of Cloud Run.
But the difficulty is not here. The difficulty comes from the variety of table creation possibilities:
By API call (table creation API)
By Load Job (load a file into a table create it automatically but without invoking the table creation API)
Directly in SQL with CREATE TABLE statement (but you can have also this statement in a script, you can have dynamic SQL,...)
And you might want to capture also the other creations (views, materialized views, procedure, functions,....)
At the end, your current method (invoque periodically the schema metadata info and get the recent addition in a dataset) could be the most "effortless" efficient!

Can I query Logs Explorer directly from a cloud function?

For a monitoring project I'm using Logs Router to send log data to BigQuery Table so I can then query the BigQuery table from cloud functions. Would it be possible to directly query Log Explorer from Cloud Functions? (i.e not having to replicate my logs to BigQuery?)
Thanks

Yes, of course you can. You even have client libraries for that. However, keep in mind that, by default, your logs are kept only 30 days. It could be enough, or not, depending on your use case.
You can create custom log bucket, with a different retention period, or sing the logs in BigQuery.
The main advantage of BigQuery if the capacity to join the logs data with other data in BigQuery, to perform powerful analytics computation. But still depends on your use case.

Is AWS Lambda preferred over AWS Glue Job?

In AWS Glue job, we can write some script and execute the script via job.
In AWS Lambda too, we can write the same script and execute the same logic provided in above job.
So, my query is not whats the difference between AWS Glue Job vs AWS Lambda, BUT iam trying to undestand when AWS Glue job should be preferred over AWS Lambda, especially while when both does the same job? If both does the same job, then ideally I would blindly prefer using AWS Lambda itself, right?
Please try to understand my query..

Additional points:
Per this source and Lambda FAQ and Glue FAQ
Lambda can use a number of different languages (Node.js, Python, Go, Java, etc.) vs. Glue can only execute jobs using Scala or Python code.
Lambda can execute code from triggers by other services (SQS, Kafka, DynamoDB, Kinesis, CloudWatch, etc.) vs. Glue which can be triggered by lambda events, another Glue jobs, manually or from a schedule.
Lambda runs much faster for smaller tasks vs. Glue jobs which take longer to initialize due to the fact that it's using distributed processing. That being said, Glue leverages its parallel processing to run large workloads faster than Lambda.
Lambda looks to require more complexity/code to integrate into data sources (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.) while Glue can easily integrate with these. However, with the addition of Step Functions, multiple lambda functions can be written and ordered sequentially due reduce complexity and improve modularity where each function could integrate into a aws service (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.)
Glue looks to have a number of additional components, such as Data Catalog which is a central metadata repository to view your data, a flexible scheduler that handles dependency resolution/job monitoring/retries, AWS Glue DataBrew for cleaning and normalizing data with a visual interface, AWS Glue Elastic Views for combining and replicating data across multiple data stores, AWS Glue Schema Registry to validate streaming data schema.
There are other examples I am missing, so feel free to comment and I can update.

Lambda has a lifetime of fifteen minutes. It can be used to trigger a glue job as an event based activity. That is, when a file lands in S3 for example, we can have an event trigger which can run a glue job. Glue is a managed services for all data processing.
If the data is very low maybe you can do it in lambda, but for some reason the process goes beyond fifteen minutes, then data processing would fail.

The answer to this can involve some foundational design decisions. What is this job doing? What kind of data are you dealing with? Is there a decision to be made whether the task should be executed in a batch or event oriented paradigm?
Batch
This may be necessary or desirable because the task:
Is being done over large monolithic data (e.g., binary).
Relies on context of multiple records in a dataset such that they must be loaded into a single job.
Order matters.
I feel like just as often I see batch handling chosen by default because "this is the way we've always done it" but breaking from this approach could be worth consideration.
Glue is built for batch operations. With a current maximum execution time of 15 minutes and maximum memory of 10gb, Lambda has become capable of processing fairly large datasets in a single execution, as well. It can be difficult to pin down a direct cost comparison without specifics of the workload. When it comes to development, I feel that Lambda has the edge as far as tooling to build, test, deploy.
Event
In the case where your data consists of a set of records, it might behoove you to parse and "stream" them into Lambda. Consider a flow like:
CSV lands in S3.
S3 event triggers Lambda.
Lambda reads and parses CSV into discrete events, submits to another Lambda or publishes to SNS for downstream processing. Concurrent instances of this Lambda can be employed to speed up ingest, where each instance is responsible for certain lines of the S3 object.
This pushes all logic and error handling, as well as resources required, to the level of individual event/record level. Often mechanisms such as dead-letter queues are employed for remediation. While context of a given container persists across invocations - assuming the container has not been idle and torn down - Lambda should generally be considered stateless such that the processing of an event/record is thought of as occurring within its own scope, outside that of others in the dataset.

Create Jobs and schedule them on Bigquery

I'm new to bigquery and need to do some tests on it. Looking through bigquery documentation, i can't find nothing about creating jobs and scheduling them.
I found in other page on internet that the only available method is creating a bucket in google cloud storage and create a function in cloud functions using javascript, and inside it's body write down the sql query.
Can someone help me here? Is it true?

Your question is a bit confusing as you mix scheduling jobs with defining a query in a cloud function.
There is a difference in scheduling jobs vs scheduling queries.
BigQuery offers Scheduled queries. See docs here.
BigQuery Data Transfer Service (schedule recurring data loads from GCS.) See docs here.
If you want to schedule jobs for (load, delete, copy jobs etc) you better do this with a trigger on the observed resource like Cloud Storage new file, a Pub/Sub message, a HTTP trigger all this wired in a Cloud Function.
Some other related blog posts:
How to schedule a BigQuery ETL job with Dataprep
Scheduling BigQuery Jobs: This time using Cloud Storage & Cloud Functions

Rate limited API requests in Cloud Composer

I'm planning a project whereby I'd be hitting the (rate-limited) Reddit API and storing data in GCS and BigQuery. Initially, Cloud Functions would be the choice, but I'd have to create a Datastore implementation to manage the "pseudo" queue of requests and GAE for cron jobs.
Doing everything in Dataflow wouldn't make sense because it's not advised the make external requests (i.e. hitting the Reddit API) and perpetually running a single job.
Could I use Cloud Composer to read fields from a Google Sheet, then create a queue of requests based on the Google Sheet, then have a task queue execute those requests, store them in GCS and load into BigQuery?

Sounds like a legitimate use case for Composer, additionally you could also leverage the pool concept in Airflow to manage concurrent calls to the same endpoint (e.g., Reddit API).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js