Which Google Cloud function is preferable to fetch data from external API into GCP? - google-cloud-platform

This should be a very easy question but I can't wrap my head around what to use. I would like to create a data pipeline that fetches data from an outside/external API (for example, Spotify API) and perform some rather simple data cleaning on it, while either continue to create a JSON file in Cloud Storage or enter the data into BigQuery.
As far as I understand I can use Composer to do it, using DAGS etc but what I need here is something more simple/lightweight (mainly UI based) that doesn't cost as much as Composer does as well as being easier to use. What I am looking for is something like Data Factory in Azure.
So, in brief:
Login to a data source using username/password
Extract data from a well known format (CSV/Json)
Transform data, such as remove columns, perform simple filtering like date filtering.
Reformat the data into another format (JSON/CSV/BigQuery)
...without having to code everything from scratch.
Can I handle all of this with one GCP application or do I need to use combinations like Cloud Scheduler, Cloud Functions etc?

As always, you have several options...
Cloud Scheduler seems to be a requirement to trigger regularly the process (up to every minutes).
Then, you have 2 options:
Code the process: API Call, transform/clean the data, sink the data into the destination
Use Cloud Workflow: you can define the API calls that you want to do
Call the API
Store the raw data in BigQuery (API Call also, you have connectors to simplify the process)
Run a query in BigQuery to clean/format your data and store them into a final table (API Call also)
You can also perform a mix between Cloud Functions to get the data and clean/format the data with a query in BigQuery.
Doing something specific like that without starting from scratch... difficult...
EDIT 1
If you have a look to the documentation, you can see that sample
- getCurrentTime:
call: http.get
args:
url: https://us-central1-workflowsample.cloudfunctions.net/datetime
result: currentTime
- readWikipedia:
call: http.get
args:
url: https://en.wikipedia.org/w/api.php
query:
action: opensearch
search: ${currentTime.body.dayOfTheWeek}
result: wikiResult
- returnResult:
return: ${wikiResult.body[1]}
The first step getCurrentTime performs an external call and store the result in result: currentTime.
In the next step, you can reuse the result currentTime and get only the value that you want in another API call.
And you can plug steps like that.
If you need authentication, you can perform a call to secret manager to get the secret values and then to result the secret manager call result in subsequent steps.
For an easier connection to Google APIs, you can use connectors

Related

Can I load data directly from a S3 Bucket for detecting key phrases in the AWS SDK for Java?

I want to perform Key Phrase detection using AWS Comprehend.
Is there any way to load data directly from an S3 URI instead of manually loading data from S3 and passing it to the SDK?
Yes.
For Amazon Comprehend, there are usually 3 ways to do the same action:
Synchronous action for one document e.g. DetectKeyPhrases
Synchronous action for multiple documents e.g. BatchDetectKeyPhrases
Asynchronous action for multiple documents e.g. StartKeyPhrasesDetectionJob
Most, if not all, of the time the synchronous actions take in Text or TextList directly & the asynchronous operations allow you to specific an S3 URI.
For detecting key phrases, this would be the StartKeyPhrasesDetectionJob, which takes in S3Uri for input data as well as output data.
All of these operations are available in the AWS SDK for Java v2 so feel free to refer to the SDK documentation for getting started.

Need recommendation to create an API by aggregating data from multiple source APIs

Before I start doing this I wanted to get advice from the community on the best and most efficient manner to go about doing it.
Here is what I want to do:
Ingest data from multiple API's which returns JSON
Store it in either S3 or DynamoDB
Modify the data to use my JSON structure
Pipe out the aggregate data as an API
The data will be updated twice a day, so I would pull in the data from the source APIs and put it through my pipeline twice a day.
So basically I want to create an API by aggregating data from multiple source APIs.
I've started playing with Lambda and created the following function using Python.
#https://stackoverflow.com/a/41765656
import requests
import json
def lambda_handler(event, context):
#https://www.nylas.com/blog/use-python-requests-module-rest-apis/ USEFUL!!!
#https://stackoverflow.com/a/65896274
response = requests.get("https://remoteok.com/api")
#print(response.json())
return {
'statusCode': 200,
'body': response.json()
}
#https://stackoverflow.com/questions/63733410/using-lambda-to-add-json-to-dynamodb DYNAMODB
This works and returns a JSON response.
Here are my questions:
Should I store the data on S3 or DynamoDB?
Which AWS service should I use to aggregate the data into my JSON structure?
Which service should I use to publish the aggregate data as an API, API Gateway?
However, before I go further I would like to know what is the best way to go about doing this.
If you have experience with this I would love to hear from you.
The answer will vary depending on the quantity of data you're planning to mine. Lambdas are designed for short-duration, high-frequency workloads and thus might not be suitable.
I would recommend looking into AWS Glue, as this seems like a fairly typical ETL (Extract Transform Load) problem. You can set up glue jobs to run on a schedule, and as for data aggregation, that's the T in ETL.
It's simple to output the glue dataframe (result of a transformation) as s3 files, which can then be queried directly by Amazon Athena (as if they were db content).
As for exposing that data via an API, the serverless framework or SST are great tools for taking the sting out of spinning up a serverless API and associated resources.

Get all items in DynamoDB with API Gateway's Mapping Template

Is there a simple way to retrieve all items from a DynamoDB table using a mapping template in an API Gateway endpoint? I usually use a lambda to process the data before returning it but this is such a simple task that a Lambda seems like an overkill.
I have a table that contains data with the following format:
roleAttributeName roleHierarchyLevel roleIsActive roleName
"admin" 99 true "Admin"
"director" 90 true "Director"
"areaManager" 80 false "Area Manager"
I'm happy with getting the data, doesn't matter the representation as I can later transform it further down in my code.
I've been looking around but all tutorials explain how to get specific bits of data through queries and params like roles/{roleAttributeName} but I just want to hit roles/ and get all items.
All you need to do is
create a resource (without curly braces since we dont need a particular item)
create a get method
use Scan instead of Query in Action while configuring the integration request.
Configurations as follows :
enter image description here
now try test...you should get the response.
to try it out on postman deploy the api first and then use the provided link into postman followed by your resource name.
API Gateway allows you to Proxy DynamoDB as a service. Here you have an interesting tutorial on how to do it (you can ignore the part related to index to make it work).
To retrieve all the items from a table, you can use Scan as the action in API Gateway. Keep in mind that DynamoDB limits the query sizes to 1MB either for Scan and Query actions.
You can also limit your own query before it is automatically done by using the Limit parameter.
AWS DynamoDB Scan Reference

ELK stack (Elasticsearch, Logstash, Kibana) - is logstash a necessary component?

We're currently processing daily mobile app log data with AWS lambda and posting it into redshift. The lambda structures the data but it is essentially raw. The next step is to do some actual processing of the log data into sessions etc, for reporting purposes. The final step is to have something do feature engineering, and then use the data for model training.
The steps are
Structure the raw data for storage
Sessionize the data for reporting
Feature engineering for modeling
For step 2, I am looking at using Quicksight and/or Kibana to create reporting dashboard. But the typical stack as I understand it is to do the log processing with logstash, then have it go to elasticsreach and finally to Kibana/Quicksight. Since we're already handling the initial log processing through lambda, is it possible to skip this step and pass it directly into elasticsearch? If so where does this happen - in the lambda function or from redshift after it has been stored in a table? Or can elasticsearch just read it from the same s3 where I'm posting the data for ingestion into a redshift table?
Elasticsearch uses JSON to perform all operations. For example, to add a document to an index, you use a PUT operation (copied from docs):
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
Logstash exists to collect log messages, transform them into JSON, and make these PUT requests. However, anything that produces correctly-formatted JSON and can perform an HTTP PUT will work. If you already invoke Lambdas to transform your S3 content, then you should be able to adapt them to write JSON to Elasticsearch. I'd use separate Lambdas for Redshift and Elasticsearch, simply to improve manageability.
Performance tip: you're probably processing lots of records at a time, in which case the bulk API will be more efficient than individual PUTs. However, there is a limit on the size of a request, so you'll need to batch your input.
Also: you don't say whether you're using an AWS Elasticsearch cluster or self-managed. If the former you'll also have to deal with authenticated requests, or use an IP-based access policy on the cluster. You don't say what language your Lambdas are written in, but if it's Python you can use the aws-requests-auth library to make authenticated requests.

Is there an api to send notifications based on job outputs?

I know there are api to configure the notification when a job is failed or finished.
But what if, say, I run a hive query that count the number of rows in a table. If the returned result is zero I want to send out emails to the concerned parties. How can I do that?
Thanks.
You may want to look at Airflow and Qubole's operator for airflow. We use airflow to orchestrate all jobs being run using Qubole and in some cases non Qubole environments. We DataDog API to report success / failures of each task (Qubole / Non Qubole). DataDog in this case can be replaced by Airflow's email operator. Airflow also has some chat operator (like Slack)
There is no direct api for triggering notification based on results of a query.
However there is a way to do this using Qubole:
-Create a work flow in qubole with following steps:
1. Your query (any query) that writes output to a particular location on s3.
2. A shell script - This script reads result from your s3 and fails the job based on any criteria. For instance in your case, fail the job if result returns 0 rows.
-Schedule this work flow using "Scheduler" API to notify on failure.
You can also use "Sendmail" shell command to send mail based on results in step 2 above.