POST Request to REST API with Apache Beam - powerbi

I have the use case that we're pulling messages from PubSub, and then the idea is to POST those messages to the REST API of PowerBI. We want to create a Live Report using the PushDatasets feature.
The main idea should be something like this:
PubSub -> Apache Beam -> POST REST API -> PowerBI Dashboard
I haven't found any implementation about POST Request inside an Apache Beam job (the runner is not a problem right now), just a GET request inside a DoFn. I don't even know if this is possible.
Has someone experienced doing something like this? or maybe another framework/tool that may be more helpful?
Thanks.

Sending POST requests to an external API is certainly possible, but requires some care. It could be as simple as making the POST inside the body of a DoFn, but be aware that this could lead to duplicates since messages within your pipeline belong to a batch and the Beam model allows entire batches to be reprocessed in case of worker failures, exceptions, etc.
There is some advice in the beam docs on grouping elements for efficient external service calls.
Choosing the best course of action here largely depends on the details of the API you're calling. Does it take message IDs that can be used for deduplication on the PowerBI side? Can the API accept batches of messages? Is there rate limiting?

Related

How do you ensure it does work with google cloud pub/sub?

I am currently working on a distributed crawling service. When making this, I have a few issues that need to be addressed.
First, let's explain how the crawler works and the problems that need to be solved.
The crawler needs to save all posts on each and every bulletin board on a particular site.
To do this, it automatically discovers crawling targets and publishes several messages to pub/sub. The message is:
{
"boardName": "test",
"targetDate": "2020-01-05"
}
When the corresponding message is issued, the cloud run function is triggered, and the data corresponding to the given json is crawled.
However, if the same duplicate message is published, duplicate data occurs because the same data is crawled. How can I ignore the rest when the same message comes in?
Also, are there pub/sub or other good features I can refer to for a stable implementation of a distributed crawler?
because PubSub is, by default, designed to deliver AT LEAST one time the messages, it's better to have idempotent processing. (Exact one delivery is coming)
Anyway, your issue is very similar: twice the same message or 2 different messages with the same content will cause the same issue. There is no magic feature in PubSub for that. You need an external tool, like a database, to store the already received information.
Firestore/datastore is a good and serverless place for that. If you need low latency, Memory store and it's in memory database is the fastest.

AWS Lambda best practices for Real Time Tracking

We currently run an AWS Lambda function that primarily simply redirects the user to a different URL. The function is invoked via API-Gateway.
For tracking purposes, we would like to create a widget on our dashboard that provides real-time insights into how many redirects are performed each second. The creation of the widget itself is not the problem.
My main question currently is which AWS Services is best suited for telling our other services that an invocation took place. We plan to register the invocation in our database.
Some additional things:
low latency (< 5 seconds) in order to be real-time data
nearly no increased time wait for the user. We aim to redirect the user as fast as possible
Many thanks in advance!
Best Regards
Martin
I understand that your goal is to simply persist the information that an invocation happened somewhere with minimal impact on the response time of the Lambda.
For that purpose I'd probably use an SQS standard queue and just send a message to the queue that the invocation happened.
You can then have an asynchronous process (Lambda, Docker, EC2) process the messages from the queue and update your Dashboard.
Depending on the scalability requirements looking into Kinesis Data Analytics might also be worth it.
It's a fully managed streaming data solution and the analytics part allows you to do sliding window analyses using SQL on data in the Stream.
In that case you'd write the info that something happened to the stream, which also has a low latency.

Is there any way that i can read from BigQuery using Dialogflow Chatbot

I want to implement a function where I want to display the data in the dialogflow chatbot that is retrieved from BigQuery using a select statement.Is this possible.Kindly help
Based on what was mentioned in the comments sections, seemingly Dialogflow Fulfillment is exactly what you need here. When user types an expression, Dialogflow matches the intent and send webhook request based on the adjusted fulfillment function, the webhook service then performs the intended action like calling API services or affording some other business logic processes.
Dialogflow integration with Bigquery also requires applying fulfillment code to compose appropriate GCP Cloud function that will handle communication with Bigquery API service. Said this, you can use built-in Inline editor and write your fulfillment function, however it doesn't accept any other then Node.js programming language.
Saying more about implementation, I think you can follow codelabs tutorial which provides in detail the general workflow, assuming that you can inject your own code replacing the domestic addToBigQuery() function in Index.js from the example. For this purpose you can visit nodejs-bigquery Github repository containing a lot of useful code samples, in particular the generic query() function that might match your initial aim.

Google Tag Manager clickstream to Amazon

So the questions has more to do with what services should i be using to have the efficient performance.
Context and goal:
So what i trying to do exactly is use tag manager custom HTML so after each Universal Analytics tag (event or pageview) send to my own EC2 server a HTTP request with a similar payload to what is send to Google Analytics.
What i think, planned and researched so far:
At this moment i have two big options,
Use Kinesis AWS which seems like a great idea but the problem is that it only drops the information in one redshift table and i would like to have at least 4 o 5 so i can differentiate pageviews from events etc ... My solution to this would be to divide from the server side each request to a separated stream.
The other option is to use Spark + Kafka. (Here is a detail explanation)
I know at some point this means im making a parallel Google Analytics with everything that implies. I still need to decide what information (im refering to which parameters as for example the source and medium) i should send, how to format it correctly, and how to process it correctly.
Questions and debate points:
Which options is more efficient and easiest to set up?
Send this information directly from the server of the page/app or send it from the user side making it do requests as i explained before.
Does anyone did something like this in the past? Any personal recommendations?
You'd definitely benefit from Google Analytics custom task feature instead of custom HTML. More on this from Simo Ahava. Also, Google Big Query is quite a popular destination for streaming hit data since it allows many 'on the fly computations such as sessionalization and there are many ready-to-use cases for BQ.

how to persist runtime parameter of a service call then use as parameter for the next service call WSO2 ESB

I am seeking advice on the most appropriate method for the following use case.
I have created a number of services using the WSO2 Data Services Server which I want to run periodically passing in parameters of last run date. ie. the data services has two parameters start and end dates to run the sql against.
I plan to create a service within WSO2 ESB to mediate the execution of these service, combine the results to pass onto another web service. I think I can manage this ;-) I will use a scheduled task to start this at a predefined interval.
Where I am seeking advice is how to keep track of the last successful run time as I need to use this as parameters for the data services web services.
My options as I see them
create a config table in my database and create another data services web service to retrieve and persist these values
use vfs transport and somehow persist these values to a text file as xml, csv or json
use some other way like property values in the esb sequence and somehow persist these
any other??
With my current knowledge it would seem that 1 is easiest but it doesn't feel right as I would have to have write access to the database, something I possibly wouldn't normally have when architecting a solution like this in the future, 2 appears like it could work with my limited knowledge of WSO2 ESB to date but is 3 the best option? But as you see from the detail above this is where I start to flounder.
Any suggestions would be most welcome
I do not have much experience with ESB. However I also feel that your first option would be easier to implement.
A related topic was also discussed in WSO2 architecture mailing list recently with subject "[Architecture] Allow ESB to put and update registry properties"
It was discussed to introduce a registry mediator, but I'm not sure it will be implemented soon.
I hope this helps.
As of now there is no direct method to save content to ESB through ESB. But you can always write a custom mediator to do that or use the script mediator to achieve this
Following is the code snippet for the script mediator
<script language="js"><![CDATA[
importPackage(Packages.org.apache.synapse.config);
/* creates a new resource */
mc.getConfiguration().getRegistry().newResource("conf:/store/myStore",false);
/* update the resource */
mc.getConfiguration().getRegistry().updateResource(
"conf:/store/myStore", mc.getProperty("myProperty").toString());
]]></script>
I've written a blog post on how to do this in ESB 4.8.1. You can find it here