How to write Kafka connector to integrate with Facebook API? - facebook-graph-api

I am trying to write a Kafka connector to fetch data from the facebook. The problems are,
How to fetch data from facebook through their API without exceeding the limit of API hit provided by facebook? The connector should call facebook API for data after a specific time interval so that the number of hits won't exceed.
Each user can hit the facebook API with their Access Token so users can't share the same topic partition. So how to handle this scenario. Do we have to create one partition for each user?
I read a few guides and blogs to understand Kafka connect and write a connector.
Confluent- https://docs.confluent.io/current/connect/index.html
Kafka Documentation- https://kafka.apache.org/documentation/#connect
Conceptually It gave me an idea about what is Kafka connect, how it works and what are the important classes to write a Kafka connector. But still, I am confused that practically how to write and run a connector. I tried to find step by step development guide but didn't get.
Any tutorial or pdf If you could suggest which have detailed step by step development guide to write and run Kafka connector.

The only "official guide" is in those links you have
https://docs.confluent.io/current/connect/devguide.html#developing-a-simple-connector
I personally have no experience with the Facebook API, but I assume it uses REST, so you could make start by forking the kafka-connect-rest project, but the simplest answer to not exceed the limit would be to not send more requests than you are allowed within a given time period (add a timer to the code that waits between requests)
Also, one connector would only have one set of access keys. How you create the ConnectRecord objects to ultimately partition the records is up to you, but I don't think having an access key per user will scale very well. It might make more sense to have one key tied to one application, then each user will accept that that application has access to read certain details from their account.

Related

Is pubsub suitable to be used by client desktop applications?

If I were to create a client desktop application, I'm trying to find a reliable way to notify client applications of new data that needs to be queried from the server. Would pubsub be a good use for this? Most of the documentation I see for it seems to be focused on server to server communication, and is a bit ambiguous if this would work well for server to client notifications.
If it should work, would I be able to properly authenticate subscribers to limit the topics they could subscribe to? This application would be potentially downloadable by anyone, and I would need to ensure that information intended for one client couldn't end up in the hands of another client.
Cloud Pub/Sub is not going to be a good choice for this use case. First of all, note that each topic and project is limited to 10,000 subscriptions. Therefore, if you intend to have more than that, you will run out of subscriptions. Secondly, note that a subscription only receives messages published after the subscription is created. If you only need messages to be delivered that were published after the user came to the website, this may be okay. However, with these two issues combined, you'll need to consider lifetime of your subscriptions. Do they get deleted when a user logs out? If not, when a user comes back, do you expect them to get all of the messages published since the last time they visited?
Additionally, as discussed in the comments, there is the issue of authentication. Your client-side app would have to have the credentials to subscribe. This would require you to essentially leak those credentials into your client-side code, which could be a vulnerability in your application.
The service designed to deliver notifications of this nature is Firebase Cloud Messaging.
If you want to open the application to anyone on the internet, you can't rely on the IAM service that only works with Google identity -> You can't ask your user to have a Google Account, the user experience will be bad.
Thus, you can't use IAM service to secure the PubSub access, and thus to use PubSub because anyone could access it.
In your use case, the first step is to ask the user to register (create an account, validate email, maybe use payment method,...). Then, you have an identity, but managed by you, not by IAM. You know which messages are for this user and which aren't.
If you want to be notified "in real time", I propose you to use long polling method or streaming to push data to the user. Cloud Run is now capable to do this and I recommend you to have a look on that.

Google Tag Manager clickstream to Amazon

So the questions has more to do with what services should i be using to have the efficient performance.
Context and goal:
So what i trying to do exactly is use tag manager custom HTML so after each Universal Analytics tag (event or pageview) send to my own EC2 server a HTTP request with a similar payload to what is send to Google Analytics.
What i think, planned and researched so far:
At this moment i have two big options,
Use Kinesis AWS which seems like a great idea but the problem is that it only drops the information in one redshift table and i would like to have at least 4 o 5 so i can differentiate pageviews from events etc ... My solution to this would be to divide from the server side each request to a separated stream.
The other option is to use Spark + Kafka. (Here is a detail explanation)
I know at some point this means im making a parallel Google Analytics with everything that implies. I still need to decide what information (im refering to which parameters as for example the source and medium) i should send, how to format it correctly, and how to process it correctly.
Questions and debate points:
Which options is more efficient and easiest to set up?
Send this information directly from the server of the page/app or send it from the user side making it do requests as i explained before.
Does anyone did something like this in the past? Any personal recommendations?
You'd definitely benefit from Google Analytics custom task feature instead of custom HTML. More on this from Simo Ahava. Also, Google Big Query is quite a popular destination for streaming hit data since it allows many 'on the fly computations such as sessionalization and there are many ready-to-use cases for BQ.

What is the "proper" way to use DynamoDB for an iOS app?

I've just started messing around with AWS DynamoDB in my iOS app and I have a few questions.
Currently, I have my app communicating directly to my DynamoDB database. I've been reading around lately and people are saying this isn't the proper way to go about getting data from my database.
By this I mean is I just have a function in my code querying my Dynamo database and returning the result.
How I do it works but is there a better way I should be going about this?
Amazon DynamoDB itself is a highly-scalable service and standing up another server in front of it requires scaling the service also in line with the RCU/WCU configured for your tables, which we can and should avoid.
If your mobile application doesn't need a backend server and you can perform all the business functions from the mobile device, then you should probably think about
Using the AWS DynamoDB SDK for iOS devices to write your client application that runs on the mobile device
Use AWS Token Vending Machine to authenticate your mobile users to grant them credentials to be used to run operations on DynamoDB tables.
Control access (i.e what operations should be allowed on tables etc.,) using IAM policies.
HTH.
From what you say, I can guess that you are talking about a way you can distribute data to many clients (ios apps).
There are few integration patterns (a very good book on this: Enterprise Integration Patterns), one of which is called shared database. It is essentially about using a common database for multiple clients to share the data. Main drawback for that pattern (in your case) is that you are doing assumption about how the database schema looks like. It can potentially bring you some headache supporting the schema in the future, if your business logic changes.
The more advanced approach would be sending events on every change in your data instead of directly writing changes to the database from client apps. This way you can add additional processing to the events before the data they carry is written to the database. For example, you may want to change the event format in the new version of your app, but still want to support legacy users, so you add translation procedure which transforms both types of events to the format which fits the database schema. It's basically a question of whether to work with diffs vs snapshots.
You should be aware of added complexity of working with events, and it can be an overkill if your app is simple and changes in schema are unlikely.
Also consider that you can do data preprocessing using DynamoDB Streams, which gives you some advantages of using events still keeping it simple to implement.

Facebook Graph API-Account suspension

I have a .Net application that uses list of names/email addresses and finds there match on Facebook using the graph API. During testing, my list had 900 names...I was checking facebook matches for each name in in a loop...The process completed...After that when I opened my Facebook page...it gave me message that my account has been suspended due to suspicious activities?
What am I doing wrong here? Doesn't facebook allow to search large number requests to their server? And 900 doesn't seem to be a big number either..
per the platform policies: https://developers.facebook.com/policy/ this may be the a suspected breach of their "Principals" section.
See Policies I.5
If you exceed, or plan to exceed, any of the following thresholds
please contact us by creating confidential bug report with the
"threshold policy" tag as you may be subject to additional terms: (>5M
MAU) or (>100M API calls per day) or (>50M impressions per day).
Also IV.5
Facebook messaging (i.e., email sent to an #facebook.com address) is
designed for communication between users, and not a channel for
applications to communicate directly with users.
Then the biggie, V. Enforcement. No surprise, it's both automated and also monitored by humans. So maybe seeing 900+ requests coming from your app.
What I'd recommend doing:
Storing what you can client side (in a cache or data store) so you make fewer calls to the API.
Put logging on your API calls so you, the developer, can see exactly what is happening. You might be surprise at what you find there.

Amazon Mechanical Turk task retrieval API

I'm writing an app wherein the premise is that people who can't directly fund a charity out of their pocket could automatically work on Amazon AWS HITs in order to bring clean water to the 3rd world - I'd known there was an Amazon AWS API but is very unclear on how to retrieve a HIT to be worked on, rather than just consume some data about some tasks I'm trying to complete.
Is there any way to retrieve a HIT to be worked on through the AWS API or otherwise?
Thanks ahead of time.
As far as I know, the only way to work on a HIT is through the mTurk website. i.e. - not via API.
There is a site that is trying to do something very similar to what you have described. http://www.sparked.com/