Kafka Connect HDFS - How to make it work? - hdfs

This is not a very specific question. However, I have not found a single document in which it is explained how do you actually use kafka - hdfs connector.
Basically, I have a kafka topic containing json encoded strings. I would like to send the data to HDFS as avro formatted data.
Any help would be more than welcome!

What specifically are you trying to achieve with the HDFS-Connector. While docs certainly could use some help they do cover the basics for how to both configure and run the hdfs-connector. If you could be a little more specific as to the goal you are trying to achieve it will be easier for us to offer you some guidance.
Thanks,
Ryan

Related

GCP dataflow to PostgreSQL connectivity

We are trying to replace alteryx with GCP dataflow for ETL job. Requirement is to get data from PostgreSQL table, join, add some formulas and group by to add missing columns and then write back to PostgreSQL table again for Qlik to consume and generate Viz.
I am new to Java. Can anyone help me with any sample code to refer to for similar use case. That would be really helpful. Thank you
I am new to Java. Can anyone help me with any sample code to refer to for similar use case. Thank you

Google Tag Manager clickstream to Amazon

So the questions has more to do with what services should i be using to have the efficient performance.
Context and goal:
So what i trying to do exactly is use tag manager custom HTML so after each Universal Analytics tag (event or pageview) send to my own EC2 server a HTTP request with a similar payload to what is send to Google Analytics.
What i think, planned and researched so far:
At this moment i have two big options,
Use Kinesis AWS which seems like a great idea but the problem is that it only drops the information in one redshift table and i would like to have at least 4 o 5 so i can differentiate pageviews from events etc ... My solution to this would be to divide from the server side each request to a separated stream.
The other option is to use Spark + Kafka. (Here is a detail explanation)
I know at some point this means im making a parallel Google Analytics with everything that implies. I still need to decide what information (im refering to which parameters as for example the source and medium) i should send, how to format it correctly, and how to process it correctly.
Questions and debate points:
Which options is more efficient and easiest to set up?
Send this information directly from the server of the page/app or send it from the user side making it do requests as i explained before.
Does anyone did something like this in the past? Any personal recommendations?
You'd definitely benefit from Google Analytics custom task feature instead of custom HTML. More on this from Simo Ahava. Also, Google Big Query is quite a popular destination for streaming hit data since it allows many 'on the fly computations such as sessionalization and there are many ready-to-use cases for BQ.

what are the best ways to transfer large streaming data files to cloud

Is it possible to use aws kinesis? If possible how can we use it?
Any suggestions Please reply. Thanks in advance.
Is it possible to use aws kinesis?
Yup. You can pretty much use anything for anything these days. Just need a little bit of creativity, that's all.
If possible how can we use it?
Good starting point would be the documentation but it would really depend on what you want to use it for.
what are the best ways to transfer large streaming data files to cloud
That's hard to answer without any details. In the end, you will be only given options and opinions, it'll be your call to figure out what's best. Kinesis is reliable and works for basic use-cases but it can be slower and less flexible than other options. It also costs a pretty penny if you use it unwisely. If you need options, checkout Apache Kafka

Amazon Turk -- Trouble Getting started, AWS documentation little to no use

I have been combing through the AWS MTurk documentation for hours and it is of little to no help for getting started on MTurk.
I am trying to have people upload small vid based on a set of instructions that I will provide. I am on the requester sandbox and I see no way to integrate anything off of the API reference. I am trying to put a QuestionForm with an AnswerSpecification displaying a FileUploadAnswer.
http://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_QuestionFormDataStructureArticle.html
From there I'm also struggling to understand how I would use the GetFileUploadURL to provide myself with the link to download the video uploaded by the user to approve them of their task.
http://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_GetFileUploadURLOperation.html
Any insights?
You can only do file uploads via the Requester API, using the file upload question type. The QuestionForm documentation is decent, but the format is not necessarily intuitive - and being XML, is very strict.
An alternative would be to give workers some other way to send you the file, like Dropbox, etc.

Is it possible to increase the playback rate (speed up the video) when using Amazon Elastic Transcoder?

I'm looking at speeding up a video before storing it in S3. I haven't found anything on the AWS docs about this.
Is this something that can be done with AWS Elastic Transcoder?
Thanks!
Sébastien
It's not possible. Yet.
Thought, You can try writing to their forum, asking for this feature...
It sounds like it's the only way to get this kind of functionality exported to API.
Extracted from faqs:
Q: Why is the codec parameter that I want to change not exposed by the API?
In designing Amazon Elastic Transcoder, we wanted to create a service that was simple to use. Therefore, we expose the most frequently used codec parameters. If there is a parameter that you require, please let us know by letting us know through our forum.
Aha, apparently, already did :}