Kafka to HDFS sync with data transformation

Kafka to HDFS sync with data transformation - hdfs

I am new to kafka, we have following requirement:
1) Do a daily sync of data from Kafka to HDFS, partitioned by specific key in the JSON payload stored in Kafka clusters.
2) JSON payload is required to be broken into two different files
Wondering if this can be achieved using HDFS kafka connector ? Saw some documentation, I think I can get #1 working easily but unable to understand if there is any thing out of the box for my second requirement.
Any suggestion on how to achieve this will be highly appreciated.
Thanks in advance.

Take a look at message transforms and see if they for your use case https://kafka.apache.org/documentation/#connect_transforms. Basically I'm envisioning 2 different hdfs connector instances reading from the same topic and using someone like ExtractField to pull what you want out of the payload for each instance and then writing to two different hdfs locations.

Related

Is there a way to deal with changes in log schema?

I am in a situation where I need to Extract the log JSON data, which might have changes in its data structure, to AWS S3 in real time manner.
I am thinking of using AWS S3 + AWS Glue Streaming ETL. The thing is the structure or schema of the log JSON data might change(these changes are unpredictable), so my solution needs to be aware of such changes and should still stream the log data smoothly without causing errors... But as far as I know, all the AWS Glue tutorials are showing the demo as if there is no changes in the structure of the incoming data.
Can you recommend or tell me the solution within AWS that's suitable for my case?
Thanks.

Running multiple apache spark streaming jobs

I'm new to Spark streaming and as I can see there are different ways of doing the same thing which makes me a bit confused.
This is the scenario:
We have multiple events (over 50 different events) happening every minute and I want to do some data transformation and then change the format from json to parquet and store the data in a s3 bucket. I'm creating a pipeline where we get the data and store it in a s3 bucket and then the transformation happens (Spark jobs). My questions are:
1- Is it good if I run a lambda function which sorts out each event type in a separate subdirectories and then read the folder in sparkStreaming? or is it better to store all the events in a same directory and then read it in my spark streaming?
2- How can I run multiple sparkStreamings at the same time? (I tried to loop through a list of schemas and folders but apparently it doesn't work)
3- Do I need an orchestration tool (airflow) for my purpose? I need to look for new events all the time with no pause in between.
I'm going to use, KinesisFirehose -> s3 (data lake) -> EMR(Spark) -> s3 (data warehouse)
Thank you so much before hand!

AWS S3 storage and schema

I have an IOT sensor which sends the following message to IoT MQTT Core topic:
{"ID1":10001,"ID2":1001,"ID3":101,"ValueMax":123}
I have added ACT/RULE which stores the incoming message in an S3 Bucket with the timestamp as a key(each message is stored as a seperate file/row in the bucket).
I have only worked with SQL databases before, so having them stored like this is new to me.
1) Is this the proper way to work with S3 storage?
2) How can I visualize the values in a schema instead of separate files?
3) I am trying to create ML Datasource from the S3 Bucket, but get the error below when Amazon ML tries to create schema:
"Amazon ML can't retrieve the schema. If you've just created this
datasource, wait a moment and try again."
Appreciate all advice there is!

1) Is this the proper way to work with S3 storage?
With only one sensor, using the [timestamp](https://docs.aws.amazon.com/iot/latest/developerguide/iot-sql-functions.html#iot-function-timestamp function in your IoT rule would be a way to name unique objects in S3, but there are issues that might come up.
With more than one sensor, you might have multiple messages arrive at the same timestamp and this would not generate unique object names in S3.
Timestamps from nearly the same time are going to have similar prefixes and designing your S3 keys this way may not give you the best performance at higher message rates.
Since you're using MQTT, you could use the traceId function instead of the timestamp to avoid these two issues if they come up.
2) How can I visualize the values in a schema instead of separate files?
3) I am trying to create ML Datasource from the S3 Bucket, but get the error below when Amazon ML tries to create schema:
For the third question, I think you could be running into a data format problem in ML because your S3 objects contain the JSON data from your messages and not a CSV.
For the second question, I think you're trying to combine message data from successive messages into a CSV, or at least output the message data as a single line of a CSV file. I don't think this is possible with just the Iot SQL language since it's intended to produce JSON.
One alternative is to configure your IoT SQL rule with a Lambda action and use a lambda function to make your JSON to CSV conversion and then write the CSV to your S3 bucket. If you go this direction, you may have to enrich your IoT message data with the timestamp (or traceId) as you call the lambda.
A rule like select timestamp() as timestamp, traceid() as traceid, concat(ID1, ID2, ID3, ValueMax) as values, * as message would produce a JSON like
{"timestamp":1538606018066,"traceid":"abab6381-c369-4a08-931d-c08267d12947","values":[10001,1001,101,123],"message":{"ID1":10001,"ID2":1001,"ID3":101,"ValueMax":123}}
That would be straightforward to use as the source for a CSV row with the data from its values property.

How exactly does Spark on EMR read from S3?

Just a few simple questions on the actual mechanism behind reading a file on s3 into an EMR cluster with Spark:
Does spark.read.format("com.databricks.spark.csv").load("s3://my/dataset/").where($"state" === "WA") communicate the whole dataset into the EMR cluster's local HDFS and then perform the filter after? Or does it filter records when bringing the dataset into the cluster? Or does it do neither? If this is the case, what's actually happening?
The official documentation lacks an explanation of what's going on (or if it does have an explanation, I cannot find it). Can someone explain, or link to a resource with such an explanation?

I can't speak for the closed source AWS one, but the ASF s3a: connector does its work in S3AInputStream
Reading data is via HTTPS, which has slow startup time, and if you need to stop the download before the GET is finished, forces you to abort the TCP stream and create a new one.
To keep this cost down the code has features like
Lazy seek: when you do a seek(), it updates its internal pointer but doesn't issue a new GET until you actually do a read.
chooses whether to abort() vs read to end on a GET based on how much is left
Has 3 IO modes:
"sequential", GET content range is from (pos, EOF). Best bandwidth, worst performance on seek. For: CSV, .gz, ...
"random": small GETs, min(block-size, length(read)). Best for columnar data (ORC, Parquet) compressed in a seekable format (snappy)
"adaptive" (new last week, based on some work from microsoft on the Azure WASB connector). Starts off sequential, as soon as you do a backwards seek switches to random IO
Code is all there, improvements welcome. The current perf work (especially random IO) based on TPC-DS benchmarking of ORC data on Hive, BTW)
Assuming you are reading CSV and filtering there, it'll be reading the entire CSV file and filtering. This is horribly inefficient for large files. Best to import into a column format and use predicate pushdown for the layers below to seek round the file for filtering and reading columns

Loading data from S3 (s3://-) usually goes via EMRFS in EMR
EMRFS directly access S3 (not via HDFS)
When Spark loads data from S3, they are stored as DataSet in the cluster according to StorageLevel(memory or disk)
Finally, Spark filters loaded data

When you specify files located on S3 they are read into the cluster. The processing happens on the cluster nodes.
However, this may be changing with S3 Select, which is now in preview.

Import XML to Dynamodb

I have a set of very large XML files and I would like to import them to dynamodb after doing some data massaging.
Is this possible through AWS Data Pipeline or some other tool? Currently this is done manually through a program that runs the ETL process.

I am not sure how much would the DataPipeline would help you get the custom processing of XML would help.
I would like to recommend few approaches [definitely non exhaustive options] - either way, it would be beneficial if you keep those XML files in S3.
Try Elastic Map Reduce Route [ Bonus Points for SPOT instances ]
Try using Amazon Lambda to process and push it to dynamodb
Try ElasticBeanstalk Batch Process

Currently through Datapipeline it is not possible to directly import the XML into DynamoDB.
But if you preprocess the XML files and convert XML data to the format described in DynamoDBExportDataFormat http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-dynamodbexportdataformat.html, then you should be able to use the templates provided in the DataPipline console to accomplish the task http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.Templates.html.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Kafka to HDFS sync with data transformation - hdfs

Related

Is there a way to deal with changes in log schema?

Running multiple apache spark streaming jobs

AWS S3 storage and schema

How exactly does Spark on EMR read from S3?

Import XML to Dynamodb

Categories

Resources