I'm new to Spark streaming and as I can see there are different ways of doing the same thing which makes me a bit confused.
This is the scenario:
We have multiple events (over 50 different events) happening every minute and I want to do some data transformation and then change the format from json to parquet and store the data in a s3 bucket. I'm creating a pipeline where we get the data and store it in a s3 bucket and then the transformation happens (Spark jobs). My questions are:
1- Is it good if I run a lambda function which sorts out each event type in a separate subdirectories and then read the folder in sparkStreaming? or is it better to store all the events in a same directory and then read it in my spark streaming?
2- How can I run multiple sparkStreamings at the same time? (I tried to loop through a list of schemas and folders but apparently it doesn't work)
3- Do I need an orchestration tool (airflow) for my purpose? I need to look for new events all the time with no pause in between.
I'm going to use, KinesisFirehose -> s3 (data lake) -> EMR(Spark) -> s3 (data warehouse)
Thank you so much before hand!
Related
I'm working with a pipeline that pushes JSON entries in batches to my Gcloud Storage bucket. I want to get this data into Kafka.
The way I'm going about it now is using a lambda function that gets triggered every minute to find the files that have changed, open streams from them, read line by line and batch every so often those lines as messages into a kafka producer.
This process is pretty terrible, but it works.... eventually.
I was hoping there'd be a way to do this w/ Kafka Connect or Flink, but there really isn't much development around sensing incremental file additions to a bucket.
Do the JSON entries end up in different files in your bucket? Flink has support for streaming in new files from a source.
I have an ETL application which is suppose to migrate to AWS infra. The scheduler being used in my application is Tivoli Work Scheduler and we want to use the same on cloud as well which has file dependencies.
Now when we move to aws , the files to be watched will land in S3 Bucket. Can we put the OPEN dependency for files in S3? If yes, What would be the hostname ( HOST#Filepath ) ?
If Not, what services should be aligned to serve the purpose. I have both time as well as file dependency in my SCHEDULES.
Eg. The file might get uploaded on S3 at 1AM. AT 3 AM my schedule will get triggered, look for the file in S3 bucket. If present, starts execution and if not then it should wait as per other parameters on tws.
Any help or advice would be nice to have.
If I understand this correctly, job triggered at 3am will identify all files uploaded within last e.g. 24 hours.
You can list all s3 files to list everything uploaded within specific period of time.
Better solution would be to create S3 upload trigger which will send information to SQS and have your code inspect the depth (number of messages) there and start processing the files one by one. An additional benefit would be an assurance that all items are processed without having to worry about time overalpse.
I have a large number of logfiles from a service that I need to regularly run analysis on via EMR/Hive. There are thousands of new files per day, and they can technically come out of order relative to the file name (e.g. a batch of files comes a week after the date in the file name).
I did an initial load of the files via Snowball, then set up a script that syncs the entire directory tree once per day using the 'aws s3 sync' cli command. This is good enough for now, but I will need a more realtime solution in the near future. The issue with this approach is that it takes a very long time, on the order of 30 minutes per day. And using a ton of bandwidth all at once! I assume this is because it needs to scan the entire directory tree to determine what files are new, then sends them all at once.
A realtime solution would be beneficial in 2 ways. One, I can get the analysis I need without waiting up to a day. Two, the network use would be lower and more spread out, instead of spiking once a day.
It's clear that 'aws s3 sync' isn't the right tool here. Has anyone dealt with a similar situation?
One potential solution could be:
Set up a service on the log-file side that continuously syncs (or aws s3 cp) new files based on the modified date. But wouldn't that need to scan the whole directory tree on the log server as well?
For reference, the log-file directory structure is like:
/var/log/files/done/{year}/{month}/{day}/{source}-{hour}.txt
There is also a /var/log/files/processing/ directory for files being written to.
Any advice would be appreciated. Thanks!
You could have a Lambda function triggered automatically as a new object is saved on your S3 bucket. Check Using AWS Lambda with Amazon S3 for details. The event passed to the Lambda function will contain the file name, allowing you to target only the new files in the syncing process.
If you'd like wait until you have, say 1,000 files, in order to sync in batch, you could use AWS SQS and the following workflow (using 2 Lambda functions, 1 CloudWatch rule and 1 SQS queue):
S3 invokes Lambda whenever there's a new file to sync
Lambda stores the filename in SQS
CloudWatch triggers another Lambda function every X minutes/hours to check how many files are there in SQS for syncing. Once there's 1,000 or more, it retrieves those filenames and run the syncing process.
Keep in mind that Lambda has a hard timeout of 5 minutes. If you sync job takes too long, you'll need to break it in smaller chunks.
You could set the bucket up to log HTTP requests to a separate bucket, then parse the log to look for newly created files and their paths. One troublespot, as well as PUT requests, you have to look for the multipart upload ops which are a sequence of POSTs. Best to log for a few days to see what gets created before putting any effort in to this approach
I'm currently playing around with AWS for some serverless CSV processing. Decent familiarity with EC2 and Dynamo. I'm sure there is a better way to structure this, and I've not found an efficient way to store the data. Any architecture suggestions would be much appreciated.
This flow will take in a CSV uploaded to S3, process all the rows of tuples and output a new CSV of processed data to S3.
What's the 1) optimal architecture and 2) optimal place to store the data before the queue is complete until the CSV can be built
Data flow and service architecture:
CSV (contains tuples) (S3) -> CSV processing (Lambda) -> Queue (SNS) -> Queue Processing (Lambda) -> ????? temporary storage for queue items that have been processed before they get written to CSV ???? (what to use here?) -> CSV building (Lambda) -> CSV storage (S3)
Clever ideas appreciated.
I believe you are over complicating matters
s3 can trigger invoke a lambda function when events occur. This is directly set up in the s3 buckets event notifications
So use this to make a converted version of the CSV in another bucket
Amazon have an example of how to do this sort of thing here
http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
Update (reply to this comment):
it doesn't parallelize anything
You can divide the task equally if you have a good idea of how many tuples can be processed by a single Lambda within its time limit.
For example, given the following info...
original CSV contains 50,000 tuples
a single Lambda can process 5000 tuples within the time limit.
You can then do 10 parallel asynchronous invocations of the processor Lambda with each of them working with a different offset.
Original answer:
You can make it work with two Lambdas:
Listener
S3-triggered Lambda whose only job is to pass the s3 path of the new uploaded CSV to the Processor Lambda.
Processor
a Lambda that is triggered by the Listener. It will require the s3 path and the offset as parameters (where offset is the row of the CSV that it should start processing.
This Lambda performs the actual processing of your CSV rows. It should keep track of which row it's currently processing and just before the Lambda time limit is reached, it will stop and invoke itself with the same s3 path but with a new offset.
So, basically, it's a recursive Lambda that invokes itself until all CSV rows are processed.
To check for the remaining time, you can use the context.getRemainingTimeInMillis() method (NodeJS) in a while or for loop in your handler.
Considering the fact that there is no data pipeline available in Singapore region, are there any alternatives available to efficiently push csv data to dynamodb?
If it was me, I would setup an s3 event notification on a bucket that fires a lambda function each time a CSV file was dropped into it.
The Notification would let Lambda know that a new file was available and a lambda function would be responsible for loading the data into dynamodb.
This would work better (because of the limits of lambda) if the CSV files were not huge, so they could be processed in a reasonable amount of time, and the bonus is the only worked that would need to be done once it was working would be to simply drop the new files into the right bucket - no server required.
Here is a github repository that has a CSV->Dynamodb loader written in java - it might help get you started.