I am using boto 2.8.0 to create EMR jobflows over large log file stored in S3. I am relatively new to Elastic Mapreduce and am getting the feel for how to properly handle jobflows from this issue.
The logfiles in question are stored in s3 with keys that correspond to the dates they are emitted from the logging server, eg: /2013/03/01/access.log. These files are very, very large. My mapreduce job runs an Apache Pig script that simply examines some of the uri paths stored in the log files and outputs generalized counts that correspond to our business logic.
My client code in boto takes date times as input on cli and schedules a jobflow with a PigStep instance for every date needed. Thus, passing something like python script.py 2013-02-01 2013-03-01 would iterate over 29 days worth of datetime objects and create pigsteps with the respective input keys for s3. This means that the resulting jobflow could have many, many steps, one for each day in the timedelta between the from_date and to_date.
My problem is that my EMR jobflow is exceedingly slow, almost absurdly so. It's been running for a night now and hasn't made it even halfway through that example set. Is there something wrong I am doing creating many jobflow steps like this? Should I attempt to generalize the pig script for the different keys instead, rather than preprocessing it in the client code and creating a step for each date? Is this a feasible place to look for an optimization on Elastic Mapreduce? It's worth mentioning that a similar job for a months worth of comparable data passed to the AWS elastic-mapreduce cli ruby client took about 15 minutes to execute (this job was fueled by the same pig script.)
EDIT
Neglected to mention, job was scheduled for two instances of type m1.small, which admittedly may in itself be the problem.
Related
What is the fastest way of getting an exact count of rows for a 100GB CSV file stored on Amazon S3 without using Athena nor any Fargate or EC2 VM? I can't use Athena, because the CSV file isn't clean-enough for it. I can't use Fargates or EC2 VMs, because I need a purely serverless solution. I can't use third-party services like Snowflake (native AWS services only).
Also, 100GB is too large to fit within a Lambda Function's /tmp (limited to 10GB). I could try to run something like DuckDB (or any other streaming database engine) on a Lambda and scan the entire file with a SELECT COUNT(*) FROM "s3://myBucket/myFile.csv" query, but the Lambda is quite likely to timeout, because its read bandwidth from S3 is 100MB/s at best, and it cannot run for more than 15 minutes (900s).
I know the approximate size of the file.
Note: I have an inaccurate estimate of the number of rows provided by AWS Glue Data Catalog's crawler, with an error margin of -50%/+100%. This could be used for some kind of iterative or dichotomous process, but I could not figure any out. For example, I tried adding an OFFSET with a value lower than but close to the number of rows to the aforementioned query, but the Lambda running DuckDB timed out. That was disappointing and somewhat surprising, because a query like SELECT * FROM "s3://myBucket/myFile.csv" LIMIT 10 OFFSET 10000000 worked well.
The fastest solution is probably to use SelectObjectContent with ScanRange to parallelize the request on chunks of 50MB or so.
Have you tried "AWS S3 select":https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html. It lets you run queries on S3 files. I use the service to get basic insight into any file on S3(Provided it can be queried).
I want to import a large csv file (around 1gb with 2.5m rows and 50 columns) into a DynamoDb, so have been following this blog from AWS.
However it seems I'm up against a timeout issue. I've got to ~600,000 rows ingested, and it falls over.
I think from reading the CloudWatch log that the timeout is occurring due to the boto3 read on the CSV file (it opens the entire file first, iterates through and batches up for writing)... I tried to reduce the file size (3 columns, 10,000 rows as a test), and I got a timeout after 2500 rows.
Any thoughts here?!
TIA :)
I really appreciate the suggestions (Chris & Jarmod). After trying and failing to break things programmatically into smaller chunks, I decided to look at the approach in general.
Through research I understood there were 4 options:
Lambda Function - as per the above this fails with a timeout.
AWS Pipeline - Doesn't have a template for importing CSV to DynamoDB
Manual Entry - of 2.5m items? no thanks! :)
Use an EC2 instance to load the data to RDS and use DMS to migrate to DynamoDB
The last option actually worked well. Here's what I did:
Create an RDS database (I used the db.t2.micro tier as it was free) and created a blank table.
Create an EC2 instance (free Linux tier) and:
On the EC2 instance: use SCP to upload the CSV file to the ec2 instance
On the EC2 instance: Firstly Sudo yum install MySQL to get the tools needed, then use mysqlimport with the --local option to import the CSV file to the rds MySQL database, which took literally seconds to complete.
At this point I also did some data cleansing to remove some white spaces and some character returns that had crept into the file, just using standard SQL queries.
Using DMS I created a replication instance, endpoints for the source (rds) and target (dynamodb) databases, and finally created a task to import.
The import took around 4hr 30m
After the import, I removed the EC2, RDS, and DMS objects (and associated IAM roles) to avoid any potential costs.
Fortunately, I had a flat structure to do this against, and it was only one table. I needed the cheap speed of the dynamodb, otherwise, I'd have stuck to the RDS (I almost did halfway through the process!!!)
Thanks for reading, and best of luck if you have the same issue in the future.
I was wondering if it would be possible to compose Google Storage Objects (specifically csv files) without headers (i.e. without the row with column names) while using gsutil.
Currently, I can do the following:
gsutil compose gs://bucket/test_file_1.csv gs://bucket/test_file_2.csv gs://bucket/test-composition-files.csv
However, I will be unable to ingest test-composition-files.csv into Google BigQuery because compose blindly appended the files (including the column names).
One possible solution would be to download the file locally and process it with pandas, but this is not ideal for large files.
Is there any way to do this via the CLI? I could not find anything in the docs.
By reading the comment, I think you are spending effort in the wrong way. I understood that you wanted to load your files into big query, but the large number of file prevented you to do this (too many API calls). And dataflow is too slow.
Maybe you can think differently. I have 2 solutions to propose
If you need "near real time" ingestion, and if file size is bellow 1.5Gb, the best way is to build a function which read the file and perform a stream write to BigQuery. This function is triggered by a Cloud Storage event. If there is several file in the same time, several functions will be spawn. Be careful, stream write to BigQuery is not free
If you can wait up to 2 minutes when a file arrive, I recommend you to build a Cloud Functions, triggered every 2 minutes. This function read the file name in a bucket, move them to a sub directory and perform a load job of all the files in the sub directory. You are limited to 1000 load jobs per day (and per table), a day contains 1440 minutes. Batch every 2 minutes you are OK. The load job are free.
Is it acceptable alternatives?
I'd like to use EMR and Spark to process an AWS S3 inventory report generated in ORC format that has many ORC files (hundreds) and the total size of all the data is around 250GB.
Is there a specific or best practice way to read all the files in to one Dataset? It seems like I can pass the sqlContext.read().orc() method a list of files, but I wasn't sure if this would scale/parallelize properly if I pass it a large list of hundreds of files.
What is the best practice way of doing this? Ultimately my goal is to have the contents of all the files in one dataset so that I can run a sql query on the dataset and then call .map on the results for subsequent processing on that result set.
Thanks in advance for your suggestions.
Just specify a folder where your orc files are located. Spark will automatically detect all of them and will put into a single DataFrame.
sparkSession.read.orc("s3://bucket/path/to/folder/with/orc/files")
You shouldn't care much about scalability since everything is done by spark based on default config provided by EMR depending on the EC2 instance type selected. You can experiment with number of slave nodes and it's instance type though.
Besides that, I would suggest to set maximizeResourceAllocation to true to configure executors to utilize maximum resources on each slave node.
I have a large number of logfiles from a service that I need to regularly run analysis on via EMR/Hive. There are thousands of new files per day, and they can technically come out of order relative to the file name (e.g. a batch of files comes a week after the date in the file name).
I did an initial load of the files via Snowball, then set up a script that syncs the entire directory tree once per day using the 'aws s3 sync' cli command. This is good enough for now, but I will need a more realtime solution in the near future. The issue with this approach is that it takes a very long time, on the order of 30 minutes per day. And using a ton of bandwidth all at once! I assume this is because it needs to scan the entire directory tree to determine what files are new, then sends them all at once.
A realtime solution would be beneficial in 2 ways. One, I can get the analysis I need without waiting up to a day. Two, the network use would be lower and more spread out, instead of spiking once a day.
It's clear that 'aws s3 sync' isn't the right tool here. Has anyone dealt with a similar situation?
One potential solution could be:
Set up a service on the log-file side that continuously syncs (or aws s3 cp) new files based on the modified date. But wouldn't that need to scan the whole directory tree on the log server as well?
For reference, the log-file directory structure is like:
/var/log/files/done/{year}/{month}/{day}/{source}-{hour}.txt
There is also a /var/log/files/processing/ directory for files being written to.
Any advice would be appreciated. Thanks!
You could have a Lambda function triggered automatically as a new object is saved on your S3 bucket. Check Using AWS Lambda with Amazon S3 for details. The event passed to the Lambda function will contain the file name, allowing you to target only the new files in the syncing process.
If you'd like wait until you have, say 1,000 files, in order to sync in batch, you could use AWS SQS and the following workflow (using 2 Lambda functions, 1 CloudWatch rule and 1 SQS queue):
S3 invokes Lambda whenever there's a new file to sync
Lambda stores the filename in SQS
CloudWatch triggers another Lambda function every X minutes/hours to check how many files are there in SQS for syncing. Once there's 1,000 or more, it retrieves those filenames and run the syncing process.
Keep in mind that Lambda has a hard timeout of 5 minutes. If you sync job takes too long, you'll need to break it in smaller chunks.
You could set the bucket up to log HTTP requests to a separate bucket, then parse the log to look for newly created files and their paths. One troublespot, as well as PUT requests, you have to look for the multipart upload ops which are a sequence of POSTs. Best to log for a few days to see what gets created before putting any effort in to this approach