Assuming I have names of temporary files from Hive job and at a given time multiple jobs were executing, is it possible to identify which Hive job produced chosen files? Is it logged somewhere?
Information about running and past jobs is available in http://jobtrackerhost:50030/jobtracker.jsp.
Going into link with job details there is also link to Job File which shows information about names of files used in hdfs /tmp/hive-${username} mapred.cache.files.
This information is also available in xml files in $HADOOP_HOME/mr1/logs/ files: *conf.xml
Related
I have multiple source sending incremental data and there are no metadata columns at record level. How can I ensure that Airflow is processing data in the order of receipt. I may end-up processing the file in out-of-sync order.
Does airflow have inbuilt methods/way to handle the files in the order of receival. ?
Airflow version used :2.4.3
You can use boto to retrieve the last modified timestamp from files in your S3 bucket within a PythonOperator.
This question has an answer that shows how to pull the last modified timestamp. Then you can sort the keys by the timestamp, process the files in that order and move the files to an achieve folder or bucket so only new files are processed with every DAG run.
As a general note if you have any control over your sources I would prefer trying to add a timestamp at the record level, this seems like an easier option.
I have a problem when I try to do ETL on large bunch of files on AWS.
The goal is to convert JSON files to parquet files. due to the size of the files I have to do it batch by batch . Let's say I need to do it in 15 batches , i.e. 15 separate runs to be able to convert all of them.
I am using write.mode("append").format("parquet") to write into parquet files in each glue pyspark job to do that.
My problem is if one job failed for some reason then I don't know what to do - some partitions are updated while some are not, some files in the batch have been processed while some have not. for example if my 9th job failed, I am kind of stuck. I dont want to delete all parquet files to start over, but also dont want to just re-run that 9th job and cause duplicates.
Is there a way to protect parquet files to only append new files into them if the whole job is successful?
THank you!!
Based on your comment and a similar experience I had with this problem, I believe this happens because of S3 eventual consistency. Have a look at Amazon S3 Data Consistency Model here https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html.
We found that using partitioned staging s3a committer with the conflict resolution mode replace made our jobs not fail.
Try the following parameters with your spark jobs:
spark.hadoop.fs.s3a.committer.staging.conflict-mode replace
spark.hadoop.fs.s3a.committer.name partitioned
Also have a read about the committers here:
https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html
Hope this helps!
P.S. If all fails and our files are not too big, you can do a hacky solution where you save your parquet file locally and upload when your spark tasks are complete, but I personally do not recommend.
I am back filling some data via glue jobs. The job itself is reading in a TSV from s3, transforming the data slightly, and writing it in Parquet to S3. Since I already have the data, I am trying to launch multiple jobs at once to reduce the amount of time needed to process it all. When I launch multiple jobs at the same time, I run into an issue sometimes where one of the files will fail to output the resultant Parquet files in S3. The job itself completes successfully without throwing an error When I rerun the job as a non-parallel task, the file it output correctly. Is there some issue, either with glue(or the underlying spark) or S3 that would cause my issue?
The same Glue job running in parallel may produce files with the same names and therefore some of them can be overwritten. As I remember correctly, transformation-context is used as part of the name. I assume you don't have bookmarking enabled so it should be safe for you to generate transformation-context value dynamically to ensure it's unique for each job.
I have data being written from Kafka to a directory in s3 with a structure like this:
s3://bucket/topics/topic1/files1...N
s3://bucket/topics/topic2/files1...N
.
.
s3://bucket/topics/topicN/files1...N
There is already a lot of data in this bucket and I want to use AWS Glue to transform it into parquet and partition it, but there is way too much data to do it all at once. I was looking into bookmarking and it seems like you can't use it to only read the most recent data or to process data in chunks. Is there a recommended way of processing data like this so that bookmarking will work for when new data comes in?
Also, does bookmarking require that spark or glue has to scan my entire dataset each time I run a job to figure out which files are greater than the last runs max_last_modified timestamp? That seems pretty inefficient especially as the data in the source bucket continues to grow.
I have learned that Glue wants all similar files (files with same structure and purpose) to be under one folder, with optional subfolders.
s3://my-bucket/report-type-a/yyyy/mm/dd/file1.txt
s3://my-bucket/report-type-a/yyyy/mm/dd/file2.txt
...
s3://my-bucket/report-type-b/yyyy/mm/dd/file23.txt
All of the files under report-type-a folder must be of the same format. Put a different report like report-type-b in a different folder.
You might try putting just a few of your input files in the proper place, running your ETL job, placing more files in the bucket, running again, etc.
I tried this by getting the current files working (one file per day), then back-filling historical files. Note however, that this did not work completely. I have been getting files processed ok in s3://my-bucket/report-type/2019/07/report_20190722.gzp, but when I tried to add past files to 's3://my-bucket/report-type/2019/05/report_20190510.gzip`, Glue did not "see" or process the file in the older folder.
However, if I moved the old report to the current partition, it worked: s3://my-bucket/report-type/2019/07/report_20190510.gzip .
AWS Glue bookmarking works only with a select few formats (more here) and when read using glueContext.create_dynamic_frame.from_options function. Along with this job.init() and job.commit() should also be present in the glue script. You can checkout a related answer.
I'd like to use EMR and Spark to process an AWS S3 inventory report generated in ORC format that has many ORC files (hundreds) and the total size of all the data is around 250GB.
Is there a specific or best practice way to read all the files in to one Dataset? It seems like I can pass the sqlContext.read().orc() method a list of files, but I wasn't sure if this would scale/parallelize properly if I pass it a large list of hundreds of files.
What is the best practice way of doing this? Ultimately my goal is to have the contents of all the files in one dataset so that I can run a sql query on the dataset and then call .map on the results for subsequent processing on that result set.
Thanks in advance for your suggestions.
Just specify a folder where your orc files are located. Spark will automatically detect all of them and will put into a single DataFrame.
sparkSession.read.orc("s3://bucket/path/to/folder/with/orc/files")
You shouldn't care much about scalability since everything is done by spark based on default config provided by EMR depending on the EC2 instance type selected. You can experiment with number of slave nodes and it's instance type though.
Besides that, I would suggest to set maximizeResourceAllocation to true to configure executors to utilize maximum resources on each slave node.