I'm trying to run UDF in Hive, which basically should scan through external csv file using value from table as another argument.
Query I use:
add jar s3://bucket_name/udf/hiveudf.jar;
add FILE hdfs:///myfile/myfile.csv;
CREATE TEMPORARY FUNCTION MyFunc AS '....udf.myUDF';
SELECT mydate, record_id, value, MyFunc('myfile.csv',value) from my_table;
Results are unstable and in some cases exact same query works just fine, but in about 80% of cases it returns exception:
java.io.FileNotFoundException: myfile.csv (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at java.io.FileInputStream.<init>(FileInputStream.java:101)
at java.io.FileReader.<init>(FileReader.java:58)
...
File seems to be added to distributed cache:
hive> list files;
/mnt/tmp/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx_resources/myfile.csv
I tried it with various releases of EMR as well as with various instance types and couldn't find a pattern or what triggers this issue. Any advise will be highly appreciated.
You might enable DEBUG to find more info. But in general, I've seen similar issues when there was a resize(shrink) on the EMR cluster causing certain blocks of expected HDFS distributed cache file removed from cluster because of not enough replication.
Related
I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.
I am working on AWS-Glue ETL part for reading huge json file (only test 1 file and around 9 GB.) to work in ETL process but, I got an error from AWS Glue of java.lang.OutOfMemoryError: Java heap space after running and processing for a while
My code and flow is so simple as
df = spark.read.option("multiline", "true").json(f"s3/raw_path")
// ...
// and write to be as source_df to other object in s3
df.write.json(f"s3/source_path", lineSep=",\n")
In error/log It seems likes It failed and terminated container since reading this huge file. I have already tried to upgrade worker type to be G1.X with a sample number of worker node, however, I just would like to ask and find another solution that does not look like vertical scaling as increasing resources
I am so new in this area and service so, wanna optimize cost and time as low as possible :-)
Thank you alls in advance
After looking into Glue and Spark, I found that to get the benefit of parallelism processing across multiple executors, for my case - I split the (large) file into multiple smaller files and it worked! The files are distributed to multiple executors.
looking for an advice on how to read parquet file from hdfs cluster using Apache Nifi. In the cluster, there are multiple files present under single directory, want to read all in one flow. Does Nifi provide an inbuilt component to read the files in HDFS directory (parquet in this case)?
example- 3 files present in directory-
hdfs://app/data/customer/file1.parquet
hdfs://app/data/customer/file2.parquet
hdfs://app/data/customer/file3.parquet
Thanks!
You can use FetchParquet processor in combination with ListHDFS/GetHDFS..etc processors.
This processor added starting from NiFi-1.2 version and Jira NiFi-3724 addressing this improvement.
ListHDFS //stores the state and runs incrementally.
GetHDFS //doesn't stores the state get's all the files from the configured directory (Keep source file property to True incase you don't want to delete the source file).
You can use some other ways(using UpdateAttribute..etc) to add fully qualified filename as attribute to the flowfile then feed the connection to FetchParquet processor then processor fetches those parquet files.
Based on the RecordWriter specified FetchParquet Processor reads parquet files and write them in the format specified in RecordWriter.
Flow:
ListHDFS/GetHDFS -> FetchParquet -> other processors
If your requirement is to read the files from HDFS, you can use the HDFS processors available in the nifi-hadoop-bundle. You can use either of the two approaches:
A combination of ListHDFS and FetchHDFS
GetHDFS
The difference between the two approaches is GetHDFS will keep listing the contents of the directories that is configured for each run, so it will produce duplicates. The former approach, however, keeps track of the state so only new additions and/or modifications are returned in each subsequent runs.
Every time I try to run some very simple jobs (import json on s3 to Redshift) I get the following error:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
s3://my-temp-glue-dir/f316d46f-eaf3-497a-927b-47ff04462e4a;'
This is not a permissions issue, since I have some other (more complex jobs with joins) working reliably. Really not sure what the issue could be - any help would be appreciated.
I'm using 2 DPU's, but have tried 5. I also tried using a different temp directory. Also, there are hundreds of files, and some of the files are very small (a few lines), but I'm not sure if that is relevant.
I believe the cause of this error is simply the number of files I'm attempting to load at the same time (and that the error itself is misleading). After disabling bookmarks, and using a subset of the data, things are working as expected.
After reading this page:
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
"Operational Differences and Considerations" -> "Direct writes to Amazon S3 eliminated" section.
I wonder - does this mean that writing to S3 from Hive in EMR 4.x versions will be faster than 5.x versions?
If so, isn't it kind of regression? why would AWS want to eliminate this optimization?
Writing to a Hive table which is located in S3 is a very common scenario.
Can someone clear up that issue?
This optimization originally was developed by Qubole and pushed to Apache Hive.
See here.
This feature is rather dangerous because it is a bypass of Hive fault-tolerance mechanism and also forces developers to use normally unnecessary intermediate tables which, in its turn leads to performance degradation and increases cost.
Very common use-case is when we need to merge increment data into partitioned target table, described here The query is an insert overwrite table from itself, without intermediate table (in a single query) it is rather efficient. The query can be much more complex, with many tables joined. This is what happens with Direct Writes enabled in this use-case:
Partition folder is being deleted before query finished, this causing FileNotFound exception in Mapper reading the same table which is being written fails because partition folder deleted before mapper executed.
If the target table is initially empty, first run succeeds because Hive knows there is no any partition and does not read folder. Second run fails because see (1) folder deleted before mapper finishes.
Known workaround has performance impact. Loading data incrementally is quite often use-case. Direct writing to S3 feature forces developers to use temporary table in this case to eliminate FileNotFoundException and table corruption. As a result we are doing this task much slower and much more costly than if this feature is disabled and we are writing target table from itself.
After the first failure, successful restart is impossible, table is not selectable and not writable because Hive partition exists in metadata but folder does not exists and this causing FileNotFoundException in other queries from this table, which are not overwriting it.
The same described with less details on Amazon page which you are refering: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
Another possible issue is described on Qubole page, some existing fix with using preffixes is mentioned, though this does not work with use-case above because writing new files into folder which is being read will anyway create a problem.
Also mappers, reducers may fail and restart, whole session may fail and restart, writing files directly even with postponed deleting the old ones seems not so good idea because it increases the chance of unrecoverable failure or data corruption.
To disable direct writes, set this configuration property:
set hive.allow.move.on.s3=true; --this disables direct write
You can use this feature for small tasks and when not reading the same table which is being written, though for small tasks it will not give you much. This optimizetion is most efficient when you are rewriting many partitions in a very big table and move task at the end is extremely slow, then you may want to enable it at risk of data corruption.