AWS Glue Error "Path does not exist" - amazon-web-services

Every time I try to run some very simple jobs (import json on s3 to Redshift) I get the following error:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
s3://my-temp-glue-dir/f316d46f-eaf3-497a-927b-47ff04462e4a;'
This is not a permissions issue, since I have some other (more complex jobs with joins) working reliably. Really not sure what the issue could be - any help would be appreciated.
I'm using 2 DPU's, but have tried 5. I also tried using a different temp directory. Also, there are hundreds of files, and some of the files are very small (a few lines), but I'm not sure if that is relevant.

I believe the cause of this error is simply the number of files I'm attempting to load at the same time (and that the error itself is misleading). After disabling bookmarks, and using a subset of the data, things are working as expected.

Related

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

AWS Glue ETL: Reading huge JSON file format to process but, got OutOfMemory Error

I am working on AWS-Glue ETL part for reading huge json file (only test 1 file and around 9 GB.) to work in ETL process but, I got an error from AWS Glue of java.lang.OutOfMemoryError: Java heap space after running and processing for a while
My code and flow is so simple as
df = spark.read.option("multiline", "true").json(f"s3/raw_path")
// ...
// and write to be as source_df to other object in s3
df.write.json(f"s3/source_path", lineSep=",\n")
In error/log It seems likes It failed and terminated container since reading this huge file. I have already tried to upgrade worker type to be G1.X with a sample number of worker node, however, I just would like to ask and find another solution that does not look like vertical scaling as increasing resources
I am so new in this area and service so, wanna optimize cost and time as low as possible :-)
Thank you alls in advance
After looking into Glue and Spark, I found that to get the benefit of parallelism processing across multiple executors, for my case - I split the (large) file into multiple smaller files and it worked! The files are distributed to multiple executors.

Redshift : Copy command failure

While loading the data into Redshift from S3 using the copy command, I am getting the below error:
Undoing 1 transactions on table 456834 with current xid 9286221: 9286221
I tried changing the copy command with various options but no luck.
I checked the console and found that even COPY ANALYZE is failing.
I spent a lot of time on it but couldn't find a way to solve the issue. Any help would be highly appreiciated.
Thanks,
S

What to do with Athena Results Files?

Newer to AWS and working with Athena for the first time. Would appreciate any help/clarification.
I set the query results location to be s3://aws-athena-query-results-{ACCOUNTID}-{Region}, I can see that whenever I am running the query, whether it be from console or externally elsewhere, that the two results file are created as expected.
However, my question is what are supposed to do with these files long term? What are some recommendations on rotating them? From what I understand, these are the query results (other one is metadata file) that contains the results of the user's query and is passed back to them. What are the recommendations on how to manage the query results bucket files? I don't want to just let them accumulate there and comeback to a million files if that makes sense.
I did search through the docs and couldn't find info on the above topic, maybe I missed it? Would appreciate any help!
Thanks!
From the documentation,
You can delete metadata files (*.csv.metadata) without causing errors,
but important information about the query is lost
The query results files can be safely deleted if you dont want to refer back to the query that ran at a particular date in past and the result it returned. If you have deleted the results files from the S3 buckets and from Athena "History" trying to download the result, it will just give you error message that result file is not available.
In summary, its up to your use case whether you can afford to run the same query in future if required? or just want to extract the result from past run history.

Hive cannot find file from distributed cache on EMR

I'm trying to run UDF in Hive, which basically should scan through external csv file using value from table as another argument.
Query I use:
add jar s3://bucket_name/udf/hiveudf.jar;
add FILE hdfs:///myfile/myfile.csv;
CREATE TEMPORARY FUNCTION MyFunc AS '....udf.myUDF';
SELECT mydate, record_id, value, MyFunc('myfile.csv',value) from my_table;
Results are unstable and in some cases exact same query works just fine, but in about 80% of cases it returns exception:
java.io.FileNotFoundException: myfile.csv (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at java.io.FileInputStream.<init>(FileInputStream.java:101)
at java.io.FileReader.<init>(FileReader.java:58)
...
File seems to be added to distributed cache:
hive> list files;
/mnt/tmp/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx_resources/myfile.csv
I tried it with various releases of EMR as well as with various instance types and couldn't find a pattern or what triggers this issue. Any advise will be highly appreciated.
You might enable DEBUG to find more info. But in general, I've seen similar issues when there was a resize(shrink) on the EMR cluster causing certain blocks of expected HDFS distributed cache file removed from cluster because of not enough replication.