Redshift : Copy command failure - amazon-web-services

While loading the data into Redshift from S3 using the copy command, I am getting the below error:
Undoing 1 transactions on table 456834 with current xid 9286221: 9286221
I tried changing the copy command with various options but no luck.
I checked the console and found that even COPY ANALYZE is failing.
I spent a lot of time on it but couldn't find a way to solve the issue. Any help would be highly appreiciated.
Thanks,
S

Related

How do you clear the persistent storage for a notebook instance on AWS SageMaker?

So I'm running into the following error on AWS SageMaker when trying to save:
Unexpected error while saving file: untitled.ipynb [Errno 28] No space left on device
If I remove my notebook, create a new identical one and run it, everything works fine. However, I'm suspecting the Jupyter checkpoint takes up too much space if I save the notebook while it's running and therefore I'm running out of space. Sadly, getting more storage is not an option for me, so I'm wondering if there's any command I can use to clear the storage before running my notebook?
More specifically, clearing the persistent storage in the beginning and at the end of the training process.
I have googled like a maniac but there is no suggestion aside from "just increase the amount of storage bro" and that's why I'm asking the question here.
Thanks in advance!
If you don't want your data to be persistent across multiple notebook runs, just store them in /tmp which is not persistent. You have at least 10GB. More details here.
I had the exact same problem, and was not unable to find a decent answer to it online. However, I was fortunately able to resolve the issue.
I use an R kernel, so the solution might be slightly different.
You can check the storage going in the terminal and typing db -kh
You are likely mounted on the /home/ec2-user/SageMaker and can see its "Size" "Used" "Avail" and "Use%".
There are hidden folders that function as a recycle bin. When I use R command list.dirs() it reveals a folder named ./.Trash-1000/ which kept a lot of random things that had been supposedly removed from the storage.
I just deleted the folder unlink('./.Trash-1000/', recursive = T) and it the entire storage was freed.
Hope it helps.

What to do with Athena Results Files?

Newer to AWS and working with Athena for the first time. Would appreciate any help/clarification.
I set the query results location to be s3://aws-athena-query-results-{ACCOUNTID}-{Region}, I can see that whenever I am running the query, whether it be from console or externally elsewhere, that the two results file are created as expected.
However, my question is what are supposed to do with these files long term? What are some recommendations on rotating them? From what I understand, these are the query results (other one is metadata file) that contains the results of the user's query and is passed back to them. What are the recommendations on how to manage the query results bucket files? I don't want to just let them accumulate there and comeback to a million files if that makes sense.
I did search through the docs and couldn't find info on the above topic, maybe I missed it? Would appreciate any help!
Thanks!
From the documentation,
You can delete metadata files (*.csv.metadata) without causing errors,
but important information about the query is lost
The query results files can be safely deleted if you dont want to refer back to the query that ran at a particular date in past and the result it returned. If you have deleted the results files from the S3 buckets and from Athena "History" trying to download the result, it will just give you error message that result file is not available.
In summary, its up to your use case whether you can afford to run the same query in future if required? or just want to extract the result from past run history.

AWS Glue Error "Path does not exist"

Every time I try to run some very simple jobs (import json on s3 to Redshift) I get the following error:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
s3://my-temp-glue-dir/f316d46f-eaf3-497a-927b-47ff04462e4a;'
This is not a permissions issue, since I have some other (more complex jobs with joins) working reliably. Really not sure what the issue could be - any help would be appreciated.
I'm using 2 DPU's, but have tried 5. I also tried using a different temp directory. Also, there are hundreds of files, and some of the files are very small (a few lines), but I'm not sure if that is relevant.
I believe the cause of this error is simply the number of files I'm attempting to load at the same time (and that the error itself is misleading). After disabling bookmarks, and using a subset of the data, things are working as expected.

Hive cannot find file from distributed cache on EMR

I'm trying to run UDF in Hive, which basically should scan through external csv file using value from table as another argument.
Query I use:
add jar s3://bucket_name/udf/hiveudf.jar;
add FILE hdfs:///myfile/myfile.csv;
CREATE TEMPORARY FUNCTION MyFunc AS '....udf.myUDF';
SELECT mydate, record_id, value, MyFunc('myfile.csv',value) from my_table;
Results are unstable and in some cases exact same query works just fine, but in about 80% of cases it returns exception:
java.io.FileNotFoundException: myfile.csv (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at java.io.FileInputStream.<init>(FileInputStream.java:101)
at java.io.FileReader.<init>(FileReader.java:58)
...
File seems to be added to distributed cache:
hive> list files;
/mnt/tmp/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx_resources/myfile.csv
I tried it with various releases of EMR as well as with various instance types and couldn't find a pattern or what triggers this issue. Any advise will be highly appreciated.
You might enable DEBUG to find more info. But in general, I've seen similar issues when there was a resize(shrink) on the EMR cluster causing certain blocks of expected HDFS distributed cache file removed from cluster because of not enough replication.

Why are direct writes to Amazon S3 eliminated in EMR 5.x versions?

After reading this page:
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
"Operational Differences and Considerations" -> "Direct writes to Amazon S3 eliminated" section.
I wonder - does this mean that writing to S3 from Hive in EMR 4.x versions will be faster than 5.x versions?
If so, isn't it kind of regression? why would AWS want to eliminate this optimization?
Writing to a Hive table which is located in S3 is a very common scenario.
Can someone clear up that issue?
This optimization originally was developed by Qubole and pushed to Apache Hive.
See here.
This feature is rather dangerous because it is a bypass of Hive fault-tolerance mechanism and also forces developers to use normally unnecessary intermediate tables which, in its turn leads to performance degradation and increases cost.
Very common use-case is when we need to merge increment data into partitioned target table, described here The query is an insert overwrite table from itself, without intermediate table (in a single query) it is rather efficient. The query can be much more complex, with many tables joined. This is what happens with Direct Writes enabled in this use-case:
Partition folder is being deleted before query finished, this causing FileNotFound exception in Mapper reading the same table which is being written fails because partition folder deleted before mapper executed.
If the target table is initially empty, first run succeeds because Hive knows there is no any partition and does not read folder. Second run fails because see (1) folder deleted before mapper finishes.
Known workaround has performance impact. Loading data incrementally is quite often use-case. Direct writing to S3 feature forces developers to use temporary table in this case to eliminate FileNotFoundException and table corruption. As a result we are doing this task much slower and much more costly than if this feature is disabled and we are writing target table from itself.
After the first failure, successful restart is impossible, table is not selectable and not writable because Hive partition exists in metadata but folder does not exists and this causing FileNotFoundException in other queries from this table, which are not overwriting it.
The same described with less details on Amazon page which you are refering: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
Another possible issue is described on Qubole page, some existing fix with using preffixes is mentioned, though this does not work with use-case above because writing new files into folder which is being read will anyway create a problem.
Also mappers, reducers may fail and restart, whole session may fail and restart, writing files directly even with postponed deleting the old ones seems not so good idea because it increases the chance of unrecoverable failure or data corruption.
To disable direct writes, set this configuration property:
set hive.allow.move.on.s3=true; --this disables direct write
You can use this feature for small tasks and when not reading the same table which is being written, though for small tasks it will not give you much. This optimizetion is most efficient when you are rewriting many partitions in a very big table and move task at the end is extremely slow, then you may want to enable it at risk of data corruption.