What to do with Athena Results Files? - amazon-athena

Newer to AWS and working with Athena for the first time. Would appreciate any help/clarification.
I set the query results location to be s3://aws-athena-query-results-{ACCOUNTID}-{Region}, I can see that whenever I am running the query, whether it be from console or externally elsewhere, that the two results file are created as expected.
However, my question is what are supposed to do with these files long term? What are some recommendations on rotating them? From what I understand, these are the query results (other one is metadata file) that contains the results of the user's query and is passed back to them. What are the recommendations on how to manage the query results bucket files? I don't want to just let them accumulate there and comeback to a million files if that makes sense.
I did search through the docs and couldn't find info on the above topic, maybe I missed it? Would appreciate any help!
Thanks!

From the documentation,
You can delete metadata files (*.csv.metadata) without causing errors,
but important information about the query is lost
The query results files can be safely deleted if you dont want to refer back to the query that ran at a particular date in past and the result it returned. If you have deleted the results files from the S3 buckets and from Athena "History" trying to download the result, it will just give you error message that result file is not available.
In summary, its up to your use case whether you can afford to run the same query in future if required? or just want to extract the result from past run history.

Related

DBT - write dbt test --store-failures to a specific table in my data warehouse

Good afternoon,
I want to write the dbt test values to a specific table in my data warehouse.
I have tested multiple schema inclusions in all the possible .yml files and I am not really finding the correct place to specify to which database I want the tests to be recorded.
Nowadays, it always answers me with the error of not having the permissions to perform a glue:CreateDatabase action, which in fact is not what I want to do, but rather write to table specified by me.
To conclude, what I am asking here is how can I specify where the dbt tests results are being written, instead of letting dbt create and store the values in the default schemas?
If somebody could help me on this I would really appreciate it!
Well, I managed to fix it, I was almost throwing my computer away, because I didn't have more air to pull off, but basically we can specify the target database in the dbt_project.yml. To do that, just add to the dbt_project.yml
tests:
+store_failures: true
+schema: "path that you want"
I did not find any information in dbt documentation, neither community forums, so it was just an iterative process of testing possible approaches

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

Power BI - "Expression.Error: The column of the table wasn't found

This error pops up randomly when I import data to Power BI using Supermetrics API about a day after when trying to refresh.
For example, using Adobe Analytics data, when I try to refresh my dashboard a day later, I sometimes get this error, saying
When I click on "Go to error" it takes me to Changed type step, but I see that the code remains the same as I first imported it.
Going to Navigation step, I see that the data has changed, but I don't know why, since I haven't changed anything since I pulled data to Power BI.
What it used to look like:
vs what it looks like now
I have tried seeking help from Microsoft community, but no use, since most topics suggest changes to the code, which I haven't done.
This is 100% an issue in import, and as this issue has only just come up now it is due to a change in source format.
Looking at the breakdown of your imported data, it seems as though you are accessing a web source? This makes me think that the html make-up of that webpage has changed.
Unfortunately, the easiest way of dealing with something like this, is to rebuild the query from the ground up. If you are still struggling, let me know and I will have a look into it further.
Possibly, the 'Date' column is not arriving from the data feed (Supermetrics API) during your refresh. When you come across this issue, read the data feed in other Power BI window or check in the source so that you can confirm, the 'Date' column is missing during a particular point in time.
Thanks.
For me, I found that when my data source changed in terms of column header name, this error was thrown. You probably already sorted it out by now, but any future readers, I would double check the source to make sure no changes were made that might affect this, and then if the source is unchanged in it's version history, then rebuild the inquiry again. It sucks if this is due to a PBI issue, but these two methods should solve 99% of the instances of this error.

AWS Glue Error "Path does not exist"

Every time I try to run some very simple jobs (import json on s3 to Redshift) I get the following error:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
s3://my-temp-glue-dir/f316d46f-eaf3-497a-927b-47ff04462e4a;'
This is not a permissions issue, since I have some other (more complex jobs with joins) working reliably. Really not sure what the issue could be - any help would be appreciated.
I'm using 2 DPU's, but have tried 5. I also tried using a different temp directory. Also, there are hundreds of files, and some of the files are very small (a few lines), but I'm not sure if that is relevant.
I believe the cause of this error is simply the number of files I'm attempting to load at the same time (and that the error itself is misleading). After disabling bookmarks, and using a subset of the data, things are working as expected.

Export single column from DynamoDB to csv (or the like)

My DynamoDB table is quite large and I don't particularly want to dump the whole thing. There is one column that I want to test on, so I would like a dump of all of its values that I could have locally to code/test with. However I am not finding anything that lets me do this.
I found RazorSQL and it semi worked (in the sense that it let me pull down just one column of information from the table but it clearly didn't pull down all the data).
I also found a Data Pipeline Template on AWS but from what I can tell this will dump the entire table. I am relatively new to AWS so it's possible I'm not understanding something about pipelines properly.
I'm okay with writing to S3 because I can pull down all the data from there, but anything that gets to my local machine is fine by me
Thanks for the help!
UPDATE: This tutorial looks promising but I want to achieve this effect in a non-interactive method