How do I clear the cache in AWS Redshift? [duplicate] - amazon-web-services

I am interested in performance testing my query in Redshift.
I would like to disable the query from using any cached results from prior queries. In other words, I would like the query to run from scratch. Is it possible to disable cached results only for the execution of my query?
I would not like to disable cached results for the entire database/all queries.

SET enable_result_cache_for_session TO OFF;
From enable_result_cache_for_session - Amazon Redshift:
Specifies whether to use query results caching. If enable_result_cache_for_session is on, Amazon Redshift checks for a valid, cached copy of the query results when a query is submitted. If a match is found in the result cache, Amazon Redshift uses the cached results and doesn’t execute the query. If enable_result_cache_for_session is off, Amazon Redshift ignores the results cache and executes all queries when they are submitted.

Ran across this during a benchmark today and wanted to add an alternative to this. The benchmark tool I was using has a setup and teardown, but they don't run in the same session/transaction, so the enable_result_cache_for_session setting was having no effect. So I had to get a little clever.
From the Redshift documentation:
Amazon Redshift uses cached results for a new query when all of the following are true:
The user submitting the query has access permission to the objects used in the query.
The table or views in the query haven't been modified.
The query doesn't use a function that must be evaluated each time it's run, such as GETDATE.
The query doesn't reference Amazon Redshift Spectrum external tables.
Configuration parameters that might affect query results are unchanged.
The query syntactically matches the cached query.
In my case, I just added a GETDATE() column to the query to force it to not use the result cache on each run.

Related

Set different output result location for different tables in aws athena

I have enabled cloudfront logs and want to capture some information through that.
For that I am using aws athena to query cf logs. Which is absolutely working fine for my staging environment as it will store query result in my staging bucket.
For production I have created another table which will query cf logs files from production bucket and that result I want to store it in different s3 bucket( production bucket). But I am not finding any way to have different output query result location. Is there any way i can set different output result location for different tables
I assume you are talking about the QueryResultLocation in S3:
QueryResultsLocationInS3 is the query result location specified either by workgroup settings or client-side settings.
You can find more detailed information on how to set the location in Specifying a Query Result Location:
The query result location that Athena uses is determined by a combination of workgroup settings and client-side settings. Client-side settings are based on how you run the query.
If you run the query using the Athena console, the Query result location entered under Settings in the navigation bar determines the client-side setting.
If you run the query using the Athena API, the OutputLocation parameter of the StartQueryExecution action determines the client-side setting.
If you use the ODBC or JDBC drivers to run queries, the S3OutputLocation property specified in the connection URL determines the client-side setting.
If you are creating the tables with a CREATE TABLE AS statement you can set the result location for that specific table in the WITH clause. e.g.:
CREATE TABLE table_name
WITH (
external_location ='s3://my-bucket/tables/table_name/'
)
AS
SELECT * FROM database.my_table
Unfortunately this needs to be an empty location for the query to work. You would need to delete all the files in that location if you wanted to re-run this query, or change the location. See the Athena documentation for more details.

AWS Athena - Query over large external table generated from Glue crawler?

I have a large set of history log files on aws s3 that sum billions of lines,
I used a glue crawler with a grok deserializer to generate an external table on Athena, but querying it has proven to be unfeasible.
My queries have timed out and I am trying to find another way of handling this data.
From what I understand, through Athena, external tables are not actual database tables, but rather, representations of the data in the files, and queries are run over the files themselves, not the database tables.
How can I turn this large dataset into a query friendly structure?
Edit 1: For clarification, I am not interested in reshaping the hereon log files, those are taken care of. Rather, I want a way to work with the current file base I have on s3. I need to query these old logs and at its current state it's impossible.
I am looking for a way to either convert these files into an optimal format or to take advantage of the current external table to make my queries.
Right now, by default of the crawler, the external tables are only partitined by day and instance, my grok pattern explodes the formatted logs into a couple more columns that I would love to repartition on, if possible, which I believe would make my queries easier to run.
Your where condition should be on partitions (at-least one condition). By sending support ticket, you may increase athena timeout. Alternatively, you may use Redshift Spectrum
But you may seriously thing to optimize query. Athena query timeout is 30min. It means your query ran for 30mins before timed out.
By default athena times out after 30 minutes. This timeout period can be increased but raising a support ticket with AWS team. However, you should first optimize your data and query as 30 minutes is good time for executing most of the queries.
Here are few tips to optimize the data that will give major boost to athena performance:
Use columnar formats like orc/parquet with compression to store your data.
Partition your data. In your case you can partition your logs based on year -> month -> day.
Create larger and lesser number of files per partition instead of small and more number of files.
The following AWS article gives detailed information for performance tuning in amazon athena
Top 10 performance tuning tips for amazon-athena

Is intermediate IO in Bigquery queries charged?

I have a fairly complex Bigquery query and it seems to cost more than I expect. It has 97 intermediate stages... are those charged?
You can get the value of how much data will be scanned (and therefore charged) by your query using the --dry-run switch from the CLI or by looking at the right end of the UI section where you run and set up your query.
BigQuery pricing model is per byte read. For my understanding, at the moment, if you reference a table in multiple CTE you will get charged one. But this might depend on how the query is written.
The best practice is always using the --dry-run feature which is very accurate.

AWS Athena disable cache

I am in the process of comparing the performance of CSV and Parquet files in AWS Athena.
To ensure that I do not get a considerable reduction in the execution times of two consecutive runs of the same query, I would like to make sure that the cache is disabled.
Do we know if there is a solution for this?
Or if AThena doesn’t even have cache enabled by default.
How Athena configures the presto engine behind is totally out of our control. I have throughly tested Aws Athena and from my finding it doens't cache the data. I see that same query executed consecutively takes similar amount of time and data scan.
But Parquet should definitely give you better performance and lesser data scan for cost efficiency.

DAX object cache and query cache get out of sync; no way to tell query cache to evict bad data?

According to the DynamoDB DAX documentation, DAX maintains two separate caches: one for objects and one for queries. Which is OK, I guess.
Trouble is, if you change an object and the changed value of the object should impact a value stored in the query cache, there appears to be no way to inform DAX about it, meaning that the query cache will be wrong until its TTL expires.
This is rather limiting and there doesn't appear to be any easy way to work around it.
Someone tell me I don't know what I'm talking about and there is a way to advise DAX to evict query cache values.
I wish there is a better answer, but unfortunately there is no way currently to update the query cache values except for TTL expiry. The item cache values are immediately updated by any Put or Update requests made through DAX, but not if there are any changes made directly to DynamoDB.
However, keep in mind that the key for query cache is the full request; thus changing any field in the request would trigger a cache miss. Obviously, this is not a solution, but it could be an option (hack) to work around the current limitation.
As per the Dynamo Db documentation you have to pass your update query through DAX.
DAX supports the following write operations: PutItem, UpdateItem, DeleteItem, and BatchWriteItem. When you send one of these requests to DAX, it does the following:
DAX sends the request to DynamoDB.
DynamoDB replies to DAX, confirming that the write succeeded.
DAX writes the item to its item cache.
DAX returns success to the requester.
If a write to DynamoDB fails for any reason, including throttling, then the item will not be cached in DAX and the exception for the failure will be returned to the requester. This ensures that data is not written to the DAX cache unless it is first written successfully to DynamoDB.
So instead of using method update of Dynamo db use UpdateItem.
To dig more you can refer this link