I have enabled cloudfront logs and want to capture some information through that.
For that I am using aws athena to query cf logs. Which is absolutely working fine for my staging environment as it will store query result in my staging bucket.
For production I have created another table which will query cf logs files from production bucket and that result I want to store it in different s3 bucket( production bucket). But I am not finding any way to have different output query result location. Is there any way i can set different output result location for different tables
I assume you are talking about the QueryResultLocation in S3:
QueryResultsLocationInS3 is the query result location specified either by workgroup settings or client-side settings.
You can find more detailed information on how to set the location in Specifying a Query Result Location:
The query result location that Athena uses is determined by a combination of workgroup settings and client-side settings. Client-side settings are based on how you run the query.
If you run the query using the Athena console, the Query result location entered under Settings in the navigation bar determines the client-side setting.
If you run the query using the Athena API, the OutputLocation parameter of the StartQueryExecution action determines the client-side setting.
If you use the ODBC or JDBC drivers to run queries, the S3OutputLocation property specified in the connection URL determines the client-side setting.
If you are creating the tables with a CREATE TABLE AS statement you can set the result location for that specific table in the WITH clause. e.g.:
CREATE TABLE table_name
WITH (
external_location ='s3://my-bucket/tables/table_name/'
)
AS
SELECT * FROM database.my_table
Unfortunately this needs to be an empty location for the query to work. You would need to delete all the files in that location if you wanted to re-run this query, or change the location. See the Athena documentation for more details.
Related
I am trying to migrate the database using AWS DMS. Source is Azure SQL server and destination is Redshift. Is there any way to know the rows updated or inserted? We dont have any audit columns in source database.
Redshift doesn’t track changes and you would need to have audit columns to do this at the user level. You may be able to deduce this from Redshift query history and save data input files but this will be solution dependent. Query history can be achieved in a couple of ways but both require some action. The first is to review the query logs but these are only saved for a few days. If you need to look back further than this you need a process to save these tables so the information isn’t lost. The other is to turn on Redshift logging to S3 but this would need to be turned on before you run queries on Redshift. There may be some logging from DMS that could be helpful but I think the bottom line answer is that row level change tracking is not something that is on in Redshift by default.
I have enabled logging on my GCP PostgreSQL 11 Cloud SQL database. The logs are being redirected to a bucket in the same project and they are in a JSON format.
The logs contain queries which were executed on the database. Is there a way to create a decent report from these JSON logs with a few fields from the log entries? Currently the log files are in JSON and not very reader friendly.
Additionally, if a multi-line query is run, then those many log entries are created for that query. If there is also a way to recognize logs which are belong to the same query, that will be helpful, too!
I guess the easiest way is using BigQuery.
BigQuery will import properly those jsonl files and will assign proper field names for the json data
When you have multiline-queries, you'll see that they appear as multiple log entries in the json files.
Looks like all entries from a multiline query have the same receiveTimestamp (which makes sense, since they were produced at the same time).
Also, the insertId field has a 's=xxxx' subfield that does not change for lines on the same statement. For example:
insertId: "s=6657e04f732a4f45a107bc2b56ae428c;i=1d4598;b=c09b782e120c4f1f983cec0993fdb866;m=c4ae690400;t=5b1b334351733;x=ccf0744974395562-0#a1"
The strategy to extract that statements in the right line order is:
Sort by the 's' field in insertId
Then sort by receiveTimestamp ascending (to get all the lines sent at once to the syslog agent in the cloudsql service)
And finally sort by timestamp ascending (to get the line ordering right)
I am interested in performance testing my query in Redshift.
I would like to disable the query from using any cached results from prior queries. In other words, I would like the query to run from scratch. Is it possible to disable cached results only for the execution of my query?
I would not like to disable cached results for the entire database/all queries.
SET enable_result_cache_for_session TO OFF;
From enable_result_cache_for_session - Amazon Redshift:
Specifies whether to use query results caching. If enable_result_cache_for_session is on, Amazon Redshift checks for a valid, cached copy of the query results when a query is submitted. If a match is found in the result cache, Amazon Redshift uses the cached results and doesn’t execute the query. If enable_result_cache_for_session is off, Amazon Redshift ignores the results cache and executes all queries when they are submitted.
Ran across this during a benchmark today and wanted to add an alternative to this. The benchmark tool I was using has a setup and teardown, but they don't run in the same session/transaction, so the enable_result_cache_for_session setting was having no effect. So I had to get a little clever.
From the Redshift documentation:
Amazon Redshift uses cached results for a new query when all of the following are true:
The user submitting the query has access permission to the objects used in the query.
The table or views in the query haven't been modified.
The query doesn't use a function that must be evaluated each time it's run, such as GETDATE.
The query doesn't reference Amazon Redshift Spectrum external tables.
Configuration parameters that might affect query results are unchanged.
The query syntactically matches the cached query.
In my case, I just added a GETDATE() column to the query to force it to not use the result cache on each run.
How can I create a query result location specific to a database / table when I am querying data using AWS console. At the moment I see "query results location" in settings but it seems to be a global setting which is applicable to all databases. Is there a way to tweak the below table creation script or use any other mechanism to specify db / table specific results location when querying using aws console?
my query
CREATE EXTERNAL TABLE IF NOT EXISTS SAMPLE (
customer_id String,
customer_name String
)
STORED AS PARQUET
LOCATION 's3://<bucket>/files/';
The closest you can get is to use the new Work Groups feature. Create a new work group and set its result location, and use this work group when running queries from the console.
It won't be something you would do per query, but I'm not sure if that is what you're asking for.
Running queries from the console is more limited than using the API, as you probably know when using the StartQuertExecution API call you can specify the output location for that specific query to be any location on S3, but when the console is an application that abstracts away all of that and is not really made to support the full API.
Im am configuring AWS pipline to load a redshift table with data from JSON S3 file.
Im using RedshiftActivity and everything was good until i try to configure KEEP_EXISTING load method. I really do not want to truncate my table with each load but keep existing information and ADD new records.
Redshift activity seems to require PRIMARY KEY defined in the table in order towork (OK) ... now it's also requresting me to configure DISTRIBUTION KEY, but i am interested in EVEN distribution and it seems that DISTRIBUTION KEY cannot work aside with EVEN distribution style.
Can i simulate EVEN distribution using a distribution key?
Thanks.
I don't bother with primary key when creating tables in Redshift. For distkey, you want to pick a field whose values are randomly distributed, ideally.
In your case of incremental insertion, what I normally do is just use SQLActivity to copy the data from s3 to a staging table in Redshift. Then I perform the update/insert/dedup and whatever steps, depending on business logic. Finally I drop the staging table. Done.