DB / TABLE specific query results location - amazon-web-services

How can I create a query result location specific to a database / table when I am querying data using AWS console. At the moment I see "query results location" in settings but it seems to be a global setting which is applicable to all databases. Is there a way to tweak the below table creation script or use any other mechanism to specify db / table specific results location when querying using aws console?
my query
CREATE EXTERNAL TABLE IF NOT EXISTS SAMPLE (
customer_id String,
customer_name String
)
STORED AS PARQUET
LOCATION 's3://<bucket>/files/';

The closest you can get is to use the new Work Groups feature. Create a new work group and set its result location, and use this work group when running queries from the console.
It won't be something you would do per query, but I'm not sure if that is what you're asking for.
Running queries from the console is more limited than using the API, as you probably know when using the StartQuertExecution API call you can specify the output location for that specific query to be any location on S3, but when the console is an application that abstracts away all of that and is not really made to support the full API.

Related

Is it possible to retrieve the Timestream schema via the API?

The AWS console for Timestream can show you the schema for tables, but I cannot find any way to retrieve this information programatically. While the schema is dynamically created from the write requests, it would be useful to check the schema to validate that writes are going to generate the same measure types.
As shown in this document, you can run the command beneath, it'll return the table's schema.
DESCRIBE "database"."tablename"
Having the right profile setup at your environment, to run this query from the CLI, you can run the command:
aws timestream-query query --query-string 'DESCRIBE "db-name"."tb-name"' --no-paginate
This should return the table schema.
You can also retrieve the same data running this query from source codes in Typescript/Javascript, Python, Java or C# just by using the Timestream-Query library from AWS's SDK.
NOTE:
If the table is empty, this query will return empty so, make sure you have data in your db/table before querying for it's schema, ok? Good luck!

How to fetch the latest schema change in BigQuery and restore deleted column within 7 days

Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.

Extract script with which data was inserted into an old table and the owner in BigQuery - GCP

I am trying to find out what was the query run to be able to insert data into an empty table created earlier.
It would also be useful for me to know the user who was the creator of that table, if not, just to ask him about that script.
I tried with "INFORMATION_SCHEMA.TABLES;" but I only get the create script of the table.
I would look into BigQuery audit logs
most probably you can get information you want :
here is reference: https://cloud.google.com/bigquery/docs/reference/auditlogs/

Set different output result location for different tables in aws athena

I have enabled cloudfront logs and want to capture some information through that.
For that I am using aws athena to query cf logs. Which is absolutely working fine for my staging environment as it will store query result in my staging bucket.
For production I have created another table which will query cf logs files from production bucket and that result I want to store it in different s3 bucket( production bucket). But I am not finding any way to have different output query result location. Is there any way i can set different output result location for different tables
I assume you are talking about the QueryResultLocation in S3:
QueryResultsLocationInS3 is the query result location specified either by workgroup settings or client-side settings.
You can find more detailed information on how to set the location in Specifying a Query Result Location:
The query result location that Athena uses is determined by a combination of workgroup settings and client-side settings. Client-side settings are based on how you run the query.
If you run the query using the Athena console, the Query result location entered under Settings in the navigation bar determines the client-side setting.
If you run the query using the Athena API, the OutputLocation parameter of the StartQueryExecution action determines the client-side setting.
If you use the ODBC or JDBC drivers to run queries, the S3OutputLocation property specified in the connection URL determines the client-side setting.
If you are creating the tables with a CREATE TABLE AS statement you can set the result location for that specific table in the WITH clause. e.g.:
CREATE TABLE table_name
WITH (
external_location ='s3://my-bucket/tables/table_name/'
)
AS
SELECT * FROM database.my_table
Unfortunately this needs to be an empty location for the query to work. You would need to delete all the files in that location if you wanted to re-run this query, or change the location. See the Athena documentation for more details.

Extracting multiple RDS MySQL tables to S3

Rather new to AWS Data Pipeline so any help will be appreciated. I have used the pipeline template RDStoS3CopyActivity to extract all contents of a table in RDS MySQL. Seems to be working good. But there are 90 other tables to be extracted and dumped to S3. I cannot imagine craeting 90 pipelines or one for each table.
What is the best approach to resolving this task? How could pipeline be instructed to iterate though a list of the table names?
I am not sure if this will ever get responded. However, in this early stage of exploration, I have developed a pipeline that seems to fit a preliminary purpose -- extracting from 10 RDS MySQL tables and copying each to their respective sub-bucket on S3.
The logic is rather simple.
Configure connection for the RDS MySQL.
Extract data by specifying in "Select Query" field for each table.
Drop a Copy Activity and link up for each table above. It runs on a specified EC2 instance. If you're running expensive query, make sure you choose the appropriate EC2 instance with enough CPU and memory. This step copies the extracted dump, which lives temporarily in ec2 tmp filesystem, to a designated S3 bucket you will set up next.
Finally, the designated / target destination.
By default, data extracted and loaded to S3 bucket will be comma separated. If you need it to be tab delimited, then in the last target S3 destination:
- Add an optional field.. > select Data Format.
- Create a new Tab Separated. This will appear under the category of 'Others'.
- Give it a name. I call it Tab Separated.
- Type: TSV. Hover mouse over 'Type' to learn more of other data formats.
- Column separator: \t (i could leave this blank as type was already specified as tsv)
Screenshot
-
If the tables are all in the same RDS Why not use a SQLActivity pipeline with a SQL statement containing multiple unload commands to S3?
You can just write one query and use one pipeline.