Is there a way to clear multiple tables at once? - google-cloud-platform

I want to clear the tables stored in a specific dataset of bigquery.
In the console screen, you cannot delete multiple tables at once.
It was also not possible to delete using the * in the bq CLI.
Is there a way to clear multiple tables at once?

While in the documentation it is stated that you can delete only one table at a time, it is possible to make an API request with a Python script in order to delete all the tables inside a dataset.
I created and tested the following script:
from google.cloud import bigquery
#construct BigQuery client object
client = bigquery.Client()
#select your dataset and list the tables within it
dataset_id='project_id.dataset'
tables = client.list_tables(dataset_id)
#inititalizing the list of tables
list_tables=[]
for table in tables:
#Create a list with the able reference for deletion 'project.dataset_id.table_id'
id =".".join([table.project,table.dataset_id,table.table_id])
list_tables.append(id)
#List of tables
print(list_tables)
#Delete all the tables inside the list of tables
for table in list_tables:
#print(table)
client.delete_table(table)
print("{} {}".format("Number of deleted tables in the dataset:", len(list_tables)))
I executed the above code using Jupyter Notebook with Python 3. If you run it in your cloud shell environment, make sure you install all the dependencies pip install --upgrade google-cloud-bigquery.

I had similar situation and used the following Java code to delete the large no of tables.
Replace dataset and table_prefix values.
Set the service account JSON key file path as GOOGLE_APPLICATION_CREDENTIALS environment variable.
Code:
import java.util.ArrayList;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ForkJoinPool;
public class TableKiller {
String dataset = "MY_DATASET";
String table_prefix = "temp_table_";
public static void main(String[] args) throws ExecutionException, InterruptedException {
new TableKiller().deleteAll();
}
public void deleteAll() throws InterruptedException, ExecutionException {
ArrayList<String> tableNames = new ArrayList<>();
BigQuery bigQuery = BigQueryOptions.getDefaultInstance().getService();
int i = 1;
bigQuery.listTables(dataset, BigQuery.TableListOption.pageSize(1000))
.iterateAll()
.forEach(table -> {
String tableName = table.getTableId().getTable();
if (tableName.startsWith(table_prefix)) {
tableNames.add(tableName);
System.out.println("Added " + tableName + i);
}
});
ForkJoinPool forkJoinPool = new ForkJoinPool(200);
forkJoinPool.submit(() -> tableNames
.parallelStream()
.forEach(this::deleteTable)).get();
}
private void deleteTable(String tableName) {
BigQuery bigQuery = BigQueryOptions.getDefaultInstance().getService();
bigQuery.delete(TableId.of(dataset, tableName));
System.out.println("Deleted " + tableName);
}
}

Related

How to fetch DBPROPERTIES, S3Location and comment set while creating database in AWS Athena?

As provided in AWS athena documentation.
https://docs.aws.amazon.com/athena/latest/ug/create-database.html
We can specify DBPROPERTIES, S3Location and comment while creating Athena database as
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT 'database_comment']
[LOCATION 'S3_loc']
[WITH DBPROPERTIES ('property_name' = 'property_value') [, ...]]
For example:
CREATE DATABASE IF NOT EXISTS clickstreams
COMMENT 'Site Foo clickstream data aggregates'
LOCATION 's3://myS3location/clickstreams/'
WITH DBPROPERTIES ('creator'='Jane D.', 'Dept.'='Marketing analytics');
But once the properties are set. How can I fetch the properties back using Query.
Let say, I want to fetch creator name from the above example.
You can get these using the Glue Data Catalog GetDatabase API call.
Databases and tables in Athena are stored in the Glue Data Catalog. When you run DDL statements in Athena it translates these into Glue API calls. Not all operations you can do in Glue are available in Athena, because of historical reasons.
I was able to fetch AWS Athena Database properties in Json format using following code of Glue data catalog.
package com.amazonaws.samples;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.glue.AWSGlue;
import com.amazonaws.services.glue.AWSGlueClient;
import com.amazonaws.services.glue.model.GetDatabaseRequest;
import com.amazonaws.services.glue.model.GetDatabaseResult;
public class Glue {
public static void main(String[] args) {
BasicAWSCredentials awsCreds = new BasicAWSCredentials("*api*","*key*");
AWSGlue glue = AWSGlueClient.builder().withRegion("*bucket_region*")
.withCredentials(new AWSStaticCredentialsProvider(awsCreds)).build();
GetDatabaseRequest req = new GetDatabaseRequest();
req.setName("*database_name*");
GetDatabaseResult result = glue.getDatabase(req);
System.out.println(result);
}
}
Also, following permissions are required for user
AWSGlueServiceRole
AmazonS3FullAccess

Query a table/database in Athena from a Notebook instance

I have developed different Athena Workgroups for different teams so that I can separate their queries and their query results. The users would like to query the tables available to them from their notebook instances (JupyterLab). I am having difficulty finding code which successfully covers the requirement of querying a table from the user's specific workgroup. I have only found code that will query the table from the primary workgroup.
The code I have currently used is added below.
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>')
df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
df
This code does not work as the users only have access to perform queries from their specific workgroups hence get errors when this code is run. It also does not cover the requirement of separating the user's queries in user specific workgroups.
Any suggestions on how I can add alter the code so that I can run the queries within a specific workgroup from the notebook instance?
Documentation of pyathena is not super extensive, but after looking into source code we can see that connect simply creates instance of Connection class.
def connect(*args, **kwargs):
from pyathena.connection import Connection
return Connection(*args, **kwargs)
Now, after looking into signature of Connection.__init__ on GitHub we can see parameter work_group=None which name in the same way as one of the parameters for start_query_execution from the official AWS Python API boto3. Here is what their documentation say about it:
WorkGroup (string) -- The name of the workgroup in which the query is being started.
After following through usages and imports in Connection we endup with BaseCursor class that under the hood makes a call to start_query_execution while unpacking a dictionary with parameters assembled by BaseCursor._build_start_query_execution_request method. That is excatly where we can see familar syntax for submitting queries to AWS Athena, in particular the following part:
if self._work_group or work_group:
request.update({
'WorkGroup': work_group if work_group else self._work_group
})
So this should do a trick for your case:
import pandas as pd
from pyathena import connect
conn = connect(
s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>',
work_group='<USER SPECIFIC WORKGROUP>'
)
df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
I implemented this it worked for me.
!pip install pyathena
Ref link
from pyathena import connect
from pyathena.pandas.util import as_pandas
import boto3
query = """
Select * from "s3-prod-db"."CustomerTransaction" ct where date(partitiondate) >= date('2022-09-30') limit 10
"""
query
cursor = connect(s3_staging_dir='s3://s3-temp-analytics-prod2/',
region_name=boto3.session.Session().region_name, work_group='data-scientist').cursor()
df = cursor.execute(query)
print(cursor.state)
print(cursor.state_change_reason)
print(cursor.completion_date_time)
print(cursor.submission_date_time)
print(cursor.data_scanned_in_bytes)
print(cursor.output_location)
df = as_pandas(cursor)
print(df)
If we dont pass work_group parameter will use "primary" as the default work_group.
If we pass s3_staging_dir='s3://s3-temp-analytics-prod2/' s3 bucket which does not exist, it will create this bucket.
But if the user role that you are running the script does not have to create bucket privilege it will throw an exception.

How to delete / drop multiple tables in AWS athena?

I am trying to drop few tables from Athena and I cannot run multiple DROP queries at same time. Is there a way to do it?
Thanks!
You are correct. It is not possible to run multiple queries in the one request.
An alternative is to create the tables in a specific database. Dropping the database will then cause all the tables to be deleted.
For example:
CREATE DATABASE foo;
CREATE EXTERNAL TABLE bar1 ...;
CREATE EXTERNAL TABLE bar2 ...;
DROP DATABASE foo CASCADE;
The DROP DATABASE command will delete the bar1 and bar2 tables.
You can use aws-cli batch-delete-table to delete multiple table at once.
aws glue batch-delete-table \
--database-name <database-name> \
--tables-to-delete "<table1-name>" "<table2-name>" "<table3-name>" ...
You can use AWS Glue interface to do this now. The prerequisite being you must upgrade to AWS Glue Data Catalog.
If you Upgrade to the AWS Glue Data Catalog from Athena, the metadata for tables created in Athena is visible in Glue and you can use the AWS Glue UI to check multiple tables and delete them at once.
FAQ on Upgrading data catalog: https://docs.aws.amazon.com/athena/latest/ug/glue-faq.html
You could write a shell script to do this for you:
for table in products customers stores; do
aws athena start-query-execution --query-string "drop table $table" --result-configuration OutputLocation=s3://my-ouput-result-bucket
done
Use AWS Glue's Python shell and invoke this function:
def run_query(query, database, s3_output):
client = boto3.client('athena')
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': s3_output,
}
)
print('Execution ID: ' + response['QueryExecutionId'])
return response
Athena configuration:
s3_input = 's3://athena-how-to/data'
s3_ouput = 's3://athena-how-to/results/'
database = 'your_database'
table = 'tableToDelete'
query_1 = "drop table %s.%s;" % (database, table)
queries = [ query_1]
#queries = [ create_database, create_table, query_1, query_2 ]
for q in queries:
print("Executing query: %s" % (q))
res = run_query(q, database, s3_ouput)
#Vidy
I would second what #Prateek said. Please provide an example of your code. Also, please tag your post with the language/shell that you're using to interact with AWS.
Currently, you cannot run multiple queries in one request. However, you can make multiple requests simultaneously. Currently, you can run 20 requests simultaneously (2018-06-15). You could do this through an API call or the console. In addition you could use the CLI or the SDK (if available for your language of choice).
For example, in Python you could use the multiprocess or threading modules to manage concurrent requests. Just remember to consider thread/multiprocess safety when creating resources/clients.
Service Limits:
Athena Service Limits
AWS Service Limits for which you can request a rate increase
I could not get Carl's method to work by executing DROP TABLE statements even though they did work in the console.
So I just thought it was worth posting my approach that worked for me, which uses a combination of the AWS Pandas SDK and the CLI
import awswrangler as wr
import boto3
import os
session = boto3.Session(
aws_access_key_id='XXXXXX',
aws_secret_access_key='XXXXXX',
aws_session_token='XXXXXX'
)
database_name = 'athena_db'
athena_s3_output = 's3://athena_s3_bucket/athena_queries/'
df = wr.athena.read_sql_query(
sql= "SELECT DISTINCT table_name FROM information_schema.tables WHERE
table_schema = '" + database_name + "'",
database= database_name,
s3_output = athena_s3_output,
boto3_session = session
)
print(df)
# ensure that your aws profile is valid for CLI commands
# i.e. your credentials are set in C:\Users\xxxxxxxx\.aws\credentials
for table in df['table_name']:
cli_string = 'aws glue delete-table --database-name ' + database_name + ' --name ' + table
print(cli_string)
os.system(cli_string)

Copying one table to another in DynamoDB

What's the best way to identically copy one table over to a new one in DynamoDB?
(I'm not worried about atomicity).
Create a backup(backups option) and restore the table with a new table name. That would get all the data into the new table.
Note: Takes considerable amount of time depending on the table size
I just used the python script, dynamodb-copy-table, making sure my credentials were in some environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY), and it worked flawlessly. It even created the destination table for me.
python dynamodb-copy-table.py src_table dst_table
The default region is us-west-2, change it with the AWS_DEFAULT_REGION env variable.
AWS Pipeline provides a template which can be used for this purpose: "CrossRegion DynamoDB Copy"
See: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-crossregion-ddb-create.html
The result is a simple pipeline that looks like:
Although it's called CrossRegion you can easily use it for the same region as long the destination table name is different (Remember that table names are unique per account and region)
You can use Scan to read the data and save it to the new table.
On the AWS forums a guy from the AWS team posted another approach using EMR: How Do I Duplicate a Table?
Here's one solution to copy all items from one table to another, just using shell scripting, the AWS CLI and jq. Will work OK for smallish tables.
# exit on error
set -eo pipefail
# tables
TABLE_FROM=<table>
TABLE_TO=<table>
# read
aws dynamodb scan \
--table-name "$TABLE_FROM" \
--output json \
| jq "{ \"$TABLE_TO\": [ .Items[] | { PutRequest: { Item: . } } ] }" \
> "$TABLE_TO-payload.json"
# write
aws dynamodb batch-write-item --request-items file://"$TABLE_TO-payload.json"
# clean up
rm "$TABLE_TO-payload.json"
If you both tables to be identical, you'd want to delete all items in TABLE_TO first.
DynamoDB now supports importing from S3.
https://aws.amazon.com/blogs/database/amazon-dynamodb-can-now-import-amazon-s3-data-into-a-new-table/
So, probably in almost all use cases, the easiest and cheapest way to replicate a table is
Use "Export to S3" feature to dump entire table into S3. Since this uses backup to generate the dump, table's throughput is not affected, and is very fast as well. You need to have backups (PITR) enabled. See https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
Use "Import from S3" to import the dump created in step 1. This automatically requires you to create a new table.
Use this node js module : copy-dynamodb-table
This is a little script I made to copy the contents of one table to another.
It's based on the AWS-SDK v3. Not sure how well it would scale to big tables but as a quick and dirty solution it does the job.
It gets your AWS credentials from a profile in ~/.aws/credentials change default to the name of the profile you want to use.
Other than that it takes two args one for the source table and one for destination
const { fromIni } = require("#aws-sdk/credential-providers");
const { DynamoDBClient, ScanCommand, PutItemCommand } = require("#aws-sdk/client-dynamodb");
const ddbClient = new DynamoDBClient({
credentials: fromIni({profile: "default"}),
region: "eu-west-1",
});
const args = process.argv.slice(2);
console.log(args)
async function main() {
const { Items } = await ddbClient.send(
new ScanCommand({
TableName: args[0],
})
);
console.log("Successfully scanned table")
console.log("Copying", Items.length, "Items")
const putPromises = [];
Items.forEach((item) => {
putPromises.push(
ddbClient.send(
new PutItemCommand({
TableName: args[1],
Item: item,
})
)
);
});
await Promise.all(putPromises);
console.log("Successfully copied table")
}
main();
Usage
node copy-table.js <source_table_name> <destination_table_name>
Python + boto3 🚀
The script is idempotent as far as you maintain the same Keys.
import boto3
def migrate(source, target):
dynamo_client = boto3.client('dynamodb', region_name='us-east-1')
dynamo_target_client = boto3.client('dynamodb', region_name='us-west-2')
dynamo_paginator = dynamo_client.get_paginator('scan')
dynamo_response = dynamo_paginator.paginate(
TableName=source,
Select='ALL_ATTRIBUTES',
ReturnConsumedCapacity='NONE',
ConsistentRead=True
)
for page in dynamo_response:
for item in page['Items']:
dynamo_target_client.put_item(
TableName=target,
Item=item
)
if __name__ == '__main__':
migrate('awesome-v1', 'awesome-v2')
On November 29th, 2017 Global Tables was introduced. This may be useful depending on your use case, which may not be the same as the original question. Here are a few snippets from the blog post:
Global Tables – You can now create tables that are automatically replicated across two or more AWS Regions, with full support for multi-master writes, with a couple of clicks. This gives you the ability to build fast, massively scaled applications for a global user base without having to manage the replication process.
...
You do not need to make any changes to your existing code. You simply send write requests and eventually consistent read requests to a DynamoDB endpoint in any of the designated Regions (writes that are associated with strongly consistent reads should share a common endpoint). Behind the scenes, DynamoDB implements multi-master writes and ensures that the last write to a particular item prevails. When you use Global Tables, each item will include a timestamp attribute representing the time of the most recent write. Updates are propagated to other Regions asynchronously via DynamoDB Streams and are typically complete within one second (you can track this using the new ReplicationLatency and PendingReplicationCount metrics).
Another option is to download the table as a .csv file and upload it with the following snippet of code.
This also eliminates the need for providing your AWS credentials to a packages such as the one #ezzat suggests.
Create a new folder and add the following two files and your exported table
Edit uploadToDynamoDB.js and add the filename of the exported table and your table name
Run npm install in the folder
Run node uploadToDynamodb.js
File: Package.json
{
"name": "uploadtodynamodb",
"version": "1.0.0",
"description": "",
"main": "uploadToDynamoDB.js",
"author": "",
"license": "ISC",
"dependencies": {
"async": "^3.1.1",
"aws-sdk": "^2.624.0",
"csv-parse": "^4.8.5",
"fs": "0.0.1-security",
"lodash": "^4.17.15",
"uuid": "^3.4.0"
}
}
File: uploadToDynamoDB.js
var fs = require('fs');
var parse = require('csv-parse');
var async = require('async');
var _ = require('lodash')
var AWS = require('aws-sdk');
// If your table is in another region, make sure to update this
AWS.config.update({ region: "eu-central-1" });
var ddb = new AWS.DynamoDB({ apiVersion: '2012-08-10' });
var csv_filename = "./TABLE_CSV_EXPORT_FILENAME.csv";
var tableName = "TABLENAME"
function prepareData(data_chunk) {
const items = data_chunk.map(obj => {
const keys = Object.keys(obj)
let attr = Object.values(obj)
attr = attr.map(a => {
let newAttr;
// Can we make this an integer
if (isNaN(Number(a))) {
newAttr = { "S": a }
} else {
newAttr = { "N": a }
}
return newAttr
})
let item = _.zipObject(keys, attr)
return {
PutRequest: {
Item: item
}
}
})
var params = {
RequestItems: {
[tableName]: items
}
};
return params
}
rs = fs.createReadStream(csv_filename);
parser = parse({
columns : true,
delimiter : ','
}, function(err, data) {
var split_arrays = [], size = 25;
while (data.length > 0) {
split_arrays.push(data.splice(0, size));
}
data_imported = false;
chunk_no = 1;
async.each(split_arrays, function(item_data, callback) {
const params = prepareData(item_data)
ddb.batchWriteItem(
params,
function (err, data) {
if (err) {
console.log("Error", err);
} else {
console.log("Success", data);
}
});
}, function() {
// run after loops
console.log('all data imported....');
});
});
rs.pipe(parser);
It's been a very long time since the question was posted and AWS has been continuously improvising features. At the time of writing this answer, we have the option to export the Table to S3 bucket then use the import feature to import this data from S3 into a new table which automatically will re-create a new table with the data. Plese refer this blog for more idea on export & import
Best part is that you get to change the name, PK or SK.
Note: You have to enable PITR (might incur additional costs). Always best to refer documents.
Here is another simple python util script for this: ddb_table_copy.py. I use it often.
usage: ddb_table_copy.py [-h] [--dest-table DEST_TABLE] [--dest-file DEST_FILE] source_table
Copy all DynamoDB items from SOURCE_TABLE to either DEST_TABLE or DEST_FILE. Useful for migrating data during a stack teardown/re-creation.
positional arguments:
source_table Name of source table in DynamoDB.
optional arguments:
-h, --help show this help message and exit
--dest-table DEST_TABLE
Name of destination table in DynamoDB.
--dest-file DEST_FILE
2) a valid file path string to save the items to, e.g. 'items.json'.

Sync framework provisioning error on table 'aspnet_personalizationperuser'

i have a database on server A which also contains the default tables and sp that used by the aspnet_Regsql.exe process (mainly i use the authentication module of asp.net with those tables) and a backup Server B, which i use sync framework sqlOrchester and SqlProvider to sync between them together.
but for some reason, when i'm trying to do the provisioning process on the remote server A
i get an error on the line that creates the SqlSyncScopeProvisioning object with exception of DbSyncException exception, stating that i'm missing a primary key column on table aspnet_personalizationperuser, although that table has the primary key column [Id] which ig a uniqueidentifier column.
I added the lines of code i'm using for the provisioning process:
var TableNames = [Collection of all the table names in the database];
var dbScopeDescription =
BuildScopeFromTableList(this.RemoteProvider, TableNames);
protected DbSyncScopeDescription
BuildScopeFromTableList(RelationalSyncProvider provider, string[] SyncTablesNames)
{
DbSyncScopeDescription scopeDesc = new
DbSyncScopeDescription(SyncSettings.ScopeName);
SqlConnection serverConn =
(SqlConnection)provider.Connection;
foreach (String table in SyncTablesNames)
{
DbSyncTableDescription tableDesc =
SqlSyncDescriptionBuilder.GetDescriptionForTable(table, serverConn);
scopeDesc.Tables.Add(tableDesc);
}
return scopeDesc;
}
var serverConfig = new SqlSyncScopeProvisioning((SqlConnection)this.remoteProvider.Connection, dbScopeDescription);