How does Sqoop fetch metadata of tables mentioned in --query option? - mapreduce

As per "Apache Sqoop Cookbook", by using query imports, Sqoop can’t use the database catalog to fetch the metadata.
Then how does Sqoop fetch metadata of tables (cities & normcities) mentioned in query option? eg:
--query "SELECT cities.city AS first_city normcities.city AS second_city FROM cities LEFT JOIN normcities USING(id)"

Related

AWS Printing DynamoDB Table Via CLI

I'm trying to find the right command to use in the CLI to print the contents of a table within DynamoDB.
I've tried using the following command but it gives me a "parameter validation failed" error.
`
aws dynamodb get-item \
--table-name Traffic \
--key file://traffic.json \
--return-consumed-capacity TOTAL
`
The AWS website is giving me a 403 error, at the moment, so I can't search for the solution through the official site.
To get all items in a table, use a scan operation, not a get item operation. This basic scan operation works fine with the CLI:
aws dynamodb scan --table-name Work
You can find all valid options here:
https://docs.aws.amazon.com/cli/latest/reference/dynamodb/scan.html
You can run the Scan API to output how the table looks in DynamoDB JSON format.
aws dynamodb scan \
--table-name test \
--output text
If you have a list of keys to fetch in your traffic.json file then you should use batch-get-item.
If it's a single item you need then please share the contents of traffic.json file.

Creating a backup of Amazon S3 tables | command line

I have some tables in my S3 bucket and I want to create a backup for them via the EMR command line.
Here is a breakdown of what I want to do
copy s3 objects into a backup S3 location
create metadata for the backup table (use DDL from the original table but read data in from the backup S3 location)
validate row count between the main and the backup table
So far I have been able to write a script to copy objects from the main tables external location to the backup tables external location
backup table() {
db=${1}
table=${2}
s3_location=$(aws glue get-table --database-name $(db) --name $(table) --query "Table.StorageDescriptor.Location")
bash -c "aws s3 cp --recursive $s3_location ${s3_location}_bkp"
}
backup_table "database" "table"
I can't figure out how to access the main table's DDL in cli (show create table db.table in athena).
Will greatly appreciate any help.

Get number of partitions in AWS Glue for specific range

I want to list all the partitions for a given table and get a count of it, but
aws glue get-partitions --database-name ... returns detailed information about each partitions which is not very helpful in this case.
Let's say my table is partitioned by input_data_date and country I want to know how many partitions I have for a given day.
I can do something with this
aws glue get-partitions --database-name MYDB --table-name MYTABLE --expression "input_data_date = '2021-07-09' "
But it needs some scripting I was looking for a better and cleaner way just by AWS CLI or ....
The AWS CLI uses JMESPATH, which has a length() function. Therefore, you can use:
aws glue get-partitions --database-name xx --table-name xx --query 'length(Partitions[])'
That will return the total number of partitions.
If you want to do something more specific ("how many partitions I have for a given day"), you'd probably need to use a better SDK (eg Python with boto3) to process the information.

Bulk hive table creation in Google Dataproc

I am very new to Google Cloud Platform, and I am doing a POC for moving a hive application (tables and jobs) to Google Dataproc. The data has already been moved to Google cloud Storage.
Is there an inbuilt way to create all the tables from hive in dataproc in bulk, instead of creating one by one using the hive prompt?
Dataproc support Hive job type, so you can use the gcloud command:
gcloud dataproc jobs submit hive --cluster=CLUSTER \
-e 'create table t1 (id int, name string); create table t2 ...;'
or
gcloud dataproc jobs submit hive --cluster=CLUSTER -f create_tables.hql
You can also SSH into the master node, then use beeline to execute the script:
beeline -u jdbc:hive2://localhost:10000 -f create_tables.hql

How to run Queries on Redshift Database using AWS CLI

I want to run queries by passing them as string to some supported command of AWS through its CLI.
I can see that the commands specific to AWS Redshift mentioned doesnt have anything which says that it can execute commands remotely
Link : https://docs.aws.amazon.com/cli/latest/reference/redshift/index.html
Need help on this.
You need to use psql. There is no API interface to redshift.
Redshift is based loosely on postgresql however so you can connect to the cluster using the psql command line tool.
https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-from-psql.html
You can use Redshift Data API to execute queries on Redshift using AWS CLI.
aws redshift-data execute-statement
--region us-west-2
--secret arn:aws:secretsmanager:us-west-2:123456789012:secret:myuser-secret-hKgPWn
--cluster-identifier mycluster-test
--sql "select * from stl_query limit 1"
--database dev
https://aws.amazon.com/about-aws/whats-new/2020/09/announcing-data-api-for-amazon-redshift/
https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html
https://docs.aws.amazon.com/cli/latest/reference/redshift-data/index.html#cli-aws-redshift-data