How to find the schema of Airflow Backend database? - google-cloud-platform

I am using apache airflow (v 1.10.2) on Google Cloud Composer, and I would like to view the schema of the airflow database. Where can I find this information?

There are couple of ways I can think of comparing our current design.
External metadata DB. If you can connect to the DB then you can get the schema.
From your UI you can go to Data Profiling and run query against the metadata tables(depends on your database types(mysql or postgres etc) and find the information from there and create a schema diagram.
I hope this helps.

According to the Composer architecture design Cloud SQL is the main place where all the Airflow metadata is stored. However, in order to grant authorization access from client application over the GKE cluster to the database we use Cloud SQL Proxy service. Particularly in Composer environment we can find airflow-sqlproxy* Pod, leveraging connections to Airflow Cloud SQL instance.
Saying this, I believe that it will not make any problem establish connection to the above mentioned Airflow database from any of the GKE cluster workloads(Pods).
For instance, I will perform connection from Airflow worker reaching airflow-sqlproxy-service.default Cloud SQL proxy service and further perform DB discovering via mysql command-line util:
kubectl -it exec $(kubectl get po -l run=airflow-worker -o jsonpath='{.items[0].metadata.name}' \
-n $(kubectl get ns| grep composer*| awk '{print $1}')) -n $(kubectl get ns| grep composer*| awk '{print $1}') \
-c airflow-worker -- mysql -u root -h airflow-sqlproxy-service.default
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> show databases;
+----------------------------------------+
| Database |
+----------------------------------------+
| information_schema |
| composer-1-8-3-airflow-1-10-3-* |
| mysql |
| performance_schema |
| sys |
+----------------------------------------+
5 rows in set (0.00 sec)

Related

Using Harbor Helm with RDS?

Is it possible to use Harbor Helm with RDS?
The original installation of Harbor, without using Helm Charts and Kubernetes, involves a harbor.yml that requires 4 databases to be set up: Harbor Core, Clair, Notary Server, and Notary Signer.
I have been told that using Harbor Helm requires these databases to be set up and managed. Therefore, when using Harbor Helm, that installs Harbor in a Kubernetes Cluster, do we still need these 4 databases to be set up and configured? If so, should RDS be used?
Yes, you do, We are using Postgres via RDS which is deployed via terraform. I then
updated the Harbor Helm Chart via Kustomize to inject an initContainer.
The initContainer then executes the following script which is passed the 4 database names registry, clair, notary_signer, notary_server
#!/bin/bash
echo "Creating Databases: $#"
for var in "$#"
do
select="SELECT 1 FROM pg_database WHERE datname = '$var'"
create="CREATE DATABASE $var;"
echo "psql -h <%=database.external.host%> -U postgres -tc \"$select\""
psql -h <%=database.external.host%> -U postgres -tc "select 1 from pg_database where datname = '$var';" | grep -q 1 || psql -h <%=database.external.host%> -U postgres -tc "$create"
done
It sort of stinks that Postgres does not have CREATE DATABASE IF NOT EXISTS like CockroachDB does.

How to migrate elasticsearch data to AWS elasticsearch domain?

I have elasticsearch 5.5 running on a server with some data indexed in it. I want to migrate this ES data to AWS elasticsearch cluster. How I can perform this migration. I got to know that one way is by creating the snapshot of ES cluster, but I am not able to find any proper documentation for this.
The best way to migrate is by using Snapshots. You will need to snapshot your data to Amazon S3 and then proceed a restore from there. Documentation for snapshots to S3 can be found here. Alternatively, you can also re-index your data though this is a longer process and there are limitations depending on the version of AWS ES.
I also recommend looking at Elastic Cloud, the official hosted offering on AWS that includes the additional X-Pack monitoring, management, and security features. The migration guide for moving to Elastic Cloud also goes over snapshots and re-indexing.
I momentarily created a shell script for this -
Github - https://github.com/vivekyad4v/aws-elasticsearch-domain-migration/blob/master/migrate.sh
#!/bin/bash
#### Make sure you have Docker engine installed on the host ####
###### TODO - Support parameters ######
export AWS_ACCESS_KEY_ID=xxxxxxxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxx
export AWS_DEFAULT_REGION=ap-south-1
export AWS_DEFAULT_OUTPUT=json
export S3_BUCKET_NAME=my-es-migration-bucket
export DATE=$(date +%d-%b-%H_%M)
old_instance="https://vpc-my-es-ykp2tlrxonk23dblqkseidmllu.ap-southeast-1.es.amazonaws.com"
new_instance="https://vpc-my-es-mg5td7bqwp4zuiddwgx2n474sm.ap-south-1.es.amazonaws.com"
delete=(.kibana)
es_indexes=$(curl -s "${old_instance}/_cat/indices" | awk '{ print $3 }')
es_indexes=${es_indexes//$delete/}
es_indexes=$(echo $es_indexes|tr -d '\n')
echo "index to be copied are - $es_indexes"
for index in $es_indexes; do
# Export ES data to S3 (using s3urls)
docker run --rm -ti taskrabbit/elasticsearch-dump \
--s3AccessKeyId "${AWS_ACCESS_KEY_ID}" \
--s3SecretAccessKey "${AWS_SECRET_ACCESS_KEY}" \
--input="${old_instance}/${index}" \
--output "s3://${S3_BUCKET_NAME}/${index}-${DATE}.json"
# Import data from S3 into ES (using s3urls)
docker run --rm -ti taskrabbit/elasticsearch-dump \
--s3AccessKeyId "${AWS_ACCESS_KEY_ID}" \
--s3SecretAccessKey "${AWS_SECRET_ACCESS_KEY}" \
--input "s3://${S3_BUCKET_NAME}/${index}-${DATE}.json" \
--output="${new_instance}/${index}"
new_indexes=$(curl -s "${new_instance}/_cat/indices" | awk '{ print $3 }')
echo $new_indexes
curl -s "${new_instance}/_cat/indices"
done

Migrate postgres dump to RDS

I have a Django postgres db (v9.3.10) running on digital ocean and am trying to migrate it over to Amazon RDS (postgres v 9.4.5). The RDS is a db.m3.xlarge instance with 300GB. I've dumped the Digital Ocean db with:
sudo -u postgres pg_dump -Fc -o -f /home/<user>/db.sql <dbname>
And now I'm trying to migrate it over with:
pg_restore -h <RDS endpoint> --clean -Fc -v -d <dbname> -U <RDS master user> /home/<user>/db.sql
The only error I see is:
pg_restore: [archiver (db)] Error from TOC entry 2516; 0 0 COMMENT EXTENSION plpgsql
pg_restore: [archiver (db)] could not execute query: ERROR: must be owner of extension plpgsql
Command was: COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language';
Apart from that everything seems to be going fine and then it just grinds to a halt. The dumped file is ~550MB and there are a few tables with multiple indices, otherwise pretty standard.
The Read and Write IOPS on the AWS interface are near 0, as is the CPU, memory, and storage. I'm very new to AWS and know that the parameter groups might need tweaking to do this better. Can anyone advise on this or a better way to migrate a Django db over to RDS?
Edit:
Looking at the db users the DO db looks like:
Role Name Attr Member Of
<user> Superuser {}
postgres Superuser, Create role, Create DB, Replication {}
And the RDS one looks like:
Role Name Attr Member Of
<user> Create role, Create DB {rds_superuser}
rds_superuser Cannot login {}
rdsadmin ... ...
So it doesn't look like it's a permissions issue to me as <user> has superuser permissions in each case.
Solution for anyone looking:
I finally got this working using:
cat <db.sql> | sed -e '/^COMMENT ON EXTENSION plpgsql IS/d' > edited.dump
psql -h <RDS endpoint> -U <user> -e <dname> < edited.dump
It's not ideal for a reliable backup/restore mechanism but given it is only a comment I guess I can do without. My only other observation is that running psql/pg_restore to a remote host is slow. Hopefully the new database migration service will add something.
Considering your dumped DB file is of ~550MB, I think using the Amazon guide for doing this is the way out. I hope it helps.
Importing Data into PostgreSQL on Amazon RDS
I think it did not halt. It was just recreating indexes, foreign keys etc. Use pg_restore -v to see what's going on during the restore. Check the logs or redirect output to a file to check for any errors after import, as this is verbose.
Also I'd recommend using directory format (pg_dump -v -Fd) as it allows for parallel restore (pg_restore -v -j4).
You can ignore this ERROR: must be owner of extension plpgsql. This is only setting a comment on extension, which is installed by default anyway. This is caused by a peculiarity in RDS flavor of PostgreSQL, which does not allow to restore a database while connecting as postgres user.

Best practice for obtaining the credentials when executing a Redshift copy command

What's the best practice for obtaining the AWS credentials needed for executing a Redshift copy command from S3? I'm automating the ingestion process from S3 into Redshift by having a machine trigger the Copy command.
I know it's recommended to use IAM roles on ec2 hosts so that you do not need to store AWS credentials. How would it work though with the Redshift copy command? I do not particularly want the credentials in the source code. Similarly the hosts are being provisioned by Chef and so if I wanted to set the credentials as environment variables they would be available in the Chef scripts.
You need credentials to use COPY command, if your question is around, how you get those credentials, from the host where you are running the program, you can get the metadata of the IAM role and use the access key, secret key and token. You can parameterize it on the fly just before COPY command and use them in the COPY command.
export ACCESSKEY=curl -s http://169.254.169.254/IAMROLE | grep '"AccessKeyId" : *' | cut -f5 -d " " | cut -b2- | rev | cut -b3- | rev
command to extrct secret key and create a parameter
export SECRETKEY=curl -s http://169.254.169.254/IAMROLE | grep '"SecretAccessKey" : *' | cut -f5 -d " " | cut -b2- | rev | cut -b3- | rev
command to extract the token and create a parameter
export TOKEN=curl -s http://169.254.169.254/IAMROLE | grep '"Token" : *' | cut -f5 -d " " | rev | cut -b2- | rev
Seems like the recommended way (as suggested by an AWS developer) is to use temporary credentials by calling AWS Security Token Service (AWS STS).
I'm yet to implement this but this is the approach I'll be looking to take.

gcloud compute no longer available after migration to v2beta2

After running: gcloud components update beta
It seems I lost compute commands:
gcloud -h
Usage: gcloud [optional flags] <group | command>
group may be auth | beta | components | config | dns | preview |
topic
command may be docker | help | info | init | version
How do I get compute back in order to run simple things like: gcloud compute images list --project google-containers | grep container-vm?
I followed migration path from: https://cloud.google.com/deployment-manager/latest/migration-guide
This is my gcloud -v:
Google Cloud SDK 0.9.67
beta 2015.06.30
bq 2.0.18
bq-nix 2.0.18
core 2015.06.30
core-nix 2015.06.02
gcloud 2015.06.30
gcutil-msg 2015.06.09
gsutil 4.13
gsutil-nix 4.12
preview 2015.06.30
Everytime I run the compute command, console gets in loop until I kill with ctrl+c:
[gcloud _19:33:01 $]]^_^ gcloud compute -h
You do not currently have this command group installed. Using it
requires the installation of components: [compute]
WARNING: Component [compute] no longer exists.
All components are up to date.
Restarting gcloud command:
$ gcloud compute -h
You do not currently have this command group installed. Using it
requires the installation of components: [compute]
WARNING: Component [compute] no longer exists.
All components are up to date.
Restarting gcloud command:
$ gcloud compute -h
^C
Is there anything I missed?
cheers
Leo
I had same error after updating my gcloud tools. I had to replace the current one by re-installing the gcloud kit to make it work.