I want to create a pyspark job, and I want to have two separate environments: sandbox and production. Based on the environment it runs in, the script should connect either to the sandbox or production postgres database that runs on Google Cloud SQL.
For all the other functionalities required (things that run in Cloud Functions or Cloud Run) I managed to separate the environments using pydantic settings with the following class:
class Settings(BaseSettings):
db_user: str = Field()
db_pass: str = Field()
db_name: str = Field()
db_host: str = Field()
class Config:
env_file = ".env"
These values are passed in as environment variables through the Secret Manager, so only requests from within Google Cloud Datacenters access the sensible data. I did not find a suitable way to pass secrets to a Dataproc batch job, the best alternative I found are properties (https://cloud.google.com/dataproc-serverless/docs/concepts/properties), that become environment variables.
Functionally I can simply set all the variables I need using the properties, but it would mean sending all my database information over the network from my computer where I launch the job to GCP, which looks like a security risk.
Is there a better way to separate sandbox and production environments in dataproc, or can I somehow access secrets when deploying a dataproc batch job?
Related
Currently I have several redis clusters for different env. In my code, I will write data to redis inside my lambda function, if I deploy this lambda to my aws account, how can it update the corresponding redis cluster? Since every environment has its own redis cluster.
You can have a config file with Redis cluster name and their hostnames and in code, you can pick different clusters based on the env provided.
If you are using roles in the AWS account, for each environment then you should also do STS on requirred role
Your resource files are just files. They will be loaded by your application following different strategies, depend on your application's framework and how it has been configured.
Some application, at build-time, apply the correct configuration, by passing a flag to the build for example. --uat or --prod. If your application is one of those kind, you can just build the correct version and push it to AWS. They will connect to the correct Redis, given that you put the redis configuration into the ENV files
The other option is to use Environment Variable
I am trying to deploy a python script via Dockerfile on Google Cloud Run. The script takes Service account key as one of the variable. What is the best practice to mount the service key as environment variable in order to deploy on Google Cloud Run. To deploy the same thing on Google Kubernetes engine I tried using Configmap to store the key and called it while deploying. Is there any such provisions for Google Cloud Run?
As it was already advised by John Hanley and Guillaume Blaquiere, it's not a good practice to pass a key file as an environment variable. Nevertheless, in response to your question I will explain how to use the environment variables in Cloud Run.
Defining environment variables for a Cloud Run service
You can specify the environment variables for your Cloud Run service upon its creation in the Cloud Console or you can set them for an existing service using specific flags in the Command line.
Alternatively, the environment variables can be set in the container using the ENV statement.
Using environment variables in Cloud Run
In order to retrieve the values of the environment variables in Python, you can use the os.environ parameter of the OS module:
import os
os.environ['<name-of-the-env-variable>']
Example
If you set an environment variable called ‘testENV’ in your Dockerfile:
ENV testENV="my variable"
you will be able to retrieve it like this:
import os
sample = os.environ['testENV']
What would be the appropriate way to configure infrastructure independent parameters inside Docker container inside ECS?
Let's say there's an API that needs to be connected with external sources (DB for example that doesn't live inside AWS infrastructure). How would one configure the container to discover the external sources?
What I've come up with:
Environment variables;
Injecting configuration during Docker image building;
Using AWS System Manager Parameter Store;
Using AWS Secrets Manager;
Hosting the configuration in S3 for example and reading from there.
To me using the environment variables seems to be the way to go.
Because one wouldn't want to make an extra query to the AWS System Manager or Secrets Manager just to get the DB host & port every time connecting with the external source.
Another possibility I thought about was that after the container is started the required parameters are queried from the AWS System Manager or Secrets Manager and then stored in some sort of configuration file. But how would one distinguish then between test & production?
Am I missing something obvious here?
I'm working on Apache Spark application which I submit to AWS EMR cluster from Airflow task.
In Spark application logic I need to read files from AWS S3 and information from AWS RDS. For example, in order to connect to AWS RDS on PostgreSQL from Spark application, I need to provide the username/password for the database.
Right now I'm looking for the best and secure way in order to keep these credentials in the safe place and provide them as parameters to my Spark application. Please suggest where to store these credentials in order to keep the system secured - as env vars, somewhere in Airflow or where?
In Airflow you can create Variables to store this information. Variables can be listed, created, updated and deleted from the UI (Admin -> Variables). You can then access them from your code as follows:
from airflow.models import Variable
foo = Variable.get("foo")
Airflow has got us covered beautifully on credentials-management front by offering Connection SQLAlchemy model that can be accessed from WebUI (where passwords still remain hidden)
You can control the salt that Airflow uses to encrypt passwords while storing Connection-details in it's backend meta-db.
It also provides you extra param for storing unstructured / client-specific stuff such as {"use_beeline": true} config for Hiveserver2
In addition to WebUI, you can also edit Connections via CLI (which is true for pretty much every feature of Airflow)
Finally if your use-case involves dynamically creating / deleting a Connection, that is also possible by exploiting the underlying SQLAlchemy Session. You can see implementation details from cli.py
Note that Airflow treats all Connections equal irrespective of their type (type is just a hint for the end-user). Airflow distinguishes them on the basis of conn_id only
I am deploying Django app to Google Flexible Environment using the following command
gcloud app deploy
But I get this error
Updating service [default] (this may take several minutes)...failed.
ERROR: (gcloud.app.deploy) Error Response: [13] Invalid Cloud SQL name: gcloud beta sql instances describe
What could be the problem?
In your app-yaml file (as well as in mysite/settings.py), you have to provide the instance connection name of your CloudSQL instance. It is in the format:
[PROJECT_NAME]:[REGION_NAME]:[INSTANCE_NAME].
You can get this instance connection name by running the gcloud command gcloud sql instances describe [YOUR_INSTANCE_NAME] and copy the value shown for connectionName. In your case it seems that you have copied the command itself instead of the connectionName value.
Alternatively, you can also get the instance connection name by going to your Developer Console > SQL and click on your instance. You'll find the instance connection name under the "Connect to this instance" section.
LundinCast post contains the most important information to fix the issue. Take also into account that the Cloud SQL Proxy provides secure access to your Cloud SQL Second Generation instances (as described here). Use this command to run the proxy, if you already created one, as suggested in this Django in App Engine Flexible guide:
./cloud_sql_proxy -instances="[YOUR_INSTANCE_CONNECTION_NAME]"=tcp:5432
The mentioned command establishes a connection from your local computer to your Cloud SQL instance for local testing purposes and must be running while testing but it is not required while deploying.