my beam dataflow job succeeds locally (with DirectRunner) and fails on the cloud (with DataflowRunner)
The issue localized in this code snippet:
class SomeDoFn(beam.DoFn):
...
def process(self, gcs_blob_path):
gcs_client = storage.Client()
bucket = gcs_client.get_bucket(BUCKET_NAME)
blob = Blob(gcs_blob_path, bucket)
# NEXT LINE IS CAUSING ISSUES! (when run remotely)
url = blob.generate_signed_url(datetime.timedelta(seconds=300), method='GET')
and dataflow points to the error: "AttributeError: you need a private key to sign credentials.the credentials you are currently using just contains a token."
My dataflow job uses the service account (and appropriate service_account_email is provided in the PipelineOptions), however I don't see how I could pass the .json credentials file of that service account to the dataflow job. I suspect that locally my job runs successfully because I set the environment variable GOOGLE_APPLICATION_CREDENTIALS=<path to local file with service account credentials>, but how do I set it similarly for remote dataflow workers? Or maybe there is another solution, if anyone could help
You can see an example here on how to add custom options to your Beam pipeline. With this we can create a --key_file argument that will point to the credentials stored in GCS:
parser.add_argument('--key_file',
dest='key_file',
required=True,
help='Path to service account credentials JSON.')
This will allow you to add the --key_file gs://PATH/TO/CREDENTIALS.json flag when running the job.
Then, you can read it from within the job and pass it as a side input to the DoFn that needs to sign the blob. Starting from the example here we create a credentials PCollection to hold the JSON file:
credentials = (p
| 'Read Credentials from GCS' >> ReadFromText(known_args.key_file))
and we broadcast it to all workers processing the SignFileFn function:
(p
| 'Read File from GCS' >> beam.Create([known_args.input]) \
| 'Sign File' >> beam.ParDo(SignFileFn(), pvalue.AsList(credentials)))
Inside the ParDo, we build the JSON object to initialize the client (using the approach here) and sign the file:
class SignFileFn(beam.DoFn):
"""Signs GCS file with GCS-stored credentials"""
def process(self, gcs_blob_path, creds):
from google.cloud import storage
from google.oauth2 import service_account
credentials_json=json.loads('\n'.join(creds))
credentials = service_account.Credentials.from_service_account_info(credentials_json)
gcs_client = storage.Client(credentials=credentials)
bucket = gcs_client.get_bucket(gcs_blob_path.split('/')[2])
blob = bucket.blob('/'.join(gcs_blob_path.split('/')[3:]))
url = blob.generate_signed_url(datetime.timedelta(seconds=300), method='GET')
logging.info(url)
yield url
See full code here
You will need to provide the service account JSON key similarly to what you are doing locally using the env variable GOOGLE_APPLICATION_CREDENTIALS.
To do so you can follow a few approaches mentioned in the answers to this question. Such as passing it using PipelineOptions
However, keep in mind that the safest way is to store the JSON key let's say in a GCP Bucket and get the file from there.
The easy but not safe workaround is getting the key, opening it, and in your code create a json object based on it to pass it later.
Related
I am trying to run a Custom Training Job to deploy my model in Vertex AI directly from a Jupyterlab. This Jupyterlab is instantiated from a Vertex AI Managed Notebook where I already specified the service account.
My aim is to deploy the training script that I specify to the method CustomTrainingJob directly from the cells of my notebook. This would be equivalent to pushing an image that contains my script to container registry and deploying the Training Job manually from the UI of Vertex AI (in this way, by specifying the service account, I was able to corectly deploy the training job). However, I need everything to be executed from the same notebook.
In order to specify the credentials to the CustomTrainingJob of aiplatform, I execute the following cell, where all variables are correctly set:
import google.auth
from google.cloud import aiplatform
from google.auth import impersonated_credentials
source_credentials = google.auth.default()
target_credentials = impersonated_credentials.Credentials(
source_credentials=source_credentials,
target_principal='SERVICE_ACCOUNT.iam.gserviceaccount.com',
target_scopes = ['https://www.googleapis.com/auth/cloud-platform'])
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)
job = aiplatform.CustomTrainingJob(
display_name=JOB_NAME,
script_path=SCRIPT_PATH,
container_uri=MODEL_TRAINING_IMAGE,
credentials=target_credentials
)
When after the job.run() command is executed it seems that the credentials are not correctly set. In particular, the following error is returned:
/opt/conda/lib/python3.7/site-packages/google/auth/impersonated_credentials.py in _update_token(self, request)
254
255 # Refresh our source credentials if it is not valid.
--> 256 if not self._source_credentials.valid:
257 self._source_credentials.refresh(request)
258
AttributeError: 'tuple' object has no attribute 'valid'
I also tried different ways to configure the credentials of my service account but none of them seem to work. In this case it looks like the tuple that contains the source credentials is missing the 'valid' attribute, even if the method google.auth.default() only returns two values.
To run the custom training job using a service account, you could try using the service_account argument for job.run(), instead of trying to set credentials. As long as the notebook executes as a user that has act-as permissions for the chosen service account, this should let you run the custom training job as that service account.
I've created a dashboard and deployed it on AWS Elastic Beanstalk. The data fed into my dashboard is supplied by a CSV file in my S3 bucket, set to update every 12 hours with AWS EventBridge. For some reason, my deployed dashboard is not updating. It's still using the same old data from my previous deployment even though the CSV file has been updating correctly.
More specifically:
I'm trying to create a Dashboard with Plotly Dash to visualize some trends starting from 2020-01-01.
I had a Lambda function that scrapes the data and saves them as a CSV file in an S3 bucket. This CSV file gets overwritten every 12 hours to capture the latest available trends.
I used boto3 to fetch the CSV file directly from my S3 bucket and use its data to construct my dashboard.
The app was then deployed with Elastic Beanstalk.
Everything was written in a Cloud9 environment, except for setting up the EventBridge trigger.
Say I deployed the app on 2020-12-10. The CSV file would contain all data up till 2020-12-10, and my dashboard would show trends between 2020-01-01 and 2020-12-10.
However, if I check the dashboard anytime after 2020-12-10 (or when the CSV file is updated with data post 2020-12-10), it still shows the same trends (between 2020-01-01 and 2020-12-10), though the CSV file in my S3 bucket is up to date.
The dashboard would update only if I redeploy the app on Elastic Beanstalk. Not sure why this is the case since my app is pulling the data directly from the updated CSV file.
Is my architecture incorrect here? Or do I need to tweak some settings in AWS?
Thanks in advance!
Update:
I'm using the following codes to load my data into trends_data dataframe.
# define bucket name
bucket = "mobilitytrends"
# define s3 client
s3 = boto3.client('s3')
# define file names
historical_file_name = 'historical_trends.csv'
# load historical data from s3
data_obj = s3.get_object(Bucket= bucket, Key= historical_file_name)
trend_data = pd.read_csv(data_obj['Body'],low_memory = False)
I then have some functions that clean this dataframe. I have a scatterplot that's rendered using the code snippet below:
fig.add_scatter(x = filtered_trend.index,
y = filtered_trend[transportation],
line = dict(color = line_color[idx]),
name = transportation)
filtered_trend is a subset of trends_data, which gets selected based on some callback functions that I set up. But I don't think that's where the problem lies since everything worked fine locally.
In Dash, global variables will break your app. More specifically, modifying global variables will not work, at least not reliably.
One approach to avoid the use of global variables would be to create a single callback that first loads the data from S3, and then renders the layout. Other approaches are discussed in this similar question.
I had a similar problem, EB was not fetching the latest version of CSV from the s3 bucket.
The only option I could find was to restart the app server after a new version of the CSV is updated in s3 bucket.
you can use below code in AWS lambda function to restart your app server at specific times in a day:
import boto3
client = boto3.client('elasticbeanstalk', region_name='your-region')
def lambda_handler(event, context):
try:
response = client.restart_app_server(EnvironmentName='your-environment-name')
if response:
print('restarting app server')
else:
print('Failed to restart server')
except Exception as e:
print(e)
Make sure to set up cron using eventbridge for timings
I am trying to run the demo code given in pdf parsing of GCP document AI. To run the code, exporting google credentials as a command line works fine. The problem comes when the code needs to run in memory and hence no credential files are allowed to be accessed from disk. Is there a way to pass the credentials in the document ai parsing function?
The sample code of google:
def main(project_id='YOUR_PROJECT_ID',
input_uri='gs://cloud-samples-data/documentai/invoice.pdf'):
"""Process a single document with the Document AI API, including
text extraction and entity extraction."""
client = documentai.DocumentUnderstandingServiceClient()
gcs_source = documentai.types.GcsSource(uri=input_uri)
# mime_type can be application/pdf, image/tiff,
# and image/gif, or application/json
input_config = documentai.types.InputConfig(
gcs_source=gcs_source, mime_type='application/pdf')
# Location can be 'us' or 'eu'
parent = 'projects/{}/locations/us'.format(project_id)
request = documentai.types.ProcessDocumentRequest(
parent=parent,
input_config=input_config)
document = client.process_document(request=request)
# All text extracted from the document
print('Document Text: {}'.format(document.text))
def _get_text(el):
"""Convert text offset indexes into text snippets.
"""
response = ''
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in el.text_anchor.text_segments:
start_index = segment.start_index
end_index = segment.end_index
response += document.text[start_index:end_index]
return response
for entity in document.entities:
print('Entity type: {}'.format(entity.type))
print('Text: {}'.format(_get_text(entity)))
print('Mention text: {}\n'.format(entity.mention_text))
When you run your workloads on GCP, you don't need to have a service account key file. You MUSTN'T!!
Why? 2 reasons:
It's useless because all GCP products have, at least, a default service account. And most of time, you can customize it. You can have a look on Cloud Function identity in your case.
Service account key file is a file. It means a lot: you can copy it, send it by email, commit it in Git repository... many persons can have access to it and you loose the management of this secret. And because it's a secret, you have to store it securely, you have to rotate it regularly (at least every 90 days, Google recommendation),... It's nighmare! When you can, don't use service account key file!
What the libraries are doing?
There are looking if GOOGLE_APPLICATION_CREDENTIALS env var exists.
There are looking into the "well know" location (when you perform a gcloud auth application-default login to allow the local application to use your credential to access to Google Resources, a file is created in a "standard location" on your computer)
If not, check if the metadata server exists (only on GCP). This server provides the authentication information to the libraries.
else raise an error.
So, simply use the correct service account in your function and provide it the correct role to achieve what you want to do.
I am creating temporary credentials via AWS Security Token Service (AWS STS).
And Using these credentials to upload a file to S3 from S3 JAVA SDK.
I need some way to restrict the size of file upload.
I was trying to add policy(of s3:content-length-range) while creating a user, but that doesn't seem to work.
Is there any other way to specify the maximum file size which user can upload??
An alternative method would be to generate a pre-signed URL instead of temporary credentials. It will be good for one file with a name you specify. You can also force a content length range when you generate the URL. Your user will get URL and will have to use a specific method (POST/PUT/etc.) for the request. They set the content while you set everything else.
I'm not sure how to do that with Java (it doesn't seem to have support for conditions), but it's simple with Python and boto3:
import boto3
# Get the service client
s3 = boto3.client('s3')
# Make sure everything posted is publicly readable
fields = {"acl": "private"}
# Ensure that the ACL isn't changed and restrict the user to a length
# between 10 and 100.
conditions = [
{"acl": "private"},
["content-length-range", 10, 100]
]
# Generate the POST attributes
post = s3.generate_presigned_post(
Bucket='bucket-name',
Key='key-name',
Fields=fields,
Conditions=conditions
)
When testing this make sure every single header item matches or you'd get vague access denied errors. It can take a while to match it completely.
I believe there is no way to limit the object size before uploading, and reacting to that would be quite hard. A workaround would be to create an S3 event notification that triggers your code, through a Lambda funcation or SNS topic. That could validate or delete the object and notify the user for example.
I tried creating a datasource using boto for machine learning but ended up with an error.
Here's my code :
import boto
bucketname = 'mybucket'
filename = 'myfile.csv'
schema = 'myfile.csv.schema'
conn = boto.connect_s3()
datasource = 'my_datasource'
ml = boto.connect_machinelearning()
#create a data source
ds = ml.create_data_source_from_s3(
data_source_id = datasource,
data_spec ={
'DataLocationS3':'s3://'+bucketname+'/'+filename,
'DataSchemaLocationS3':'s3://'+bucketname+'/'+schema},
data_source_name=None,
compute_statistics = True)
print ml.get_data_source(datasource,verbose=None)
I get this error as a result of get_data_source call:
Could not access 's3://mybucket/myfile.csv'. Either there is no file at that location, or the file is empty, or you have not granted us read permission.
I have checked and I have FULL_CONTROL as my permissions. The bucket, file and schema all are present and are non-empty.
How do I solve this?
You may have FULL_CONTROL over that S3 resource but in order for this to work you have to grant the Machine Learning service the appropriate access to that S3 resource.
I know links to answers are frowned upon but in this case I think its best to link to the definitive documentation from the Machine Learning Service since the actual steps are complicated and could change in the future.