Need help in setting expiry time for a new table in GBQ.
I am creating/uploading a new file as a table in gbq using the below code,
def uploadCsvToGbq(self, table_name, jsonSchema, csvFile, delim):
job_data = {
'jobReference': {
'projectId': self.project_id,
'job_id': str(uuid.uuid4())
},
#"expires":str(datetime.now()+timedelta(seconds=60)),
#"expirationTime": 20000,
#"defaultTableExpirationMs":20000,
'configuration': {
'load': {'writeDisposition': 'WRITE_TRUNCATE',
'fieldDelimiter': delim,
'skipLeadingRows': 1,
'sourceFormat': 'CSV',
'schema': {
'fields': jsonSchema
},
'destinationTable': {
'projectId': self.project_id,
'datasetId': self.dataset_id,
'tableId': table_name
}
}
}
}
upload = MediaFileUpload(csvFile,
mimetype='application/octet-stream', chunksize=1048576,
# This enables resumable uploads.
resumable=True)
start = time.time()
job_id = 'job_%d' % start
# Create the job.
return self.bigquery.jobs().insert(projectId=self.project_id,
body=job_data,
media_body=upload).execute()
This is a perfect code that uploads that file into GBQ as a new table,now i need to set the expiry time for the table,already i tried setting(which is commented) expires,expirationTime and defaultTableExpirationMs,but nothing works.
Do anyone have any idea?
You should use Tables: patch API and set expirationTime property
Below function creates a table with an expirationTime, so as an alternative solution you can create the table first and insert the data later.
def createTableWithExpire(bigquery, dataset_id, table_id, expiration_time):
"""
Creates a BQ table that will be expired in specified time.
Expiration time can be in Unix timestamp format e.g. 1452627594
"""
table_data = {
"expirationTime": expiration_time,
"tableReference":
{
"tableId": table_id
}
}
return bigquery.tables().insert(
projectId=_PROJECT_ID,
datasetId=dataset_id,
body=table_data).execute()
Also answered by Mikhail in this SO question.
Thankyou both,I combined both solution,but made some modifications to work for mine.
As i am creating the table by uploading csv, i am setting the expirationTime by calling patch method and passing tableid to that,
def createTableWithExpire(bigquery, dataset_id, table_id, expiration_time):
"""
Creates a BQ table that will be expired in specified time.
Expiration time can be in Unix timestamp format e.g. 1452627594
"""
table_data = {
"expirationTime": expiration_time,
}
return bigquery.tables().patch(
projectId=_PROJECT_ID,
datasetId=dataset_id,
tableId=table_id,
body=table_data).execute()
Another alternative is to set the expiration time after the table has been created:
from google.cloud import bigquery
import datetime
client = bigquery.Client()
table_ref = client.dataset('my-dataset').table('my-table') # get table ref
table = client.get_table(table_ref) # get Table object
# set datetime of expiration, must be a datetime type
table.expires = datetime.datetime.combine(datetime.date.today() +
datetime.timedelta(days=2),
datetime.time() )
table = client.update_table(table, ['expires']) # update table
Related
Hi Stackoverflow I'm trying to conditionally put an item within a DynamoDB table. The DynamoDB table has the following attributes.
ticker - Partition Key
price_date - Sort Key
price - Attribute
Every minute I'm calling an API which gives me a minute by minute list of dictionaries for all stock prices within the day so far. However, the data I receive from the API sometimes can be behind by a minute or two. I don't particularly want to overwrite all the records within the DynamoDB table every time I get new data. To achieve this I've tried to create a conditional expression to only use put_item when there is a match on ticker but there is a new price_date
I've created a simplification of my code below to better illustrate my problem.
import boto3
from boto3.dynamodb.conditions import Attr
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('stock-intraday')
data = [
{'ticker': 'GOOG', 'price_date': '2021-10-08T9:30:00.000Z', 'price': 100},
{'ticker': 'GOOG', 'price_date': '2021-10-08T9:31:00.000Z', 'price': 101}
]
for item in data:
dynamodb_response = table.put_item(Item=item,
ConditionExpression=Attr("ticker").exists() & Attr("price_date").not_exists())
However when I run this code I get this error...
What is wrong with my conditional expression?
Found an answer to my own problem. DynamoDB was throwing an error because my code WAS working but with some minor changes.
There needed to be a TRY EXCEPT block but also since the partition key is already evaluated only the price_date needed to be included within the condition expression
import boto3
from boto3.dynamodb.conditions import Attr
from botocore.exceptions import ClientError
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('stock-intraday')
data = [
{'ticker': 'GOOG', 'price_date': '2021-10-08T9:30:00.000Z', 'price': 100},
{'ticker': 'GOOG', 'price_date': '2021-10-08T9:31:00.000Z', 'price': 101}]
for item in data:
try:
dynamodb_response = table.put_item(Item=item,
ConditionExpression=Attr("price_date").not_exists())
except ClientError as e:
if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
pass
I have few Google cloud transfer jobs running in my GCP account, which transfers data from Azure to GCS bucket.
As per this document - https://cloud.google.com/storage-transfer/docs/reference/rest/v1/transferJobs/get?apix_params=%7B%22jobName%22%3A%22transferJobs%2F213858246512856794%22%2C%22projectId%22%3A%22merlincloud-gcp-preprod%22%7D
the "get" method can fetch details of the job like name, description, bucketName, status, includePrefixes, storageAccount and so on.
Here's the sample output of "get" method.
{
"name": "transferJobs/<job_name>",
"description": "<description given while creating job>",
"projectId": "<project_id>",
"transferSpec": {
"gcsDataSink": {
"bucketName": "<destination_bucket>"
},
"objectConditions": {
"includePrefixes": [
"<prefix given while creating job>"
],
"lastModifiedSince": "2021-06-30T18:30:00Z"
},
"transferOptions": {
},
"azureBlobStorageDataSource": {
"storageAccount": "<account_name>",
"container": "<container_name>"
}
},
"schedule": {
"scheduleStartDate": {
"year": 2021,
"month": 7,
"day": 1
},
"startTimeOfDay": {
"hours": 13,
"minutes": 45
},
"repeatInterval": "86400s"
},
"status": "ENABLED",
"creationTime": "2021-07-01T06:08:19.392111916Z",
"lastModificationTime": "2021-07-01T06:13:32.460934533Z",
"latestOperationName": "transferOperations/transferJobs-<job_name>"
}
Now, how do I fetch the run history details of a particular job in python?
By "Run history details" I mean the metrics (Data transferred, no of files, status, size, duration) displayed in GTS console as shown in the picture below.
I'm unfamiliar with the transfer service but I'm very familiar with GCP.
The only other resource that's provided by the service is transferOperations.
Does that provide the data you need?
If not (!), it's possible that Google hasn't exposed this functionality beyond the Console. This happens occasionally even though the intent is always to be (public) API first.
One way you can investigate is to check the browser's developer tools 'network' tab to see what REST API calls the Console is making to fulfill the request. Another way is to use the equivalent gcloud command and tack on --log-http to see the underlying REST API calls that way.
As #DazWilkin mentioned, I was able to fetch each job's run history details using the transferOperations - list API
I wrote a Cloud Function to fetch GTS metrics by making API calls.
Initially it makes tansferJobs - list API call and fetches the list of jobs and in that fetches only the required job details. It then makes "transferOperations" API call and passes the job name to fetch the run history details.
Here's the code:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
from datetime import datetime
import logging
"""
requirements.txt
google-api-python-client==2.3.0
oauth2client==4.1.3
"""
class GTSMetrics:
def __init__(self):
self.project = "<your_gcp_project_name>"
self.source_type_mapping = {"gcsDataSource": "Google Cloud Storage", "awsS3DataSource": "Amazon S3",
"azureBlobStorageDataSource": "Azure Storage"}
self.transfer_job_names = ["transferJobs/<your_job_name>"]
self.credentials = GoogleCredentials.get_application_default()
self.service = discovery.build('storagetransfer', 'v1', credentials=self.credentials)
self.metric_values = {}
def build_run_history_metrics(self, job=None):
try:
if job:
operation_filters = {"projectId": self.project, "jobNames": [job['name']]}
request = self.service.transferOperations().list(name='transferOperations', filter=operation_filters)
while request is not None:
response = request.execute()
if 'operations' in response:
self.metric_values['total_runs'] = len(response['operations'])
metadata = response['operations'][0]['metadata']
status = metadata['status'] if 'status' in metadata else ""
start_time = metadata['startTime'] if 'startTime' in metadata else ""
end_time = metadata['endTime'] if 'endTime' in metadata else ""
start_time_object = datetime.strptime(start_time[:-4], "%Y-%m-%dT%H:%M:%S.%f")
end_time_object = datetime.strptime(end_time[:-4], "%Y-%m-%dT%H:%M:%S.%f")
gts_copy_duration = end_time_object - start_time_object
self.metric_values['latest_run_status'] = status
self.metric_values['latest_run_time'] = str(start_time_object)
self.metric_values['latest_run_errors'] = ""
self.metric_values['start_time'] = str(start_time_object)
self.metric_values['end_time'] = str(end_time_object)
self.metric_values['duration'] = gts_copy_duration.total_seconds()
if status == "FAILED":
if 'errorBreakdowns' in metadata:
errors = metadata['errorBreakdowns'][0]['errorCount']
error_code = metadata['errorBreakdowns'][0]['errorCode']
self.metric_values['latest_run_errors'] = f"{errors} - {error_code}"
elif status == "SUCCESS":
counters = metadata['counters']
data_bytes = counters['bytesCopiedToSink'] if 'bytesCopiedToSink' in counters else '0 B'
obj_from_src = str(
counters['objectsFoundFromSource']) if 'objectsFoundFromSource' in counters else 0
obj_copied_sink = str(
counters['objectsCopiedToSink']) if 'objectsCopiedToSink' in counters else 0
data_skipped_bytes = counters[
'bytesFromSourceSkippedBySync'] if 'bytesFromSourceSkippedBySync' in counters else '0 B'
data_skipped_files = counters[
'objectsFromSourceSkippedBySync'] if 'objectsFromSourceSkippedBySync' in counters else '0'
self.metric_values['data_transferred'] = data_bytes
self.metric_values['files_found_in_source'] = obj_from_src
self.metric_values['files_copied_to_sink'] = obj_copied_sink
self.metric_values['data_skipped_in_bytes'] = data_skipped_bytes
self.metric_values['data_skipped_files'] = data_skipped_files
break
# request = self.service.transferOperations().list_next(previous_request=request,
# previous_response=response)
except Exception as e:
logging.error(f"Exception in build_run_history_metrics - {str(e)}")
def build_job_metrics(self, job):
try:
transfer_spec = list(job['transferSpec'].keys())
source = ""
source_type = ""
if "gcsDataSource" in transfer_spec:
source_type = self.source_type_mapping["gcsDataSource"]
source = job['transferSpec']["gcsDataSource"]["bucketName"]
elif "awsS3DataSource" in transfer_spec:
source_type = self.source_type_mapping["awsS3DataSource"]
source = job['transferSpec']["awsS3DataSource"]["bucketName"]
elif "azureBlobStorageDataSource" in transfer_spec:
source_type = self.source_type_mapping["azureBlobStorageDataSource"]
frequency = "Once"
schedule = list(job['schedule'].keys())
if "repeatInterval" in schedule:
interval = job['schedule']['repeatInterval']
if interval == "86400s":
frequency = "Every day"
elif interval == "604800s":
frequency = "Every week"
else:
frequency = "Custom"
prefix = ""
if 'objectConditions' in transfer_spec:
obj_con = job['transferSpec']['objectConditions']
if 'includePrefixes' in obj_con:
prefix = job['transferSpec']['objectConditions']['includePrefixes'][0]
self.metric_values['job_description'] = job['description']
self.metric_values['job_name'] = job['name']
self.metric_values['source_type'] = source_type
self.metric_values['source'] = source
self.metric_values['destination'] = job['transferSpec']['gcsDataSink']['bucketName']
self.metric_values['frequency'] = frequency
self.metric_values['prefix'] = prefix
except Exception as e:
logging.error(f"Exception in build_job_metrics - {str(e)}")
def build_metrics(self):
try:
request = self.service.transferJobs().list(pageSize=None, pageToken=None, x__xgafv=None,
ilter={"projectId": self.project})
while request is not None:
response = request.execute()
for transfer_job in response['transferJobs']:
if transfer_job['name'] in self.transfer_job_names:
# fetch job details
self.build_job_metrics(job=transfer_job)
# fetch run history details for the job
self.build_run_history_metrics(job=transfer_job)
request = self.service.transferJobs().list_next(previous_request=request, previous_response=response)
logging.info(f"GTS Metrics - {str(self.metric_values)}")
except Exception as e:
logging.error(f"Exception in build_metrics - {str(e)}")
def build_gts_metrics(request):
gts_metrics = GTSMetrics()
gts_metrics.build_metrics()
My Requirement
I want to create a CloudWatch-Metric from Athena query results.
Example
I want to create a metric like user_count of each day.
In Athena, I will write an SQL query like this
select date,count(distinct user) as count from users_table group by 1
In the Athena editor I can see the result, but I want to see these results as a metric in Cloudwatch.
CloudWatch-Metric-Name ==> user_count
Dimensions ==> Date,count
If I have this cloudwatch metric and dimensions, I can easily create a Monitoring Dashboard and send send alerts
Can anyone suggest a way to do this?
You can use CloudWatch custom widgets, see "Run Amazon Athena queries" in Samples.
It's somewhat involved, but you can use a Lambda for this. In a nutshell:
Setup your query in Athena and make sure it works using the Athena console.
Create a Lambda that:
Runs your Athena query
Pulls the query results from S3
Parses the query results
Sends the query results to CloudWatch as a metric
Use EventBridge to run your Lambda on a recurring basis
Here's an example Lambda function in Python that does step #2. Note that the Lamda function will need IAM permissions to run queries in Athena, read the results from S3, and then put a metric into Cloudwatch.
import time
import boto3
query = 'select count(*) from mytable'
DATABASE = 'default'
bucket='BUCKET_NAME'
path='yourpath'
def lambda_handler(event, context):
#Run query in Athena
client = boto3.client('athena')
output = "s3://{}/{}".format(bucket,path)
# Execution
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': DATABASE
},
ResultConfiguration={
'OutputLocation': output,
}
)
#S3 file name uses the QueryExecutionId so
#grab it here so we can pull the S3 file.
qeid = response["QueryExecutionId"]
#occasionally the Athena hasn't written the file
#before the lambda tries to pull it out of S3, so pause a few seconds
#Note: You are charged for time the lambda is running.
#A more elegant but more complicated solution would try to get the
#file first then sleep.
time.sleep(3)
###### Get query result from S3.
s3 = boto3.client('s3');
objectkey = path + "/" + qeid + ".csv"
#load object as file
file_content = s3.get_object(
Bucket=bucket,
Key=objectkey)["Body"].read()
#split file on carriage returns
lines = file_content.decode().splitlines()
#get the second line in file
count = lines[1]
#remove double quotes
count = count.replace("\"", "")
#convert string to int since cloudwatch wants numeric for value
count = int(count)
#post query results as a CloudWatch metric
cloudwatch = boto3.client('cloudwatch')
response = cloudwatch.put_metric_data(
MetricData = [
{
'MetricName': 'MyMetric',
'Dimensions': [
{
'Name': 'DIM1',
'Value': 'dim1'
},
],
'Unit': 'None',
'Value': count
},
],
Namespace = 'MyMetricNS'
)
return response
return
I would like to get the usage cost report of each instance in my aws account form a period of time.
I'm able to get linked_account_id and service in the output but I need instance_id as well. Please help
import argparse
import boto3
import datetime
cd = boto3.client('ce', 'ap-south-1')
results = []
token = None
while True:
if token:
kwargs = {'NextPageToken': token}
else:
kwargs = {}
data = cd.get_cost_and_usage(
TimePeriod={'Start': '2019-01-01', 'End': '2019-06-30'},
Granularity='MONTHLY',
Metrics=['BlendedCost','UnblendedCost'],
GroupBy=[
{'Type': 'DIMENSION', 'Key': 'LINKED_ACCOUNT'},
{'Type': 'DIMENSION', 'Key': 'SERVICE'}
], **kwargs)
results += data['ResultsByTime']
token = data.get('NextPageToken')
if not token:
break
print('\t'.join(['Start_date', 'End_date', 'LinkedAccount', 'Service', 'blended_cost','unblended_cost', 'Unit', 'Estimated']))
for result_by_time in results:
for group in result_by_time['Groups']:
blended_cost = group['Metrics']['BlendedCost']['Amount']
unblended_cost = group['Metrics']['UnblendedCost']['Amount']
unit = group['Metrics']['UnblendedCost']['Unit']
print(result_by_time['TimePeriod']['Start'], '\t',
result_by_time['TimePeriod']['End'],'\t',
'\t'.join(group['Keys']), '\t',
blended_cost,'\t',
unblended_cost, '\t',
unit, '\t',
result_by_time['Estimated'])
As far as I know, Cost Explorer can't treat the usage per instance. There is a function Cost and Usage Reports which gives a detailed billing report by dump files. In this file, you can see the instance id.
It can also be connected to the AWS Athena. Once you did this, then directly query to the file on Athena.
Here is my presto example.
select
lineitem_resourceid,
sum(lineitem_unblendedcost) as unblended_cost,
sum(lineitem_blendedcost) as blended_cost
from
<table>
where
lineitem_productcode = 'AmazonEC2' and
product_operation like 'RunInstances%'
group by
lineitem_resourceid
The result is
lineitem_resourceid unblended_cost blended_cost
i-***************** 279.424 279.424
i-***************** 139.948 139.948
i-******** 68.198 68.198
i-***************** 3.848 3.848
i-***************** 0.013 0.013
where the resourceid containes the instance id. The amount of cost is summed for all usage in this month. For other type of product_operation, it will contains different resource ids.
You can add an individual tag to all instances (e.g. Id) and then group by that tag:
GroupBy=[
{
'Type': 'TAG',
'Key': 'Id'
},
],
Is there a way at all to query on the global secondary index of dynamodb using boto3. I dont find any online tutorials or resources.
You need to provide an IndexName parameter for the query function.
This is the name of the index, which is usually different from the name of the index attribute (the name of the index has an -index suffix by default, although you can change it during table creation). For example, if your index attribute is called video_id, your index name is probably video_id-index.
import boto3
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('videos')
video_id = 25
response = table.query(
IndexName='video_id-index',
KeyConditionExpression=Key('video_id').eq(video_id)
)
To check the index name, go to the Indexes tab of the table on the web interface of AWS. You'll need a value from the Name column.
For anyone using the boto3 client, below example should work:
import boto3
# for production
client = boto3.client('dynamodb')
# for local development if running local dynamodb server
client = boto3.client(
'dynamodb',
region_name='localhost',
endpoint_url='http://localhost:8000'
)
resp = client.query(
TableName='UsersTabe',
IndexName='MySecondaryIndexName',
ExpressionAttributeValues={
':v1': {
'S': 'some#email.com',
},
},
KeyConditionExpression='emailField = :v1',
)
# will always return list
items = resp.get('Items')
first_item = items[0]
Adding the updated technique:
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource(
'dynamodb',
region_name='localhost',
endpoint_url='http://localhost:8000'
)
table = dynamodb.Table('userTable')
attributes = table.query(
IndexName='UserName',
KeyConditionExpression=Key('username').eq('jdoe')
)
if 'Items' in attributes and len(attributes['Items']) == 1:
attributes = attributes['Items'][0]
There are so many questions like this because calling dynamo through boto3 is not intuitive. I use dynamof library to make things like this a lot more common sense. Using dynamof the call looks like this.
from dynamof.operations import query
from dynamof.conditions import attr
query(
table_name='users',
conditions=attr('role').equals('admin'),
index_name='role_lookup_index')
https://github.com/rayepps/dynamof
disclaimer: I wrote dynamof