Updating aws glue catalog table with glue api - amazon-web-services

I am trying to use create table glue api to create the data catalog and thus bypassing the need of crawler because the schema is going to be same every-time.
I am able to create the data catalog and now whenever any updated csv file comes in s3 , the table is updated (as in when i run the athena query it shows the updated table).
But my problem is , that there are cases when i will get only the deltas ( that is only the data which is changed and not a complete data ) , now when only the deltas are coming , and when i am running the athena query manually on the table, it is only showing the deltas , and data is not getting merged with the earlier complete data.
So i am not understanding how shall i update only deltas and merge then in the original data catalog.
Is it even possible with aws glue???
Below is my current code:-
import boto3
import json
client=boto3.client('glue')
def lambda_handler(event, context):
response = client.create_table(
DatabaseName = 'sample',
TableInput = {
'Name': 'lawyertable4',
'Description': 'Table created with boto3 API',
'StorageDescriptor': {
'Columns': [{
'Name': 'id',
'Type': 'bigint',
},
{
'Name': 'username',
'Type': 'string',
},
{
'Name': 'time_stamp',
'Type': 'string',
},
],
'Location': 's3://location/sample/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'Compressed': False,
'SerdeInfo': {
'SerializationLibrary': 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
'Parameters': {
'field.delim': ',',
'skip.header.line.count':'1'
}
}
},
}
)

Related

How Specify Unique Column for QuestDB Import?

I am importing data to QuestDB with the HTTP method in Python.
The table (test_data) has the following properties:
'name': 'time', 'size': 8, 'type': 'TIMESTAMP'
'name': 'open', 'size': 8, 'type': 'DOUBLE'
'name': 'high', 'size': 8, 'type': 'DOUBLE'
'name': 'low', 'size': 8, 'type': 'DOUBLE'
'name': 'close', 'size': 8, 'type': 'DOUBLE'
'name': 'volume', 'size': 8, 'type': 'DOUBLE'
'name': 'ts', 'size': 8, 'type': 'TIMESTAMP'
Note: the 'time' column is the designated timestamp
The imported data is sourced from a pandas dataframe. The dataframe has the same headers and the 'time' column is the index and the 'ts' column is the timestamp from when the data was acquired. The code shown below regarding the import function.
def to_csv_str(table):
output = io.StringIO()
csv.writer(output, dialect="excel").writerows(table)
return output.getvalue().encode("utf-8")
def write_to_table(df, table="test_data"):
table_name = table
table = [[df.index.name] + df.columns.tolist()] +df.reset_index().values.tolist()
table_csv = to_csv_str(table)
schema = json.dumps([])
response = requests.post(
"http://localhost:9000/imp",
params={"fmt": "json"},
files={"schema": schema, "data": (table_name, table_csv)},
).json()
pprint.pprint(response)
The import executes successfully the first time. If I was to rerun the import for the same data (all values are the same except the 'ts' column for when the data was acquired), then one additional row will be appended with all of the same values but the 'ts' column. How can I have the 'time' column be defined in such a way that it is forced to be unique and any import with a row that has a duplicate 'time' value will be omitted?
Example screenshots for a 6 row import below:
Initial import with all rows successfull image 1
Reissued import with only 5 errors (expected 6) image 2
Table data from the web console
image 3

Dataflow XML to Bigquery using Dynamic Table destination feature(facing latency issue)

I am new to Dataflow and got stuck with the below issue.
Problem statement: Need a Dataflow job(Python) to load XML from GCS into Bigquery (Batch Load). The Destination table in Bigquery is dynamic and calculated at the run time based on the XML file name.
Solution Decided: Just followed the article - https://medium.com/google-cloud/how-to-load-xml-data-into-bigquery-using-python-dataflow-fd1580e4af48. Wherein there static table was used in BIGQUERYWRITE transform, but I am using dynamic table name obtained via a callable function.(Attaching the code reference)
JOB Graph:
Code:
import apache_beam as beam
import argparse
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import storage
# List the gcs file objects
# ToDo: Use Apache Beam GCSIO Match Patterns To list the file objects
storage_client = storage.Client()
bucket_name = "xmltobq"
bucket=storage_client.get_bucket(bucket_name)
blobs = list(bucket.list_blobs(prefix="xmlfiles/"))
blob_files = [blob.name for blob in blobs if ".xml" in blob.name]
#Static schema
table_schema = {
"fields": [
{'name' : 'filename', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'CustomerID', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'EmployeeID', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'OrderDate', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'RequiredDate', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'ShipInfo', 'type': 'RECORD', 'mode': 'NULLABLE', 'fields': [
{'name' : 'ShipVia', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'Freight', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'ShipName', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'ShipAddress', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'ShipCity', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'ShipRegion', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'ShipPostalCode', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'ShipCountry', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name' : 'ShippedDate', 'type': 'STRING', 'mode': 'NULLABLE'},
]},
]
}
def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as p:
def readfiles(element):
'''
Input Pcollection: GCS Element Path
Output Pcollection: (XML, Filename)
'''
# Simple XML conversion using XMLTODICT package
# ToDo: Once specific XML paths are acquired, we can parse only the required fields
import xmltodict
import apache_beam as beam
gcs_file = beam.io.filesystems.FileSystems.open("gs://xmltobq/"+element)
parsed_xml = xmltodict.parse(gcs_file)
return parsed_xml, element.split("/")[1].split(".")[0]
def xmlformatting(element):
'''
Input Pcollection: XML
Output Pcollection: A generator of Modified XML Elements
'''
data, filename = element
for order in data['Root']['Orders']['Order']:
yield formatting(order, filename)
#def tablename(e):
# import re
# return "gcp-bq-2021:dataset4." + re.sub("[\s+,(,)]", "", a)
def formatting(order, filename):
'''
Input Pcollection: (XMLELEMENT, Filename)
Output PCollection: Modified XML
ToDo: This is just to handle the sample xml, production code will be havin different
formatting procress'''
import copy
import re
order_copy = copy.deepcopy(order)
if "#ShippedDate" in order['ShipInfo']:
order_copy['ShipInfo']['ShippedDate'] = order['ShipInfo']['#ShippedDate']
del order_copy['ShipInfo']['#ShippedDate']
order_copy['filename'] = "gcp-bq-2021:testdataset."+re.sub("[\s+,(,)]", "", filename)
return order_copy
# Dynamic table name same as that of input xml file by adding file name as key in the dictionary and accessing
# them in writetobigquery
# ToDo: In Production code dynamic schema option will be included in the Writetobq transform
pipeline_data = (p | "Create GCS Object List" >> beam.Create(blob_files) |
"XMLConversion" >> beam.Map(readfiles) |
"XMLformatting" >> beam.FlatMap(xmlformatting) | "shuffle" >> beam.Reshuffle() |
beam.io.WriteToBigQuery(table=lambda row: row['filename'],
# A lambda function to return dynamic table name,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, #WRITE_TRUNCATE
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location="gs://xmltobq"))
Sample XML File:
ISSUE:
I am able to run the job successfully in the dataflow runner and move the files to Bigquery. But the time it is consuming in the WRITETOBIGQUERY is too long, especially in ParDo(TriggerCopyJobs) step where
the throughput is almost below 1element/second
Instead of a dynamic table if it is a single table the job gets completed lightning fast.
Is there is anything wrong I am doing that is preventing parallel processing.
Machine type used: n1-highcpu-8.
JobID: 2022-03-09_07_28_47-10567887862012507747

How do I format the Key in my boto3 lambda function to update dynamodb?

I need help with figuring out how to format my Key in lambda to update an item in DynamoDB. Below is the code I have but I can't figure out how to format the Key.
My table looks as follows:
'''
import json
import boto3
dynamodb = boto3.resource('dynamodb')
client = boto3.client('dynamodb')
def lambda_handler(event, context):
response = client.update_item(
TableName='hitcounter',
Key={????????},
UpdateExpression='ADD visits :incr',
ExpressionAttributeValues={':incr': 1}
)
print(response)
'''
ERROR MESSAGE:
'''
{
"errorMessage": "'path'",
"errorType": "KeyError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 11, in lambda_handler\n Key={'path': event['path']},\n"
]
}
'''
The AWS docs provides an example for Updating an item:
table.update_item(
Key={
'username': 'janedoe',
'last_name': 'Doe'
},
UpdateExpression='SET age = :val1',
ExpressionAttributeValues={
':val1': 26
}
)
I'm not sure from your question, if the AWS examples are unclear or what is the issue specifically?

AWS Cloudwatch python sdk API not working

I have a root AWS account which is linked with other three sub aws accounts. in my root account I created a Lambda function to get billing metrics from cloudwatch using python SDK and APIs . its working I am using IAM user's access key and secret key which has billing access and all admin access but I copied the lambda code and put into sub account's lambda function it doesn't retrieve any data. I can't understand why its not working in sub account ?
import boto3
from datetime import datetime, timedelta;
def get_metrics(event, context):
ACCESS_KEY='accesskey'
SECRET_KEY='secretkey'
client = boto3.client('cloudwatch',aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY)
response = client.get_metric_statistics(
Namespace='AWS/Billing',
MetricName='EstimatedCharges',
Dimensions=[
{
'Name': 'LinkedAccount',
'Value': '12 digit account number'
},
{
'Name': 'Currency',
'Value': 'USD'
},
],
StartTime='2017, 12, 19',
EndTime='2017, 12, 21',
Period=86400,
Statistics=[
'Maximum',
],
)
print response

format data in views.py for ajax

I'me trying to use this eventCalendar in django : http://jquery-week-calendar.googlecode.com/svn/trunk/jquery.weekcalendar/full_demo/weekcalendar_full_demo.html
I suppose to write ajax codes myself but on the other hand I'me a newbie in jquery ajax,I wanna send event data include startTime,endTime,etc to show them on the calendar:
$('#calendar').weekCalendar({
data: function(callback){
$.getJSON("{% url DrHub.views.getEvents %}",
{
},
function(result) {
callback(result);
}
);
}
});
this calendar get data in this format:
return {
events : [
{
"id":1,
"start": new Date(year, month, day, 12),
"end": new Date(year, month, day, 13, 30),
"title":"Lunch with Mike"
},
{
"id":2,
"start": new Date(year, month, day, 14),
"end": new Date(year, month, day, 14, 45),
"title":"Dev Meeting"
},
...
]
};
how can I format fetched data from database in getEvents view?
from django.utils import simplejson
def some_view(request):
# Build the output -> it's a standard Python dict
output = {
"events": [
{
"id": 1,
"start": "2009-05-10T13:15:00.000+10:00",
"end": "2009-05-10T14:15:00.000+10:00",
"title":"Lunch with Mike"
},
]
}
# With db data you would do something like:
# events = Event.objects.all()
# for event in events:
# event_out = {
# "title": event.title,
# # other fields here
# }
# output['events'].append(event_out)
# Return the output as JSON
return HttpResponse(simplejson.dumps(output), mimetype='application/json')
You can construct the dictionary as usual, just take into account that strings for dates will not be interpreted in javascript without processing. My advice is to directly send javascript interpretable dates, not strings, as follows:
from django.utils import simplejson
import datetime
import time
occ.start = time.mktime(occ.start.timetuple())*1000
occ.end = time.mktime(occ.end.timetuple())*1000
event = {'id': occ.id,'title':occ.title,'start':occ.start,'end':occ.end,'body':occ.description,'readOnly': '%r' %occ.read_only,'recurring':'%r' % occ.recurring,'persisted': '%r' % occ.persisted,'event_id':occ.event.id}
mimetype = 'application/json'
return HttpResponse(simplejson.dumps(event),mimetype)
take into account that the calendar expects the 'events' key so:
$.getJSON(url, function(data){
res = {events:data};
//alert(JSON.stringify(res, null, 4));
callback(res);
});
If you preffer processing on the javascript side try the library datejs that can convert a date from text.