How to change name of a table created by AWS Glue crawler using boto3 - amazon-web-services

I'm trying to change the table name created by AWS Crawler using boto3. Here is the code:
import boto3
database_name = "eventbus"
table_name = "enrollment_user_enroll_cancel_1_0_0"
new_table_name = "enrollment_user_enroll_cancel"
client = boto3.client("glue", region_name='us-west-1')
response = client.get_table(DatabaseName=database_name, Name=table_name)
table_input = response["Table"]
table_input["Name"] = new_table_name
print(table_input)
print(table_input["Name"])
table_input.pop("CreatedBy")
table_input.pop("CreateTime")
table_input.pop("UpdateTime")
client.create_table(DatabaseName=database_name, TableInput=table_input)
Getting the below error:
botocore.exceptions.ParamValidationError: Parameter validation failed:
Unknown parameter in TableInput: "DatabaseName", must be one of: Name, Description, Owner, LastAccessTime, LastAnalyzedTime, Retention, StorageDescriptor, PartitionKeys, ViewOriginalText, ViewExpandedText, TableType, Parameters
Unknown parameter in TableInput: "IsRegisteredWithLakeFormation", must be one of: Name, Description, Owner, LastAccessTime, LastAnalyzedTime, Retention, StorageDescriptor, PartitionKeys, ViewOriginalText, ViewExpandedText, TableType, Parameters
Could you please let me know the resolution for this issue? Thanks!

To get rid of botocore.exceptions.ParamValidationError thrown by client.create_table, you need to delete the corresponding items from table_input in a similar way as you did with CreatedBy etc
...
table_input.pop("DatabaseName")
table_input.pop("IsRegisteredWithLakeFormation")
client.create_table(DatabaseName=database_name, TableInput=table_input)
In case your original table had partitions, which want to add to a new table, you need to use similar approach. First you need to retrieve meta information about those partitions with either:
batch_get_partition()
get_partition()
get_partitions()
Note: depending which one you chose, you would need to pass different parameters. There are limitation on how many partitions you can retrieve within a single request. If I remember correctly it is around 200 or so. On top of that, you might need to use page paginator to list all of the available partitions. This is the case when your table has more then 400 partitions.
In general, I would suggest to:
paginator = client.get_paginator('get_partitions')
response = paginator.paginate(
DatabaseName=database_name,
TableName=table_name
)
partitions = list()
for page in response:
for p in page['Partitions']:
partitions.append(p.copy())
# Here you need to remove "DatabaseName", "TableName", "CreationTime" from
# every partition
Now you are ready add those retrieved partition to a new table with either:
batch_create_partition()
create_partition()
I'd suggest to use batch_create_partition(), however, it limits on how many partitions can be created at the single request.

Related

Django-ORM: Check whether multiple items are in DB while minimizing calls

Assume that from an external API call, we get the following response:
resp = ['123', '67283', '99829', '786232']
These are external_id fields for our objects, defined in our Article model. Some of which may already exist in database, while others don't.
Before returning a response, we need to check whether each external_id corresponds to a record in our database, and if not, we need to create it and fetch additional info from another, third, source.
What is the most efficient way to do this? Right now I can't think of something better than:
for external_id in resp:
if not Article.objects.filter(external_id=external_id).exists():
# item doesn't exist, go fetch more data and create object
else:
# already exists, do something else
But there must be a better way..?
You can use sets for this task. Following code will issue only one database call:
expected_ids = set(int(pk) for pk in resp)
exist_ids = set(Article.objects.filter(external_id__in=resp)
.values_list('external_id', flat=True))
not_exist_ids = list(expected_ids - exist_ids)

Pulling data from datastore and converting it in Json in python(Google Appengine)

I am creating an apllication using google appengine, in which i am fetching a data from the website and storing it in my Database (Data store).Now whenever user hits my application url as "application_url\name =xyz&city= abc",i am fetching the data from the DB and want to show it as json.Right now i am using a filter to fetch data based on the name and city but getting output as [].I dont know how to get data from this.My code looks like this:
class MainHandler(webapp2.RequestHandler):
def get(self):
commodityname = self.request.get('veg',"Not supplied")
market = self.request.get('market',"No market found with this name")
self.response.write(commodityname)
self.response.write(market)
query = commoditydata.all()
logging.info(commodityname)
query.filter('commodity = ', commodityname)
result = query.fetch(limit = 1)
logging.info(result)
and the db structure for "commoditydata" table is
class commoditydata(db.Model):
commodity= db.StringProperty()
market= db.StringProperty()
arrival= db.StringProperty()
variety= db.StringProperty()
minprice= db.StringProperty()
maxprice= db.StringProperty()
modalprice= db.StringProperty()
reporteddate= db.DateTimeProperty(auto_now_add = True)
Can anyone tell me how to get data from the db using name and market and covert it in Json.First getting data from db is the more priority.Any suggestions will be of great use.
If you are starting with a new app, I would suggest to use the NDB API rather than the old DB API. Your code would look almost the same though.
As far as I can tell from your code sample, the query should give you results as far as the HTTP query parameters from the request would match entity objects in the datastore.
I can think of some possible reasons for the empty result:
you only think the output is empty, because you use write() too early; app-engine doesn't support streaming of response, you must write everything in one go and you should do this after you queried the datastore
the properties you are filtering are not indexed (yet) in the datastore, at least not for the entities you were looking for
the filters are just not matching anything (check the log for the values you got from the request)
your query uses a namespace different from where the data was stored in (but this is unlikely if you haven't explicitly set namespaces anywhere)
In the Cloud Developer Console you can query your datastore and even apply filters, so you can see the results with-out writing actual code.
Go to https://console.developers.google.com
On the left side, select Storage > Cloud Datastore > Query
Select the namespace (default should be fine)
Select the kind "commoditydata"
Add filters with example values you expect from the request and see how many results you get
Also look into Monitoring > Log which together with your logging.info() calls is really helpful to better understand what is going on during a request.
The conversion to JSON is rather easy, once you got your data. In your request handler, create an empty list of dictionaries. For each object you get from the query result: set the properties you want to send, define a key in the dict and set the value to the value you got from the datastore. At the end dump the dictionary as JSON string.
class MainHandler(webapp2.RequestHandler):
def get(self):
commodityname = self.request.get('veg')
market = self.request.get('market')
if commodityname is None and market is None:
# the request will be complete after this:
self.response.out.write("Please supply filters!")
# everything ok, try query:
query = commoditydata.all()
logging.info(commodityname)
query.filter('commodity = ', commodityname)
result = query.fetch(limit = 1)
logging.info(result)
# now build the JSON payload for the response
dicts = []
for match in result:
dicts.append({'market': match.market, 'reporteddate': match.reporteddate})
# set the appropriate header of the response:
self.response.headers['Content-Type'] = 'application/json; charset=utf-8'
# convert everything into a JSON string
import json
jsonString = json.dumps(dicts)
self.response.out.write( jsonString )

how to filter items with a certain attribute empty

I'm using boto and DynamoDB and I want to count all the items with Feature attribute empty. I tried the following,
not_empty_ct = db.instance.query_count(
Feature__eq=''
)
But didn't work,
boto.dynamodb2.exceptions.ValidationException: ValidationException: 400 Bad Request
{'message': 'One or more parameter values were invalid: An AttributeValue may not contain an empty string', '__type':
'com.amazon.coral.validate#ValidationException'}
I didn't find too much information in boto's API Docs.
You probably need to do a Scan which you should probably not be doing in your code.

how to get amazon prices using Boto?

it seems that Boto is the official Amazon API module for Python, and this one is for Tornado, so here is my questions:
does it offer pagination (requests only 10 products, since amazon offers 10 products per page, then i want only to get the first page...), then how (sample code?)
how then to parse the product parse, i've used python-amazon-simple-product-api but sadly it dont offer pagination, so all the offers keep iterating.
generally, pagination is performed by the client requesting the api. To do this in boto, you'll need to cut your systems up. So for instance, say you make a call to AWS via boto, using the get_all_instances def; you'll need to store those somehow and then keep track of which servers have been displayed, and which not. To my knowledge, boto does not have the LIMIT functionality most dev's are used to from MySQL. Personally, I scan all my instances and stash them in mongo like so:
for r in conn.get_all_instances(): # loop through all reservations
groups = [g.name for g in r.groups] # get a list of groups for this reservation
for x in r.instances: # loop through all instances with-in reservation
groups = ','.join(groups) # join the groups into a comma separated list
name = x.tags.get('Name',''); # get instance name from the 'Name' tag
new_record = { "tagname":name, "ip_address":x.private_ip_address,
"external_ip_nat":x.ip_address, "type":x.instance_type,
"state":x.state, "base_image":x.image_id, "placement":x.placement,
"public_ec2_dns":x.public_dns_name,
"launch_time":x.launch_time, "parent": ObjectId(account['_id'])}
new_record['groups'] = groups
systems_coll.update({'_id':x.id},{"$set":new_record},upsert=True)
error = db.error()
if error != None:
print "err:%s:" % str(error)
You could also wrap these in try/catch blocks. Up to you. Once you get them out of boto, should be trivial to do the cut up work.
-- Jess

Bulk create model objects in django

I have a lot of objects to save in database, and so I want to create Model instances with that.
With django, I can create all the models instances, with MyModel(data), and then I want to save them all.
Currently, I have something like that:
for item in items:
object = MyModel(name=item.name)
object.save()
I'm wondering if I can save a list of objects directly, eg:
objects = []
for item in items:
objects.append(MyModel(name=item.name))
objects.save_all()
How to save all the objects in one transaction?
as of the django development, there exists bulk_create as an object manager method which takes as input an array of objects created using the class constructor. check out django docs
Use bulk_create() method. It's standard in Django now.
Example:
Entry.objects.bulk_create([
Entry(headline="Django 1.0 Released"),
Entry(headline="Django 1.1 Announced"),
Entry(headline="Breaking: Django is awesome")
])
worked for me to use manual transaction handling for the loop(postgres 9.1):
from django.db import transaction
with transaction.atomic():
for item in items:
MyModel.objects.create(name=item.name)
in fact it's not the same, as 'native' database bulk insert, but it allows you to avoid/descrease transport/orms operations/sql query analyse costs
name = request.data.get('name')
period = request.data.get('period')
email = request.data.get('email')
prefix = request.data.get('prefix')
bulk_number = int(request.data.get('bulk_number'))
bulk_list = list()
for _ in range(bulk_number):
code = code_prefix + uuid.uuid4().hex.upper()
bulk_list.append(
DjangoModel(name=name, code=code, period=period, user=email))
bulk_msj = DjangoModel.objects.bulk_create(bulk_list)
Here is how to bulk-create entities from column-separated file, leaving aside all unquoting and un-escaping routines:
SomeModel(Model):
#classmethod
def from_file(model, file_obj, headers, delimiter):
model.objects.bulk_create([
model(**dict(zip(headers, line.split(delimiter))))
for line in file_obj],
batch_size=None)
Using create will cause one query per new item. If you want to reduce the number of INSERT queries, you'll need to use something else.
I've had some success using the Bulk Insert snippet, even though the snippet is quite old.
Perhaps there are some changes required to get it working again.
http://djangosnippets.org/snippets/446/
Check out this blog post on the bulkops module.
On my django 1.3 app, I have experienced significant speedup.
bulk_create() method is one of the ways to insert multiple records in the database table. How the bulk_create()
**
Event.objects.bulk_create([
Event(event_name="Event WF -001",event_type = "sensor_value"),
Entry(event_name="Event WT -002", event_type = "geozone"),
Entry(event_name="Event WD -001", event_type = "outage") ])
**
for a single line implementation, you can use a lambda expression in a map
map(lambda x:MyModel.objects.get_or_create(name=x), items)
Here, lambda matches each item in items list to x and create a Database record if necessary.
Lambda Documentation
The easiest way is to use the create Manager method, which creates and saves the object in a single step.
for item in items:
MyModel.objects.create(name=item.name)