NDB - Most Efficient Way to Delete a List of Keys - python-2.7

I believe I need to use ndb.delete_multi but I am confused as to how to make it work for a specific set of keys and whether it is the most efficient way to go. Using Python 2.7 with Google App Engine.
First, I am collecting the keys that I want to delete. I don't want to delete everything, only those entries that are 1 hour or more old. Toward that end I first collect the list of keys that meet this criteria.
cs = ChannelStore()
delMsgKeys = []
for x in cs.allMessages():
current = datetime.datetime.now()
recordTime = x.channelMessageCreated
timeDiffSecs = (current - recordTime).total_seconds()
timeDiff = (timeDiffSecs/60)/60
if timeDiff >=1:
delMsgKeys.append(x.key.id())
ndb.delete_multi(?????)
The definition for cs.allMessages():
def allMessages(self):
return ChannelStore.query().fetch()
First, is this overall the most efficient approach? Second, how do I use the list of keys I created with the ndb.delete_multi statement?
---Update----
The issue with the ndb.delete_multi has to do with the keys I was passing it. In the code I posted above the keys should have been stored as follows:
delMsgKeys.append(x.key)
With the above ndb.delete_multi works.

Per the NDB documentation, you can just pass in a list of keys to ndb.delete_multi, so based on your code, this should work:
ndb.delete_multi(delMsgKeys)
I'm not sure what the limit is for the number of keys that you can pass in a single ndb.delete_multi() call, though.
For this query:
ChannelStore.query().fetch()
You can add an entity property to store a timestamp when you create/update the entity by adding auto_now = True (more documentation here). Then with the timestamp property you can query for something like this:
sixty_mins_ago = datetime.datetime.now()- datetime.timedelta(minutes = 60)
qry = ChannelStore.query()
list_of_keys = qry.filter(ChannelStore.timestamp < sixty_mins_ago).fetch(keys_only = True)
Since you don't need the entities, a keys_only fetch will be cheaper. Of course this code is assuming your ChannelStore model has a timestamp property, so your model will have to be something like this:
class ChannelStore(ndb.model):
#other properties go here
timestamp = ndb.DateTimeProperty(auto_now = True)
Putting it all together, something like this could work for the code block you have above:
from models import ChannelStore
from google.appengine.ext import ndb
from datetime import datetime, timedelta
# other imports
def delete_old_entities():
sixty_mins_ago = datetime.now() - timedelta(minutes = 60)
qry = ChannelStore.query()
qry = qry.filter(ChannelStore.timestamp < sixty_mins_ago)
list_of_keys = qry.fetch(keys_only = True)
ndb.delete_multi(list_of_keys)
In case you have to delete a lot of keys and are running into some kind of API limit with the ndb.delete_multi call, you can change the delete_old_entities() method to the following:
def delete_old_entities():
sixty_mins_ago = datetime.datetime.now() - datetime.timedelta(minutes = 60)
qry = ChannelStore.query()
qry = qry.filter(ChannelStore.timestamp < sixty_mins_ago)
list_of_keys = qry.fetch(keys_only = True)
while list_of_keys:
# delete 100 at a time
ndb.delete_multi(list_of_keys[:100])
list_of_keys = list_of_keys[100:]

Related

Importing salesforce report data using python

I am new to sfdc . I have report already created by user . I would like to use python to dump the data of the report into csv/excel file.
I see there are couple of python packages for that. But my code gives an error
from simple_salesforce import Salesforce
sf = Salesforce(instance_url='https://cs1.salesforce.com', session_id='')
sf = Salesforce(password='xxxxxx', username='xxxxx', organizationId='xxxxx')
Can i have the basic steps for setting up the API and some example code
This worked for me:
import requests
import csv
from simple_salesforce import Salesforce
import pandas as pd
sf = Salesforce(username=your_username, password=your_password, security_token = your_token)
login_data = {'username': your_username, 'password': your_password_plus_your_token}
with requests.session() as s:
d = s.get("https://your_instance.salesforce.com/{}?export=1&enc=UTF-8&xf=csv".format(reportid), headers=sf.headers, cookies={'sid': sf.session_id})
d.content will contain a string of comma separated values which you can read with the csv module.
I take the data into pandas from there, hence the function name and import pandas. I removed the rest of the function where it puts the data into a DataFrame, but if you're interested in how that's done let me know.
In case it is helpful, I wanted to write out the steps I used to answer this question now (Aug-2018), based on Obol's comment. For reference, I followed the README instructions at https://github.com/cghall/force-retrieve/blob/master/README.md for the salesforce_reporting package.
To connect to Salesforce:
from salesforce_reporting import Connection, ReportParser
sf = Connection(username='your_username',password='your_password',security_token='your_token')
Then, to get the report I wanted into a Pandas DataFrame:
report = sf.get_report(your_reports_id)
parser = salesforce_reporting.ReportParser(report)
report = parser.records_dict()
report = pd.DataFrame(report)
If you were so inclined, you could also simplify the four lines above into one, like so:
report = pd.DataFrame(salesforce_reporting.ReportParser(sf.get_report(your_reports_id)).records_dict())
One difference I ran into from the README is that sf.get_report('report_id', includeDetails=True) threw an error stating get_report() got an unexpected keyword argument 'includeDetails'. Simply removing it out seemed result in the code working fine.
report can now be exported via report.to_csv('report.csv',index=False), or manipulated directly.
EDIT: parser.records() changed to parser.records_dict(), as this allows the DataFrame to have the columns already listed, rather than indexing them numerically.
The code below is rather long and might be just for our use case but the basic idea is the following:
Find out date interval length and additional needed filtering to never run into the "more the 2'000" limit. In my case I could have weekly date range filter but would need to apply some additional filters
Then run it like this:
report_id = '00O4…'
sf = SalesforceReport(user, pass, token, report_id)
it = sf.iterate_over_dates_and_filters(datetime.date(2020,2,1),
'Invoice__c.InvoiceDate__c', 'Opportunity.CustomField__c',
[('a', 'startswith'), ('b', 'startswith'), …])
for row in it:
# do something with the dict
The iterator goes through every week (if you need daily iterators or monthly then you'd need to change the code, but the change should be minimal) since 2020-02-01 and applies the filter CustomField__c.startswith('a'), then CustomField__c.startswith('b'), … and acts as a generator so you don't need to mess with the filter cycling yourself.
The iterator throws an Exception if there's a query which returns more than 2000 rows, just to be sure that the data is not incomplete.
One warning here: SF has a limit of max 500 queries per hour. Say if you have one year with 52 weeks and 10 additional filters you'd already run into that limit.
Here's the class (relies on simple_salesforce)
import simple_salesforce
import json
import datetime
"""
helper class to iterate over salesforce report data
and manouvering around the 2000 max limit
"""
class SalesforceReport(simple_salesforce.Salesforce):
def __init__(self, username, password, security_token, report_id):
super(SalesforceReport, self).__init__(username=username, password=password, security_token=security_token)
self.report_id = report_id
self._fetch_describe()
def _fetch_describe(self):
url = f'{self.base_url}analytics/reports/{self.report_id}/describe'
result = self._call_salesforce('GET', url)
self.filters = dict(result.json()['reportMetadata'])
def apply_report_filter(self, column, operator, value, replace=True):
"""
adds/replaces filter, example:
apply_report_filter('Opportunity.InsertionId__c', 'startsWith', 'hbob').
For date filters use apply_standard_date_filter.
column: needs to correspond to a column in your report, AND the report
needs to have this filter configured (so in the UI the filter
can be applied)
operator: equals, notEqual, lessThan, greaterThan, lessOrEqual,
greaterOrEqual, contains, notContain, startsWith, includes
see https://sforce.co/2Tb5SrS for up to date list
value: value as a string
replace: if set to True, then if there's already a restriction on column
this restriction will be replaced, otherwise it's added additionally
"""
filters = self.filters['reportFilters']
if replace:
filters = [f for f in filters if not f['column'] == column]
filters.append(dict(
column=column,
isRunPageEditable=True,
operator=operator,
value=value))
self.filters['reportFilters'] = filters
def apply_standard_date_filter(self, column, startDate, endDate):
"""
replace date filter. The date filter needs to be available as a filter in the
UI already
Example: apply_standard_date_filter('Invoice__c.InvoiceDate__c', d_from, d_to)
column: needs to correspond to a column in your report
startDate, endDate: instance of datetime.date
"""
self.filters['standardDateFilter'] = dict(
column=column,
durationValue='CUSTOM',
startDate=startDate.strftime('%Y-%m-%d'),
endDate=endDate.strftime('%Y-%m-%d')
)
def query_report(self):
"""
return generator which yields one report row as dict at a time
"""
url = self.base_url + f"analytics/reports/query"
result = self._call_salesforce('POST', url, data=json.dumps(dict(reportMetadata=self.filters)))
r = result.json()
columns = r['reportMetadata']['detailColumns']
if not r['allData']:
raise Exception('got more than 2000 rows! Quitting as data would be incomplete')
for row in r['factMap']['T!T']['rows']:
values = []
for c in row['dataCells']:
t = type(c['value'])
if t == str or t == type(None) or t == int:
values.append(c['value'])
elif t == dict and 'amount' in c['value']:
values.append(c['value']['amount'])
else:
print(f"don't know how to handle {c}")
values.append(c['value'])
yield dict(zip(columns, values))
def iterate_over_dates_and_filters(self, startDate, date_column, filter_column, filter_tuples):
"""
return generator which iterates over every week and applies the filters
each for column
"""
date_runner = startDate
while True:
print(date_runner)
self.apply_standard_date_filter(date_column, date_runner, date_runner + datetime.timedelta(days=6))
for val, op in filter_tuples:
print(val)
self.apply_report_filter(filter_column, op, val)
for row in self.query_report():
yield row
date_runner += datetime.timedelta(days=7)
if date_runner > datetime.date.today():
break
For anyone just trying to download a report into a DataFrame this is how you do it (I added some notes and links for clarifications):
import pandas as pd
import csv
import requests
from io import StringIO
from simple_salesforce import Salesforce
# Input Salesforce credentials:
sf = Salesforce(
username='johndoe#mail.com',
password='<password>',
security_token='<security_token>') # See below for help with finding token
# Basic report URL structure:
orgParams = 'https://<INSERT_YOUR_COMPANY_NAME_HERE>.my.salesforce.com/' # you can see this in your Salesforce URL
exportParams = '?isdtp=p1&export=1&enc=UTF-8&xf=csv'
# Downloading the report:
reportId = 'reportId' # You find this in the URL of the report in question between "Report/" and "/view"
reportUrl = orgParams + reportId + exportParams
reportReq = requests.get(reportUrl, headers=sf.headers, cookies={'sid': sf.session_id})
reportData = reportReq.content.decode('utf-8')
reportDf = pd.read_csv(StringIO(reportData))
You can get your token by following the instructions at the bottom of this page

Django filter query with date and integer

I try to write query which must show planes, which working life is over.
Models.py
class Aircraft(models.Model):
registration_num = models.IntegerField(primary_key=True)
airplane_age = models.IntegerField()
commision_day = models.DateField()
Views.py
def query1(request):
air_list = Aircraft.objects.filter(commision_day__year__lte=(datetime.date.today().year - 10))
#air_list = Aircraft.objects.extra(where=['commision_day + airplane_age < CURDATE()'])
#air_list = Aircraft.objects.extra(where=["commision_day+airplane_age<datetime.date.today()"]) F('airplane_age') year=10, month=1, day=1
Data example
commision_day = 1995.10.05
airplane_age = 15
I try many methods to do, but no one works((((
So i need to show plnes which commision day + airplane_age < today day
Thanks
You could try setting up a datetime.date object to use as a threshold, and then filter Aircraft objects against that by their date directly:
today = datetime.date.today()
threshold = datetime.date(today.year-10, today.month, today.day)
Aircraft.objects.filter(commision_day__lte=threshold)
Try this queryset:
from datetime import datetime, timedelta
Aircraft.objects.filter(commision_day__lte=(datetime.now() - timedelta(years=10)))
The goal here is to combine the values of two fields, and compare the result with a value calculated in the current process, essentially:
Aircraft.objects.extra(where=['somedate + years < today'])
The expression within where=['...'] must be valid SQL and the right syntax will depend on your database backend. That is, you need to know the right way to add years to a date to form the somedate + years correctly, for example in SQLite it would be like this:
date(commision_day, '+' || airplane_age || ' year') < date('now')
This is more efficient than your solution, but it is not portable, as it depends on the database backend due to non-standard syntax of date operations.
Try this:
from dateutil.relativedelta import relativedelta
from django.db.models import F
Aircraft.objects.filter(commision_day__lte=datetime.now()-relativedelta(years=F('airplane_age')))

app-engine ndb delete data

I'm new to app-engine [Python 2.7]
I would like to delete elements from my ndb (currently I don't care if it is one by one or all at once since none is working for me).
Version 1 based on this Q:
ps_ancestors = req_query.fetch()
for ps_ancestor in ps_ancestors:
self.response.write(ps_ancestor.key)
ps_ancestor.key.delete()
It continues to print the same data without actually deleting anything
Version 2:
[myId currently have only the values 1,2,3]
ndb.Key(myId, 1).delete()
ndb.Key(myId, 2).delete()
ndb.Key(myId, 3).delete()
The model:
class tmpReport (ndb.Model):
myId = ndb.IntegerProperty()
hisId = ndb.IntegerProperty()
date = ndb.DateTimeProperty(auto_now_add=True)
What am I missing?
k = users.query(users.name == 'abhinav')
for l in k.fetch(limit = 1):
l.key.delete()
First of all, you should not define your Entity key as an IntegerProperty. Take a look at this documentation: NDB Entities and Keys
In order to delete an entity from datastore you should first retrieve it by using a query or by its ID. I recommend you to use a "keyname" when creating your entities (to use it as your custom ID):
# Model declaration
class tmpReport (ndb.Model):
hisId = ndb.IntegerProperty()
date = ndb.DateTimeProperty(auto_now_add=True)
# Store the entity
report = tmpReport(id=1, hisId=5)
report.put()
Then to retrieve and delete the previous entity use:
# Retrieve entity
report = ndb.Key("tmpReport", 1).get()
# Delete the entity
report.key.delete()
Hope it helps.
query1 = tmpReport.query()
query2 = query1.filter(tmpReport.myId == int(myId))
#Add other filters as necessary
query3 = query2.fetch()
query3[0].key.delete()
Removes the first entity(element) returned assuming myID is unique so there is only one element in the list.

Django compare values of two objects

I have a Django model that looks something like this:
class Response(models.Model):
transcript = models.TextField(null=True)
class Coding(models.Model):
qid = models.CharField(max_length = 30)
value = models.CharField(max_length = 200)
response = models.ForeignKey(Response)
coder = models.ForeignKey(User)
For each Response object, there are two coding objects with qid = "risk", one for coder 3 and one for coder 4. What I would like to be able to do is get a list of all Response objects for which the difference in value between coder 3 and coder 4 is greater than 1. The value field stores numbers 1-7.
I realize in hindsight that setting up value as a CharField may have been a mistake, but hopefully I can get around that.
I believe something like the following SQL would do what I'm looking for, but I'd rather do this with the ORM
SELECT UNIQUE c1.response_id FROM coding c1, coding c2
WHERE c1.coder_id = 3 AND
c2.coder_id = 4 AND
c1.qid = "risk" AND
c2.qid = "risk" AND
c1.response_id = c2.response_id AND
c1.value - c2.value > 1
from django.db.models import F
qset = Coding.objects.filter(response__coding__value__gt=F('value') + 1,
qid='risk', coder=4
).extra(where=['T3.qid = %s', 'T3.coder_id = %s'],
params=['risk', 3])
responses = [c.response for c in qset.select_related('response')]
When you join to a table already in the query, the ORM will assign the second one an alias, in this case T3, which you can using in parameters to extra(). To find out what the alias is you can drop into the shell and print qset.query.
See Django documentation on F objects and extra
Update: It seems you actually don't have to use extra(), or figure out what alias django uses, because every time you refer to response__coding in your lookups, django will use the alias created initially. Here's one way to look for differences in either direction:
from django.db.models import Q, F
gt = Q(response__coding__value__gt=F('value') + 1)
lt = Q(response__coding__value__lt=F('value') - 1)
match = Q(response__coding__qid='risk', response__coding__coder=4)
qset = Coding.objects.filter(match & (gt | lt), qid='risk', coder=3)
responses = [c.response for c in qset.select_related('response')]
See Django documentation on Q objects
BTW, If you are going to want both Coding instances, you have an N + 1 queries problem here, because django's select_related() won't get reverse FK relationships. But since you have the data in the query already, you could retrieve the required information using the T3 alias as described above and extra(select={'other_value':'T3.value'}). The value data from the corresponding Coding record would be accessible as an attribute on the retrieved Coding instance, i.e. as c.other_value.
Incidentally, your question is general enough, but it looks like you have an entity-attribute-value schema, which in an RDB scenario is generally considered an anti-pattern. You might be better off long-term (and this query would be simpler) with a risk field:
class Coding(models.Model):
response = models.ForeignKey(Response)
coder = models.ForeignKey(User)
risk = models.IntegerField()
# other fields for other qid 'attribute' names...

How to 'bulk update' with Django?

I'd like to update a table with Django - something like this in raw SQL:
update tbl_name set name = 'foo' where name = 'bar'
My first result is something like this - but that's nasty, isn't it?
list = ModelClass.objects.filter(name = 'bar')
for obj in list:
obj.name = 'foo'
obj.save()
Is there a more elegant way?
Update:
Django 2.2 version now has a bulk_update.
Old answer:
Refer to the following django documentation section
Updating multiple objects at once
In short you should be able to use:
ModelClass.objects.filter(name='bar').update(name="foo")
You can also use F objects to do things like incrementing rows:
from django.db.models import F
Entry.objects.all().update(n_pingbacks=F('n_pingbacks') + 1)
See the documentation.
However, note that:
This won't use ModelClass.save method (so if you have some logic inside it won't be triggered).
No django signals will be emitted.
You can't perform an .update() on a sliced QuerySet, it must be on an original QuerySet so you'll need to lean on the .filter() and .exclude() methods.
Consider using django-bulk-update found here on GitHub.
Install: pip install django-bulk-update
Implement: (code taken directly from projects ReadMe file)
from bulk_update.helper import bulk_update
random_names = ['Walter', 'The Dude', 'Donny', 'Jesus']
people = Person.objects.all()
for person in people:
r = random.randrange(4)
person.name = random_names[r]
bulk_update(people) # updates all columns using the default db
Update: As Marc points out in the comments this is not suitable for updating thousands of rows at once. Though it is suitable for smaller batches 10's to 100's. The size of the batch that is right for you depends on your CPU and query complexity. This tool is more like a wheel barrow than a dump truck.
Django 2.2 version now has a bulk_update method (release notes).
https://docs.djangoproject.com/en/stable/ref/models/querysets/#bulk-update
Example:
# get a pk: record dictionary of existing records
updates = YourModel.objects.filter(...).in_bulk()
....
# do something with the updates dict
....
if hasattr(YourModel.objects, 'bulk_update') and updates:
# Use the new method
YourModel.objects.bulk_update(updates.values(), [list the fields to update], batch_size=100)
else:
# The old & slow way
with transaction.atomic():
for obj in updates.values():
obj.save(update_fields=[list the fields to update])
If you want to set the same value on a collection of rows, you can use the update() method combined with any query term to update all rows in one query:
some_list = ModelClass.objects.filter(some condition).values('id')
ModelClass.objects.filter(pk__in=some_list).update(foo=bar)
If you want to update a collection of rows with different values depending on some condition, you can in best case batch the updates according to values. Let's say you have 1000 rows where you want to set a column to one of X values, then you could prepare the batches beforehand and then only run X update-queries (each essentially having the form of the first example above) + the initial SELECT-query.
If every row requires a unique value there is no way to avoid one query per update. Perhaps look into other architectures like CQRS/Event sourcing if you need performance in this latter case.
Here is a useful content which i found in internet regarding the above question
https://www.sankalpjonna.com/learn-django/running-a-bulk-update-with-django
The inefficient way
model_qs= ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
obj.name = 'foo'
obj.save()
The efficient way
ModelClass.objects.filter(name = 'bar').update(name="foo") # for single value 'foo' or add loop
Using bulk_update
update_list = []
model_qs= ModelClass.objects.filter(name = 'bar')
for model_obj in model_qs:
model_obj.name = "foo" # Or what ever the value is for simplicty im providing foo only
update_list.append(model_obj)
ModelClass.objects.bulk_update(update_list,['name'])
Using an atomic transaction
from django.db import transaction
with transaction.atomic():
model_qs = ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
ModelClass.objects.filter(name = 'bar').update(name="foo")
Any Up Votes ? Thanks in advance : Thank you for keep an attention ;)
To update with same value we can simply use this
ModelClass.objects.filter(name = 'bar').update(name='foo')
To update with different values
ob_list = ModelClass.objects.filter(name = 'bar')
obj_to_be_update = []
for obj in obj_list:
obj.name = "Dear "+obj.name
obj_to_be_update.append(obj)
ModelClass.objects.bulk_update(obj_to_be_update, ['name'], batch_size=1000)
It won't trigger save signal every time instead we keep all the objects to be updated on the list and trigger update signal at once.
IT returns number of objects are updated in table.
update_counts = ModelClass.objects.filter(name='bar').update(name="foo")
You can refer this link to get more information on bulk update and create.
Bulk update and Create