I have a query that filters registers according to some dates, and uses subqueries to aggregate data. It goes like this:
# Parse datetime
st_date = timezone.datetime.strptime(...)
ed_date = timezone.datetime.strptime(...)
# Main queryset
clients = Client.objects.all()
# Subquery
details = Detail.objects.filter(
temp__client__id=OuterRef('pk'),
temp__datetime__date__range=(st_date, ed_date))
# Filter brand
brand_values = detalles.filter(
product__brand__id=...).values(
'temp__client')
# Aggregate
total_qtys = brand_values.annotate(t=Sum('qty')).values('t')
# Annotate to main query
qs = clients.annotate(brand_1:
Subquery(total_qtys, output_field=FloatField()))
And it works on my local env, but on production it always returns null for all rows. Upon inspection I noticed that the generated query goes like:
SELECT ...,
COALESCE((SELECT SUM(U0.`qty`) AS `t`
FROM `sales_detail` U0
INNER JOIN `sales_temp` U1
ON (U0.`temp_id` = U1.`id`)
INNER JOIN `inventory_product` U3
ON (U0.`product_id` = U3.`id`)
WHERE (U1.`cliente_id` = (`sales_client`.`id`) AND DATE(
CONVERT_TZ(U1.`datetime`, 'UTC',
'America/Mexico_City')) BETWEEN 2018-12-01 AND 2018-12-31 AND
U3.`brand_id` = 6)
GROUP BY U1.`client_id`
ORDER BY NULL), 0) AS `brand_1`
FROM `sales_client`
WHERE NOT (`sales_client`.`status_rip` = True)
And if I run the raw query against my local db it fails, too. I then noticed that the dates are being unquoted, and the problem is solved by just quoting them, i.e. by replacing 2018-12-01 for '2018-12-01' in the raw query.
So the question is: why doesn't Django quote the dates? and more important: Why does it work only when executed by the ORM?
Edit:
This is the configuration on both local and production
Local:
Fedora 28
Mariadb 10.1.25 running on Docker with the following config:
#mariadb.cnf
[client]
default-character-set = utf8
[mysqld]
character-set-server = utf8
collation-server = utf8_general_ci
character_set_server = utf8
collation_server = utf8_general_ci
!includedir /etc/mysql/mariadb.conf.d/
Production:
Fedora 24
Mariadb installed from repos, version 10.1.25, with no config. changes
You may not like these answers. QuerySet.query is not specific to the database yet. It is the generic version of the query as it may apply to any database. So in that case, you will have to add the quotations to the dates, datetimes and maybe a few other data types. I can't remember which. The cursor is what handles the actual escaping of parameters.
From Django's code:
def _execute_query(self):
connection = connections[self.using]
# Adapt parameters to the database, as much as possible considering
# that the target type isn't known. See #17755.
params_type = self.params_type
adapter = connection.ops.adapt_unknown_value
if params_type is tuple:
params = tuple(adapter(val) for val in self.params)
elif params_type is dict:
params = {key: adapter(val) for key, val in self.params.items()}
else:
raise RuntimeError("Unexpected params type: %s" % params_type)
self.cursor = connection.cursor()
self.cursor.execute(self.sql, params)
As you can see, the QuerySet nor the Query actually escapes the parameters. It passes them to the database wrapper / cursor to handle as each database may need to do it differently.
In regards to why Production has different results than locally, I would check your data. The code should be generating the same query for the database which means it's most likely the data is causing the different results.
Related
I use django 3.1.0
I have a release_time field(a DateTimeFeild) in my model, and i want to retrieve those records that have been released in current date.
here is my model:
class Test(models.Model):
title = models.CharField(max_length=100)
release_time = models.DateTimeField()
end_time = models.DateTimeField()
this is my time settings:
TIME_ZONE = 'Asia/Tehran'
USE_I18N = True
USE_L10N = True
USE_TZ = True
I use the following line to get my data:
cur_time = localtime() // django.utils.timezone.localtime
Test.objects.filter(release_time__date=cur_time.date())
This always returns an empty query set. I am sure that there are records that have this condition.
PLease tell me am I doing this wrong??
UPDATE
This is the WHERE statement in sql query.(the date is right and there is a record in the database with this date.
WHERE DATE(CONVERT_TZ(`contests_test`.`release_time`, 'Asia/Tehran', 'Asia/Tehran')) = 2020-11-23
If i use this query:
Test.objects.filter(release_time__date=TruncDate(Now()))
I get this sql:
WHERE DATE(CONVERT_TZ(`contests_test`.`release_time`, 'UTC', 'Asia/Tehran')) = DATE(CONVERT_TZ(CURRENT_TIMESTAMP, 'UTC', 'Asia/Tehran'))
None of the above doesn't work.
I found a logical work around and I don't depend on this query anymore but this really keeps me awake at nights!!
For getting current date for ORM usage purposes, you may use Django's database functions Now() and truncate it to date using DateTrunc(). This forces usage of database time in the query and also avoids mess with timezones in database and python sides. For example if USE_TZ setting is set to False, .localtime() will fail with error ValueError: localtime() cannot be applied to a naive datetime.
So you may try:
Test.objects.filter(release_time__date = TruncDate(Now()))
As for your query, I can't guess why it does not return data, it is well formed and should be fetching the result. So, it is always a good practice to see what actual sql is executed by Django, by using .query.
Please, execute the following and look at where clause part of the sql statement, it may contain additional hints, like different timezones or whatsoever.
q = Test.objects.filter(
release_time__date=TruncDate( #django.db.models.functions.datetime.TruncDate
Now() # django.db.models.functions.datetime.Now
)
)
print(q.query)
My Goal
I need PostgreSQL's rank() window function applied to an annotated queryset from Django's ORM. Django's sql query has to be a subquery in order to apply the window function and this is what I'm doing so far:
queryset = Item.objects.annotate(…)
queryset_with_rank = Items.objects.raw("""
select rank() over (order by points), *
from (%(subquery)s)""", { 'subquery': queryset.query }
)
The problem
Unfortunately, the query returned by queryset.query does not quote the parameters used for annotation correctly although the query itself is executed perfectly fine.
Example of returned query
The query returned by queryset_with_rank.query or queryset.query returns the following
"participation"."category" = )
"participation"."category" = amateur)
which I rather expected to be
"participation"."category" = '')
"participation"."category" = 'amateur')
Question
I noticed that the Django documentation states the following about Query.__str__()
Parameter values won't necessarily be quoted correctly, since that is done by the database interface at execution time.
As long as I fix the quotation manually and pass it to Postgres myself, everything works as expected. Is there a way to receive the needed subquery with correct quotation? Or is there an alternative and better approach to applying a window function to a Django ORM queryset altoghether?
As Django core developer Aymeric Augustin said, there's no way to get the exact query that is executed by the database backend beforehand.
I still managed to build the query the way I hoped to, although a bit cumbersome:
# Obtain query and parameters separately
query, params = item_queryset.query.sql_with_params()
# Put additional quotes around string. I guess this is what
# the database adapter does as well.
params = [
'\'{}\''.format(p)
if isinstance(p, basestring) else p
for p in params
]
# Cast list of parameters to tuple because I got
# "not enough format characters" otherwise. Dunno why.
params = tuple(params)
participations = Item.objects.raw("""
select *,
rank() over (order by points DESC) as rank
from ({subquery}
""".format(subquery=query.format(params)), []
)
I am new to sfdc . I have report already created by user . I would like to use python to dump the data of the report into csv/excel file.
I see there are couple of python packages for that. But my code gives an error
from simple_salesforce import Salesforce
sf = Salesforce(instance_url='https://cs1.salesforce.com', session_id='')
sf = Salesforce(password='xxxxxx', username='xxxxx', organizationId='xxxxx')
Can i have the basic steps for setting up the API and some example code
This worked for me:
import requests
import csv
from simple_salesforce import Salesforce
import pandas as pd
sf = Salesforce(username=your_username, password=your_password, security_token = your_token)
login_data = {'username': your_username, 'password': your_password_plus_your_token}
with requests.session() as s:
d = s.get("https://your_instance.salesforce.com/{}?export=1&enc=UTF-8&xf=csv".format(reportid), headers=sf.headers, cookies={'sid': sf.session_id})
d.content will contain a string of comma separated values which you can read with the csv module.
I take the data into pandas from there, hence the function name and import pandas. I removed the rest of the function where it puts the data into a DataFrame, but if you're interested in how that's done let me know.
In case it is helpful, I wanted to write out the steps I used to answer this question now (Aug-2018), based on Obol's comment. For reference, I followed the README instructions at https://github.com/cghall/force-retrieve/blob/master/README.md for the salesforce_reporting package.
To connect to Salesforce:
from salesforce_reporting import Connection, ReportParser
sf = Connection(username='your_username',password='your_password',security_token='your_token')
Then, to get the report I wanted into a Pandas DataFrame:
report = sf.get_report(your_reports_id)
parser = salesforce_reporting.ReportParser(report)
report = parser.records_dict()
report = pd.DataFrame(report)
If you were so inclined, you could also simplify the four lines above into one, like so:
report = pd.DataFrame(salesforce_reporting.ReportParser(sf.get_report(your_reports_id)).records_dict())
One difference I ran into from the README is that sf.get_report('report_id', includeDetails=True) threw an error stating get_report() got an unexpected keyword argument 'includeDetails'. Simply removing it out seemed result in the code working fine.
report can now be exported via report.to_csv('report.csv',index=False), or manipulated directly.
EDIT: parser.records() changed to parser.records_dict(), as this allows the DataFrame to have the columns already listed, rather than indexing them numerically.
The code below is rather long and might be just for our use case but the basic idea is the following:
Find out date interval length and additional needed filtering to never run into the "more the 2'000" limit. In my case I could have weekly date range filter but would need to apply some additional filters
Then run it like this:
report_id = '00O4…'
sf = SalesforceReport(user, pass, token, report_id)
it = sf.iterate_over_dates_and_filters(datetime.date(2020,2,1),
'Invoice__c.InvoiceDate__c', 'Opportunity.CustomField__c',
[('a', 'startswith'), ('b', 'startswith'), …])
for row in it:
# do something with the dict
The iterator goes through every week (if you need daily iterators or monthly then you'd need to change the code, but the change should be minimal) since 2020-02-01 and applies the filter CustomField__c.startswith('a'), then CustomField__c.startswith('b'), … and acts as a generator so you don't need to mess with the filter cycling yourself.
The iterator throws an Exception if there's a query which returns more than 2000 rows, just to be sure that the data is not incomplete.
One warning here: SF has a limit of max 500 queries per hour. Say if you have one year with 52 weeks and 10 additional filters you'd already run into that limit.
Here's the class (relies on simple_salesforce)
import simple_salesforce
import json
import datetime
"""
helper class to iterate over salesforce report data
and manouvering around the 2000 max limit
"""
class SalesforceReport(simple_salesforce.Salesforce):
def __init__(self, username, password, security_token, report_id):
super(SalesforceReport, self).__init__(username=username, password=password, security_token=security_token)
self.report_id = report_id
self._fetch_describe()
def _fetch_describe(self):
url = f'{self.base_url}analytics/reports/{self.report_id}/describe'
result = self._call_salesforce('GET', url)
self.filters = dict(result.json()['reportMetadata'])
def apply_report_filter(self, column, operator, value, replace=True):
"""
adds/replaces filter, example:
apply_report_filter('Opportunity.InsertionId__c', 'startsWith', 'hbob').
For date filters use apply_standard_date_filter.
column: needs to correspond to a column in your report, AND the report
needs to have this filter configured (so in the UI the filter
can be applied)
operator: equals, notEqual, lessThan, greaterThan, lessOrEqual,
greaterOrEqual, contains, notContain, startsWith, includes
see https://sforce.co/2Tb5SrS for up to date list
value: value as a string
replace: if set to True, then if there's already a restriction on column
this restriction will be replaced, otherwise it's added additionally
"""
filters = self.filters['reportFilters']
if replace:
filters = [f for f in filters if not f['column'] == column]
filters.append(dict(
column=column,
isRunPageEditable=True,
operator=operator,
value=value))
self.filters['reportFilters'] = filters
def apply_standard_date_filter(self, column, startDate, endDate):
"""
replace date filter. The date filter needs to be available as a filter in the
UI already
Example: apply_standard_date_filter('Invoice__c.InvoiceDate__c', d_from, d_to)
column: needs to correspond to a column in your report
startDate, endDate: instance of datetime.date
"""
self.filters['standardDateFilter'] = dict(
column=column,
durationValue='CUSTOM',
startDate=startDate.strftime('%Y-%m-%d'),
endDate=endDate.strftime('%Y-%m-%d')
)
def query_report(self):
"""
return generator which yields one report row as dict at a time
"""
url = self.base_url + f"analytics/reports/query"
result = self._call_salesforce('POST', url, data=json.dumps(dict(reportMetadata=self.filters)))
r = result.json()
columns = r['reportMetadata']['detailColumns']
if not r['allData']:
raise Exception('got more than 2000 rows! Quitting as data would be incomplete')
for row in r['factMap']['T!T']['rows']:
values = []
for c in row['dataCells']:
t = type(c['value'])
if t == str or t == type(None) or t == int:
values.append(c['value'])
elif t == dict and 'amount' in c['value']:
values.append(c['value']['amount'])
else:
print(f"don't know how to handle {c}")
values.append(c['value'])
yield dict(zip(columns, values))
def iterate_over_dates_and_filters(self, startDate, date_column, filter_column, filter_tuples):
"""
return generator which iterates over every week and applies the filters
each for column
"""
date_runner = startDate
while True:
print(date_runner)
self.apply_standard_date_filter(date_column, date_runner, date_runner + datetime.timedelta(days=6))
for val, op in filter_tuples:
print(val)
self.apply_report_filter(filter_column, op, val)
for row in self.query_report():
yield row
date_runner += datetime.timedelta(days=7)
if date_runner > datetime.date.today():
break
For anyone just trying to download a report into a DataFrame this is how you do it (I added some notes and links for clarifications):
import pandas as pd
import csv
import requests
from io import StringIO
from simple_salesforce import Salesforce
# Input Salesforce credentials:
sf = Salesforce(
username='johndoe#mail.com',
password='<password>',
security_token='<security_token>') # See below for help with finding token
# Basic report URL structure:
orgParams = 'https://<INSERT_YOUR_COMPANY_NAME_HERE>.my.salesforce.com/' # you can see this in your Salesforce URL
exportParams = '?isdtp=p1&export=1&enc=UTF-8&xf=csv'
# Downloading the report:
reportId = 'reportId' # You find this in the URL of the report in question between "Report/" and "/view"
reportUrl = orgParams + reportId + exportParams
reportReq = requests.get(reportUrl, headers=sf.headers, cookies={'sid': sf.session_id})
reportData = reportReq.content.decode('utf-8')
reportDf = pd.read_csv(StringIO(reportData))
You can get your token by following the instructions at the bottom of this page
I am using bulk_create to loads thousands or rows into a postgresql DB. Unfortunately some of the rows are causing IntegrityError and stoping the bulk_create process. I was wondering if there was a way to tell django to ignore such rows and save as much of the batch as possible?
This is now possible on Django 2.2
Django 2.2 adds a new ignore_conflicts option to the bulk_create method, from the documentation:
On databases that support it (all except PostgreSQL < 9.5 and Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values. Enabling this parameter disables setting the primary key on each model instance (if the database normally supports it).
Example:
Entry.objects.bulk_create([
Entry(headline='This is a test'),
Entry(headline='This is only a test'),
], ignore_conflicts=True)
One quick-and-dirty workaround for this that doesn't involve manual SQL and temporary tables is to just attempt to bulk insert the data. If it fails, revert to serial insertion.
objs = [(Event), (Event), (Event)...]
try:
Event.objects.bulk_create(objs)
except IntegrityError:
for obj in objs:
try:
obj.save()
except IntegrityError:
continue
If you have lots and lots of errors this may not be so efficient (you'll spend more time serially inserting than doing so in bulk), but I'm working through a high-cardinality dataset with few duplicates so this solves most of my problems.
(Note: I don't use Django, so there may be more suitable framework-specific answers)
It is not possible for Django to do this by simply ignoring INSERT failures because PostgreSQL aborts the whole transaction on the first error.
Django would need one of these approaches:
INSERT each row in a separate transaction and ignore errors (very slow);
Create a SAVEPOINT before each insert (can have scaling problems);
Use a procedure or query to insert only if the row doesn't already exist (complicated and slow); or
Bulk-insert or (better) COPY the data into a TEMPORARY table, then merge that into the main table server-side.
The upsert-like approach (3) seems like a good idea, but upsert and insert-if-not-exists are surprisingly complicated.
Personally, I'd take (4): I'd bulk-insert into a new separate table, probably UNLOGGED or TEMPORARY, then I'd run some manual SQL to:
LOCK TABLE realtable IN EXCLUSIVE MODE;
INSERT INTO realtable
SELECT * FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.id = realtable.id
);
The LOCK TABLE ... IN EXCLUSIVE MODE prevents a concurrent insert that creates a row from causing a conflict with an insert done by the above statement and failing. It does not prevent concurrent SELECTs, only SELECT ... FOR UPDATE, INSERT,UPDATE and DELETE, so reads from the table carry on as normal.
If you can't afford to block concurrent writes for too long you could instead use a writable CTE to copy ranges of rows from temptable into realtable, retrying each block if it failed.
Or 5. Divide and conquer
I didn't test or benchmark this thoroughly, but it performs pretty well for me. YMMV, depending in particular on how many errors you expect to get in a bulk operation.
def psql_copy(records):
count = len(records)
if count < 1:
return True
try:
pg.copy_bin_values(records)
return True
except IntegrityError:
if count == 1:
# found culprit!
msg = "Integrity error copying record:\n%r"
logger.error(msg % records[0], exc_info=True)
return False
finally:
connection.commit()
# There was an integrity error but we had more than one record.
# Divide and conquer.
mid = count / 2
return psql_copy(records[:mid]) and psql_copy(records[mid:])
# or just return False
Even in Django 1.11 there is no way to do this. I found a better option than using Raw SQL. It using djnago-query-builder. It has an upsert method
from querybuilder.query import Query
q = Query().from_table(YourModel)
# replace with your real objects
rows = [YourModel() for i in range(10)]
q.upsert(rows, ['unique_fld1', 'unique_fld2'], ['fld1_to_update', 'fld2_to_update'])
Note: The library only support postgreSQL
Here is a gist that I use for bulk insert that supports ignoring IntegrityErrors and returns the records inserted.
Late answer for pre Django 2.2 projects :
I ran into this situation recently and I found my way out with a seconder list array for check the uniqueness.
In my case, the model has that unique together check, and bulk create is throwing Integrity Error exception because of the array of bulk create has duplicate data in it.
So I decided to create checklist besides bulk create objects list. Here is the sample code; The unique keys are owner and brand, and in this example owner is an user object instance and brand is a string instance:
create_list = []
create_list_check = []
for brand in brands:
if (owner.id, brand) not in create_list_check:
create_list_check.append((owner.id, brand))
create_list.append(ProductBrand(owner=owner, name=brand))
if create_list:
ProductBrand.objects.bulk_create(create_list)
it's work for me
i am use this this funtion in thread.
my csv file contains 120907 no of rows.
def products_create():
full_path = os.path.join(settings.MEDIA_ROOT,'productcsv')
filename = os.listdir(full_path)[0]
logger.debug(filename)
logger.debug(len(Product.objects.all()))
if len(Product.objects.all()) > 0:
logger.debug("Products Data Erasing")
Product.objects.all().delete()
logger.debug("Products Erasing Done")
csvfile = os.path.join(full_path,filename)
csv_df = pd.read_csv(csvfile,sep=',')
csv_df['HSN Code'] = csv_df['HSN Code'].fillna(0)
row_iter = csv_df.iterrows()
logger.debug(row_iter)
logger.debug("New Products Creating")
for index, row in row_iter:
Product.objects.create(part_number = row[0],
part_description = row[1],
mrp = row[2],
hsn_code = row[3],
gst = row[4],
)
# products_list = [
# Product(
# part_number = row[0] ,
# part_description = row[1],
# mrp = row[2],
# hsn_code = row[3],
# gst = row[4],
# )
# for index, row in row_iter
# ]
# logger.debug(products_list)
# Product.objects.bulk_create(products_list)
logger.debug("Products uploading done")```
For a Queryset in Django, we can call its method .query to get the raw sql.
for example,
queryset = AModel.objects.all()
print queryset.query
the output could be: SELECT "id", ... FROM "amodel"
But for retrieving a object by "get", say,
item = AModel.objects.get(id = 100)
how to get the equivalent raw sql? Notice: the item might be None.
The item = AModel.objects.get(id = 100) equals to
items = AModel.objects.filter(id = 100)
if len(items) == 1:
return items[0]
else:
raise exception
Thus the executed query equals to AModel.objects.filter(id = 100)
Also, you could check the latest item of connection.queries
from django.db import connection # use connections for non-default dbs
print connection.queries[-1]
And, as FoxMaSk said, install django-debug-toolbar and enjoy it in your browser.
It's the same SQL, just with a WHERE id=100 clause tacked to the end.
However, FWIW, If a filter is specific enough to only return one result, it's the same SQL as get would produce, the only difference is on the Python side at that point, e.g.
AModel.objects.get(id=100)
is the same as:
AModel.objects.filter(id=100).get()
So, you can simply query AModel.objects.filter(id=100) and then use queryset.query with that.
if it's just for debugging purpose you can use "the django debug bar" which can be installed by
pip install django-debug-toolbar