Duplicate elements in Django Paginate after `order_by` call

Duplicate elements in Django Paginate after `order_by` call - django

I'm using Django 1.7.7.
I'm wondering if anyone has experienced this. This is my query:
events = Event.objects.filter(
Q(date__gt=my_date) | Q(date__isnull=True)
).filter(type__in=[...]).order_by('date')
When I try to then paginate it
p = Paginator(events, 10)
p.count # Gives 91
event_ids = []
for i in xrange(1, p.count / 10 + 2):
event_ids += [i.id for i in p.page(i)]
print len(event_ids) # Still 91
print len(set(event_ids)) # 75
I noticed that if I removed the .order_by, I don't get any duplicates. I then tried just .order_by with Event.objects.all().order_by('date') which gave no duplicates.
Finally, I tried this:
events = Event.objects.filter(
Q(date__gt=my_date) | Q(date__isnull=True)
).order_by('date')
p = Paginator(events, 10)
events.count() # Gives 131
p.count # Gives 131
event_ids = []
for i in xrange(1, p.count / 10 + 2):
event_ids += [i.id for i in p.page(i)]
len(event_ids) # Gives 131
len(set(event_ids)) # Gives 118
... and there are duplicates. Can anyone explain what's going on?
I dug into the Django source (https://github.com/django/django/blob/master/django/core/paginator.py#L46-L55) and it seems to be something to do with how Django slices the object_list.
Any help is appreciated. Thanks.
Edit: distinct() has no affect on the duplicates. There aren't any duplicates in the database and I don't think the query introduces any duplicates ([e for e in events.iterator()] doesn't produce any duplicates). It's just when the Paginator is slicing.
Edit2: Here's a more complete example
In [1]: from django.core.paginator import Paginator
In [2]: from datetime import datetime, timedelta
In [3]: my_date = timezone.now()
In [4]: 1 events = Event.objects.filter(
2 Q(date__gt=my_date) | Q(date__isnull=True)
3 ).order_by('date')
In [5]: events.count()
Out[5]: 134
In [6]: p = Paginator(events, 10)
In [7]: p.count
Out[7]: 134
In [8]: event_ids = []
In [9]: 1 for i in xrange(1, p.num_pages + 1):
2 event_ids += [j.id for j in p.page(i)]
In [10]: len(event_ids)
Out[10]: 134
In [11]: len(set(event_ids))
Out[11]: 115

oh, shot in the dark, but i think i might know what it is. i wasn't able to reproduce it in sqlite but using mysql. i think mysql trying to sort on a column that has the same value has it returning the same results during slicing
the pagination splicing basically does an sql statement of
SELECT ... FROM ... WHERE (date > D OR date IS NULL) ORDER BY date ASC LIMIT X OFFSET X
But when date is null I'm not sure how mysql sorts it. So when I tried two sql queries of LIMIT 10 and LIMIT 10 OFFSET 10 it returned sets that had the same rows, while LIMIT 20 produce a unique set.
you can try to update your order_by to order_by('id', 'date') to have it sort by a unique field first and it may fix it.

Try to use .distinct() on your query before passing it to Paginator.

Related

Django 3.2.3 Pagination isn't working properly

I have a Class Based View who isn't working properly (duplicating objects and deleting some)
Tested it in shell
from django.core.paginator import Paginator
from report.models import Grade, Exam
f = Exam.objects.all().filter(id=7)[0].full_mark
all = Grade.objects.all().filter(exam_id=7, value__gte=(f*0.95)).order_by('-value')
p = Paginator(all, 12)
for i in p.page(1).object_list:
... print(i.id)
2826
2617
2591
2912
2796
2865
2408
2501
2466
2681
2616
2563
for i in p.page(2).object_list:
... print(i.id)
2558
2466
2563
2920
2681
2824
2498
2854
2546
2606
2598
2614

Making an order_by call before passing the query_set all to the pagination is the root of the problem and well explained here. All you need is to call distinct() or specify another field in the order_by to use in case of same value.
Below is the code that should work, you also don't need to use all() in your queries. The filter by default applies on all the model objects.
from django.core.paginator import Paginator
from report.models import Grade, Exam
f = Exam.objects.filter(id=7).first().full_mark
all = Grade.objects.filter(exam_id=7, value__gte=(f*0.95)).order_by('-value').distinct()
p = Paginator(all, 12)
for i in p.page(1).object_list:
... print(i.id)
By the way, your code will crash if an exam object with id=7 is not found. You should assign the full_mark value to your f variable conditionally.

Keras loss is nan when using inputing data from csv with numpy

I'm trying to use the TensorFlow's Boston housing price example to learn how to use TensorFlow/Keras for regressions, but I keep running into a problem using my own data, even when I make as small of changes as possible. After giving up on writing everything myself, I simply changed the two lines of the code that input the data:
boston_housing = keras.datasets.boston_housing
(train_data, train_labels), (test_data, test_labels) = boston_housing.load_data()
to something that, after looking online, should also create numpy arrays from my csv:
np_array = genfromtxt('trainingdata.csv', delimiter=',')
np_array = np.delete(np_array, (0), axis=0) # Remove header
test_np_array = np_array[:800,:]
tr_np_array = np_array[800:,:] # Separate out test and train data
train_labels = tr_np_array[:, 20] # Get the last row for the labels
test_labels = test_np_array[:, 20]
train_data = np.delete(tr_np_array, (20), axis=1)
test_data = np.delete(test_np_array, (20), axis=1) # Remove the last row so the data is only the features
Everything I can look at seems right – the shapes of the arrays are all correct, the arrays do seem to be correct-looking numpy arrays, the features do seem to become normalized, etc. and yet when I set verbose to 1 on model.fit(...), the very first lines of output show a problem with loss:
Epoch 1/500
32/2560 [..............................] - ETA: 18s - loss: nan - mean_absolute_error: nan
2016/2560 [======================>.......] - ETA: 0s - loss: nan - mean_absolute_error: nan
2560/2560 [==============================] - 0s 133us/step - loss: nan - mean_absolute_error: nan - val_loss: nan - val_mean_absolute_error: nan
I'm especially confused because every other place on stack overflow where I've seen the "TensorFlow loss is 'NaN'" error, it has generally a) been with a custom loss function, and b) once the model has trained for a while, not (as here) within the first 52 passes. Where that's not the case, it's because the data wasn't normalized, but I do that later in the code, and the normalization works for the housing pricing example and prints out numbers clustered around 0. At this point, my best guess is that it's a problem with the genfromtxt command, but if anyone can see what I'm doing wrong or where I might find my issue, I'd be incredibly appreciative.
Edit:
Here is the full code for the program. Commenting out lines 13 through 26 and uncommenting lines 10 and 11 make the program work perfectly. Commenting out lines 13 and 14 and uncommenting 16 and 17 was my attempt at using pandas, but that led to the same errors.
import tensorflow as tf
from tensorflow import keras
import numpy as np
from numpy import genfromtxt
import pandas as pd
print(tf.__version__)
# boston_housing = keras.datasets.boston_housing # Line 10
# (train_data, train_labels), (test_data, test_labels) = boston_housing.load_data()
np_array = genfromtxt('trainingdata.csv', delimiter=',') # Line 13
np_array = np.delete(np_array, (0), axis=0)
# df = pd.read_csv('trainingdata.csv') # Line 16
# np_array = df.get_values()
test_np_array = np_array[:800,:]
tr_np_array = np_array[800:,:]
train_labels = tr_np_array[:, 20]
test_labels = test_np_array[:, 20]
train_data = np.delete(tr_np_array, (20), axis=1)
test_data = np.delete(test_np_array, (20), axis=1) # Line 26
order = np.argsort(np.random.random(train_labels.shape))
train_data = train_data[order]
train_labels = train_labels[order]
mean = train_data.mean(axis=0)
std = train_data.std(axis=0)
train_data = (train_data - mean) / std
test_data = (test_data - mean) / std
labels_mean = train_labels.mean(axis=0)
labels_std = train_labels.std(axis=0)
train_labels = (train_labels - labels_mean) / labels_std
test_labels = (test_labels - labels_mean) / labels_std
def build_model():
model = keras.Sequential([
keras.layers.Dense(64, activation=tf.nn.relu,
input_shape=(train_data.shape[1],)),
keras.layers.Dense(64, activation=tf.nn.relu),
keras.layers.Dense(1)
])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae'])
return model
model = build_model()
model.summary()
EPOCHS = 500
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = model.fit(train_data, train_labels, epochs=EPOCHS,
validation_split=0.2, verbose=1,
callbacks=[early_stop])
[loss, mae] = model.evaluate(test_data, test_labels, verbose=0)
print("Testing set Mean Abs Error: ${:7.2f}".format(mae * 1000 * labels_std))

I am trying to run Dickey-Fuller test in statsmodels in Python but getting error

I am trying to run Dickey-Fuller test in statsmodels in Python but getting error P
Running from python 2.7 & Pandas version 0.19.2. Dataset is from Github and imported the same
enter code here
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
print 'Results of Dickey-Fuller Test:'
dftest = ts.adfuller(timeseries, autolag='AIC' )
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print dfoutput
test_stationarity(tr)
Gives me following error :
Results of Dickey-Fuller Test:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-10ab4b87e558> in <module>()
----> 1 test_stationarity(tr)
<ipython-input-14-d779e1ed35b3> in test_stationarity(timeseries)
19 #Perform Dickey-Fuller test:
20 print 'Results of Dickey-Fuller Test:'
---> 21 dftest = ts.adfuller(timeseries, autolag='AIC' )
22 #dftest = adfuller(timeseries, autolag='AIC')
23 dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
C:\Users\SONY\Anaconda2\lib\site-packages\statsmodels\tsa\stattools.pyc in adfuller(x, maxlag, regression, autolag, store, regresults)
209
210 xdiff = np.diff(x)
--> 211 xdall = lagmat(xdiff[:, None], maxlag, trim='both', original='in')
212 nobs = xdall.shape[0] # pylint: disable=E1103
213
C:\Users\SONY\Anaconda2\lib\site-packages\statsmodels\tsa\tsatools.pyc in lagmat(x, maxlag, trim, original)
322 if x.ndim == 1:
323 x = x[:,None]
--> 324 nobs, nvar = x.shape
325 if original in ['ex','sep']:
326 dropidx = nvar
ValueError: too many values to unpack

tr must be a 1d array-like, as you can see here. I don't know what is tr in your case. Assuming that you defined tr as the dataframe that contains the time serie's data, you should do something like this:
tr = tr.iloc[:,0].values
Then adfuller will be able to read the data.

just change the line as:
dftest = adfuller(timeseries.iloc[:,0].values, autolag='AIC' )
It will work. adfuller requires a 1D array list. In your case you have a dataframe. Therefore mention the column or edit the line as mentioned above.

I am assuming since you are using the Dickey-Fuller test .you want to maintain the timeseries i.e date time column as index.So in order to do that.
tr=tr.set_index('Month') #I am assuming here the time series column name is Month
ts = tr['othercoulumnname'] #Just use the other column name here it might be count or anything
I hope this helps.

Python, calculating time difference

I'm parsing logs generated from multiple sources and joined together to form a huge log file in the following format;
My_testNumber: 14, JobType = testx.
ABC 2234
**SR 111**
1483529571 1 1 Wed Jan 4 11:32:51 2017 0 4
datatype someRandomValue
SourceCode.Cpp 588
DBConnection failed
TB 132
**SR 284**
1483529572 0 1 Wed Jan 4 11:32:52 2017 5010400 4
datatype someRandomXX
SourceCode2.cpp 455
DBConnection Success
TB 102
**SR 299**
1483529572 0 1 **Wed Jan 4 11:32:54 2017** 5010400 4
datatype someRandomXX
SourceCode3.cpp 455
ConnectionManager Success
....
(there are dozens of SR Numbers here)
Now i'm looking a smart way to parse logs so that it calculates time differences in seconds for each testNumber and SR number
like
My_testNumber:14 it subtracts SR 284 and SR 111 time (difference would be 1 second here), for SR 284 and 299 it is 2 seconds and so on.

You can parse your posted log file and save the corresponding data accordingly. Then, you can work with the data to get the time differences. The following should be a decent start:
from itertools import combinations
from itertools import permutations # if order matters
from collections import OrderedDict
from datetime import datetime
import re
sr_numbers = []
dates = []
# Loop through the file and get the test number and times
# Save the data in a list
pattern = re.compile(r"(.*)\*{2}(.*)\*{2}(.*)")
for line in open('/Path/to/log/file'):
if '**' in line:
# Get the data between the asterisks
if 'SR' in line:
sr_numbers.append(re.sub(pattern,"\\2", line.strip()))
else:
dates.append(datetime.strptime(re.sub(pattern,"\\2", line.strip()), '%a %b %d %H:%M:%S %Y'))
else:
continue
# Use hashmap container (ordered dictionary) to make it easy to get the time differences
# Using OrderedDict here to maintain the order of the order of the test number along the file
log_dict = OrderedDict((k,v) for k,v in zip(sr_numbers, dates))
# Use combinations to get the possible combinations (or permutations if order matters) of time differences
time_differences = {"{} - {}".format(*x):(log_dict[x[1]] - log_dict[x[0]]).seconds for x in combinations(log_dict, 2)}
print(time_differences)
# {'SR 284 - SR 299': 2, 'SR 111 - SR 284': 1, 'SR 111 - SR 299': 3}
Edit:
Parsing the file without relying on the asterisks around the dates:
from itertools import combinations
from itertools import permutations # if order matters
from collections import OrderedDict
from datetime import datetime
import re
sr_numbers = []
dates = []
# Loop through the file and get the test number and times
# Save the data in a list
pattern = re.compile(r"(.*)\*{2}(.*)\*{2}(.*)")
for line in open('/Path/to/log/file'):
if 'SR' in line:
current_sr_number = re.sub(pattern,"\\2", line.strip())
sr_numbers.append(current_sr_number)
elif line.strip().count(":") > 1:
try:
dates.append(datetime.strptime(re.split("\s{3,}",line)[2].strip("*"), '%a %b %d %H:%M:%S %Y'))
except IndexError:
#print(re.split("\s{3,}",line))
dates.append(datetime.strptime(re.split("\t+",line)[2].strip("*"), '%a %b %d %H:%M:%S %Y'))
else:
continue
# Use hashmap container (ordered dictionary) to make it easy to get the time differences
# Using OrderedDict here to maintain the order of the order of the test number along the file
log_dict = OrderedDict((k,v) for k,v in zip(sr_numbers, dates))
# Use combinations to get the possible combinations (or permutations if order matters) of time differences
time_differences = {"{} - {}".format(*x):(log_dict[x[1]] - log_dict[x[0]]).seconds for x in combinations(log_dict, 2)}
print(time_differences)
# {'SR 284 - SR 299': 2, 'SR 111 - SR 284': 1, 'SR 111 - SR 299': 3}
I hope this proves useful.

Get objects created in last 30 days, for each past day

I am looking for fast method to count model's objects created within past 30 days, for each day separately. For example:
27.07.2013 (today) - 3 objects created
26.07.2013 - 0 objects created
25.07.2013 - 2 objects created
...
27.06.2013 - 1 objects created
I am going to use this data in google charts API. Have you any idea how to get this data efficiently?

items = Foo.objects.filter(createdate__lte=datetime.datetime.today(), createdate__gt=datetime.datetime.today()-datetime.timedelta(days=30)).\
values('createdate').annotate(count=Count('id'))
This will (1) filter results to contain the last 30 days, (2) select just the createdate field and (3) count the id's, grouping by all selected fields (i.e. createdate). This will return a list of dictionaries of the format:
[
{'createdate': <datetime.date object>, 'count': <int>},
{'createdate': <datetime.date object>, 'count': <int>},
...
]
EDIT:
I don't believe there's a way to get all dates, even those with count == 0, with just SQL. You'll have to insert each missing date through python code, e.g.:
import datetime
# needed to use .append() later on
items = list(items)
dates = [x.get('createdate') for x in items]
for d in (datetime.datetime.today() - datetime.timedelta(days=x) for x in range(0,30)):
if d not in dates:
items.append({'createdate': d, 'count': 0})

I think this can be somewhat more optimized solution with #knbk 's solution with python. This has fewer iterations and iterations inside SET is highly optimized in python (both in processing and in CPU-cycles).
from_date = datetime.date.today() - datetime.timedelta(days=7)
orders = Order.objects.filter(created_at=from_date, dealer__executive__branch__user=user)
orders = orders.annotate(count=Count('id')).values('created_at').order_by('created_at')
if len(orders) < 7:
orders_list = list(orders)
dates = set([(datetime.date.today() - datetime.timedelta(days=i)) for i in range(6)])
order_set = set([ord['created_at'] for ord in orders])
for dt in (order_set - dates):
orders_list.append({'created_at': dt, 'count': 0})
orders_list = sorted(orders_list, key=lambda item: item['created_at'])
else:
orders_list = orders

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Duplicate elements in Django Paginate after `order_by` call - django

Try to use .distinct() on your query before passing it to Paginator.

Related

Django 3.2.3 Pagination isn't working properly

Keras loss is nan when using inputing data from csv with numpy

I am trying to run Dickey-Fuller test in statsmodels in Python but getting error

Python, calculating time difference

Get objects created in last 30 days, for each past day

Categories

Resources