I have a model that allows me to log errors with a hash that I generate, so the same errors have the same hash so I am able to group and count them. The model looks something like this.
class ErrorLog(models.Model):
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
date = models.DateTimeField(null=True, blank=True, db_index=True)
log = models.TextField(blank=True, null=True)
log_hash = models.CharField(max_length=255, blank=True, null=True)
In my view, I perform the following query to count the errors by hash.
def get(self, request):
qs = self.filter_queryset(self.get_queryset())
total_errors = qs.count()
qs = qs.values(
'log_hash', 'log'
).annotate(
error_count=Count('log_hash')
).annotate(
percentage_of_occurrence=Concat(
Cast(
funcs.Round(
(F('error_count') / total_errors) * 100, 1
), CharField()
), Value('%')
)
)
This works like a charm, because I can get my results back just as I want them.
"results": [
{
"error_count": 2,
"percentage_of_occurrence": "50.0%",
"log_hash": "8f7744ba51869f93ce990c67bd8d3544",
"log": "Error 1"
},
{
"error_count": 1,
"percentage_of_occurrence": "25.0%",
"log_hash": "de54a1e3be2cab4d04d8c61f538a71df",
"log": "Error 2"
},
{
"error_count": 1,
"percentage_of_occurrence": "25.0%",
"log_hash": "05988dc15543ef06e803a930923d11d4",
"log": "Error 3"
}
]
Here comes the problem, this is REALLY slow on a large table, so after inspecting the SQL generated I saw one problem. I am counting 2 times, one to get the error_count, and another one to calculate the percentage_of_occurrence.
SELECT `errorlog`.`log_hash`, `errorlog`.`log`, COUNT(`errorlog`.`log_hash`) AS `error_count`,
((COUNT(`errorlog`.`log_hash`) / ) * 100) AS `percentage_of_occurrence`
FROM `errorlog`
GROUP BY `errorlog`.`log_hash`, `errorlog`.`log`
ORDER BY `error_count` DESC
Is there any way I can reuse the first count to calculate the percentage_of_occurrence without having to count again? Also, I am not very savvy on SQL, but would it be better if the log_hash column was indexed?
Related
I'm trying to create a high score statistic table/list for a quiz, where the table/list is supposed to be showing the percentage of (or total) correct guesses on a person which was to be guessed on. To elaborate further, these are the models which are used.
The Quiz model:
class Quiz(models.Model):
participants = models.ManyToManyField(
User,
through="Participant",
through_fields=("quiz", "correct_user"),
blank=True,
related_name="related_quiz",
)
fake_users = models.ManyToManyField(User, related_name="quiz_fakes")
user_quizzed = models.ForeignKey(
User, related_name="user_taking_quiz", on_delete=models.CASCADE, null=True
)
time_started = models.DateTimeField(default=timezone.now)
time_end = models.DateTimeField(blank=True, null=True)
final_score = models.IntegerField(blank=True, default=0)
This model does also have some properties; I deem them to be unrelated to the problem at hand.
The Participant model:
class Participant(models.Model): # QuizAnswer FK -> QUIZ
guessed_user = models.ForeignKey(
User, on_delete=models.CASCADE, related_name="clicked_in_quiz", null=True
)
correct_user = models.ForeignKey(
User, on_delete=models.CASCADE, related_name="solution_in_quiz", null=True
)
quiz = models.ForeignKey(
Quiz, on_delete=models.CASCADE, related_name="participants_in_quiz"
)
#property
def correct(self):
return self.guessed_user == self.correct_user
To iterate through what I am trying to do, I'll try to explain how I'm thinking this should work:
For a User in User.objects.all(), find all participant objects where the user.id equals correct_user(from participant model)
For each participantobject, evaluate if correct_user==guessed_user
Sum each participant object where the above comparison is True for the User, represented by a field sum_of_correct_guesses
Return a queryset including all users with parameters [User, sum_of_correct_guesses]
^Now ideally this should be percentage_of_correct_guesses, but that is an afterthought which should be easy enough to change by doing sum_of_correct_guesses / sum n times of that person being a guess.
Now I've even made some pseudocode for a single person to illustrate to myself roughly how it should work using python arithmetics
# PYTHON PSEUDO QUERY ---------------------
person = get_object_or_404(User, pk=3) # Example-person
y = Participant.objects.filter(
correct_user=person
) # Find participant-objects where person is used as guess
y_corr = [] # empty list to act as "queryset" in for-loop
for el in y: # for each participant object
if el.correct: # if correct_user == guessed_user
y_corr.append(el) # add to queryset
y_percentage_corr = len(y_corr) / len(y) # do arithmetic division
print("Percentage correct: ", y_percentage_corr) # debug-display
# ---------------------------------------------
What I've tried (with no success so far), is to use an ExtensionWrapper with Count() and Q object:
percentage_correct_guesses = ExpressionWrapper(
Count("pk", filter=Q(clicked_in_quiz=F("id")), distinct=True)
/ Count("solution_in_quiz"),
output_field=fields.DecimalField())
all_users = (
User.objects.all().annotate(score=percentage_correct_guesses).order_by("score"))
Any help or directions to resources on how to do this is greatly appreciated :))
I found an answer while looking around for related problems:
Django 1.11 Annotating a Subquery Aggregate
What I've done is:
Create a filter with an OuterRef() which points to a User and checks if Useris the same as correct_person and also a comparison between guessed_person and correct_person, outputs a value correct_user in a queryset for all elements which the filter accepts.
Do an annotated count for how many occurrences there are of a correct_user in the filtered queryset.
Annotate User based on the annotated-count, this is the annotation that really drives the whole operation. Notice how OuterRef() and Subquery are used to tell the filter which user is supposed to be correct_user.
Below is the code snippet which I made it work with, it looks very similar to the answer-post in the above linked question:
from django.db.models import Count, OuterRef, Subquery, F, Q
crit1 = Q(correct_user=OuterRef('pk'))
crit2 = Q(correct_user=F('guessed_user'))
compare_participants = Participant.objects.filter(crit1 & crit2).order_by().values('correct_user')
count_occurrences = compare_participants.annotate(c=Count('*')).values('c')
most_correctly_guessed_on = (
User.objects.annotate(correct_clicks=Subquery(count_occurrences))
.values('first_name', 'correct_clicks')
.order_by('-correct_clicks')
)
return most_correctly_guessed_on
This works wonderfully, thanks to Oli.
class Order(models.Model):
product = models.ForeignKey(Product, on_delete=models.CASCADE)
category = models.ForeignKey(
Category, null=True, on_delete=models.SET_NULL
)
user = models.ForeignKey(User, null=True, on_delete=models.SET_NULL)
placed = models.DateTimeField(auto_now=True)
shipped = models.DateTimeField(null=True)
delivered = models.DateTimeField(null=True)
I want to calculate statistics on how fast the order has been processed for each category
where process time is delivered - shipped
In result I want to achieve something like this:
[
{
"category": <category 1>
"processed_time": <average processed time in seconds>
},
{
"category": <category 2>
"processed_time": <average processed time in seconds>
},
{
"category": <category 3>
"processed_time": <average processed time in seconds>
},
]
I can calculate this outside of the ORM but I'd like to achieve this somehow with annotation/aggregation
delivered = delivered_qs.annotate(first_processed=Min("delivered"), last_processed=Max("delivered")) \
.aggregate(processed_time=F("last_processed")-F("first_processed"))
This QS returns time only for all categories and I dont know how to retrieve time for each individual category
You want to do a group by, which in Django works kinda weird. For more information see the documentation
But by first using .values you say again the queryset you gonna group by on the category. Than you determine the min, the max and the difference.
delivered = (
delivered_qs
.values('category')
.annotate(
first_processed=Min("delivered"),
last_processed=Max("delivered"),
processed_time=F("last_processed") - F("first_processed"),
)
)
Which, in my expectation, would return:
[{
"category": 1,
"first_processed": timedelta(),
"last_processed": timedelta(),
"processed_time": timedelta()
}, ...]
Question is regarding filtering X most recent entries in each category of queryset.
Goal is like this:
I have a incoming queryset based on the following model.
class UserStatusChoices(models.TextChoices):
CREATOR = 'CREATOR'
SLAVE = 'SLAVE'
MASTER = 'MASTER'
FRIEND = 'FRIEND'
ADMIN = 'ADMIN'
LEGACY = 'LEGACY'
class OperationTypeChoices(models.TextChoices):
CREATE = 'CREATE'
UPDATE = 'UPDATE'
DELETE = 'DELETE'
class EntriesChangeLog(models.Model):
content_type = models.ForeignKey(
ContentType,
on_delete=models.CASCADE,
)
object_id = models.PositiveIntegerField(
)
content_object = GenericForeignKey(
'content_type',
'object_id',
)
user = models.ForeignKey(
get_user_model(),
verbose_name='user',
on_delete=models.SET_NULL,
null=True,
blank=True,
related_name='access_logs',
)
access_time = models.DateTimeField(
verbose_name='access_time',
auto_now_add=True,
)
as_who = models.CharField(
verbose_name='Status of the accessed user.',
choices=UserStatusChoices.choices,
max_length=7,
)
operation_type = models.CharField(
verbose_name='Type of the access operation.',
choices=OperationTypeChoices.choices,
max_length=6,
)
And I need to filter this incoming queryset in a such way to keep only 4 most recent objects (defined by access_time field) in each category. Categories are defined by ‘content_type_id’ field and there are 3 possible options.
Lets call it ‘option1’, ‘option2’ and ‘option3’
This incoming queryset might contain different amount of objects of 1,2 or all 3 categories. This is can’t be predicted beforehand.
DISTINCT is not possible to use as after filtering operation this queryset might be ordered.
I managed to get 1 most recent object in a following way:
# get one most recent operation in each category
last_operation_time = Subquery(
EntriesChangeLog.objects.filter(user=OuterRef('user')).values('content_type_id').
annotate(last_access_time=Max(‘access_time’)).values_list('last_access_time', flat=True)
)
queryset.filter(access_time__in=last_operation_time)
But I have a hard time to figure out how to get last 4 most recent objects instead of last one.
This is needed for Django-Filter and need to be done in one query.
DB-Postgres 12
Do you have any ideas how to do such filtration?
Thanks...
pk_to_rank = queryset.annotate(rank=Window(
expression=DenseRank(),
partition_by=('content_type_id',),
order_by=F('access_time').desc(),
)).values_list('pk', 'rank', named=True)
pks_list = sorted(log.pk for log in pk_to_rank if log.rank <= value)
return queryset.filter(pk__in=pks_list)
Managed to do it only this way by spliting queryset in 2 parts. Option with 3 unions is also possible but what if we have 800 options instead 3 - make 800 unions()??? ges not...
I have the following model defined:
class TestCaseResult(models.Model):
run_result = models.ForeignKey(
RunResult,
on_delete=models.CASCADE,
)
name = models.CharField(max_length=128)
duration = models.DurationField(default=datetime.timedelta)
result = models.CharField(
max_length=1,
choices=(('f', 'failure'), ('s', 'skipped'), ('p', 'passed'), ('e', 'error')),
)
I'm trying to get, in a single query, the count of each kind of result for a given run_result, along with the sum of the durations for the test cases with that result.
This gives me the count of each type of result, but I can't figure out how to get the sum of the durations included.
qs = TestCaseResult.objects.filter(run_result=run_result).values('result').annotate(result_count=Count('result'))
I basically want this as the resulting SQL:
SELECT
"api2_testcaseresult"."result",
SUM("api2_testcaseresult"."duration") AS "duration",
COUNT("api2_testcaseresult"."result") AS "result_count"
FROM "api2_testcaseresult"
WHERE "api2_testcaseresult"."run_result_id" = 3
GROUP BY "api2_testcaseresult"."result";
Note how 'duration' is not part of the 'group by' clause.
You can simply append a second annotate():
qs = (TestCaseResult.objects
.filter(run_result=run_result)
.values('result')
.annotate(result_count=Count('result'))
.annotate(result_duration=Sum('duration'))
)
This should give you exactly the desired SQL query.
Here's what my model structure looks like:
class Visitor(models.Model):
id = models.AutoField(primary_key=True)
class Session(models.Model):
id = models.AutoField(primary_key=True)
visit = models.ForeignKey(Visitor)
sequence_no = models.IntegerField(null=False)
class Track(models.Model):
id = models.AutoField(primary_key=True)
session = models.ForeignKey(Session)
action = models.ForeignKey(Action)
when = models.DateTimeField(null=False, auto_now_add=True)
sequence_no = models.IntegerField(null = False)
class Action(models.Model):
id = models.AutoField(primary_key=True)
url = models.CharField(max_length=65535, null=False)
host = models.IntegerField(null=False)
As you can see, each Visitor has multiple Sessions; each Session has multiple Tracks and each Track has one Action. Tracks are always ordered ascendingly by the session and the sequence_no. A Visitors average time on an site (i.e. a particular Action.host) is the difference in Track.when (time) between the highest and lowest Track.sequence_no divided by the number of Sessions of that Visitor.
I need to calculate the average time of visitors on the site which be the sum of the time for each visitor on the Action.site divided by the number of visitors.
I could query this using SQL but I'd like to keep my query as Djangonic as possible and I'm still very lost with complex queries.
For a specific Action object you can gather interesting data about Sessions:
from django.db.models import Min, Max
from yourapp.models import *
host = 1 # I suppose you want to calculate for each site
sessions = list(Session.objects.filter(
track__action__host=host,
).annotate(
start=Min('track__when'),
end=Max('track__when'),
).values('visit_id', 'start', 'end'))
You will get something in the line of:
[
{ 'visit_id': 1, 'start': datetime(...), 'end': datetime(...) },
{ 'visit_id': 1, 'start': datetime(...), 'end': datetime(...) },
{ 'visit_id': 2, 'start': datetime(...), 'end': datetime(...) },
....
]
Now it's only a matter of getting the desired result from the data:
number_of_visitors = len(set(s['visit_id'] for s in sessions))
total_time = sum((s['end'] - s['start']).total_seconds() for s in sessions)
average_time_spent = total_time / number_of_visitors
Another way is to use two queries instead of one, and avoid the len(set(...)) snippet:
sessions = Session.objects.filter(
track__action__host=host,
).annotate(
start=Min('track__when'),
end=Max('track__when'),
)
number_of_visitors = sessions.values('visit_id').distict().count()
total_time = sum((s['end'] - s['start']).total_seconds()
for s in sessions.values('start', 'end'))
There is NO WAY to do actual calculated fields barring the provided aggregations, so either you do it in raw SQL or you do in code like this.
At least the proposed solution uses Django's ORM as far as possible.