I have a model that has a couple million objects. Each object represents a call made/received by a company.
To simplify things, let's say this model, Call, has these fields:
calldate, context, channel.
My goal is to know the average # of calls made and received during each hour of the day of the month (load by hour). The catch is: I need to find this for port1 and port2 separately.
As of now, my code works fine, except that it takes around 1 whole minute to give me the result for a range of 4 months and I it seems extremely inefficient.
I've done some simple profiling and discovered that the extend is taking around 99% of the processing time:
queryset = Call.objects.filter(calldate__gte='SOME_DATE')
port1, port2 = [],[]
port1.extend(queryset.filter(context__icontains="e1-1"))
port2.extend(queryset.filter(context__icontains="e1-2"))
channels_in_port1 = ["Port/%d-2" % x for x in range(1,32)]
channels_in_port2 = ["Port/%d-2" % x for x in range(32,63)]
for i in channels_in_port1:
port1.extend(queryset.filter(channel__icontains=i))
for i in channels_in_port2:
port2.extend(queryset.filter(channel__icontains=i))
port1 and port2 have around 150k objects combined now.
As soon as I have all calls for port1 and port2, I'm good to go. The rest of the code is basically some for loops for port1 and port2 that sums up and takes the average of calls according to the hour/day/month. Trivial stuff.
I tried to avoid using any "extend" by using itertools.chain and chaining the querysets instead. However, that made the processing time shift to the part where I do the trivial for loops to calculate the load by hour.
Any alternatives? Better ways to filter the queryset?
Thanks very much!!
Have you considered using django's aggregate functions? http://docs.djangoproject.com/en/dev/topics/db/aggregation/
I presume your problem is with the second set of extends, ie those within the for loops, rather than the first. (The first is completely unnecessary, in any case: rather than defining an empty list up front and extending it, you can just do port1 = list(queryset.filter(context__icontains="e1-1")).)
Anyway, to summarize what I think you are trying to do: you want to get all Call objects for a certain date, in two blocks depending on the value for channel: one where it contains values from 0 to 31, and one with values between 32 and 62.
It seems like you could do this with just two queries, without any extending at all:
port1 = queryset.filter(channel__range=["Port/1-2", "Port/31-2"])
port2 = queryset.filter(channel__range=["Port/1-32", "Port/31-62"])
Does that not do what you want?
Edit in response to comment but that's then just two queries which you can extend, or concatenate. The problem with your code as posted is that you are doing 31 queries and extend operations for each port, which is bound to be expensive. If you just do one each, plus one extend/concat, that will be much cheaper.
Related
I have a model that has one attribute with a list of floats:
values = ArrayField(models.FloatField(default=0), default=list, size=64, verbose_name=_('Values'))
Currently, I'm getting my entries and order them according to the sum of all diffs with another list:
def diff(l1, l2):
return sum([abs(v1-v2) for v1, v2 in zip(l1, l2)])
list2 = [0.3, 0, 1, 0.5]
entries = Model.objects.all()
entries.sort(key=lambda t: diff(t.values, list2))
This works fast if my numer of entries is very slow small. But I'm afraid with a large number of entries, the comparison and sorting of all the entries will get slow since they have to be loaded from the database. Is there a way to make this more efficient?
best way is to write it yourself, right now you are iterating over a list over 4 times!
although this approach looks pretty but it's not good.
one thing that you can do is:
have a variable called last_diff and set it to 0
iterate through all entries.
iterate though each entry.values
from i = 0 to the end of list, calculate abs(entry.values[i]-list2[i])
sum over these values in a variable called new_diff
if new_diff > last_diff break from inner loop and push the entry into its right place (it's called Insertion Sort, check it out!)
in this way, in average scenario, time complexity is much lower than what you are doing now!
and maybe you must be creative too. I'm gonna share some ideas, check them for yourself to make sure that they are fine.
assuming that:
values list elements are always positive floats.
list2 is always the same for all entries.
then you may be able to say, the bigger the sum over the elements in values, the bigger the diff value is gonna be, no matter what are the elements in list2.
then you might be able to just forget about whole diff function. (test this!)
The only way to makes this really go faster, is to move as much work as possible to the database, i.e. the calculations and the sorting. It wasn't easy, but with the help of this answer I managed to actually write a query for that in almost pure Django:
class Unnest(models.Func):
function = 'UNNEST'
class Abs(models.Func):
function = 'ABS'
class SubquerySum(models.Subquery):
template = '(SELECT sum(%(field)s) FROM (%(subquery)s) _sum)'
x = [0.3, 0, 1, 0.5]
pairdiffs = Model.objects.filter(pk=models.OuterRef('pk')).annotate(
pairdiff=Abs(Unnest('values')-Unnest(models.Value(x, ArrayField(models.FloatField())))),
).values('pairdiff')
entries = Model.objects.all().annotate(
diff=SubquerySum(pairdiffs, field='pairdiff')
).order_by('diff')
The unnest function turns each element of the values into a row. In this case it happens twice, but the two resulting columns are instantly subtracted and made positive. Still, there are as many rows per pk as there are values. These need to be summed, but that's not as easy as it sounds. The column can't be simply be aggregated. This was by far the most tricky part—even after fiddling with it for so long, I still don't quite understand why Postgres needs this indirection. Of the few options there are to make it work, I believe a subquery is the single one expressible in Django (and only as of 1.11).
Note that the above behaves exactly the same as with zip, i.e. the when one array is longer than the other, the remainder is ignored.
Further improvements
While it will be a lot faster already when you don't have to retrieve all rows anymore and loop over them in Python, it doesn't change yet that it results in a full table scan. All rows will have to be processed, every single time. You can do better, though. Have a look into the cube extension. Use it to calculate the L1 distance—at least, that seems what you're calculating—directly with the <#> operator. That will require the use of RawSQL or a custom Expression. Then add a GiST index on the SQL expression cube("values"), or directly on the field if you're able to change the type from float[] to cube. In case of the latter, you might have to implement your own CubeField too—I haven't found any package yet that provides it. In any case, with all that in place, top-N queries on the lowest distance will be fully indexed hence blazing fast.
Community,
Objective: I'm running a Pi project (i.e. Python) that communicates with an Arduino to get data from a load cell once a second. What data structure should I use to log (and do real-time analysis) on this data in Python?
I want to be able to do things like:
Slice the data to get the value of the last logged datapoint.
Slice the data to get the mean of the datapoints for the last n seconds.
Perform a regression on the last n data points to get g/s.
Remove from the log data points older than n seconds.
Current Attempts:
Dictionaries: I have appended a new key with a rounded time to a dictionary (see below), but this makes slicing and analysis hard.
log = {}
def log_data():
log[round(time.time(), 4)] = read_data()
Pandas DataFrame: this was the one I was hopping for, because is makes time-series slicing and analysis easy, but this (How to handle incoming real time data with python pandas) seems to say its a bad idea. I can't follow their solution (i.e. storing in dictionary, and df.append()-ing in bulk every few seconds) because I want my rate calculations (regressions) to be in real time.
This question (ECG Data Analysis on a real-time signal in Python) seems to have the same problem as I did, but with no real solutions.
Goal:
So what is the proper way to handle and analyze real-time time-series data in Python? It seems like something everyone would need to do, so I imagine there has to pre-built functionality for this?
Thanks,
Michael
To start, I would question two assumptions:
You mention in your post that the data comes in once per second. If you can rely on that, you don't need the timestamps at all -- finding the last N data points is exactly the same as finding the data points from the last N seconds.
You have a constraint that your summary data needs to be absolutely 100% real time. That may make life more complicated -- is it possible to relax that at all?
Anyway, here's a very naive approach using a list. It satisfies your needs. Performance may become a problem depending on how many of the previous data points you need to store.
Also, you may not have thought of this, but do you need the full record of past data? Or can you just drop stuff?
data = []
new_observation = (timestamp, value)
# new data comes in
data.append(new_observation)
# Slice the data to get the value of the last logged datapoint.
data[-1]
# Slice the data to get the mean of the datapoints for the last n seconds.
mean(map(lambda x: x[1], filter(lambda o: current_time - o[0] < n, data)))
# Perform a regression on the last n data points to get g/s.
regression_function(data[-n:])
# Remove from the log data points older than n seconds.
data = filter(lambda o: current_time - o[0] < n, data)
Let's say I have a team model, and a team has members.
So
class Team(models.Model):
team_member = models.ManyToManyField('Employee')
class Employee(models.Model):
....
Lets say I have a list of employee ids like team_members = [1001, 1003, 1004] and I want to find the Team, that is made up of exactly those three members.
I don't want the team that has [1001, 1003, 1004, 1005] or the team that has [1001, 1003].
Only team [1001, 1003, 1004].
This is what I'm doing now:
teams = Team.objects.all()
for t in teams:
if set([x.id for x in t.team_member.all()]) == set(team_members):
team = t
if not team:
team = Team.objects.create()
team.team_member = team_members
But it seems a bit ham-handed. Is there a cleaner way, with fewer nested loops?
The short answer
No, I don't know of a much simpler way in terms of code appearance.
However there are some things you could do to make your code a little more graceful and potentially a lot faster. Plus it is possible to do the work in the database, albeit quite inefficiently for large team sizes.
The DB option listed below is pretty much as ham-handed as the for loop you provided, but could be more efficient depending on your data set, DB, etc.
Longer answer: ways to be less 'ham-handed'
There are a couple of places I'd clean up the style here.
Plus, in my experience with Django, loops like the one you built do tend to become pretty expensive on large data sets. If you end up loading, say, 10,000 teams into memory, having the ORM convert them to Team objects, and then iterating over them, you'll probably see some significant slowdown.
Two things to try for speed & grace:
Use Team.values_list('team_members') for your in-python filter loop, which skips the step where Django organizes all of the SQL data into Model objects. I've found this to save lots of time instantiating objects (sometimes around an order of magnitude).
straighten out your set() calls. Currently you're re-converting team_members to a set() on every iteration, plus you're turning t.team_member implicitly into TeamMember objects (as they're fetched from the DB), then into a list of ids and then into a set. For the first item, just make a team_members_set = set(team_members) up front and reuse it. For the second item, you can do set(t.team_member.values_list('id', flat=True)) which will skip the heaviest ORM step of instantiating TeamMembers (which could be as bad as O(n^2) in your example depending on the data set and Django's caching).
use Team.objects.all().iterator() to not load the Teams all into memory at once. This will help if you're running into memory issues.
But with any performance optimization, of course test your perf with real or real-ish data to be sure you're not making things worse!
Longer answer: the DB option
After trying all manner of Q() manipulation and other approaches listed in the answers here, to no avail, I found this answer by #Todor.
Basically you need to do repeated filter()s, one for each team_member. On top of that you use a Count filter to make sure that you don't end up choosing a Team with a superset of the desired members.
desired_members = [1001, 1003, 1004]
initial_queryset = Team.objects.annotate(cnt=models.Count('team_members')).filter(cnt=len(desired_members))
matching_teams = reduce( # Can of course use a for loop if you prefer that to reduce()
lambda queryset, member: queryset.filter(team_members=member),
desired_members,
initial_queryset
)
Note that the resulting query will likely have perf issues for large teams, since it will do one JOIN for every one of your desired_members. It'd be nice to avoid that but I don't know of another way to do this all in the database without changing your data structure. I'd love to learn a better way, and if you end up doing some perf testing I'd be curious to find what you learn!
Maybe you can use annotate for the count of team_member. Can you try this?
Team.objects.filter(team_member__pk__in=team_members).annotate(num_team=Count('team_member')).filter(num_team=len(team_members))
To get the Team with those exact three members you can use:
Team.objects.get(team_member__pk=team_members) # This code was untested
You could also try with a list of Employee objects:
# team_members = Employee.objects.filter(pk__in=tem_members)
team_members = [<Employee: Employee object>, <Employee: Employee object>, <Employee: Employee object>]
Team.objects.get(team_member=team_members)
I'm currently implementing a histogram that will show a very large scale data using Qt and I have some doubts about which data structure(s) I should be using for my problem. I will be displaying the amount of queries received from users of an application and the way I should display is as follows -in a single application that will show different histograms upon clicking different "show me this data etc." buttons-
1) Display the histogram of total queries per every month -4 months of data here, I
kept four variables and incremented them as I caught queries belonging to those months
in the CSV file-
2) Display the histogram of total queries per every single day in a selected month -I was thinking of using 4 QVectors to represent the months for this one, incrementing every element of the vector (day), as I come by that specific day -e.g. the vector represents the month of August and whenever I come across a data with 2011-08-XY , I will increment the (XY + 1)th element of that vector by 1- my second alternative is to use 4 QLinkedList's for the sake of better complexity but I'm not sure if the ways I've come up with are efficient enough and I'm willing to listen to any other idea.
3) Here's where things get a bit complicated. Display the histogram of total queries per every hour in a selected day and month. The data represented is multiplied in a vast manner and I don't know which data structure -or combination of structures- I should use to implement this one. A list of lists perhaps?
Any ideas on my problems at 2) and 3) would be helpful, Thanks in advance.
Actually, it shouldn't be too unmanageable to always do queries per hour. Assuming that the number of queries per hour is never greater than the maximum int value, that's only 24 ints per day = 32 bits or 64 depending on your machine. Assuming 32 bits, then you could get up to 28 years worth of data per MB.
There's no need to transfer the month/year - your program can work that out. Just assign hour 0 to the earliest point in your data, which you keep as a constant, then work out the date based on hours passed since then.
This avoids having to have a list of lists or anything fancy - just use an array where each address contains the number of hours since hour 0, and the number of queries for that hour.
Why don't you simply use a classical database?
When you start asking these kind of question I think it is a good time to consider a more robust structure.There are multiple data structures implemented inside any DB, optimized either for different access type. You should considerate at least lookup, insertion, deletion, range queries. There is no structure which is better than the others in all costs, so there is always a trade-off.
Qt has some database classes you can use. I never used the Qt SQL library, but I think you should give it a shot. Fortunately, there is a Qt SQL programming guide at the end of the page linked.
I am looking for a strategy to batch all my queries (with IN clause) to overcome the restrictions by databases on IN clause (See here).
I usually get list of size 100000 to 305000. So, this has become very important to tackle.
I have tried two strategies so far.
Strategy 1:
Create an entity and hence a table with one column to hold such values (can we create temp tables on the fly with JPA 2.0 vendor-independent?) and use the data from the temp table as a subquery to the original query before eventually cleaning up the temp table.
Advantage: Very performant queries. Really quick, I must admit for the numbers I have mentioned, it was mostly under a minute.
Possible drawback: Use of temp table which is actually a permanent one in my case thus far.
Strategy 2:
Calculate the batch size for the given input list and for each batch execute the query and accumulate the result.
Advantage: No temp tables. Easy for any threads within the same transaction.
Disadvantage: A big disadvantage is amount of time it takes to execute all the batches. For the mentioned numbers, this is at an unacceptable level at the moment. Takes anything between 5 to 15 mins!
I would appreciate any feedback, suggestions or improvements from all you JPA gurus.
Thanks.
I only tested up to 50,000 integers but I have some pretty good performance data around splitting large lists using various methods, with CLR and a numbers table leading the pack at the higher end:
Splitting a list of integers : another roundup
Not sure if you are using integers or strings but the results should be roughly equivalent.
As an aside, I'll confess I have no idea what JPA 2.0 is, but I assume you can control the format of the lists that it sends to SQL Server.