I'm developing an application that recognizes an element, given a library of models.
This library is saved in a local Database in one table. This table contains the model identification number, its name, and the number of times that it was recognized. This last parameter is used because I developed a specific application for the organization of that library, and I want short all the models by how many time each one was matched.
The application that matches the current element with the models saved in the library, run in the same PC and have access at the model library database.
What is the best way to track the number of occurrences in which a model was recognized with the element under analysis?
One solution can be: execute a query and increment by one the match count value in the model's table every time that I match it, but I'm scared due to the fact that I have to perform a match operation every 2/3 second. With this solution, I will query verbosely the database I think.
Another useful information can be obtained, thanks to the fact that I save in a table in the same database, the history of elements under analysis with their match result for elaboration purpose. Maybe when I start the application referred to manage the model's library, I can count the number of occurrences for each model from the history table and update the model's match count. But I have more or less 20.000 candidates for the matching per day, and I'm planning to clear the history table each day/month, and now a day in this table I don't save the date.
Thanks in advice
Related
I have two tables that i need to build a relationship. They both have a timestamp that I want to use to match the data. The problem is that the timestamps are not exact and my be off by a few seconds,. Eg 8.32.50 and 8.32.45. I could use the hh.mm (change date time to integer)to match the data as but the problem arise in a situation where e.g the times are 8.32.58 and 8.33.067
You will need to set up a time-dimension table. A time-dimension table goes down to the granularity of seconds, so you can relate both tables to that one dimension table so that they can "speak" to one another. Radacad has a great script for creating one here, which also has time-binning columns.
I am working with Django and I am a bit lost on how to extract information from models (tables).
I have a table containing different information from various sensors. What I would like to know is if it is possible from the Django models to obtain for each sensor (each sensor has an identifier) the last row of data (using the timestamp column).
In sql it would be something like this, (probably the query is not correct but I think you can understand what I'm trying)
SELECT sensorID,timestamp,sensorField1,sensorField2
FROM sensorTable
GROUP BY sensorID
ORDER BY max(timestamp);
I have seen that the group_by() function exists and also lastest() but I don't get anything coherent and I'm also not clear if I'm choosing the best form.
Can anyone help me get started with this topic? I imagine it is very easy but it is a new world and it is difficult to start.
Greetings!
When you use a PostgreSQL database, you can make use of the .distinct(..) method [Django-doc] of the queryset where you add fields that determine on what these should be distinct.
So you can obtain the latest sensors in Django with:
SensorModel.objects.order_by('sensor', '-timestamp').distinct('sensor')
We thus order by sensor (which is required for a .distinct(..)), and then in case of a tie (so two times the same sensor), we order on the timestamp in descending order, hence we pick the latest SensorModel object for that sensor.
In a Django project, I'm refreshing tens of thousands of lines of data from an external API on a daily basis. The problem is that since I don't know if the data is new or just an update, I can't do a bulk_create operation.
Note: Some, or perhaps many, of the rows, do not actually change on a daily basis, but I don't which, or how many, ahead of time.
So for now I do:
for row in csv_data:
try:
MyModel.objects.update_or_create(id=row['id'], defaults={'field1': row['value1']....})
except:
print 'error!'
And it takes.... forever! One or two lines a second, max speed, sometimes several seconds per line. Each model I'm refreshing has one or more other models connected to it through a foreign key, so I can't just delete them all and reinsert every day. I can't wrap my head around this one -- how can I cut down significantly the number of database operations so the refresh doesn't take hours and hours.
Thanks for any help.
The problem is you are doing a database action on each data row you grabbed from the api. You can avoid doing that by understanding which of the rows are new (and do a bulk insert to all new rows), Which of the rows actually need update, and which didn't change.
To elaborate:
grab all the relevant rows from the database (meaning all the rows that can possibly be updated)
old_data = MyModel.objects.all() # if possible than do MyModel.objects.filter(...)
Grab all the api data you need to insert or update
api_data = [...]
for each row of data understand if its new and put it in array, or determine if the row needs to update the DB
for row in api_data:
if is_new_row(row, old_data):
new_rows_array.append(row)
else:
if is_data_modified(row, old_data):
...
# do the update
else:
continue
MyModel.objects.bulk_create(new_rows_array)
is_new_row - will understand if the row is new and add it to an array that will be bulk created
is_data_modified - will look for the row in the old data and understand if the data of that row is changed and will update only if its changed
If you look at the source code for update_or_create(), you'll see that it's hitting the database multiple times for each call (either a get() followed by a save(), or a get() followed by a create()). It does things this way to maximize internal consistency - for example, this ensures that your model's save() method is called in either case.
But you might well be able to do better, depending on your specific models and the nature of your data. For example, if you don't have a custom save() method, aren't relying on signals, and know that most of your incoming data maps to existing rows, you could instead try an update() followed by a bulk_create() if the row doesn't exist. Leaving aside related models, that would result in one query in most cases, and two queries at the most. Something like:
updated = MyModel.objects.filter(field1="stuff").update(field2="other")
if not updated:
MyModel.objects.bulk_create([MyModel(field1="stuff", field2="other")])
(Note that this simplified example has a race condition, see the Django source for how to deal with it.)
In the future there will probably be support for PostgreSQL's UPSERT functionality, but of course that won't help you now.
Finally, as mentioned in the comment above, the slowness might just be a function of your database structure and not anything Django-specific.
Just to add to the accepted answer. One way of recognizing whether the operation is an update or create is to ask the api owner to include a last updated timestamp with each row (if possible) and store it in your db for each row. That way you only have to check for those rows where this timestamp is different from the one in api.
I faced an exact issue where I was updating every existing row and creating new ones. It took a whole minute to update 8000 odd rows. With selective updates, I cut down my time to just 10-15 seconds depending on how many rows have actually changed.
I think below code can do the same thing together instead of update_or_create:
MyModel.objects.filter(...).update()
MyModel.objects.get_or_create()
I am trying to implement a search engine for a new app.
The app allows people to rate items (+1 or -1) - Giving the items a +ve or -ve score.
When people search for items, I'd like to take into account their rating and to order the results accordingly. If the item is a match, it should show up. But if it's a match with a high score it should be boosted up the results a bit.
A really good match should win over a fairly good match with a high score, so it needs to be weighted along with the rest of it (i.e. I boosted my titles a bit).
Not stuck on Solr by any means, only just started playing today.
With Solr, you can maintain a field with the document which holds the difference.
The difference can be between the total +1ve's and the -1ve's.
Solr allows you to boost on field values using function queries.
So you can query with the boost on the difference field, with documents with better difference scoring over others.
From indexing front, as this difference would change quite often, the respective document needs to be updated everytime.
Solr does not allow the updation of the single field, so you need to handle the incremental updates of the difference field.
If that would be a concern to you, can try using ExternalFileField.
This allows mapping of certain fields of documents such as ranking, popularity external to the index in a separate file.
The file can be updated and index committed to reflect the changes.
The field can also be used with function queries to boost the results as needed, however have lot of limitations.
You can order your results by a field that stores the ranking.
sqs.filter(content='blah').order_by('rating')
I need to pick a document from a collection at random (alternatively - a small number of successive documents from a randomly-positioned "window").
I've found two solutions: 1 and 2. The first is unacceptable since I anticipate large collection size and wish to minimize the document size. The second seems ineffective (I'm not sure about the complexity of skip operation). And here one can find a mention of querying a document with a specified index, but I don't know how to do it (I'm using C++ driver).
Are there other solutions to the problem? Which is the most efficient?
I had a similar issue once. In my case, I had a date property on my documents. I knew the earliest date possible in the dataset so in my application code, I would generate a random date within the range of EARLIEST_DATE_IN_SET and NOW and then query mongodb using a GTE query on the date property and simply limit it to 1 result.
There was a small chance that the random date would be greater than the highest date in the data set, so i accounted for that in the application code.
With an index on the date property, this was a super fast query.
It seems like you could mold solution 1 there, (assuming your _id key was an auto-inc value), then just do a count on your records, and use that as the upper limit for a random int in c++, then grab that row.
Likewise, if you don't have an autoinc _id key, just create one with your results.. having an additional field with an INT shouldn't add that much to your document size.
If you don't have an auto-inc field Mongo talks about how to quickly add one here:
Auto Inc Field.