a beginner here!
here's how im using url path (from the DRF tutorials):
path('articles/', views.ArticleList.as_view()),
path('articles/<int:pk>/', views.ArticleDetail.as_view())
and i noticed that after deleting an 'Article' (this is my model), the pk stays the same.
an Example:
1st Article pk = 1, 2nd Article pk = 2, 3rd Acrticle pk = 3
after deleting the 2n Arctile im expecting --
1st Article pk = 1, 3rd Artcile pk = 2
yet it remains
3rd Artile pk = 3.
is there a better way to impleten the url, maybe the pk is not the variable im looking for?
or i should update the list somehow?
thnkx
and I noticed that after deleting an Article (this is my model), the pk stays the same.
This is indeed the expected behaviour. Removing an object will not "fill the gap" by shifting all the other primary keys. This would mean that for huge tables, you start updating thousands (if not millions) of records, resulting in a huge amount of disk IO. This would make the update (very) slow.
Furthermore not only the primary keys of the table that stores the records should be updated, but all sorts of foreign keys that refer to these records. This thus means that several tables need to be updated. This results in even more disk IO and furthermore it can result in slowing down a lot of unrelated updates to the database due to locking.
This problem can be even more severe if you are working with a distributed system where you have multiple databases on different servers. Updating these databases atomically is a serious challenge. The CAP theorem [wiki] demonstrates that in case a network partition failure happens, then you either can not guarantee availability or consistency. By updating primary keys, you put more "pressure" on this.
Shifting the primary key is also not a good idea anyway. It would mean that if your REST API for example returns the primary key of an object, then the client that wants to access that object might not be able to access that object, because the primary key changed in between. A primary key thus can be seen as a permanent identifier. It is usually not a good idea to change the token(s) that a client uses to access an object. If you use a primary key, or a slug, you know that if you later refer to the same item, you will again retrieve the same item.
how to 'update' the pk after deleting an object?
Please don't. Sorting elements can be done with a timestamp, but that is something different than having an identifier space that does not contain any gaps. A gap is often not a real problem, so you better do not turn it into a real problem.
Related
I'm currently facing some questions regarding my database design. Currently i'm developing an api which lets users do the following:
Create an Account ( 1 User owns 1 Account)
Create a Profile ( 1 Account owns 1-n Profiles)
Let a profile upload 2 types of items ( 1 Profile owns 0-n Items ; the items differ in type and purpose)
Calling the API methods triggers AWS Lambda to perform the requested operations in the DynamoDB tables.
My current plan looks like this:
It should be possible to query items by specifying a time frame and the Profile ID. But i think my design completely defeats the purpose of DynamoDB. AWS documentation says that a well designed product only requires one table.
What would be a good way to realise this architecture in one table?
Are there any drawbacks on using the current design?
What would you specify as Primary/Partition/sort key/secondary indexes in both the current design and a one-table-approach?
I’m going to give this answer assuming that you need to be able to do the following queries.
Given an Account, find all profiles
Given a Profile, find all Items
Given a Profile and a specific ItemType, find all Items
Given an Item, find the owning Profile
Given a Profile, find the owning account
One of the beauties of DynamoDB (and also a bane, perhaps) is that it is mostly schema-less. You need to have the mandatory Primary Key attributes for every item in the table, but all of the other attributes can be anything you like. In order to have a DynamoDB design with only one table, you usually need to get used to the idea of having mixed types of objects in the same table.
That being said, here’s a possible schema for your use case. My suggestion assumes that you are using something like UUIDs for your identifiers.
The partition key is a field that is simply called pkey (or whatever you want). We’ll also call the sort key skey (but again, it doesn’t really matter). Now, for an Account, the value of pkey is Account-{{uuid}} and the value of skey would be the same. For a Profile, the pkey value is also Account-{{uuid}}, but the skey value is Profile-{{uuid}}. Finally, for an Item, the pkey is Profile-{{uuid}} and the skey is Item-{{type}}-{{uuid}}. For all of the attributes of an item, don’t worry about it, just use whatever attributes you want to use.
Since the “parent” object is always the partition key, you can get any of the “child” objects simply by querying for the ID of the of the parent. For example, your key condition expression to get all the ‘ItemType2’s for a Profile would be
pkey = “Profile-{{uuid}}” AND begins_with(skey, “Item-Type2”)
In this schema, your GSI has the same keys as the table, but reversed. You can query the GSI for ‘Item-{{type}}-{{uuid}}’ to get the owning Profile, and similarly with a Profile is to get the owning account.
What I have illustrated here is the adjacency list pattern. DynamoDB also has an article describing how to use composite sort keys for hierarchical data, which would also be suitable for your data, and depending on your expected queries, it might be more suitable than using the adjacency list.
You don’t have to put everything in a single table. Yes, DynamoDB recommends it, but it is far more important to make sure that your application is correct and maintainable. If having multiple tables means it’s easier to write a defect free application, then use multiple tables.
I have a single celery worker with 5 threads. It's scraping websites and saving domains to DB via django's ORM.
Here is roughly how it looks like:
domain_all = list(Domain.objects.all())
needs_domain = set()
for x in dup_free_scrape:
domain = x['domain']
if any(domain.lower() == s.name.lower() for s in domain_all):
x['domainn'] = [o for o in domain_all if domain.lower() == o.name.lower()][0]
else:
print('adding: {}'.format(domain))
needs_domain.add(domain)
create_domains = [Domain(name=b.lower()) for b in needs_domain]
create_domains_ids = Domain.objects.bulk_create(create_domains)
Probably not the best way, but it checks domains in one dict(dup_free_scrape) against all domains already in database.
It can go over hundreds or even thousands before encountering the error, but sometimes it does:
Task keywords.domains.model_work[285c3e74-8e47-4925-9ab6-a99540a24665]
raised unexpected: IntegrityError('duplicate key value violates unique
constraint "keywords_domain_name_key"\nDETAIL: Key
(name)=(domain.com) already exists.\n',)
django.db.utils.IntegrityError: duplicate key value violates unique
constraint "keywords_domain_name_key"
The only reason for this issue I can think of would be: One thread saved domain to DB while another was in the middle of code above?
I can't find any good solutions, but here is and idea(not sure if any good): Wrap whole thing in transaction and if databaise raises error simplty retry(query database for "Domain.objects.all()" again).
If you are creating these records in bulk and multiple threads are at it, it's indeed very likely that IntegrityErrors are caused by different threads inserting the same data. Do you really need multiple threads working on this? If yes you could try:
create_domains = []
create_domain_ids = []
for x in dup_free_scrape:
domain = x['domain']
new_domain, created = Domain.objects.get_or_create(name = domain.lower()
if created:
create_domains.append(domain.lower())
created_domain_ids.append(new_domain.pk)
Note that this is all the code. The select all query which you had right at the start is not needed. Domain.objects.all() is going to be very inefficient because you are reading the entire table there.
Also note that your list comprehension for x['domain'] appeared to be completely redundant.
create_domains and create_domain_ids lists may not be needed unless you want to keep track of what was being created.
Please make sure that you have the proper index on domain name. From get_or_create docs:
This method is atomic assuming correct usage, correct database
configuration, and correct behavior of the underlying database.
However, if uniqueness is not enforced at the database level for the
kwargs used in a get_or_create call (see unique or unique_together),
this method is prone to a race-condition which can result in multiple
rows with the same parameters being inserted simultaneously.
Say I have a general website that allows someone to download their feed in a small amount of time. A user can be subscribed to many different pages, and the user's feed must be returned to the user from the server with only N of the most recent posts between all of the pages subscribed to. Originally when a user queried the server for a feed, the algorithm was as follows:
look at all of the pages a user subscribed to
getting the N most recent posts from each page
sorting all of the posts
return the N most recent posts to the user as their feed
As it turns out, doing this EVERY TIME a user tried to refresh a feed was really slow. Thus, I changed the database to have a table of feedposts, which simply has a foreign key to a user and a foreign key to the post. Every time a page makes a new post, it creates a feed post for each of its subscribing followers. That way, when a user wants their feed, it is already created and does not have to be created upon retrieval.
The way I am doing this is creating far too many rows and simply does not seem scalable. For instance, if a single page makes 1 post & has 1,000,000 followers, we just created 1,000,000 new rows in our feedpost table.
Please help!
How do companies such as facebook handle this problem? Do they generate the feed upon request? Are my database relationships terrible?
It's not that the original schema itself would be inherently wrong, at least not based on the high-level description you have provided. The slowness stems from the fact that you're not accessing the database in a way relational databases should be accessed.
In general, when querying a relational database, you should use JOINs and in-database ordering where possible, instead of fetching a bunch of data, and then trying to connect related objects and sort them in your code. If you let the database do all this for you, it will be much faster, because it can take advantage of indices, and only access those objects that are actually needed.
As a rule of thumb, if you need to sort the results of a QuerySet in your Python code, or loop through multiple querysets and combine them somehow, you're most likely doing something wrong and you should figure out how to let the database do it for you. Of course, it's not true every single time, but certainly often enough.
Let me try to illustrate with a simple piece of code. Assume you have the following models:
class Page(models.Model):
name = models.CharField(max_length=47)
followers = models.ManyToManyField('auth.User', related_name='followed_pages')
class Post(models.Model):
title = models.CharField(max_length=147)
page = models.ForeignKey(Page, related_name='posts')
content = models.TextField()
time_published = models.DateTimeField(auto_now_add=True)
You could, for example, get the list of the last 20 posts posted to pages followed by the currently logged in user with the following single line of code:
latest_posts = Post.objects.filter(page__followers=request.user).order_by('-time_published')[:20]
This runs a single SQL query against your database, which only returns the (up to) 20 results that match, and nothing else. And since you're joining on primary keys of all tables involved, it will conveniently use indices for all joins, making it really fast. In fact, this is exactly the kind of operation relational databases were designed to perform efficiently.
Caching will be the solution here.
You will have to reduce the database reads, which are much slower as compared to cache reads.
You can use something like Redis to cache the post.
Here is an amazing answer for better understanding
Is Redis just a cache
Each page can be assigned a key, and you can pull all of the posts for that page under that key.
you need not to cache everything , just cache resent M posts, where M>>N and safe enough to reduce the database calls.Now if in case user requests for posts beyond the latesd M ones, then they can be directly fetched from the DB.
Now when you have to generate the feed you can make a DB call to get all of the subscribed pages(or you can put in the cache as well) and then just get the required number of post's from the cache.
The problem here would be keeping the cache up-to date.
For that you can use something like django-signals. Whenever a new post is added, add it to the cache as well using the signal.
So for each DB write you will have to write to cache as well.
But then you will not have to read from DB and as Redis is a in memory datastore it is pretty fast as compared to standard relational databases.
Edit:
These are a few more articles which can help for better understanding
Does Stack Exchange use caching and if so, how
How Twitter Uses Redis to Scale - 105TB RAM, 39MM QPS, 10,000+ Instances
I am have a Django 1.7rc project running on multiple app servers and a MySQL.
I have noticed the primary key of a model has gaps, eg, from 10001 jumps to 10003, 10011 jumps to 10014. I cannot figure out why, there is no code to delete the records directly, however it could be cascade deleted, which I will investigate further.
order = Order(cart=cart)
order.billing_address = billing_address
order.payment = payment
order.account = account
order.user_uuid = account.get('uuid')
order.save()
Thought I would ask here if this is normal on a multiple app server setup?
Gaps in a primary key are normal (unless you're using a misconfigured SQLite table, which does not use a monotonic PK by default) and help to maintain referential integrity. Having said that, they are usually only caused by deletions or updates within the table, cascaded or otherwise. Verify that you have no code which may delete or update the PK in that table, directly or indirectly.
I can't detect any pattern, maybe 1 in each 1000 edits of a certain model returns an IntegrityError on a m2m field. Most of the times this field wasn't even modified. When a model is saved I believe django always wipes the m2m field and then re-adds the items, right? I saw django calls clear() and then add()s the items.
My code then fails with:
IntegrityError: duplicate key value violates unique constraint
"app_model_m2m_field_key" DETAIL: Key (model1_id, model2_id)=(597,
1009) already exists.
It seems like the add of items is performed before the items are cleared, which is very weird. I've tried to reproduce it but it's very hard, only happens occasionally. Any idea what could cause it? Could maybe setting auto commit solve this problem?
Thanks in advance
Most likely, you have two requests racing to commit similar changes at the same time.
Request 1 begins a transaction and DELETEs the existing M2M rows.
Request 2 begins a transaction and DELETEs the M2M rows with the same where clause. This blocks waiting for request 1's transaction to commit.
Request 1 re-INSERTs all the M2M rows and commits.
Request 2 resumes, and the delete succeeds without deleting any rows, because all rows that existed when the statement began have already been deleted.
Request 2 tries to re-INSERT an M2M row, but the database detects that it already exists and returns an error.
It's possible to fix this by upgrading to the SERIALIZABLE isolation level (instead of PostgreSQL's default of READ COMMITTED) but at the cost of even more exciting potential failure modes and worse performance.
I'm assuming you're right that Django is performing a DELETE followed by a series of INSERTs, although that wouldn't be a very good plan precisely because it exacerbates this kind of race.
The best plan is to identify what has actually changed and only ask the database to make those changes, because then if you get an integrity error it's because there was a real conflict that you probably couldn't do anything about anyway.