Short incremental uinque id for neo4j - django

I use django with neo4j as database. I need to use short url based on node ids in my rest api. In neo4j there is an id used in database that didn't recommended to use in app, and there is approach to use uuid that is too long for my short urls. So I add my uid generator:
def uid_generator():
last_id = db.cypher_query("MATCH (n) RETURN count(*) AS lastId")[0][0][0]
if last_id is None:
last_id = 0
last_id = str(last_id)
hash = sha256()
hash.update(str(time.time()).encode())
return hash.hexdigest()[0:(max(2, len(last_id)))] + str(uuid.uuid4()).replace('-', '')[0:(max(2, len(last_id)))]
I have two question, First I read this question in stack overflow and still not sure that MATCH (n) RETURN count(*) AS lastId is O(1) there was no reference to that! Is there any reference for that answer? Second is there a better approach to do in both id uniqueness and speed?

First, you should put a unique constraint on the id property to make sure there are no collisions created by parallel create statements. This requires using a label, but you NEED this fail-safe if you plan to do anything serious with this data. But this way, you can have rolling ids for different labels. (All indexed labels will have a count table. UNIQUE CONSTRAINT also creates an index)
Second, you should do the generation and creation in the same cypher like this
MATCH (n:Node) WITH count(*) AS lastId
CREATE (:Node{id:lastId})
This will minimize time between generation and commit, reducing chances of collision. (Remember to retry on failed attempts from unique violations)
I'm not sure what you are doing with the hash, just that you are doing it wrong. Either you generate a new time based UUID (It will require no parameters) and use it as is, or you use the incriminating id. (By altering a UUID, you invalidate the logic that guaranteed uniqueness, thus significantly increasing collision chance)
You can also store the current index count in a node like is explained here. It's not guaranteed to be thread safe, but shouldn't be a problem as long as you have Unique Constraints in place, and retry on constraint violations. This will be more tolerant of deleting nodes.

Your approach is not good because it's based on the number of node in the database.
What happened if you create a node (call it A), and then delete a random node, and then create a new node (call it B).
A and B will have the same ID, and I think that's why you have added a hash in code based on the time (but I barely understand the line :)).
On the other side, Neo4j's ID ensure you to have a unique ID across the database, but not in the time. Per default, Neo4j recycle unused ID (an ID is release when a node is deleted).
You can change this behavour by changing the configuration (see the doc HERE ) : dbms.ids.reuse.types.override=RELATIONSHIP
Becarefull with such a configuration, the size of your database on your harddrive can only increase, even if you delete nodes.

Why not create your own identifier? You can get the maximum of your last identifier (let's call it RN for record number).
match (n) return max(n.RN) as lastID
max is one of several numeric functions in cypher.

Related

Is this a reasonable way to design this DynamoDB table? Alternatives?

Our team has started to use AWS and one of our projects will require storing approval statuses of various recommendations in a table.
There are various things that identify a single recommendation, let's say they're : State, ApplicationDate, LocationID, and Phase. And then a bunch of attributes corresponding to the recommendation (title, volume, etc. etc.)
The use case will often require grabbing all entries for a given State and ApplicationDate (and then we will look at all the LocationId and Phase items that correspond to it) for review from a UI. Items are added to the table one at a time for a given Station, ApplicationDate, LocationId, Phase and updated frequently.
A dev with a little more AWS experience mentioned we should probably use State+ApplicationDate as the partition key, and LocationId+Phase as the sort key. These two pieces combined would make the primary key. I generally understand this, but how does that work if we start getting multiple recommendations for the same primary key? I figure we either are ok with just overwriting what was previously there, OR we have to add some other attribute so we can write a recommendation for the State+ApplicationDate/LocationId+Phase multiple times and get all previous values if we need to... but that would require adding something to the primary key right? Would that be like adding some kind of unique value to the sort key? Or for example, if we need to do status and want to record different values at different statuses, would we just need to add status to the sort key?
Does this sound like a reasonable approach or should I be exploring a different NAWS offering for storing this data?
Use a time-based id property, such as a ULID or KSID. This will provide randomness to avoid overwriting data, but also provide a time-based sorting of your data when used as part of a sort key
Because the id value is random, you will want to add it to your sort key for the table or index where you perform your list operations, and reserve the pk for known values that can be specified exactly.
It sounds like the 'State' is a value that can change. You can't update an item's key attributes on the table, so it is more common to use these attributes in a key for a GSI if they are needed to list data.
Given the above, an alternative design is to use the LocationId as the pk, the random id value as the sk, and a GSI with the GSI with 'State' as the pk and the random id as the sk. Or, if you want to list the items by State -> Phase -> date, the GSI sk could be a concatenation of the Phase and id property. The above pattern gives you another list mechanism using the LocationId + timestamp of the recommendation create time.

DRF - how to 'update' the pk after deleting an object?

a beginner here!
here's how im using url path (from the DRF tutorials):
path('articles/', views.ArticleList.as_view()),
path('articles/<int:pk>/', views.ArticleDetail.as_view())
and i noticed that after deleting an 'Article' (this is my model), the pk stays the same.
an Example:
1st Article pk = 1, 2nd Article pk = 2, 3rd Acrticle pk = 3
after deleting the 2n Arctile im expecting --
1st Article pk = 1, 3rd Artcile pk = 2
yet it remains
3rd Artile pk = 3.
is there a better way to impleten the url, maybe the pk is not the variable im looking for?
or i should update the list somehow?
thnkx
and I noticed that after deleting an Article (this is my model), the pk stays the same.
This is indeed the expected behaviour. Removing an object will not "fill the gap" by shifting all the other primary keys. This would mean that for huge tables, you start updating thousands (if not millions) of records, resulting in a huge amount of disk IO. This would make the update (very) slow.
Furthermore not only the primary keys of the table that stores the records should be updated, but all sorts of foreign keys that refer to these records. This thus means that several tables need to be updated. This results in even more disk IO and furthermore it can result in slowing down a lot of unrelated updates to the database due to locking.
This problem can be even more severe if you are working with a distributed system where you have multiple databases on different servers. Updating these databases atomically is a serious challenge. The CAP theorem [wiki] demonstrates that in case a network partition failure happens, then you either can not guarantee availability or consistency. By updating primary keys, you put more "pressure" on this.
Shifting the primary key is also not a good idea anyway. It would mean that if your REST API for example returns the primary key of an object, then the client that wants to access that object might not be able to access that object, because the primary key changed in between. A primary key thus can be seen as a permanent identifier. It is usually not a good idea to change the token(s) that a client uses to access an object. If you use a primary key, or a slug, you know that if you later refer to the same item, you will again retrieve the same item.
how to 'update' the pk after deleting an object?
Please don't. Sorting elements can be done with a timestamp, but that is something different than having an identifier space that does not contain any gaps. A gap is often not a real problem, so you better do not turn it into a real problem.

how to deal with virtual index in a database table in Django + PostgreSQL

Here is my current scenario:
Need to add a new field to an existing table that will be used for ordering QuerySet.
This field will be an integer between 1 and not a very high number, I expect less than 1000. The whole reasoning behind this field is to use it for visual ordering on the front-end, thus, index 1 would be the first element to be returned, index 2 second, etc...
This is how the field is defined in model:
priority = models.PositiveSmallIntegerField(verbose_name=_(u'Priority'),
default=0,
null=True)
I will need to re-arrange (reorder) the whole set of elements in this table if a new or existing element gets this field updated. So for instance, imagine I have 3 objects it this table:
Element A
priority 1
Element B
priority 2
Element C
priority 3
If I change Element C priority to 1 I should have:
Element C
priority 1
Element A
priority 2
Element B
priority 3
Since this is not a real db index ( and have empty values), I'm gonna have to query for all elements on database each time a new element is created / updated and change priority value for each record in table. Not really worried about performance since table will always be small BUT, I'm worried this way to proceed is not the way to go or simply it generates too much overhead.
Maybe there is simpler way to do this with plain SQL stuff? If I use an index though, I will get an error every time an existing priority is used, something I don't want either.
Any pointers?
To insert at 10th position all you need is a single sql query:
MyModel.objects.filter(priority__gte=10).update(priority=models.F('priority')+1)
Then you would need a similar one for deleting an element, and swapping two elements (or whatever your use case requires). It all should be doable in a similar manner with bulk update queries, no need to manually update entry by entry.
First, you can very well index this column, just don't enforce it to contains unique values. Such standard indexes can have nulls and duplicates... they are just used to locate the row(s) matching a criteria.
Second, updating each populated* row each time you insert/update a record should be looked at based on the expected update frequency. If each user is inserting several records each time they use the system and you have thousands of concurrent users, it might not be a good idea... whereas if you have a single user updating any number of rows once in a while, it is not so much an issue. On the same vein, you need to consider if other updates are occurring to the same rows or not. You don't want to lock all rows too often if they are to be updated/locked for updating other fields.
*: to be accurate, you wouldn't update all populated rows, but only the ones having a priority lower than the inserted one. (inserting a priority 999 would only decrease the priority of items with 999 and 1000)

Celery/django duplicate key violations

I have a single celery worker with 5 threads. It's scraping websites and saving domains to DB via django's ORM.
Here is roughly how it looks like:
domain_all = list(Domain.objects.all())
needs_domain = set()
for x in dup_free_scrape:
domain = x['domain']
if any(domain.lower() == s.name.lower() for s in domain_all):
x['domainn'] = [o for o in domain_all if domain.lower() == o.name.lower()][0]
else:
print('adding: {}'.format(domain))
needs_domain.add(domain)
create_domains = [Domain(name=b.lower()) for b in needs_domain]
create_domains_ids = Domain.objects.bulk_create(create_domains)
Probably not the best way, but it checks domains in one dict(dup_free_scrape) against all domains already in database.
It can go over hundreds or even thousands before encountering the error, but sometimes it does:
Task keywords.domains.model_work[285c3e74-8e47-4925-9ab6-a99540a24665]
raised unexpected: IntegrityError('duplicate key value violates unique
constraint "keywords_domain_name_key"\nDETAIL: Key
(name)=(domain.com) already exists.\n',)
django.db.utils.IntegrityError: duplicate key value violates unique
constraint "keywords_domain_name_key"
The only reason for this issue I can think of would be: One thread saved domain to DB while another was in the middle of code above?
I can't find any good solutions, but here is and idea(not sure if any good): Wrap whole thing in transaction and if databaise raises error simplty retry(query database for "Domain.objects.all()" again).
If you are creating these records in bulk and multiple threads are at it, it's indeed very likely that IntegrityErrors are caused by different threads inserting the same data. Do you really need multiple threads working on this? If yes you could try:
create_domains = []
create_domain_ids = []
for x in dup_free_scrape:
domain = x['domain']
new_domain, created = Domain.objects.get_or_create(name = domain.lower()
if created:
create_domains.append(domain.lower())
created_domain_ids.append(new_domain.pk)
Note that this is all the code. The select all query which you had right at the start is not needed. Domain.objects.all() is going to be very inefficient because you are reading the entire table there.
Also note that your list comprehension for x['domain'] appeared to be completely redundant.
create_domains and create_domain_ids lists may not be needed unless you want to keep track of what was being created.
Please make sure that you have the proper index on domain name. From get_or_create docs:
This method is atomic assuming correct usage, correct database
configuration, and correct behavior of the underlying database.
However, if uniqueness is not enforced at the database level for the
kwargs used in a get_or_create call (see unique or unique_together),
this method is prone to a race-condition which can result in multiple
rows with the same parameters being inserted simultaneously.

Get unique ID of Row As it is being created in the same query

I want to predict the Unique ID that I will get from the next row that will be created in MYSQL via C++
For example if I create a user in my database I want to predict the unique ID (Auto incremented) of the user that will be created in the database using one query. This is for security purposes so I have no real idea how to go about this. Any nudges in the right direction would be great, thank you.
To sum it up: I want to predict the next Unique ID of a user in a database and return it whilst the user is created, one last note, the Unique ID is auto incremented.
You do not want to predict the value as there is no guarantee it will be anywhere close to accurate.
For example, if a something attempts to add a user, but fails, the auto increment field is usually already updated (so you may have users for 1...N, but since N+1 failed, the next ID would be N+2, not N+1).
You can use mysql_insert_id() to get the last id that was added by your connection, but you cannot really get a prediction for what the next value will be (at least not accurately).