I have an events table and sessions table. Events has_many sessions, this is the association. Now I want to move the time_zone column from the sessions table to events table only. So how do I do this with help of migrations. How do I move the existing records for time_zone in sessions table to events table?
First, you need to be sure that sessions associated with the same event have the same time zone. You can do this with:
Session.group(:event_id).count(:time_zone)
This will return a hash mapping an event_id to the number of time zones associated with it. This number should always be one.
Second, I recommend that you first add events.time_zone and start using it and remove sessions.time_zone in a separate migration after the new code has been in production for some time and is proved to work.
Third, the migration to add events.time_zone should look like this (I added some comments for clarity):
class AddTimeZoneToEvents < ActiveRecord::Migration
class Event < ActiveRecord::Base; end
class Session < ActiveRecord::Base; end
def up
# Add a NULLable time_zone column to events. Even if the column should be
# non-NULLable, we first allow NULLs and will set the appropriate values
# in the next step.
add_column :events, :time_zone, :string
# Ensure the new column is visible.
Event.reset_column_information
# Iterate over events in batches. Use #update_columns to set the newly
# added time_zone without modifying updated_at. If you want to update
# updated_at you have at least two options:
#
# 1. Set it to the time at which the migration is run. In this case, just
# replace #update_columns with #update!
# 2. Set it to the maximum of `events.updated_at` and
# `sessions.updated_at`.
#
# Also, if your database is huge you may consider a different query to
# perform the update (it also depends on your database).
Event.find_each do |event|
session = Session.where(event_id: event.id).last
event.update_columns(time_zone: session.time_zone)
end
# If events don't always need to have time zone information then
# you can remove the line below.
change_column_null :events, :time_zone, false
end
def down
remove_column :events, :time_zone
end
end
Note that I redefined models in the migration. It's crucial to do so because:
The original model may have callbacks and validations (sure, you can skip them but it's one extra precaution that contributes zero value).
If you remove the models in 6 months down the road the migration will stop working.
Once you're sure your changes work as expected you can remove sessions.time_zone. If something goes awry you can simply roll back the above migration and restore a working version easily.
You can simply use the following migration.
class Test < ActiveRecord::Migration
def change
add_column :events, :time_zone, :string
Event.all.each do |e|
e.update_attributes(time_zone: e.sessions.last.time_zone)
end
remove_column :sessions, :time_zone
end
end
Related
I have an existing table in which default value of column is already set. This table contains lot of data in it. I don't want to change any of exiting record in table(don't want to change column of exiting record), but from here onwards I want to change the default value of that column. How do I do that?
Rails version: Rails 4.0.13
Ruby version: ruby 2.2.10p489
Consider I want to change default value of a field(is_male) in table(users). My users table contains lot of data in it where I want to change the default value of is_male column which is of type tinyint, from true to false.
I have created a migration using:
rails g migration add_default_false_to_is_male_for_users
made below changes in migration file:
class AddDefaultFalseToColumnNameTableName < ActiveRecord::Migration
def up
change_column_default :users, :is_male, false
end
def down
change_column_default :users, :is_male, true
end
end
It won't change existing data.
I have the following fields in my Model:
class Event(models.Model):
starts = models.DateTimeField()
ends = models.DateTimeField()
I want to restrict overlapping dates (starts, ends). I have managed to do this in model validation, but now I want this enforced at database level such that an IntegrityError exception is thrown if an insert happens outside the model's save method.
My Validation was as follows:
...
def clean(self):
if self.starts and self.ends:
if self.__class__.objects.filter(
models.Q(ends__gte=self.starts, starts__lt=self.starts) | models.Q(ends__gte=self.ends, starts__lt=self.ends) | models.Q(starts__gt=self.starts, ends__lt=self.ends)
).exists():
raise ValidationError('Event times overlap with existing record!')
This works. Say an event starting 2020 Oct 11 # 19:00, and ending 2020 Oct 11 # 20:00, the following values will prompt an overlap:
same date, starting # 18:00, ending # 21:00
same date, starting # 19:30, ending # 19:50
same date, starting # 19:30, ending # 20:50
But there are situations where the model's .clean() method will not be invoked, which may result in invalid data to be inserted.
My question is, how can I enforce a constraint on the model, which will apply on the database itself, like unique_together does.
I have used postgres specific fields like DateRangeField but in this case, their functionality is limited as they can contain empty upper values to mention one.
I have also come accross this question here on S/O which implements the new (from django 2.2) CheckConstraint, which I have tried to implement, but it doesn't work.
I have used postgres specific fields like DateRangeField but in this
case, their functionality is limited as they can contain empty upper
values to mention one.
Why not just add an additional constraint to prevent empty upper values? Then you can get all of the benefits of DateRangeField.
Either way, these days we have ExclusionConstraint for postgres, and there are examples in the docs for using ExclusionConstraint with range fields or with two separate fields like your current model.
I have a channel model with 2 associations, "contents" and "subscriptions".
In the channel index the user has the possibility of ordering the channels by number of subscriptions or number of approved contents.
While in development everything seems to work properly (by observation of the results, can be malfunctioning and be a question of not enough data to see it properly), in staging the results are random, sometimes showing them properly, sometimes don't.
At first I wasn't using delta indexes and thought the problem could be there so every time I approve a content I call:
Delayed::Job.enqueue(DelayedRake.new("ts:index"), queue: "sphinx")
Since the subscriptions don't have indexes, I don't reindex every time I create one ( should I do it? )
Then I started using delta indexes in the channel and I still get the same problems:
ThinkingSphinx::Index.define :channel, with: :active_record, delta: true do
# fields
indexes :name, sortable: true
indexes description
# attributes
has created_at, sortable: true
has approved, type: :boolean
has public, type: :boolean
join subscriptions
has "COUNT(subscriptions.id)", as: :subscription_count, type: :integer, sortable: true
join contents.approved
has "COUNT(contents.id)", as: :content_count, type: :integer, sortable: true
end
And here is the search call in the controller:
def index
if params[:order_by].present?
#channels = Channel.search params[:search],
order: "#{params[:order_by]} DESC",
page: params[:page], per_page: 6
else
#channels = Channel.search params[:search],
order: :name,
page: params[:page], per_page: 6
end
end
Summarising, my questions would be:
1. Are my channel indexes well formed?
2. Should subscriptions by indexed as well or is it enough to join them in my channel index?
3. Should I run reindex after I create a subscription / approve a content or the delta index in the channel deals with that since I have those two controllers joined in the channel index?
Your index looks fine, but if you're using deltas (and I think that's the wisest approach here, to have the data up-to-date), then you want to fire deltas for the related channels when a subscription or content is created/edited/deleted. This is covered in the documentation (see the "Deltas and Associations" section), but you'd be looking at something like this in both Subscription and Content:
after_save :set_channel_delta_flag
after_destroy :set_channel_delta_flag
# ...
private
def set_channel_delta_flag
channel.update_attributes :delta => true
end
Given you're using Delayed Job, I'd recommend investigating ts-delayed-delta to ensure delta updates are happening out of your normal HTTP request flow. And I highly recommend not running a full index after every change - that has the potential of getting quite slow quite quickly (and adding to the server load unnecessarily).
My problem is as follows:
I have a car dealer A, and a db table named sold_cars. When a car is being sold I create entry in this table.
Table has an integer column named order_no. It should be unique within cars sold by dealer.
So if dealer A sold cars a, b and c, then this column should be 1, 2, 3. I have to use this column, and not a primary key because I don't want to have any holes in my numeration - dealer A and B (which might be added later) should have order numbers 1, 2, 3, and not A: 1, 3, 5, and B: 2, 4, 6. So... I select last greatest order_no for given dealer, increment it by 1 and save.
Problem is that two people bought car from dealer A in the same millisecond and both orders got the same order_no. Any advice? I was thinking of closing this process in a transaction block, and locking this table until the transaction is complete, but can't find any info on how to to that.
I know this question is a bit older, but I just had the same issue and wanted to share my learnings.
I wasn't quite satisfied with st0nes answer, since (at least for postgres) a LOCK TABLE statement can only be issued within a transaction. And although in Django usually almost everything happens within a transaction, this LockingManager does not make sure, that you actually are within a transaction, at least to my understanding. Also I didn't want to completely change the Models Manager just to be able to lock it at one spot and therefore I was more looking for something that works kinda like the with transaction.atomic():, but also locks a given Model.
So I came up with this:
from django.conf import settings
from django.db import DEFAULT_DB_ALIAS
from django.db.transaction import Atomic, get_connection
class LockedAtomicTransaction(Atomic):
"""
Does a atomic transaction, but also locks the entire table for any transactions, for the duration of this
transaction. Although this is the only way to avoid concurrency issues in certain situations, it should be used with
caution, since it has impacts on performance, for obvious reasons...
"""
def __init__(self, model, using=None, savepoint=None):
if using is None:
using = DEFAULT_DB_ALIAS
super().__init__(using, savepoint)
self.model = model
def __enter__(self):
super(LockedAtomicTransaction, self).__enter__()
# Make sure not to lock, when sqlite is used, or you'll run into problems while running tests!!!
if settings.DATABASES[self.using]['ENGINE'] != 'django.db.backends.sqlite3':
cursor = None
try:
cursor = get_connection(self.using).cursor()
cursor.execute(
'LOCK TABLE {db_table_name}'.format(db_table_name=self.model._meta.db_table)
)
finally:
if cursor and not cursor.closed:
cursor.close()
So if I now want to lock the model ModelToLock, this can be used like this:
with LockedAtomicTransaction(ModelToLock):
# do whatever you want to do
ModelToLock.objects.create()
EDIT: Note that I have only tested this using postgres. But to my understanding, it should also work on mysql just like that.
from contextlib import contextmanager
from django.db import transaction
from django.db.transaction import get_connection
#contextmanager
def lock_table(model):
with transaction.atomic():
cursor = get_connection().cursor()
cursor.execute(f'LOCK TABLE {model._meta.db_table}')
try:
yield
finally:
cursor.close()
This is very similar to #jdepoix solution, but a bit more concise.
You can use it like this:
with lock_table(MyModel):
MyModel.do_something()
Note that this only works with PostgreSQL and uses python 3.6's f-strings a.k.a. literal string interpolation.
I would recommend using the F() expression instead of locking the entire table. If your app is being heavily used, locking the table will have significant performance impact.
The exact scenario you described is mentioned in Django documentation here. Based on your scenario, here's the code you can use:
from django.db.models import F
# Populate sold_cars as you normally do..
# Before saving, use the "F" expression
sold_cars.order_num =F('order_num') + 1
sold_cars.save()
# You must do this before referring to order_num:
sold_cars.refresh_from_db()
# Now you have the database-assigned order number in sold_cars.order_num
Note that if you set order_num during an update operation, use the following instead:
sold_cars.update(order_num=F('order_num')+1)
sold_cars.refresh_from_db()
Since database is in charge of updating the field, there won't be any race conditions or duplicated order_num values. Plus, this approach is much faster than one with locked tables.
I think this code snippet meets your need, assuming you are using MySQL. If not, you may need to tweak the syntax a little, but the idea should still work.
Source: Locking tables
class LockingManager(models.Manager):
""" Add lock/unlock functionality to manager.
Example::
class Job(models.Model):
manager = LockingManager()
counter = models.IntegerField(null=True, default=0)
#staticmethod
def do_atomic_update(job_id)
''' Updates job integer, keeping it below 5 '''
try:
# Ensure only one HTTP request can do this update at once.
Job.objects.lock()
job = Job.object.get(id=job_id)
# If we don't lock the tables two simultanous
# requests might both increase the counter
# going over 5
if job.counter < 5:
job.counter += 1
job.save()
finally:
Job.objects.unlock()
"""
def lock(self):
""" Lock table.
Locks the object model table so that atomic update is possible.
Simulatenous database access request pend until the lock is unlock()'ed.
Note: If you need to lock multiple tables, you need to do lock them
all in one SQL clause and this function is not enough. To avoid
dead lock, all tables must be locked in the same order.
See http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html
"""
cursor = connection.cursor()
table = self.model._meta.db_table
logger.debug("Locking table %s" % table)
cursor.execute("LOCK TABLES %s WRITE" % table)
row = cursor.fetchone()
return row
def unlock(self):
""" Unlock the table. """
cursor = connection.cursor()
table = self.model._meta.db_table
cursor.execute("UNLOCK TABLES")
row = cursor.fetchone()
return row
I had the same problem. The F() solution solves a different problem. It doesn't get the max(order_no) for all sold_cars rows for a specific car dealer, but rather provides a way to update the value of order_no based on the value that already set in the field for a particular row.
Locking entire table is an overkill here, it's sufficient to lock only specific dealer's rows.
Below is the solution I ended up with. The code assumes sold_cars table references dealers table using sold_cars.dealer field. Imports, logging and error handling omitted for clarity:
DEFAULT_ORDER_NO = 0
def save_sold_car(sold_car, dealer):
# update sold_car instance as you please
with transaction.atomic():
# to successfully use locks the processes must query for row ranges that
# intersect. If no common rows are present, no locks will be set.
# We save the sold_car entry without an order_no to create at least one row
# that can be locked. If order_no assignment fails later at some point,
# the transaction will be rolled back and the 'incomplete' sold_car entry
# will be removed
sold_car.save()
# each process adds its own sold_car entry. Concurrently getting sold_cars
# by their dealer may result in row ranges which don't intersect.
# For example process A saves sold_car 'a1' the same moment process B saves
# its 'b1' sold_car. Then both these processes get sold_cars for the same
# dealer. Process A gets single 'a1' row, while process B gets
# single 'b1' row.
# Since all the sold_cars here belong to the same dealer, adding the
# related dealer's row to each range with 'select_related' will ensure
# having at least one common row to acquire the lock on.
dealer_sold_cars = (SoldCar.objects.select_related('dealer')
.select_for_update()
.filter(dealer=dealer))
# django queries are lazy, make sure to explicitly evaluate them
# to acquire the locks
len(dealer_sold_cars)
max_order_no = (dealer_sold_cars.aggregate(Max('order_no'))
.get('order_no__max') or DEFAULT_ORDER_NO)
sold_car.order_no = max_order_no + 1
sold_car.save()
I have a model that has four fields. How do I remove duplicate objects from my database?
Daniel Roseman's answer to this question seems appropriate, but I'm not sure how to extend this to situation where there are four fields to compare per object.
Thanks,
W.
def remove_duplicated_records(model, fields):
"""
Removes records from `model` duplicated on `fields`
while leaving the most recent one (biggest `id`).
"""
duplicates = model.objects.values(*fields)
# override any model specific ordering (for `.annotate()`)
duplicates = duplicates.order_by()
# group by same values of `fields`; count how many rows are the same
duplicates = duplicates.annotate(
max_id=models.Max("id"), count_id=models.Count("id")
)
# leave out only the ones which are actually duplicated
duplicates = duplicates.filter(count_id__gt=1)
for duplicate in duplicates:
to_delete = model.objects.filter(**{x: duplicate[x] for x in fields})
# leave out the latest duplicated record
# you can use `Min` if you wish to leave out the first record
to_delete = to_delete.exclude(id=duplicate["max_id"])
to_delete.delete()
You shouldn't do it often. Use unique_together constraints on database instead.
This leaves the record with the biggest id in the DB. If you want to keep the original record (first one), modify the code a bit with models.Min. You can also use completely different field, like creation date or something.
Underlying SQL
When annotating django ORM uses GROUP BY statement on all model fields used in the query. Thus the use of .values() method. GROUP BY will group all records having those values identical. The duplicated ones (more than one id for unique_fields) are later filtered out in HAVING statement generated by .filter() on annotated QuerySet.
SELECT
field_1,
…
field_n,
MAX(id) as max_id,
COUNT(id) as count_id
FROM
app_mymodel
GROUP BY
field_1,
…
field_n
HAVING
count_id > 1
The duplicated records are later deleted in the for loop with an exception to the most frequent one for each group.
Empty .order_by()
Just to be sure, it's always wise to add an empty .order_by() call before aggregating a QuerySet.
The fields used for ordering the QuerySet are also included in GROUP BY statement. Empty .order_by() overrides columns declared in model's Meta and in result they're not included in the SQL query (e.g. default sorting by date can ruin the results).
You might not need to override it at the current moment, but someone might add default ordering later and therefore ruin your precious delete-duplicates code not even knowing that. Yes, I'm sure you have 100% test coverage…
Just add empty .order_by() to be safe. ;-)
https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Transaction
Of course you should consider doing it all in a single transaction.
https://docs.djangoproject.com/en/3.2/topics/db/transactions/#django.db.transaction.atomic
If you want to delete duplicates on single or multiple columns, you don't need to iterate over millions of records.
Fetch all unique columns (don't forget to include the primary key column)
fetch = Model.objects.all().values("id", "skuid", "review", "date_time")
Read the result using pandas (I did using pandas instead ORM query)
import pandas as pd
df = pd.DataFrame.from_dict(fetch)
Drop duplicates on unique columns
uniq_df = df.drop_duplicates(subset=["skuid", "review", "date_time"])
## Dont add primary key in subset you dumb
Now, you'll get the unique records from where you can pick the primary key
primary_keys = uniq_df["id"].tolist()
Finally, it's show time (exclude those id's from records and delete rest of the data)
records = Model.objects.all().exclude(pk__in=primary_keys).delete()