I have an application that has a model called CalculationResults. I also have two databases with the same scheme - let's call them "source" and "target". On source database, I have a specific instance of CalculationResults that I want to migrate to "target" database - preferably, I also want to change the field "owner" of this instance in the process. What is the easiest way to achieve this goal? The data I'm talking about is not huge, so it's rather a question of manpower vs computational power.
I have never done this, but I believe that following will work:
results = CalculationResults.objects.using('source').filter(field=search_field)
for result in results:
result.owner = 'new owner'
result.save(using='target')
Related
I am fairly new to Great Expectations - and have a question. Essentially I have a PostgreSQL database, and every time I run my data pipeline, i want to validate a specific subset of the PostgreSQL table based off some key. Eg: If the data pipeline is run every day, the would be a field called current_batch. And the validation would occur for the below query:
SELECT * FROM jobs WHERE current_batch = <input_batch>.
I am unsure the best way to complete this. I am a using v3-api of great expectations and am a bit confused as to whether to use a checkpoint, or a validator. I assume I want to use a checkpoint but I can't seem to figure out how to create a checkpoint, but then only validate a specific subset of the PostgreSQL datasource.
Any help or guidance would be much appreciated.
Thanks,
I completely understand your confusion because I am working with GE too and the documentation is not really clear.
First of all "Validators" are now called "Checkpoints", so they are not a different entity, as you can read here.
I am working on an Oracle database and the only way I found to apply a query before testing my data with expectations is to put the query inside the checkpoint.
To create a checkpoint you should run the great_expectations checkpoint new command from your terminal. After creating it, you should add the "query" field inside the .yml file that is your checkpoint.
Below you can see a snippet of a checkpoint I am working with. When I want to validate my data, I run the command great_expectations checkpoint run check1
name: check1
module_name: great_expectations.checkpoint
class_name: LegacyCheckpoint
batches:
- batch_kwargs:
table: pso
schema: test
query: SELECT p AS c,
[ ... ]
AND lsr = c)
datasource: my_database
data_asset_name: test.pso
expectation_suite_names:
- exp_suite1
Hope this helps! Feel free to ask if you have any doubts :)
I managed this using Views (in Postgres). Before running GE, I create (or replace the existing) view as a query with all necessary joins, filtering, aggregations, etc. And then specify the name of this view in GE checkpoints.
Yes, it is not the ideal solution. I would rather use a query in checkpoints too. But as a workaround, it covers all my cases.
Let's have view like this:
CREATE OR REPLACE VIEW table_to_check_1_today AS
SELECT * FROM initial_table
WHERE dt = current_date;
And checkpoint be configured something like this:
name: my_task.my_check
config_version: 1.0
validations:
- expectation_suite_name: my_task.my_suite
batch_request:
datasource_name: my_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: table_to_check_1_today
Yes, a view can be created using the "current_date" - and the checkpoint can simply run the view. However, this would mean that the variable (current_date) is stored in the database - which may not be desirable; you might want to run the query in the checkpoint for a different date - which could be coming from a environment variable or elsewhere - to the CLI or python/notebook
Yet to find a solution where we can substitute a string in the checkpoint query; using a config variable from the file is a very static way - there may be different checkpoints running for different dates.
I'm attempting to trial setting up a system in Django whereby I specify the database connection to use at runtime. I feel I may need to go as low level as possible, but want to try and work within the idioms of Django where possible, perhaps stretching it as much as can be possible.
The general precise is that I have a centralised database that stores meta information Datasets - but the actual datasets are created as dyanmic models at runtime, in the database in question. I need to be able to specify which database to connect to at runtime to then extract the data back out...
I have kind of the following idea:
db = {}
db['ENGINE'] = 'django.db.backends.postgresql'
db['OPTIONS'] = {'autocommit': True}
db['NAME'] = my_model_db['database']
db['PASSWORD'] = my_model_db['password']
db['USER'] = my_model_db['user']
db['HOST'] = my_model_db['host']
logger.info("Connecting to database {db} on {host}".format(db=source_db['NAME'], host=source_db['HOST']))
connections.databases['my_model_dynamic_db'] = db
DynamicObj.objects.using('my_model_dynamic_db').all()
Has anyone achieved this? And how?
I'm building a product with Zend 2 and Doctrine 2 and it requires that I have a separate table for each user to contain data unique to them. I've made an entity that defines what that table looks like but how do I change the name of the table to persist the data to, or in fact retrieve the data from, at run time?
Alternatively am I going to be better off giving each user their own database, and just changing which DB I am connecting to?
I'd question the design-choice at first. What happens if you create a new user after runtime. The table has to be created first? Furthermore, what kind of data are you storing, to me this sounds like a pretty common multi-client capabilities. Like:
tbl_clients
- id
- name
tbl_clientdata
- client_id
- data_1_value
- data_2_value
- data_n_value
If you really want to silo users data, you'd have to go the separate databases route. But that only works if each "user" is really independent of each other. Think very hard about that.
If you're building some kind of software-as-a-service, and user A and user B are just two different customers of yours, with no relationship to each other, then an N+1 database might be appropriate (one db for each of your N users, plus one "meta" database which just holds user accounts (and maybe billing-related stuff).
I've implemented something like this in ZF2/Doctrine2, and it's not terribly bad. You just create a factory for EntityManager that looks up the database information for whatever user is active, and configures the EM to connect to it. The only place it gets a bit tricky is when you find yourself writing some kind of shared job queue, where long-running workers need to switch database connections with some regularity -- but that's doable too.
I've been looking for a way to define database tables and alter them via a Django API.
For example, I'd like to be write some code which directly manipulates table DDL and allow me to define tables or add columns to a table on demand programmatically (without running a syncdb). I realize that django-south and django-evolution may come to mind, but I don't really think of these tools as tools meant to be integrated into an application and used by and end user... rather these tools are utilities used for upgrading your database tables. I'm looking for something where I can do something like:
class MyModel(models.Model): # wouldn't run syncdb.. instead do something like below
a = models.CharField()
b = models.CharField()
model = MyModel()
model.create() # this runs the create table (instead of a syncdb)
model.add_column(c = models.CharField()) # this would set a column to be added
model.alter() # and this would apply the alter statement
model.del_column('a') # this would set column 'a' for removal
model.alter() # and this would apply the removal
This is just a toy example of how such an API would work, but the point is that I'd be very interested in finding out if there is a way to programatically create and change tables like this. This might be useful for things such as content management systems, where one might want to dynamically create a new table. Another example would be a site that stores datasets of an arbitrary width, for which tables need to be generated dynamically by the interface or data imports. Dose anyone know any good ways to dynamically create and alter tables like this?
(Granted, I know one can do direct SQL statements against the database, but that solution lacks the ability to treat the databases as objects)
Just curious as to if people have any suggestions or approaches to this...
You can try and interface with the django's code that manages changes in the database. It is a bit limited (no ALTER, for example, as far as I can see), but you may be able to extend it. Here's a snippet from django.core.management.commands.syncdb.
for app in models.get_apps():
app_name = app.__name__.split('.')[-2]
model_list = models.get_models(app)
for model in model_list:
# Create the model's database table, if it doesn't already exist.
if verbosity >= 2:
print "Processing %s.%s model" % (app_name, model._meta.object_name)
if connection.introspection.table_name_converter(model._meta.db_table) in tables:
continue
sql, references = connection.creation.sql_create_model(model, self.style, seen_models)
seen_models.add(model)
created_models.add(model)
for refto, refs in references.items():
pending_references.setdefault(refto, []).extend(refs)
if refto in seen_models:
sql.extend(connection.creation.sql_for_pending_references(refto, self.style, pending_references))
sql.extend(connection.creation.sql_for_pending_references(model, self.style, pending_references))
if verbosity >= 1 and sql:
print "Creating table %s" % model._meta.db_table
for statement in sql:
cursor.execute(statement)
tables.append(connection.introspection.table_name_converter(model._meta.db_table))
Take a look at connection.creation.sql_create_model. The creation object is created in the database backend relevant to the database you are using in your settings.py. All of them are under django.db.backends.
If you must have ALTER table, I think you can create your own custom backend that extends an existing one and adds this functionality. Then you can interface with it directly through a ExtendedModelManager you create.
Quickly off the top of my head..
Create a Custom Manager with the Create/Alter methods.
I have a django model in use on a production application and I need to change the name and data type of the field with zero downtime to the site. So here is what I was planning:
1) Create the new field in the database that will replace the original field
2) Everytime an instance of the Model is loaded, convert the data form the original field and store it into the new field, then save the object (only save object if new field is empty)
3) Over time the original field can be removed once every object has a non-blank new field
What method can I attach too for the 2nd step?
Won't you have to change your business logic (and perhaps templates) first to accomodate the new fieldname?
Unless stuff gets assigned to the field in question at dozens of places in your code, you could (after creation of the field in the database)
1) adapt the code to recognize the old (read) and the new field(name)s (write).
2) change the data in the database from old to new field via locking / .update() call, etc.
3) remove the old field(name) from the model/views/templates completely
Without downtime, I don't see how users of your site will not suffer getting "old" values for a few seconds (depending on how many rows are in the table, how costly the recalc to the new datatype, etc.).
Sounds complex, and effects a lot of production code.
Are you trying to avoid doing this in bulk because of downtime? What volume of data are you working with?
Have you looked at any Django migration tools that are out there. South is a very popular one:
http://south.aeracode.org/
As you seemingly can't afford ANY downtime whatsoever (I wouldn't want your job!!!) you probably don't want to risk overriding the model's constructor method. What you could try instead is catching the post init signal...
https://docs.djangoproject.com/en/1.0/ref/signals/#django.db.models.signals.post_init