DBT - best practices on schema to which the dbt deployed models should belong to?

DBT - best practices on schema to which the dbt deployed models should belong to? - materialized-views

I am working on a database intensive project where there are lots of schemas and tables/views etc that are created in these schemas.
My question is around the best practice or recommended practice on to what schema should all the DBT deployed models should belong to?
One option is to have them belong to different schemas based on its definition.
Other option is to have them all belong to same schema that is separate from the other schemas already existing in the database, so we can lookup using the particular schema name.

You're asking a great question! Albiet, one without one right answer.
From the dbt docs's Using Aliases page:
The names of schemas and tables are effectively the "user interface" of your data warehouse. Well-named schemas and tables can help provide clarity and direction for consumers of this data. In combination with custom schemas, model aliasing is a powerful mechanism for designing your warehouse.
Perhaps the answer is:
Do what you think is best for your users. You are making a library. what's the easiest way for folks to find what they are looking for?

Related

Can Elasticsearch be used as a database in a Django Project?

Can we directly use Elasticsearch as a primary database in Django? I have tried finding the solution or a way out but could not find any relevant information. Everywhere it is said that we can use Elasticsearch as a search engine over any other primary database. But as per my understanding Elasticsearch is a NoSQL Database, so there should be a way to use it as a Primary Database in a Django Project.
Please help, if someone has any idea about it.

The short answer is no.
SO already has an answer here and this is still valid: Using ElasticSearch as Database/Storage with Django
ES is not a ACID compliant
Indexing is not immediate so any kind of load would be an issue
It's very weakly consistent
Use it together with a proper database and it will help with real time searches, analytics, expensive queries etc. but treat it as derived data.

Is dynamodb suitable for growing or pivotable product?

Amazon said
NoSQL design requires a different mindset than RDBMS design. For an RDBMS, you can create a normalized data model without thinking about access patterns. You can then extend it later when new questions and query requirements arise. For DynamoDB, by contrast, you shouldn't start designing your schema until you know the questions it needs to answer. Understanding the business problems and the application use cases up front is absolutely essential.
It seems that I should design the tables after designing the product for efficient query cost.
But a product can be pivoted or be appended new features. In early stage, nobody knows where the product goes.
Is dynamodb suitable for growing or pivotable product?

In my opinion, the main benefit of Dynamo DB over other NoSQL solutions is that it is a managed database service. You pay for reads and writes and you never worry about scaling to handle larger data, more users. If you are doing a prototype or don't have technical know-how to setup a database server and host in the cloud it could be useful and cost effective. It has its limitations however so if you do have technical resources consider another open source NoSQL option.
I think that statement by Amazon is confusing and is probably more marketing than anything else. Use NoSQL in cases where your data is only accessed in distinct elements that do no have to be combined in a complex manner. It's also helpful if you don't have an exact schema defined because NoSQL doesn't require a hard set schema you can store any fields in a table and you can always add new fields. This is helpful when things are changing rapidly and you don't want to migrate everything as strictly as an RDBMS would require. If however you're going to have to run complex logic or calculations combining data from across tables you should use an RDBMS. You could use NoSQL for some data and and RDBMS for other data in a hybrid fashion but in that case you probably wouldn't want to use Dynamo DB because you'd want full ownership to set it up properly. Hope this helps I'm sure others have more to say and I welcome comments to help me refine my answer.

How to use Django in order to present an old database without ruining it when using syncdb?

I have a database in sqlite and I want to use Django to present it and make queries on it.
I know how to create a new database by creating new classes in models.py, but what is the best way to use Django to access an existing database?

This seems to be a question in two parts: firstly, how can one write django model classes to represent an existing database, and secondly how that interacts with syncdb.
The answer to the first of these is that django models are not expressive enough to describe every possible SQL database schema, and instead use a subset that works well with the ORM's usage pattern. Tnerefore you may need to accept some adjustments to your schema in order to describe it with django models. In particular:
Django does not support composite primary keys. That is, you can't have a primary key that spans multiple columns.
Django expects tables to be named appname_modelname, because this convention allows the tables from many apps to easily co-exist in the same database.
If your schema happens to match the subset that django models support or you are willing to make changes to adapt it to be so then your task is simply to write models that match with the schema. The inspectdb tool may provide a useful starting point.
You can test if you've been successful in describing your database by temporarily reconfiguring your project to use a different empty database and running manage.py syncdb, and then comparing the schema that Django created with the schema that already existed. If they are the same (or at least close enough) then you got it right.
If your existing database is not a good match for Django ORM's assumptions then a more flexible alternative is SQLAlchemy. It doesn't natively integrate into django's application system but it does provide a more complete database interface that can work with almost any database; some databases will be easy to map, and some others will require some more manual mapping work, but almost all cases should be possible with some creativity.
As for the interaction with syncdb: the default behavior for this command is to skip over any models that already seem to have tables in the database. Therefore if you've defined models that do indeed match with your existing database tables it should leave them alone. It will, however, create the additional tables required for other apps in your project, including Django's own tables.
Modern Django has support for multiple databases, which could provide you with a further approach: configure your existing database as a second database source in your project and use a database router to ensure that the appropriate models are loaded from that second database, and further to ensure that django won't attempt to run syncdb on this database. This provides true separation at the expense of some additional complexity, but it still requires that your schema be compatible with the ORM's assumptions. It also has some limitations, largely pertaining to relationships between objects that are persisted in different databases.
If you'd like to be able to make versioned changes to the database Django uses, starting with the schema you've inherited from the existing database, then South provides a more flexible and more complete alternative to the builtin syncdb mechanism that supports running arbitrary SQL data definition language statements to make changes to your database schema.

It sounds like you need something like South which will allow you to version and and revert changes to your models.

You just need ./manage.py inspectdb.

Data Warehouse and Django

This is more of an architectural question than a technological one per se.
I am currently building a business website/social network that needs to store large volumes of data and use that data to draw analytics (consumer behavior).
I am using Django and a PostgreSQL database.
Now my question is: I want to expand this architecture to include a data warehouse. The ideal would be: the operational DB would be the current Django PostgreSQL database, and the data warehouse would be something additional, preferably in a multidimensional model.
We are still in a very early phase, we are going to test with 50 users, so something primitive such as a one-column table for starters would be enough.
I would like to know if somebody has experience in this situation, and that could recommend me a framework to create a data warehouse, all while mantaining the operational DB with the Django models for ease of use (if possible).
Thank you in advance!

Here are some cool Open Source tools I used recently:
Kettle - great ETL tool, you can use this to extract the data from your operational database into your warehouse. Supports any database with a JDBC driver and makes it very easy to build e.g. a star schema.
Saiku - nice Web 2.0 frontend built on Pentaho Mondrian (MDX implementation). This allows your users to easily build complex aggregation queries (think Pivot table in Excel), and the Mondrian layer provides caching etc. to make things go fast. Try the demo here.

My answer does not necessarily apply to data warehousing. In your case I see the possibility to implement a NoSQL database solution alongside an OLTP relational storage, which in this case is PostgreSQL.
Why consider NoSQL? In addition to the obvious scalability benefits, NoSQL offer a number of advantages that probably will apply to your scenario. For instance, the flexibility of having records with different sets of fields, and key-based access.
Since you're still in "trial" stage you might find it easier to decide for a NoSQL database solution depending on your hosting provider. For instance AWS have SimpleDB, Google App Engine provide their own DataStore, etc. However there are plenty of other NoSQL solutions you can go for that have nice Python bindings.

Philosophical Question Regarding OneToOne Models in Django

I'm working on a project that is scheduled to be deployed in phases. In each of those phases, a specific table will progressively receive additional fields. As I already know which fields will be added in the future, I could added those just now and let them be empty for now until the next phases are reached, but I was wondering a different strategy. I was thinking in implement the first phase's table and in the subsequent phases create new table in which each of its fields are OneToOne related to the first table.
I'm doing this right? Sounds like a good strategy?
ps.: I'm not a native English speaker. I apologize for any mistake. :)

You might want to look at Django South. It can help migrate schemas. So if you want to add fields to existing tables, South will help you.
There's plenty of documentation at their website, and also from other sources.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js