I've started a model with the default, automatically handled IDs Django provides.
Now I've started interfacing with an external system which has its own IDs for the same objects, and it would be very convenient to align my own IDs with the external system's.
However, the numerical ranges overlap, so a naive solution wouldn't work.
Is there some elegant way to alter the IDs in a safe manner? (the object have multiple foreign keys/m2ms etc)
Thanks
I don't know if this is the best way but I would set all foreign/m2m keys to cascade updates and add a defined number to all IDs in the django db to remove the overlapping of the ranges (10k or 100K or more depeding on your data).
with that done, you can copy IDs from the other system. For safety, I would duplicate the ID columns while doing that in order to not lose the original ones... until sure all works ok
Related
I'm learning SQLAlchemy and using a database where multiple table lookups are needed to find a single piece of data.
I'm trying to find the best (most efficient and Pythonic) way to map the multiple lookups to a single SQLAlchemy object or reusable python method.
Ultimately, there will be dozens if not hundreds of mapped object such as these, so something like a .map file might be handy.
I.e. (Using pseudocode)
If I want to find the data 'Status' from 'Patient Name' have to use three tables.
Instead of writing a function for every potential 'this' to 'that' data request, is there an SQLAlchemy or Pythonic way to make the mappings?
I CAN make new, temporary SQLAlchemy Tables to store data. I am NOT at liberty to change the database I'm reading from. I'm hoping to reduce the number of individual calls to the database, because it is remote and slow.
I'm not sure a data join will work, because the Primary Keys, Foreign Keys and Column names are inconsistent in the database. But, I don't really know how to make select-joins in SQLAlchemy.
Perhaps I need to create a new table, with relationships to those three previous tables? But I'm not understanding the relationships well.
Can these tables be auto-generated from a map.ini file?
EDIT:
I might add, that some of these relationships could be one to many. I.e. a patient may be associated with more than one statusID...such as...
I have a view which returns data associated with a randomly-chosen row from one of my models. I'm aware of order_by('?') and its performance problems, and I want to avoid using order_by('?') in my view.
Because the data in my model changes very rarely (if at all), I'm considering the approach of caching the entire model in memory between requests. I know how many records I'm dealing with, and I'm comfortable taking the memory hit. If the model does change somehow, I could regenerate the cache at that moment.
Is my strategy reasonable? If so, how do I implement it? If not, how can I quickly select a random row from a model that changes very rarely if at all?
If you know the ids of your object, and its range you can randomize over the ids, and then query the database
A better approach might be to keep the number of objects in your cache, and simply retrieve a random one when you need it:
item_number = random.randint(MODEL_COUNT)
MyModel.objects.all()[item_number]
How do people deal with index data (the data usually shown on index pages, like a customer list) -vs- the model detail data?
When somebody goes to the customer/index route -- they only need access to a small subset of the full customer resource. Since I am dealing with legacy data, my customer model has > 10 relationships. It seems wasteful to have the api return a complete and full customer representation for every customer just to render a list/select/index view.
I know those relationships are somewhat lazy-loaded, but it still takes effort on the backend to pull all those relationships in. For some relationships (such as customer->invoices) this could be a large list of ids.
I feel answers to this can be very opinionated. But my two cents:
The API you are drawing on for your data should have an end-point to fetch the subset of data you're interested in, e.g. /api/mini-customer vs /api/customer.
You can then either define two separate models (one to represent the model in the list and one to represent the detailed view), or simply populate the original model with the subset of data and merge the extra data in at a later point.
That said, I've also seen plenty of cases such as the one you describe, where you load all data initially and just display the subset to begin with. If it's reasonable that the data will eventually be used and your page-load constraints can handle it, then this can be an acceptable approach.
I have somewhat of an interesting problem, and I'm looking for data store solutions for efficient querying.
I have a large (1M+) number of business objects, and each object has a large number of attributes (on the order of 100). The attributes are relatively unstructured -- the system has thousands of possible attributes, their number grows over time, and each object has an arbitrary (e.g. sparse) subset of them.
I frequently have to perform the following operation: find all objects with some concrete set of attributes S and perform an aggregation on them. I never know S ahead of time, and so on every request I have to perform an expensive sweep of the database which doesn't scale.
What are some data store solutions for this kind of problem? One possible solution would be to have a data store that parallelizes the aggregations -- maybe Cassandra with Hive/Pig on top?
Thoughts?
At this point, Cassandra + Spark is a likely candidate.
In a pure Cassandra world, you could (in theory) create a manual mapping of all possible S attributes to data objects, and then load those in via app and process (where the name of the S attribute is the partition key, the value of the S attribute is the clustering key, and the data object ID itself is another clustering key, that way you can quickly iterate over all objects with S attribute set).
It's not incredibly sexy, but could be made to work.
For a model in my database I need to store around 300 values for a specific field. What would be the drawbacks, in terms of performance and simplicity in query, if I use Postgres-specific ArrayField instead of a separate table with One-to-Many relationship?
If you use an array field
The size of each row in your DB is going to be a bit large thus Postgres is going to be using a lot more toast tables (http://www.postgresql.org/docs/9.5/static/storage-toast.html)
Every time you get the row, unless you specifically use defer (https://docs.djangoproject.com/en/1.9/ref/models/querysets/#defer) the field or otherwise exclude it from the query via only, or values or something, you paying the cost of loading all those values every time you iterate across that row. If that's what you need then so be it.
Filtering based on values in that array, while possible isn't going to be as nice and the Django ORM doesn't make it as obvious as it does for M2M tables.
If you use M2M
You can filter more easily on those related values
Those fields are postponed by default, you can use prefetch_related if you need them and then get fancy if you want only a subset of those values loaded
Total storage in the DB is going to be slightly higher with M2M because of keys, and extra id fields
The cost of the joins in this case is completely negligible because of keys.
Personally I'd say go with the M2M tables, but I don't know your specific application. If you're going to be working with a massive amount of data it's likely worth grabbing a representative dataset and testing both methods with it.