Rails - Keep a table out of structure.sql during migrations - ruby-on-rails-4

It is straightforward to ignore tables when your schema format is :ruby, but is there a way to do it when your schema format is :sql?
Ideally something like this in environment.rb:
ActiveRecord::SQLDumper.ignore_tables = ['table_name']
After a quick perusal through the AR source code it looks unpromising.

There is currently no way to do this, when the schema format is set to :sql, Rails doesn't go through the regular SchemaDumper but instead uses the tasks in ActiveRecord::Tasks::PostgreSQLDatabaseTasks to do the dump, check it out here.
The code is quite straightforward. I came up with a simple patch for ActiveRecord that should work as expected. It relies on setting the tables to ignore in your database.yml file. It basically adds the following code:
ignore_tables = configuration['ignore_tables']
unless ignore_tables.blank?
args += ignore_tables.split(',').map do |table|
"-T #{table}"
end
end
I just submitted a pull request to rails with those changes. In case you'd want to test it.

For the sake of people landing here after Googling this problem in 2022: As of this PR the structure.sql dump should respect the ignore_tables configuration of ActiveRecord::SchemaDumper. If you're using a regular schema.rb, this option can be a mix of strings and regexes; However, if you're using a structure.sql, it will only take strings according to the docs.
In any case, you can add an initializer to modify the list of ignored tables however you need. In my case, I have a number of backup tables that are created when certain risky operations are performed, and I'd like to exclude those from the structure.sql. Adding this initializer solved it for me:
# config/initializers/ignore_tables.rb
backup_tables = ActiveRecord::Base.connection.tables.select do |t|
t.match?(/_backup_.*$/)
end
ActiveRecord::SchemaDumper.ignore_tables += backup_tables

Related

Query to get a field to field logic from source to target using informatica repository tables

I would like to get a query for source to target field flow using informatica repository tables like for example the repository view : REP_MAPPING_CONN_PORTS.
We do not have any additional licensing for the Metadata Manager. Idea is to basically automate the end to end source to target logical flow of all fields in a mapping. Say after every release, a job runs which automatically updates the logical flow, which would then be very easy for any person to go through and understand the logic.
say something like if i have a mapping m_temp, with say 4 transformations :
Source --> Source Qualifier --> Expression --> Target
I need to extract the data from the repository say something like below, so that i can then showcase it in a front end tool of some sort.
Say in the above mapping, FIELD_1 starts from the Source flowing though SQ and there is a logical IF in the expression which then is connected to a field FIELD_2 in the Target. This is how i expect the output of the query to be.
From Logic TO
FIELD_1 IF(FIELD_1='1','A','B') FIELD_2
Could someone please assist me on a query that i could use to run on the informatica Repository?
Here's a tool that I'm using: https://xmlanalyzer.maciejg.pl
It does generate a documentation based on XML exports, not on repository. And it shows the basic data lineage, without full logic. Trying to put all the expressions, aggregation conditions, lookups, joins etc. would make this huge and unreadable, I'm afraid.

Django: how do i create a model dynamically

How do I create a model dynamically upon uploading a csv file? I have done the part where it can read the csv file.
This doc explains very well how to dynamically create models at runtime in django. It also links to an example of doing so.
However, as you will see after looking at the document, it is quite complex and cumbersome to do this. I would not recommend doing this and believe it is quite likely you can determine a model ahead of time that is flexible enough to handle the CSV. This would be much better practice since dynamically changing the schema of your database as your application is running is a recipe for a ton of bugs in your code.
I understand that you want to create new schema's on the fly based on fields in the those in a CSV. While thats a valid use case and could be the absolute right call. I doubt it though - it lends itself to a data model for a single tenet SaaS application that could have goofy performance and migration issues.
I'd try using Mongo/ some other NoSQL solutions as others have mentioned. But a simpler approach may be a modified Star Schema implemented in SQL. In this case you create a dimensions tables that stores each header, then create an instance of each data element that has a foreign key to dimension and records the value of that dimension.
If you read the csv the psuedo code would look something like this:
for row in DictReader(file):
for k in row.keys():
try:
dim = Dimension.objects.get(name=k)
except:
dim = Dimension(name=k)
dim.save()
DimensionRecord(dimension=dim, value=row[k]
Obviously you could better handle reading the headers and error trapping if dimensions already exist, but this would be an example of how you could dynamically load variable headered CSV's into a SQL db.

Modifying field with regex in Mongo and adding it to a new field

I'm a mongo noob and have what I hope is a pretty easy question. I received a 100gb .bson file yesterday and need to quickly retrieve some documents associated with urls. Unfortunately, the people that managed the database decided to change the schema for storing urls halfway through its life. This means that the url field must be queried via regex and cannot be indexed.
What I am hoping to do is this: regex out some common string between the two versions of urls and store it in a new field called url_id. This field could then be indexed to make for quicker queries. Looking through some past SO posts i cobbled together some pseudo-code that might do the trick:
//pseudo code, i dont know javascript that well.
db.eval(function() {
db.foo.find({}, {url:1}).forEach(function(e) {
match = e.url.match(/.*(domain.com/.*)?(\\?.*)/); //remove http, www, and query strings
e.url_id = matches[1];
db.foo.save(e);
});
});
Then I could run:
db.foo.ensureIndex({url_id:1})
Which would create a new index that would be quicker to query by so long as I properly modified the urls before querying for them.
However, I'm scared at the prospect of running a for loop across 100gb of records. Is there a better way to do this that I'm not thinking of?
Figured out a workaround...
By simply scripting the modification of the input url to create various versions of itself, I was able to run multiple queries on the indexed database and concatenate the results. Hacky but it worked!

Inexact full-text search in PostgreSQL and Django

I'm new to PostgreSQL, and I'm not sure how to go about doing an inexact full-text search. Not that it matters too much, but I'm using Django. In other words, I'm looking for something like the following:
q = 'hello world'
queryset = Entry.objects.extra(
where=['body_tsv ## plainto_tsquery(%s)'],
params=[q])
for entry in queryset:
print entry.title
where I the list of entries should contain either exactly 'hello world', or something similar. The listings should then be ordered according to how far away their value is from the specified string. For instance, I would like the query to include entries containing "Hello World", "hEllo world", "helloworld", "hell world", etc., with some sort of ranking indicating how far away each item is from the perfect, unchanged query string.
How would you go about doing this?
Your best bet is to use Django raw querysets, I use it with MySQL to perform full text matching. If the data is all in the database and Postgres provides the matching capability then it makes sense to use it. Plus Postgres offers some really useful things in terms of stemming etc with full text queries.
Basically it lets you write the actual query you want yet returns models (as long as you are querying a model table obviously).
The advantage this gives you is that you can test the exact query you will be using first in Postgres, the documentation covers full text queries pretty well.
The main gotcha with raw querysets at the moment is they don't support count. So if you will be returning lots of data and have memory constraints on your application you might need to do something clever.
"Inexact" matching however isn't really part of the full text searching capabilities. Instead you want the postgres fuzzystrmatch contrib module. It's use is described here with indexes.
The best would be to use a search engine for this purpose. Django-haystack supports the integration of three different search engines.
In 2022, Django supports full text search with postgres. Full documentation here: https://docs.djangoproject.com/en/4.0/ref/contrib/postgres/search/

Pitfalls of generating JSON in Django templates

I've found myself unsatisfied with Django's ability to render JSON data. If I use built in serializes then database foreign key relationships are not included in the data (only the keys). Also, it seems to be impossible to include custom data in the json feed that isn't part of the model being serialized.
As a test I implemented a template that rendered some JSON for the resultset of a particular model. I was able to include/exclude whatever parts of the model I wanted and was able to include custom data as well.
The test seemed to work well and wasn't slower than the recommended serialization methods.
Are there any pitfalls to this using this method of serialization?
While it's hard to say definitively whether this method has any pitfalls, it's the method we use in production as you control everything that is serialized, even if the underlying model is changed. We've been running a high traffic application in for almost two years using this method.
Hope this helps.
One problem might be escaping metacharacters like ". Django's template system automatically escapes dangerous characters, but it's set up to do that for HTML. You should look up exactly what the template escaping does, and compare that to what's dangerous in JSON. Otherwise, you could cause XSS problems.
You could think about constructing a data structure of dicts and lists, and then running a JSON serializer on that, rather than directly on your database model.
I don't understand why you see the choice as being either 'use Django serializers' or 'write JSON in templates'. The middle way, which to my mind is much more robust and fits your use case well, is to build up your data as Python lists/dictionaries and then simply use simplejson.dumps() to convert it to a JSON string.
We use this method to get custom JSON format consumed by datatables.net
It was the easiest method we find to accomplish this task and it looks very fine with no problems so far.
You can find details here: http://datatables.net/development/server-side/django
So far, generating JSON from templates, we've run into the need to escape newlines. Looking at doing simplejson.dumps() next.