How to manage schema migrations in Google BigQuery

How to manage schema migrations in Google BigQuery - google-cloud-platform

How to manage schema migrations for Google BigQuery, we have used Liquibase and Flyway in the past. What kind of tools can we use to manage schema modifications and the like (e.g. adding a new column) across dev/staging environments.

Found open source framework for BigQuery schema migration
https://github.com/medjed/bigquery_migration
One more solution
https://robertsahlin.com/automatic-builds-and-version-control-of-your-bigquery-views/
PS
In flyway someone opened the ticket to support BigQuery.

Flyway, a very popular database migration tool, now offers support for BigQuery as a beta, while pending certification.
You can get access to the beta version here: https://flywaydb.org/documentation/database/big-query after answering a short survey.
I've tested it from the command line and it works great! Took me about an hour to get familiar with Flyway's configuration, and now calling it with a yarn command.
Here's an example for a NodeJS project with the following files structure:
package.json
fireway/
<SERVICE_ACCOUNT_JSON_FILE>
flyway.conf
migrations/
V1_<YOUR_MIGRATION>.sql
package.json
{
...
"scripts": {
...
"migrate": "flyway -configFiles=flyway/flyway.conf migrate"
},
...
}
and flyway.conf:
flyway.url=jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=<YOUR_PROJECT_ID>;OAuthType=0;OAuthServiceAcctEmail=<SERVICE_ACCOUNT_NAME>;OAuthPvtKeyPath=flyway/<SERVICE_ACCOUNT_JSON_FILE>;
flyway.schemas=<YOUR_DATASET_NAME>
flyway.user=
flyway.password=
flyway.locations=filesystem:./flyway/migrations
flyway.baselineOnMigrate=true
Then you can just call yarn migrate any time you have new migrations to apply.

I created an adapter for Sequel, sequel-bigquery, so we could manage our BigQuery database schema as a set of Ruby migration files which use Sequel's DSL - the same way we do for our PostgreSQL database.
Example
# migrations/bigquery/001_create_people_table.rb
Sequel.migration do
change do
create_table(:people) do
String :name, null: false
Integer :age, null: false
TrueClass :is_developer, null: false
DateTime :last_skied_at
Date :date_of_birth, null: false
BigDecimal :height_m
Float :distance_from_sun_million_km
end
end
end
require 'sequel-bigquery'
require 'logger'
db = Sequel.connect(
adapter: :bigquery,
project: 'your-gcp-project',
database: 'your_bigquery_dataset_name',
location: 'australia-southeast2',
logger: Logger.new(STDOUT),
)
Sequel.extension(:migration)
Sequel::Migrator.run(db, 'migrations/bigquery')

According to the BQ docs, you can add a row to the schema without any additional process.
For more complex transformations, if it can be resolved in a SQL query, you can just run that query setting the destination table as the source table (although I would suggest creating a backup of the table in case something goes wrong).
Example
Let's say I have a table with a column that is a integer (column d), but at the insertion time it was written as a string. I can modify the table by setting itself as a destination table and running a query like:
SELECT
a,
b,
c,
CAST(d AS INT64) AS d,
e,
f
FROM
`example.dataset.table`
This is an example for changing the schema, but this can be applied as long as you can get the result with a BQ query.

Related

Run GQL query from Google Cloud Workflows

I have a simple GQL query, SELECT * FROM PacerFeed WHERE pollInterval >= 0 that I want to run in a GCP Workflow wit the Firestore connector.
What is the database ID in the parent field? Is there a way I can just provide the whole query rather than the yaml'd fields? If not, what is the correct yaml args for this query?
- getFeeds:
call: googleapis.firestore.v1.projects.databases.documents.runQuery
// These args are not correct, just demonstrative.
args:
parent: projects/{projectId}/databases/{database_id}/documents
body:
structuredQuery:
from: [PacerFeed]
select: '*'
where: pollInterval >= 0
result: got
PS can someone with more pts add a 'google-cloud-workflows' tag?

As in the comments, Jim pointed out what over a week of back and forth with GCP support could not. The db must be in Firestore Native mode.
I suggest not using GCP Workflows at all. The documentation sucks, there is not schema, the only thing GCP Support can do is tell you to hire a MSP to fill the doc gaps and point you to an example ... that does not include the parent field.

Automatically migrate JSON data to newest version of JSON schema

I have a service running on my linux machine that reads data stored in a .json file when the machine is booting. The service then validates the incoming JSON data and modifies specific system configurations according to the data. The service is written in C++ and for the validation im using https://github.com/pboettch/json-schema-validator.
In development it was easy to modify the JSON schema and just adapt the data manually. I've started to use semantic versioning for my JSON schema and included it the following way:
JSON schema:
{
"$id": "https://my-company.org/schemas/config/0.1.0/config.schema.json",
"$schema": "http://json-schema.org/draft-07/schema#",
// Start of Schema definition
}
JSON data:
{
"$schema": "https://my-comapny.org/schemas/config/0.1.0/config.schema.json",
// Rest of JSON data
}
With the addition of the version, I am able to check if a version mismatch exists before validating.
What I am looking for is a way to automatically migrate the JSON data to match the newer schema version, if a version mismatch is identified. Is there any way to automatically achieve this, or is the only way to manually edit the JSON data to match the schema?
Since I plan on releasing this as open source I would really like to include some form of automatic migration so I can just ask the user if he wants to migrate to conform to the newest schema version instead of throwing an error, if a version mismatch was identified.

What you're asking for is something which will need to make assumptions to work.
This is an age old problem and similar for databases. You can have schema migrations generated with many simple changes, but this is not viable if you wish to translate existing data automatically too.
Let's look at a basic example. You rename a field.
How would a tool know you've renamed a field vs removed an old one and added a new one? It essentially, cannot.
So, you need to write your migrations by hand.
You could use JSON transformation tools like jq or fx to create migration scripts without writing it in code, which may or may not be preferable. (jq has a steeper learning curve but it's also very powerful.)

TEIID Importing ddl into vdb ddl

Currently my VDB DDL file is getting quite big. I want to split into different files using the following.
IMPORT FROM REPOSITORY "DDL-FILE"
INTO test OPTIONS ("ddl-file" '/path/to/schema1.ddl')
However, this does not seem to work.
Can the DDL file path be relative, how?
The schema test, can it be VIRTUAL?
Does "DDL-FILE" refer to "ddl-file"?
What should I put in my main VDB ddl and what should I put in my extra ddl's. Should the
extra ddl's contain server configuration details or should they be defined as a VDB.
I would like to see a working example on how to use this.
This will be used in a teiid springboot project where you can only load one main vdb file. It is not workable to have one very large ddl file.
I tried multiple approaches but it does not seem to work, either giving me a null pointer with no error codes or error codes that tell me nothing.
Also the syntax in Teiid 9.3 seems different:
IMPORT FOREIGN SCHEMA public
FROM REPOSITORY DDL-FILE
INTO test OPTIONS ("ddl-file" '/path/to/schema.ddl')

This feature is currently not implemented in Teiid Spring Boot. This issue is captured in https://issues.redhat.com/browse/TEIIDSB-219
Update: I added the needed code to master, should be available with 1.7 release meanwhile you can build the master branch and test it out.

Django fixtures for permissions

I'm creating fixtures for permissions in Django. I'm able to get them loaded the way it's needed. However, my question is..say I want to load a fixture for the table auth_group_permissions, I need to specify a group_id and a permission_id, unfortunately fixtures aren't the best way to handle this. Is there an easier way to do this programmatically? So that I can get the id for particular values and have them filled in? How is this normally done?

As of at least Django >=1.7, it is now possible to store permissions in fixtures due to the introduction of "natural keys" as a serialization option.
You can read more about natural keys in the Django serialization documentation
The documentation explicitly mentions the use case for natural keys being when..
...objects are automatically created by Django during the database synchronization process, the primary key of a given relationship isn’t easy to predict; it will depend on how and when migrate was executed. This is true for all models which automatically generate objects, notably including Permission, Group, and User.
So for your specific question, regarding auth_group_permissions, you would dump your fixture using the following syntax:
python manage.py dumpdata auth --natural-foreign --natural-primary -e auth.Permission
The auth_permissions table must be explicitly excluded with the -e flag as that table is populated by the migrate command and will already have data prior to loading fixtures.
This fixture would then be loaded in the same way as any other fixtures

The proper solution is to create the permissions in the same manner the framework itself does.
You should connect to the built-in post_migrate signal either in the module management.py or management/__init__.py and create the permissions there. The documentation does say that any work performed in response to the post_migrate signal should not perform any database schema alterations, but you should also note that the framework itself creates the permissions in response to this signal.
So I'd suggest that you take a look at the management module of the django.contrib.auth application to see how it's supposed to be done.

Just to add to #jonpa's comment, if you are using multitenant app and you want to directly save the fixtures to a file you can do:
python manage.py tenant_command dumpdata --schema=<schema_name> --natural-foreign --natural-primary -e auth.Permission --indent 4 > /path/to/fixtures/fixtures.json

Generate Symfony2 fixtures from DB?

Is it possible to generate fixtures from an existing DB in Symfony2/Doctrine? How could I do that?
Example:
I have defined 15 entities and my symfony2 application is working. Now some people are able to browse to the application and by using it it had inserted about 5000 rows until now. Now I want the stuff inserted as fixtures, but I don’t want to do this by hand. How can I generate them from the DB?

There's no direct manner within Doctrine or Symfony2, but writing a code generator for it (either within or outside of sf2) would be trivial. Just pull each property and generate a line of code to set each property, then put it in your fixture loading method. Example:
<?php
$i = 0;
$entities = $em->getRepository('MyApp:Entity')->findAll();
foreach($entities as $entity)
{
$code .= "$entity_{$i} = new MyApp\Entity();\n";
$code .= "$entity_{$i}->setMyProperty('" . addslashes($entity->getMyProperty()); . "'); \n");
$code .= "$manager->persist($entity_{$i}); \n $manager->flush();";
++$i;
}
// store code somewhere with file_put_contents

As I understand your question, you have two databases: the first is already in production and filled with 5000 rows, the second one is a new database you want to use for new test and development. Is that right ?
If it is, I suggest you to create in you test environment two entity manager: the first will be the 'default' one, which will be used in your project (your controllers, etc.). The second one will be used to connect to your production database. You will find here how to deal with multiple entity manager : http://symfony.com/doc/current/cookbook/doctrine/multiple_entity_managers.html
Then, you should create a Fixture class which will have access to your container. There is an "how to" here : http://symfony.com/doc/current/bundles/DoctrineFixturesBundle/index.html#using-the-container-in-the-fixtures.
Using the container, you will have access to both entity manager. And this is the 'magic': you will have to retrieve the object from your production database, and persist them in the second entity manager, which will insert them in your test database.
I point your attention to two points:
If there are relationship between object, you will have to take care to those dependencies: owner side, inversed side, ...
If you have 5000 rows, take care on the memory your script will use. Another solution may be use native sql to retrieve all the rows from your production database and insert them in your test database. Or a SQL script...
I do not have any code to suggest to you, but I hope this idea will help you.

I assume that you want to use fixtures (and not just dump the production or staging database in the development database) because a) your schema changes and the dumps would not work if you update your code or b) you don't want to dump the hole database but only want to extend some custom fixtures. An example I can think of is: you have 206 countries in your staging database and users add cities to those countries; to keep the fixtures small you only have 5 countries in your development database, however you want to add the cities that the user added to those 5 countries in the staging database to the development database
The only solution I can think of is to use the mentioned DoctrineFixturesBundle and multiple entity managers.
First of all you should configure two database connections and two entity managers in your config.yml
doctrine:
dbal:
default_connection: default
connections:
default:
driver: %database_driver%
host: %database_host%
port: %database_port%
dbname: %database_name%
user: %database_user%
password: %database_password%
charset: UTF8
staging:
...
orm:
auto_generate_proxy_classes: %kernel.debug%
default_entity_manager: default
entity_managers:
default:
connection: default
mappings:
AcmeDemoBundle: ~
staging:
connection: staging
mappings:
AcmeDemoBundle: ~
As you can see both entity managers map the AcmeDemoBundle (in this bundle I will put the code to load the fixtures). If the second database is not on your development machine, you could just dump the SQL from the other machine to the development machine. That should be possible since we are talking about 500 rows and not about millions of rows.
What you can do next is to implement a fixture loader that uses the service container to retrieve the second entity manager and use Doctrine to query the data from the second database and save it to your development database (the default entity manager):
<?php
namespace Acme\DemoBundle\DataFixtures\ORM;
use Doctrine\Common\DataFixtures\FixtureInterface;
use Doctrine\Common\Persistence\ObjectManager;
use Symfony\Component\DependencyInjection\ContainerAwareInterface;
use Symfony\Component\DependencyInjection\ContainerInterface;
use Acme\DemoBundle\Entity\City;
use Acme\DemoBundle\Entity\Country;
class LoadData implements FixtureInterface, ContainerAwareInterface
{
private $container;
private $stagingManager;
public function setContainer(ContainerInterface $container = null)
{
$this->container = $container;
$this->stagingManager = $this->container->get('doctrine')->getManager('staging');
}
public function load(ObjectManager $manager)
{
$this->loadCountry($manager, 'Austria');
$this->loadCountry($manager, 'Germany');
$this->loadCountry($manager, 'France');
$this->loadCountry($manager, 'Spain');
$this->loadCountry($manager, 'Great Britain');
$manager->flush();
}
protected function loadCountry(ObjectManager $manager, $countryName)
{
$country = new Country($countryName);
$cities = $this->stagingManager->createQueryBuilder()
->select('c')
->from('AcmeDemoBundle:City', 'c')
->leftJoin('c.country', 'co')
->where('co.name = :country')
->setParameter('country', $countryName)
->getQuery()
->getResult();
foreach ($cities as $city) {
$city->setCountry($country);
$manager->persist($city);
}
$manager->persist($country);
}
}
What I did in the loadCountry method was that I load the objects from the staging entity manager, add a reference to the fixture country (the one that already exists in your current fixtures) and persist it using the default entity manager (your development database).
Sources:
DoctrineFixturesBundle
How to work with Multiple Entity Managers

you could use https://github.com/Webonaute/DoctrineFixturesGeneratorBundle
It add ability to generate fixtures for single entity using commands like
$ php bin/console doctrine:generate:fixture --entity=Blog:BlogPost --ids="12 534 124" --name="bug43" --order="1"
Or you can create full snapshot
php app/console doctrine:generate:fixture --snapshot --overwrite

The Doctrine Fixtures are useful because they allow you to create objects and insert them into the database. This is especially useful when you need to create associations or say, encode a password using one of the password encoders. If you already have the data in a database, you shouldn't really need to bring them out of that format and turn it into PHP code, only to have that PHP code insert the same data back into the database. You could probably just do an SQL dump and then re-insert them into your database again that way.
Using a fixture would make more sense if you were initiating your project but wanted to use user input to create it. If you had in your config file the default user, you could read that and insert the object.

The AliceBundle can help you doing this. Indeed it allows to load fixtures with YAML (or PHP array) files.
For instance you can define your fixtures with:
Nelmio\Entity\Group:
group1:
name: Admins
owner: '#user1->id'
Or with the same structure in a PHP array. It's WAY easier than generating working PHP code.
It also supports references:
Nelmio\Entity\User:
# ...
Nelmio\Entity\Group:
group1:
name: Admins
owner: '#user1'

In the doctrine_fixture cookbook, you can see in the last example how to get the service container in your entity.
With this service container, you can retrieve the doctrine service, then the entity manager. With the entity manager, you will be able to get all the data from your database you need.
Hope this will help you!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js