Greenplum Query : Best strategy to Move Objects from Pre-Prod to Prod Env - database-migration

I have two different environment Production (new) and pre-production (existing), We have given cluster ready with GP Installed on new prod environment.
I want to know what is the best way to move objects from Pre-Production Environment to Production Environment,
I know:
using gp_dump
using pg_dump
Manually dump each object (table ddl, functions ddl, view ddl, sequence ddl etc)
I want to know the best strategy and what are the pros and cons of each strategy, if only objects need to backup and restore from one environment to another.
Need your valuable input for the same.

The available strategies, ranged by priorities:
Use gpcrondump and gpdbrestore. Will work only if the amount of segments in Pre-Production and Production are the same and the dbids are the same. The fastest way to transfer the whole database with schema as it would work as parallel dump and parallel restore. As it is the backup, it will lock pg_class for some short time, which might create some problems on Production system
If the amount of objects to transfer is small, you can use gptransfer utility, see the user guide for reference. It provides you with an ability to transfer data directly between the segments of Pre-Production and Production. The requirement is that all the segment servers of the Pre-Production environment should be able to see all the segments from Production, which means the should be added to the same VLAN for data transfer.
Write custom code and use writable external tables and readable external tables over the pipe object on the shared host. Also you would have to write some manual code to compare DDL. The benefit of this methout is that you can reuse the external tables to transfer the data between environments many times, and if DDL is not changed your transfer would be fast as the data won't be put to the disks. But all the data would be transferred through a single host, which might be a bottleneck (up to 10gbps transfer rate with dual 10GbE connections for the shared host). Another big advantage is that there would be no locks on the pg_class
Run gpcrondump on the source system and restore the data serially on the target system. This is a way to go if you want to use backup-restore solution and your source and target systems have different amount of segments
In general, everything depends on what you want to achieve: move the objects a single time, move them once a month in a period of inactivity on the clusters, move all the objects weekly without stopping production, move the selected objects daily without stopping production, etc. The result would really depend on your needs

Related

Create a copy of Redshift production with limited # records in each table

I have a production Redshift cluster with a significant amount of data on it. I would like to create a 'dummy' copy of the cluster that I can use for ad-hoc development and testing of various data pipelines. The copy would have all the schemas/tables of production, but only a small subset of the records in each table (say, limited to 10,000 rows per table).
What would be a good way to create such a copy, and refresh it on a regular basis (in case production schemas change)? Is there a way to create a snapshot of a cluster with limits on each table?
So far my thinking is to create a new cluster and use some of the admin views as defined here to automatically get the DDL of schemas/tables etc. and write scripts that generate UNLOAD statements (with limits on number of records) for each table. I can then use these to populate my dev cluster. However I feel there must be a cleaner solution.
I presume your basic goal is cost-saving. This needs to be balanced against administrative effort (how expensive is your time?).
It might be cheaper to produce a full-copy (restore from backup) of the cluster but turn it off at night/weekends to save money. If you automate the restoration process you could even schedule it to start before you come into work.
That way, you'll have a complete replica of the production system with effectively zero administration overhead (once you write a couple of scripts to create/delete the cluster) and you can save 75% of the costs (40 out of 168 hours per week). Plus, each time you create a new cluster it contains the latest data from the snapshot, so there is no need to keep them "in sync".
The simplest solutions are often the best.

Replicated caching solutions compatible with AWS

My use case is as follow:
We have about 500 servers running in an autoscaling EC2 cluster that need to access the same configuration data (layed out in a key/value fashion) several million times per second.
The configuration data isn't very large (1 or 2 GBs) and doesn't change much (a few dozen updates/deletes/inserts per minute during peak time).
Latency is critical for us, so the data needs to be replicated and kept in memory on every single instance running our application.
Eventual consistency is fine. However we need to make sure that every update will be propagated at some point. (knowing that the servers can be shutdown at any time)
The update propagation across the servers should be reliable and easy to setup (we can't have static IPs for our servers, or we don't wanna go the route of "faking" multicast on AWS etc...)
Here are the solutions we've explored in the past:
Using regular java maps and use our custom built system to propagate updates across the cluster. (obviously, it doesn't scale that well)
Using EhCache and its replication feature. But setting it up on EC2 is very painful and somehow unreliable.
Here are the solutions we're thinking of trying out:
Apache Ignite (https://ignite.apache.org/) with a REPLICATED strategy.
Hazelcast's Replicated Map feature. (http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#replicated-map)
Apache Geode on every application node. (http://geode.apache.org/)
I would like to know if each of those solutions would work for our use case. And eventually, what issues I'm likely to face with each of them.
Here is what I found so far:
Hazelcast's Replicated Map is somehow recent and still a bit unreliable (async updates can be lost in case of scaling down)
It seems like Geode became "stable" fairly recently (even though it's supposedly in development since the early 2000s)
Ignite looks like it could be a good fit, but I'm not too sure how their S3 discovery based system will work out if we keep adding / removing node regularly.
Thanks!
Geode should work for your use case. You should be able to use a Geode Replicated region on each node. You can choose to do synchronous OR asynchronous replication. In case of failures, the replicated region gets an initial copy of the data from an existing member in the system, while making sure that no in-flight operations are lost.
In terms of configuration, you will have to start a couple/few member discovery processes (Geode locators) and point each member to these locators. (We recommend that you start one locator/AZ and use 3 AZs to protect against network partitioning).
Geode/GemFire has been stable for a while; powering low latency high scalability requirements for reservation systems at Indian and Chinese railways among other users for a very long time.
Disclosure: I am a committer on Geode.
Ignite provides native AWS integration for discovery over S3 storage: https://apacheignite-mix.readme.io/docs/amazon-aws. It solves main issue - you don't need to change configuration when instances are restarted. In a nutshell, any nodes that successfully joins topology writes its coordinates to a bucket (and removes them when fails or leaves). When you start a new node, it reads this bucket and connects to one the listed addresses.
Hazelcast's Replicated Map will not work for your use-case. Note that it is a map that is replicated across all it's nodes not on the client nodes/servers. Also, as you said, it is not fully reliable yet.
Here is the Hazelcast solution:
Create a Hazelcast cluster with a set of nodes depending upon the size of data.
Create a Distributed map(IMap) and tweak the count & eviction configurations based on size/number of key/value pairs. The data gets partitioned across all the nodes.
Setup Backup count based on how critical the data is and how much time it takes to pull the data from the actual source(DB/Files). Distributed maps have 1 backup by default.
In the client side, setup a NearCache and attach it to the Distributed map. This NearCache will hold the Key/Value pair in the local/client side itself. So the get operations would end up in milliseconds.
Things to consider with NearCache solution:
The first get operation would be slower as it has to go through network to get the data from cluster.
Cache invalidation is not fully reliable as there will be a delay in synchronization with the cluster and may end reading stale data. Again, this is same case across all the cache solutions.
It is client's responsibility to setup timeout and invalidation of Nearcache entries. So that the future pulls would get fresh data from cluster. This depends on how often the data gets refreshed or value is replaced for a key.

Can you run a whole service using Redis?

So I'm currently developing a messaging application to learn the process and I'm actually using Redis as a cache and use it with websockets to push real-time messages.
And then, this question popped in my mind:
Is it possible, to use Redis only to run a whole service (like a messaging application for example) ?
NOTE : This imply removing any form of database (we're only keeping strings)
I know you can set-up Redis to be persistent, but is it enough ? Is it robust enough ? Would it be a safe enough move ? Or totally insane ?
What are you thoughts ? I'd really like to know, and if you think it is possible, I'll give it a shot.
Thanks !
A few companies use Redis as their unique or primary database, so it is definitely not insane.
You can develop and run a full service using Redis as a backend, as long as you understand and accept the tradeoffs it implies.
By this I mean:
that you can use a Redis server as a high performance database as long as your whole data can reside in memory. It may imply that you reduce the size of your data, or choose not to store some of them which may be computed by your app on read access or imported from another source;
that if you can't store all of your data in the memory of a single server, you can use a Redis cluster, but it will limit the available Redis features (see implemented subset
that you have to think about the potential data losses when a server crashes, and determine if they are acceptable or not. It may be OK to lose some data if the process which produced them is robust and will create them again when the database restart (by example when the data stored in Redis come from an import process, which will start again from the last imported item). You can also use several Redis instances, with different persistency configuration: one which writes on disk each time a key is modified, avoiding potential data loss, but with much lower performances; and another one to store non critical data, which are written on disk every couple of seconds.
Redis may be used to store structured data, not only strings, using hashes. Each time you would create an index in a relational model, you have to create a data structure in Redis. By example if you want to store Person objects, you create a HASH for each of them, to store their properties, including a unique ID. If you want to be able to get people by city, you create a SET for each city, and you insert the ID of each newly created Person in the corresponding SET. So you will be able to get the list of persons in a given city. It's just an example, you have to define the model and data structures to be used according to your application.

Suggested lightweight key-value store for distributing state between many hosts

I'm looking for key-val store that will be used to share some state between multiple hosts.
Goal:
- Achive high availability for limited set of data, that need to be accesible on every host/node
Requirements:
put/get/incr/decr operations
simple numeric data - int/float values, nothing more, no JSON, blobs and so on
full copy of dataset on every node or automated failure tolerance
automatic adding/removing of hosts with no need to reconfigure application
small dataset - only a few megabytes of shared data
node traffic is load balanced with user-to-node sticking, so only one node at once will change data related to users that are sticked to that node. This will only change on node failure, but constraint of one master for a set of keys will be keeped, so many readers, one master for own small dataset
multiple small VM instances will be used, so it should be lightweight in terms of required memory
automated operation - configure once and forget
I've looked at Riak and CouchDB, but they look like too complicated and too heavy
Any suggestions?
After more research I'm heading toward Hazelcast, it provides memcache-like interface and it's easy to configure a simple cluster with automated failover

Nuodb and HDFS as storage

Using HDFS for Nuodb as storage. Would this have a performance impact?
If I understand correctly, HDFS is better suited for batch mode or write once and read many times, types of application. Would it not increase the latency for record to be fetch in case it needs to read from storage?
On top of that HDFS block size concept, keep the file size small that would increase the network traffic while data is being fetch. Am I missing something here? Please point out the same.
How would Nuodb manage these kind of latency gotchas?
Good afternoon,
My name is Elisabete and I am the Technical Support Engineer over at NuoDB. I believe that I may have just answered this via your post on our own forum, but I'm responding here as well for anyone else who's curious.
First... a mini lesson on NuoDB architecture/layout:
The most basic NuoDB set-up includes:
Broker Agent
Transaction Engine (TE)
Storage Manager (SM) connected to an Archive Directory
Broker Agents keep track of all the moving parts in the domain (collection of machines hosting NuoDB processes) and provide client applications with connection information for the next available Transaction Engine.
Transaction Engines process incoming SQL requests and manage transactions.
Storage Managers read and write data to and from "disk" (Archive Directory)
All of these components can reside on a single machine, but an optimal set up would have them spread across multiple host machines (allowing each process to take full advantage of the host's available CPU/RAM). Also, while it's possible to run with just one of each component, this is a case where more is definitely more. Additional Brokers provide resiliency, additional TE's increase performance/speed and additional SM's ensure durability.
Ok, so now lets talk about Storage:
This is the "Archive Directory" that your storage manager is writing to. Currently, we support three modes of storage:
Local Files System
Amazon Web Services: Simple Storage volume (S3), Elastic Block Storage (EBS)
Hadoop Distributed Files System (HDFS)
So, to elaborate on how NuoDB works with HDFS... it doesn't know about the multiple machines that the HDFS layer is writing to. As far as the SM is concerned, it is reading and writing data atoms to a single directory. The HDFS layer decides how to then distribute and retrieve data to and from the cluster of machines it resides over.
And now to finally address the question of latency:
Here's the thing, whenever we introduce a remote storage device, we inevitably introduce some amount of additional latency because the SM now has further to go when reading/writing atoms to/from memory. HDFS likely adds a bit more, because now it needs to do it's magic divvying up, distributing, retrieving and reassembling data. Add to that discrepancy in network speed, etc.
I imagine that the gained disk space outweighs the cost in travel time, but this is something you'd have to decide on a case by case basis.
Now, all of that said... I haven't mentioned that TE and SM's both have the ability to cache data to local memory. The size of this cache is something you can set, when starting up each process. NuoDB uses a combination of Multi-Version Concurrency Control (MVCC) and a near constant stream of communication between all of the processes, to ensure that data held in cache is kept up to date with all of the changes happening within the system. Garbage Collection also kicks in and clears out atoms in a Least Recently Used order, when the cache grows close to hitting its limit.
All of this helps reduce latency, because the TE's can hold onto the data they reference most often and grab copies of data they don't have from sibling TE's. When they do resort to asking the SM's for data, there's a chance that the SM (or one of its sibling SM's) has a copy of the requested data in local cache, saving itself the trip out to the Archive Directory.
Whew.. that was a lot and I absolutely glossed over more than a few concepts. These topics are covered in greater depth via the new suite of white papers (and the new "green book") available on our main website. I'm currently also working on some visual guides, to help explain all of this.
If you'd like to know more about NuoDB or if I didn't quite answer your question.... please reach out to me directly via the NuoDB Community Forums (I respond to posts there, a bit faster).
Thank you,
Elisabete
Technical Support Engineer at NuoDB