Can I sync/backup RocksDB over the network? - c++

I have several machines processing large amounts of text data (100s of GB) that is indexed in RocksDB. The machines are for load balancing and are operating on the same data. When I add new machines, I want to copy the database over the network from an existing machine, as quickly as possible.
Is there an elegant way to make a RocksDB backup over the network? I have read https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB but this would require twice the amount of disk space: For a backup onto the local filesystem first, before it can be copied over the network. I would also have to deal with e.g. rsyncing files.

You can attach a volume to the server and make the backup to that - and then use that volume on the new node
Alternatively, you can loop over the entire DB at a particular LSN and stream it over some protocol

Related

Share data across Amazon Elastic Beanstalk nodes

I have a spring boot application which downloads around 300 MB of data at start up and saves it to a path /app/local/mydata. Currently, I have just one dev environment with a single node and it is not a problem. However, once I create a prod instance with (say) 10 nodes, it would be a waste of data bandwidth for each node to individually download the same 300 MB data. It will put a lot of stress on the service it is downloading the data from. And there is cost associated with data flowing in/out of EC2.
I can build a logic using a touchfile to make sure that only one box downloads the data and others just wait until the download is complete. However, I don't know where to download these data such that the other nodes can read it too.
Any suggestions?
Download it to S3 if you want to keep it in a file, but it sounds like you might need to put the data in a database (RDS) or maybe cache it in Redis (ElastiCache).
I'm not sure what a "touchfile" is but I assume you mean some sort of file lock mechanism. I don't see that as the best option for coordinating this across multiple servers. I would probably use a DynamoDB table with consistent reads and conditional writes as a distributed locking mechanism.
How often does the data you are downloading change? Perhaps you could just schedule a Lambda function to refresh the data periodically and update a database or something?
In general, you need to stop thinking about using the web server's local file system for this sort of thing.

Greenplum Query : Best strategy to Move Objects from Pre-Prod to Prod Env

I have two different environment Production (new) and pre-production (existing), We have given cluster ready with GP Installed on new prod environment.
I want to know what is the best way to move objects from Pre-Production Environment to Production Environment,
I know:
using gp_dump
using pg_dump
Manually dump each object (table ddl, functions ddl, view ddl, sequence ddl etc)
I want to know the best strategy and what are the pros and cons of each strategy, if only objects need to backup and restore from one environment to another.
Need your valuable input for the same.
The available strategies, ranged by priorities:
Use gpcrondump and gpdbrestore. Will work only if the amount of segments in Pre-Production and Production are the same and the dbids are the same. The fastest way to transfer the whole database with schema as it would work as parallel dump and parallel restore. As it is the backup, it will lock pg_class for some short time, which might create some problems on Production system
If the amount of objects to transfer is small, you can use gptransfer utility, see the user guide for reference. It provides you with an ability to transfer data directly between the segments of Pre-Production and Production. The requirement is that all the segment servers of the Pre-Production environment should be able to see all the segments from Production, which means the should be added to the same VLAN for data transfer.
Write custom code and use writable external tables and readable external tables over the pipe object on the shared host. Also you would have to write some manual code to compare DDL. The benefit of this methout is that you can reuse the external tables to transfer the data between environments many times, and if DDL is not changed your transfer would be fast as the data won't be put to the disks. But all the data would be transferred through a single host, which might be a bottleneck (up to 10gbps transfer rate with dual 10GbE connections for the shared host). Another big advantage is that there would be no locks on the pg_class
Run gpcrondump on the source system and restore the data serially on the target system. This is a way to go if you want to use backup-restore solution and your source and target systems have different amount of segments
In general, everything depends on what you want to achieve: move the objects a single time, move them once a month in a period of inactivity on the clusters, move all the objects weekly without stopping production, move the selected objects daily without stopping production, etc. The result would really depend on your needs

Big Data with Minimal Disk Operation - MapReduce

I need to process 10TB of text in thousands of files that are on a remote server. I wan to process them on my local machine with 3GB RAM, 50GB HDD. I need an abstract layer to download the files from the remote server on-demand, process them (mapreduce) then discard them, load some more files.
Regarding HDFS I need to load them to HDFS and then things should be straightforward but I need to do memory management myself. I want something that takes care of this. something like remote links in HDFS, or symbolic links in HDFS to a remote file that downloads them and loads them to memory process them then discard them move on to more files.
So for now I use Amplab spark to do the parallel processing for me, but on this level of processing it gives up.
I want a one liner in something like spark:
myFilesRDD.map(...).reduce(...)
RDD should take care of it
Map/Reduce is for breaking up work over a cluster of machines. It sounds like you have a single machine, your local one. You might want to look at R, as it has built-in commands to load data across the net. Out of the box, it won't give you the virtual memory-like facade you've described, but if you can tolerate writing an iterative loop and loading the data in chunks yourself, then R can not only give you the remote data loading you seek, R's rich collection of available libraries can facilitate any sort of processing you could desire.

Nuodb and HDFS as storage

Using HDFS for Nuodb as storage. Would this have a performance impact?
If I understand correctly, HDFS is better suited for batch mode or write once and read many times, types of application. Would it not increase the latency for record to be fetch in case it needs to read from storage?
On top of that HDFS block size concept, keep the file size small that would increase the network traffic while data is being fetch. Am I missing something here? Please point out the same.
How would Nuodb manage these kind of latency gotchas?
Good afternoon,
My name is Elisabete and I am the Technical Support Engineer over at NuoDB. I believe that I may have just answered this via your post on our own forum, but I'm responding here as well for anyone else who's curious.
First... a mini lesson on NuoDB architecture/layout:
The most basic NuoDB set-up includes:
Broker Agent
Transaction Engine (TE)
Storage Manager (SM) connected to an Archive Directory
Broker Agents keep track of all the moving parts in the domain (collection of machines hosting NuoDB processes) and provide client applications with connection information for the next available Transaction Engine.
Transaction Engines process incoming SQL requests and manage transactions.
Storage Managers read and write data to and from "disk" (Archive Directory)
All of these components can reside on a single machine, but an optimal set up would have them spread across multiple host machines (allowing each process to take full advantage of the host's available CPU/RAM). Also, while it's possible to run with just one of each component, this is a case where more is definitely more. Additional Brokers provide resiliency, additional TE's increase performance/speed and additional SM's ensure durability.
Ok, so now lets talk about Storage:
This is the "Archive Directory" that your storage manager is writing to. Currently, we support three modes of storage:
Local Files System
Amazon Web Services: Simple Storage volume (S3), Elastic Block Storage (EBS)
Hadoop Distributed Files System (HDFS)
So, to elaborate on how NuoDB works with HDFS... it doesn't know about the multiple machines that the HDFS layer is writing to. As far as the SM is concerned, it is reading and writing data atoms to a single directory. The HDFS layer decides how to then distribute and retrieve data to and from the cluster of machines it resides over.
And now to finally address the question of latency:
Here's the thing, whenever we introduce a remote storage device, we inevitably introduce some amount of additional latency because the SM now has further to go when reading/writing atoms to/from memory. HDFS likely adds a bit more, because now it needs to do it's magic divvying up, distributing, retrieving and reassembling data. Add to that discrepancy in network speed, etc.
I imagine that the gained disk space outweighs the cost in travel time, but this is something you'd have to decide on a case by case basis.
Now, all of that said... I haven't mentioned that TE and SM's both have the ability to cache data to local memory. The size of this cache is something you can set, when starting up each process. NuoDB uses a combination of Multi-Version Concurrency Control (MVCC) and a near constant stream of communication between all of the processes, to ensure that data held in cache is kept up to date with all of the changes happening within the system. Garbage Collection also kicks in and clears out atoms in a Least Recently Used order, when the cache grows close to hitting its limit.
All of this helps reduce latency, because the TE's can hold onto the data they reference most often and grab copies of data they don't have from sibling TE's. When they do resort to asking the SM's for data, there's a chance that the SM (or one of its sibling SM's) has a copy of the requested data in local cache, saving itself the trip out to the Archive Directory.
Whew.. that was a lot and I absolutely glossed over more than a few concepts. These topics are covered in greater depth via the new suite of white papers (and the new "green book") available on our main website. I'm currently also working on some visual guides, to help explain all of this.
If you'd like to know more about NuoDB or if I didn't quite answer your question.... please reach out to me directly via the NuoDB Community Forums (I respond to posts there, a bit faster).
Thank you,
Elisabete
Technical Support Engineer at NuoDB

Amazon EC2 and EBS using Windows AMIs

I put our application on EC2 (Windows 2003 x64 server) and attached up to 7 EBS volumes. The app is very I/O intensive to storage -- typically we use DAS with NTFS mount points (usually around 32 mount points, each to 1TB drives) so i tried to replicate that using EBS but the I/O rates are bad as in 22MB/s tops. We suspect the NIC card to the EBS (which are dymanic SANs if i read correctly) is limiting the pipeline. Our app uses mostly streaming for disk access (not random) so for us it works better when very little gets in the way of our talking to the disk controllers and handling IO directly.
Also when I create a volume and attach it, I see it appear in the instance (fine) and then i make it into a dymamic disk pointing to my mount point, then quick format it -- when I do this does all the data on the volume get wiped? Because it certainly seems so when i attach it to another AMI. I must be missing something.
I'm curious if anyone has any experience putting IO intensive apps up on the EC2 cloud and if so what's the best way to setup the volumes?
Thanks!
I've had limited experience, but I have noticed one small thing:
The initial write is generally slower than subsequent writes.
So if you're streaming a lot of data to disk, like writing logs, this will likely bite you. But if you make a big file fill it with data, and do a lot of random access I/O to it, it gets better on the second time writing to any specific location.