Is it safe to use embedded database (RocksDB, BoltDB, BadgerDB) on DigitalOcean block storage? - digital-ocean

DigitalOcean block storage uses ceph which means that volume attached to the droplet would be physically located on a different machine. So a database file written to this volume would be using network, not the local disk.
BoltDB specifically mentions that it's not safe to use over network file system, but I'm not sure if that also applies to DO block storage (it's not NFS, but it does use network).
Is it safe to use DO block storage for embedded databases? Yes, performance would be not as good, but that is not relevant if it's entirely not safe.
If the answer is "no, embdeeded databases should only use local disk", then what are the simple ways to replicate the database (e.g. just once a day or few hours)?

Is it safe to use embedded database (RocksDB, BoltDB, BadgerDB) on DigitalOcean block storage?
Yes, it is safe to use
then what are the simple ways to replicate the database (e.g. just once a day or few hours)?
Put a timer thread in your app that creates a checkpoint/backup and uploads it to s3 and also snapshot the instance

Related

Optimal way to use AWS S3 for a backend application

In order to learn how to connect backend to AWS, I am writing a simple notepad application. On the frontend it uses Editor.js as an alternative to traditional WYSIWYG. I am wondering how best to synchronise the images uploaded by a user.
To upload images from disk, I use the following plugin: https://github.com/editor-js/image
In the configuration of the tool, I give the api endpoint of the server to upload the image. The server in response have to send the url to the saved file. My server saves the data to s3 and returns the link.
But what if someone for example adds and removes the same file over and over again? Each time, there will be a new request to aws.
And here is the main part of the question, should I optimize it somehow in practice? I'm thinking of saving the files temporarily on my server first, and only doing a synchronization with aws from time to time. How this is done in practice? I would be very grateful if you could share with me any tips or resources that I may have missed.
I am sorry for possible mistakes in my English, i do my best.
Thank you for help!
I think you should upload them to S3 as soon as they are available. This way you are ensuring their availability and resistance to failure of you instance. S3 store files across multiple availability zones (AZs) ensuring reliable long-term storage. On the other hand, an instance operates only within one AZ and if something happens to it, all your data on the instance is lost. So potentially you can lost entire batch of images if you wait with the uploads.
In addition to that, S3 has virtually unlimited capacity, so you are not risking any storage shortage. When you keep them in batches on an instance, depending on the image sizes, there may be a scenario where you simply run out of space.
Finally, the good practice of developing apps on AWS is to make them stateless. This means that your instances should be considered disposable and interchangeable at any time. This is achieved by not storing any user data on the instances. This enables you to auto-scale your application and makes it fault tolerant.

Greenplum Query : Best strategy to Move Objects from Pre-Prod to Prod Env

I have two different environment Production (new) and pre-production (existing), We have given cluster ready with GP Installed on new prod environment.
I want to know what is the best way to move objects from Pre-Production Environment to Production Environment,
I know:
using gp_dump
using pg_dump
Manually dump each object (table ddl, functions ddl, view ddl, sequence ddl etc)
I want to know the best strategy and what are the pros and cons of each strategy, if only objects need to backup and restore from one environment to another.
Need your valuable input for the same.
The available strategies, ranged by priorities:
Use gpcrondump and gpdbrestore. Will work only if the amount of segments in Pre-Production and Production are the same and the dbids are the same. The fastest way to transfer the whole database with schema as it would work as parallel dump and parallel restore. As it is the backup, it will lock pg_class for some short time, which might create some problems on Production system
If the amount of objects to transfer is small, you can use gptransfer utility, see the user guide for reference. It provides you with an ability to transfer data directly between the segments of Pre-Production and Production. The requirement is that all the segment servers of the Pre-Production environment should be able to see all the segments from Production, which means the should be added to the same VLAN for data transfer.
Write custom code and use writable external tables and readable external tables over the pipe object on the shared host. Also you would have to write some manual code to compare DDL. The benefit of this methout is that you can reuse the external tables to transfer the data between environments many times, and if DDL is not changed your transfer would be fast as the data won't be put to the disks. But all the data would be transferred through a single host, which might be a bottleneck (up to 10gbps transfer rate with dual 10GbE connections for the shared host). Another big advantage is that there would be no locks on the pg_class
Run gpcrondump on the source system and restore the data serially on the target system. This is a way to go if you want to use backup-restore solution and your source and target systems have different amount of segments
In general, everything depends on what you want to achieve: move the objects a single time, move them once a month in a period of inactivity on the clusters, move all the objects weekly without stopping production, move the selected objects daily without stopping production, etc. The result would really depend on your needs

Sync data between EC2 instances

While I'm looking to move our servers to AWS, I'm trying to figure out how to sync data between our web nodes.
I would like to mount a disk on every web node and have a local cache of the entire share.
Are there any preferred ways to do this?
Sounds like you should consider storing your files on s3 originally and if performance is key, have a sync job that pulls copies of the files locally to your ec2 instance. S3 is fast, durable and cheap - maybe even fast enough without keeping a local cache - but if you do indeed need a local copy, there are tools such as the aws cli and other 3rd party tools.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Depending on what you are trying to sync - take a look at
http://aws.amazon.com/elasticache/
It is an extremely fast and efficient method for sharing data.
One absolute easy solution is to install Dropbox sync client on both the machines and keep your files in Dropbox. This is by far the easiest !
In this approach, you can "load" data to the machines, using externally adding files to your dropbox account (not even go to AWS service to load) - From another machine or even from browser interface of Dropbox.

Cloud hosting - shared storage with direct access

We have an application deployed across AWS with using EC2, EBS services.
The infrastructure dropped by layers (independent instances):
application (with load balancer)
database (master-slave standard schema)
media server (streaming)
background processing (redis, delayed_job)
Application and Database instance use number of EBS block storage devices (root, data), which help us to attach/detach them and do EBS snapshots to S3. It's pretty default way how AWS works.
But EBS should be located in a specific zone and can be attached to one instance only in the same time.
Media server is one of bottlenecks, so we'd like to scale them with master/slave schema. So for the media server storage we'd like to try distributed file systems can be attached to multiple servers. What do you advice?
If you're not Facebook or Amazon, then you have no real reason to use something as elaborate as Hadoop or Cassandra. When you reach that level of growth, you'll be able to afford engineers who can choose/design the perfect solution to your problems.
In the meantime, I would strongly recommend GlusterFS for distributed storage. It's extremely easy to install, configure and get up and running. Also, if you're currently streaming files from local storage, you'll appreciate that GlusterFS also acts as local storage while remaining accessible by multiple servers. In other words, no changes to your application are required.
I can't tell you the exact configuration options for your specific application, but there are many available such as distributed, replicated, striped data. You can also play with cache settings to avoid hitting disks on every request, etc.
One thing to note, since GlusterFS is a layer above the other storage layers (particularly with Amazon), you might not get impressive disk performance. Actually it might be much worst than what you have now, for the sake of scalability... basically you could be better-off designing your application to serve streaming media from a CDN who already has the correct infrastructure for your type of application. It's something to think about.
HBase/Hadoop
Cassandra
MogileFS
Good same question (if I understand correctly):
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming
There are many distributed file systems, just find the one you need.
The above are just part which I personally know (haven't tested them).

Amazon EC2 and EBS using Windows AMIs

I put our application on EC2 (Windows 2003 x64 server) and attached up to 7 EBS volumes. The app is very I/O intensive to storage -- typically we use DAS with NTFS mount points (usually around 32 mount points, each to 1TB drives) so i tried to replicate that using EBS but the I/O rates are bad as in 22MB/s tops. We suspect the NIC card to the EBS (which are dymanic SANs if i read correctly) is limiting the pipeline. Our app uses mostly streaming for disk access (not random) so for us it works better when very little gets in the way of our talking to the disk controllers and handling IO directly.
Also when I create a volume and attach it, I see it appear in the instance (fine) and then i make it into a dymamic disk pointing to my mount point, then quick format it -- when I do this does all the data on the volume get wiped? Because it certainly seems so when i attach it to another AMI. I must be missing something.
I'm curious if anyone has any experience putting IO intensive apps up on the EC2 cloud and if so what's the best way to setup the volumes?
Thanks!
I've had limited experience, but I have noticed one small thing:
The initial write is generally slower than subsequent writes.
So if you're streaming a lot of data to disk, like writing logs, this will likely bite you. But if you make a big file fill it with data, and do a lot of random access I/O to it, it gets better on the second time writing to any specific location.