Is it possible to run QuestDB on NFS? - questdb

I'm curious if it's possible to run QuestDB on any kind of distributed or networked file system. Are there any limitations or considerations for using this as the main storage for a QuestDB instance? Are there any guidelines or recommendations for other database systems that can be applied to this use case?

We can't use NFS or a similar networked filesystem directly with QuestDB, but we may copy a backup to such a filesystem after a backup has been made.
see
https://questdb.io/docs/operations/backup/

Related

Costs associated with Data Analysis (data cleaning) on the cloud

I am data analyst. My company is moving all data science to a cloud provider (it could be Azure, GCP,AWS). All the data science programming tools like Jupyter notebook will be installed on the cloud environment (there will be no local installations of Python, or Jupyter Notebooks on the laptop).
For most of my work, I will be reading/ingesting relational database tables directly from an on-premise Database. Also most of my data analysis work does not require any GPU instances for data processing. Sometimes, I also do simple research or experimentation data analysis programming such as data cleaning using Jupyter notebooks without the need for usage of GPU instances.
I would like to find out if it would be possible to do such activities without incurring any pay-per-use costs or unnecessary expenses for my company on their data science cloud computing platform given that none of my tasks utilize GPUs? Please advise, thank you.
EDIT Note: It is difficult to work & develop locally with Jupyter on my company PC because I do not have full permissions to install Python packages(usually this has to be requested for approval, which is very painful and takes a very long time).
Jupyter Notebook can be installed in the cloud, but also on prem and on your workstation. You pay either resource in the cloud, on prem, or your worstation.
Of course, if you add large disk, GPUs, CPUs, memory, it costs more! The problem isn't the cost, it is more where do you want to run your notebook?
I think, there is a bad alternative. With Colab you have free Jupyter Notebook instance. But, AFAIK, it's not private, it's public instances and if you work for your company, you can have data leakage. (Not sure, to validate, but it's not a recommended solution in any case)
EDIT 1
Considering your latest comment, I wondering if you need a jupyter notebook to run your code.
Indeed, Jupyter is simply and IDE: you could create your script, even this one that need GPU locally, and to run it on production data on Compute Engine that you provision only for the process. At the end of the script destroy the VM. No Jupyter notebook environment for that, no?
EDIT 2
Thanks to your note, I understand that developing locally isn't an option. In this case, I recommend you to use a managed Jupyter Notebook solution. You can provision this VM on Google Cloud if you want, you can also have different VM, with or without GPU.
The principle is the same: when you stop to work with your instance, stop it. You will only pay for the storage (the disk) when the instance is down.
And the dev principle can be the same: use a small CPU/GPU for your dev, and when you have to process big data, run your script on a powerful VM. Because you pay only when the VM is running, you can optimize cost like that.
In addition to Guillaume's answer, if you want to keep track or to plan ahead if there are cost that will occur while using instances. You can use Google Cloud Platform's Pricing calculator:
https://cloud.google.com/products/calculator?hl=en
With this, you can can choose what product do you're interested to, what kind of components do want in your set-up (e.g. how many RAM, capacity of your storage space, CPU)in case you choose to use GCP Compute Engine, choose what location you are and check if that location price suits your company's budget.
If you want to have more information regarding Google Cloud Platform pricing, you can check out this link:
https://cloud.google.com/compute/all-pricing#compute-optimized_machine_types

Is it safe to use embedded database (RocksDB, BoltDB, BadgerDB) on DigitalOcean block storage?

DigitalOcean block storage uses ceph which means that volume attached to the droplet would be physically located on a different machine. So a database file written to this volume would be using network, not the local disk.
BoltDB specifically mentions that it's not safe to use over network file system, but I'm not sure if that also applies to DO block storage (it's not NFS, but it does use network).
Is it safe to use DO block storage for embedded databases? Yes, performance would be not as good, but that is not relevant if it's entirely not safe.
If the answer is "no, embdeeded databases should only use local disk", then what are the simple ways to replicate the database (e.g. just once a day or few hours)?
Is it safe to use embedded database (RocksDB, BoltDB, BadgerDB) on DigitalOcean block storage?
Yes, it is safe to use
then what are the simple ways to replicate the database (e.g. just once a day or few hours)?
Put a timer thread in your app that creates a checkpoint/backup and uploads it to s3 and also snapshot the instance

Does Dask communicate with HDFS to optimize for data locality?

In Dask distributed documentation, they have the following information:
For example Dask developers use this ability to build in data locality
when we communicate to data-local storage systems like the Hadoop File
System. When users use high-level functions like
dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to
the HDFS name node, finds the locations of all of the blocks of data,
and sends that information to the scheduler so that it can make
smarter decisions and improve load times for users.
However, it seems that the get_block_locations() was removed from the HDFS fs backend, so my question is: what is the current state of Dask regarding to HDFS ? Is it sending computation to nodes where data is local ? Is it optimizing the scheduler to take into account data locality on HDFS ?
Quite right, with the appearance of arrow's HDFS interface, which is now preferred over hdfs3, the consideration of block locations is no longer part of workloads accessing HDFS, since arrow's implementation doesn't include the get_block_locations() method.
However, we already wanted to remove the somewhat convoluted code which made this work, because we found that the inter-node bandwidth on test HDFS deployments was perfectly adequate that it made little practical difference in most workloads. The extra constrains on the size of the blocks versus the size of the partitions you would like in-memory created an additional layer of complexity.
By removing the specialised code, we could avoid the very special case that was being made for HDFS as opposed to external cloud storage (s3, gcs, azure) where it didn't matter which worker accessed which part of the data.
In short, yes the docs should be updated.

Why HDFS is write once and read multiple times?

I am a new learner of Hadoop.
While reading about Apache HDFS I learned that HDFS is write once file system. Some other distributions ( Cloudera) provides append feature. It will be good to know rational behind this design decision. In my humble opinion, this design creates lots of limitations on Hadoop and make it suitable for limited set of problems( problems similar to log analytic).
Experts comment will help me to understand HDFS in better manner.
HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified, though it can be access ‘n’ number of times (though future versions of Hadoop may support this feature too)! At present, in HDFS strictly has one writer at any time. This assumption enables high throughput data access and also simplifies data coherency issues. A web crawler or a MapReduce application is best suited for HDFS.
As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming data access is extremely important in HDFS. As HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data. HDFS overlooks a few POSIX requirements in order to implement streaming data access.
http://www.edureka.co/blog/introduction-to-apache-hadoop-hdfs/
There are three major reasons that HDFS has the design it has,
HDFS was designed by slavishly copying the design of Google's GFS, which was intended to support batch computations only
HDFS was not originally intended for anything but batch computation
Design a real distributed file system that can support high performance batch operations as well as real-time file modifications is difficult and was beyond the budget and experience level of the original implementors of HDFS.
There is no inherent reason that Hadoop couldn't have been built as a fully read/write file system. MapR FS is proof of that. But implementing such a thing was far outside of the scope and capabilities of the original Hadoop project and the architectural decisions in the original design of HDFS essentially preclude changing this limitation. A key factor is the presence of the NameNode since HDFS requires that all meta-data operations such as file creation, deletion or file length extensions round-trip through the NameNode. MapR FS avoids this by completely eliminating the NameNode and distributing meta-data throughout the cluster.
Over time, not having a real mutable file system has become more and more annoying as the workload for Hadoop-related systems such as Spark and Flink have moved more and more toward operational, near real-time or real-time operation. The responses to this problem have included
MapR FS. As mentioned ... MapR implemented a fully functional high performance re-implementation of HDFS that includes POSIX functionality as well as noSQL table and streaming API's. This system has been in performance for years at some of the largest big data systems around.
Kudu. Cloudera essentially gave up on implementing viable mutation on top of HDFS and has announced Kudu with no timeline for general availability. Kudu implements table-like structures rather than fully general mutable files.
Apache Nifi and the commercial version HDF. Hortonworks also has largely given up on HDFS and announced their strategy as forking applications into batch (supported by HDFS) and streaming (supported by HDF) silos.
Isilon. EMC implemented the HDFS wire protocol as part of their Isilon product line. This allows Hadoop clusters to have two storage silos, one for large-scale, high-performance, cost-effective batch based on HDFS and one for medium-scale mutable file access via Isilon.
other. There are a number of essentially defunct efforts to remedy the write-once nature of HDFS. These include KFS (Kosmix File System) and others. None of these have significant adoption.
An advantage of this technique is that you don't have to bother with synchronization. Since you write once, your reader are guaranteed that the data will not be manipulated while they read.
Though this design decision does impose restrictions, HDFS was built keeping in mind efficient streaming data access.
Quoting from Hadoop - The Definitive Guide:
HDFS is built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern. A dataset is typically generated or copied
from source, and then various analyses are performed on that dataset over time.
Each analysis will involve a large proportion, if not all, of the dataset, so the time
to read the whole dataset is more important than the latency in reading the first
record.

Cloud hosting - shared storage with direct access

We have an application deployed across AWS with using EC2, EBS services.
The infrastructure dropped by layers (independent instances):
application (with load balancer)
database (master-slave standard schema)
media server (streaming)
background processing (redis, delayed_job)
Application and Database instance use number of EBS block storage devices (root, data), which help us to attach/detach them and do EBS snapshots to S3. It's pretty default way how AWS works.
But EBS should be located in a specific zone and can be attached to one instance only in the same time.
Media server is one of bottlenecks, so we'd like to scale them with master/slave schema. So for the media server storage we'd like to try distributed file systems can be attached to multiple servers. What do you advice?
If you're not Facebook or Amazon, then you have no real reason to use something as elaborate as Hadoop or Cassandra. When you reach that level of growth, you'll be able to afford engineers who can choose/design the perfect solution to your problems.
In the meantime, I would strongly recommend GlusterFS for distributed storage. It's extremely easy to install, configure and get up and running. Also, if you're currently streaming files from local storage, you'll appreciate that GlusterFS also acts as local storage while remaining accessible by multiple servers. In other words, no changes to your application are required.
I can't tell you the exact configuration options for your specific application, but there are many available such as distributed, replicated, striped data. You can also play with cache settings to avoid hitting disks on every request, etc.
One thing to note, since GlusterFS is a layer above the other storage layers (particularly with Amazon), you might not get impressive disk performance. Actually it might be much worst than what you have now, for the sake of scalability... basically you could be better-off designing your application to serve streaming media from a CDN who already has the correct infrastructure for your type of application. It's something to think about.
HBase/Hadoop
Cassandra
MogileFS
Good same question (if I understand correctly):
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming
There are many distributed file systems, just find the one you need.
The above are just part which I personally know (haven't tested them).