HBase: Does copy pasting the directories work? - amazon-web-services

I'm planning to get a hadoop/hbase cluster up and I'm trying to figure out what ec2 intance type to use and how much EBS space.
I'm going initially with
1 master (m1.small)
2 slaves (m1.small)
I'm not expecting more than 100 simultaneous users on my website (Is this no big?)
Well, I would attach 20 GB EBS blocks to each of master and slaves. These EBS blocks will contain the data storage and logs from HDFS and HBase.
The path of hbase should look like /mnt/hadoop/hbase/root where /mnt/hadoop is the directory where EBS block (for e.g. /dev/sda) will be mounted
Eventually, this space will be filled and when I realize that 20 GB is less I would create a 60 GB (/dev/sdb) lets say and attach it the disk. Now, I'll copy everything from /dev/sda to /dev/sdb and finally mount /dev/sdb to /mnt/hadoop
Does, HDFS/HBase see any difference after this change? Is it legal to do in this way or discouraged?
How do we increase the storage of the device the HBase/HDFS write its data?

Neither HBase nor Hadoop stops you from doing that. It's merely changing your configuration parameters accordingly. You need to be careful though.
But why would you do that, when you are sure that you are going to hit the limits?Prevention is better than cure, IMHO.

Related

Getting "[Errno 28] No space left on device" on AWS SageMaker with plenty of storage left

I am running a notebook instance on Amazon Sagemaker and my understanding is that notebooks by default have 5GB of storage. I am running into [Errno 28] No space left on device on a notebook that worked just fine the last time I tried it. I checked and I'm using approximately 1.5GB out of 5GB. I'm trying to download a bunch of files from my S3 bucket but I get the error even before one file is downloaded. Additionally, the notebook no longer autosaves.
Has anyone run into this and figured out a way to fix it? I've already tried clearing all outputs.
Thanks in advance!
Open a terminal and run df -kh to tell which FS is running out of disk space.
There's a root filesystem which is 100GB, and there's a user filesystem which size you can customize (default 5GB) (doc).
Guess: I saw that, especially when using Docker, the root FS can run out of space.
You will want to try to restart or kill the kernel. Sometimes some cells are left running and you may be trying to execute another operation. Sometimes logfiles also if you have any kill the space so try to remove all tertiary files that you are not using.
I work for AWS & my opinions are my own
Often this is seen in cases when there are certain unused resources running.
The default FS size is 100GB.
If you are using SageMaker Studio, you can use [this][1] JupyterLab extension to automatically shuts down Kernels, Terminals and Apps in Sagemaker Studio when they are idle for a stipulated period of time. You will be able to configure an idle time limit using the user interface this extension provides.
[1]: https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension
You can resize the persistent storage of your notebook up to 16Tb in the editing the notebook details on AWS console. This volume however is mounted under /home/ec2-user/SageMaker. Download your files under this folder and you'll see all the storage you allocated.

How to back-up local SSD in Google Cloud Platform

Local SSDs are the fastest storage available in Google Cloud Platform, which makes it clear why people would want to use it. But it comes with some severe drawbacks, most notably that all data on the SSDs will be lost if the VM is shut down manually.
It also seems to be impossible to take and image of a local SSD and restore it to a restarted disk.
What then is the recommended way to back-up local SSD data if I have to shut down the VM (if I want to change the machine specs for example)?
So far I can only thing of manually copying my files to a bucket or a persistent disk.
If you have a solution to this, would it work if multiple local SSDs were mounted into a single logical volume?
I'm guessing that you want to create a backup of data stored on local SSD every time you shut down the machine (or reboot).
To achieve this (as #John Hanley commented) you have to copy the data either manually or by some script to other storage (persistend disk, bucket etc).
If you're running Linux:
Here's a great answer how to run script at reboot / shutdown. You can then create a script that will copy all the data to some more persistend storage solution.
If I were you I'd just use rsync and cron. Run it every hour or day (depends on your use case). Here's a another great example how to use rsync to synchronize folders.
If you're running Windows:
It is also possible to run command at windows shutdown and here's how.

Not able to run airflow in Cloud run getting error disk I/o error

I am trying to run airflow in google cloud run.
Getting error Disk I/O error, I guess the disk write permission is missing.
can someone please help me with this how to give write permission inside cloud run.
I also have to write file and later delete it.
Only the directory /tmp is writable in Cloud Run. So, change the default write location to write into this directory.
However, you have to be aware of 2 things:
Cloud Run is stateless, that means when a new instance is created, the container start from scratch, with an empty /tmp directory
/tmp directory is an in-memory file system. The maximum allowed memory on Cloud Run is 2Gb, your app memory footprint included. In addition of your file and Airflow, not sure that you will have a lot of space.
A final remark. Cloud Run is active only when it process request, and a request has a maximum timeout of 15 minutes. When no request, the allowed cpu is close to 0%. I'm not sure of what you want to achieve with Airflow on Cloud Run, but my feeling tells me that your design is strange. And I prefer to warn you before you spend too much effort on this.
EDIT 1:
Cloud Run service has evolved in the right way. In 2022,
/tmp is no longer the only writable directory (you can write everywhere, but it's still in memory)
the timeout is no longer limited to 15 minutes, but to 60 minutes
The 2nd gen runtime execution (still in preview) allows you to mount NFS (Filestore) or Cloud Storage (GCSFuse) volume to have services "more stateful".
You can also execute jobs now. So, a lot of very great evolution!
My impression is that you have a write i/o error because you are using SQLite. Is that possible.
If you want to run Airflow using cointainers, I would recommend to use Postgres or MySQL as backend databases.
You can also mount the plugins and dag folder in some external volume.

How to make a linux VM Instance on GCP of more than 2TB. I am trying to Download a 2.4 TB file,

Is it possible to make one drive on a GCP VM instance that has more than 2TB storage. I have a 2.5 TB file that I need to get into the VM instance but if I set the size to 4 TB I still get only 2 TB in my command line.
image of output to df -h
The reason you only see 2 TB even though you created a 4 TB disk is because the disks use MBR as default, as described here.
If you would like to change the disk from using MBR to GPT which will allow you to use the full volume size of your disk, you will need to download a third party tool to help with the conversion.
You can search online for different tools that will convert the disk such as gptgen for Windows or gdisk for Linux; however, there is the possibility some data may get corrupted.

s3distcp copy from S3 to EMR HDFS data replica always on one node

I am using s3distcp to copy a 500GB dataset into my EMR cluster. It's a 12 node r4.4xlarge cluster each with 750GB disk. It's using the EMR release label emr-5.13.0 and I'm adding Hadoop: Amazon 2.8.3, Ganglia: 3.7.2 and Spark 2.3.0. I'm using the following command to copy the data into the cluster:
s3-dist-cp --src=s3://bucket/prefix/ --dest=hdfs:///local/path/ --groupBy=.*(part_).* --targetSize=128 --outputCodec=none
When I look at the disk usage in either Ganglia or the namenode UI (port 50070 on the EMR cluster) then I can see that one node has most of it's disk filled and the others have a similar percentage used. Clicking through a lot of the files (~50) I can see that a replicate of the file always appears on the full node.
I'm using Spark to transform this data, write it to HDFS and then copy back to S3. I'm having trouble with this dataset as my tasks are being killed. I'm not certain this is the cause of the problem. I don't need to copy the data locally, nor decompress it. Initially I thought the BZIP2 codec was not splitable and decompressing would help gain parallelism in my Spark jobs but I was wrong, it is splitable. I have also discovered the hdfs balancer command which I'm using to redistribute the replicas and see if this solves my Spark problems.
However, now I've seen what I think is odd behaviour I would like to understand if this is normal for s3distcp/HDFS to create a replica of the files always on one node?
s3distcp is closed source; I can't comment in detail about its internals.
When HDFS creates replicas of data, it tries to save one block to the local machine, then 2 more elsewhere (Assuming replication==3). Whichever host is running the distcp worker processes will end up having a copy of the entire file. So if only one host is used for the copy, that fills up.
FWIW, I don't believe you need to do that distcp, not if you can do a read and filter of the data straight off S3, saving that result to hdfs. Your spark workers will do the filtering, and write their blocks back to the machines running these workers and other hosts in the chain. And for short-lived clusters, you could also try lowering the hdfs replication factor (2?), so save on HDFS data across the cluster, at the cost of having one less place for spark to schedule work adjacent to the data