HDFS Storage Policies Change,but file size does not change - hdfs

enter image description here
When I change the file storage policy from hot to cold, the file size should be smaller because the archive type has a higher storage density, but there is no change so far

Changing storage type won’t change the filesize, the filesize is determined by the content of the file.
Changing the Storage Policy from HOT to COLD will just make sure the blocks of data corresponding to that file moves from DISK to ARCHIVAL storage. No change in size will happen, just the storage type of blocks changes


How does the CAP Theorem apply on HDFS?

I just started reading about Hadoop and came across the CAP Theorem. Can you please throw some light on which two components of CAP would be applicable to a HDFS system?
Argument for Consistency
The document very clearly says:
"The consistency model of a Hadoop FileSystem is one-copy-update-semantics; that of a traditional local POSIX filesystem."
(One-copy update semantics means the file contents seen by all of the processes accessing or updating a given file would see as if only a single copy of the file existed.)
Moving forward, the document says:
"Create. Once the close() operation on an output stream writing a newly created file has completed, in-cluster operations querying the file metadata and contents MUST immediately see the file and its data."
"Update. Once the close() operation on an output stream writing a newly created file has completed, in-cluster operations querying the file metadata and contents MUST immediately see the new data.
"Delete. once a delete() operation on a path other than “/” has completed successfully, it MUST NOT be visible or accessible. Specifically, listStatus(), open() ,rename() and append() operations MUST fail."
The above mentioned characteristics point towards the presence of "Consistency" in the HDFS.
Source: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/introduction.html
Argument for Partition Tolerance
HDFS provides High Availability for both Name Nodes and Data Nodes.
Source: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
Argument for Lack of Availability
It is very clearly mentioned in the documentation(under the section "Operations and failures"):
"The time to complete an operation is undefined and may depend on the implementation and on the state of the system."
This indicates that the "Availability" in the context of CAP is missing in HDFS.
Given the above mentioned arguments, I believe HDFS supports "Consistency and Partition Tolerance" and not "Availability" in the context of
CAP theorem.
C – Consistency (All nodes see the data in homogeneous form i.e. every node has the same knowledge of data at any instant of time)
A – Availability (A guarantee that every request receives a response which may be processed or failed)
P – Partition Tolerance (The system continues to operate even if a message is lost or part of the system fails)
Talking about Hadoop , it supports the Availability and Partition Tolerance property. The Consistency property is not supported because only namenode has the information of where the replicas are placed. This information is not available with each and every node of the cluster.

Fixed chunk size S3

Can I change chunk size to allways be the same ? What I need is fixed chunk size to always be 100mb. I want to send files from browser without passing it throught server. I use signature 4. It would be cool if we could restrict max file size and max chunk size

Spark writing/reading to/from S3 - Partition Size and Compression

I am doing an experiment to understand which file size behaves best with s3 and [EMR + Spark]
Input data :
Incompressible data: Random Bytes in files
Total Data Size: 20GB
Each folder has varying input file size: From 2MB To 4GB file size.
Cluster Specifications :
1 master + 4 nodes : C3.8xls
--driver-memory 5G \
--executor-memory 3G \
--executor-cores 2 \
--num-executors 60 \
Code :
scala> def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
time: [R](block: => R)R
scala> val inputFiles = time{sc.textFile("s3://bucket/folder/2mb-10240files-20gb/*/*")};
scala> val outputFiles = time {inputFiles.saveAsTextFile("s3://bucket/folder-out/2mb-10240files-20gb/")};
2MB - 32MB: Most of the time is spent in opening file handles [Not efficient]
64MB till 1GB: Spark itself is launching 320 tasks for all these file sizes, it's no longer the no of files in that bucket with 20GB
data e.g. 512 MB files had 40 files to make 20gb data and could
just have 40 tasks to be completed but instead, there were 320
tasks each dealing with 64MB data.
4GB file size : 0 Bytes outputted [Not able to handle in-memory /Data not even splittable ???]
Any default setting that forces input size to be dealt with to be 64MB ??
Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file
Why is compressed file size increased after uploading via spark? The 2MB compressed input file becomes 3.6 MB in the output bucket.
Since it is not specified, I'm assuming usage of gzip and Spark 2.2 in my answer.
Any default setting that forces input size to be dealt with to be 64MB ??
Yes, there is. Spark is a Hadoop project, and therefore treats S3 to be a block based file system even though it is an object based file system.
So the real question here is: which implementation of S3 file system are you using(s3a, s3n) etc. A similar question can be found here.
Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file size?
Spark docs indicate that it is capable of reading compressed files:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
This means that your files were read quite easily and converted to a plaintext string for each line.
However, you are using compressed files. Assuming it is a non-splittable format such as gzip, the entire file is needed for de-compression. You are running with 3gb executors which can satisfy the needs of 4mb-1gb files quite well, but can't handle a file larger than 3gb at once (probably lesser after accounting for overhead).
Some further info can be found in this question. Details of splittable compression types can be found in this answer.
Why is compressed file size increased after uploading via spark?The 2MB compressed input file becomes 3.6 MB in output bucket.
As a corollary to the previous point, this means that spark has de-compressed the RDD while reading as plaintext. While re-uploading, it is no longer compressed. To compress, you can pass a compression codec as a parameter:
sc.saveAsTextFile("s3://path", classOf[org.apache.hadoop.io.compress.GzipCodec])
There are other compression formats available.

Extend GlusterFS on top of LVM

I need to add more space to one of our gluster volumes. The volumes are replica 2 and sit on top of an LVM. The file system is XFS. The current size is 4TB and I want to resize to 6TB. The LVM has enough Free PEs on both replica servers.
--- Physical volume ---
PV Name /dev/sdb
VG Name gluster
PV Size 10,91 TiB / not usable 4,00 MiB
Allocatable yes
PE Size 4,00 MiB
Total PE 2861183
Free PE 1633407
Allocated PE 1227776
PV UUID F3CwNm-dceK-ezPY-7w12-OYT5-FLAH-U0a239
--- Physical volume ---
PV Name /dev/sdb
VG Name gluster
PV Size 10,91 TiB / not usable 4,00 MiB
Allocatable yes
PE Size 4,00 MiB
Total PE 2861183
Free PE 1618047
Allocated PE 1243136
PV UUID dWDEgF-0brq-9e6r-eqpO-jTeK-GJfb-c3MGbE
I've read somewhere, that it's enough to extend the LVM and to resize the FS on both hosts.
# lvextend -L +2T <lvm>
# xfs_growfs <lvm mountpoint>
I know that XFS has to be reseize while it's mounted. The LVM can also be resized during operations (although not recommended). And I've read somewhere that GlusterFS will automatically adapt to the new volume size as soon as both/all volumes have the new size.
Since the storage is used in an productive environment it's important to do this on the fly.
Has anyone any experience with this combination or can confirm that my approach is correct?
Thanks in advance.
To answer my own question:
Eventually we had to increase Gluster storage without downtime.
After extending the logical volume of all Gluster bricks to the same size and growing the xfs filesystem to the maximum size of the volume, the Gluster volume automatically adjusted the new size.
I'm not sure if it's the recommended way, but it worked for us like a charm (both replica 2 and replica 3 volumes).
According to the https://access.redhat.com/solutions/1517993 if you want, you can grow a glusterfs volume by growing its bricks rather than adding new ones.
You can do it following a usual file system resizing method (the method will depend on the underlying block device).
Example : the brick is set on an XFS file system on the LV VGgluster/brick1. We want to add another 500MB to this brick.
Check that you have free space on the VG containing the brick :
# vgs VGgluster
Grow the file system :
# lvextend --resizefs -L+500M VGgluster/brick1
The size has been updated on the client :
# df /glusterfs/mountpoint
From my experience: if it's a replica volume, you have to resize all volume's bricks. Until all bricks have the same size, the actual size of particular gluster volume will be equal the size of the smallest brick at the moment.

Dynamically creating the volume based on the size of ubifs image size

I have a requirement to create a new volume (it can be static) based on the size of the ubifs image (say rootfs.ubifs) which I am going to write into that volume. The aim is to create the volume with the minimum possible size required to write 'rootfs.ubifs' to that volume and boot the device from it.
Can somebody please help me in this regard?
The difference is the overhead of the UBI layer. This is documented as O in the web page or,
O - the overhead related to storing EC and VID headers in bytes, i.e. O = SP - SL.
SP is a physical erase block size and SL is what UbiFs will get. Usually, it is the minimum page size times two. One for an EC and another for a VID; these are the two structures that UBI uses to manage the flash. Both are defined in ubi-media.h. EC is the ubi_ec_hdr structure and VID is the ubi_vid_hdr structure. The EC or erase count is written every time an erase block is erased and this is responsible for wear leveling.note The VID or volume id header allows UBI to support multiple volumes and provide the PEB to LEB (physical to logical erase block) management.
So for a 2k page NAND flash without sub-pages, it is 4k; if sub-pages are supported then it is possible to put both headers in the same page and only 2k is needed. If your flash page size differs, you just need to multiply by two without sub-pages and only add the page overhead if you have sub-pages. The overhead for NOR flash is 256 bytes as it doesn't have the idea of pages.
In order to create your rootfs.ubifs, you must have specified a logic erase block size (to mkfs.ubifs). The difference between logical erase block (LEB) and physical erase block (PEB) is just the overhead documented above. Multiply your rootfs.ubifs by PEB/LEB to get the minimum possible size for the UBI volume.
note: If an erase is interrupted (reset/power cycle) between the actual erase and the EC write, an average of all other erase blocks is used to set the erase count when UBI re-reads the ubi device.