If we have a file divided into 128MB blocks and to be stored in HDFS, also lets say we have 2 racks each having 5 nodes to replicate data.
How the replication happens?
Will the different data blocks (lets say 2,3,4) of a same file can be stored on same node(lets say datanode1) in HDFS?
First of all the number of data nodes must be greater than or equal to the replication factor.
How the replication happens: read this article: https://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/
Different data blocks of same file can be stored on same data node - Yes.
Related
Hi,
I would like to know if the following the option A in the above image is a valid answer to the question and why is option C incorrect.
As per the documentation on splitting data into multiple files:
Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. The number of slices per node depends on the node size of the cluster. For example, each DS2.XL compute node has two slices, and each DS2.8XL compute node has 32 slices. For more information about the number of slices that each node size has, go to About Clusters and Nodes in the Amazon Redshift Cluster Management Guide.
Shouldn't option C, split the data into 10 files of equal size be the right answer?
"A" is correct because of the nature of S3. S3 takes time to "look up" the object you are accessing and then transfers the data to the requestor. This "look up" time is on the order of .5 sec. and a lot of data can be transferred in that amount of time. The worst case of this (and I have seen this) is breaking the data into one S3 object per row of data. This means that most of the COPY time will be in "look up" time, not transfer time. My own analysis of this (years ago) showed that objects of 100MB will spend less than 50% of the COPY time in object look ups. So 1GB is likely a good safe and future proofed size target.
"C" is wrong because you want to have as many independent parallel S3 data transfer occurring during the COPY (within reason and network card bandwith). Since Redshift will start one S3 object per slice and there are multiple slices per node in Redshift. The minimal number of slices is 2 for the small node types so you would want at least 20 S3 objects and many more for the larger node types.
Combining these you want many S3 objects but not a lot of small (<1GB) objects. Big enough that object look up time is not a huge overhead and lots of objects so all the slices will be busy doing COPY work.
It is not the correct answer because
Unless you know the exact size of the output files (combined) you won't be able to split to exactly 10 and if you do this redshift will probably write file by file so the unload will be slow
It is not the optimal because the more smaller files you have, the better it becomes because you will utilize all the cores/nodes you have in the redshift cluster on unload and on copy later if you needed to load this data again
I'm UNLOADing tables from Redshift to S3 for backup. So I am checking to make sure the files are complete if we need them again.
I just did UNLOAD on a table that has size = 1,056 according to:
select "table", size, tbl_rows
FROM svv_table_info;
According to the documentation, the size is "in 1 MB data blocks", so this table is using 1,056 MB. But after copying to S3, the file size is 154 MB (viewing in AWS console).
I copied back to Redshift and all the rows are there, so this has to do with "1 MB data blocks". This is related to how it's saved in the file system, yes?
Can someone please explain? Thank you.
So you're asking why the SVV_TABLE_INFO view claims that the table consumes 1 GB, but when you dump it to disk the result is only 154 MB?
There are two primary reasons. The first is that you're actively updating the table but not vacuuming it. When a row is updated or deleted, Redshift actually appends a new row (yes, stored as columns) and tombstones the old row. To reclaim this space, you have to regularly vacuum the table. While Redshift will do some vacuuming in the background, this may not be enough, or it may not have happened at the time you're looking.
The second reason is that there's overhead required to store table data. Each column in a table is stored as a list of 1 MB blocks, one block per slice (and multiple slices per node). Depending on the size of your cluster and the column data type, this may lead to a lot of wasted space.
For example, if you're storing 32-bit integers, one 1MB block can store 256,000 such integers, requiring a total of 4 blocks to store 1,000,000 values (which is probably close to number of rows in your table). But, if you have a 4-node cluster, with 2 slices per node (ie, a dc2.large), then you'll actually require 8 blocks, because the column will be partitioned across all slices.
You can see the number of blocks that each column uses in STV_BLOCKLIST.
I'm running a Redshift unload command, but am not getting the name I desire. The command is:
UNLOAD ('select * from foo')
TO 's3://mybucket/foo'
CREDENTIALS 'xxxxxx'
GZIP
NULL AS 'NULL'
DELIMITER as '\t'
allowoverwrite
parallel off
The result is mybucket/foo-000.gz. I don't want the slice number to be the end of the file name (it'd be great if it can be eliminated completely), I want to add a file extension at end of the file name. I'd like to see either of the following:
mybucket/foo-000.txt.gz
mybucket/foo.txt.gz
Is there any way to do this (without writing a lambda post process renamer script)?
TL;DR
No.
Explanation:
As it says in Amazon Redshift UNLOAD document, if you do not want it to be split into several parts, you can use PARALLEL FALSE, but it is strongly recommended to leave it enabled. Even then, the file will always include the 000.[EXT] suffix (when the [EXT] exists only when the compression is enabled), because there is a limit to a file size that Redshift can output, as says in the documentation:
By default, UNLOAD writes data in parallel to multiple files,
according to the number of slices in the cluster. The default option
is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or
more data files serially, sorted absolutely according to the ORDER BY
clause, if one is used. The maximum size for a data file is 6.2 GB.
So, for example, if you unload 13.4 GB of data, UNLOAD creates the
following three files.
s3://mybucket/key000 6.2 GB
s3://mybucket/key001 6.2 GB
s3://mybucket/key002 1.0 GB
Therefore, it will alway add at least the prefix 000, because Redshift doesn't know what size of the file he is going to output in the first place, so he's adding this suffix in case the output will reach the size of 6.2 GB.
If you ask why the use of PARALLEL FALSE is not recommended, I'll try to explain it in several points:
The most important reason is because of the way a Redshift cluster designed. Each cluster includes at least 2 servers, when one of them is a leader node and the rest are data nodes. The purpose of leader node, is to control the data nodes, it hold the necessary information to work with all data in Redshift, either read or write.
When you unload data from Redshift while the flag PARALLEL is TRUE, it will create at least X files, when X is the number of nodes you choose to construct the Redshift cluster of, in the first place. It means, that the data is written directly from the data nodes themselves, which is much faster because it's doing it in parallel and skips the leader node.
When you decide to turn this flag to off, all data is gathered from all of the data nodes into a single node, the leader node, because it needs to reorganize the sorting of the rows to output and also compress it if needed as a single stream. This action causes you data to be written much slower.
Also, this is significantly decreases Redshift cluster performance in a matter of reading and writing data, because everything (read and write queries) goes through the leader node, and as it says above, when the leader node is overloaded, there will be a performance issue.
The queries COPY and UNLOAD work directly with the data nodes, therefore, they behave almost the same way as if you would use PARALLEL TRUE. In the contrary, queries like SELECT, UPDATE, DELETE and INSERT, are processed by the leader node, that's why they suffer from the leader node loads.
i have in-memroy of 4GB.the data file iam going to load into GEMFIREXD is of 8GB. how in-memory organize the Remaining data 4 GB data.i read about EVICTION Class but i didn't get any clarification.
While loading the data it copied into disk OR after filling the 4GB it start coping into disk?
help onthis ..
thank you
If you use the EVICTION clause without using the PERSISTENT clause, the data will start being written to disk once you reach the eviction threshold. The least recently used rows will be written to disk and dropped from memory.
If you have a PERSISTENT table, the data is already on disk when you reach your eviction threshold. At that point, the least recently used rows are dropped from memory.
Note that there is still a per row overhead in memory even if the row is evicted.
Doc reference for details:
http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#overflow/configuring_data_eviction.html
- http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#caching_database/eviction_limitations.html
While unloading a large result set to s3, redshift automatically split the files into multiple parts. Is there a way to set the size of each part while unloading?
When unload, you can use maxfilesize to indicate the maximum size of the file.
For exampe:
unload ('select * from venue')
to 's3://mybucket/unload/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
maxfilesize 1 gb;
From here
Redshift by default unloads data into multiple files according to the number of slices in your cluster. So, if you have 4 slices in the cluster, you would have 4 files written by each cluster concurrently.
Here is short answer to your question from the Documentation. Go here for details.
"By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. So, for example, if you unload 13.4 GB of data, UNLOAD creates the following three files."
I hope this helps.