HDFS Balancer - For cluster with 1KB files - hdfs

I have an HDFS cluster with 3 nodes. The cluster holds lots of small files (KB) and I have reached Millions of blocks per node.
I have added 4 more new servers to the cluster and started the balancer process but it looks that it does not do much. - The goal is to reduce the Million of blocks per server
In order to balance the small-size files should i change the value of the following parameter to support moving files from 1KB size?
Ddfs.balancer.getBlocks.min-block-size=1048
** I do know that HDFS should manage Big files - working on compaction

If you are running a version with the dfs.balancer.getBlocks.min-block-size option, then the balancer will not move blocks below that size.
If you have a cluster with a mix of small and large files, the balancer picks blocks somewhat randomly. So if the majority of blocks are small, it will tend to move many more small blocks than large ones, and then the smaller blocks tend to build up on the nodes with less disk space used.
The above parameter was introduced to stop that happening.
Therefore if you need to get the small blocks to move, you will need to adjust that setting to something smaller than the default to get your blocks moving.

Related

AWS Elasticsearch indexing memory usage issue

The problem: very frequent "403 Request throttled due to too many requests" errors during data indexing which should be a memory usage issue.
The infrastructure:
Elasticsearch version: 7.8
t3.small.elasticsearch instance (2 vCPU, 2 GB memory)
Default settings
Single domain, 1 node, 1 shard per index, no replicas
There's 3 indices with searchable data. 2 of them have roughly 1 million documents (500-600 MB) each and one with 25k (~20 MB). Indexing is not very simple (has history tracking) so I've been testing refresh with true, wait_for values or calling it separately when needed. The process is using search and bulk queries (been trying sizes of 500, 1000). There should be a limit of 10MB from AWS side so these are safely below that. I've also tested adding 0,5/1 second delays between requests, but none of this fiddling really has any noticeable benefit.
The project is currently in development so there is basically no traffic besides the indexing process itself. The smallest index generally needs an update once every 24 hours, larger ones once a week. Upscaling the infrastructure is not something we want to do just because indexing is so brittle. Even only updating the 25k data index twice in a row tends to fail with the above mentioned error. Any ideas how to reasonably solve this issue?
Update 2020-11-10
Did some digging in past logs and found that we used to have 429 circuit_breaking_exception-s (instead of the current 403) with a reason among the lines of [parent] Data too large, data for [<http_request>] would be [1017018726/969.9mb], which is larger than the limit of [1011774259/964.9mb], real usage: [1016820856/969.7mb], new bytes reserved: [197870/193.2kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=197870/193.2kb, accounting=4309694/4.1mb]. Used cluster stats API to track memory usage during indexing, but didn't find anything that I could identify as a direct cause for the issue.
Ended up creating a solution based on the information that I could find. After some searching and reading it seemed like just trying again when running into errors is a valid approach with Elasticsearch. For example:
Make sure to watch for TOO_MANY_REQUESTS (429) response codes
(EsRejectedExecutionException with the Java client), which is the way
that Elasticsearch tells you that it cannot keep up with the current
indexing rate. When it happens, you should pause indexing a bit before
trying again, ideally with randomized exponential backoff.
The same guide has also useful information about refreshes:
The operation that consists of making changes visible to search -
called a refresh - is costly, and calling it often while there is
ongoing indexing activity can hurt indexing speed.
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds.
In my use case indexing is a single linear process that does not occur frequently so this is what I did:
Disabled automatic refreshes (index.refresh_interval set to -1)
Using refresh API and refresh parameter (with true value) when and where needed
When running into a "403 Request throttled due to too many requests" error the program will keep trying every 15 seconds until it succeeds or the time limit (currently 60 seconds) is hit. Will adjust the numbers/functionality if needed, but results have been good so far.
This way the indexing is still fast, but will slow down when needed to provide better stability.

Split Files - Redshift Copy Command

Hi,
I would like to know if the following the option A in the above image is a valid answer to the question and why is option C incorrect.
As per the documentation on splitting data into multiple files:
Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. The number of slices per node depends on the node size of the cluster. For example, each DS2.XL compute node has two slices, and each DS2.8XL compute node has 32 slices. For more information about the number of slices that each node size has, go to About Clusters and Nodes in the Amazon Redshift Cluster Management Guide.
Shouldn't option C, split the data into 10 files of equal size be the right answer?
"A" is correct because of the nature of S3. S3 takes time to "look up" the object you are accessing and then transfers the data to the requestor. This "look up" time is on the order of .5 sec. and a lot of data can be transferred in that amount of time. The worst case of this (and I have seen this) is breaking the data into one S3 object per row of data. This means that most of the COPY time will be in "look up" time, not transfer time. My own analysis of this (years ago) showed that objects of 100MB will spend less than 50% of the COPY time in object look ups. So 1GB is likely a good safe and future proofed size target.
"C" is wrong because you want to have as many independent parallel S3 data transfer occurring during the COPY (within reason and network card bandwith). Since Redshift will start one S3 object per slice and there are multiple slices per node in Redshift. The minimal number of slices is 2 for the small node types so you would want at least 20 S3 objects and many more for the larger node types.
Combining these you want many S3 objects but not a lot of small (<1GB) objects. Big enough that object look up time is not a huge overhead and lots of objects so all the slices will be busy doing COPY work.
It is not the correct answer because
Unless you know the exact size of the output files (combined) you won't be able to split to exactly 10 and if you do this redshift will probably write file by file so the unload will be slow
It is not the optimal because the more smaller files you have, the better it becomes because you will utilize all the cores/nodes you have in the redshift cluster on unload and on copy later if you needed to load this data again

Could not allocate a new page for database ‘TEMPDB’ because of insufficient disk space in filegroup ‘DEFAULT’

ETL developer reports they have been trying to run our weekly and daily processes on ADW consistently. While for the most part they are executing without exception, I am now getting this error:
“Could not allocate a new page for database ‘TEMPDB’ because of insufficient disk space in filegroup ‘DEFAULT’. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.”
Is there a limit on TEMPDB space associated with the DWU setting?
The database is limited to 100TB (per the portal) and not full.
Azure SQL Data Warehouse does allocate space for a tempdb, at around 399 GB per 100 DWU. Reference here.
What DWU are you using at the moment? Consider temporarily raising your DWU aka service objective or refactoring your job to be less dependent on tempdb. Lower it when your batch process is finished.
It might also be worth checking your workload for anything like cartesian products, excessive sorting, over-dependency on temp tables etc to see if any optimisation can be done.
Have a look at the Explain Plans for your code, and see whether you have a lot more data movement going on than you expect. If you find that one query does moved a lot more into Q tables, you can probably tune it to avoid the data movement (which may mean redesigning tables to distribute in a different key).

What is the disk strategy of HDFS datanode when it has multiple disks and block directories?

For example, I have two disks /dev/sda, /dev/sdb,
The mapping of device and the block directories are as follow
/dev/sda /data/1/dns/dn
/dev/sdb /data/2/dns/dn
The conditions are /data/1/dns/dn has 10M space, /data/2/dns/dn has 400GB.
My question about the strategy of HDFS are:
Should these two directories store different blocks of HDFS file?
The space of /data/1/dns/dn is obviously small, will HDFS detect that and write the block into the bigger one?
I found the two algorithms exist in the configuration of HDFS datanode:
Round robin
Available space
My choice is round robin, it seems when the disk space is small, the available space algorithm still applied.

Cassandra disk space overhead

We are running a 6 node Cassandra 2.0.11 cluster with RF=3 at AWS in a single datacenter across 3 AZ's
Our average datasize is about 110GB and each node has 2 80GB disks with raid0 to create a single 160GB disk.
We are starting to see the disk fill up whenever a repair or subsequent compaction takes place and are no longer able to rebalance the ring.
Is it time to horizontally scale and move from 6 to 9 nodes?
It seems like 50GB out of 160GB is a lot of overhead required for "normal" cassandra operation.
Get more disk space if you can.
Otherwise consider using leveled compaction in case you're low on disk space and only have small to moderate write load. LCS can save significant disk space during compaction compared to size tired compaction.
Also check if you can delete some old snapshots.
First, find the root cause of what is causing your disks to fill up.
From what you wrote, it sounds to me like the load on the cluster is too high which causes compaction to fall behind. This in turn would cause the disks to fill up.
Check nodetool tpstats to see whether there is a backlog of compactions and check how many sstables are in your Columnfamilies. If this is the case, either scale horizontally to handle the load or tune your current cluster so that it can handle the load that is being pushed.
The cause could also stem from a huge compaction that floods the data drive. I assume you use Size-tiered compaction strategy. The overhead for this is 50% of your current data at all times. As a big compaction can temporarily add that much data.
One option could be switching to Leveled Compaction Strategy as this only requires an overhead of 10%. Note however that LCS is much harder on the disks.