hdfs Storage Policies SSD and Memory do not work as expected [closed] - hdfs

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 days ago.
This post was edited and submitted for review 3 days ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
How to determine if the data is actually written to memory, indicating that the setting is successful, but that the data is still written to disk when put?
hdfs storagepolicies -getStoragePolicy -path /hot/lazy_persist
The storage policy of /hot/lazy_persist:
BlockStoragePolicy{LAZY_PERSIST:15, storageTypes=[RAM_DISK, DISK], creationFallbacks=[DISK], replicationFallbacks=[DISK]}
hdfs --loglevel debug dfs -put test.csv /hot/lazy_persist
23/02/16 18:22:55 DEBUG sasl.SaslDataTransferClient: SASL client skipping handshake in secured configuration with privileged port for addr = /10.12.16.198, datanodeId = DatanodeInfoWithStorage[10.12.16.198:1019,DS-4430a0df-5509-4d74-b30a-764534e28a2f,DISK]
23/02/16 18:22:55 DEBUG hdfs.DataStreamer: nodes [DatanodeInfoWithStorage[10.12.16.198:1019,DS-4430a0df-5509-4d74-b30a-764534e28a2f,DISK], DatanodeInfoWithStorage[10.12.65.71:1019,DS-a2dcb888-c61f-446f-ad25-4df7e18ffc28,DISK], DatanodeInfoWithStorage[10.12.17.2:1019,DS-a15a2500-ffda-4b3b-bcd4-35d4d7f41952,DISK]] storageTypes [DISK, DISK, DISK] storageIDs [DS-4430a0df-5509-4d74-b30a-764534e28a2f, DS-a2dcb888-c61f-446f-ad25-4df7e18ffc28, DS-a15a2500-ffda-4b3b-bcd4-35d4d7f41952]
log out all put disk ,There should be a replicas in memory
free -g
total used free shared buff/cache available
Mem: 125 10 42 0 72 112
Swap: 0 0 0
df -h
...
tmpfs 10G 28K 10G 1% /mnt/dn-tmpfs
config:
[RAM_DISK]/mnt/dn-tmpfs,[SSD]/data/hadoop/hdfs/data,/data1/hadoop/hdfs/data.....
The above configuration according to the document, check and no problem
https://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Introduction
It's also showing something strange here
Yes, the value is too large (dfs.datanode.du.reserved), which limits the ssd memory capacity by default.
<property>
<name>dfs.datanode.du.reserved</name>
<value>536870912000</value>
<final>false</final>
<source>hdfs-site.xml</source>
</property>
But I don't understand why my question is set to -1 to turn off my question?

Related

Ethereum: Why do I keep creating DAG files?

After reading on another question on Stack, I understood that a DAG file stands for Directed Acyclic Graph.
However, I do not understand how it is used and when I typed ethminer -G, I started to see Creating DAG. XX% done DAG 16:37:39.331|ethminer Generating DAG file. Progress: XX %. It has already been the third time since it reached 100% and just keeps on restarting the same process after printing:
Creating DAG. 100% done...miner 16:22:32.015|ethminer Got work package:
miner 16:22:32.015|ethminer Header-hash: xxx
miner 16:22:32.015|ethminer Seedhash: xxx
miner 16:22:32.015|ethminer Target: xxx
ℹ 16:22:32.041|gpuminer0 workLoop 1 #xxx… #xxx…
ℹ 16:22:32.041|gpuminer0 Initialising miner...
[OPENCL]:Using platform: NVIDIA CUDA
[OPENCL]:Using device: GeForce 840M(OpenCL 1.2 CUDA)
miner 16:22:32.542|ethminer Mining on PoWhash #xxx… : 0 H/s = 0 hashes / 0.5 s
miner 16:22:32.542|ethminer Grabbing DAG for #xxx…
[OPENCL]:Printing program log
[OPENCL]:
[OPENCL]:Creating one big buffer for the DAG
[OPENCL]:Loading single big chunk kernels
[OPENCL]:Mapping one big chunk.
[OPENCL]:Creating buffer for header.
[OPENCL]:Creating mining buffer 0
[OPENCL]:Creating mining buffer 1
I precise that I am using Ubuntu 16.04 and CUDA 8.0 with drivers 367 for my NVIDIA.
Ethhash, the proof-of-work algorithm used by ethereum was designed to be memory-hard. Part of this is the requirement of for the entire DAG file to be stored in a GPU's memory.
There is better explanation here: https://ethereum.stackexchange.com/questions/1993/what-actually-is-a-dag/1996
The reason why ethminer is restarting is because your NVIDIA GeForce 840M has only has 2 GB of memory whereas at the time when posted this question, the DAG size on the ethereum network was ~3 GB.

Is there a maximum concurrency for AWS s3 multipart uploads?

Referring to the docs, you can specify the number of concurrent connection when pushing large files to Amazon Web Services s3 using the multipart uploader. While it does say the concurrency defaults to 5, it does not specify a maximum, or whether or not the size of each chunk is derived from the total filesize / concurrency.
I trolled the source code and the comment is pretty much the same as the docs:
Set the concurrency level to use when uploading parts. This affects
how many parts are uploaded in parallel. You must use a local file as
your data source when using a concurrency greater than 1
So my functional build looks like this (the vars are defined by the way, this is just condensed for example):
use Aws\Common\Exception\MultipartUploadException;
use Aws\S3\Model\MultipartUpload\UploadBuilder;
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bucket)
->setKey($file)
->setConcurrency(30)
->setOption('CacheControl', 'max-age=3600')
->build();
Works great except a 200mb file takes 9 minutes to upload... with 30 concurrent connections? Seems suspicious to me, so I upped concurrency to 100 and the upload time was 8.5 minutes. Such a small difference could just be connection and not code.
So my question is whether or not there's a concurrency maximum, what it is, and if you can specify the size of the chunks or if chunk size is automatically calculated. My goal is to try to get a 500mb file to transfer to AWS s3 within 5 minutes, however I have to optimize that if possible.
Looking through the source code, it looks like 10,000 is the maximum concurrent connections. There is no automatic calculations of chunk sizes based on concurrent connections but you could set those yourself if needed for whatever reason.
I set the chunk size to 10 megs, 20 concurrent connections and it seems to work fine. On a real server I got a 100 meg file to transfer in 23 seconds. Much better than the 3 1/2 to 4 minute it was getting in the dev environments. Interesting, but thems the stats, should anyone else come across this same issue.
This is what my builder ended up being:
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bicket)
->setKey($file)
->setConcurrency(20)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
I may need to up that max cache but as of yet this works acceptably. The key was moving the processor code to the server and not relying on the weakness of my dev environments, no matter how powerful the machine is or high class the internet connection is.
We can abort the process during upload and can halt all the operations and abort the upload at any instance. We can set Concurrency and minimum part size.
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource('/path/to/large/file.mov')
->setBucket('mybucket')
->setKey('my-object-key')
->setConcurrency(3)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
try {
$uploader->upload();
echo "Upload complete.\n";
} catch (MultipartUploadException $e) {
$uploader->abort();
echo "Upload failed.\n";
}

Not enough space to cache rdd in memory warning

I am running a spark job, and I got Not enough space to cache rdd_128_17000 in memory warning. However, in the attached file, it obviously saying only 90.8 G out of 719.3 G is used. Why is that? Thanks!
15/10/16 02:19:41 WARN storage.MemoryStore: Not enough space to cache rdd_128_17000 in memory! (computed 21.4 GB so far)
15/10/16 02:19:41 INFO storage.MemoryStore: Memory use = 4.1 GB (blocks) + 21.2 GB (scratch space shared across 1 thread(s)) = 25.2 GB. Storage limit = 36.0 GB.
15/10/16 02:19:44 WARN storage.MemoryStore: Not enough space to cache rdd_129_17000 in memory! (computed 9.4 GB so far)
15/10/16 02:19:44 INFO storage.MemoryStore: Memory use = 4.1 GB (blocks) + 30.6 GB (scratch space shared across 1 thread(s)) = 34.6 GB. Storage limit = 36.0 GB.
15/10/16 02:25:37 INFO metrics.MetricsSaver: 1001 MetricsLockFreeSaver 339 comitted 11 matured S3WriteBytes values
15/10/16 02:29:00 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt1/var/lib/hadoop/s3/959a772f-d03a-41fd-bc9d-6d5c5b9812a1-0000 134217728 bytes md5: qkQ8nlvC8COVftXkknPE3A== md5hex: aa443c9e5bc2f023957ed5e49273c4dc
15/10/16 02:38:15 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt/var/lib/hadoop/s3/959a772f-d03a-41fd-bc9d-6d5c5b9812a1-0001 134217728 bytes md5: RgoGg/yJpqzjIvD5DqjCig== md5hex: 460a0683fc89a6ace322f0f90ea8c28a
15/10/16 02:42:20 INFO metrics.MetricsSaver: 2001 MetricsLockFreeSaver 339 comitted 10 matured S3WriteBytes values
This is likely to be caused by the configuration of spark.storage.memoryFraction being too low. Spark will only use this fraction of the allocated memory to cache RDDs.
Try either:
increasing the storage fraction
rdd.persist(StorageLevel.MEMORY_ONLY_SER) to reduce memory usage by serializing the RDD data
rdd.persist(StorageLevel.MEMORY_AND_DISK) to partially persist onto disk if memory limits are reached.
This could be due to the following issue if you're loading lots of avro files:
https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCANx3uAiJqO4qcTXePrUofKhO3N9UbQDJgNQXPYGZ14PWgfG5Aw#mail.gmail.com%3E
With a PR in progress at:
https://github.com/databricks/spark-avro/pull/95
I have a Spark-based batch application (a JAR with main() method, not written by me, I'm not a Spark expert) that I run in local mode without spark-submit, spark-shell, or spark-defaults.conf. When I tried to use IBM JRE (like one of my customers) instead of Oracle JRE (same machine and same data), I started getting those warnings.
Since the memory store is a fraction of the heap (see the page that Jacob suggested in his comment), I checked the heap size: IBM JRE uses a different strategy to decide default heap size and it was too small, so I simply added appropriate -Xms and -Xmx params and the problem disappeared: now the batch works fine both with IBM and Oracle JRE.
My usage scenario is not typical, I know, however I hope this can help someone.

Data error(cyclic redundancy check) while logging transaction status using bitronix transaction manager

Below exception occurred. Any possible explanations. My notion is may be problem with filesystem!?
Caused by: bitronix.tm.internal.BitronixSystemException: error logging status
at bitronix.tm.BitronixTransaction.setStatus(BitronixTransaction.java:400)
at bitronix.tm.BitronixTransaction.setStatus(BitronixTransaction.java:379)
at bitronix.tm.BitronixTransaction.setActive(BitronixTransaction.java:367)
at bitronix.tm.BitronixTransactionManager.begin(BitronixTransactionManager.java:126)
... 8 more
Caused by: java.io.IOException: Data error (cyclic redundancy check)
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:71)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
at sun.nio.ch.IOUtil.write(IOUtil.java:60)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:195)
at bitronix.tm.journal.TransactionLogAppender.writeLog(TransactionLogAppender.java:121)
at bitronix.tm.journal.DiskJournal.log(DiskJournal.java:98)
at bitronix.tm.BitronixTransaction.setStatus(BitronixTransaction.java:389)
... 12 more
There are two reasons for such problem: a bug in the BTM disk journal or a hardware failure (could be RAM, disk, power supply, motherboard... almost anything).
Since the Disk journal is IMHO quite a solid piece of software that has been running on many production systems for years, I'd rather suspect your hardware first.

Curl slow multithreading dns

The program is made in C++, and it indexes webpages, so all domains are random domain names from the web. The strange part is that the dns fail/not found percentage is small (>5%).
here is the pmp stack trace:
3886 __GI___poll,send_dg,buf=0xADDRESS,__libc_res_nquery,__libc_res_nquerydomain,__libc_res_nsearch,_nss_dns_gethostbyname3_r,gaih_inet,__GI_getaddrinfo,Curl_getaddrinfo_ex
601 __GI___poll,Curl_socket_check,waitconnect,singleipconnect,Curl_connecthost,ConnectPlease,protocol_done=protocol_done#entry=0xADDRESS),Curl_connect,connect_host,at
534 __GI___poll,Curl_socket_check,Transfer,at,getweb,athread,start_thread,clone,??
498 nanosleep,__sleep,athread,start_thread,clone,??
50 __GI___poll,Curl_socket_check,Transfer,at,getweb,getweb,athread,start_thread,clone,??
15 __GI___poll,Curl_socket_check,Transfer,at,getweb,getweb,getweb,athread,start_thread,clone
7 nanosleep,usleep,main
Why are there so many threads at _nss_dns_gethostbyname3_r? What could I do to speed it up.
Could it be because I'm using curl's default synchronous DNS resolver with CURLOPT_NOSIGNAL?
The program is running on a intel I7 (8 cores HT), 16GB ram, Ububtu 12.10.
The bandwidth varies from of 6MB/s (ISP limit) -> 2MB/s at an irregular interval, and it sometimes even drops to a few 100KB/s.
The threads you are seeing are probably waiting for DNS answers. A way of speeding that up would be to do the looking up beforehand, so they get cached in your neighbor recursive DNS server. Also make sure nobody is asking for autoritative answers, that is slow always.
I've found that the solution was to change the default curl dns resolver to c-ares and to specifically ask for ipv4 as ipv6 is not supported yet by my network.
Changing to c-ares also allowed me to add more set dns servers and to circle them in order to improve the number of dns queries/s.
The outcome:
//set to ipv4 only
curl_easy_setopt(curl, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);
//cicle dns Servers
dns_index=DNS_SERVER_I;
pthread_mutex_lock(&running_mutex);
if(DNS_SERVER_I>DNS_SERVERS.size())
{
DNS_SERVER_I=1;
}else
{
DNS_SERVER_I++;
}
pthread_mutex_unlock(&running_mutex);
string dns_servers_string=DNS_SERVERS.at(dns_index%DNS_SERVERS.size())+","+DNS_SERVERS.at((dns_index+1)%DNS_SERVERS.size())+","+DNS_SERVERS.at((dns_index+2)%DNS_SERVERS.size());
// set curl DNS (option available only when curl is built with c-ares)
curl_easy_setopt(curl, CURLOPT_DNS_SERVERS, &dns_servers_string[0]);