How to avoid remote memory allocation in numa architecture? - c++

I have a numa machine with two nodes, each node has 8 cores and 64G memory and have two services: service-1 and service-2, service-1 deployed in node-0, and service-2 deployed in node-1.
I just want these two services to run separately, So I start service as followed:
numactl --membind=0 ./service-1
numactl --membind=1 ./service-2
In service code, I use pthread_setaffinity_np to bind threads to corresponding node's cpus(binding service-1 to cpu[0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30]);
localhost:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
localhost:~$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 64327 MB
node 0 free: 17231 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 64498 MB
node 1 free: 37633 MB
node distances:
node 0 1
0: 10 21
1: 21 10
What I expected is linux system allocates memory on node-1 for service-2, besides service-2 visits node-1 memory mainly and some dynamic library pages on node-0. But sadly I find service-2 has a lot of memory on remote node(node-0) and visiting it costs too much time.
service-2's numa_maps as followed:
7f07e8c00000 bind:1 anon=455168 dirty=455168 N0=442368 N1=12800
7f08e4200000 bind:1 anon=128000 dirty=128000 N0=121344 N1=6656
7f092be00000 bind:1 anon=21504 dirty=21504 N0=20992 N1=512
7f093ac00000 bind:1 anon=32768 dirty=32768 N0=27136 N1=5632
7f0944e00000 bind:1 anon=82432 dirty=82432 N0=81920 N1=512
7f0959400000 bind:1 anon=1024 dirty=1024 N0=1024
7f0959a00000 bind:1 anon=4096 dirty=4096 N0=4096
7f095aa00000 bind:1 anon=2560 dirty=2560 N0=2560
7f095b600000 bind:1 anon=4608 dirty=4608 N0=4608
7f095c800000 bind:1 anon=512 dirty=512 N0=512
7f095cc00000 bind:1 anon=512 dirty=512 N0=512
...
So here are my questions:
1、Does linux system really do allocate remote memory(node-0) for service-2, regardless of service-2 is already bound to node-1 by membind command?
--membind=nodes, -m nodes
Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. nodes may be specified as noted above.
2、Is this associated with kernel.numa_balancing, which values 1 in my machine? In my opinion, with kernel.numa_balancing = 1, linux only migrates tasks to the nearest memory or move memory to the nearest node where the task was executed. Since service-2 has already bound to node-1, no balancing would happen?
3、Can someone explain how this remote allocation happened? Is there any methods to avoid this?
Thank you very much !

Related

Why i get nothing with PCM(Intel performance counter monitor) -- c++ API

root#dellr740:/ycsb_build# sudo ./ycsb
IBRS and IBPB supported : yes
STIBP supported : yes
Spec arch caps supported : yes
Number of physical cores: 20
Number of logical cores: 40
Number of online logical cores: 40
Threads (logical cores) per physical core: 2
Num sockets: 1
Physical cores per socket: 20
Core PMU (perfmon) version: 4
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 2100000000 Hz
IBRS enabled in the kernel : yes
STIBP enabled in the kernel : no
The processor is not susceptible to Rogue Data Cache Load: yes
The processor supports enhanced IBRS : yes
Package thermal spec power: 125 Watt; Package minimum power: 63 Watt; Package maximum power: 307 Watt;
ERROR: Secure Boot detected. Recompile PCM with -DPCM_USE_PERF or disable Secure Boot.
Socket 0: 2 memory controllers detected with total number of 6 channels. 0 QPI ports detected. 2 M2M (mesh to memory) blocks detected.
.......
.......
......
result here, pcm metric, i get 0 byte
WARNING: Custom counter 0 is in use. MSR_PERF_GLOBAL_INUSE on core 0: 0x800000000000000f
WARNING: Custom counter 1 is in use. MSR_PERF_GLOBAL_INUSE on core 0: 0x800000000000000f
WARNING: Custom counter 2 is in use. MSR_PERF_GLOBAL_INUSE on core 0: 0x800000000000000f
WARNING: Custom counter 3 is in use. MSR_PERF_GLOBAL_INUSE on core 0: 0x800000000000000f
WARNING: Core 0 IA32_PERFEVTSEL0_ADDR are not zeroed 1245244
Error opening PCM: 2
Zeroed PMU registers
Cleaning up
Zeroed uncore PMU registers
PCM Metrics:
L2 HitRatio: 1
L3 HitRatio: -1
L2 misses: 0
L3 misses: 0
DRAM Reads (bytes): 0
DRAM Writes (bytes): 0
NVM Reads (bytes): 0
NVM Writes (bytes): 0

Ejabberd Server Application CPU Overload

We have build Ejabberd in AWS EC2 instance and have enabled the clustering in the 6 Ejabberd servers in Tokyo, Frankfurt, and Singapore regions.
The OS, middleware, applications and settings for each EC2 instance are exactly the same.
But currently, the Ejabberd CPUs in the Frankfurt and Singapore regions are overloaded.
The CPU of Ejabberd in the Japan region is normal.
Could you please tell me the suspicious part?
You can take a look at the ejabberd log files of the problematic (and the good) nodes, maybe you find some clue.
You can use the undocumented "ejabberdctl etop" shell command in the problematic nodes. It's similar to "top", but runs inside the erlang virtual machine that runs ejabberd
ejabberdctl etop
========================================================================================
ejabberd#localhost 16:00:12
Load: cpu 0 Memory: total 44174 binary 1320
procs 277 processes 5667 code 20489
runq 1 atom 984 ets 5467
Pid Name or Initial Func Time Reds Memory MsgQ Current Function
----------------------------------------------------------------------------------------
<9135.1252.0> caps_requests_cache 2393 1 2816 0 gen_server:loop/7
<9135.932.0> mnesia_recover 480 39 2816 0 gen_server:loop/7
<9135.1118.0> dets:init/2 71 2 5944 0 dets:open_file_loop2
<9135.6.0> prim_file:start/0 63 1 2608 0 prim_file:helper_loo
<9135.1164.0> dets:init/2 56 2 4072 0 dets:open_file_loop2
<9135.818.0> disk_log:init/2 49 2 5984 0 disk_log:loop/1
<9135.1038.0> ejabberd_listener:in 31 2 2840 0 prim_inet:accept0/3
<9135.1213.0> dets:init/2 31 2 5944 0 dets:open_file_loop2
<9135.1255.0> dets:init/2 30 2 5944 0 dets:open_file_loop2
<9135.0.0> init 28 1 3912 0 init:loop/1
========================================================================================

New Ceoh Installation Won't Recover

I am unsure if this is the platform to ask. But hopefully it is :).
I've got a 3 node setup of ceph.
node1
mds.node1 , mgr.node1 , mon.node1 , osd.0 , osd.1 , osd.6
14.2.22
node2
mds.node2 , mon.node2 , osd.2 , osd.3 , osd.7
14.2.22
node3
mds.node3 , mon.node3 , osd.4 , osd.5 , osd.8
14.2.22
For some reason though, When I down one node, It does not start backfilling/recovery at all. It just reports 3 osd's down as below. But does nothing to repair it....
If I run a ceph -s I get the below ouput:
[root#node1 testdir]# ceph -s
cluster:
id: 8932b76b-282b-4385-bee8-5c295af88e74
health: HEALTH_WARN
3 osds down
1 host (3 osds) down
Degraded data redundancy: 30089/90267 objects degraded (33.333%), 200 pgs degraded, 512 pgs undersized
1/3 mons down, quorum node1,node2
services:
mon: 3 daemons, quorum node1,node2 (age 2m), out of quorum: node3
mgr: node1(active, since 48m)
mds: homeFS:1 {0=node1=up:active} 1 up:standby-replay
osd: 9 osds: 6 up (since 2m), 9 in (since 91m)
data:
pools: 4 pools, 512 pgs
objects: 30.09k objects, 144 MiB
usage: 14 GiB used, 346 GiB / 360 GiB avail
pgs: 30089/90267 objects degraded (33.333%)
312 active+undersized
200 active+undersized+degraded
io:
client: 852 B/s rd, 2 op/s rd, 0 op/s wr
[root#node1 testdir]#
The odd thing though, when I boot up my 3rd node again it does recover and sync. But it looks like it's backfilling just not starting at all...
Is there something that might be causing it?
Update
What I did notice, If I mark a drive as out, it does recover it... But when the server node's down, and the drive's marked as out, it then does not recover it at all...
Update 2:
I noticed while experimenting that if the OSD is up, but out, It does recover... When the OSD is marked as down it does not begin to recover at all...
The default is 10 minutes for ceph to wait until it marks OSDs as out (mon_osd_down_out_interval). This can help in case a server just needs a reboot and returns within 10 minutes then all is good. If you need a longer maintenance window but you're not sure if it will be longer than 10 minutes, but the server will eventually return, set ceph osd set noout to prevent unnecessary rebalancing.

How I can determine physical RAM installed on computer? (windows)

How I can get physical ram installed to my computer using c++ in Windows?
I mean not only capacity parametrs which can GlobalMemoryStatusEx(), but also number of used memory slots, type of memory (like DDR1/DDR2/DDR3), type of slot (DIMM/SO-DIMM) and clock rate of memory bus.
Am I need to use SMBIOS? Or have been any another way to get this info?
On my machine, most of the information you request is available through WMI. Take a look at the Win32_PhysicalMemory and related classes.
For example, the output of wmic memorychip on my machine is:
C:\>wmic memorychip
Attributes BankLabel Capacity Caption ConfiguredClockSpeed ConfiguredVoltage CreationClassName DataWidth Description DeviceLocator FormFactor HotSwappable InstallDate InterleaveDataDepth InterleavePosition Manufacturer MaxVoltage MemoryType MinVoltage Model Name OtherIdentifyingInfo PartNumber PositionInRow PoweredOn Removable Replaceable SerialNumber SKU SMBIOSMemoryType Speed Status Tag TotalWidth TypeDetail Version
2 BANK 0 17179869184 Physical Memory 2133 1200 Win32_PhysicalMemory 64 Physical Memory ChannelA-DIMM0 12 Samsung 0 0 0 Physical Memory M471A2K43BB1-CPB 15741117 26 2133 Physical Memory 0 64 128
2 BANK 2 17179869184 Physical Memory 2133 1200 Win32_PhysicalMemory 64 Physical Memory ChannelB-DIMM0 12 Samsung 0 0 0 Physical Memory M471A2K43BB1-CPB 21251413 26 2133 Physical Memory 2 64 128
As noted in the link above, FormFactor 12 is SODIMM.
Notably missing are the voltages (which you didn't ask for, but are usually of interest) and the MemoryType, the documentation of which is outdated on MSDN, while the recent SMBIOS docs from DMTF include values in the enum for DDR4. etc.
Therefore, you would probably have to resort to looking at the SMBIOS tables more or less by hand. See: How to get memory information (RAM type, e.g. DDR,DDR2,DDR3?) with WMI/C++

In Hadoop HDFS, how many data nodes a 1GB file uses to be stored?

I have a file of 1GB size to be stored on HDFS file system. I am having a cluster setup of 10 data nodes and a namenode. Is there any calculation that the Namenode uses (not for replicas) a particular no of data nodes for the storage of the file? Or Is there any parameter that we can configure to use for a file storage? If so, what is the default no of datanodes that Hadoop uses to store the file if it is not specifically configured?
I want to know if it uses all the datanodes of the cluster or only specific no of datanodes.
Let's consider the HDFS block size is 64MB and free space is also existing on all the datanodes.
Thanks in advance.
If the configured block size is 64 MB, and you have a 1 GB file which means the file size is 1024 MB.
So the blocks needed will be 1024/64 = 16 blocks, which means 1 Datanode will consume 16 blocks to store your 1 GB file.
Now, let's say that you have a 10 nodes cluster then the default replica is 3, that means your 1 GB file will be stored on 3 different nodes. So, the blocks acquired by your 1 GB file is -> *16 * 3 =48 blocks*.
If your one block is of 64 MB, then total size your 1 GB file consumed is ->
*64 * 48 = 3072 MB*.
Hope that clears your doubt.
In Second(2nd) Generation of Hadoop
If the configured block size is 128 MB, and you have a 1 GB file which means the file size is 1024 MB.
So the blocks needed will be 1024/128 = 8 blocks, which means 1 Datanode will contain 8 blocks to store your 1 GB file.
Now, let's say that you have a 10 nodes cluster then the default replica is 3, that means your 1 GB file will be stored on 3 different nodes. So, the blocks acquired by your
1 GB file is -> *8 * 3 =24 blocks*.
If your one block is of 128 MB, then total size your 1 GB file consumed is -
*128 * 24 = 3072 MB*.