Dpdk Virtio Vring Negotiation Abnormal - dpdk

I have an process1 with EAL param like : --vdev"/tmp/socket",queues=4,server=1 as virtio front-end and another process2 as back-end.
when they finish negotiation, nothing in dpdk's log wrong, but when I print each queue's situation, it shows below:
Queue idx: 0, pkt in queue: 0;
Queue idx: 1, pkts in queue: 256;
Queue idx: 2, pkts in queue: 0;
Queue idx: 3, pkts in queue: 256;
Queue idx: 4, pkts in queue: 256;
Queue idx: 5, pkts in queue: 256;
Queue idx: 6, pkts in queue: 256;
Queue idx: 7, pkts in queue: 256;
The capacity of queue 4 and 6 supposed to be 0 but it is 256, I am sure the back-end process has not start to process pkt yet.

OK, I find the answer myself.
I try to find the question by using testpmd, When I use the same EAL param, the log shows:
Queue idx: 0, pkt in queue: 0;
Queue idx: 1, pkts in queue: 256;
Queue idx: 2, pkts in queue: 256;
Queue idx: 3, pkts in queue: 256;
Queue idx: 4, pkts in queue: 256;
Queue idx: 5, pkts in queue: 256;
Queue idx: 6, pkts in queue: 256;
Queue idx: 7, pkts in queue: 256;
I guess it's port not config, and it just use 1 queue. I config port using port config all rxq 4 to config port to use 4 rx queues, and then the log shows:
Queue idx: 0, pkt in queue: 0;
Queue idx: 1, pkts in queue: 256;
Queue idx: 2, pkts in queue: 0;
Queue idx: 3, pkts in queue: 256;
Queue idx: 4, pkts in queue: 0;
Queue idx: 5, pkts in queue: 256;
Queue idx: 6, pkts in queue: 0;
Queue idx: 7, pkts in queue: 256;

Related

pktgen cannot send packet in ovs dpdk scenario

The test setup is: pktgen send packet to vhost-user1 port, then ovs forward it vhost-user2, then testpmd received it from vhost-user2.
The problem is: pktgen can not send any packets, testpmd received no packet also, I don't know what's the problem.
Needs some help, thanks in advance!
OVS: 2.9.0
DPDK: 17.11.6
pktgen: 3.4.4
OVS setup:
export DB_SOCK=/usr/local/var/run/openvswitch/db.sock
export PATH=$PATH:/usr/local/share/openvswitch/scripts
rm /usr/local/etc/openvswitch/conf.db
ovsdb-tool create /usr/local/etc/openvswitch/conf.db /usr/local/share/openvswitch/vswitch.ovsschema
ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach
ovs-vsctl --no-wait init
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true other_config:dpdk-lcore=0x2 other_config:dpdk-socket-mem="1024,0"
ovs-vswitchd unix:/usr/local/var/run/openvswitch/db.sock --pidfile --detach
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x8
ovs-vsctl add-br ovs-br0 -- set bridge ovs-br0 datapath_type=netdev
ovs-vsctl add-port ovs-br0 vhost-user0 -- set Interface vhost-user0 type=dpdkvhostuser
ovs-vsctl add-port ovs-br0 vhost-user1 -- set Interface vhost-user1 type=dpdkvhostuser
ovs-vsctl add-port ovs-br0 vhost-user2 -- set Interface vhost-user2 type=dpdkvhostuser
ovs-vsctl add-port ovs-br0 vhost-user3 -- set Interface vhost-user3 type=dpdkvhostuser
sudo ovs-ofctl del-flows ovs-br0
sudo ovs-ofctl add-flow ovs-br0 in_port=2,dl_type=0x800,idle_timeout=0,action=output:3
sudo ovs-ofctl add-flow ovs-br0 in_port=3,dl_type=0x800,idle_timeout=0,action=output:2
sudo ovs-ofctl add-flow ovs-br0 in_port=1,dl_type=0x800,idle_timeout=0,action=output:4
sudo ovs-ofctl add-flow ovs-br0 in_port=4,dl_type=0x800,idle_timeout=0,action=output:1
run pktgen:
root#k8s:/home/haosp/OVS_DPDK/pktgen-3.4.4# pktgen -c 0xf --master-lcore 0 -n 1 --socket-mem 512,0 --file-prefix pktgen --no-pci \
> --vdev 'net_virtio_user0,mac=00:00:00:00:00:05,path=/usr/local/var/run/openvswitch/vhost-user0' \
> --vdev 'net_virtio_user1,mac=00:00:00:00:00:01,path=/usr/local/var/run/openvswitch/vhost-user1' \
> -- -P -m "1.[0-1]"
Copyright (c) <2010-2017>, Intel Corporation. All rights reserved. Powered by DPDK
EAL: Detected 4 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
Lua 5.3.4 Copyright (C) 1994-2017 Lua.org, PUC-Rio
Copyright (c) <2010-2017>, Intel Corporation. All rights reserved.
Pktgen created by: Keith Wiles -- >>> Powered by DPDK <<<
>>> Packet Burst 64, RX Desc 1024, TX Desc 2048, mbufs/port 16384, mbuf cache 2048
=== port to lcore mapping table (# lcores 4) ===
lcore: 0 1 2 3 Total
port 0: ( D: T) ( 1: 1) ( 0: 0) ( 0: 0) = ( 1: 1)
port 1: ( D: T) ( 1: 1) ( 0: 0) ( 0: 0) = ( 1: 1)
Total : ( 0: 0) ( 2: 2) ( 0: 0) ( 0: 0)
Display and Timer on lcore 0, rx:tx counts per port/lcore
Configuring 2 ports, MBUF Size 2176, MBUF Cache Size 2048
Lcore:
1, RX-TX
RX_cnt( 2): (pid= 0:qid= 0) (pid= 1:qid= 0)
TX_cnt( 2): (pid= 0:qid= 0) (pid= 1:qid= 0)
Port :
0, nb_lcores 1, private 0x5635a661d3a0, lcores: 1
1, nb_lcores 1, private 0x5635a661ff70, lcores: 1
** Default Info (net_virtio_user0, if_index:0) **
max_rx_queues : 1, max_tx_queues : 1
max_mac_addrs : 64, max_hash_mac_addrs: 0, max_vmdq_pools: 0
rx_offload_capa: 28, tx_offload_capa : 0, reta_size : 0, flow_type_rss_offloads:0000000000000000
vmdq_queue_base: 0, vmdq_queue_num : 0, vmdq_pool_base: 0
** RX Conf **
pthresh : 0, hthresh : 0, wthresh : 0
Free Thresh : 0, Drop Enable : 0, Deferred Start : 0
** TX Conf **
pthresh : 0, hthresh : 0, wthresh : 0
Free Thresh : 0, RS Thresh : 0, Deferred Start : 0, TXQ Flags:00000f00
Create: Default RX 0:0 - Memory used (MBUFs 16384 x (size 2176 + Hdr 128)) + 192 = 36865 KB headroom 128 2176
Set RX queue stats mapping pid 0, q 0, lcore 1
Create: Default TX 0:0 - Memory used (MBUFs 16384 x (size 2176 + Hdr 128)) + 192 = 36865 KB headroom 128 2176
Create: Range TX 0:0 - Memory used (MBUFs 16384 x (size 2176 + Hdr 128)) + 192 = 36865 KB headroom 128 2176
Create: Sequence TX 0:0 - Memory used (MBUFs 16384 x (size 2176 + Hdr 128)) + 192 = 36865 KB headroom 128 2176
Create: Special TX 0:0 - Memory used (MBUFs 64 x (size 2176 + Hdr 128)) + 192 = 145 KB headroom 128 2176
Port memory used = 147601 KB
Initialize Port 0 -- TxQ 1, RxQ 1, Src MAC 00:00:00:00:00:05
** Default Info (net_virtio_user1, if_index:0) **
max_rx_queues : 1, max_tx_queues : 1
max_mac_addrs : 64, max_hash_mac_addrs: 0, max_vmdq_pools: 0
rx_offload_capa: 28, tx_offload_capa : 0, reta_size : 0, flow_type_rss_offloads:0000000000000000
vmdq_queue_base: 0, vmdq_queue_num : 0, vmdq_pool_base: 0
** RX Conf **
pthresh : 0, hthresh : 0, wthresh : 0
Free Thresh : 0, Drop Enable : 0, Deferred Start : 0
** TX Conf **
pthresh : 0, hthresh : 0, wthresh : 0
Free Thresh : 0, RS Thresh : 0, Deferred Start : 0, TXQ Flags:00000f00
Create: Default RX 1:0 - Memory used (MBUFs 16384 x (size 2176 + Hdr 128)) + 192 = 36865 KB headroom 128 2176
Set RX queue stats mapping pid 1, q 0, lcore 1
Create: Default TX 1:0 - Memory used (MBUFs 16384 x (size 2176 + Hdr 128)) + 192 = 36865 KB headroom 128 2176
Create: Range TX 1:0 - Memory used (MBUFs 16384 x (size 2176 + Hdr 128)) + 192 = 36865 KB headroom 128 2176
Create: Sequence TX 1:0 - Memory used (MBUFs 16384 x (size 2176 + Hdr 128)) + 192 = 36865 KB headroom 128 2176
Create: Special TX 1:0 - Memory used (MBUFs 64 x (size 2176 + Hdr 128)) + 192 = 145 KB headroom 128 2176
Port memory used = 147601 KB
Initialize Port 1 -- TxQ 1, RxQ 1, Src MAC 00:00:00:00:00:01
Total memory used = 295202 KB
Port 0: Link Up - speed 10000 Mbps - full-duplex <Enable promiscuous mode>
!ERROR!: Could not read enough random data for PRNG seed
Port 1: Link Up - speed 10000 Mbps - full-duplex <Enable promiscuous mode>
!ERROR!: Could not read enough random data for PRNG seed
=== Display processing on lcore 0
WARNING: Nothing to do on lcore 2: exiting
WARNING: Nothing to do on lcore 3: exiting
RX/TX processing lcore: 1 rx: 2 tx: 2
For RX found 2 port(s) for lcore 1
For TX found 2 port(s) for lcore 1
Pktgen:/>set 0 dst mac 00:00:00:00:00:03
Pktgen:/>set all rate 10
Pktgen:/>set 0 count 10000
Pktgen:/>set 1 count 20000
Pktgen:/>str
| Flags:Port : P--------------:0 P--------------:1 0/0
Link State : P--------------:0 P--------------:1 ----TotalRate----
Pkts/s Max/Rx : <UP-10000-FD> <UP-10000-FD> 0/0
Max/Tx : 0/0 0/0 0/0
MBits/s Rx/Tx : 256/0 256/0 512/0
Broadcast : 0/0 0/0 0/0
Multicast : 0 0
64 Bytes : 0 0
65-127 : 0 0
128-255 : 0 0
256-511 : 0 0
512-1023 : 0 0
1024-1518 : 0 0
Runts/Jumbos : 0 0
Errors Rx/Tx : 0/0 0/0
Total Rx Pkts : 0/0 0/0
Tx Pkts : 0 0
Rx MBs : 256 256
Tx MBs : 0 0
ARP/ICMP Pkts : 0 0
Tx Count/% Rate : 0/0 0/0
Pattern Type : abcd... abcd...
Tx Count/% Rate : 10000 /10% 20000 /10%--------------------
PktSize/Tx Burst : 64 / 64 64 / 64
Src/Dest Port : 1234 / 5678 1234 / 5678--------------------
Pkt Type:VLAN ID : IPv4 / TCP:0001 IPv4 / TCP:0001
802.1p CoS : 0 0--------------------
ToS Value: : 0 0
- DSCP value : 0 0--------------------
- IPP value : 0 0
Dst IP Address : 192.168.1.1 192.168.0.1--------------------
Src IP Address : 192.168.0.1/24 192.168.1.1/24
Dst MAC Address : 00:00:00:00:00:03 00:00:00:00:00:05--------------------
Src MAC Address : 00:00:00:00:00:05 00:00:00:00:00:01
VendID/PCI Addr : 0000:0000/00:00.0 0000:0000/00:00.0--------------------
Pktgen:/> str
-- Pktgen Ver: 3.4.4 (DPDK 17.11.6) Powered by DPDK --------------------------
Pktgen:/>
run testpmd:
./testpmd -c 0xf -n 1 --socket-mem 512,0 --file-prefix testpmd --no-pci \
--vdev 'net_virtio_user2,mac=00:00:00:00:00:02,path=/usr/local/var/run/openvswitch/vhost-user2' \
--vdev 'net_virtio_user3,mac=00:00:00:00:00:03,path=/usr/local/var/run/openvswitch/vhost-user3' \
-- -i -a --burst=64 --txd=2048 --rxd=2048 --coremask=0x4
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: 1 hugepages of size 1073741824 reserved, but no mounted hugetlbfs found for that size
EAL: Probing VFIO support...
EAL: VFIO support initialized
update_memory_region(): Too many memory regions
update_memory_region(): Too many memory regions
Interactive-mode selected
Auto-start selected
Warning: NUMA should be configured manually by using --port-numa-config and --ring-numa-config parameters along with --numa.
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
update_memory_region(): Too many memory regions
update_memory_region(): Too many memory regions
update_memory_region(): Too many memory regions
update_memory_region(): Too many memory regions
Configuring Port 0 (socket 0)
Port 0: 00:00:00:00:00:02
Configuring Port 1 (socket 0)
Port 1: 00:00:00:00:00:03
Checking link statuses...
Done
Start automatic packet forwarding
io packet forwarding - ports=2 - cores=1 - streams=2 - NUMA support enabled, MP allocation mode: native
Logical Core 2 (socket 0) forwards packets on 2 streams:
RX P=0/Q=0 (socket 0) -> TX P=1/Q=0 (socket 0) peer=02:00:00:00:00:01
RX P=1/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
io packet forwarding packets/burst=64
nb forwarding cores=1 - nb forwarding ports=2
port 0: RX queue number: 1 Tx queue number: 1
Rx offloads=0x0 Tx offloads=0x0
RX queue: 0
RX desc=2048 - RX free threshold=0
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x0
TX queue: 0
TX desc=2048 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX offloads=0x0 - TX RS bit threshold=0
port 1: RX queue number: 1 Tx queue number: 1
Rx offloads=0x0 Tx offloads=0x0
RX queue: 0
RX desc=2048 - RX free threshold=0
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x0
TX queue: 0
TX desc=2048 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX offloads=0x0 - TX RS bit threshold=0
testpmd> show port info
Bad arguments
testpmd> show port stats all
######################## NIC statistics for port 0 ########################
RX-packets: 0 RX-missed: 0 RX-bytes: 0
RX-errors: 0
RX-nombuf: 0
TX-packets: 0 TX-errors: 0 TX-bytes: 0
Throughput (since last show)
Rx-pps: 0
Tx-pps: 0
############################################################################
######################## NIC statistics for port 1 ########################
RX-packets: 0 RX-missed: 0 RX-bytes: 0
RX-errors: 0
RX-nombuf: 0
TX-packets: 0 TX-errors: 0 TX-bytes: 0
Throughput (since last show)
Rx-pps: 0
Tx-pps: 0
############################################################################
OVS dump-flow show:
root#k8s:/home/haosp# ovs-ofctl dump-flows ovs-br0
cookie=0x0, duration=77519.972s, table=0, n_packets=0, n_bytes=0, ip,in_port="vhost-user1" actions=output:"vhost-user2"
cookie=0x0, duration=77519.965s, table=0, n_packets=0, n_bytes=0, ip,in_port="vhost-user2" actions=output:"vhost-user1"
cookie=0x0, duration=77519.959s, table=0, n_packets=0, n_bytes=0, ip,in_port="vhost-user0" actions=output:"vhost-user3"
cookie=0x0, duration=77518.955s, table=0, n_packets=0, n_bytes=0, ip,in_port="vhost-user3" actions=output:"vhost-user0"
ovs-ofctl dump-ports ovs-br0 show:
root#k8s:/home/haosp# ovs-ofctl dump-ports ovs-br0
OFPST_PORT reply (xid=0x2): 5 ports
port "vhost-user3": rx pkts=0, bytes=0, drop=0, errs=0, frame=?, over=?, crc=?
tx pkts=0, bytes=0, drop=6, errs=?, coll=?
port "vhost-user1": rx pkts=0, bytes=0, drop=0, errs=0, frame=?, over=?, crc=?
tx pkts=0, bytes=0, drop=8, errs=?, coll=?
port "vhost-user0": rx pkts=0, bytes=0, drop=0, errs=0, frame=?, over=?, crc=?
tx pkts=0, bytes=0, drop=8, errs=?, coll=?
port "vhost-user2": rx pkts=0, bytes=0, drop=0, errs=0, frame=?, over=?, crc=?
tx pkts=0, bytes=0, drop=8, errs=?, coll=?
port LOCAL: rx pkts=50, bytes=3732, drop=0, errs=0, frame=0, over=0, crc=0
tx pkts=0, bytes=0, drop=0, errs=0, coll=0
ovs-ofctl show ovs-br0
root#k8s:/home/haosp# ovs-ofctl show ovs-br0
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000ca4f2b8e6b4b
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
1(vhost-user0): addr:00:00:00:00:00:00
config: 0
state: LINK_DOWN
speed: 0 Mbps now, 0 Mbps max
2(vhost-user1): addr:00:00:00:00:00:00
config: 0
state: LINK_DOWN
speed: 0 Mbps now, 0 Mbps max
3(vhost-user2): addr:00:00:00:00:00:00
config: 0
state: 0
speed: 0 Mbps now, 0 Mbps max
4(vhost-user3): addr:00:00:00:00:00:00
config: 0
state: 0
speed: 0 Mbps now, 0 Mbps max
LOCAL(ovs-br0): addr:ca:4f:2b:8e:6b:4b
config: 0
state: 0
current: 10MB-FD COPPER
speed: 10 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0
ovs-vsctl show
root#k8s:/home/haosp# ovs-vsctl show
635ba448-91a0-4c8c-b6ca-4b9513064d7f
Bridge "ovs-br0"
Port "vhost-user2"
Interface "vhost-user2"
type: dpdkvhostuser
Port "ovs-br0"
Interface "ovs-br0"
type: internal
Port "vhost-user0"
Interface "vhost-user0"
type: dpdkvhostuser
Port "vhost-user3"
Interface "vhost-user3"
type: dpdkvhostuser
Port "vhost-user1"
Interface "vhost-user1"
type: dpdkvhostuser
It seems that pktgen can not send packets, ovs statatics shows no packet received also,
I have no idea yet, it confused me
If the goal is to have packet transfer between Pktgen and testpmd that is connected by OVS-DPDK one has to use net_vhost and virtio_user pair.
DPDK Pktgen (net_vhost) <==> OVS-DPDK port-1 (virtio_user) {Rule to forward} OVS-DPDK port-2 (virtio_user) <==> DPDK Pktgen (net_vhost)
In the current setup, you will have to make the following changes
start DPDK pktgen by changing from --vdev net_virtio_user0,mac=00:00:00:00:00:05,path=/usr/local/var/run/openvswitch/vhost-user0 to --vdev net_vhost0,iface=/usr/local/var/run/openvswitch/vhost-user0
start DPDK testpmd by changing from --vdev 'net_virtio_user2,mac=00:00:00:00:00:02,path=/usr/local/var/run/openvswitch/vhost-user2' to --vdev 'net_vhost0,iface=/usr/local/var/run/openvswitch/vhost-user2'
then start DPDK-OVS with --vdev=virtio_user0,path=/usr/local/var/run/openvswitch/vhost-user0 and --vdev=virtio_user1,path=/usr/local/var/run/openvswitch/vhost-user2
add rules to allow the port to port forwarding between pktgen and testpmd
Note:
please update the command line for multiple ports.
screenshot shared below with pktgen and l2fwd setup

DPDK IP reassemble API returns NULL

I am new to DPDK, currently testing IP reassemble API and I am having difficulties. Below is the C++ code which I wrote to test the IP reassemble. I took the reference from the examples list provided from dpdk itself. dpdk version which I am using is 20.08 in debian machine. dpdk user guide mentions the API works on src add, dst add and packet ID, even though all three data are proper still the API returns NULL. Any kind of help will be much appreciated. Thanks in advance.
Search the Fragment Table for entry with packet’s <IPv4 Source Address, IPv4 Destination Address, Packet ID>.
If the entry is found, then check if that entry already timed-out. If yes, then free all previously received fragments, and remove information about them from the entry.
If no entry with such key is found, then try to create a new one by one of two ways:
Use as empty entry.
Delete a timed-out entry, free mbufs associated with it mbufs and store a new entry with specified key in it.
Update the entry with new fragment information and check if a packet can be reassembled (the packet’s entry contains all fragments).
If yes, then, reassemble the packet, mark table’s entry as empty and return the reassembled mbuf to the caller.
If no, then return a NULL to the caller.
CONFIG_RTE_LIBRTE_IP_FRAG=y
CONFIG_RTE_LIBRTE_IP_FRAG_DEBUG=y
CONFIG_RTE_LIBRTE_IP_FRAG_MAX_FRAG=100
CONFIG_RTE_LIBRTE_IP_FRAG_TBL_STAT=n
#define DEF_FLOW_NUM 0x1000
#define DEF_FLOW_TTL MS_PER_S
#define IP_FRAG_TBL_BUCKET_ENTRIES 128
static uint32_t max_flow_num = DEF_FLOW_NUM;
static uint32_t max_flow_ttl = DEF_FLOW_TTL;
struct lcore_queue_conf {
struct rte_ip_frag_tbl *frag_tbl;
struct rte_mempool *pool;
struct rte_ip_frag_death_row death_row;
}__rte_cache_aligned;
static struct lcore_queue_conf lcore_queue_conf[RTE_MAX_LCORE];
static inline int setup_queue_tbl(struct lcore_queue_conf *qconf)
{
uint64_t frag_cycles = (rte_get_tsc_hz() + MS_PER_S - 1) / MS_PER_S * max_flow_ttl;
qconf->frag_tbl = rte_ip_frag_table_create(max_flow_num, IP_FRAG_TBL_BUCKET_ENTRIES, max_flow_num, frag_cycles, rte_socket_id());
if((qconf->frag_tbl) == NULL){
RTE_LOG(ERR, IP_RSMBL, "Table Failed.");
return -1;
}
qconf->pool=rte_pktmbuf_pool_create("BUFFER", POOL_SIZE*2, POOL_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
if(qconf->pool== NULL){
RTE_LOG(ERR, IP_RSMBL, "Mem Pool Failed.");
return -1;
}
return 0;
}
static inline void reassemble(struct rte_mbuf *reassemblepkt, struct lcore_queue_conf *qconf, uint64_t cur_tsc)
{
struct rte_mbuf *mo;
struct rte_ether_hdr *eth_hdr;
struct rte_ipv4_hdr *ip_hdr;
eth_hdr = rte_pktmbuf_mtod(reassemblepkt, struct rte_ether_hdr *);
ip_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1);
if (rte_ipv4_frag_pkt_is_fragmented(ip_hdr)){
reassemblepkt->l2_len = sizeof(*eth_hdr);
reassemblepkt->l3_len = sizeof(*ip_hdr);
int ip_len;
ip_len = rte_be_to_cpu_16(ip_hdr->total_length);
mo = rte_ipv4_frag_reassemble_packet(qconf->frag_tbl, &qconf->death_row, reassemblepkt, cur_tsc, ip_hdr);
if (mo == NULL){
cout << "Total Length: " << ip_len << ", l3 length: " << reassemblepkt->l3_len << " ,Packet ID: " << ip_hdr->packet_id << " , src add:" << ip_hdr->src_addr << " , dst add:" << ip_hdr->dst_addr << endl;
RTE_LOG(ERR, IP_RSMBL, "Reassemble Failed.\n");
}
if ((mo != reassemblepkt) && (mo != NULL)){
cout << "Reassemble is success." << endl;
reassemblepkt = mo;
}
}
}
static int
lcore_main(struct lcore_queue_conf *qconf)
{
int rx, rec;
struct rte_mbuf *bufs[BUFFER_LENGTH];
uint64_t cur_tsc;
int i;
RTE_ETH_FOREACH_DEV(port){
cout << "RX Thread: Socket ID: " << rte_socket_id() << endl;
cout << "RX Thread: lcore count: " << dec << rte_lcore_count() << endl;
cout << "RX Thread: lcore ID: " << rte_lcore_id() << endl;
cout << "RX Thread Started." << endl;
cout << "=====================================================" << endl;
}
while(!quit_signal) {
cur_tsc = rte_rdtsc();
RTE_ETH_FOREACH_DEV(port){
rx=rte_eth_rx_burst(port, 0, bufs, BUFFER_LENGTH);
if(unlikely(rx == 0 ))
continue;
if(rx){
for(i=0; i<rx; i++)
reassemble(bufs[i], qconf, cur_tsc);
rte_ip_frag_free_death_row(&qconf->death_row, PREFETCH_OFFSET);
rec = rte_ring_enqueue_burst(Myring, (void **)bufs, rx, NULL);
}
}
}
return 0;
}
int main(int argc, char *argv[])
{
int ret;
uint16_t portcheck;
DPDKPORT p1;
struct lcore_queue_conf *qconf;
/* catch ctrl-c so we can print on exit */
signal(SIGINT, int_handler);
/* EAL setup */
ret=rte_eal_init(argc, argv);
cout << "=====================================================" << endl;
if(ret < 0)
cout << "EAL initialising failed." << strerror(-ret) << endl;
else
cout << "EAL initialisation success." << endl;
qconf = &lcore_queue_conf[rte_get_master_lcore()];
if(setup_queue_tbl(qconf) != 0)
rte_exit(EXIT_FAILURE, "%s\n", rte_strerror(rte_errno));
.
.
.
.
RTE_ETH_FOREACH_DEV(portcheck){
if(p1.eth_init(portcheck, qconf->pool) != 0)
rte_exit(EXIT_FAILURE, "Ethernet port initialisation failed.");
}
/* Master core call */
lcore_main(qconf);
return 0;
}
The Output as follows.
EAL: Detected 12 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: Invalid NUMA socket, default to 0
EAL: Invalid NUMA socket, default to 0
EAL: using IOMMU type 1 (Type 1)
EAL: Probe PCI driver: net_ixgbe (8086:15d1) device: 0000:01:00.0 (socket 0)
EAL: Invalid NUMA socket, default to 0
EAL: No legacy callbacks, legacy socket not created
=====================================================
EAL initialisation success.
USER1: rte_ip_frag_table_create: allocated of 201326720 bytes at socket 0
PORT 0: Ethernet configuration success.
TX queue configuration success.
RX queue configuration success.
PORT 0: NIC started successfully.
PORT 0: Enabled promiscuous mode.
MAC Addr b4:96:91:3f:21:b6
=====================================================
RX Thread: Socket ID: 0
RX Thread: lcore count: 12
RX Thread: lcore ID: 0
RX Thread Started.
=====================================================
Total Length: 2048, l3 length: 20 ,Packet ID: 1030 , src add:2831203304 , dst add:11304
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 40960 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 40960 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 40960 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 40960 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 40960 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 40960 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 32, l3 length: 20 ,Packet ID: 40960 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 2048, l3 length: 20 ,Packet ID: 1030 , src add:2831191337 , dst add:10280
IP_RSMBL: Reassemble Failed.
Total Length: 2048, l3 length: 20 ,Packet ID: 1030 , src add:2831191337 , dst add:10280
IP_RSMBL: Reassemble Failed.
Total Length: 2048, l3 length: 20 ,Packet ID: 1030 , src add:2831191337 , dst add:10280
IP_RSMBL: Reassemble Failed.
Total Length: 2048, l3 length: 20 ,Packet ID: 1030 , src add:2831191337 , dst add:10280
IP_RSMBL: Reassemble Failed.
Total Length: 2048, l3 length: 20 ,Packet ID: 1030 , src add:2831191337 , dst add:10280
IP_RSMBL: Reassemble Failed.
Total Length: 2048, l3 length: 20 ,Packet ID: 1030 , src add:2831191337 , dst add:10280
IP_RSMBL: Reassemble Failed.
Total Length: 2048, l3 length: 20 ,Packet ID: 1030 , src add:2831191337 , dst add:10280
IP_RSMBL: Reassemble Failed.
Total Length: 2048, l3 length: 20 ,Packet ID: 1030 , src add:2831191337 , dst add:10280
IP_RSMBL: Reassemble Failed.
Total Length: 2048, l3 length: 20 ,Packet ID: 1030 , src add:2831191337 , dst add:10280
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 42752 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 42752 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 42752 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 42752 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 42752 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 44, l3 length: 20 ,Packet ID: 42752 , src add:992520384 , dst add:4294967295
IP_RSMBL: Reassemble Failed.
Total Length: 32, l3 length: 20 ,Packet ID: 42752 , src add:992520384 , dst add:4294967295
I added printf inside the ip_reassemble example program and same thing is observed. unable to figure it out why is it happening. When I open the input pcap in Wireshark, Wireshark is able to reassemble properly. Please anyone suggest me a pcap or a way to check if IP reassemble is working fine.
/* process this fragment. */
mo = rte_ipv4_frag_reassemble_packet(tbl, dr, m, tms, ip_hdr);
if (mo == NULL) {
/* no packet to send out. */
printf("IP reassemble failed\n");
return;
}
/* we have our packet reassembled. */
if (mo != m) {
printf("IP reassemble success\n");
m = mo;
eth_hdr = rte_pktmbuf_mtod(m,
struct rte_ether_hdr *);
ip_hdr = (struct rte_ipv4_hdr *)(eth_hdr + 1);
}
root#user:/home/user/dpdk/examples/ip_reassembly# ./build/ip_reassembly -l 1 -- -p 1 -q 2
EAL: Detected 12 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: Invalid NUMA socket, default to 0
EAL: Invalid NUMA socket, default to 0
EAL: using IOMMU type 1 (Type 1)
EAL: Probe PCI driver: net_ixgbe (8086:15d1) device: 0000:01:00.0 (socket 0)
EAL: Invalid NUMA socket, default to 0
EAL: No legacy callbacks, legacy socket not created
0x7ffdd2cbd50e
IP_RSMBL: Creating LPM table on socket 0
IP_RSMBL: Creating LPM6 table on socket 0
USER1: rte_ip_frag_table_create: allocated of 25165952 bytes at socket 0
Initializing port 0 ... Port 0 modified RSS hash function based on hardware support,requested:0xa38c configured:0x8104
Address:B4:96:91:3F:21:B6
txq=1,0,0
IP_RSMBL: Socket 0: adding route 100.10.0.0/16 (port 0)
IP_RSMBL: Socket 0: adding route 100.20.0.0/16 (port 1)
IP_RSMBL: Socket 0: adding route 100.30.0.0/16 (port 2)
IP_RSMBL: Socket 0: adding route 100.40.0.0/16 (port 3)
IP_RSMBL: Socket 0: adding route 100.50.0.0/16 (port 4)
IP_RSMBL: Socket 0: adding route 100.60.0.0/16 (port 5)
IP_RSMBL: Socket 0: adding route 100.70.0.0/16 (port 6)
IP_RSMBL: Socket 0: adding route 100.80.0.0/16 (port 7)
IP_RSMBL: Socket 0: adding route 0101:0101:0101:0101:0101:0101:0101:0101/48 (port 0)
IP_RSMBL: Socket 0: adding route 0201:0101:0101:0101:0101:0101:0101:0101/48 (port 1)
IP_RSMBL: Socket 0: adding route 0301:0101:0101:0101:0101:0101:0101:0101/48 (port 2)
IP_RSMBL: Socket 0: adding route 0401:0101:0101:0101:0101:0101:0101:0101/48 (port 3)
IP_RSMBL: Socket 0: adding route 0501:0101:0101:0101:0101:0101:0101:0101/48 (port 4)
IP_RSMBL: Socket 0: adding route 0601:0101:0101:0101:0101:0101:0101:0101/48 (port 5)
IP_RSMBL: Socket 0: adding route 0701:0101:0101:0101:0101:0101:0101:0101/48 (port 6)
IP_RSMBL: Socket 0: adding route 0801:0101:0101:0101:0101:0101:0101:0101/48 (port 7)
Checking link status.......................................
done
Port0 Link Up. Speed 10000 Mbps - full-duplex
IP_RSMBL: entering main loop on lcore 1
IP_RSMBL: -- lcoreid=1 portid=0
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
.
.
.
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
IP reassemble failed
^C -- lcoreid=1 portid=0 frag tbl stat:
max entries: 4096;
entries in use: 11;
finds/inserts: 0;
entries added: 0;
entries deleted by timeout: 0;
entries reused by timeout: 0;
total add failures: 0;
add no-space failures: 0;
add hash-collisions failures: 0;
TX bursts: 0
TX packets _queued: 0
TX packets dropped: 0
TX packets send: 0
received signal: 2, exiting
[EDIT-2]: Either Last fragment or first fragment length is 0. I dont know this has anything to do with below details but just mentioning.
NIC(10GB NIC)
Ethernet controller: Intel Corporation Ethernet Controller 10G X550T
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-8700 CPU # 3.20GHz
Stepping: 10
CPU MHz: 3185.543
CPU max MHz: 3200.0000
CPU min MHz: 800.0000
BogoMIPS: 6384.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-11
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
Output of your code..
EAL: Detected 12 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: Invalid NUMA socket, default to 0
EAL: Invalid NUMA socket, default to 0
EAL: using IOMMU type 1 (Type 1)
EAL: Probe PCI driver: net_ixgbe (8086:15d1) device: 0000:01:00.0 (socket 0)
EAL: Invalid NUMA socket, default to 0
EAL: No legacy callbacks, legacy socket not created
USER1: rte_ip_frag_table_create: allocated of 201326720 bytes at socket 0
Port 0 MAC: b4 96 91 3f 21 b6
RX Thread: Socket ID: 0
RX Thread: lcore count: 12
RX Thread: lcore ID: 0
mb: 0x172980e80
fp: 0x17e05c5c0
offset: 0
, IPLen: 24
, ipflag: 8192
mb_end: (nil)
ERR, IP_RSMBL, Reassemble Failed.
mb: 0x172980540
fp: 0x17e05c5c0
offset: 24
, IPLen: 24
, ipflag: 8192
mb_end: (nil)
ERR, IP_RSMBL, Reassemble Failed.
mb: 0x17297fc00
fp: 0x17e05c5c0
offset: 48
, IPLen: 24
, ipflag: 8192
mb_end: (nil)
ERR, IP_RSMBL, Reassemble Failed.
mb: 0x17297f2c0
fp: 0x17e05c5c0
offset: 72
, IPLen: 24
, ipflag: 8192
idx: 4, frags: 0x4ip_frag_process:145 invalid fragmented packet:
ipv4_frag_pkt: 0x17e05c5c0, key: <ffffffff3b28a8c0, 0xb500>, total_size: 4294967295, frag_size: 96, last_idx: 4
first fragment: ofs: 0, len: 24
last fragment: ofs: 0, len: 0
mb_end: (nil)
ERR, IP_RSMBL, Reassemble Failed.
mb: 0x172b99680
fp: 0x17551e5c0
offset: 96
, IPLen: 24
, ipflag: 8192
mb_end: (nil)
ERR, IP_RSMBL, Reassemble Failed.
mb: 0x172b98d40
fp: 0x17551e5c0
offset: 120
, IPLen: 24
, ipflag: 8192
mb_end: (nil)
ERR, IP_RSMBL, Reassemble Failed.
mb: 0x172b98400
fp: 0x17551e5c0
offset: 144
, IPLen: 12
, ipflag: 0
mb_end: (nil)
ERR, IP_RSMBL, Reassemble Failed.
mb: 0x172c8fb00
fp: 0x17d5885c0
offset: 8
, IPLen: 2028
, ipflag: 0
idx: -1, frags: 0x4ip_frag_process:145 invalid fragmented packet:
ipv4_frag_pkt: 0x17d5885c0, key: <2c28a8c0bbe8, 0x406>, total_size: 2036, frag_size: 4056, last_idx: 2
first fragment: ofs: 0, len: 0
last fragment: ofs: 8, len: 2028
mb_end: (nil)
DPDK API rte_ipv4_frag_reassemble_packet returns NULL in 2 ocassion
an error occurred
not all fragments of the packet are collected yet
Based on the code and logs shared it looks like you are
sending the last fragment multiple times.
setting time out as cur_tsc
Note:
the easiest way to test your packet is run it against ip_reassembly example and cross check the variance.
if (mo == NULL) it only means not sufficient fragments are received.
[edit-1] Hence I request to model your code as dpdk example ip_reassembly since assuming rte_ipv4_frag_reassemble_packet returning NULL is not always a failure.
[edit-2] cleaning up the code and adding missing libraries I am able to get this working with right set of fragment packets
Reassemble is success.dump mbuf at 0x1736966c0, iova=7b3696740, buf_len=2176
pkt_len=5421, ol_flags=10, nb_segs=4, in_port=0
segment at 0x1736966c0, data=0x1736967c0, data_len=1514
Dump data at [0x1736967c0], len=64
00000000: 00 1D 09 94 65 38 68 5B 35 C0 61 B6 08 00 45 00 | ....e8h[5.a...E.
00000010: 15 1F F5 AF 00 00 40 11 00 00 83 B3 C4 DC 83 B3 | ......#.........
00000020: C4 2E 18 DB 18 DB 15 0B DC E2 06 FD 14 FF 07 29 | ...............)
00000030: 08 07 65 78 61 6D 70 6C 65 08 07 74 65 73 74 41 | ..example..testA
segment at 0x1733a8040, data=0x1733a8162, data_len=1480
segment at 0x1733a8980, data=0x1733a8aa2, data_len=1480
segment at 0x1734dde40, data=0x1734ddf62, data_len=947
[edit-3]
traffic generator: ./app/x86_64-native-linuxapp-gcc/pktgen -l 1-4 -- -s 0:test.pcap -P -m [2].0
application: (code snippet edited and compiled as C program) https://pastebin.pl/view/91e533e3
build: gcc $(PKG_CONFIG_PATH=[path-to-dpdk-pkgconfig] pkg-config --static --cflags libdpdk) dpdk.c -Wl,-Bstatic $(PKG_CONFIG_PATH=[path-to-dpdk-pkgconfig] pkg-config --static --libs libdpdk)
run: sudo LD_LIBRARY_PATH=/path-to-dpdk-sahred-library/ ./a.out
[edit-4] based on the live debug, issue appears to be the hub switch connecting the 2 machines. Packets are dropped or not at all forwarded. Requested to have a direct stable connection to check the logic on the #Nirmal machine. Ran the same example and show cased the output in my machine.

MPI derived datatype problem caused by struct padding and non-blocking communication buffer's problem

Hi I am writing a c++ program, in which I want MPI to communicate by a derived data type. But the receiver does not receive the full information that the sender sends out.
Here is how I build my derived data type:
// dg_derived_datatype.cpp
#include <mpi.h>
#include "dg_derived_datatype.h"
namespace Hash{
MPI_Datatype Face_type;
};
void Construct_data_type(){
MPI_Face_type();
}
void MPI_Face_type(){
int num = 3;
// Number of elements in each block (array of integers)
int elem_blocklength[num]{2, 1, 5};
// Byte displacement of each block (array of integers).
MPI_Aint array_of_offsets[num];
MPI_Aint intex, charex;
MPI_Aint lb;
MPI_Type_get_extent(MPI_INT, &lb, &intex);
MPI_Type_get_extent(MPI_CHAR, &lb, &charex);
array_of_offsets[0] = (MPI_Aint) 0;
array_of_offsets[1] = array_of_offsets[0] + intex * 2;
array_of_offsets[2] = array_of_offsets[1] + charex;
MPI_Datatype array_of_types[num]{MPI_INT, MPI_CHAR, MPI_INT};
// create and MPI datatype
MPI_Type_create_struct(num, elem_blocklength, array_of_offsets, array_of_types, &Hash::Face_type);
MPI_Type_commit(&Hash::Face_type);
}
void Free_type(){
MPI_Type_free(&Hash::Face_type);
}
Here I derive my data type Hash::Face_type and commit it. The Hash::Face_type is used to transfer my struct (face_pack, 2 int + 1 char + 5 int) vector.
// dg_derived_datatype.h
#ifndef DG_DERIVED_DATA_TYPE_H
#define DG_DERIVED_DATA_TYPE_H
#include <mpi.h>
struct face_pack{
int owners_key;
int facei;
char face_type;
int hlevel;
int porderx;
int pordery;
int key;
int rank;
};
namespace Hash{
extern MPI_Datatype Face_type;
};
void Construct_data_type();
void Free_type();
#endif
Then in my main program I do
// dg_main.cpp
#include <iostream>
#include <mpi.h>
#include "dg_derived_datatype.h"
#include <vector>
void Recv_face(int source, int tag, std::vector<face_pack>& recv_face);
int main(){
// Initialize MPI.
// some code here.
// I create a vector of struct: std::vector<face_pack> face_info,
// to store the info I want to let proccesors communicate.
Construct_data_type(); // construct my derived data type
MPI_Request request_pre1, request_pre2, request_next1, request_next2;
// send
if(num_next > 0){ // If fullfilled the current processor send info to the next processor (myrank + 1)
std::vector<face_pack> face_info;
// some code to construct face_info
// source my_rank, destination my_rank + 1
MPI_Isend(&face_info[0], num_n, Hash::Face_type, mpi::rank + 1, mpi::rank + 1, MPI_COMM_WORLD, &request_next2);
}
// recv
if(some critira){ // recv from the former processor (my_rank - 1)
std::vector<face_pack> recv_face;
Recv_face(mpi::rank - 1, mpi::rank, recv_face); // recv info from former processor
}
if(num_next > 0){
MPI_Status status;
MPI_Wait(&request_next2, &status);
}
Free_type();
// finialize MPI
}
void Recv_face(int source, int tag, std::vector<face_pack>& recv_face){
MPI_Status status1, status2;
MPI_Probe(source, tag, MPI_COMM_WORLD, &status1);
int count;
MPI_Get_count(&status1, Hash::Face_type, &count);
recv_face = std::vector<face_pack>(count);
MPI_Recv(&recv_face[0], count, Hash::Face_type, source, tag, MPI_COMM_WORLD, &status2);
}
The problem is that the receiver sometimes receives an incomplete info.
For example, I print out the face_info before it is sent out:
// rank 2
owners_key3658 facei 0 face_type M neighbour 192 n_rank 0
owners_key3658 facei 1 face_type L neighbour 66070 n_rank 1
owners_key3658 facei 1 face_type L neighbour 76640 n_rank 1
owners_key3658 facei 2 face_type M neighbour 2631 n_rank 0
owners_key3658 facei 3 face_type L neighbour 4953 n_rank 1
...
owners_key49144 facei 1 face_type M neighbour 844354 n_rank 2
owners_key49144 facei 1 face_type M neighbour 913280 n_rank 2
owners_key49144 facei 2 face_type L neighbour 41619 n_rank 1
owners_key49144 facei 3 face_type M neighbour 57633 n_rank 2
Which is correct.
But in the receiver side, I print out the message it received:
owners_key3658 facei 0 face_type M neighbour 192 n_rank 0
owners_key3658 facei 1 face_type L neighbour 66070 n_rank 1
owners_key3658 facei 1 face_type L neighbour 76640 n_rank 1
owners_key3658 facei 2 face_type M neighbour 2631 n_rank 0
owners_key3658 facei 3 face_type L neighbour 4953 n_rank 1
... // at the beginning it's fine, however, at the end it messed up
owners_key242560 facei 2 face_type ! neighbour 2 n_rank 2
owners_key217474 facei 2 face_type ! neighbour 2 n_rank 2
owners_key17394 facei 2 face_type ! neighbour 2 n_rank 2
owners_key216815 facei 2 face_type ! neighbour 2 n_rank 2
Surely, it lost the face_type info, which is a char. And as far as I know, the std::vector warrants a contiguous memory here. So I am not sure which part of my derived mpi datatype is wrong. The message passing sometimes works sometimes not.
OK, I kind of figured out my problems. There are two.
The first one is the use of MPI_Type_get_extent(). Since c/c++ struct could be padded by your compiler, it is OK that if you only send one element, but if you send multiple elements, the trailing padding could cause problems (see picture below).
Therefore, a safer and a more protable way to define your derived datatype is to use MPI_Get_address(). Here is how I do it:
// generate the derived datatype
void MPI_Face_type(){
int num = 3;
int elem_blocklength[num]{2, 1, 5};
MPI_Datatype array_of_types[num]{MPI_INT, MPI_CHAR, MPI_INT};
MPI_Aint array_of_offsets[num];
MPI_Aint baseadd, add1, add2;
std::vector<face_pack> myface(1);
MPI_Get_address(&(myface[0].owners_key), &baseadd);
MPI_Get_address(&(myface[0].face_type), &add1);
MPI_Get_address(&(myface[0].hlevel), &add2);
array_of_offsets[0] = 0;
array_of_offsets[1] = add1 - baseadd;
array_of_offsets[2] = add2 - baseadd;
MPI_Type_create_struct(num, elem_blocklength, array_of_offsets, array_of_types, &Hash::Face_type);
// check that the extent is correct
MPI_Aint lb, extent;
MPI_Type_get_extent(Hash::Face_type, &lb, &extent);
if(extent != sizeof(myface[0])){
MPI_Datatype old = Hash::Face_type;
MPI_Type_create_resized(old, 0, sizeof(myface[0]), &Hash::Face_type);
MPI_Type_free(&old);
}
MPI_Type_commit(&Hash::Face_type);
}
The second one is the use of non-blocking send MPI_Isend(). The program works properly after I changed the non-blocking send to blocking send.
The relative part of my program looks like this:
if(criteria1){
//form the vector using my derived datatype
std::vector<derived_type> my_vector;
// use MPI_Isend to send the vector to the target rank
MPI_Isend(... my_vector...);
}
if(critira2){
// need to recv message
MPI_Recv();
}
if(critira1){
// the sender now needs to make sure the message has arrived.
MPI_Wait();
}
Although I used MPI_Wait the recver did not get the full message. I check the man page of MPI_Isend(), it says man_page
A nonblocking send call indicates that the system may start copying data out of the send buffer. The sender should not modify any part of the send buffer after a nonblocking send operation is called until the send completes.
But I do not think I modified the send buffer? Or it could be there is no enough space in the send buffer to store the info to be sent? In my understanding, the non-blocking send works like this, the sender put the message in its buffer and send out to the target rank when the target rank hits MPI_Recv. So it could be the sender's buffer runs out of space to store messages before it sends them out? Correct me if I am wrong.

Can I use MPI with shared memory

I have written a simulation software for highly parallelized execution, using MPI for internode and threads for intranode parallelization to reduce the memory footprint by using shared memory where possible. (The largest data structures are mostly read-only, so I can easily manage thread-safety.)
Although my program works fine (finally), I am having second thoughts about whether this approach is really best, mostly because managing two types of parallelizations does require some messy asynchronous code here and there.
I found a paper (pdf draft) introducing a shared memory extension to MPI, allowing the use of shared data structures within MPI parallelization on a single node.
I am not very experienced with MPI, so my question is: Is this possible with recent standard Open MPI implementations and where can I find an introduction / tutorial on how to do it?
Note that I am not talking about how message passing is accomplished with shared memory, I know that MPI does that. I would like to (read-)access the same object in memory from multiple MPI processors.
This can be done - here is a test code that sets up a small table on each shared memory node. Only one process (node rank 0) actually allocates and initialises the table, but all processes on a node can read it (apologies for the formatting - seems to be a space/tab issue)
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(void)
{
int i, flag;
int nodesize, noderank;
int size, rank, irank;
int tablesize, localtablesize;
int *table, *localtable;
int *model;
MPI_Comm allcomm, nodecomm;
char verstring[MPI_MAX_LIBRARY_VERSION_STRING];
char nodename[MPI_MAX_PROCESSOR_NAME];
MPI_Aint winsize;
int windisp;
int *winptr;
int version, subversion, verstringlen, nodestringlen;
allcomm = MPI_COMM_WORLD;
MPI_Win wintable;
tablesize = 5;
MPI_Init(NULL, NULL);
MPI_Comm_size(allcomm, &size);
MPI_Comm_rank(allcomm, &rank);
MPI_Get_processor_name(nodename, &nodestringlen);
MPI_Get_version(&version, &subversion);
MPI_Get_library_version(verstring, &verstringlen);
if (rank == 0)
{
printf("Version %d, subversion %d\n", version, subversion);
printf("Library <%s>\n", verstring);
}
// Create node-local communicator
MPI_Comm_split_type(allcomm, MPI_COMM_TYPE_SHARED, rank,
MPI_INFO_NULL, &nodecomm);
MPI_Comm_size(nodecomm, &nodesize);
MPI_Comm_rank(nodecomm, &noderank);
// Only rank 0 on a node actually allocates memory
localtablesize = 0;
if (noderank == 0) localtablesize = tablesize;
// debug info
printf("Rank %d of %d, rank %d of %d in node <%s>, localtablesize %d\n",
rank, size, noderank, nodesize, nodename, localtablesize);
MPI_Win_allocate_shared(localtablesize*sizeof(int), sizeof(int),
MPI_INFO_NULL, nodecomm, &localtable, &wintable);
MPI_Win_get_attr(wintable, MPI_WIN_MODEL, &model, &flag);
if (1 != flag)
{
printf("Attribute MPI_WIN_MODEL not defined\n");
}
else
{
if (MPI_WIN_UNIFIED == *model)
{
if (rank == 0) printf("Memory model is MPI_WIN_UNIFIED\n");
}
else
{
if (rank == 0) printf("Memory model is *not* MPI_WIN_UNIFIED\n");
MPI_Finalize();
return 1;
}
}
// need to get local pointer valid for table on rank 0
table = localtable;
if (noderank != 0)
{
MPI_Win_shared_query(wintable, 0, &winsize, &windisp, &table);
}
// All table pointers should now point to copy on noderank 0
// Initialise table on rank 0 with appropriate synchronisation
MPI_Win_fence(0, wintable);
if (noderank == 0)
{
for (i=0; i < tablesize; i++)
{
table[i] = rank*tablesize + i;
}
}
MPI_Win_fence(0, wintable);
// Check we did it right
for (i=0; i < tablesize; i++)
{
printf("rank %d, noderank %d, table[%d] = %d\n",
rank, noderank, i, table[i]);
}
MPI_Finalize();
}
Here is some sample output for 6 processes across two nodes:
Version 3, subversion 1
Library <SGI MPT 2.14 04/05/16 03:53:22>
Rank 3 of 6, rank 0 of 3 in node <r1i0n1>, localtablesize 5
Rank 4 of 6, rank 1 of 3 in node <r1i0n1>, localtablesize 0
Rank 5 of 6, rank 2 of 3 in node <r1i0n1>, localtablesize 0
Rank 0 of 6, rank 0 of 3 in node <r1i0n0>, localtablesize 5
Rank 1 of 6, rank 1 of 3 in node <r1i0n0>, localtablesize 0
Rank 2 of 6, rank 2 of 3 in node <r1i0n0>, localtablesize 0
Memory model is MPI_WIN_UNIFIED
rank 3, noderank 0, table[0] = 15
rank 3, noderank 0, table[1] = 16
rank 3, noderank 0, table[2] = 17
rank 3, noderank 0, table[3] = 18
rank 3, noderank 0, table[4] = 19
rank 4, noderank 1, table[0] = 15
rank 4, noderank 1, table[1] = 16
rank 4, noderank 1, table[2] = 17
rank 4, noderank 1, table[3] = 18
rank 4, noderank 1, table[4] = 19
rank 5, noderank 2, table[0] = 15
rank 5, noderank 2, table[1] = 16
rank 5, noderank 2, table[2] = 17
rank 5, noderank 2, table[3] = 18
rank 5, noderank 2, table[4] = 19
rank 0, noderank 0, table[0] = 0
rank 0, noderank 0, table[1] = 1
rank 0, noderank 0, table[2] = 2
rank 0, noderank 0, table[3] = 3
rank 0, noderank 0, table[4] = 4
rank 1, noderank 1, table[0] = 0
rank 1, noderank 1, table[1] = 1
rank 1, noderank 1, table[2] = 2
rank 1, noderank 1, table[3] = 3
rank 1, noderank 1, table[4] = 4
rank 2, noderank 2, table[0] = 0
rank 2, noderank 2, table[1] = 1
rank 2, noderank 2, table[2] = 2
rank 2, noderank 2, table[3] = 3
rank 2, noderank 2, table[4] = 4

C++ Insertion sort on a queue

I am trying to implement 'Insertion Sort' on two queues without using an array.
Queue 1 - 4, 5, 11, 8, 3
Queue 2 - 2, 3, 4, 5, 2, 11
After the sorting they are as following :
Queue 1 - 3, 4, 5, 8, 11
Queue 2 - 2, 2, 3, 4, 5, 11
They get sorted. But I sort the queue like if it was a list. I do not know how to deal with a FIFO structure.
My teacher said that my implementation is alright if it was for a list, but not for queue. I am supposed to use the push() and pop() functions (already implemented them) and a third queue for assistance. This is my current implementation of the sorting algorithm:
void InsertionSort(queue* &left, queue* &right)
{
int x, i = 0, j;
queue *p = left;
while (p)
{
x = getElemAt(i, left, right);
j = i - 1;
while (j >= 0 && x < getElemAt(j, left, right))
{
setElemAt(j + 1, getElemAt(j, left, right), left, right);
j--;
}
setElemAt(j + 1, x, left, right);
p = p->next;
i++;
}
}
getElemAt and setElemAt are additional functions I've written separately. How should I approach the problem of sorting with an additional queue?
#interjay the queues need to be sorted individualy , with the use of a 3rd assisting queue which is used when queue1 or queue2 is sorted. The queues must not be sorted at the same time because this will require 2 assisting queues