Using move_pages() to move hugepages?

Using move_pages() to move hugepages? - c++

This question is for:
kernel 3.10.0-1062.4.3.el7.x86_64
non transparent hugepages allocated via boot parameters and might or might not be mapped to a file (e.g. mounted hugepages)
x86_64
According to this kernel source, move_pages() will call do_pages_move() to move a page, but I don't see how it indirectly calls migrate_huge_page().
So my questions are:
can move_pages() move hugepages? if yes, should the page boundary be 4KB or 2MB when passing an array of addresses of pages? It seems like there was a patch for supporting moving hugepages 5 years ago.
if move_pages() cannot move hugepages, how can I move hugepages?
after moving hugepages, can I query the NUMA IDs of hugepages the same way I query regular pages like this answer?
According to the code below, it seems like I move hugepages via move_pages() with page size = 2MB but is it the correct way?:
#include <cstdint>
#include <iostream>
#include <numaif.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <limits>
int main(int argc, char** argv) {
const int32_t dst_node = strtoul(argv[1], nullptr, 10);
const constexpr uint64_t size = 4lu * 1024 * 1024;
const constexpr uint64_t pageSize = 2lu * 1024 * 1024;
const constexpr uint32_t nPages = size / pageSize;
int32_t status[nPages];
std::fill_n(status, nPages, std::numeric_limits<int32_t>::min());;
void* pages[nPages];
int32_t dst_nodes[nPages];
void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, -1, 0);
if (ptr == MAP_FAILED) {
throw "failed to map hugepages";
}
memset(ptr, 0x41, nPages*pageSize);
for (uint32_t i = 0; i < nPages; i++) {
pages[i] = &((char*)ptr)[i*pageSize];
dst_nodes[i] = dst_node;
}
std::cout << "Before moving" << std::endl;
if (0 != move_pages(0, nPages, pages, nullptr, status, 0)) {
std::cout << "failed to inquiry pages because " << strerror(errno) << std::endl;
}
else {
for (uint32_t i = 0; i < nPages; i++) {
std::cout << "page # " << i << " locates at numa node " << status[i] << std::endl;
}
}
// real move
if (0 != move_pages(0, nPages, pages, dst_nodes, status, MPOL_MF_MOVE_ALL)) {
std::cout << "failed to move pages because " << strerror(errno) << std::endl;
exit(-1);
}
const constexpr uint64_t smallPageSize = 4lu * 1024;
const constexpr uint32_t nSmallPages = size / smallPageSize;
void* smallPages[nSmallPages];
int32_t smallStatus[nSmallPages] = {std::numeric_limits<int32_t>::min()};
for (uint32_t i = 0; i < nSmallPages; i++) {
smallPages[i] = &((char*)ptr)[i*smallPageSize];
}
std::cout << "after moving" << std::endl;
if (0 != move_pages(0, nSmallPages, smallPages, nullptr, smallStatus, 0)) {
std::cout << "failed to inquiry pages because " << strerror(errno) << std::endl;
}
else {
for (uint32_t i = 0; i < nSmallPages; i++) {
std::cout << "page # " << i << " locates at numa node " << smallStatus[i] << std::endl;
}
}
}
And should I query the NUMA IDs based on 4KB page size (like the code above)? Or 2MB?

For original version of 3.10 linux kernel (not redhat patched, as I have no LXR for rhel kernels) syscall move_pages will force splitting huge page (2MB; both THP and hugetlbfs styles) into small pages (4KB). move_pages uses too short chunks (around 0.5MB if I calculated correctly) and the function graph is like:
move_pages .. -> migrate_pages -> unmap_and_move ->
static int unmap_and_move(new_page_t get_new_page, unsigned long private,
struct page *page, int force, enum migrate_mode mode)
{
struct page *newpage = get_new_page(page, private, &result);
....
if (unlikely(PageTransHuge(page)))
if (unlikely(split_huge_page(page)))
goto out;
PageTransHuge returns true for both kinds of hugepages (thp and libhugetlbs):
https://elixir.bootlin.com/linux/v3.10/source/include/linux/page-flags.h#L411
PageTransHuge() returns true for both transparent huge and hugetlbfs pages, but not normal pages.
And split_huge_page will call split_huge_page_to_list which:
Split a hugepage into normal pages. This doesn't change the position of head page.
Split will also emit vm_event counter increment of kind THP_SPLIT. The counters are exported in /proc/vmstat ("file displays various virtual memory statistics"). You can check this counter with this UUOC command cat /proc/vmstat |grep thp_split before and after your test.
There were some code for hugepage migration in 3.10 version as unmap_and_move_huge_page function which is not called from move_pages. The only usage of it in 3.10 was in migrate_huge_page which is called only from memory failure handler soft_offline_huge_page (__soft_offline_page) (added 2010):
Soft offline a page, by migration or invalidation,
without killing anything. This is for the case when
a page is not corrupted yet (so it's still valid to access),
but has had a number of corrected errors and is better taken
out.
Answers:
can move_pages() move hugepages? if yes, should the page boundary be 4KB or 2MB when passing an array of addresses of pages? It seems like there was a patch for supporting moving hugepages 5 years ago.
Standard 3.10 kernel have move_pages which will accept array "pages" of 4KB page pointers and it will break (split) huge page into 512 small pages and then it will migrate small pages. There are very low chances for them to be merged back by thp as move_pages does separate requests for physical memory pages and they almost always will be non-continuous.
Don't give pointers to "2MB", it will still split every huge page mentioned and migrate only first 4KB small page of this memory.
2013 patch was not added into original 3.10 kernel.
v2 https://lwn.net/Articles/544044/ "extend hugepage migration" (3.9);
v3 https://lwn.net/Articles/559575/ (3.11)
v4 https://lore.kernel.org/patchwork/cover/395020/ (click on Related to get access to individual patches, for example move_pages patch)
The patch seems to be accepted in September 2013: https://github.com/torvalds/linux/search?q=+extend+hugepage+migration&type=Commits
if move_pages() cannot move hugepages, how can I move hugepages?
move_pages will move data from hugepages as small pages. You can: allocate huge page in manual mode at correct numa node and copy your data (copy twice if you want to keep virtual address); or update kernel to some version with the patch and use methods and tests of patch author, Naoya Horiguchi (JP). There is copy of his tests: https://github.com/srikanth007m/test_hugepage_migration_extension
(https://github.com/Naoya-Horiguchi/test_core is required)
https://github.com/srikanth007m/test_hugepage_migration_extension/blob/master/test_move_pages.c
Now I'm not sure how to start the test and how to check that it works correctly. For ./test_move_pages -v -m private -h 2048 runs with recent kernel it does not increment THP_SPLIT counter.
His test looks very similar to our tests: mmap, memset to fault pages, filling pages array with pointers to small pages, numa_move_pages
after moving hugepages, can I query the NUMA IDs of hugepages the same way I query regular pages like this answer?
You can query status of any memory by providing correct array "pages" to move_pages syscall in query mode (with null nodes). Array should list every small page of the memory region you want to check.
If you know any reliable method to check if the memory mapped to huge page or not, you can query any small page of huge page. I think that there can be probabilistic method if you can export physical address out of kernel to the user-space (using some LKM module for example): for huge page virtual and physical addresses will always have 21 common LSB bits, and for small pages bits will coincide only for 1 test in million. Or just write LKM to export PMD Directory.

Related

Memory usage doesn't increase when allocating memory

I would like to find out the amount of bytes used by a process from within a C++ program by inspecting the operating system's memory information. The reason I would like to do this is to find a possible overhead in memory allocation when allocating memory (due to memory control blocks/nodes in free lists etc.) Currently I am on mac and am using this code:
#include <mach/mach.h>
#include <iostream>
int getResidentMemoryUsage() {
task_basic_info t_info;
mach_msg_type_number_t t_info_count = TASK_BASIC_INFO_COUNT;
if (task_info(mach_task_self(), TASK_BASIC_INFO,
reinterpret_cast<task_info_t>(&t_info),
&t_info_count) == KERN_SUCCESS) {
return t_info.resident_size;
}
return -1;
}
int getVirtualMemoryUsage() {
task_basic_info t_info;
mach_msg_type_number_t t_info_count = TASK_BASIC_INFO_COUNT;
if (task_info(mach_task_self(), TASK_BASIC_INFO,
reinterpret_cast<task_info_t>(&t_info),
&t_info_count) == KERN_SUCCESS) {
return t_info.virtual_size;
}
return -1;
}
int main(void) {
int virtualMemoryBefore = getVirtualMemoryUsage();
int residentMemoryBefore = getResidentMemoryUsage();
int* a = new int(5);
int virtualMemoryAfter = getVirtualMemoryUsage();
int residentMemoryAfter = getResidentMemoryUsage();
std::cout << virtualMemoryBefore << " " << virtualMemoryAfter << std::endl;
std::cout << residentMemoryBefore << " " << residentMemoryAfter << std::endl;
return 0;
}
When running this code I would have expected to see that the memory usage has increased after allocating an int. However when I run the above code I get the following output:
75190272 75190272
819200 819200
I have several questions because this output does not make any sense.
Why hasn't either the virtual/resident memory changed after an integer has been allocated?
How come the operating system is allocating such large amounts of memory to a running process.
When I do run the code and check activity monitor I find that 304 kb of memory is used but that number differs from the virtual/resident memory usage obtained programmatically.
My end goal is to be able to find the memory overhead when assigning data, so is there a way to do this (i.e. determine the bytes used by the OS and compare with the bytes allocated to find the difference is what I am currently thinking of)
Thank you for reading

The C++ runtime typically allocates a block of memory when a program starts up, and then parcels this out to your code when you use things like new, and adds it back to the block when you call delete. Hence, the operating system doesn't know anything about individual new or delete calls. This is also true for malloc and free in C (or C++)

First you measure number of pages, not really memory allocated. Second the runtime pre allocates few pages at startup. If you want to observe something allocate more than a single int. Try allocating several thousands and you will observe some changes.

RAM usage measured with GetProcessMemoryInfo is lower than measure through Task Manager

I have found many discussions online on why RAM usage of a process as measured by Task Manager is often higher than measured during runtime by an application's own code. For an excelent answer about that, see: allocating ram shows double the ram usage in task manager
However, oddly, I am finding the opposite. My measurement inside the application, using GetProcessMemoryInfo function, shows a higher amount of RAM usage than the Task Manager. The code is simply:
#include <iostream>
#include <Windows.h>
#include <psapi.h>
int main(void)
{
uint64_t currentUsedRAM(0);
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize;
const int N(100000000);
int *x = new int[N];
for (int i = 0; i < N; i++)
{
x[i] = 1;
}
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize - currentUsedRAM;
std::cout << "Current RAM used: " << currentUsedRAM << "\n";
return 0;
}
The output is ``Current RAM used: 400007168`, measured in bytes (e.g. around 400 Mb). However, in my Task Manager, the application process is shown as using just 381.8 Mb, which is around around 18 Mb less.
Why would that happen?
Is there a way to make these to converge to a same result?
EDIT:
Following the linked suggested in the commments, I also tried the Process Explorer software in substitution to Task Manager. With, the measurement is 391,892 Mb. It is, closer to what I get from the in-application measurement, but still quite off.
Even more importantly, I tried increasing the array size by one order of magnitude. Interestingly, the within application measurement, the Task Manager measurement and the Process Explorer measurement all also increase by one order of magnitude. It is, the difference between the within-application measurement and these softwares' also increases proportionally and goes from around 18,8 Mb (Task Manager) or 9 Mb (Process Explorer) to around 185 Mb (Task Manager) or 102 Mb (Process Explorer).

How many cache misses will we have for this simple program?

I have a simple program as follows. When I compiled the code without any optimization, it took 5.986s (user 3.677s, sys 1.716s) to run on a mac with 2.4G i5 processor, and 16GB DDR3-1600 9 CAS memory. I am trying to figure out how many L1 cache miss happen in this program. Any suggestions? Thanks!
void main()
{
int size = 1024 * 1024 * 1024;
int * a = new int[size];
int i;
for (i = 0; i < size; i++) a[i] = i;
delete[] a;
}

You can measure the number of cache misses using cachegrind feature of valgrind. This page provides a pretty detailed summary.
note: If you're using C then you should be using malloc. Don't forget to call free: as it stands your program will leak memory. If you are using C++ (and this question is incorrectly tagged), you should be using new and delete.

If you want really fine granularity measurements of cache misses, you should use Intel's architectural counters, which can accessed from userspace using the rdpmc instruction. The kernel module source I wrote in this answer will enable rdpmc in userspace for older CPU's.
Here is another kernel module to enable configuration of the counters for measuring last-level cache misses and last-level cache references. Note that I have hardcoded 8 cores, because that was what I had happened to use for my configuration.
#include <linux/module.h> /* Needed by all modules */
#include <linux/kernel.h> /* Needed for KERN_INFO */
#define PERFEVTSELx_MSR_BASE 0x00000186
#define PMCx_MSR_BASE 0x000000c1 /* NB: write when evt disabled*/
#define PERFEVTSELx_USR (1U << 16) /* count in rings 1, 2, or 3 */
#define PERFEVTSELx_OS (1U << 17) /* count in ring 0 */
#define PERFEVTSELx_EN (1U << 22) /* enable counter */
static void
write_msr(uint32_t msr, uint64_t val)
{
uint32_t lo = val & 0xffffffff;
uint32_t hi = val >> 32;
__asm __volatile("wrmsr" : : "c" (msr), "a" (lo), "d" (hi));
}
static uint64_t
read_msr(uint32_t msr)
{
uint32_t hi, lo;
__asm __volatile("rdmsr" : "=d" (hi), "=a" (lo) : "c" (msr));
return ((uint64_t) lo) | (((uint64_t) hi) << 32);
}
static uint64_t old_value_perfsel0[8];
static uint64_t old_value_perfsel1[8];
static spinlock_t mr_lock = SPIN_LOCK_UNLOCKED;
static unsigned long flags;
static void wrapper(void* ptr) {
int id;
uint64_t value;
spin_lock_irqsave(&mr_lock, flags);
id = smp_processor_id();
// Save the old values before we do something stupid.
old_value_perfsel0[id] = read_msr(PERFEVTSELx_MSR_BASE);
old_value_perfsel1[id] = read_msr(PERFEVTSELx_MSR_BASE+1);
// Clear out the existing counters
write_msr(PERFEVTSELx_MSR_BASE, 0);
write_msr(PERFEVTSELx_MSR_BASE + 1, 0);
write_msr(PMCx_MSR_BASE, 0);
write_msr(PMCx_MSR_BASE + 1, 0);
if (clear){
spin_unlock_irqrestore(&mr_lock, flags);
return;
}
// Table 19-1 in the most recent Intel Manual - Architectural
// Last Level Cache References Event select 2EH, Umask 4FH
value = 0x2E | (0x4F << 8) |PERFEVTSELx_EN |PERFEVTSELx_OS|PERFEVTSELx_USR;
write_msr(PERFEVTSELx_MSR_BASE, value);
// Table 19-1 in the most recent Intel Manual - Architectural
// Last Level Cache Misses Event select 2EH, Umask 41H
value = 0x2E | (0x41 << 8) |PERFEVTSELx_EN |PERFEVTSELx_OS|PERFEVTSELx_USR;
write_msr(PERFEVTSELx_MSR_BASE + 1, value);
spin_unlock_irqrestore(&mr_lock, flags);
}
static void restore_wrapper(void* ptr) {
int id = smp_processor_id();
if (clear) return;
write_msr(PERFEVTSELx_MSR_BASE, old_value_perfsel0[id]);
write_msr(PERFEVTSELx_MSR_BASE+1, old_value_perfsel1[id]);
}
int init_module(void)
{
printk(KERN_INFO "Entering write-msr!\n");
on_each_cpu(wrapper, NULL, 0);
/*
* A non 0 return means init_module failed; module can't be loaded.
*/
return 0;
}
void cleanup_module(void)
{
on_each_cpu(restore_wrapper, NULL, 0);
printk(KERN_INFO "Exiting write-msr!\n");
}
Here is a user-space wrapper around rdpmc.
uint64_t
read_pmc(int ecx)
{
unsigned int a, d;
__asm __volatile("rdpmc" : "=a"(a), "=d"(d) : "c"(ecx));
return ((uint64_t)a) | (((uint64_t)d) << 32);
}

You have to be running on a 64 bit system. You are setting 4 GB of data to zero. The number of cache misses is 4 x 1024 x 1024 x 1024, divided by the cache line size. However, since all the memory accesses are sequential, you will not have many TLB misses etc. and the processor most likely optimises accesses to sequential cache lines.

Your performance here (or lack thereof) is totally dominated by paging, every new page (probably 4k in your case) would cause a page fault since it's newly allocated and was never used, and trigger an expensive OS flow. Cachegrind and performance monitors should show you the same behavior, so you may be confused if you expected just simple data accesses.
One way to avoid this is to allocate, store once to the entire array (or even once per page) to warm up the page tables, and then measure time internally in you application (using rdtsc or any c API you like) over the main loop.
Alternatively, if you want to use external time measurements, just loop multiple times (> 1000), and amortize, so the initial penalty would become less significant.
Once you do all that, the cache misses you measure should reflect the number of accesses for each new 64 byte line (i.e. ~16M), plus the page walks (256k pages assuming they're 4k, multiplied by the page table levels, since each walk would have to lookup the memory on each level).
under a virtualized platform the paging would become squared (9 accesses instead of 3 for e.g.) since each level of the guest page table also needs paging on the host.

How to Change the Maximum Size of a malloc() Allocation in C++

As far as I can tell, calling malloc() basically means the program is asking the OS for a hunk of memory. I'm writing a program to interface with a camera, in which I need to allocate chucks of memory large enough to store hundreds of images at a time (its a fast camera).
When I allocate space for about 1.9 Gb worth of images, everything works just fine. The allocation calculation is pretty simple:
int allocateBurst( int numImages )
{
int streamSize = ZIMAGESIZE * numImages;
data.images = new unsigned short [streamSize];
return 0;
}
But as soon as I go over the 2 Gb limit, I get runtime errors like this:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
It seems like 2 Gigs might be the maximum size that I can allocate at once. I have 32 Gigs of ram, and would like to simply be able to allocate larger pieces of memory in one allocation. Is this possible?
I'm running Ubuntu 12.10.

There may be an underlying issue that the OS can't grant your large memory allocation because it is using memory for other applications. Check with your OS to see what the limits are.
Also know that some OS's will "page" memory to the hard disk. When your program asks for memory outside the page, the OS will swap pages with the hard disk. Knowing this, I recommend a classic technique of "Double Buffering" or "Multiple Buffering".
You will need at least two threads: reading and writing. One thread is responsible for reading data from the camera and placing into a buffer. When it fills up a buffer, it starts on another buffer. Meanwhile the writing thread is starting at the buffer and writing it to disk (block file writes). When the writing thread finishes a buffer, it starts on the next one. The buffers should be in a circular sequence to reuse them.
The magic is to have enough buffers so that the reader never catches up to the writer.
Since you are using a couple of small buffers, you should not get any errors from the OS.
The are methods to optimize this, such as obtaining static buffers from the OS.

The problem is you're using a signed 32-bit variable to describe an unsigned 64-bit number.
Use "size_t" instead of "int" for holding the storage count. This has nothing to do with what you intend to store, just how large a count of them you need.
#include <iostream>
int main(int /*argc*/, const char** /*argv*/)
{
int units = 2;
// 32-bit signed, i.e. 31-bit numbers.
int intSize = units * 1024 * 1024 * 1024;
// 64-bit values (ULL suffix)
size_t sizetSize = units * 1024ULL * 1024ULL * 1024ULL;
std::cout << "intSize = " << intSize << ", sizetSize = " << sizetSize << std::endl;
try {
unsigned short* intAlloc = new unsigned short[intSize];
std::cout << "intAlloc = " << intAlloc << std::endl;
delete [] intAlloc;
} catch (std::bad_alloc) {
std::cout << "intAlloc failed (std::bad_alloc)" << std::endl;
}
try {
unsigned short* sizetAlloc = new unsigned short[sizetSize];
std::cout << "sizetAlloc = " << sizetAlloc << std::endl;
delete [] sizetAlloc;
} catch (std::bad_alloc) {
std::cout << "sizetAlloc failed (std::bad_alloc)" << std::endl;
}
return 0;
}
Output (g++ -m64 -o test test.cpp under Mint 15 64 bit with g++ 4.7.3 on a virtual machine with 4Gb of memory)
intSize = -2147483648, sizetSize = 2147483648
intAlloc failed
sizetAlloc = 0x7f55affff010

int allocateBurst( int numImages )
{
// change that from int to long
long streamSize = ZIMAGESIZE * numImages;
data.images = new unsigned short [streamSize];
return 0;
}
Try using
long
OR
cast the result of the allocateBurst function to "uint_64" and the return type of the function to uint_64
Because int you allocate 32 bit allocation while long or uint_64 allocates 64 bit allocation which could possibly allocate more memory space for you.
Hope that helps

Artificially Limit C/C++ Memory Usage

Is there any way to easily limit a C/C++ application to a specified amount of memory (30 mb or so)? Eg: if my application tries to complete load a 50mb file into memory it will die/print a message and quit/etc.
Admittedly I can just constantly check the memory usage for the application, but it would be a bit easier if it would just die with an error if I went above.
Any ideas?
Platform isn't a big issue, windows/linux/whatever compiler.

Read the manual page for ulimit on unix systems. There is a shell builtin you can invoke before launching your executable or (in section 3 of the manual) an API call of the same name.

On Windows, you can't set a quota for memory usage of a process directly. You can, however, create a Windows job object, set the quota for the job object, and then assign the process to that job object.

Override all malloc APIs, and provide handlers for new/delete, so that you can book keep the memory usage and throw exceptions when needed.
Not sure if this is even easier/effort-saving than just do memory monitoring through OS-provided APIs.

In bash, use the ulimit builtin:
bash$ ulimit -v 30000
bash$ ./my_program
The -v takes 1K blocks.
Update:
If you want to set this from within your app, use setrlimit. Note that the man page for ulimit(3) explicitly says that it is obsolete.

You can limit the size of the virtual memory of your process using the system limits. If you process exceeds this amount, it will be killed with a signal (SIGBUS I think).
You can use something like:
#include <sys/resource.h>
#include <iostream>
using namespace std;
class RLimit {
public:
RLimit(int cmd) : mCmd(cmd) {
}
void set(rlim_t value) {
clog << "Setting " << mCmd << " to " << value << endl;
struct rlimit rlim;
rlim.rlim_cur = value;
rlim.rlim_max = value;
int ret = setrlimit(mCmd, &rlim);
if (ret) {
clog << "Error setting rlimit" << endl;
}
}
rlim_t getCurrent() {
struct rlimit rlim = {0, 0};
if (getrlimit(mCmd, &rlim)) {
clog << "Error in getrlimit" << endl;
return 0;
}
return rlim.rlim_cur;
}
rlim_t getMax() {
struct rlimit rlim = {0, 0};
if (getrlimit(mCmd, &rlim)) {
clog << "Error in getrlimit" << endl;
return 0;
}
return rlim.rlim_max;
}
private:
int mCmd;
};
And then use it like that:
RLimit dataLimit(RLIMIT_DATA);
dataLimit.set(128 * 1024 ); // in kB
clog << "soft: " << dataLimit.getCurrent() << " hard: " << dataLimit.getMax() << endl;
This implementation seems a bit verbose but it lets you easily and cleanly set different limits (see ulimit -a).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using move_pages() to move hugepages? - c++

Related

Memory usage doesn't increase when allocating memory

RAM usage measured with GetProcessMemoryInfo is lower than measure through Task Manager

How many cache misses will we have for this simple program?

How to Change the Maximum Size of a malloc() Allocation in C++

Artificially Limit C/C++ Memory Usage

Categories

Resources