How do dmapped domains in chapel language get actually mapped onto? - chapel

I need to know few things about array element allocation over domain map in chapel
Let me keep this as short as possible
region = {1..10,5..10}
regionbox = {1..5,1..5}
grid2d = /*a 2D arrangement of locales*/
Space = domain(2) dmapped Block( boundingBox = regionbox,
target_locales = grid2d
) = region.
var : myarray[Space] int;
Now Space is a distributed domain.
So here comes in.
In a distributed domain, whether we have to keep all our indexes in each locality that is
For the above example.
whether we have to keep the indexes which maps to locales, locally on all locales ?
I hope that domain map supports global-view programming so when we are accessing myarray[3,5], it dynamically maps to associative locale using the dist.
Please correct me If I'm wrong
And how are arrays allocated over the distributed domains?
Is it that domain maps have some features, which calculate the individual local size at start, from the given parameters, and allocate local_size elements in each locale ?
Like
blocking 10 elements over 2 locales needs a local size of 5.
I want to know how the array elements are created over the distributed domain and also whether the index which are mapped to locality according to distribution, got stored in that locality ?
Please let me know if this question needs more info
Thank you for your kind help

As with your previous question, the answer to this question depends on the specific domain map. The Chapel language and compiler expect a domain map (and its implementation of domains and arrays) to support a standard interface, but how it implements that interface is completely up to its author. This interface includes things like "allocate a new domain for me", "allocate a new array over that domain for me", "iterate over the indices/elements of the domain/array", "randomly access the array", etc. Thus, a given domain map implementation may be very space efficient and minimal, or it can allocate everything on every locale redundantly, as its author thinks best.
That said, if we consider standard domain maps like Block, they behave the way you would expect: E.g., for a {1..n, 1..n} array mapped across 4 locales, each locale will store ~(n**2 / 4) elements of the array rather than all n**2 elements. A random access to that array will be implemented by determining which locale owns the element and having the compiler/runtime manage the communication required to get at that remote element (as implemented by the domain map). Information is stored redundantly when it only requires O(1) storage, since this redundancy is better than communicating to get the values. E.g., each locale would store the {1..n, 1..n} bounds of the domain/array since it is cheaper to store those bounds than to communicate with some centralized location to get them.
This is one of those cases where a picture can be worth a thousand words. Taking a look at the slides for the talks where we introduced these concepts (like slide 34 of this presentation) could be much more instructive than the following text-based description.
Walking through your declarations and cleaning them up a bit, here's roughly what happens as this code is executed:
const region = {1..10,5..10},
regionbox = {1..5,1..5},
grid2d = /*a 2D arrangement of locales*/;
Nothing about these declarations refer to other locales (no on-clauses, no dmapped clauses), so these would all result in domains and arrays that are stored locally on the locale where the task encountering the declarations is executing (locale #0 at the program's start-up time).
const Space : domain(2) dmapped Block( boundingBox = regionbox,
target_locales = grid2d
) = region.
The dmapped Block(...) clause causes an instance of the Block domain map class to be allocated on each locale in grid2d. Each instance of the class stores the bounding box (regionbox) and the set of target locales. Each locale also gets an instance of a class representing the local view of the distribution named LocBlock which stores the subset of the 2D plane which is owned by that locale as defined by the bounding box and the target locale set.
The declaration and initialization of Space invokes a method on the current locale's copy of the Block domain map object created in the previous step, asking it to create a new domain. This causes each locale in grid2d to allocate a pair of classes corresponding to the global and local views of the domain, respectively. The global descriptor describes the domain's indices as a whole (e.g., region) while the local descriptor describes that locale's personal subset of region.
var myarray: [Space] int;
This declaration asks the current locale's copy of the global Space domain class created in the previous step to create a new array. This causes each locale in grid2d to allocate a pair of classes representing the global and local views of the array. The global view of the array tends not to store much state and is used primarily to dispatch methods on the array to the appropriate local descriptor. The local descriptor stores the array elements corresponding to the locale's subarray.
I hope this helps clarify the issues you are asking about.

Related

How to set "user data" for a body in PlayRho?

How can I associate "user data" - i.e. arbitrary data for my application - with bodies in the PlayRho 0.10.0 physics engine?
In the Box2D 2.4.1 physics engine, I can associate "user data" with bodies, using the userData field of a b2BodyDef instance that I pass to the b2World::CreateBody function and get the value back by calling b2Body::GetUserData(). How do you do this in PlayRho?
Possible Solution:
In your application, you can use an array whose elements are your user-data values and whose indices match the underlying values returned from creating your bodies in PlayRho.
For example, a simple/naive implementation for any void* compatible user data might be like:
int main() {
struct MyEntity {
int totalHealth;
int currentHealth;
int strength;
int intellect;
};
std::vector<void*> myUserData;
auto world = World{};
// If you want to pre-allocate 100 spaces...
myUserData.resize(100);
const auto body = world.CreateBody();
// If your # bodies is unlimited or you don't want to preallocate space...
if (body.get() >= myUserData.size()) myUserData.resize(body.get());
// Set your user data...
myUserData[body.get()] = new MyEntity();
// Gets your user data...
const auto myEntity = static_cast<MyEntity*>(myUserData[body.get()]);
// Frees your dynamically allocated MyEntity instances.
// Even with Box2D `userData` you may want to free these.
for (const auto& element: myUserDAta) {
delete element;
}
return 0;
}
But if you'd like to avoid dealing with headaches like memory leaks, myUserData could instead be std::vector<MyEntity> myUserData;, and new MyEntity() and delete element; calls could be avoided.
Some advantages to this:
Provides greater flexibility. User data is often application specific. Since you implement this storage yourself when using PlayRho, you're freer to make elements be any type you want and you don't have to make all of your uses of the physics engine have the same user data type. Your user data type could be world specific for instance whereas in Box2D all of your uses of its userData field would have the same type.
Avoids wasting memory. Not all applications need user data so those that don't won't be wasting this memory or having you modify the library code to avoid it.
Disadvantages:
This is different than in Box2D.
This may require more coding effort on your part if you don't care about having the extra flexibility (since the extra flexibility might save you some coding effort too).
Background/explanation:
While the PlayRho physics engine started off as a fork of the Box2D physics engine, PlayRho has moved away from reference semantics (pointers) and towards value semantics (values). So pointers were replaced and "user data" was outright removed in favor of alternatives like this possible solution. Additionally, with this shift to value semantics, the concept of creating a body changed from getting a pointer to a new body back from the world, to basically getting an integer index to the new body back from the world instead. That index acts as an identifier to the new body within the world and is basically an incremented counter from the world starting from 0 and incremented every time you create a new body. This means that you can have O(1) lookups from an array using the underlying body ID value as the index to the element that stores your user data. Using std::unordered_map<b2Body*, b2BodyUserData> would also provide O(1) lookups but hashed maps tend to be less cache friendly on modern hardware than arrays so it makes more sense to avoid such overhead in Box2D by setting aside storage for a user data value per body than it does in PlayRho.

Continuation on: "How do dmapped domains in chapel language get actually mapped onto?"

Cray, sorry for again rising for clarity of answer! I got few more questions about domain-maps after this answer. I will highly appreciate and be very thankful if you clear my doubts about domain-maps.
I hope, I have ordered the questions sequentially.
1.) What are domain-maps? - A domain-map defines a mapping from a global array indexes of domains and arrays to a set of Locales in the cluster.
I have summarized what I have understood from the research paper and other ppts, which may be potentially wrong. Please feel free to correct the answer.
const Domain = {1..8,1..8} dmapped Block( {1..8,1..8} )
Here {1..8,1..8} be the indexspace( domain ), that is distributed to locales using a Block-distribution domain-map with a boundingBox = {1..8,1..8}
From the Block domain-map constructor,
proc block( boundingBox: domain,
targetlocales:[] locale = Locales,
datapartasks = ...,
dataparmingranularity = ...
)
The Block domain-map only wants to know about the boundingBoX, targetlocales and datapar*'-s and there is no need for a domain, here in this case {1..8,1..8}. I find it difficult to get things correct, due to many interfaces for creating domain-maps within chapel itself, where some interfaces hide some info from user.
So my question is: Did the Block domain-map create instances on targetlocales, which holds the local index sets such as {1..2,3..4} on locale 1, {1..2,1..2} on locale 2 ( where these numbers are just examples, so as to illustrate the mapping process ) ?
In the previous answer, Dr. Brad Chamberlain has mentioned that
"dmapped Block clause() creates instances on the target locales. Each instance of the Block domain map class stores the boundingbox and set of target locales"
I didn't find meaning from it :(
In the whole, please, explain how domain-maps, domains and arrays are working co-operatively. I research some steps, but all misses some kind information I needed to fully understand domain maps.
In this presentation in slide No:34, the local instance domain maps and domains store only the index space, nothing special.
In the previous answer, Dr. Brad Chamberlain has also mentioned that
" a given domain map implementation may be very space efficient and minimal, or it can allocate everything on every locale redundantly, as its author thinks best",
in this context what does "allocate everything on every locale redundantly" actually mean? Whether storing the whole array on each locality?
In PGAS, the Global instance of a domain-map, domain, array is visible across all locales ?. And I also hope that each query to them takes place through the global instances.
I kindly request you to determine the required interfaces for domain-map, as you mention in the documentation.
I will highly appreciate and be thankful, if I got some explanation on this.
Thank you very much.
1.) What are domain maps ?
A domain map is a Chapel class whose purpose is to describe how domains—and any arrays declared over those domains—are implemented. Practically speaking, they specify:
how to implement domains declared in terms of that domain map, including how to store them in memory and how to implement methods that the compiler may require (like iteration, membership tests, etc.)
how to implement arrays declared in terms of that domain, including how to store them in memory and how to implement methods that the compiler may require (like iteration, random access, etc.)
The block domain map only wants to know about the boundingBox, targetlocales, and datapar's, and there is no need for domain, here in this case {1..8,1..8}.
That's correct and this can often be a point of confusion when learning the Block distribution. Let's use a slightly artificial example to make the point:
const D = {1..4, 1..4} dmapped Block({1..8, 1..8});
This declaration says that the plane of 2D indices should be distributed to the locales by blocking the bounding box {1..8, 1..8} between the locales as evenly as possible. If I'm running on 4 locales, locale 0 will own {1..4, 1..4} of the bounding box which implies it will own {-inf..4, -inf..4} in the 2D plane (where, inf is meant to mean "infinity" and practically speaking, -inf will be represented by a small integer). Similarly, locale 3 will own {5..8, 5..8} of the bounding box or {5..inf, 5..inf} from the 2D plane.
As you note, the {1..8, 1..8} above is only an argument to the Block domain map and thus has no bearing on D's value, only its implementation. Because D's value is {1..4, 1..4} it is owned completely by locale 0 since it fits completely within that locale's portion of the 2D plane. If instead it were {2..5, 2..5}, locale 0 would own {2..4, 2..4}, locale 1 would own {2..4, 5..5}, locale 2 would own {5..5, 2..4} and locale 3 would own {5..5, 5..5}.
Because a Block distribution is parameterized by this bounding box domain, this often leads to confusion about why there are two domains used in a declaration like the one above, especially since they are often identical in practice. This has nothing to do with domain maps inherently, and everything to do with the fact that the Block domain map's constructor expects a domain as one of its arguments. In contrast, the Cyclic and BlockCyclic domain maps don't take domain arguments, and are often simpler to understand as a result.
Did the Block domain map creates instances on targetlocales which holds the local index sets...
Yes, in practice, the Block domain map creates an instance of a LocBlock class which stores each locale's local index set—e.g., {-inf..4, inf..4} on locale 0, {-inf..4, 5..inf} on locale 1, etc. for my previous example.
Each instance of the Block domain map class stores the boundingbox and set of target locales
Yes, each locale also stores a copy of the Block domain map class which stores all of the key arguments which parameterize the domain map so that the distribution it defines can be reasoned about by each locale in isolation.
In this presentation in slide no : 34,the local instance domain maps and domains store only the index space,nothing special.
That's correct, the role of the local Block domain map class is to store the portion of the 2D plane owned by that locale (locale #4 or L4 in the slide). Similarly, the role of the local Block domain class is to store the portion of the domain's index set owned by that locale. These are not special, but they are important for defining the distribution and the distributed domain, respectively.
2.In the previous answer,Dr.Brad has also mention that " a given domain map implementation may be very space efficient and minimal, or it can allocate everything on every locale redundantly, as its author thinks best", in this context what "allocate everything on every locale redundantly" actually means ? whether storing the whole array on each locality ?
Yes, by "allocate everything on every locale redundantly" I meant each locale stores the entire array. This is probably not particularly scalable, but the point is that the language and domain map framework say nothing about how arrays are stored as long as they support the expected interface. So the author of a domain map could do this if they chose to.
In PGAS, the Global instance of domain map,domain,array is visible across all locales ?. And I also hope that each query to them takes place through the global instances.
You are correct that the global instances of the classes are visible across all locales due to Chapel's PGAS (Partitioned Global Address Space) model. That said, since communication between domains is expensive and these objects tend to change their fields only rarely, in practice we tend to replicate them across locales. On the slide 34 that you referred to the tag "(logically)" is meant to refer to this: There is one conceptual copy of each object, but in practice we tend to replicate them across locales (the author of a domain map may or may not do this, as they wish).
I kindly request you to determine the required interfaces for domain map as you mention in the documentation.
The current documentation on domain map interfaces is available here. The current sources for the Block domain map which implement the 6 descriptors ({global, local} x {domain map, domain, array}) can be found here.

avoid cache miss related to 1:N relationship in entity-component-system

How to avoid cache miss related to 1:N (pirateship-cannon) relationship in entity-component-system (ECS)?
For example, a PirateShip can have 1-100 cannons. (1:N)
Each cannon can detach/attach freely to any pirateship at any time.
For some reasons, PirateShip and Cannon should be entity.
Memory diagram
In around first time-steps, when ship/cannon are gradually created, the ECS memory looks very nice :-
Image note:
Left = low address, right = high address
Although there seems to be gaps, ShipCom and CannonCom are actually compact arrays.
It is really fast to access cannon information from ship and vice versa (pseudo code):-
Ptr<ShipCom> shipCom=....;
EntityPtr ship = shipCom; //implicitly conversion
Array<EntityPtr> cannons = getAllCannonFromShip(ship);
for(auto cannon: cannons){
Ptr<CannonCom> cannonCom=cannon;
//cannonCom-> ....
}
Problem
In later time-step, some ship / cannon are randomly created/destroyed.
As a result, Entity,ShipCom and CannonCom array has gap scattering around.
When I want to allocate them, I will get a random memory block from the pool.
EntityPtr ship = ..... (an old existing ship)
EntityPtr cannon = createNewEntity();
Ptr<CannonCom> cannonCom= createNew<CannonCom>(cannon);
attach_Pirate_Cannon(ship,cannon);
//^ ship entity and cannon tend to have very different (far-away) address
Thus, the "really fast" code above become bottom-neck. (I profiled.)
(Edit) I believe that the underlying cache miss also occur from different address between cannon inside the same ship.
For example (# is address of turret component),
ShipA has turret#01 to turret#49
ShipB has turret#50 to turret#99
In later timesteps, if turret#99 is moved to ShipA, it will be :-
ShipA has turret#01 to turret#49 + turret#99 (mem jump)
ShipB has turret#50 to turret#98
Question
What is a design pattern / C++ magic to reduce cache miss from frequently-used relation?
More information:
In real case, there are a lot of 1:1 and 1:N relationship. A certain relationship binds specifically to a certain type of component to a certain type of component.
For example, relation_Pirate_Cannon = (ShipCom:CannonCom), relation_physic_graphic = (PhysicCom:GraphicCom)
Only some of the relation are "indirect" often.
Current architecture has no limit on amount of Entity/ShipCom/CannonCom.
I don't want to restrict it in the beginning of program.
I prefer an improvement that not make game-logic coding harder.
The first solution that come to my mind is to enable relocation, but I believe it is the last resort approach.
A possible solution is to add another layer of indirection. It slows things down a bit, but it helps keeping compact your arrays and could help speeding up the whole thing. Profiling is the only way to know if it really helps.
That being said, how to do that?
Here is a brief introduction to sparse set and it's worth reading it before to proceed to better understand what I'm saying.
Instead of create relationships between items within the same array, use a second array to which to point.
Let's call the two arrays reverse and direct:
reverse is accessed through the entity identifier (a number, thus just an index in the array). Each and every slot contains an index within the direct array.
direct is accessed, well... directly and each slot contains the entity identifier (that is an index to access the reverse array) and the actual component.
Whenever you add a cannon, get its entity identifier and the first free slot in the direct array. Set slot.entity with your entity identifier and put in reverse[entity] the index of the slot. Whenever you drop something, copy the last element in the direct array to keep it compact and adjust the indexes so that relationships hold up.
The ship will store the indexes to the outer array (reverse), so that you are free to switch back and forth things within the inner array (direct).
What are advantages and disadvantages?
Well, whenever you access the cannons through the outer array, you have an extra jump because of the extra layer of indirection. Anyway, as long as you succeed in keeping low the number of accesses made this way and you visit the direct array in your systems, you have a compact array to iterate and the lowest number of cache misses.
How about sorting the entities to minimize cache misses ?
Everytime a cannon and/or ship is added/destroyed/moved sort the entities.
I am not sure if this is feasible in your ECS system;
it wouldn't be practical if you heavily depend on the entity indices, they would change everytime you sort.

Emulate memory-mapping of a game console, access different locations based on the address provided

I am implementing an emulator for an old game console, mostly for learning purposes.
This console maps roms, and a lot of other things, to regions within its address space. Certain locations are also mirrored so that multiple addresses can correspond to the same physical location. I would like to emulate this, but I am not sure what would be a good approach to do so (and also have no idea what this process is called, hence this somewhat generic question).
One thing that does work is a simple, unordered map. Have it contain absolute addresses and the corresponding pointers to my data structures. This way, I can easily map everything I need into the system's address space. The problem with this approach is that, it's obviously a memory hog. Even with small roms, I end up with close to ten million entries, thanks to the aforementioned mirroring. Surely, this can't be the rigth thing to do?
Any help is much appreciated.
Edit:
To provide some details as to how exactly I am doing this:
The system in question is, of course, the SNES. Using this wiki as my primary resource, I implemeted what I mentioned above as follows:
Create a std::unordered_map<uint32,uint8_t*> mMemoryMap;
Check whether the rom is LoRom or HiRom
For each byte in the rom
Calculate the address where it should be mapped and emplace both the address and a pointer to said byte in the map
If the section needs to be mirrored somewhere else, repeat the above
This will be applied to anything else I need to make available, such as video- or system-memory
If I now want to access anything within the address space, I can simply use the address the system would use internally.
I'm assuming that for contiguous addresses the physical locations are also contiguous, within certain blocks of memory or "chunks". That is, if the address 0x0000 maps to 0xFF00, then 0x0004 maps to 0xFF04.
If they work like that, then you can make a list that contains the information of those chunks. Say:
struct Chunk
{
int addressStart, memoryStart, size;
}
The chunks may be ordered by the addressStart, so you can find out the correct chunk you would need for any address. This requires you to iterate the list, but if you have only a few chunks this may be acceptable.
Rather than use simple maps (which even with ranges can grow to large sizes) you can instead use a more intelligent map.
For instance if the console maps 0x10XXXX through 0x1FXXXX all to the same 0x20XXXX you can design a structure which holds that repetition (start 0x100000 end 0x1FFFFF repeat 0x010000 although you may want to use a bitmask rather than repeat).
I'm currently in the same boat as you and doing an NES emulator for learning purposes. The complete memory map is declared as an array of pointers to bytes. Each pointer can point into another array, which allows me to have pointers to the same data in the case of mirroring.
byte *memoryMap[cpuMemSize];
I iterate over the addresses that repeat and map them to point to the bytes in the following array. The ram array is the memory that gets mapped 4 times across the CPU's memory map.
byte ram[ramSize];
The following code goes through the RAM array and maps it across the CPU memory map 4 times.
// map RAM 4 times to emulate mirroring
mem_address memAddr = 0;
for (int x = 0; x < 4; ++x)
{
for (mem_address ramAddr = 0; ramAddr < ramSize; ++ramAddr.value)
{
memoryMap[memAddr.value++] = &ram[ramAddr.value];
}
}
You can then write the a value to the 256th byte using something like this, which would of course be propagated to the other parts of the memory map since they point to the same byte in memory:
*memoryMap[0x00ff] = 10;
I haven't really tested this and want to do some more testing with regard to CPU cache use and performance. I was clearly out searching for other ways of doing this when I stumbled on your question and figured I'd put in my (unverified) two cents. Hope this makes sense.

Why are empty classes 8 bytes and larger classes always > 8 bytes?

class foo { }
writeln(foo.classinfo.init.length); // = 8 bytes
class foo { char d; }
writeln(foo.classinfo.init.length); // = 9 bytes
Is d actually storing anything in those 8 bytes, and if so, what? It seems like a huge waste, If I'm just wrapping a few value types then the the class significantly bloats the program, specifically if I am using a lot of them. A char becomes 8 times larger while an int becomes 3 times as large.
A struct's minimum size is 1 byte.
In D, object have a header containing 2 pointer (so it may be 8bytes or 16 depending on your architecture).
The first pointer is the virtual method table. This is an array that is generated by the compiler filled with function pointer, so virtual dispatch is possible. All instances of the same class share the same virtual method table.
The second pointer is the monitor. It is used for synchronization. It is not sure that this field stay here forever, because D emphasis local storage and immutability, which make synchronization on many objects useless. As this field is older than these features, it is still here and can be used. However, it may disapear in the future.
Such header on object is very common, you'll find the same in Java or C# for instance. You can look here for more information : http://dlang.org/abi.html
D uses two machine words in each class instance for:
A pointer to the virtual function table. This contains the addresses of virtual methods. The first entry points towards the class's classinfo, which is also used by dynamic casts.
The monitor, which allows the synchronized(obj) syntax, documented here.
These fields are described in the D documentation here (scroll down to "Class Properties") and here (scroll down to "Classes").
I don't know the particulars of D, but in both Java and .net, every class object contains information about its type, and also holds information about whether it's the target of any monitor locks, whether it's eligible for finalization cleanup, and various other things. Having a standard means by which all objects store such information can make many things more convenient for both users and implementers of the language and/or framework. Incidentally, in 32-bit versions of .net, the overhead for each object is 8 bytes except that there is a 12-byte minimum object size. This minimum stems from the fact that when the garbage-collector moves objects around, it needs to temporarily store in the old location a reference to the new one as well as some sort of linked data structure that will permit it to examine arbitrarily-deep nested references without needing an arbitrarily-large stack.
Edit
If you want to use a class because you need to be able to persist references to data items, space is at a premium, and your usage patterns are such that you'll know when data items are still useful and when they become obsolete, you may be able to define an array of structures, and then pass around indices to the array elements. It's possible to write code to handle this very efficiently with essentially zero overhead, provided that the structure of your program allows you to ensure that every item that gets allocated is released exactly once and things are not used once they are released.
If you would not be able to readily determine when the last reference to an object is going to go out of scope, eight bytes would be a very reasonable level of overhead. I would expect that most frameworks would force objects to be aligned on 32-bit boundaries (so I'm surprised that adding a byte would push the size to nine rather than twelve). If a system is going have a garbage collector that works better than a Commodore 64(*), it would need to have an absolute minimum of a bit of overhead per object to indicate which things are used and which aren't. Further, unless one wants to have separate heaps for objects which can contain supplemental information and those which can't, one will every object to either include space for a supplemental-information pointer, or include space for all the supplemental information (locking, abandonment notification requests, etc.). While it might be beneficial in some cases to have separate heaps for the two categories of objects, I doubt the benefits would very often justify the added complexity.
(*) The Commodore 64 garbage collector worked by allocating strings from the top of memory downward, while variables (which are not GC'ed) were allocated bottom-up. When memory got full, the system would scan all variables to find the reference to the string that was stored at the highest address. That string would then be moved to the very top of memory and all references to it would be updated. The system would then scan all variables to find the reference to the string at the highest address below the one it just moved and update all references to that. The process would repeat until it didn't find any more strings to move. This algorithm didn't require any extra data to be stored with strings in memory, but it was of course dog slow. The Commodore 128 garbage collector stored with each string in GC space a pointer to the variable that holds a reference and a length byte that could be used to find the next lower string in GC space; it could thus check each string in order to find out whether it was still used, relocating it to the top of memory if so. Much faster, but at the cost of three bytes' overhead per string.
You should look into the storage requirements for various types. Every instruction, storage allocation (ie:variable/object, etc) uses up a specific amount of space. In c# an Int32 type integer object should store integer information to the tune of 4 bytes (32bit). It might have other information, too, because it is an object, but your character data type probably only requires 1 byte of information. If you have constructs like for or while in your class, those things will take up space, too, because each of those things is telling your class to do something. The class itself requires a number of instructions to be created in memory, which would account for the 8 initial bytes.
Take an assembler language course. You'll learn all you ever wanted to know and then some about why your programs use however much memory or take up however much storage when compiled.