List VM sizes in Microsoft Azure Compute based on Type or Category - python-2.7

We are trying to list all available sizes for particular location using the API "GET https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Compute/virtualMachines/{vmName}/vmSizes?api-version=2017-12-01". It returns nearly 22400 sizes. Is it really contains this many sizes under some region? Is there any elegant way to get VM sizes based on type.
For Example:
1. Get VM sizes based on General purpose, Memory optimized, Storage optimized etc.
2. Get VM Sizes based on RAM size, CPU count etc.

I used the sample posted by Laurent (link below) and it returned all available VM sizes' names, cores, disks, memory, etc. in the region (use parm location=region). If you put some code around it you should be able to do example 2.
Get Virtual Machine sizes list in json format using azure-sdk-for-python
def list_available_vm_sizes(compute_client, region = 'EastUS2', minimum_cores = 1, minimum_memory_MB = 768):
vm_sizes_list = compute_client.virtual_machine_sizes.list(location=region)
for vm_size in vm_sizes_list:
if vm_size.number_of_cores >= int(minimum_cores) and vm_size.memory_in_mb >= int(minimum_memory_MB):
print('Name:{0}, Cores:{1}, OSDiskMB:{2}, RSDiskMB:{3}, MemoryMB:{4}, MaxDataDisk:{5}'.format(
vm_size.name,
vm_size.number_of_cores,
vm_size.os_disk_size_in_mb,
vm_size.resource_disk_size_in_mb,
vm_size.memory_in_mb,
vm_size.max_data_disk_count
))
list_available_vm_sizes(compute_client, region = 'EastUS', minimum_cores = 2, minimum_memory_MB = 8192)

Related

Erlang - How is the creation integer (a part of a distributed pid representation ) actually created?

In a distributed Erlang system pids can have two different representations: i) internal; ii) external.
The internal representation has the following shape: < A.B.C >
The external representation, used for instance when a message has to travel across different nodes, is instead composed of the following elements: < node_id, ID, serial, creation > according to the official documentation.
Where node_id is the name of the node, ID and serial identify the process on node_id and creation is an integer used to distinguish the node from past (crashed) version of itself.
What I could not find is how the creation integer is created by the VM.
By setting a small experiment on my PC, I have seen that if I create and kill the same node several times the counter is always increased by 1, and by creating the same node on different machines, the creation integers are different, but have some similarities in their structure, for instance:
machine 1 -> creation integer = 1647595383
machine 2 -> creation integer = 1647596018
Do any of you have any knowledge about how this integer is created? If so could you please explain it to me and possibly reference some (more or less) official documentation?
The creation is sent as a part of the response to node registration in epmd, see details on that protocol.
If you have a custom erl_epmd module, you can also provide your own way of creating the creation-value.
The original creation is the local time of when the node with that name is first registered, and then it is bumped once for each time the name is re-registered.

What determines AWS Redis' usable memory? (OOM issue)

I am using AWS Redis for a project and ran into an Out of Memory (OOM) issue. In investigating the issue, I discovered a couple parameters that affect the amount of usable memory, but the math doesn't seem to work out for my case. Am I missing any variables?
I'm using:
3 shards, 3 nodes per shard
cache.t2.micro instance type
default.redis4.0.cluster.on cache parameter group
The ElastiCache website says cache.t2.micro has 0.555 GiB = 0.555 * 2^30 B = 595,926,712 B memory.
default.redis4.0.cluster.on parameter group has maxmemory = 581,959,680 (just under the instance memory) and reserved-memory-percent = 25%. 581,959,680 B * 0.75 = 436,469,760 B available.
Now, looking at the BytesUsedForCache metric in CloudWatch when I ran out of memory, I see nodes around 457M, 437M, 397M, 393M bytes. It shouldn't be possible for a node to be above the 436M bytes calculated above!
What am I missing; Is there something else that determines how much memory is usable?
I remember reading it somewhere but I can not find it right now. I believe BytesUsedForCache is a sum of RAM and SWAP used by Redis to store data/buffers.
As Elasticache's docs suggest that SWAP should not go higher than 300 MB.
I would suggest checking the SWAP metric at that time.

No space left on device in Sagemaker model training

I'm using custom algorithm running shipped with Docker image on p2 instance with AWS Sagemaker (a bit similar to https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)
At the end of training process, I try to write down my model to output directory, that is mounted via Sagemaker (like in tutorial), like this:
model_path = "/opt/ml/model"
model.save(os.path.join(model_path, 'model.h5'))
Unluckily, apparently the model gets too big with time and I get the
following error:
RuntimeError: Problems closing file (file write failed: time = Thu Jul
26 00:24:48 2018
00:24:49 , filename = 'model.h5', file descriptor = 22, errno = 28,
error message = 'No space left on device', buf = 0x1a41d7d0, total
write[...]
So all my hours of GPU time are wasted. How can I prevent this from happening again? Does anyone know what is the size limit for model that I store on Sagemaker/mounted directories?
When you train a model with Estimators, it defaults to 30 GB of storage, which may not be enough. You can use the train_volume_size param on the constructor to increase this value. Try with a large-ish number (like 100GB) and see how big your model is. In subsequent jobs, you can tune down the value to something closer to what you actually need.
Storage costs $0.14 per GB-month of provisioned storage. Partial usage is prorated, so giving yourself some extra room is a cheap insurance policy against running out of storage.
In the SageMaker Jupyter notebook, you can check free space on the filesystem(s) by running !df -h. For a specific path, try something like !df -h /opt.

fine tuning vgg raise memory error

Hi i'm trying to fine tuning vgg on my problem but when i try to train the net i get this error.
OOM when allocating tensor with shape[25088,4096]
The net has this structure:
I take this tensorflow pretrained vgg implementation code from this site.
I only add this procedure to train the net:
with tf.name_scope('joint_loss'):
joint_loss = ya_loss+yb_loss+yc_loss+yd_loss+ye_loss+yf_loss+yg_loss+yh_loss+yi_loss+yl_loss+ym_loss+yn_loss
# Loss with weight decay
l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
self.joint_loss = joint_loss + self.weights_decay * l2_loss
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(joint_loss)
i try to reduce the batch size to 2 but not works i get the same error. The error is due to the big tensor that cannot be allocated in memory. I get this error only in train cause if i feed a value without minimize the net works. How i can avoid this error? how can i save memory of graphic card(Nvidia GeForce GTX 970)?
UPDATE: if i use the GradientDescentOptimizer the training process start, instead if i use AdamOptimizer i get the memory error, seems that the GradientDescentOptimizer use less memory.
Without a backward pass ("feed a value without minimizing"), TensorFlow can immediately de-allocate intermediate activations. With a backward pass, the graph has a giant U-shape, where activations from the forward pass need to be kept in memory for the backward pass. There are some tricks (such as swapping to host memory), but in general backprop means that memory usage will be higher.
Adam does keep some extra bookkeeping variables around, so it will increase memory usage proportional to the amount of memory your weight variables are already using. If your training steps take quite a while (in which case having the variable updates on the GPU isn't important), you could instead locate the optimization ops in host memory.
If you need a larger batch size and can't reduce image resolution or model size, combining gradients from multiple workers/GPUs using something like SyncReplicasOptimizer can be a good option. Looking at the paper associated with this model, it looks like they were training on 4 GPUs each with 12GB of memory.

Returning the memory used so I can predict the memory required to compute ML algorithm

I am running a Random Forest ML script using a test size data set 5 k observations with a set number of parameters with a varying number of forests. My real model is closer to 1 million observations with 500+ parameters. I am trying to calculate how much memory this model would require assuming x number of forests.
In order to do this I could use a method of returning how much memory was used in a running of the script. Is it possible to return this, so that I can calculate the RAM required to compute the full model?
I currently use the following to tell me how long it takes to compute:
global starttime
print "The whole routine took %.3f seconds" % (time() - starttime)
Edit Re to my own answer
Feel like I am conversing with myself a little but hey ho, I tried running the following code to find out how much memory is actually being used, and why when I increase the number of n_estimators_value my PC runs out of memory. Unfortunately all of the % memory usage come back the same, I assume this is because it is calculating the memory usage at the incorrect time, it needs to record it at its peak whilst actually fitting the random forest. See code:
psutilpercent = psutil.virtual_memory()
print "\n", " --> Memory Check 1 Percent:", str(psutilpercent.percent) + "%\n"
n_estimators_value = 500
rf = ensemble.RandomForestRegressor(n_estimators = n_estimators_value, oob_score=True, random_state = 1)
psutilpercent = psutil.virtual_memory()
print "\n", " --> Memory Check 1 Percent:", str(psutilpercent.percent) + "%\n"
Any methods to find out the peak memory usage? I am trying to calculate how much memory would be required to fit a rather large RF, and I cant calculate this without knowing how much memory my smaller models require.
/usr/bin/time reports peak memory usage for a program. There's also the memory_profiler for Python.