CPU load percentage for individual processors - wmi

I am using following WMI query to get the CPU load percent data.
Select * from win32_processor and below instance results are captured.
Win32_processor WMI query results
By above data it’s understood that there are two physical processor instances are available (CPU0 and CPU1).
But it’s observed in some machines the load percent parameter for these instances always gives value as 100 and Microsoft has recommended to use the following WMI class to rectify this. So for the same machine below WMI query is made.
Select * from Win32_PerfFormattedData_Counters_ProcessorInformation and results are captured as below.
Win32_PerfFormattedData_Counters_ProcessorInformation WMI query results
In the above results from the same machine, win32_Processor class gives two instances and win32_PerfFormattedData_Counters_ProcessorInformation gives one instance (other data like 0.0, 0.1, 0.2,0.3 are for the cores and not the processor instances).
I was assuming that it’ll give 0,_Total and 1,_Total as the processor instances. _Total is over all load percentage across all the processors.
Note: The above results are based on the VM machine. Don’t have physical machine with two physical processor to verify. But my assumption is since wi32_processor class is giving the data for two physical processor instances, so as win32_PerfFormattedData_Counters_ProcessorInformation should give the data for two physical processor instances.
Please let me know how to get the individual processor information using win32_PerfFormattedData_Counters_ProcessorInformation as win32_Processor gives wrong values. My requirement is to collect overall CPU and individual processor instance CPU load percent data.

Related

How can I aggregate intel amplifier batch results?

I'm solving a number of instances with my code and I'd need to find the worst hotspots, where "worst" is defined as a hotspot over a wide range of instances. So for every instance I have collected hotspot analysis data in batch mode using amplxe-cl. Now I'd like to aggregate this data, I'd like to analyze them together. Is there any way to do this with vtune?
Update:
This is not an mpi application. There are a number of different datasets (problems, instances, pick your term :-) that need to be processed by my application. Depending on the data in a single instance the application can take very different turns while processing it, thus running the application on different instances can result in different hotspots. The purpose of the aggregation would be, as #ArunJose_Intel guessed, is to find hotspots that are common in all runs, that are present in the processing of all kind of instances.
I can collect hotspot analysis for every instance easily using batch mode and I can inspect them individually, but I'd like to see an aggregate analysis.
Of course, I could just process them in one run one after the other, but that would take several weeks, while I can process them as individual problems in a few hours on a cluster of identical machines.
In vtune it is not possible to combine multiple GUI reports. You have an option to compare across two different reports to see what has changed but clearly this is not what you are looking for.
A workaround you could possibly try is to create command line reports from the vtune results you have already collected. These command line reports would be in easily parsable data formats like CSV . Once you have reports in these formats you could have could write your custom scripts/code to aggregate multiple of these csv reports, with whatever logic you wish to have them aggregated.
Please find below some samples to create command line reports
1)Generate a Hotspots report from the r001hs result on Linux*, and save it to /home/test/MyReport.txt in text format.
vtune -report hotspots -result-dir r001hs -report-output /home/test/MyReport.txt
2)Generate a hotspots report in the CSV format from the most recent result and save it in the current Linux working directory. Use the format option with the csv argument and the csv-delimiter option to specify a delimiter, such as comma.
vtune -R hotspots -report-output MyReport.csv -format csv -csv-delimiter comma
For more information
https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/command-line-interface/generating-command-line-reports.html
https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/command-line-interface/generating-command-line-reports/saving-and-formatting-reports.htm

How is the 'finalDigest' calculated in the 'sevLaunchAttestationReportEvent' log entry for confidential VMs in GCP?

I've experimented with launching some confidential VM instances. The simple scenario includes:
Launch an instance named 'Alice'.
Stop and relaunch instance 'Alice'.
Delete instance 'Alice', create a new VM instance named 'Alice'
I checked the 'sevLaunchAttestationReportEvent' log entry.
As expected, in all three cases the 'guestMemoryRegion' digest was identical in all cases.
However, the 'finalDigest' was different in all three cases. My questions are:
A. How is the 'finalDigest' calculated?
B. What is the purpose of a 'finalDigest' that is different at each launch of an identical VM image?
C. Can the 'finalDigest' be pre-calculate before instantiation?
Thanks.
First of all, a Confidential Virtual Machine runs on hosts based on the second generation of AMD Epyc processors, it is optimized for security workloads and includes inline memory encryption that ensures that data is encrypted while it's in RAM.
You can consult the following documentation to get further information.
Regarding your questions:
A. How is the 'finalDigest' calculated?
To calculate the digest value, a Digest Algorithm you can be use, those algorithms could be:
SHA-1
SHA-256
SHA-384
SHA-512
MD5
They are functions to take a large document and compute a "digest" (also called "hash"), this is typically used in a digital signing process.
B. What is the purpose of a 'finalDigest' that is different at each launch of an identical VM image?
A message digest or hash function is used to turn input of arbitrary length into an output of fixed length and this output can then be used in place of the original input, and the digest can be changed every time that the VM instance is turned on because some changes were executed internally in the instance. I mean, the hash algorithm takes into consideration those changes, even though a single byte is changed the digest or hash will change completely.
C. Can the 'finalDigest' be pre-calculate before instantiation?
In my opinion this is not feasible because the digest algorithm is a one-way function, that is, a function which is practically infeasible to invert.
You can get more information about the hash functions on this link.

Slow Performance with Apache Spark Gradient Boosted Tree training runs

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting GBT ensemble in human-readable format. My experience so far is that for my problem size adding more resources to the cluster seems to not have an effect on the length of the run. A 10-iteration training run seem to roughly take 13hrs. This isn't acceptable since I'm looking to do 100-300 iteration runs, and the execution time seems to explode with the number of iterations.
My Spark application
This isn't the exact code, but it can be reduced to:
SparkConf sc = new SparkConf().setAppName("GBT Trainer")
// unlimited max result size for intermediate Map-Reduce ops.
// Having no limit is probably bad, but I've not had time to find
// a tighter upper bound and the default value wasn't sufficient.
.set("spark.driver.maxResultSize", "0");
JavaSparkContext jsc = new JavaSparkContext(sc)
// The input file is encoded in plain-text LIBSVM format ~59GB in size
<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD();
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
boostingStrategy.setNumIterations(10);
boostingStrategy.getTreeStrategy().setNumClasses(2);
boostingStrategy.getTreeStrategy().setMaxDepth(1);
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
GradientBoostedTreesModel model = GradientBoostedTrees.train(data, boostingStrategy);
// Somewhat-convoluted code below reads in Parquete-formatted output
// of the GBT model and writes it back out as json.
// There might be cleaner ways of achieving the same, but since output
// size is only a few KB I feel little guilt leaving it as is.
// serialize and output the GBT classifier model the only way that the library allows
String outputPath = "s3://somebucket/somekeyprefex";
model.save(jsc.sc(), outputPath + "/parquet");
// read in the parquet-formatted classifier output as a generic DataFrame object
SQLContext sqlContext = new SQLContext(jsc);
DataFrame outputDataFrame = sqlContext.read().parquet(outputPath + "/parquet"));
// output DataFrame-formatted classifier model as json
outputDataFrame.write().format("json").save(outputPath + "/json");
Question
What is the performance bottleneck with my Spark application (or with GBT learning algorithm itself) on input of that size and how can I achieve greater execution parallelism?
I'm still a novice Spark dev, and I'd appreciate any tips on cluster configuration and execution profiling.
More details on the cluster setup
I'm running this app on a AWS EMR cluster (emr-4.0.0, YARN cluster mode) of r3.8xlarge instances (32 cores, 244GB RAM each). I'm using such large instances in order to maximize flexibility of resource allocation. So far I've tried using 1-3 r3.8xlarge instances with a variety of resource allocation schemes between the driver and workers. For example, for a cluster of 1 r3.8xlarge instances I submit the app as follows:
aws emr add-steps --cluster-id $1 --steps Name=$2,\
Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,\
Args=[/usr/lib/spark/bin/spark-submit,--verbose,\
--deploy-mode,cluster,--master,yarn,\
--driver-memory,60G,\
--executor-memory,30G,\
--executor-cores,5,\
--num-executors,6,\
--class,GbtTrainer,\
"s3://somebucket/somekey/spark.jar"],\
ActionOnFailure=CONTINUE
For a cluster of 3 r3.8xlarge instances I tweak resource allocation:
--driver-memory,80G,\
--executor-memory,35G,\
--executor-cores,5,\
--num-executors,18,\
I don't have a clear idea of how much memory is useful to give to every executor, but I feel that I'm being generous in either case. Looking through Spark UI, I'm not seeing task with input size of more than a few GB. I'm steering on the side of caution when giving the driver process so much memory in order to ensure that it isn't memory starved for any intermediate result-aggregation operations.
I'm trying to keep the number of cores per executor down to 5 as per suggestions in Cloudera's How To Tune Your Spark Jobs series (according to them, more that 5 cores tends to introduce a HDFS IO bottleneck). I'm also making sure that there is enough of spare RAM and CPUs left over for the host OS and Hadoop services.
My findings thus far
My only clue is Spark UI showing very long Scheduling Delay for a number of tasks at the tail-end of execution. I also get the feeling that the stages/tasks timeline shown by Spark UI does not account for all of the time that the job takes to finish. I suspect that the driver application is stuck performing some kind of a lengthy operation either at the end of every training iteration, or at the end of the entire training run.
I've already done a fair bit of research on tuning Spark applications. Most articles will give great suggestions on using RDD operations which reduce intermediate input size or avoid shuffling of data between stages. In my case I'm basically using an "out-of-the-box" algorithm, which is written by ML experts and should already be well tuned in this regard. My own code that outputs GBT model to S3 should take a trivial amount of time to run.
I haven't used MLLibs GBT implemention, but I have used both
LightGBM and XGBoost successfully. I'd highly suggest taking a look at these other libraries.
In general, GBM implementations need to train models iteratively as they consider the loss of the entire ensemble when building the next tree. This makes GBM training inherently bottlenecked and not easily parallelizable (unlike random forests which are trivially parallelizable). I'd expect it to perform better with fewer tasks, but that might not be your whole issue. Since you have so many features 500K, you're going to have very high overhead when calculating the histograms and split points during training. You should reduce the number of features you have, especially since they're much larger than the number of samples which will cause it to overfit.
As for tuning your cluster:
You want to minimize data movement, so fewer executors with more memory. 1 executor per ec2 instance, with the number of cores set to whatever the instance provides.
Your data is small enough to fit into ~2 EC2s of that size. Assuming you are using doubles (8 bytes), it comes to 8 * 500000 * 50000 = 200 GB Try loading it all into ram by using .cache() on your dataframe. If you perform an operation over all the rows (like sum) you should force it to load and you can measure how long the IO takes. Once its in ram and cached any other operations over it will be faster.
With a dataset of that size, you may well be better off loading the full dataset into memory and using XGBoost directly rather than the Spark implementation.
If you want to stick with Spark to give greater scalability, I'd recommend taking a closer look at your partitioning strategy. If your data isn't effectively partitioned, adding machines won't improve your runtime, as you describe above, and the subset of overloaded workers will remain your bottleneck. Ensure you have an effective partition key, and repartition your RDD before you begin your training stage.

What factors decide the number of executors in a stand alone mode?

Given a Spark application
What factors decide the number of executors in a stand alone mode? In the Mesos and YARN according to this documents, we can specify the number of executers/cores and memory.
Once a number of executors are started. Does Spark start the tasks in a round robin fashion or is it smart enough to see if some of the executors are idle/busy and then schedule the tasks accordingly.
Also, how does Spark decide on the number of tasks? I did write a simple max temperature program with small dataset and Spark spawned two tasks in a single executor. This is in the Spark stand alone mode.
Answering your questions:
The standalone mode uses the same configuration variable as Mesos and Yarn modes to set the number of executors. The variable spark.cores.max defines the maximun number of cores used in the spark Context. The default value is infinity so Spark will use all the cores in the cluster. The spark.task.cpus variable defines how many CPUs Spark will allocate for a single task, the default value is 1. With these two variables you can define the maximun number of parallel tasks in your cluster.
When you create an RDD subClass you can define in which machines to run your task. This is defined in the getPreferredLocations method. But as the method signatures suggest this is only a preference so if Spark detects that one machine is not busy, it will launch the task in this idle machine. However I don't know the mechanism used by Spark to know what machines are idle. To achieve locality, we (Stratio) decided to make each Partions smaller so the task takes less time and achieve locality.
The number of tasks of each Spark's operation is defined according to the length of the RDD's partitions. This vector is the result of the getPartitions method that you have to override if you want to develop a new RDD subClass. This method returns how a RDD is split, where the information is and the partitions. When you join two or more RDDs using, for example, union or join operations, the number of tasks of the resulting RDD is the maximum number of tasks of the RDDs involved in the operation. For example: if you join RDD1 that has 100 tasks and RDD2 that has 1000 tasks, the next operation of the resulting RDD will have 1000 tasks. Note that a high number of partitions is not necessarily synonym of more data.
I hope this will help.
I agree with #jlopezmat about how Spark chooses its configuration. With respect to your test code, your are seeing two task due to the way textFile is implemented. From SparkContext.scala:
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString)
}
and if we check what is the value of defaultMinPartitions:
/** Default min number of partitions for Hadoop RDDs when not given by user */
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
Spark chooses the number of tasks based on the number of partitions in the original data set. If you are using HDFS as your data source, then the number of partitions with be equal to the number of HDFS blocks, by default. You can change the number of partitions in a number of different ways. The top two: as an extra argument to the SparkContext.textFile method; by calling the RDD.repartion method.
Answering some points that were not addressed in previous answers:
in Standalone mode, you need to play with --executor-cores and --max-executor-cores to set the number of executors that will be launched (granted that you have enough memory to fit that number if you specify --executor-memory)
Spark does not allocate task in a round-robin manner, it uses a mechanism called "Delay Scheduling", which is a pull-based technique allowing each executor to offer it's availability to the master, which will decide whether or not to send a task on it.

Machine unique ID [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Generating a unique machine id
I want processor serial number which is unique id no other processor have that id. Also i have hard disk serial number. I am using c++. Can anyone please help me for this?
I need unique machine id like CPU number,motherboard number using c++.
Win32_BaseBoard,
Win32_Processor
Win32_DiskPartition
Thank you.
According to Wikipedia, starting with the Pentium III the CPUID assembler opcode is supported, however due to security concerns is no longer implemented. See the following article for details: http://en.wikipedia.org/wiki/CPUID#EAX.3D3:_Processor_Serial_Number
The best way is to derive a Machine Unique ID from different sources rather than depending on single parameter.
Check http://sowkot.blogspot.com/2008/08/generating-unique-keyfinger-print-for.html for more information.
Even the method described in the above link can't guarantee always same MID (user might change the hardware).
Based on my experience, at the application start/launch generate MID and store in the application specific area (may be in registry) and use this for all other application related tasks instead of generating every time. In such case a normal GUID generation should suffice.
If you need a unique ID, you don't have to tie it up to the hardware, simply, generate a new random ID (128 bits or larger)! Store it in whatever persistent storage mechanism you prefer, so that next time you extract the same ID you generated before.
If you use processor or disk serial numbers, they will be subject to change, because users could upgrade their hardware. Your own unique ID will never change. The only downside of this, is that machines with dual boot will have two or more ID's -- one ID per instance of the OS.