How can I aggregate intel amplifier batch results? - profiling

I'm solving a number of instances with my code and I'd need to find the worst hotspots, where "worst" is defined as a hotspot over a wide range of instances. So for every instance I have collected hotspot analysis data in batch mode using amplxe-cl. Now I'd like to aggregate this data, I'd like to analyze them together. Is there any way to do this with vtune?
Update:
This is not an mpi application. There are a number of different datasets (problems, instances, pick your term :-) that need to be processed by my application. Depending on the data in a single instance the application can take very different turns while processing it, thus running the application on different instances can result in different hotspots. The purpose of the aggregation would be, as #ArunJose_Intel guessed, is to find hotspots that are common in all runs, that are present in the processing of all kind of instances.
I can collect hotspot analysis for every instance easily using batch mode and I can inspect them individually, but I'd like to see an aggregate analysis.
Of course, I could just process them in one run one after the other, but that would take several weeks, while I can process them as individual problems in a few hours on a cluster of identical machines.

In vtune it is not possible to combine multiple GUI reports. You have an option to compare across two different reports to see what has changed but clearly this is not what you are looking for.
A workaround you could possibly try is to create command line reports from the vtune results you have already collected. These command line reports would be in easily parsable data formats like CSV . Once you have reports in these formats you could have could write your custom scripts/code to aggregate multiple of these csv reports, with whatever logic you wish to have them aggregated.
Please find below some samples to create command line reports
1)Generate a Hotspots report from the r001hs result on Linux*, and save it to /home/test/MyReport.txt in text format.
vtune -report hotspots -result-dir r001hs -report-output /home/test/MyReport.txt
2)Generate a hotspots report in the CSV format from the most recent result and save it in the current Linux working directory. Use the format option with the csv argument and the csv-delimiter option to specify a delimiter, such as comma.
vtune -R hotspots -report-output MyReport.csv -format csv -csv-delimiter comma
For more information
https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/command-line-interface/generating-command-line-reports.html
https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/command-line-interface/generating-command-line-reports/saving-and-formatting-reports.htm

Related

How to use Apache beam to process Historic Time series data?

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc.
I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?
I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards
Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes
That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:
context.outputWithTimestamp() in your DoFn;
WithTimestamps PTransform;
You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.
See:
outputWithTimestamp example: https://github.com/apache/beam/blob/efcb20abd98da3b88579e0ace920c1c798fc959e/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowingTest.java#L248
documentation for WithTimestamps: https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/WithTimestamps.html#of-org.apache.beam.sdk.transforms.SerializableFunction-
similar question: Assigning to GenericRecord the timestamp from inner object
another question that may have helpful details: reading files and folders in order with apache beam

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

SSIS Process Design Issue: Workaround for Variable Sharing in Concurrent Executions?

Let met start by saying that I am in the preliminary stages of a large ETL project - ~80ish data sources, with 100s of millions of rows being transformed and loaded to a (finicky) ERP system. As such, I'm open to large process design solutions, along with potential solutions to this issue of variable sharing in concurrent Sequence Container execution.
In my example, I am logging three row counting variables to an SSMS ErrorLogging table. My intention is to use these three variables to help narrow down the 'stage' of the ETL process where data loss is/could be occurring. I am currently building this out in a very small test case. Below are the three variables and an image of the SSIS test case package.
varRowsSource: Control Flow task that counts of rows from the source data
varRowsTransformation: Data Flow task that counts the number of rows after data flow transformations
varRowsDestination: Control Flow task that counts the number of rows that wrote to the destination table -> select count(*).
SSIS TestCase package
In the image above, you'll see the range of Sequence Containers that are being used. These containers run concurrently.
I've highlighted the Control Flow tasks that overwrite each others RowsSource and RowsDestination variables entries, as well as circled the Data Flows that add rows to the variable.
It can be seen how, with only one data source, this requires 3 variables per container. We expect 30-50 data sources per package. I'm hoping to avoid having to create hundreds of variables.
SSMS ErrorLog table
Above is a snippet of the ErrorLog table in SSMS, should that help visualize how the variables are adding to, or overwriting each other. There are 4417 rows in the source data for Sequence Container CustomerInfoAddress and 4429 rows in the source data for Sequence Container CustomerTaxInfo.
In essence, I'm looking for a workaround for the issue of variable sharing in SSIS during concurrent processes and I would like to avoid creating hundreds of variables. I've consulted as many resources as I've been able to find about this issue, (including the below) without luck. I'm hoping someone here may have some insight into a process design workaround or any other type of solution.
social.msdn.microsoft.com/Forums/sqlserver/en-US/bed0d082-6c30-47a5-aaa7-0e132927bfe2/row-count-concurrency?forum=sqlintegrationservices

Tool for querying large numbers of csv files

We have large numbers of csv files, files/directories are partitioned by date and several other factors. For instance, files might be named /data/AAA/date/BBB.csv
There are thousands of files, some are in the GB range in size. Total data sizes are in the terabytes.
They are only ever appended to, and usually in bulk, so write performance is not that important. We don't want to load it into another system because there are several important processes that we run that rely on being able to stream the files quickly, which are written in c++.
I'm looking for tool/library that would allow sql like queries against the data directly off the data. I've started looking at hive, spark, and other big data tools, but its not clear if they can access partitioned data directly from a source, which in our case is via nfs.
Ideally, we would be able to define a table by giving a description of the columns, as well as partition information. Also, the files are compressed, so handling compression would be ideal.
Are their open source tools that do this? I've seen a product called Pivotal, which claims to do this, but we would rather write our own drivers for our data for an open source distributed query system.
Any leads would be appreciated.
Spark can be a solution. It is in memory distributed processing engine. Data can be loaded into memory on multiple nodes in the cluster and can be processed in memory. You do not need to copy data to another system.
Here are the steps for your case:
Build multiple node spark cluster
Mount NFS on to one of the nodes
Then you have to load data temporarily into memory in the form of RDD and start processing it
It provides
Support for programming languages like scala, python, java etc
Supports SQL Context and data frames. You can define structure to the data and start accessing using SQL Queries
Support for several compression algorithms
Limitations
Data has to be fit into memory to be processed by Spark
You need to use data frames to define structure on data after which you can query the data using sql embedded in programming languages like scala, python, java etc
There are subtle differences between traditional SQL in RDBMS and SQL in distributed systems like spark. You need to aware of those.
With hive, you need to have data copied to HDFS. As you do not want to copy the data to another system, hive might not be solution.

Missing the obvious with inconsistently delimited data?

I have built something in SAS to pull down Yahoo! finance .csv data. The code I have built now works fine and I have built some robust error handling into the code. The problem I have had with the data though is that the .csv feed is unsupported and not clean.
The data is comma delimited, but some of the data also has commas in it. Some of the fields are in quotes and some are not. Also the length of the fields varies wildly as as well. A field like Market Capitlisation for example could run form a few million to hundreds of billions.
As a result, if you pass multiple stock metrics for multiple stocks through to the Yahoo! API at the same time, you will get rows of .csv data where each field is in a different place, is a different length and is inconsistently delimited.
I have tried multiple infile options that could handle some of these errors in isolation, but not all of them together. My only solution that works is to download single stock metrics by multiple stocks at the same time.
This gives me what I want, but it takes over an hour to run the data for the NASDAQ and the NYSE. Have I overlooked another method for handling this type of problem?
Thanks
This is the outline of a way to do what you are looking for. The whole of the code to do this would be too long to post here and out of scope of what this site looks to do.
Create a SAS program that takes a stock ticker from the SYSPARM automatic macro, and downloads the data to a data set named the same as the ticker into a permanent library.
The SYSPARM macro is set by the value you set on the commandline to call SAS
sas.exe myprog.sas -sysparm XYZ
This would set &SYSPARM to resolve XYZ
Write a SAS program that merges all the ticker data sets together for further processing.
Create a program in a language like Perl or Python, (or shell script, etc.) that loops over a range of tickers and calls your SAS program, passing the ticker through SYSPARM.
Use a threading, forking, etc. package from that language to have multiple of these running at the same time. You can probably go to some multiple of the CPU cores on your machine as this processing will not be CPU intensive. Test values to you find one that works.
From that same language call your SAS program to merge the datasets.