Using RRD4J and XML to create graph? - rrdtool

I have a homework that I need to research RRD4J and create graph by using RRD4J library. My teacher just gave me only xml file. So, can I use XML with RRD4J to draw graph and how?

Without much more information it is difficult to answer your question. Those general steps might help you to understand what you should do, to solve the problem:
1) Depending on the granularity you would like to have (and the data frequency you have in XML file) create RRD
For example if you would like to have hourly and daily data, your archive creation should look like:
RrdDef rrdDef = new RrdDef(fileName, 60); // 60 is step, means you expect data to enter at 60 seconds interval
rrdDef.setStartTime(...); // Set initial timestamp here (must be 10 digit epoch timestamp)
rrdDef.addDatasource(DATASOURCE_NAME, DsType.GAUGE, 120, 0, Double.NaN); // DATASOURCE_NAME is the name of your variable in time series, DsType - is the type of data (always increasing, increasing and decreasing, etc), 120 is the timeout for new data entry, (i.e if no data enters in 120 seconds, NaN will be added to database), max and min values
rrdDef.addArchive(ConsolFun.AVERAGE, 0.99, 1, 60);
rrdDef.addArchive(ConsolFun.AVERAGE, 0.99, 24, 240);
RrdDb rrdDb = new RrdDb(rrdDef);
rrdDb.close();
(all of those configurations are coming from detailed analisys of time series you are working with, it's really hard to predict something without looking at data)
2) Parse XML file using SAX (I guess this one will be better since after insertiong into RRD database you won't need to access parsed values anymore)
3) While parsing XML, update RRD
RrdDb rrdDb = new RrdDb(fileName);
Sample sample = rrdDb.createSample();
sample.setAndUpdate(timestamp+":"+value);
rrdDb.close();
4) When all data is inserted generate some graphs (check the examples and options on RRD4J website)
P.S(use the intergration with MongoDB, which outperformes RRD4J many time, there is an example also on their page)
Hope this helped :-)

Is this XML a template ?
http://rrd4j.googlecode.com/git/javadoc/org/rrd4j/core/XmlTemplate.html
The best configuration for rrd4j is File and version 2 rrd.

Related

What is the maximum number of data values that amCharts can handle?

We are using amCharts 4 to show trend logs, and sometimes we end up with a lot of data that has to go into the chart. We'd like to know what the maximum number of data points that a chart can handle so we know how much data to aggregate (to reduce the data point count) before sending it into the package. To show the most accurate representation of the data as possible, we don't want to aggregate more aggressively than we have to. Our charts are x/y charts with value vs. date/time for up to 8 series.
In one case, we have a data set with well in excess of 600,000 data points in 8 series, and loading this into the chart, even in batches (i.e., loading one batch in, then adding the remaining batches to it in turn), will cause the charting package to run out of memory. In the case cited here, during our test, the charting package ran out of memory on the third batch, where the total of the 3 batches exceeded 600,000 data points, preventing further batches from being loaded in. For large sites that use our product, it is quite common to have that much data that the user wants to see in a chart if they want to see 6 months or a year's worth of data; so it's important that we be able to show some kind of representation of all that data, which is where aggregation comes in.

AWS Forecast cannot train the predictor due to missing data

This question is close, but doesn't quite help me with a similar issue as I am using a single data set and no related time series.
I am using AWS Forecast with a single time series dataset (no related data, just the main DS). It is a daily data set with about 10 years of data ranging from 2010-2020.
I have 3572 data points in the original data set; I manually filled missing data to ensure there were no missing days in the date range for a total of 3739 data points. I lopped off everything in 2020 to create a validation dataset and then configured the predictor for a 180 day Forecast. I keep getting the following error:
Unable to evaluate this dataset because there is missing data in the evaluation window for all items. Ensure that there is complete data for at least one item in the evaluation window starting from 2019-03-07T00:00:00 up to 2020-01-01T00:00.
There is definitely no missing data, I've double and triple checked the date range and data fill and every day between start and end dates has a data point. I also tried adding a data point for 1/1/2020 (it ended at 12/31/2019) and I continue to get this error. I can't figure out what it's asking me for, except that maybe I'm missing something in my math about the forecast Horizon and Backtest window offset?
Dataset example:
Brief model parameters (can share more if I'm missing something pertinent):
Total data points in training data: 3479
forecastHorizon = 180
create_predictor_response=forecast.create_predictor(PredictorName=predictorName,
ForecastHorizon=forecastHorizon,
PerformAutoML= True,
PerformHPO=False,
EvaluationParameters= {"NumberOfBacktestWindows": 1,
"BackTestWindowOffset": 180},
InputDataConfig= {"DatasetGroupArn": datasetGroupArn},
FeaturizationConfig= {"ForecastFrequency": 'D'
I noticed you don't have entry for 6/24/10 (this american date format is the worst btw)
I faced a similar problem when leaving out days (assuming you're modelling in daily frequency) just like that and having the Forecast automatic filling of gaps to nan values (as opposed to zero which is the default). I suggest you:
pre-fill literally every date within the range of training data (and of forecast window, if using related data)
choose zero as the option for automatically filling of missing values. I think mean or any other float value would also work for that matter
let me know if that works! I am also using Forecast and it's good to keep track of possible problems and solutions

ML.Net LearningPipeline always has 10 rows

I have noticed that the Microsoft.Ml.Legacy.LearningPipeline.Row count is always 10 in the SentimentAnalysis sample project no matter how much data is in the test or training models.
https://github.com/dotnet/samples/blob/master/machine-learning/tutorials/SentimentAnalysis.sln
Can anyone explain the significance of 10 here?
// LearningPipeline allows you to add steps in order to keep everything together
// during the learning process.
// <Snippet5>
var pipeline = new LearningPipeline();
// </Snippet5>
// The TextLoader loads a dataset with comments and corresponding postive or negative sentiment.
// When you create a loader, you specify the schema by passing a class to the loader containing
// all the column names and their types. This is used to create the model, and train it.
// <Snippet6>
pipeline.Add(new TextLoader(_dataPath).CreateFrom<SentimentData>());
// </Snippet6>
// TextFeaturizer is a transform that is used to featurize an input column.
// This is used to format and clean the data.
// <Snippet7>
pipeline.Add(new TextFeaturizer("Features", "SentimentText"));
//</Snippet7>
// Adds a FastTreeBinaryClassifier, the decision tree learner for this project, and
// three hyperparameters to be used for tuning decision tree performance.
// <Snippet8>
pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 50, NumTrees = 50, MinDocumentsInLeafs = 20 });
// </Snippet8>
The debugger is showing only a preview of the data - the first 10 rows. The goal here is to show a few example rows and how each transform is operating on them to make debugging easier.
Reading in the entire training data and running all the transformations on it is expensive and only happens when you reach .Train(). As the transformations are only operating on a few rows, their effect might be different when operating on the entire dataset (e.g. the text dictionary will likely be bigger), but hopefully the preview of data shown before running through the full training process is helpful for debugging and making sure transforms are applied to the correct columns.
If you have any ideas on how to make this clearer or more useful, it would be great if you can create an issue on GitHub!

Can rrdtool store data for metrics, list of which changes over time, like, for example, top 10 processes consuming CPU?

We need to create a graph with top 10 items, which will change from time to time, for example - top 10 processes consuming CPU or any other top 10 items, we can generate values for on the monitored server, with possibility to have names of the items on the graph.
Please tell me, is there any way to store this information using rrdtool?
Thanks
If you want to store this kind of information with rrdtool, you will have to create a separate rrd database for each item, update them accordingly and finally generate charts picking the 10 'top' rrd files ...
In other words, quite a lot of the magic has to happen in the script you write around rrdtool ... rrdtool will take care of storing the time series data ...

Searching for means to get smaller rdf (n3) dataset

I have downloaded yago.n3 dataset
However for testing I wish to work on a smaller version of the dataset (as the dataset is 2 GB) and even though i make a small change it takes me a lot of time to debug.
Therefore, I tried to copy a small portion of the data and create a separate file, however this did not work and threw lexical errors.
I saw the earlier posts, however the earlier post is about big datasets, whereas I am searching for smaller ones.
Is there any means by which I may obtain a smaller amount of the same dataset?
If you have an RDF parser at hand to read your yago.n3 file, you can parse it and write on a separate file as many RDF triples as you want/need for your smaller dataset to run your experiments with.
If you find some data in N-Triples format (i.e. one RDF triple per line) you can just take as many line as you want and make your dataset as small as you want: head -n 10 filename.nt would give you a tiny dataset of 10 triples.