Reverse Engineer ML.net Model Builder Settings - ml.net

I have a model that was generated using ML .Net Model Builder. I am using the ForecastBySsa engine. I cannot manually recreate the same results. I can use the same settings for the fitting (as provided in the auto-generated console app), but the model predictions from my code using the same data produces different results. Is the Model Builder adding/adjusting optional parameters for the fitting? How can I discover the parameters used so I can rebuild the model without using Model Builder? This is true for even the simplest data set (see below).
for example here's my data:
Time Data
1/1/2019 4:45 0.000135722
1/1/2019 5:00 0.001085629
1/1/2019 5:15 0.000406669
1/1/2019 5:30 -0.000677507
1/1/2019 5:45 0.00040678
1/1/2019 6:00 0.00067769
1/1/2019 6:15 -0.000270893
1/1/2019 6:30 -0.000812898
1/1/2019 6:45 0.000542373
1/1/2019 7:00 0.002032796
Prediction from Model Builder: 0.0007719097
Prediction from retrained model (using the same data and windowSize, seriesLength, horizon) 0.00051951635

Related

Getting records in the last 1 hour using PROC SQL (Teradata)

I am using SAS to connect to Teradata. Given the below dataset (it's a transaction table that updates records regularly), I need to be able to select records from the past hour (at least 3). So for example, if I am running the query at 6pm, I should get txn_id 5678, 1985, 2985 (refer to below dataset). Can you please help? This needs to be done in proc sql (connecting to teradata) or even just a SQL query running in Teradata SQL Assistant.
Dataset:
TXN_ID Date Time
1234 20200608 4:00 PM
5678 20200608 5:00 PM
1985 20200608 5:30 PM
2985 20200608 5:45 PM
2365 20200608 2:30 PM
Expected Output:
TXN_ID Date Time
5678 20200608 5:00 PM
1985 20200608 5:30 PM
2985 20200608 5:45 PM
Try outobs option :
proc sql outobs=3;
select * from sashelp.class order by Age, Name;
quit;
This option is used to limit the number of rows in the output.

Passing django-recurrence field via REST API

Folks,
I am using django recurrence field in my app and its not clear how to format the field when passed via REST API.
Any help is appreciated.
from recurrence.fields import RecurrenceField
class Course(models.Model):
title = models.CharField(max_length=200)
recurrences = RecurrenceField()
Looks like its based of RFC 2445
https://www.rfc-editor.org/rfc/rfc2445#section-4.8.5.4
Format Definition: This property is defined by the following
notation:
rrule = "RRULE" rrulparam ":" recur CRLF
rrulparam = *(";" xparam)
Example: All examples assume the Eastern United States time zone.
Daily for 10 occurrences:
DTSTART;TZID=US-Eastern:19970902T090000
RRULE:FREQ=DAILY;COUNT=10
==> (1997 9:00 AM EDT)September 2-11
Daily until December 24, 1997:
DTSTART;TZID=US-Eastern:19970902T090000
RRULE:FREQ=DAILY;UNTIL=19971224T000000Z
==> (1997 9:00 AM EDT)September 2-30;October 1-25
(1997 9:00 AM EST)October 26-31;November 1-30;December 1-23
Every other day - forever:
DTSTART;TZID=US-Eastern:19970902T090000
RRULE:FREQ=DAILY;INTERVAL=2
==> (1997 9:00 AM EDT)September2,4,6,8...24,26,28,30;
October 2,4,6...20,22,24
(1997 9:00 AM EST)October 26,28,30;November 1,3,5,7...25,27,29;
Dec 1,3,...

WEKA: Print the Indexes of Test data instances w.r.t original data at the time of cross validation

I have a query about the indexes of test data instances chosen by weka at the time of cross validation. How to print the indexes of the test data instances which are being evaluated ?
==================================
I have chosen:
Dataset : iris.arff
Total instances : 150
Classifier : J48
cross validation: 10 fold
I have also made output prediction as "PlainText"
=============
In the output window I can see like this :-
inst# actual predicted error prediction
1 3:Iris-virginica 3:Iris-virginica 0.976
2 3:Iris-virginica 3:Iris-virginica 0.976
3 3:Iris-virginica 3:Iris-virginica 0.976
4 3:Iris-virginica 3:Iris-virginica 0.976
5 3:Iris-virginica 3:Iris-virginica 0.976
6 1:Iris-setosa 1:Iris-setosa 1
7 1:Iris-setosa 1:Iris-setosa 1
....
...
...
Total 10 test data set.(15 instances in each).
======================
As WEKA uses startified cross validation, instances in the test data sets are randomly choosen.
So, How to print the indexes of test data w.r.t the data in original file?
i.e
inst# actual predicted error prediction
1 3:Iris-virginica 3:Iris-virginica 0.976
This result is for which instance in main data (among total 50 Iris-virginica) ?
===============
After a lot of search, I have found that the below youtube video is helpful for the above problem.
Hope this will be helpful for any future visitor with same queries.
Weka Tutorial 34: Generating Stratified Folds (Data Preprocessing)

How to ActiveRecord get query values rather than models?

I have a query that is working as expected, but the result is coming back as an array of models, rather than the values. The query transforms a date, and then groups by that date.
The ActiveRecord query statement:
Participant.
select(
"count(*)",
"date_trunc('month', participants.completed_at) as Date").
group('Date')
which returns:
[ #<Participant id: nil>,
#<Participant id: nil>,
#<Participant id: nil>,
#<Participant id: nil> ]
Interestingly, the relevant data is there on the Participant... for example, if I look at the model it shows me the information I need:
pry(main)> results.first.attributes
=> {"count"=>70, "date"=>2014-04-01 00:00:00 UTC, "id"=>nil}
But it's sort of strange, since neither count, nor date are attributes of that model.
The raw SQL query:
SELECT
count(*),
date_trunc('month', participants.completed_at) as Date
FROM "participants"
WHERE "participants"."deleted_at" IS NULL
GROUP BY Date;
Which returns:
count | date
-------+---------------------
5742 | [NULL]
590 | 2016-05-01 00:00:00
798 | 2016-06-01 00:00:00
293 | 2017-01-01 00:00:00
I know that I can simply map the results to format it the way I want, but I would like to get the data directly from the ActiveRecord statement as an array or hash, rather than on a model object.

Select most recent rows in Django ORM with grouping

We have a system written in Django to track patients recruited to clinical trials.
Spread sheets are used to record the number of patients recruited each month throughout a financial year; so the sheet only contains 12 months of data even though a study may run for years.
There is a table in a django database in to which the spread sheets are imported each month. The data includes the month/year, a count of patients, and some other fields. Each import will include all the previous months data; we need this to make sure no data has been changed on the import sheet since the last import.
For example, the import table containing two imports (the first up to January and the second up to February) would look like this:
id | study_id | data_date | patient_count | [other fields] -->
100 5456 2016-04-01 10 ...
101 5456 2016-05-01 8 ...
102 5456 2016-06-01 5 ...
... all months in between ...
109 5456 2016-01-01 12 ...
110 5456 2016-02-01 NULL ...
111 5456 2016-03-01 NULL ...
112 5456 2016-04-01 10 ...
113 5456 2016-05-01 8 ...
114 5456 2016-06-01 5 ...
... all months in between ...
121 5456 2016-01-01 12 ...
122 5456 2016-02-01 6 ...
123 5456 2016-03-01 NULL ...
The other fields includes a foreign key to another table containing the actual study identification number (iras_number), so I have to join to that to select the rows for a particular study.
I want the most recent values of data_date and patient_count for a study, which may span more than one financial year, so I tried this query (iras_number is passed to the function performing this query):
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(import_study__iras_number=iras_number) \
.annotate(max_id=Max('id')).order_by()
However, this produces a SQL query which includes patient_count in the GROUP BY, resulting in duplicate rows:
data_date | patient_count | max_id
2016-04-01 10 100
2016-04-01 10 112
2016-05-01 8 101
2016-05-01 8 113
...
2016-01-01 12 109
2016-01-01 12 121
2016-02-01 NULL 110
2016-02-01 6 122
How do I select the most recent data_date and patient_count from the table using the ORM?
If I were writing the SQL I would do an inner select of the max(id) grouped by data_date and then use that to join, or use an IN query, to select the fields I require from the table; such as:
SELECT data_date, patient_count
FROM importstudydata
WHERE id IN (
SELECT MAX(id) AS "max_id"
FROM importstudydata INNER JOIN importstudy
ON importstudydata.import_study_id = importstudy.id
WHERE importstudy.iras_number = 5456
GROUP BY importstudydata.data_date
)
ORDER BY data_date ASC
I've tried to create an inner select to replicate the SQL query, however the inner select returns more than one field (column) a causes the query to fail:
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(id__in=ImportStudyData.objects.values('data_date') \
.filter(import_study__iras_number=iras_number) \
.annotate(max_data_id=Max('id'))
Now I can't get the inner select to return only the max(id) grouped by `data_date' and for it to be performed in a single SQL query.
For now I'm splitting the query in to a number of steps to get the result I want.
First I query for the most recent id of all rows related to the study
id_qry = ImportStudyData.objects.values('data_date')\
.filter(import_study__iras_number=iras_number)\
.annotate(max_id=Max('id'))
To get a list of just the numbers, stripping out the date, I use list comprehension:
id_list = [x['max_id'] for x in id_qry]
This list is then used as a filter for the final query to get the number of patients
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(id__in=id_list)
It hits the database twice, and is computationally more expensive, but for now it works and I need to move on.
I'll come back to this problem at a later date.
Use: distinct=True
totals = ImportStudyData.objects.values('data_date', 'patient_count').filter(import_study__iras_number=iras_number).annotate(max_id=Max('id')).order_by('data_date').distinct()