create a custom numeric arff weka

create a custom numeric arff weka - weka

Hello I want to create a following cusotm numeric arff in WEKA. I saw some examples of arff but still confused How I model my problem in arff. I am trying to use it for naive bayes classifier.
attributes = {long, sweet, colour}
long = {yes, no}
sweet = {yes, no}
colour = {yellow, other}
fruits = {banana, orange, other}
for example I have 400 instances of
long(yes) , banana
350 of
sweet(yes), banana
and 450 of
yellow(yes), banana
like wise for orange and other fruits.HOw do I model this in arff

If you have never created one ARFF file from scratch, the easiest way is to put your data into a CSV file instead, open it from the weka explorer and save it as ARFF. The attributes will be handled appropriately. From there, it will be easier to modify your created ARFF, if needed.

Related

How to filter queryset by string field containing json (Django + SQLite)

I have the following situation.
The Flight model (flights) has a field named 'airlines_codes' (TextField) in which I store data in JSON array like format:
["TB", "IR", "EP", "XX"]
I need to filter the flights by 2-letter airline code (IATA format), for example 'XX', and I achieve this primitively but successfully like this:
filtered_flights = Flight.objects.filter(airlines_codes__icontains='XX')
This is great but actually not.
I have flights where airlines_codes look like this:
["TBZ", "IR", "EP", "XXY"]
Here there are 3-letter codes (ICAO format) and obviously the query filter above will not work.
PS. I cannot move to PostgreSQL, also I cannot alter in anyway the database. This has to be achieved only by some query.
Thanks for any idea.

Without altering the database in any way you need to filter the value as a string. Your best bet might be airlines_codes__contains. Here's what I would recommend assuming your list will always be cleaned exactly as you represent it.
Flight.objects.filter(airlines_codes__contains='"XX"')
As of Django 3.1 JSONField is supported on a wider array of databases. Ideally, for someone else building a similar system from the ground up, this field would be a preferable approach.

Time Series forecasting with DeepAR for multiple independent products

I wanted to forecast some data(suppose countries temperature).Is there any way to add multiple countires temperature at once in deepAR (Algorithm available at AWS Sagemaker marketplace) and deepAR forecast them independently?.Is it possible to remove a particular country data and add another after few days?
I am new to Forecasting and wanted to try deepAR.If anyone has arleady worked on this, please provide me some guidelines on how to do this using deepAR
Link - https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html

This is a late reply to this post, but my reply could be helpful in the future to others. The answer to your first question is yes.
The page you linked to references the cat field, this allows you to encode a vector representing different record groups. In your case, the cat field can just be a single value, but the cat field can encode more complex relationships too with more dimensions in the vector.
Say you have 3 countries you want to make predictions on. You have some time-series temperature training data for each country, then you would enter them as rows in the train JSON file like this:
Country 1:
{"start": "02/09/2019 00:00:00", "target": [T1,T2,T3,T4,...], "cat": [0]}
Country 2:
{"start": "02/09/2019 00:00:00", "target": [T1,T2,T3,T4,...], "cat": [1]}
Country 3:
{"start": "02/09/2019 00:00:00", "target": [T1,T2,T3,T4,...], "cat": [2]}
The category field indicates to DeepAR that these are independent data categories, in other words, different countries.
The frequency (time between temperature measurements) has to be the same for all data, however, the start time and the number of training points does not.
When you've trained the model, open the endpoint and want to make a prediction for a country, you can pass the context for a particular country along with the same cat as one of those countries above.
This allows you to make a single model that will allow you to make predictions from many independent groups of data.
I'm not sure exactly what you mean by the second question. If you mean to add more training data for another country later on, this would require you to create a different training dataset with an additional category for that country, then re-train the model.

How to do a LEFT OUTER JOIN QuerySet in Django - Many to One

I got two models: Image(filename, uploaded_at) and Annotation (author, quality,... fk Image).
An Image can have multiple Annotations, and an Annotation belong to one image.
I'd like to build a queryset that fetch all annotations (including the relation to the image so I can display image fields as well) that meet some criteria.
All fine until here, but I'd like to display also the images that have no annotations created (left outer join), and not sure how could I proceed to do this?
To clarify I am trying to get fetch the data so I can build a table like this:
Image name, Image date, Annotation author, Annotation Quality
image 1 , 2019 , john , high
image 2 , 2019 , doe , high
image 3 , 2019 , null , null
image 4 , 2014 , doe , high
Maybe I'm using wrong approach, I'm using Annotation as main model, but then I don't seem to have a way to display the images that don't have Annotation, which kind of makes sense as there is no Annotation.
This is what I'm doing:
Annotation.objects.select_related('image').filter(Q(image__isnull=True)
| Q(other condition))
However, if I use Image as main model, the relation is many Annotations to one image so I can't use select_related, and I'm not sure prefetch_related works for what I need. I don't know how to get the data properly. I tried this:
Image.objects.prefetch_related('annotations').filter(Q(annotations__isnull=True) | Q(other condition))
The prefetch_related doesn't seem to make any difference to the query, plus I'd like to have the Annotation data in flat/the same row (ie. row 1: image 1, annotation 1; row 2: image1, annotation 2, etc.) instead of having to do image.annotation_set... as that wouldn't fit my needs.

If you need an outer join, it has to be a left join, as you correctly assumed. So you'll have to start from the Image model. To get a flat representation rather than nested Annotation objects, use values(), which returns a queryset of dictionaries (rather than model objects):
queryset_of_dictionaries = (Image.objects
.filter(Q(annotations__isnull=True) | Q(other condition))
.values('name', 'date', 'annotations__author', 'annotations__quality',
# etc. – you have to enumerate all fields you need
)
# you'll probably want the rows in a particular order
.order_by(
# a field list like in values()
)
)
# accessing the rows
for dic in queryset_of_dictionaries:
print(f'Image name: {dic["name"]}, quality: {dic["annotations__quality"]}')

ValueError Scikit learn. Number of features of model don't match input

I am pretty new to machine learning in general and scikit-learn in specific.
I am trying to use the example given on the site http://scikit-learn.org/stable/tutorial/basic/tutorial.html
For practicing on my own, I am using my own data-set. My data set is divided into two different CSV files:
Train_data.csv (Contains 32 columns, the last column is the output value).
Test_data.csv (Contains 31 columns the output column is missing - Which should be the case, no?)
Test data is one column less than training data..
I am using the following code to learn (using training data) and then predict (using test data).
The issue I am facing is the error:
*ValueError: X.shape[1] = 31 should be equal to 29, the number of features at training time*
Here is my code (sorry if it looks completely wrong :( )
import pandas as pd #import the library
from sklearn import svm
mydata = pd.read_csv("Train - Copy.csv") #I read my training data set
target = mydata["Desired"] #my csv has header row, and the output label column is named "Desired"
data = mydata.ix[:,:-3] #select all but the last column as data
clf = svm.SVC(gamma=0.001, C=100.) #Code from the URL above
clf.fit(data,target) #Code from the URL above
test_data = pd.read_csv("test.csv") #I read my test data set. Without the output column
clf.predict(test_data[-1:]) #Code from the URL above
The training data csv labels looks something like this:
Value1,Value2,Value3,Value4,Output
The test data csv labels looks something like this:
Value1,Value2,Value3,Value4.
Thanks :)

Your problem is a Supervised Problem, you have some data in form of (input,output).
The input are the features describing your example and the output is the prediction that your model should respond given that input.
In your training data, you'll have one more attribute in your csv file because in order to train your model you need to give him the output.
The general workflow in sklearn with a Supervised Problem should look like this
X, Y = read_data(data)
n = len(X)
X_train, X_test = X[:n*0.8], X[n*0.8:]
Y_train, Y_test = Y[:n*0.8], Y[n*0.8:]
model.fit(X_train,Y_train)
model.score(X_test, Y_test)
To split your data, you can use train_test_split and you can use several metrics in order to judge your model's performance.
You should check the shape of your data
data.shape
It seems like you're not taking into the account the last 3 columns instead of only the last. Try instead :
data = mydata.ix[:,:-1]

Tokenizing Product Models

Looking to match some product info, returning structured data and rewriting or looking up the value.
Example input:
"I have a 1999 Cat (D-6) and an Ingersoll Rand Model Z for sale"
From which I want to create something like
[ { year:1999, brand:"CATERPILLAR", model:"D6" },
{ year:null, brand:"INGERSOLL-RAND", model:"MODEL Z" } ]
Based on known data:
/\d{4}/, YEAR
...
/cat(erpill[ae]r)/, BRAND, "CATERPILLAR"
...
/d[\-\s]6/, MODEL, "D6"
Can this be done with Regex alone? Or do I need a Lexer?
I can figure out the regexes no problem, but confused about the re-writing part, and grouping things together

I think you want to extract car trading details.
Here you need NLP ,You can use Stanford Core NLP to design your own NLP regex or you can train a dataset.
but Stanford NER is developed model which will give you entities like Date and time , organization also location , person ,percentage and price.
other related tools: apache openNLP , aylien

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js