I am cleaning a data set with google open refine and then trying to use it in Weka to do some cluster analysis. I am dealing with a nominal column that stores range of salaries.
I've specified the attribute as below
#ATTRIBUTE Income {'0-30000','30000-50000','50000-75000','75000-150000','>150000'}
In the data set there are rows in which the 'Income' column is null and I suppose that is the reason why I get the error:
'nominal value not declared in header, read Token line 13'
Is there a way I can replace null values with a string( and then specify the string in the attribute)? - If so how do i specify it in the #ATRRIBUTE row?
Or would it be possible to include the null in the set of attributes?
Thanks
Related
I have a big CSV text file uploaded weekly to an S3 path partitioned by upload date (maybe not important). The schema of these files are all the same, the formatting is all the same, the naming conventions are all the same. Each file contains ~100 columns and ~1M rows of mixed text/numeric types. The raw data looks like this:
id,date,string,int_values,double_values
"6F87U",2021-03-21,"Text",0,1.1483
"8DU87",2021-03-22,"More text, oh yes",1,2.525
"79LO2",2021-03-23,"Moar, give me moar, text",2,3.485489
When I run a Crawler with everything default, querying with Athena like so:
select * from tb_csv_data
...the results in Athena are thus:
id
date
string
int_values
double_values
"6F87U"
2021-03-21
"Text"
0
1.1483
"8DU87"
2021-03-22
"More text
oh yes"
1
"79LO2"
2021-03-23
"Moar
give me moar
text
The problem at this level seems to be with proper detection (read: ignoring) of commas as delimiters within quotation marks. So I have a CSV classifier with the following characteristics that I have attached to the Crawler, I run the Crawler again with the classifier attached, and the resulting table properties are thus:
Input format org.apache.hadoop.mapred.TextInputFormat
Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib org.apache.hadoop.hive.serde2.OpenCSVSerde
Serde parameters
quoteChar "
separatorChar ,
Table properties
sizeKey 4356512114
objectCount 3
UPDATED_BY_CRAWLER crawler-name
CrawlerSchemaSerializerVersion 1.0
recordCount 3145398
averageRecordSize 1384
CrawlerSchemaDeserializerVersion 1.0
compressionType none
columnsOrdered true
areColumnsQuoted true
delimiter ,
typeOfData file
The resulting table with the same simple Athena query as above seems to be correct:
id
date
string
int_values
double_values
6F87U
2021-03-21
Text, yes
0
1.1483
8DU87
2021-03-22
More text, oh yes
1
2.525
79LO2
2021-03-23
Moar, give me moar, text
2
3.485489
The expected automatic inference of data types is supposed to be this (let's simplify and presume the date is correct as a string):
Column name
Data type
id
string
date
string
string
string
int_values
bigint (or long)
double_values
double
...but instead they're all strings!
Column name
Data type
id
string
date
string
string
string
int_values
string
double_values
string
I need this data to be accurately queryable from Athena as it is, where it is, so what can I do without further processing of the raw data? I suppose I could manually adjust the table properties in the Console but is that really correct when I need the entire pipeline to be automated? I also want to avoid having to cast types in queries 80+ times for each field as most of these columns are numeric. What can I do?
Thank you!
The limitation arrives from the serde that you are using in your query. Refer to note section in this doc which has below explanation :
When you use Athena with OpenCSVSerDe, the SerDe converts all column types to STRING. Next, the parser in Athena parses the values from STRING into actual types based on what it finds. For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them. If the values are in TIMESTAMP in the UNIX format, Athena parses them as TIMESTAMP. If the values are in TIMESTAMP in Hive format, Athena parses them as INT. DATE type values are also parsed as INT.
For date type to be detected it has to be in UNIX numeric format, such as 1562112000 according to the doc.
I created a custom column in PowerBI, which concatenate columns.
I have the following:
Text.Combine({[Nip],[Nap],[Noup]]},"_")
However, I would like to have a specific text which change based on whether or not data is present in columns. I need to check if there is data in four columns. If there is data, a specific string of character should be inserted, if there is not, no data should be inserted.
I am trying to insert the outcome of the "IF"s, but there is some complexity, I have tried this, but this is not working, Power BI is telling me "Token Eof expected" :
If [Lapino] <> null or [Lapinou] <> null or [Werwolf] <> null or [Ciocolato] then
Text.Combine({[Nip],"Snoubadiuba",[Nap],[Noup]},"_")
else Text.Combine({[Nip],"BruttoCativo",[Nap],[Noup]},"_")
I believe this is as simple as changing your If to lowercase if. M code is case-sensitive.
Suppose nominal attribute is Outlook which contains three values Sunny , Overcast and Rainy. I want to convert this values of outlook attribute in numeric form i.e. 1,2,3 (order can be change). I saw one filter nominaltobinary in weka but this will create three columns. I don't want to create separate column for each value. How I can do this using Weka.
In the ARFF, if you are using it, you can have a comment which specifies what the values of the "Outlook" attribute are.
For example, you ARFF can contain this comment at the top -
%% Numeric values for the "Outlook" Attribute
%% Sunny = 1
%% Overcast = 2
%% Rainy = 3
%% Windy = 4
Then you can define the attribute as -
#attribute Outlook {1,2,3,4}
I dont think there is a way to do this in the UI. But you can use a text editor to edit the ARFF itself.
For this you can use "RenameNominalValues" filter under unsupervised ---> attributes.
Then under "selectedAttribute" type the attribute and
under "valueReplacements" type as Sunny:1,Overcast:2,Rainy:3,Windy:4
There is a field in a table in a database with data type BLOB SUB_TYPE 0. It stores text data generally (the field contains 'payment description', so mostly there are values like 'sent successfully', '15$ to go' and so on in that column). I do not know for what purpose that field was designed as BLOB SUB_TYPE 0, but the database is provided 'as is' and I cannot affect this.
I need to display the contents of that field as text.
What do I do:
select
cast(paym_desc as varchar (1000))
from payments
It works, but it works for Latin characters, digits and !##$%^&*(){}[] signs (those contained in standard ASCII charset I suppose). When it comes to Cyrillic, or Euro signs or pound symbol etc., I get dots instead of symbols.
To troubleshoot the situation I have tried to re-code the column using the solution provided here:
Firebird 2.5.2 change blob subtype
I've tried creating columns with blob sub_type 0 character set win1251 and blob sub_type 0 character set win1251 types and have copied my payment description there.
It is kind of working (no more dots!), but I (as expected) get different values in these columns and none contains what I need. For example, if I set my payment description to
ùúûĀąĈô£¥®҈ѾВАСЯвасяӃ€€€€
I get
ùúûĀąĈô£¥®҈ѾВАСЯвасяӃ€€€€ in win1251 column and
ùúûĀąĈô£¥®҈ѾВРin win1252 column.
Could somebody please give an advise on how to display such values properly?
A blob sub_type 0 is a binary blob. Binary blobs don't have a notion of character sets. You need to use blob sub_type 1 (or blob sub_type text) if you want to include a character set.
Based on the conversions you show in your question, it looks like you are storing UTF-8 into the binary blob, and then retrieve it as WIN1251 or WIN1252.
I am currently in the process of writing some code to analyse the mushrooms data off UCI using Weka. I am trying to get the values (i.e. coefficients) of the attributes, but the attribute name is truncated (indicated by the "..."), and am unable to get the full set of coefficients from the attributes.
e.g.
#attribute -0.251a=e+0.242m=k+0.241n=k-0.224t=p+0.213f=f... numeric
Any help would be greatly appreciated.
I believe your attribute names are being truncated because of an option in the PCA filter.
-A
Maximum number of attributes to include in
transformed attribute names.
(-1 = include all, default: 5)
Using the following code I change the value of this option to -1 and print an attribute name from the transformed data.
Instances originalTrain=...//load the training data
PrincipalComponents pca = new PrincipalComponents(); // new PCA filter
pca.setMaximumAttributeNames(-1); //set the value to -1
pca.setInputFormat(originalTrain);// inform filter about dataset
Instances newData = Filter.useFilter(originalTrain, pca); // apply filter
System.out.println(newData.attribute(0).name()); //look at new name
An example of the obviously untruncated attribute name is (scroll to view):
0.257stalksurfacebelowring=k+0.256stalksurfaceabovering=k+0.234ringtype=l+0.231odor=f-0.215ringtype=p-0.212stalksurfaceabovering=s+0.206sporeprintcolor=h-0.195stalksurfacebelowring=s+0.185bruises+0.18 stalkroot=b-0.176stalkcolorbelowring=w-0.175stalkcolorabovering=w-0.173odor=n-0.139sporeprintcolor=n-0.134sporeprintcolor=k+0.133habitat=p+0.133gillcolor=b+0.13 stalkcolorbelowring=b+0.13 stalkcolorabovering=b+0.129population=v+0.128stalkcolorabovering=n-0.125population=s-0.124stalkroot=e+0.121stalkcolorbelowring=n-0.119capcolor=w+0.119stalkcolorbelowring=p+0.119stalkcolorabovering=p-0.11gillspacing-0.105stalkroot=c-0.101gillcolor=n+0.094sporeprintcolor=w-0.087capshape=b-0.085gillcolor=k-0.082odor=l-0.082odor=a-0.082habitat=m+0.08 capcolor=y-0.08gillcolor=w+0.078gillcolor=h-0.076population=n-0.073habitat=g-0.072gillsize+0.068odor=y+0.068odor=s-0.067population=a-0.065capsurface=s-0.064odor=p+0.063gillcolor=g-0.059stalksurfaceabovering=f+0.057capsurface=y-0.057ringnumber=t-0.057stalksurfacebelowring=f+0.055ringnumber=o+0.051population=y-0.05habitat=u-0.048stalkcolorabovering=o-0.048stalkcolorbelowring=o+0.047veilcolor=w-0.046population=c+0.046capshape=k+0.046ringtype=e-0.046gillattachment-0.045stalkcolorabovering=g-0.045stalkcolorbelowring=g+0.043capcolor=e-0.041stalkroot=r-0.039gillcolor=u+0.039capcolor=g+0.034habitat=l-0.034veilcolor=n-0.034veilcolor=o-0.033habitat=w-0.031capcolor=p-0.031odor=c-0.031stalksurfacebelowring=y-0.031sporeprintcolor=r+0.03 capshape=f-0.029capcolor=n-0.028gillcolor=o-0.024stalkshape-0.024sporeprintcolor=o-0.024sporeprintcolor=y-0.024sporeprintcolor=b-0.024gillcolor=y-0.023gillcolor=e-0.023capcolor=b-0.023stalkcolorabovering=e-0.023stalkcolorbelowring=e-0.019gillcolor=r-0.018capshape=s-0.018sporeprintcolor=u-0.015capshape=x+0.012habitat=d+0.009gillcolor=p-0.006capsurface=g+0.005capsurface=f-0.004capshape=c+0.003stalkcolorbelowring=y-0.003stalkcolorabovering=y-0.003veilcolor=y+0.001stalksurfaceabovering=y+0.001capcolor=u+0.001capcolor=r-0.001capcolor=c+0 stalkcolorabovering=c+0 odor=m+0 ringtype=n+0 stalkcolorbelowring=c+0 ringnumber=n+0 ringtype=f