Apologies as I'm a complete novice when it comes to Weka.
I have 100 instances and each instance has 400 attributes most of which have a single value. However some attributes have multiple values as they contain a time component. I was wondering if Weka can analyse multiple values for one attribute and if so, how do I separate these values so that weka can read them (e.g. commas, semi-colons?)
Many Thanks for your help
R
Weka natively works with a format called arff acronym for Attribute-Relation
File Format. This format consists of a clearly differentiated structure in three parts:
1.Head. Here, the name of the relationship is defined. Its format is as follows:
relation <name-of-the-relationship>
Where is of type String. If this name contains some
space will be put between quotation marks.
2. Statements of attributes. This section describes the attributes that make up our file with his type are declared. The syntax is:
attribute <attribute-name> <type>
Where it is of type String having the same restrictions
as above.
Weka accepts various types, these are:
a) NUMERIC. Real numbers*
b) INTEGER.
c) DATE. Dates, to do this kind should be preceded by a label quoted format.
The label format is composed of separator characters (hyphens
and / or spaces) and time units:
dd Day.
MM Month.
yyyy Year.
HH Hours.
mm minutes.
ss seconds.
d) STRING.. With the restrictions of the type String commented
previously.
e) LISTED The identifier of this type is to express in braces and separated
Comma possible values (or character strings) that can take
attribute. For example, if we have an attribute that indicates the time could be defined:
attribute time {sunny, rainy, cloudy}
3. Data Section. Declare the data that make up the relationship between commas separating the attributes and line breaks relationships.
data
4,3.2
Although this is the "full" mode it is possible to define the data in a short form (sparse data). If we have a sample in which there are many data we can express 0 Data, omitting those items that are zero, surrounding each of the rows in braces and placing in front of each of the data the attribute number.
An example of this is as follows:
data
{14 1, 3 3}
In the event that any of the information is unknown is expressed with a symbol of close interrogation ("?"). And if you want to add comments, use the character %.
So, you can use several values to contruct your dataset.
Example:
1 % Test Weka.
2 #relation MyTest
3
4 #attribute nombre STRING
5 #attribute ojo_izquierdo {Bien,Mal}
6 #attribute dimension NUMERIC
7 #attribute fecha_analisis DATE "dd-MM-yyyy HH:mm"
8
9 #data
10 Antonio,Bien,38.43,"12-04-2003 12:23"
11 ’Maria Jose’,?,34.53,"14-05-2003 13:45"
12 Juan,Bien,43,"01-01-2004 08:04"
13 Maria,?,?,"03-04-2003 11:03"
Related
I have a table that has a text field which has formatted strings that represent money.
For example, it will have values like this, but also have "bad" invalid data as well
$5.55
$100050.44
over 10,000
$550
my money
570.00
I want to convert this to a numeric field but maintain the actual numbers that can be retained, and for any that can't , convert to null.
I was using this function originally which did convert clean numbers (numbers that didn't have any formatting). The issue was that it would not convert $5.55 as an example and set this to null.
CREATE OR REPLACE FUNCTION public.cast_text_to_numeric(
v_input text)
RETURNS numeric
LANGUAGE 'plpgsql'
COST 100
VOLATILE
AS $BODY$
declare v_output numeric default null;
begin
begin
v_output := v_input::numeric;
exception when others then return null;
end;
return v_output;
end;
$BODY$;
I then created a simple update statement which removes the all non digit characters, but keeps the period.
update public.numbertesting set field_1=regexp_replace(field_1,'[^\w.]','','g')
and if I run this statement, it correctly converts the text data to numeric and maintains the number:
alter table public.numbertesting
alter column field_1 type numeric
using field_1::numeric
But I need to use the function in order to properly discard any bad data and set those values to null.
Even after I run the clean up to set the text value to say 5.55
my "cast_text_to_numeric" function STILL sets this to null ? I don't understand why this sets it to null, but the above statement correctly converts it to a proper number.
How can I fix my cast_text_to_numeric function to properly convert values such as 5.55 , etc?
I'm ok with disgarding (setting to NULL) any values that don't end up with numbers and a period. The regular expression will strip out all other characters... and if there happens to be two numbers in the text field, with the script, they would be combined into one (spaces are removed) and I'm good with that.
In the example of data above, after conversion, the end result in numeric field would be:
5.55
100050.44
null
550
null
570.00
FYI, I am on Postgres 11 right now
This question already has answers here:
ultimate short custom number formatting - K, M, B, T, etc., Q, D, Googol
(3 answers)
Closed 1 year ago.
I want to make a number format in Google Sheets that turns large numbers into their abbreviated form. Example: "1 200" -> "1.2k", "1 500 000 000 000 000" (one point five quadrillions) -> "1.5Qa". I have absolutely no idea on how would that look.
Thanks in advance.
this should cover your needs:
=ARRAYFORMULA(IF(A:A<10^3, A:A,
IF(1*A:A<10^6, TEXT(A:A/10^3, "#.0\k"),
IF(1*A:A<10^9, TEXT(A:A/10^6, "#.0\M"),
IF(1*A:A<10^12, TEXT(A:A/10^9, "#.0\B"),
IF(1*A:A<10^15, TEXT(A:A/10^12, "#.0\T"),
IF(1*A:A<10^18, TEXT(A:A/10^15, "#.0\Q\a"),
IF(1*A:A<10^21, TEXT(A:A/10^18, "#.0\Q\i"),
IF(1*A:A<10^24, TEXT(A:A/10^21, "#.0\S\x"),
IF(1*A:A<10^27, TEXT(A:A/10^24, "#.0\S\p"),
IF(1*A:A<10^30, TEXT(A:A/10^27, "#.0\O"),
IF(1*A:A<10^33, TEXT(A:A/10^30, "#.0\N"),
IF(1*A:A<10^36, TEXT(A:A/10^33, "#.0\D"),
IF(1*A:A<10^39, TEXT(A:A/10^36, "#.0\U"),
IF(1*A:A<10^42, TEXT(A:A/10^39, "#.0\D\d"),
IF(1*A:A<10^45, TEXT(A:A/10^42, "#.0\T\d"),
IF(1*A:A<10^48, TEXT(A:A/10^45, "#.0\Q\a\d"),
IF(1*A:A<10^51, TEXT(A:A/10^48, "#.0\Q\u\d"),
IF(1*A:A<10^54, TEXT(A:A/10^51, "#.0\S\x\d"),
IF(1*A:A<10^57, TEXT(A:A/10^54, "#.0\S\p\d"),
IF(1*A:A<10^60, TEXT(A:A/10^57, "#.0\O\d"),
IF(1*A:A<10^63, TEXT(A:A/10^60, "#.0\N\d"),
IF(1*A:A<10^66, TEXT(A:A/10^63, "#.0\V"),
IF(1*A:A<10^69, TEXT(A:A/10^66, "#.0\C"), ))))))))))))))))))))))))
Use a custom number format
Select the range of cells you want to convert
Go to Format -> Number -> More Formats -> Custom number format
Paste into the input field [>999999]#,,"M";#,"K"
Click on Apply - Done
I do not think it is possible to configure more than two formats of a cell to adapt dynamically according to the number inside it without some scripting. That would be nice as it would preserved the number type.
But if there is no need to preserve the number type and string is acceptable, then strings could be generated like this using TEXT function and dynamically setting format for the number based on a reference:
=INDEX(
TEXT(
E2:E24,
"0.0"
& IFNA(
REPT(",", (VLOOKUP(INT(LOG10(E2:E24)), $C$2:$C$8, 1, TRUE)) / 3)
& "\" & VLOOKUP(INT(LOG10(E2:E24)), {$C$2:$C$8, $A$2:$A$8}, 2, TRUE)
)
)
)
On the left you can see a reference columns where I used symbols from wiki.
I have a series of values in Tableau that are long strings intermixed with letters and numbers. I am unable to control the data output, but would like to parse the names from these strings. They follow the following format:
Potato 1TByte 4.5 NFA
Board 256GByte 553 NCA
Launch 4 512GByte 4.5 NFA
Launch 4S 512GByte 4.5 NCA
From each of these, I am attempting to capture the following:
"Potato"
"Board"
"Launch 4"
"Launch 4S"
Each string follows the same format: the name, followed by size, followed by some extra information we don't really care about.
I've tried to put together some text parsing strings, but am coming up short, and am still trying to learn regular expressions.
The Tableau calculated field I was trying to work with was something like the following:
LEFT([String], FIND([String], "Byte") - 2)
The issue is that the text and numbers preceding Byte can be anywhere from 4 to 2 characters and I need a way to identify the length of that.
Any help would be greatly appreciated!
One option which uses a regex replacement:
REGEXP_REPLACE('Launch 4 512GByte 4.5 NFA', ' \d+[A-Z]Byte .*$', '')
This strips off everything from the Byte term to the right, leaving us with only the product name.
You could try the following - this seems to work - Screenshot of Tableau output. Find below the formulas for the various derived columns you see in the screenshot (Your source column is called [Name])
Step1 = LEFT([Name],FIND([Name],"Byte")-1)
Step2 = LEN([Step1])-LEN(REPLACE([Step1]," ",""))
Step3 = FINDNTH([Step1]," ",[Step2])
Step4 = LEFT([Step1],[Step3]-1)
And of course you can nest all these in a single calculated field - kept them as separate columns for easier understanding
I have a data step where I have a few columns that need tied to one other column.
I have tried using multiple "from" statements and " to" statements and a couple other permutations of that, but nothing seems to do the trick. The code looks something like this:
data analyze;
set css_email_analysis;
from = bill_account_number;
to = customer_number;
output;
from = bill_account_number;
to = email_addr;
output;
from = bill_account_number;
to = e_customer_nm;
output;
run;
I would like to see two columns showing bill accounts in the "from" column, and the other values in the "to", but instead I get a bill account and its customer number, with some "..."'s for the other values.
Issue
This is most likely because SAS has two datatypes and the first time the to variable is set up, it has the value of customer_number. At your second to statement you attempt to set to to have the value of email_addr. Assuming email_addr is a character variable, two things can happen here:
Customer_number is a number - to has already been set up as a number, so SAS cannot force to to become a character, an error like this may appear:
NOTE: Invalid numeric data, 'me#mywebsite.com' , at line 15 column 8. to=.
ERROR=1 N=1
Customer_number is a character - to has been set up as a character, but without explicitly defining its length, if it happens to be shorter than the value of email_addr then the email address will be truncated. SAS will not show an error if this happens:
Code:
data _NULL_;
to = 'hiya';
to = 'me#mydomain.com';
put to=;
run;
short=me#m
to is set with a length of 4, and SAS does not expand it to fit the new data.
Detail
The thing to bear in mind here is how SAS works behind the scenes.
The data statement sets up an output location
The set statement adds the variables from first observation of the dataset specified to a space in memory called the PDV, inheriting lengths and data types.
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm
===================================================================
010101 | 758|me#my.com |John Smith
The to statement adds another variable inheriting the characteristics of customer_number
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |758
(to is either char length 3 or a numeric)
Subsequent to statements will not alter the characteristics of the variable and SAS will continue processing
PDV (if customer_number is character = TRUNCATION):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |me#
PDV (if customer_number is numeric = DATA ERROR, to set to missing):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |.
Resolution
To resolve this issue it's probably easiest to set the length and type of to before your first to statement:
data analyze;
set css_email_analysis;
from = bill_account_number;
length to $200;
to = customer_number;
output;
...
You may get messages like this, where SAS has converted data on your behalf:
NOTE: Numeric values have been converted to character
values at the places given by: (Line):(Column).
27:8
N.B. it's not necessary to explicitly define the length and type of from, because as far as I can see, you only ever get the values for this variable from one variable in the source dataset. You could also achieve this with a rename if you don't need to keep the bill_account_number variable:
rename bill_account_number = from;
I am using the below date/time format in gSheets:
01 Apr at 11:00
I wonder whether it is possible to use Data Validation (or any other function) to report error (add the small red triangle to the corner of the cell) when the format differs in any way.
Possible values in the given format:
01 -> any number between 01-31 (but not "1", there must be the leading zero)
space
Apr -> 3 letters for month (Jan, Feb, Mar... Dec)
space
at
space
11 -> hours in 24h format (00, 01...23)
:
00 -> minutes (00, 01,...59)
Is there any way to validate that the cell contains "text/data" exactly in the above mentioned format?
The right way to do this is using Regular Expression and "regexmatch()" function in Google Sheets. For the given example, I made the below regular expression:
[0-3][0-9] (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) at [0-2][0-9]\:[0-5][0-9]
Process:
Select range of cells to be validated
Go to Data > Data Validation
Under Criteria select "Own pattern is" (not sure the exact translation used in EN)
Paste: =regexmatch(to_text(K4); "[0-3][0-9] (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) at [0-2][0-9]\:[0-5][0-9]")
Make sure that instead of K4 in "to_text(K4)" there is a upper-left cell from the selected range
Save
Hope it helps someone :)
You may try the formula for data validation:
=not(iserror(SUBSTITUTE(A1," at","")*1))*(len(A1)=15)*(right(A1,2)*1<61)
not(iserror(SUBSTITUTE(A1," at","")*1)) checks all statemant is legal date
(len(A1)=15) checks dates are entered with 2 digits
(right(A1,2)*1<61) cheks too much minutes, for some reason 01 Apr at 11:99 is a legal date..
Select the range of fields, where you need the data validation to occur to.
Press on -> Data -> Data validation
For "Criteria" select "Custom formula is"
Enter the following in the textfield next to "Custom formula is":
=regexmatch(Tablename!B2; "^[a-z_]*$")
Where as "Tablename" should be replaced by the table name and "B2" should be replaced by the first cell of the range.
Inside the "" you enter then your regex-expression. Here this would allow only small letters and underscores.
Using the to_text() function additionally didn't work for me. So you should maybe avoid it in order to make sure, that it works.
Press save