define ordinal data type in rapidminer - data-mining

I want to know how can I define ordinal data type for my attributes in rapid miner?
for example I have blood pressure attribute with three values (High,Normal,low) or cholesterol with high and normal values and some other attributes like this. should I set integer data type for these attributes?

Yes. You'd better replace text with integer, integer would be more convenient when you apply some algorithms.

Related

number expected.read Token[2015-02-02 14:19:00] weka project

i hope you all are doing well!
I have a project at data mining class.Τhe data consists of numerical data and many algorithms do not work.I have to do this:"you should compare the performance of the following categorization algorithms:
RandomForest, C4.5, JRip, Bayesian Network. Where necessary use them
Weka filters to replace or create values ​​for some properties
new properties. For comparison, adopt the Train / Test Percentage Split type with
percentage for training data equal to 80%.Describe your observations by giving tables with the results and
presenting the performance of the algorithms. Repeat the experiment by putting
percentage for training data equal to 70% and 50% presenting the results."
So my first try was to transform the data inside weka with preprocessing data numeric to nominal but a friend of mine suggest that is statistical wrong.So my second try was to use excel to transform all data even the date to numeric,remove the first row(id) and pass it to the weka(I leave double quotes only at date)
.But i have the error that i mention on the title.The dataset is:https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
Thank you for the time.
If you define date-like data as a DATE attribute in the ARFF file (using the right format for parsing the strings), then WEKA will treat it as a numeric attribute internally (Java epoch, ie milli-seconds since 1970-01-01).
Instead of using NumericToNominal, use either the supervised or unsupervised Discretize filter if the algorithm cannot handle numeric attributes.
Converting nominal attributes to numeric ones is not a recommended approach. Instead, try the supervised or unsupervised NominalToBinary filter.

Storing values of arbitrary type

I want to store arbitrary key value pairs. For example,
{:foo "bar" ; string
:n 12 ; long
:p 1.2 ; float
}
In datomic, I'd like to store it as something like:
[{:kv/key "foo"
:kv/value "bar"}
{:kv/key "n"
:kv/value 12}
{:kv/key "p"
:kv/value 1.2}]
The problem is :kv/value can only have one type in datomic. A solution is to to split :kv/value into :kv/value-string, :kv/value-long, :kv/value-float, etc. It comes with its own issues like making sure only one value attribute is used at a time. Suggestions?
If you could give more details on your specific use-case it might be easier to figure out the best answer. At this point it is a bit of a mystery why you may want to have an attribute that can sometimes be a string, sometimes an int, etc.
From what you've said so far, your only real answer it to have different attributes like value-string etc. This is like in a SQL DB you have only 1 type per table column and would need different columns to store a string, integer, etc.
As your problem shows, any tool (such as a DB) is designed with certain assumptions. In this case the DB assumes that each "column" (attribute in Datomic) is always of the same type. The DB also assumes that you will (usually) want to have data in all columns/attrs for each record/entity.
In your problem you are contradicting both of these assumptions. While you can still use the DB to store information, you will have to write custom functions to ensure only 1 attribute (value-string, value-int, etc) is in use at one time. You probably want custom insertion functions like "insert-str-val", "insert-int-val", etc, as well as custom read functions "read-str-val" etc al. It might be also a good idea to have a validation function that could accept any record/entity and verify that exactly one-and-only-one "type" was in use at any given time.
You can emulate a key-value store with heterogenous values by making :kv/key a :db.unique/identity attribute, and by making :kv/value either bytes-typed or string-typed and encoding the values in the format you like (e.g fressian / nippy for :db.types/bytes, edn / json for :db.types/string). I advise that you set :db/index to false for :kv/value in this case.
Notes:
you will have limited query power, as the values will not be indexed and will need to be de-serialized for each query.
If you want to run transaction functions which read or write the values (e.g for data migrations), you should make your encoding / decoding library available to the Transactor as well.
If the values are large (say, over 20kb), don't store them in Datomic; use a complementary storage service like AWS S3 and store a URL.

Index-based access on Matrix-like structure in C++

I have a mapping Nx2 between two set of encodings (not relevant: Unicode and GB18030) under this format:
Warning: huge XML, don't open if having slow connection:
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
Snapshot:
<a u="00B7" b="A1 A4"/>
<a u="00B8" b="81 30 86 30"/>
<a u="00B9" b="81 30 86 31"/>
<a u="00BA" b="81 30 86 32"/>
I would like to save the b-values (right column) in a data structure and to access them directly (no searching) with indexes based on a-values (left column).
Example:
I can store those elements in a data structure like this:
unsigned short *my_page[256] = {my_00,my_01, ....., my_ff}
, where the elements are defined like:
static unsigned short my_00[256] etc.
.
So basically a matrix of matrix => 256x256 = 65536 available elements.
In the case of other encodings with less elements and different values (ex. Chinese Big5, Japanese Shift, Korean KSC etc), I can access the elements using a bijective function like this:
element = my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF];, where unicode[i] is filled with the a-like elements from the mapping (as mentioned above). How do I generate and fill the my_page structure is analogous. For the working encodings, I have like around 7000 characters to store (and they are stored in a unique place in my_page).
The problem comes with the GB18030 encoding, trying to store 30861 elements in my_page (65536 elements). I am trying to use the same bijective function for filling (and then accessing, analogously) the my_page structure, but it fails since the access mode does not return unique results.
For example: For the unicode values, there are more than 1 element accessed via
my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF] since the indexes can be the same for i and for i+1 for example. Do you know another way of accessing/filling the elements in the my_page structure based only on pre-computed indexes like I was trying to do?
I assume I have to use something like a pseudo-hash function that returns me a range of values VRange and based on a set of rules I can extract from the range VRange the integer indexes of my_page[256][256].
If you have any advice, please let me know :)
Thank you !
For GB18030, refer to this document: http://icu-project.org/docs/papers/gb18030.html
As explained in this article:
“The number of valid byte sequences -- of Unicode code points covered and of mappings defined between them -- makes it impractical to directly use a normal, purely mapping-table-based codepage converter. With about 1.1 million mappings, a simple mapping table would be several megabytes in size.”
So most probably is not good to implement a conversion based on a pure mapping table.
For large parts, there is a direct mapping between GB18030 and Unicode. Most of the four-bytes characters can be translated algorithmically. The author of the article suggests to handle them such ranges with a special code, and the other ones with a classic mapping table. These characters are the ones given in the XML mapping table: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
Therefore, the index-based access on Matrix-like structure in C++ can be a problem opened for whom wants to research on such bijective functions.

How can I obfuscate/de-obfuscate integer properties?

My users will in some cases be able to view a web version of a database table that stores data they've entered. For various reasons I need to include all the stored data, including a number of integer flags for each record that encapsulate adjacencies and so forth within the data (this is for speed and convenience at runtime). But rather than exposing them one-for-one in the webview, I'd like to have an obfuscated field that's just called "reserved" and contains a single unintelligible string representing those flags that I can easily encode and decode.
How can I do this efficiently in C++/Objective C?
Thanks!
Is it necessary that this field is exposed to the user visually, or just that it’s losslessly captured in the HTML content of the webview? If possible, can you include the flags as a hidden input element with each row, i.e., <input type=“hidden” …?
Why not convert each of the fields to hex, and append them as a string and save that value?
As long as you always append the strings in the same order, breaking them back apart and converting them back to numbers should be trivial.
Use symmetric encryption (example) to encode and decode the values. Of course, only you should know of the key.
Alternatively, Assymetric RSA is more powerfull encryption but is less efficient and is more complex to use.
Note: i am curios about the "various reasons" that require this design...
Multiply your flag integer by 7, add 3, and convert to base-36. To check if the resulting string is modified, convert back to base-2, and check if the result modulo 7 is still 3. If so, divide by 7 to get the flags. note that this is subject to replay attacks - users can copy any valid string in.
Just calculate a CRC-32 (or similar) and append it to your value. That will tell you, with a very high probability, if your value has been corrupted.

creating datatypes at runtime

I have a scenario where I am given data records at runtime. The datatype of the cells of the record are variable and only known at runtime. How wil I store these records?
For e.g.,
At runtime, I get record_Info = "char[]","int16","int32"
Then I get records = "abc" "2" "30", "def" "3" "40"
how can I store these when I cant initialize their types?
Assuming you want to store them in a file. Store the type information at the beginning of the file(say like a header).
There are only a predefined set of types. With the type info available you can have converter functions to convert the data into the respective types and store them as binary data in the file.
If you have some upper limit of variable data(char[]), then better store fixed data records in the file. It would be easier to access and modify.
If there is no upper limit on the variable data, then u need to store the variable data in TLV format.