How to represent a numerical dataset as a tree search for an interval pattern mining purpose

How to represent a numerical dataset as a tree search for an interval pattern mining purpose - data-mining

I am a phd student in data mining and i want to use constraint programming to solve pattern mining tasks,
Knowing that constraint programming is based on a tree search, i would like to know if there is a common way to represent the data of a numerical dataset on a tree search.
I have only found discrete data represented in a CP tree search:
Example of discrete data representation in a CP tree search:
Discrete dataset:
Corresponding tree representation:
So supposing we have the following numerical dataset:
How can i represent the numerical values (data) into a cp tree search without discretizing my dataset ?

Related

Applying jack-knife weights to categorical variables in SAS

I am using SAS proc surveyfreq with jack-knife replicate weights to describe frequencies across variables in a survey that used address based sampling. Some of the variables are coded by individual selection - for example, a survey question asks respondents to pick three top choices, so the actual dataset made each individual choice a variable with a Yes/No 0/1 response. Which SAS procedure that incorporate jk weights should I use in this case to describe the frequency for the entire question for the three top choices?

How to Find similarity in structured data using rapidminer?

I want to find the similarity using cosine similarity operator on the structured dataset but I am not getting the desired result. Can someone guide me how to find the similarity using the Rapidminer?
Sample dataset:

SAS insert column for dynamically determined levels

I am attempting to set up SAS to do something I am able to easily do in Excel, but am unable to find a way to do effectively. Given the first two tables shown here (dubbed TREE and LEVEL, respectively), I am trying to end up with the third table (FINAL_TREE).
Adding in the Level column to TREE, so that it becomes FINAL_TREE works as follows: any given tree must have a number Apple which is greater than or equal to Apple_Req for a given Level, as well as Orange greater than or equal to Orange_Req. So a Tree is given a Level to which it meets all given requirements.
So in the example tables, Tree3 is given Level1, despite the fact that it would easily be Level3 if not for its low Orange count.
In Excel, this can be done using INDEX and finding the MIN of two MATCH functions, but I don't think that can be directly translated into SAS. I imagine there is a way to set this up using explicilty defined nested IF statements, but I am hoping there is a solution which can handle a LEVEL table with any number of levels (so long as the requirements are set up correctly).

In fact, this is quite a bit easier in SAS - in part because there are a lot of different ways to do this.
The most straightforward is probably using SQL, if you're familiar with it. The most similar to what you're doing in Excel, though, is Format, and perhaps the fastest as well.
proc format;
value appleF
1-<4 = '1'
5-<15 = '2'
15-high='3'
other='0';
value orangeF
5-<15 = '1'
16-<30 = '2'
30-high= '3'
other='0';
quit;
Now, you can convert the values using put and then use min just like you would in Excel. Basically this replaces your index.
data want;
set have;
level = min(put(apple,applef1.),put(orange,orangef1.));
run;
You can also produce a format from a dataset directly - see this paper for example for using CNTLIN option on PROC FORMAT.

Converting mixed data set to numerical data set

In my project i have to work with mixed dataset (i.e.it has both categorical and numerical data). Is there any algorithm or method for converting categorical values to numerical values so finally my dataset should contain only numerical values. Can anyone please help me out....
(Im doing my project in matlab)

Use one-hot encoding.
But don't expect the results to be very good. There is a lot of meaning lost this way.

Amazon Redshift Equality filter performance and sortkeys

Does Redshift efficiently (i.e. binary search) find a block of a table that is sorted on a column A for a query with a condition A=?
As an example, let there be a table T with ~500m rows, ~50 fields, distributed and sorted on field A. Field A has high cardinality - so there are ~4.5 m different A values, with exactly the same number of rows in T: ~100 rows per value.
Assume a redshift cluster with a single XL node.
Field A is not compressed. All other fields have some form compression, as suggested by ANALYZE COMPRESSION. A ratio of 1:20 was given compared to an uncompressed table.
Given a trivial query:
select avg(B),avg(C) from
(select B,C from T where A = <val>)
After VACUUM and ANALYZE the following explain plan is given:
XN Aggregate (cost=1.73..1.73 rows=1 width=8)
-> XN Seq Scan on T (cost=0.00..1.23 rows=99 width=8)
Filter: (A = <val>::numeric)
This query takes 39 seconds to complete.
The main question is: Is this the expected behavior of redshift?
According to the documentation at Choosing the best sortkey:
"If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range."
In Choosing sort keys:
"Another optimization that depends on sorted data is the efficient handling of range-restricted predicates. Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max values to rapidly skip over large numbers of blocks during table scans. For example, if a table stores five years of data sorted by date and a query specifies a date range of one month, up to 98% of the disk blocks can be eliminated from the scan. If the data is not sorted, more of the disk blocks (possibly all of them) have to be scanned. For more information about these optimizations, see Choosing distribution keys."
Secondary questions:
What is the complexity of the aforementioned skipping scan on a sort key? Is it linear ( O(n) ) or some variant of binary search ( O(logn) )?
If a key is sorted - is skipping the only optimization available?
What would this "skipping" optimization look like in the explain plan?
Is the above explain the best one possible for this query?
What is the fastest result redshift can be expected to provide given this scenario?
Does vanilla ParAccel have different behavior in this use case?

This question is answered on amazon forum: https://forums.aws.amazon.com/thread.jspa?threadID=137610

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to represent a numerical dataset as a tree search for an interval pattern mining purpose - data-mining

Related

Applying jack-knife weights to categorical variables in SAS

How to Find similarity in structured data using rapidminer?

SAS insert column for dynamically determined levels

Converting mixed data set to numerical data set

Amazon Redshift Equality filter performance and sortkeys

Categories

Resources