Subsets for each class using weka for microarray data - weka

How can I do this in explorer of weka:
• for each class, generate subsets with top 2,4,6,8,10,12,15,20,25, and 30 top genes with the highest T-value
my data is in this format:
instances/classes
| | |
----------
| | |
genes/attributes ----------
| | |
----------
| | |
(the rows are attributes and the columns are instances)

Weka is not suitable for such subset generation, you may accomplish this feat but it will not be easy. I suggest put your data to other program and generate your subsets, after that use weka for classification and clustering which are weka's main point.
You can use R/Matlab/Python or an SQL database for this subset generation.

Related

Plot Min, Max, Average, Median values into a horizontal enumerated line in Power BI Desktop

I am trying to do a trivial task with Power BI Desktop. I have the following kind of data
| Name | Min | Max | Average | Median |
|-------- |----- |------- |--------- |-------- |
| team A | 0 | 3,817 | 120 | 120 |
| team B | -10 | 1,050 | 25 | 89 |
| team C | 5 | 14,320 | 50 | 48 |
And I want to create my own horizontal line with pre-defined (Start, End) points to plot for each team name the values of the Min, Max, Average, Median. And I filter the team name to adjust the numbers and the visual accordingly.
So far I have done the following static approach
The example above is totally non-dynamic because every point on the line is set by me. Also if for example, I select Team B with a higher median than average then the above visual line does not change the position of the relative spheres (in the image I posted, I have placed average always higher than the median which is not true for all the teams).
Thus, I would like to know if there is any fancy and well-plotted way to represent those 4 descriptive measures for a team name in a horizontal line that will respond when I use a different team. As I have noted on the image attached, the card visuals change when I change the team name. But the spheres do not move across the line.
My desired output
For Team B
While for Team C
I literally don't know if this is feasible in Power BI apart from the static approach I already did. Thank you in advance.
Regards.

Power BI - Filtered count not grouping by values in table

I have two tables (subject & category) that are both related to the same parent table (main). Because of the foreign key constraints, it looks like Power BI automatically created the links.
Simple mock-up of table links
I need to count the subjects by type for each possible distance range. I tried a simple calculation shown below for each distance category.
less than 2m =
CALCULATE(
COUNTA('Category'[Descr]),
'Subject'[Distance] IN { "less than 2m" }
)
However, the filter doesn't seem to apply properly.
I want...
+------+--------------+--------------+--+
| Descr| less than 2m | more than 2m | |
+------+--------------+--------------+--+
| Car | 2 | 1 | |
| Sign | 4 | 2 | |
+------+--------------+--------------+--+
but I'm getting...
+------+--------------+--------------+--+
| Descr| less than 2m | more than 2m | |
+------+--------------+--------------+--+
| Car | 3 | 3 | |
| Sign | 6 | 6 | |
+------+--------------+--------------+--+
It's just giving me the total count by type which is correct but isn't applying the filter by distance so I can break it down.
I'm sure this is probably really simple but I'm pretty new with DAX and I can't figure this one out.
I wish I could mark Kosuke's comment as an answer. The issue was indeed with having to enable cross-filtering. This can either be done clicking on the link on your model or using a function to temporarily enable the cross filter.

How to visualize multiple lines from two measures

I have a challenge in Power BI Desktop to model and display a line chart that shows multiple lines in the same visualization where the x,y pair consists of two measures. The X axis contains a measure Average weight and the y axis Price per Kilo. There is a Normal line chart displaying the optimal curve where as there are a number of projects displaying other curves in the same chart (as legends). Below you see the coordinates for the normal curve, while the project curves can have other x,y values. This is easy in Excel but not that easy in Power BI.
To make lines it seems that every x coordinate in Line chart must be in the same interval. Otherwise I only get points not separate lines. Maybe the line chart component is not suitable for showing this. I think scatter chart is more suitable but I don't think it can show lines between the points.
I hope some of you have solved this or may be have pbix file to share how this have been solved.
Regards Geir
Sample data:
| Avg weight | Price pr kg |
|------------|-------------|
| 100 | 129.39 |
| 500 | 63.65 |
| 1000 | 40.13 |
| 1500 | 33.41 |
| 2000 | 30.05 |
| 2500 | 27.53 |
| 3000 | 25.43 |
| 3500 | 23.582 |
| 4000 | 22.91 |
| 4500 | 22.322 |
| 5000 | 21.902 |
| 5500 | 21.734 |
| 6000 | 21.65 |
Plot example:
This is seems quite straightforward, although perhaps your actual data is more complex?
With Avg Weight and 2 data series in one Table, I can use Avg Weight as the X Axis and the 2 data series as Values to achieve something similar to your requirement:

Making SQLite run SELECT faster

Situation: I have about 40 million rows, 3 columns of unorganised data in a table in my SQLite DB (~300MB). An example of my data is as follows:
| filehash | filename | filesize |
|------------|------------|------------|
| hash111 | fileA | 100 |
| hash222 | fileB | 250 |
| hash333 | fileC | 380 |
| hash111 | fileD | 250 | #Hash collision with fileA
| hash444 | fileE | 520 |
| ... | ... | ... |
Problem: A single SELECT statement could take between 3 to 5 seconds. The application I am running needs to be fast. A single query taking 3 to 5 seconds is too long.
#calculates hash
md5hash = hasher(filename)
#I need all 3 columns so that I do not need to parse through the DB a second time
cursor.execute('SELECT * FROM hashtable WHERE filehash = ?', (md5hash,))
returned = cursor.fetchall()
Question: How can I make the SELECT statement run faster (I know this sounds crazy but I am hoping for speeds of below 0.5s)?
Additional information 1: I am running it on Python 2.7 program on a RPi 3B (1GB RAM, default 100MB SWAP). I am asking mainly because I am afraid that it will crash the RPi because 'not enough RAM'.
For reference, when reading from the DB normally with my app running, we are looking at max 55MB of RAM free, with a few hundred MB of cached data - I am unsure if this is the SQLite caches (SWAP has not been touched).
Additional information 2: I am open to using other databases to store the table (I was looking at either PyTables or ZODB as a replacement - let's just say that I got a little desperate).
Additional information 3: There are NO unique keys as the SELECT statement will look for a match in the column which are just hash values, which apparently have collisions.
Currently, the database has to scan the entire table to find all matches. To speed up searches, use an index:
CREATE INDEX my_little_hash_index ON hashtable(filehash);

How to store data with large number (constant) of properties in SQL

I am parsing the USDA's food database and storing it in SQLite for query purposes. Each food has associated with it the quantities of the same 162 nutrients. It appears that the list of nutrients (name and units) has not changed in quite a while, and since this is a hobby project I don't expect to follow any sudden changes anyway. But each food does have a unique quantity associated with each nutrient.
So, how does one go about storing this kind of information sanely. My priorities are multi-programming language friendly (Python and C++ having preference), sanity for me as coder, and ease of retrieving nutrient sets to sum or plot over time.
The two things that I had thought of so far were 162 columns (which I'm not particularly fond of, but it does make the queries simpler), or a food table that has a link to a nutrient_list table that then links to a static table with the nutrient name and units. The second seems more flexible i ncase my expectations are wrong, but I wouldn't even know where to begin on writing the queries for sums and time series.
Thanks
You should read up a bit on database normalization. Most of the normalization stuff is quite intuitive, but really going through the definition of the steps and seeing an example helps understanding the concepts and will help you greatly if you want to design a database in the future.
As for this problem, I would suggest you use 3 tables: one for the foods (let's call it foods), one for the nutrients (nutrients), and one for the specific nutrients of each food (foods_nutrients).
The foods table should have a unique index for referencing and the food's name. If the food has other data associated to it (maybe a link to a picture or a description), this data should also go here. Each separate food will get a row in this table.
The nutrients table should also have a unique index for referencing and the nutrient's name. Each of your 162 nutrients will get a row in this table.
Then you have the crossover table containing the nutrient values for each food. This table has three columns: food_id, nutrient_id and value. Each food gets 162 rows inside this table, oe for each nutrient.
This way, you can add or delete nutrients and foods as you like and query everything independent of programming language (well, using SQL, but you'll have to use that anyway :) ).
Let's try an example. We have 2 foods in the foods table and 3 nutrients in the nutrients table:
+------------------+
| foods |
+---------+--------+
| food_id | name |
+---------+--------+
| 1 | Banana |
| 2 | Apple |
+---------+--------+
+-------------------------+
| nutrients |
+-------------+-----------+
| nutrient_id | name |
+-------------+-----------+
| 1 | Potassium |
| 2 | Vitamin C |
| 3 | Sugar |
+-------------+-----------+
+-------------------------------+
| foods_nutrients |
+---------+-------------+-------+
| food_id | nutrient_id | value |
+---------+-------------+-------+
| 1 | 1 | 1000 |
| 1 | 2 | 12 |
| 1 | 3 | 1 |
| 2 | 1 | 3 |
| 2 | 2 | 7 |
| 2 | 3 | 98 |
+---------+-------------+-------+
Now, to get the potassium content of a banana, your'd query:
SELECT food_nutrients.value
FROM food_nutrients, foods, nutrients
WHERE foods_nutrients.food_id = foods.food_id
AND foods_nutrients.nutrient_id = nutrients.nutrient_id
AND foods.name = 'Banana'
AND nutrients.name = 'Potassium';
Use the second (more normalized) approach.
You could even get away with fewer tables than you mentioned:
tblNutrients
-- NutrientID
-- NutrientName
-- NutrientUOM (unit of measure)
-- Otherstuff
tblFood
-- FoodId
-- FoodName
-- Otherstuff
tblFoodNutrients
-- FoodID (FK)
-- NutrientID (FK)
-- UOMCount
It will be a nightmare to maintain a 160+ field database.
If there is a time element involved too (can measurements change?) then you could add a date field to the nutrient and/or the foodnutrient table depending on what could change.