I have a simple question about filtering attributes in WEKA.
Let's say I have 500 attributes 30 classes and 100 samples for each class which equals 3000 rows and 500 columns. This causes time and memory problems a you can guess.
How do I filter attributes that occur only once or twice (or n times) in 3000 rows. And is it a good idea?
Thank you
Use the following filter
weka.filters.unsupervised.attribute.RemoveUseless
This filter removes attributes that do not vary at all or that vary too much. All constant attributes are deleted automatically, along with any that exceed the maximum percentage of variance parameter.
Related
I am trying to simplify a table that shows the amount of time that people are working on certain jobs and wanting to present the dataset in a table that only shows the values greater than zero.
The image below shows how the table currently looks, where each person has a % of their time allocated to 1 of 5 jobs across columns.
I am trying to create a table that looks like the below, where it only shows the jobs that each person is working on, and excludes the ones where they have no % of their time allocated.
Wondering if I am going about this in the wrong fashion, any help greatly appreciated!
Thanks
I have been tryin to use an index match function with some if logic for values greater than zero but have been only able to get the first value greater than zero to populate.
When dynamically filling css grid columns I recently noticed that after column 1000 the remainder seems to be filled in the row direction. See the example below. This leads me to the question:
Is there a maximum amount of rows and/or columns when using CSS grid?
Suggestions of how to get the remainder (from 1001 on) in the next columns are welcome, but are not the core of this question.
The CSS grid seems to have a different column/row limit depending on the browser.
For chrome the limit seems to be 1,000x1,000 as explained here in the answer of Bludev
For firefox the limit seems to be 10,000x10,000, but I can't remember exactly where i read this.
I used sklearn.preprocessing.KBinsDiscretizer(n_bins=10, encode='ordinal') to discretize my continuous feature.
The strategy is 'quantile', by defalut. But my data distribution is actually not uniformly, like 70% of rows is 0.
Then I got KBinsDiscretizer.bins_edges=[0.,0.,0.,0.,0.,0.,0.,256.,602., 1306., 18464.].
There're many duplicate bins. So, is there a method to drop the duplicates in KBinsDiscretizer's bins?
KBinsDiscretizer calculates the quantile of input. If the most samples of input are zero, the 10-quantiles will have multiple zeros. The result I expected is a discretizer with unique bins. For the example I mentioned, is [0.,256.,602., 1306., 18464.].
That will not be possible. Set strategy='uniform' to achieve your goal.
I have been querying Geonames for parks per state. Mostly there are under 1000 parks per state, but I just queried Conneticut, and there are just under 1200 parks there.
I already got the 1-1000 with this query:
http://api.geonames.org/search?featureCode=PRK&username=demo&country=US&style=full&adminCode1=CT&maxRows=1000
But increasing the maxRows to 1200 gives an error that I am querying for too many at once. Is there a way to query for rows 1000-1200 ?
I don't really see how to do it with their API.
Thanks!
You should be using the startRow parameter in the query to page results. The documentation notes that it takes an integer value (0 based indexing) and should be
Used for paging results. If you want to get results 30 to 40, use startRow=30 and maxRows=10. Default is 0.
So to get the next 1000 data points (1000-1999), you should change your query to
http://api.geonames.org/search?featureCode=PRK&username=demo&country=US&style=full&adminCode1=CT&maxRows=1000&startRow=1000
I'd suggest reducing the maxRows to something manageable as well - something that will put less of a load on their servers and make for quicker responses to your queries.
I am testing the speed of inserting multiple rows with a single INSERT statement.
For example:
INSERT INTO [MyTable] VALUES (5, 'dog'), (6, 'cat'), (3, 'fish)
This is very fast until I pass 50 rows on a single statement, then the speed drops significantly.
Inserting 10000 rows with batches of 50 take 0.9 seconds.
Inserting 10000 rows with batches of 51 take 5.7 seconds.
My question has two parts:
Why is there such a hard performance drop at 50?
Can I rely on this behavior and code my application to never send batches larger than 50?
My tests were done in c++ and ADO.
Edit:
It appears the drop off point is not 50 rows, but 1000 columns. I get similar results with 50 rows of 20 columns or 100 rows of 10 columns.
It could also be related to the size of the row. The table you use as an example seems to have only 2 columns. What if it has 25 columns? Is the performance drop off also at 50 rows?
Did you also compare with the "union all" approach shown here? http://blog.sqlauthority.com/2007/06/08/sql-server-insert-multiple-records-using-one-insert-statement-use-of-union-all/
I suspect there's an internal cache/index that is used up to 50 rows (it's a nice round decimal number). After 50 rows it falls back on a less efficient general case insertion algorithm that can handle arbitrary amounts of inputs without using excessive memory.
the slowdown is probably the parsing of the string values: VALUES (5, 'dog'), (6, 'cat'), (3, 'fish) and not an INSERT issue.
try something like this, which will insert one row for each row returned by the query:
INSERT INTO YourTable1
(col1, col2)
SELECT
Value1, Value2
FROM YourTable2
WHERE ...--rows will be more than 50
and see what happens
If you are using SQL 2008, then you can use table value parameters and just do a single insert statement.
personally, I've never seen the slowdown at 50 inserts records even with regular batches. Regardless we moved to table value parameters which had a significant speed increase for us.
Random thoughts:
is it completely consistent when run repeatedly?
are you checking for duplicates in the 1st 10k rows for the 2nd 10k insert?
did you try batch size of 51 first?
did you empty the table between tests?
For high-volume and high-frequency inserts, consider using Bulk Inserts to load your data. Not the simplest thing int he world to implement and it brings with it a new set of challenges, but it can be much faster than doing an INSERT.