Data Mining, PyFim.eclat

Data Mining, PyFim.eclat - data-mining

Why the PYFIM.eclat module in python giving less number of frequent itemsets? Actually, I given database with 8 length frequent itemsets but fim.eclat(db) giving output for only 4 length itemsets.

Probably because there are no frequent patterns of length 5?

Related

File size of 1 million rand numbers

The file by rand is 1 million random numbers. It is compressed down to 415 kb....how is this possible if it is impossible to compress random data.
Thank you.
Jon Hutton

You're most likely talking about the famous "A Million Random Digits" test data that was published in 1955. So it's digits, not numbers, as Mark already guessed, that's why the binary version is only 415,241 bytes. Also see Mark Nelson's homepage that has a link to the binary file.
Note that the end result (the binary file) is not compressible without knowing it - although there are some small redundancies in the file that come from the way it was created - see this forum entry for more details:
There are potentially other biases in the million random digits file
that I discussed years ago in comp.compression. The data was
originally generated by sampling a 5 bit counter driven by a noisy
oscillator to produce a set of 20,000 punched cards with 50 digits
each. But there was some correlation between consecutive digits, so
what they did was add adjacent pairs of cards modulo 10 to produce a
new set of cards which was published. That is why the sums of the
columns are even. Each of the original cards is counted twice.

Sounds like they're stored as one decimal digit per byte. So using only ten of the 256 possible bytes values leaves you with the potential for a log(256)/log(10) compression ratio on random digits, which is about 2.4. You're getting 2.35 (assuming "kb" = 1024 bytes). Voila.
You can get 2.4 quite easily by coding every three digits into ten bits, since 1024 > 1000. Then you can code 1,000,000 decimal digits into 416,667 bytes, or 406.9 KiB.
With a little more difficulty, using something like GMP, you could code it as a giant million-digit integer in binary, which would take 415,242 bytes, or 405.5 KiB. That would be as good as it gets for random decimal digits.

Generate a list of all possible combinations of a regular expression based string

This is not a straightforward "all possible combinations" question.
EDIT: The regex is just a fixed length string with different combinations of alpha and non alphanumeric for each index...
Given a regular expression of fixed length, what would be the fastest way of computing and storing all combinations in a database, speed of saving to database included. From the get go, given the regular expression to having any type of database of every combination?
What I did, successfully but ridiculously slow, was just create an array the length of the fixed length regular expression and each element contained every possible character at that position, I generated this with some script. And then just did a loopception on the array with an SQL Server connection open from start to finish inserting 10 possibilities at a time. It was extremely slow, we're talking a string of 7/8 characters with a maximum of 36 possibilities in any given location. It took a few days.
So, my question is given this problem what would be the best combination of technologies, languages and algorithm to accomplish this the quickest?

Number of possible strings with length 8 and composed of 36 possible characters:
36^8 = 2821109907456 = 2,8 trillion
Generating that many strings in any way will take "considerable" time. Let's look at how long it will take to insert them into DB. Assuming a really good DB performance, we can take 20000 inserts/sec. In such a case the total insertion time is expected to be:
2,8 * 10^12 / 20000 = 140 million seconds
140 * 10^6 / (60*60*24) = 1620 days
So, this answers your question I guess: 1620 DAYS!

Clojure dictionary of words

I want a dictionary of English words available, to pick random english words. I have a dictionary text file that I downloaded form the internet which has almost 1 million words, what's the best way to go about using this list in Clojure, given that most of the time I'll only need 1 randomly selected word?
Edit:
To answer the comments, this is for some tests which I may turn into load tests which is why I want a decent number of random words and I guess access speed is the most important thing. I do not want to use a database for this. I originally thought of a dictionary just because that's the first thing that popped into my mind but I think a random sequence of letters and numbers would be good enough, perhaps I will just use a UUID as a string.

Read all the words into a Vector and then call rand-nth , e.g.
(rand-nth all-words)
rand-nth uses the nth function on the underlying data structure and Clojure Vectors have log32N performance for index based retrieval.
Edit: This is assuming that it is for a test environment as you described in your question. A more memory efficient method would be to use RandomAccessFile and seek to a random location in the file of words, read until you find the first word delimiter (e.g. comma, EOL) and then read the following bytes until the next delimiter which will give you a random word.

Given the life time of different elephants, find the period when maximum number of elephants lived

I came across an interview question:
"Given life times of different elephants. Find the period when maximum number of elephants were alive." For example:
Input: [5, 10], [6, 15], [2, 7]
Output: [6,7] (3 elephants)
I wonder if this problem can be related to the Longest substring problem for 'n' number of strings, such that each string represents the continuous range of a time period.
For e.g:
[5,10] <=> 5 6 7 8 9 10
If not, what can be a good solution to this problem ? I want to code it in C++.
Any help will be appreciated.

For each elephant, create two events: elephant born, elephant died. Sort the events by date. Now walk through the events and just keep a running count of how many elephants are alive; each time you reach a new maximum, record the starting date, and each time you go down from the maximum record the ending date.
This solution doesn't depend on the dates being integers.

If i were you at the interview i would create a std::array with maximum age of the elephant and then increment elements number for each elephant like:
[5,10] << increment all elements from index 5 to 10 in array.
Then i would sort and find where is the biggest number.
There is possibility to use std::map like map<int,int> ( 1st - period, 2nd - number of elephants). It will be sorted by default.
Im wondering if you know any better solution?

This is similar to a program that checks to see if parenthesis are missing. It is also related to date range overlap. This subject is beaten to death on StackOverflow and elsewhere. Here it is:
Determine Whether Two Date Ranges Overlap
I have implemented this by placing all of the start/end ranged in one vector of structs (or classes) and then sorting them. Then you can run through the vector and detect transitions of the level of elephants. (Number of elephants -- funny way of stating the problem!)

From your Input I find that all the time period are overlapping then in that case the solution is simple
we have been given range as [start end]
so the answer will be maximum of all start and minimum of all end.
Just traverse over each time period and find the maximum of all start and mimumum of all end
Note : this solution is applicable when all the time periods over lap
In Your example
Maximum of all input = 6
Minimum of all output= 7

I will just make two arrays , one for the time elephants are born and one for the time elephants die . Sort both of the arrays.
Now keep a counter (initially at zero ) . Start traversing both the arrays and keep getting the smallest element from both of the arrays. If we get an element from start array then increment the counter , else decrement the counter. We can find the max value and the time easily by this method.

find a string with at least n matching elements

I have a list of numbers that I want to find at least 3 of...
here is an example
I have a large list of numbers in a sql database in the format of (for example)
01-02-03-04-05-06
06-08-19-24-25-36
etc etc
basically 6 random numbers between 0 and 99.
Now I want to find the strings where at least 3 of a set of given numbers occurs.
For example:
given: 01-02-03-10-11-12
return the strings that have at least 3 of those numbers in them.
eg
01-05-06-09-10-12 would match
03-08-10-12-18-22 would match
03-09-12-18-22-38 would not
I am thinking that there might be some algorithm or even regular expression that could match this... but my lack of computer science textbook experience is tripping me up I think.
No - this is not a homework question! This is for an actual application!
I am developing in ruby, but any language answer would be appreciated

You can use a string replacement to replace - with | to turn 01-02-03-10-11-12 into 01|02|03|10|11|12. Then wrap it like this:
((01|02|03|10|11|12).*){3}
This will find any of the digit pairs, then ignore any number of characters... 3 times. If it matches, then success.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Data Mining, PyFim.eclat - data-mining

Why the PYFIM.eclat module in python giving less number of frequent itemsets? Actually, I given database with 8 length frequent itemsets but fim.eclat(db) giving output for only 4 length itemsets.

Probably because there are no frequent patterns of length 5?

Related

File size of 1 million rand numbers

Generate a list of all possible combinations of a regular expression based string

Clojure dictionary of words

Given the life time of different elephants, find the period when maximum number of elephants lived

find a string with at least n matching elements

Categories

Resources