Minimal transversal of a Hypergraph - data-mining

Hi This is my first post, so please go easy on me. I tried going through an algorithm Dualize and Advance for generating maximal frequent item sets. I considered an example as follows
Transactions
abcde
ace
bd
abc
and minimum frequency threshold as 2.
Now, I have a problem understanding how to generate 'minimal transversals' part of the algorithm.
I know that transversal is a subset of vertices of the hypergraph that intersects every hyper edge. So the initial set of minimal transversals should be {a,b,c,d,e} if I am not wrong.
Can you please explain me this part of 'minimal transversal' w.r.t the transactions.

Ok, I ll try to answer my question.
At the end of first iteration of the algorithm and for the given transactions, {abc} is emanated as maximal frequent itemset. Here is how I understood,
minimal transversal X = S1' = {a,b,c,d,e}
S2 = {abc} is maximal and S2' = {de}
Find minimal transversal of S2' which is {d,e}
Now, X = {d,e}, consider 'd' ,
S3 = {bd} is maximal and S3' = {ace}
Now, consider 'e'
S4 = {ace} is maximal and meanwhile I get {ad} , {be} ,{cd} and {de} as minimal transversals which are infrequent.

A subset T of V is a transversal (or hitting set) of H if it
intersects all the hyperedges of H,
'minimal transversal' -> the smallest possible set that we get.
Also, we enforce ordering in our algorithm a>b>c>d>e
Iteration 1:
S1 = {}
S1' = {abcde}
Tr = {a,b,c,d,e} [All nodes are required to cut Si']
Iteration 2:
S2 = {abc}
S2' = {de}
Tr = {d,e} [Two nodes are required to cut Si']
Iteration 3:
S3 = {abc, bd}
S3' = {de, ace}
Tr = {e, cd, ad} [Three nodes are required to cut Si']
Iteration 4:
S4 = {abc, bd, ace}
S4' = {de, bd, ace}
Tr = {de, cd, ad, be} [Four nodes are required to cut Si']
All minimal transversals Tr are infrequent so the algorithm ends.

Related

Python: referring to each duplicate item in a list by unique index

I am trying to extract particular lines from txt output file. The lines I am interested in are few lines above and few below the key_string that I am using to search through the results. The key string is the same for each results.
fi = open('Inputfile.txt')
fo = open('Outputfile.txt', 'a')
lines = fi.readlines()
filtered_list=[]
for item in lines:
if item.startswith("key string"):
filtered_list.append(lines[lines.index(item)-2])
filtered_list.append(lines[lines.index(item)+6])
filtered_list.append(lines[lines.index(item)+10])
filtered_list.append(lines[lines.index(item)+11])
fo.writelines(filtered_list)
fi.close()
fo.close()
The output file contains the right lines for the first record, but multiplied for every record available. How can I update the indexing so it can read every individual record? I've tried to find the solution but as a novice programmer I was struggling to use enumerate() function or collections package.
First of all, it would probably help if you said what exactly goes wrong with your code (a stack trace, it doesn't work at all, etc). Anyway, here's some thoughts. You can try to divide your problem into subproblems to make it easier to work with. In this case, let's separate finding the relevant lines from collecting them.
First, let's find the indexes of all the relevant lines.
key = "key string"
relevant = []
for i, item in enumerate(lines):
if item.startswith(key):
relevant.append(item)
enumerate is actually quite simple. It takes a list, and returns a sequence of (index, item) pairs. So, enumerate(['a', 'b', 'c']) returns [(0, 'a'), (1, 'b'), (2, 'c')].
What I had written above can be achieved with a list comprehension:
relevant = [i for (i, item) in enumerate(lines) if item.startswith(key)]
So, we have the indexes of the relevant lines. Now, let's collected them. You are interested in the line 2 lines before it and 6 and 10 and 11 lines after it. If your first lines contains the key, then you have a problem – you don't really want lines[-1] – that's the last item! Also, you need to handle the situation in which your offset would take you past the end of the list: otherwise Python will raise an IndexError.
out = []
for r in relevant:
for offset in -2, 6, 10, 11:
index = r + offset
if 0 < index < len(lines):
out.append(lines[index])
You could also catch the IndexError, but that won't save us much typing, as we have to handle negative indexes anyway.
The whole program would look like this:
key = "key string"
with open('Inputfile.txt') as fi:
lines = fi.readlines()
relevant = [i for (i, item) in enumerate(lines) if item.startswith(key)]
out = []
for r in relevant:
for offset in -2, 6, 10, 11:
index = r + offset
if 0 < index < len(lines):
out.append(lines[index])
with open('Outputfile.txt', 'a') as fi:
fi.writelines(out)
To get rid of duplicates you can cast list to set; example:
x=['a','b','a']
y=set(x)
print(y)
will result in:
['a','b']

Scala: how to flatten a List of a Set of filepaths

I have a List[Set[Path]]: Update: Each Path in a Set is unique and represents a particular directory location. There are no duplicates. So, what I am looking for is the total number of path elements/
val miceData = List(Set(C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory\Test7.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory\Test2.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory\Test6.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory\Test5.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory\Test8.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory\Test3.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\aPowerPoint.pptx, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory\Test1.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory\Test4.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory2\Test10.txt), Set(C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory2\Test6.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory2\Test3.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory2\Test4.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory2\Test70.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory2\Test8.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory2\Test5.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory2\Test2.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\FlatDirectory\Test2.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\FlatDirectory\Test3.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\FlatDirectory\Test1.txt), Set(C:\Users\lulu\Documents\mice_data\data_mining_folder\FlatDirectory\Test80.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\FlatDirectory\Test7.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\FlatDirectory\Test40.txt, C:\Users\lulu\Documents\GitHub\data_mining_folder\FlatDirectory\Test6.txt, C:\Users\lulu\Documents\mice_data\data_mining_folder\FlatDirectory\Test5.txt), Set(C:\Users\lulu\Documents\mice_data\data_mining_folder\zipfile.zip), Set(C:\Users\lulu\Documents\mice_data\data_mining_folder\micetest.txt,C:\Users\lulu\Documents\mice_data\data_mining_folder\riley.jpg))
There are 5 Sets in this List, each Set holding Path(s). The total number of such Paths is 28, if I counted correctly.
Now, I want to find out the total number of Path elements across all Sets in this List.
I could have done this computation in an area of my code upstream, but I am curious to do so now, and learn more about Scala in the process.
Something like:
val totalPaths = <<iterate over this List and count all the paths>>
I would like the shortest, most idiomatic piece of code to accomplish this.
val paths = for { //gives you a list of all paths on all sets
set <- miceData
path <- set
} yield path
val totalPaths = paths.toSet.size // converting it to set will remove duplicates if any
I think flatten is just enough
val toto = List(Set(1,2,3), Set(6,7,8))
println(toto.flatten.count)
val totalPaths = miceData.map(_.size).sum
If you have duplicates, you can do :
val totalPaths = miceData.flatten.distinct.size
val totalPaths = miceData.flatten.size
OR
val totalPaths = miceData.flatten.length
And you might want to give your paths as triple qouted Strings. because with single quotes REPL is giving the following error.
<console>:1: error: invalid escape character
List(Set("C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory\Test7.txt", "C:\Users\lulu\Documents\mice_data\data_mining_folder\DeeplyNestedDirectory\FlatDirectory\Test2.txt")).flatten.size

Recursively Assigning One List Element to Another in Prolog

This is a homework assignment and my first experience with Prolog. My goal is to create a list of Assignments from a list of people and a list of tasks. If a person has the letter identifier which matches the tasks then that persons ID and the Tasks ID are matched up and placed in a list of Assignments. My function prints out a list but it does not look like it is comparing all the elements. A sample input: schedule([p1,p2,p3],[t1,t2],Result). A sample output would look like [[p1,t1],[p2,t2][p3,t1],[p3,t2]].
What I have so far:
%%
%% person(ID, TASK_CAPABILITIES, AVAILABLE_HOURS)
%%
%% How many hours each person has available and what classes of tasks they
%% are capable of performing.
%%
person(p1, [c,a], 20).
person(p2, [b], 10).
person(p3, [a,b], 15).
person(p4, [c], 30).
%%
%% task(ID, REQUIRED_HOURS, TASK_CLASS)
%%
%% How long each task requires and what class it falls under.
%%
task(t1, a, 5).
task(t2, b, 10).
task(t3, c, 15).
task(t4, c, 10).
task(t5, a, 15).
task(t6, b, 10).
%test arithmetic functions
add(X, Y, Z) :- Z is X + Y.
subtract(X,Y,Z) :- Z is X - Y.
schedule([],[],[]).
schedule(People,
[Task|OtherTasks],
[[PersonId, TaskId]|RestOfAssignments]):-
member(PersonId, People),
person(PersonId, PersonCapabilities,_),
member(TaskId, [Task|OtherTasks]),
task(TaskId, TaskType,_),
member(TaskType, PersonCapabilities),
schedule( _, OtherTasks, RestOfAssignments).
My reasoning behind what I wrote was that the list of People would be compared to each task, then that task would be replaced by the next task and the comparison would repeat. What I see in the trace of this function instead is that the tasks are being removed from the list but are only compared to the first two People. My question is how can I get the schedule function to check the full list of people for each task?
Your problem seems ill specified, and you are simplifying too much... the code should keep into account hours availability as well as memberships. Ignoring this problem, select/3 instead of member/2 could help to model a naive solution:
schedule([],_,[]).
% peek a suitable task for PersonId
schedule([PersonId|People], Tasks, [[PersonId, TaskId]|RestOfAssignments]):-
select(TaskId, Tasks, RestTasks),
person(PersonId, PersonCapabilities,_),
task(TaskId, TaskType,_),
memberchk(TaskType, PersonCapabilities),
schedule(People, RestTasks, RestOfAssignments).
% if no suitable task for PersonId
schedule([_PersonId|People], Tasks, Assignments):-
schedule(People, Tasks, Assignments).
yields these solutions
?- schedule([p1,p2,p3],[t1,t2],Result).
Result = [[p1, t1], [p2, t2]] ;
Result = [[p1, t1], [p3, t2]] ;
Result = [[p1, t1]] ;
Result = [[p2, t2], [p3, t1]] ;
Result = [[p2, t2]] ;
Result = [[p3, t1]] ;
Result = [[p3, t2]] ;
Result = [].

Insertion order of a list based on order of another list

I have a sorting problem in Scala that I could certainly solve with brute-force, but I'm hopeful there is a more clever/elegant solution available. Suppose I have a list of strings in no particular order:
val keys = List("john", "jill", "ganesh", "wei", "bruce", "123", "Pantera")
Then at random, I receive the values for these keys at random (full-disclosure, I'm experiencing this problem in an akka actor, so events are not in order):
def receive:Receive = {
case Value(key, otherStuff) => // key is an element in keys ...
And I want to store these results in a List where the Value objects appear in the same order as their key fields in the keys list. For instance, I may have this list after receiving the first two Value messages:
List(Value("ganesh", stuff1), Value("bruce", stuff2))
ganesh appears before bruce merely because he appears earlier in the keys list. Once the third message is received, I should insert it into this list in the correct location per the ordering established by keys. For instance, on receiving wei I should insert him into the middle:
List(Value("ganesh", stuff1), Value("wei", stuff3), Value("bruce", stuff2))
At any point during this process, my list may be incomplete but in the expected order. Since the keys are redundant with my Value data, I throw them away once the list of values is complete.
Show me what you've got!
I assume you want no worse than O(n log n) performance. So:
val order = keys.zipWithIndex.toMap
var part = collection.immutable.TreeSet.empty[Value](
math.Ordering.by(v => order(v.key))
)
Then you just add your items.
scala> part = part + Value("ganesh", 0.1)
part: scala.collection.immutable.TreeSet[Value] =
TreeSet(Value(ganesh,0.1))
scala> part = part + Value("bruce", 0.2)
part: scala.collection.immutable.TreeSet[Value] =
TreeSet(Value(ganesh,0.1), Value(bruce,0.2))
scala> part = part + Value("wei", 0.3)
part: scala.collection.immutable.TreeSet[Value] =
TreeSet(Value(ganesh,0.1), Value(wei,0.3), Value(bruce,0.2))
When you're done, you can .toList it. While you're building it, you probably don't want to, since updating a list in random order so that it is in a desired sorted order is an obligatory O(n^2) cost.
Edit: with your example of seven items, my solution takes about 1/3 the time of Jean-Philippe's. For 25 items, it's 1/10th the time. 1/30th for 200 (which is the difference between 6 ms and 0.2 ms on my machine).
If you can use a ListMap instead of a list of tuples to store values while they're gathered, this could work. ListMap preserves insertion order.
class MyActor(keys: List[String]) extends Actor {
def initial(values: ListMap[String, Option[Value]]): Receive = {
case v # Value(key, otherStuff) =>
if(values.forall(_._2.isDefined))
context.become(valuesReceived(values.updated(key, Some(v)).collect { case (_, Some(v)) => v))
else
context.become(initial(keys, values.updated(key, Some(v))))
}
def valuesReceived(values: Seq[Value]): Receive = { } // whatever you need
def receive = initial(keys.map { k => (k -> None) })
}
(warning: not compiled)

Regular expression puzzle

This is not homework, but an old exam question. I am curious to see the answer.
We are given an alphabet S={0,1,2,3,4,5,6,7,8,9,+}. Define the language L as the set of strings w from this alphabet such that w is in L if:
a) w is a number such as 42 or w is the (finite) sum of numbers such as 34 + 16 or 34 + 2 + 10
and
b) The number represented by w is divisible by 3.
Write a regular expression (and a DFA) for L.
This should work:
^(?:0|(?:(?:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\
+)*[369]0*)*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:
\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[
258](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0
\+)*[147])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+)(?:\+(?:0|(?:(?
:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:\+?(?:0\+)*
[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[258](?:0*(?
:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])*
(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)
*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+))*$
It works by having three states representing the sum of the digits so far modulo 3. It disallows leading zeros on numbers, and plus signs at the start and end of the string, as well as two consecutive plus signs.
Generation of regular expression and test bed:
a = r'0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*'
b = r'a[147]'
c = r'a[258]'
r1 = '[369]|[147](?:bc)*(?:c|bb)|[258](?:cb)*(?:b|cc)'
r2 = '(?:0|(?:(?:' + r1 + ')0*)+)'
r3 = '^' + r2 + r'(?:\+' + r2 + ')*$'
r = r3.replace('b', b).replace('c', c).replace('a', a)
print r
# Test on 10000 examples.
import random, re
random.seed(1)
r = re.compile(r)
for _ in range(10000):
x = ''.join(random.choice('0123456789+') for j in range(random.randint(1,50)))
if re.search(r'(?:\+|^)(?:\+|0[0-9])|\+$', x):
valid = False
else:
valid = eval(x) % 3 == 0
result = re.match(r, x) is not None
if result != valid:
print 'Failed for ' + x
Note that my memory of DFA syntax is woefully out of date, so my answer is undoubtedly a little broken. Hopefully this gives you a general idea. I've chosen to ignore + completely. As AmirW states, abc+def and abcdef are the same for divisibility purposes.
Accept state is C.
A=1,4,7,BB,AC,CA
B=2,5,8,AA,BC,CB
C=0,3,6,9,AB,BA,CC
Notice that the above language uses all 9 possible ABC pairings. It will always end at either A,B,or C, and the fact that every variable use is paired means that each iteration of processing will shorten the string of variables.
Example:
1490 = AACC = BCC = BC = B (Fail)
1491 = AACA = BCA = BA = C (Success)
Not a full solution, just an idea:
(B) alone: The "plus" signs don't matter here. abc + def is the same as abcdef for the sake of divisibility by 3. For the latter case, there is a regexp here: http://blog.vkistudios.com/index.cfm/2008/12/30/Regular-Expression-to-determine-if-a-base-10-number-is-divisible-by-3
to combine this with requirement (A), we can take the solution of (B) and modify it:
First read character must be in 0..9 (not a plus)
Input must not end with a plus, so: Duplicate each state (will use S for the original state and S' for the duplicate to distinguish between them). If we're in state S and we read a plus we'll move to S'.
When reading a number we'll go to the new state as if we were in S. S' states cannot accept (another) plus.
Also, S' is not "accept state" even if S is. (because input must not end with a plus).