How to compare 2 tuples (files) in Pig Script - compare

I am new to Pig scripting and am having problem in comparing two tuples.
There are two files:
random = having sequence of 6 numbers
allposs = having sequence of 5 numbers
I want to count for all the rows in allposs, the sequence occurrence from file random.
There can be 2 possibilities:
sequence occurs in first 5 numbers of random
sequence occurs in last 5 numbers of random
A = load 'random' using PigStorage(':') as (bsid1:int, bsid2:int, bsid3:int, bsid4:int, bsid5:int, bsid6:int);
B = load 'Allposs' using PigStorage(':') as (bsid1:int, bsid2:int, bsid3:int, bsid4:int, bsid5:int);
C = FILTER A BY (A.bsid1==B.bsid1 AND A.bsid2==B.bsid2 AND
A.bsid3==B.bsid3 AND A.bsid4==B.bsid4 AND A.bsid5==B.bsid5) OR
(A.bsid2==B.bsid1 AND A.bsid3==B.bsid2 AND A.bsid4==B.bsid3 AND
A.bsid5==B.bsid4 AND A.bsid6==B.bsid5);
C = GROUP B ALL;
D = FOREACH C GENERATE COUNT(B);<br/>
DUMP D;
Please help me in correcting this Pig script.

Related

I need to create a loop that prints a graph where each number in the list is shown by a number of characters. Here is an example:

numbers = [1,2,3,4]
results in
1: i
2: ii
3: iii
4: iiii
This is my code so far and I'm not sure where to go.
numbers = [1,2,3,4]
c = 0
for i in numbers:
count += 1
print(len(numbers))
Instead of 'i' you can put any character you like enclosed in single inverted commas.
numbers = [1,2,3,4]
output = []
for number in numbers:
output.append(number*'i')
print(output)

How do I create randomized lists using a nested for loop

So I'm trying to make 2 lists with random variables compare to each other to find the probability of them being the same. What I've done is made 2 lists with random numbers using a for loop, but in order to find the probability I'm trying to create the lists within another for loop in order to make 10000 pairs of lists to compare but I can't get it to work.
import random
import collections
N= 10000
count = 0
playerPick=[]
randomPick=[]
for j in range (N):
for i in range(4):
playerPick.append(random.randrange(1,21))
print(playerPick)
for i in range(4):
randomPick.append(random.randrange(1,21))
print(randomPick)
if collections.Counter(playerPick) == collections.Counter(randomPick):
count+=1
probability = count/N
print("Probability of winning: ", probability)
The lists end up being super long but I just want them to be 4 long.
This may be a more efficient way to calculate the average match count.
import random
import collections
N = 10000 # main list length
L = 4 # each element is list of 4 elements
def getpct():
# create random list of lists
playerPick=[[random.randrange(1,21) for x in range(L)] for n in range(N)]
randomPick=[[random.randrange(1,21) for x in range(L)] for n in range(N)]
# create match list, 1=match else 0
matches = [1 if p==r else 0 for p,r in zip(playerPick,randomPick)]
return sum(matches)/N # percent matches
allpcts = [getpct() for r in range(1000)] # run test 1000 times
avgpct = sum(allpcts)/1000 # average percent
print(f'Avg Pct: {avgpct}%')
Output
Avg Pct: 6.2000000000000025e-06%

Matlab: locate a string in txt and read it into a number

I have an input file like this:
number of elements = 4
number of nodes = 6
number of fixed points = 2
number of forces = 1
young = 2.0E8
poiss = 0.2
thickness = 0.002
node group
1 2 6
2 3 4
2 4 5
2 5 6
And I use this to read the file
fid = fopen(input_file);
tline = fgetl(fid);
line_number = 1;
while ischar(tline)
# this will locate the string, and find the number
if ~isempty(strfind(tline,'number of elements'))
NELEM = str2double(regexp(tline, '\d+', 'match'));
end
if ~isempty(strfind(tline,'young'))
YOUNG = str2double(regexp(tline, '\d+', 'match'));
end
line_number=line_number+1;
tline = fgetl(fid);
end
fclose(fid);
The first works fine, however, for the second, YOUNG, the output is actually [2 0 8](original number is 2e8) The regexp turns the string into an array.
And for poiss, it read as [0,2].
How can I turn the string into the original number?
Your regular expression needs to match floating point numbers with exponents, try changing '\d+' to
'[0-9]*\.?[0-9]+([eE][0-9]+)?'
This then matches numbers with an optional decimal point and exponent. For example:
str2double(regexp('young = 2.0E8', '[0-9]*\.?[0-9]+([eE][0-9]+)?', 'match'))
gives 200000000.

Multiple lists of the same length to csv

I have a couple List<string>s, with the format like this:
List 1 List 2 List 3
1 A One
2 B Two
3 C Three
4 D Four
5 E Five
So in code form, it's like:
List<string> list1 = {"1","2","3","4","5"};
List<string> list2 = {"A","B","C","D","E"};
List<string> list3 = {"One","Two","Three","Four","Five"};
My questions are:
How do I transfom those three lists to a CSV format?
list1,list2,list3
1,A,one
2,b,two
3,c,three
4,d,four
5,e,five
Should I append , to the end of each index or make the delimeter its own index within the multidimensional list?
If performance is your main concern, I would use an existing csv library for your language, as it's probably been pretty well optimized.
If that's too much overhead, and you just want a simple function, I use the same concept in some of my code. I use the join/implode function of a language to create a list of comma separated strings, then join that list with \n.
I'm used to doing this in a dynamic language, but you can see the concept in the following pseudocode example:
header = {"List1", "List2", "List3"}
list1 = {"1","2","3","4","5"};
list2 = {"A","B","C","D","E"};
list3 = {"One","Two","Three","Four","Five"};
values = {header, list1, list2, list3};
for index in values
values[index] = values[index].join(",");
values = values.join("\n");

Need help in improving the speed of my code for duplicate columns removal in Python

I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.
The input file looks like this:
chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254
As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.
I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.
Thanks!
#!/usr/bin/env python
""" takes in a input file and
prints out only the variants that occur more than once """
import shlex
import collections
rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []
for row in rows:
var = shlex.split(row)
indices.append("_".join(var[0:3]))
dup_list = []
ind_tuple = collections.Counter(indices).items()
for x, y in ind_tuple:
if y>1:
dup_list.append(x)
print dup_list
print len(dup_list)
Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.
EDIT:
Edited the code as per the suggestion of damienfrancois. Below is my new code:
f = open('variants.txt', 'r')
indices = {}
for line in f:
row = line.rstrip()
var = shlex.split(row)
index = "_".join(var[0:3])
if indices.has_key(index):
indices[index] = indices[index] + 1
else:
indices[index] = 1
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
I used, time to see how long both the code takes.
My original code:
time run remove_dup.py
14428
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s
Wall time: 209.31 s
Code after modification:
time run remove_dup2.py
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s
I don't see any significant improvement in the time.
Some suggestions:
do not read the whole file at once ; read line by line and process it on the fly ; you'll save memory operations
let indices be a default dict and increment the value at key "_".join(var[0:3]) ; this saves the costly (guessing here, should use a profiler) collections.Counter(indices).items() step
try pypy or a python compiler
split your data in as many subsets as your computer has cores, apply the program to each subset in parallel then merge the results
HTH
A big time sink is probably the if..has_key() portion of the code. In my experience, try-except is a lot faster...
f = open('variants.txt', 'r')
indices = {}
for line in f:
var = line.split()
index = "_".join(var[0:3])
try:
indices[index] += 1
except KeyError:
indices[index] = 1
f.close()
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
Another option there would be replace the four try except lines with:
indices[index] = 1 + indices.get(index,0)
This approach only tells how many lines of the lines are duplicated, and not how many times they are repeated. (So if one line is duped 3x, then it will say one...)
If you are only trying to count the duplicates and not delete or note them, you could tally the lines of the file as you go, and compare this to the length of the indices dictionary, and the difference is the number of dupe lines (instead of looping back through and re-counting). This might save a little time, but gives a different answer:
#!/usr/bin/env python
f = open('variants.txt', 'r')
indices = {}
total_len=0
for line in f:
total_len +=1
var = line.split()
index = "_".join(var[0:3])
indices[index] = 1 + indices.get(index,0)
f.close()
print "Number of duplicated lines:", total_len - len(indices.keys())
I'd be curious to hear what your benchmarks are for code that does not include the has_key() test...