how to create combinatorial combination of two files - python-2.7

I did some research but i have difficulties finding an answer.
I am using python 2.7 and pandas so far but i am still learning.
I have two CSVs, let say it's the alphabet A-Z in one and digits in the second one, 0-100.
I want to merge the two files to have A0 to A100 up through Z.
For information the two files have DNA sequence so i believe they are strings.
I tried to create arrays with numpy and create a matrix but to no available..
here is a preview of the files:
barcode
0 GGAAGAA
1 CCAAGAA
2 GAGAGAA
3 AGGAGAA
4 TCGAGAA
5 CTGAGAA
6 CACAGAA
7 TGCAGAA
8 ACCAGAA
9 GTCAGAA
10 CGTAGAA
11 GCTAGAA
12 GAAGGAA
13 AGAGGAA
14 TCAGGAA
659
barcode
0 CGGAAGAA
1 GCGAAGAA
2 GGCAAGAA
3 GGAGAGAA
4 CCAGAGAA
5 GAGGAGAA
6 ACGGAGAA
7 CTGGAGAA
8 CACGAGAA
9 AGCGAGAA
10 TCCGAGAA
11 GTCGAGAA
12 CGTGAGAA
13 GCTGAGAA
14 CGACAGAA
1995

I am putting here the way i found to do it, there might be a sexier way:
index = pd.MultiIndex.from_product([df8.barcode, df7.barcode], names = ["df8", "df7"])
df = pd.DataFrame(index = index).reset_index()
def concat_BC(x):#concatenate the two sequences into one new column
return str(x["df8"]) + str(x["df7"])
df["BC"] = df.apply(concat_BC, axis=1)
– Stephane Chiron

Related

Power BI : line grouping

I begin to use Power BI, and I don't know how to group lines.
I have this kind of data :
api user 01/07/21 02/07/21 03/07/21 ...
a 25 null 3 4
b 25 1 null 2
c 25 1 4 5
a 30 4 3 5
b 30 3 2 2
c 30 1 1 3
And I would like to have the sum of the values per user, not by api and user
user 01/07/21 02/07/21 03/07/21 ...
25 2 7 11
30 8 6 10
Do you know how to do it please ?
I created a table with your sample data (make sure your values are treated as numbers):
Then create a Matrix visual, with "user" in Rows and your desired columns in the Values section:

Reshaping Pandas data frame (a complex case!)

I want to reshape the following data frame:
index id numbers
1111 5 58.99
2222 5 75.65
1000 4 66.54
11 4 60.33
143 4 62.31
145 51 30.2
1 7 61.28
The reshaped data frame should be like the following:
id 1 2 3
5 58.99 75.65 nan
4 66.54 60.33 62.31
51 30.2 nan nan
7 61.28 nan nan
I use the following code to do this.
import pandas as pd
dtFrame = pd.read_csv("data.csv")
ids = dtFrame['id'].unique()
temp = dtFrame.groupby(['id'])
temp2 = {}
for i in ids:
temp2[i]= temp.get_group(i).reset_index()['numbers']
dtFrame = pd.DataFrame.from_dict(temp2)
dtFrame = dtFrame.T
Although the above code solve my problem but is there a more simple way to achieve this. I tried Pivot table but it does not solve the problem perhaps it requires to have same number of element in each group. Or may be there is another way which I am not aware of, please share your thoughts about it.
In [69]: df.groupby(df['id'])['numbers'].apply(lambda x: pd.Series(x.values)).unstack()
Out[69]:
0 1 2
id
4 66.54 60.33 62.31
5 58.99 75.65 NaN
7 61.28 NaN NaN
51 30.20 NaN NaN
This is really quite similar to what you are doing except that the loop is replaced by apply. The pd.Series(x.values) has an index which by default ranges over integers starting at 0. The index values become the column names (above). It doesn't matter that the various groups may have different lengths. The apply method aligns the various indices for you (and fills missing values with NaN). What a convenience!
I learned this trick here.

stop pd.DataFrame.from_csv() from converting integer index to date

pandas.DataFrame.from_csv(filename) seems to be converting my integer index into a date.
This is undesirable. How do I prevent this?
The code shown here is a toy version of a larger problem. In the larger problem, I am estimating and writing the parameters of statistical models for each zone for later use. I thought by using a pandas dataframe indexed by zone, I could easily read back the parameters. While pickle or some other format like json might solve this problem I'd like to see a pandas solution....except pandas is converting the zone number to a date.
#!/usr/bin/python
cache_file="./mydata.csv"
import numpy as np
import pandas as pd
zones = [1,2,3,8,9,10]
def create():
data = []
for z in zones:
info = {'m': int(10*np.random.rand()), 'n': int(10*np.random.rand())}
info.update({'zone':z})
data.append(info)
df = pd.DataFrame(data,index=zones)
print "about to write this data:"
print df
df.to_csv(cache_file)
def read():
df = pd.DataFrame.from_csv(cache_file)
print "read this data:"
print df
create()
read()
Sample output:
about to write this data:
m n zone
1 0 3 1
2 5 8 2
3 6 4 3
8 1 8 8
9 6 2 9
10 7 2 10
read this data:
m n zone
2013-12-01 0 3 1
2013-12-02 5 8 2
2013-12-03 6 4 3
2013-12-08 1 8 8
2013-12-09 6 2 9
2013-12-10 7 2 10
The CSV file looks OK, so the problem seems to be in reading not creating.
mydata.csv
,m,n,zone
1,0,3,1
2,5,8,2
3,6,4,3
8,1,8,8
9,6,2,9
10,7,2,10
I suppose this might be useful:
pd.__version__
0.12.0
Python version is python 2.7.5+
I want to record the zone as an index so I can easily pull out the corresponding
parameters later. How do I keep pandas.DataFrame.from_csv() from turning it into a date?
Reading pandas.DataFrame.from_csv? the parse_dates argument defaults to True. Set it to False.

GLPK Mathprog group of sets

I'm trying to code a model that can solve the Multiple Choice Knapsack Problem (MCKP) as described in Knapsack Problems involving dimensions, demands and multiple
choice constraints: generalization and transformations between
formulations (Found here, see figures 8 an 9). You can find an example GMPL model of the basic knapsack problem here. For anyone looking for a quick explanation of the knapsack problem read the following illustration:
You are an adventurer and have stumbled upon a treasure trove. There are hundreds of wonderful items 'i' that each have a weight 'w' and a profit 'p'. Say you have a knapsack with weight capacity as 'c' and you want to make the most profit without overfilling your knapsack. What is the best combination of items such that you make the most profit?
In code:
maximize obj :
sum{(i,w,p) in I} p*x[i];
Where 'I' is the basket of items, and x[i] is the binary variable (0 = not chosen, 1 = chosen)
The problem that I am having trouble with is the addition of multiple groups. MCKP requires exactly one item to be selected from each group. So, for example, lets say we have three groups from which to choose. They could be represented as follows (ignore actual values):
# Items: index, weight, profit
set ONE :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
# Items: index, weight, profit
set TWO :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
# Items: index, weight, profit
set THREE :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
I am confused on how I can iterate over each group and how I would define the variable x. I assume it would look something like:
var x{i,j} binary;
Where i is the index of items in j of groups. This assumes I define a set of sets:
set Groups{ONE,TWO,THREE}
Then I'd iterate over the groups of items:
sum{j in Groups, (i,w,p) in Groups[j]} p*x[i,j];
But I am concerned because I believe GMPL does not support ordered sets. I have seen this related question where the answer suggests defining a set within a set. However, I am not sure how it would apply in this particular scenario.
My main question, to be clear: In GMPL, how can I iterate over sets of sets (in this case a set of groups where each group has a set of items)?
Unlike AMPL, GMPL doesn't support sets of sets. Here's how to do it in AMPL:
set Groups;
set Items{Groups} dimen 3;
# define x and additional constraints
# ...
maximize obj: sum{g in Groups, (i,w,p) in Items[g]} p*x[i];
data;
set Groups := ONE TWO THREE;
# Items: index, weight, profit
set Items[ONE] :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
# Items: index, weight, profit
set Items[TWO] :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
# Items: index, weight, profit
set Items[THREE] :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
If you have no more than 300 variables, you can use a free student version of AMPL and solvers (e.g. CPLEX or Gurobi).
Based on this gnu mailing list thread, I believe GMPL/MathProg has support for what you want to do. Here's their example:
set WORKERS;
param number_of_shifts, integer, >= 1;
set WORKER_CLIQUE{1..number_of_shifts}, within WORKERS;
data;
set WORKERS := Jack Kate Sawyer Sun Juliet Richard Desmond Hugo;
param number_of_shifts := 2;
set WORKER_CLIQUE[1] := Sawyer, Juliet;
set WORKER_CLIQUE[2] := Jack, Kate, Hugo;
In your example, I assume you'd use something like, set Items{1..3}, within Groups; with the data block from #vitaut's answer.

Django query aggregation

Imagine a number guessing game where one person thinks of a number and another person has to guess it. The game is over if the correct number was guessed.
The models might look like this
class SecretNumber(models.Model):
number = models.IntegerField()
class Guess(models.Model)
secretnumber = models.Foreignkey(SecretNumber)
guess = models.IntegerField()
After having played four times, the database might look like this:
id number
==========
1 10
2 54
3 68
4 25
id secretnumber_id guess
=============================
1 1 50
2 1 30
3 1 10
4 2 99
5 2 60
6 2 54
7 3 1
8 3 68
9 4 73
10 4 34
11 4 86
12 4 51
13 4 25
As you can see, the guesser was very lucky: it took him 3, 3, 2 and 4 guesses. But that's just to keep this example short.
Now I need to come up with a query which will allow to display the following data:
Nb. guesses Count
=====================
2 1
3 2
4 1
A manual SQL statement would look something like this:
SELECT inner_count AS 'Nb. guesses', count(inner_count) AS 'Count' FROM (
SELECT secretnumber_id, count(id) AS inner_count FROM guess GROUP BY secretnumber_id
) GROUP BY inner_count
I thought about annotating an annotation, but this seems not to be possible.
Any ideas?
If you're using django (ie models instead of classes), you want to use the QuerySet aggregate functions
e.g.
from django.db.models import Count
guesses = Guess.objects.values('secretnumber').annotate(Count('secretnumber'))
This will give you a queryset with a list of objects, which have a secretnumber and a count value.