Merge lists with 0 1 encoding - list

I have the following case in python:
a = [[0,0,1,0],
[0,0,0,1],
[1,0,0,1],
[1,0,1,1]]
b = [[1,1,0,0],
[1,0,0,1],
[0,1,0,0]]
c = [[1,0,1,0],
[0,0,1,0],
[0,1,0,0]]
d = [[1,0,1,0],
[0,0,1,0],
[0,0,0,0],
[0,0,0,1],
[1,0,0,0]]
a has length 4, b has length 3, c has length 3, d has length 4 and I have several more lists with variable length.
What I want is to construct a function that can merge the "sub lists" considering the columns, for example:
def combine(foo):
...
print(foo)
combine(a) = [1,0,1,1]
combine(b) = [1,1,0,1]
combine(c) = [1,1,1,0]
combine(d) = [1,0,1,1]
How can I do it?
Thanks for your help.

Related

Applying Rcpp on a dataframe

I'm new to C++ and exploring faster computation possibilities on R through the Rcpp package. The actual dataframe contains over ~2 million rows, and is quite slow.
Existing Dataframes
Main Dataframe
df<-data.frame(z = c("a","b","c"), a = c(303,403,503), b = c(203,103,803), c = c(903,803,703))
Cost Dataframe
cost <- data.frame("103" = 4, "203" = 5, "303" = 6, "403" = 7, "503" = 8, "603" = 9, "703" = 10, "803" = 11, "903" = 12)
colnames(cost) <- c("103", "203", "303", "403", "503", "603", "703", "803", "903")
Steps
df contains z which is a categorical variable with levels a, b and c. I had done a merge operation from another dataframe to bring in a,b,c into df with the specific nos.
First step would be to match each row in z with the column names (a,b or c) and create a new column called 'type' and copy the corresponding number.
So the first row would read,
df$z[1] = "a"
df$type[1]= 303
Now it must match df$type with column names in another dataframe called 'cost' and create df$cost. The cost dataframe contains column names as numbers e.g. "103", "203" etc.
For our example, df$cost[1] = 6. It matches df$type[1] = 303 with cost$303[1]=6
Final Dataframe should look like this - Created a sample output
df1 <- data.frame(z = c("a","b","c"), type = c("303", "103", "703"), cost = c(6,4,10))
A possible solution, not very elegant but does the job:
library(reshape2)
tmp <- cbind(cost,melt(df)) # create a unique data frame
row.idx <- which(tmp$z==tmp$variable) # row index of matching values
col.val <- match(as.character(tmp$value[row.idx]), names(tmp) ) # find corresponding values in the column names
# now put all together
df2 <- data.frame('z'=unique(df$z),
'type' = tmp$value[row.idx],
'cost' = as.numeric(tmp[1,col.val]) )
the output:
> df2
z type cost
1 a 303 6
2 b 103 4
3 c 703 10
see if it works

Find starting and ending index of each unique charcters in a string in python

I have a string with characters repeated. My Job is to find starting Index and ending index of each unique characters in that string. Below is my code.
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
mo = re.search(item,x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
Output :
a 0 1
b 3 4
c 7 8
Here the end index of the characters are not correct. I understand why it's happening but how can I pass the character to be matched dynamically to the regex search function. For instance if I hardcode the character in the search function it provides the desired output
x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
output:
b 2 5
The above function is providing correct result but here I can't pass the characters to be matched dynamically.
It will be really a help if someone can let me know how to achieve this any hint will also do. Thanks in advance
String literal formatting to the rescue:
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
# for patterns better use raw strings - and format the letter into it
mo = re.search(fr"{item}+",x) # fr and rf work both :) its a raw formatted literal
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n) # fix upper limit by n-1
Output:
a 0 3 # you do see that the upper limit is off by 1?
b 3 7 # see above for fix
c 7 9
Your pattern does not need the [] around the letter - you are matching just one anyhow.
Without regex1:
x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
if last_ch == ch:
continue
else:
print(last_ch,start_idx, idx-1)
last_ch = ch
start_idx = idx
print(ch,start_idx,idx)
output:
a 0 2 # not off by 1
b 3 6
c 7 8
1RegEx: And now you have 2 problems...
Looking at the output, I'm guessing that another option would be,
import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
print(output)
Output
a 0 3
b 3 7
c 7 9
I think it'll be in the Order of N, you can likely benchmark it though, if you like.
import re, time
timer_on = time.time()
for i in range(10000000):
x = "aabbbbccc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
timer_off = time.time()
timer_total = timer_off - timer_on
print(timer_total)

How to query list of variables in Matlab struct matching a certain pattern?

Suppose I have the following struct in Matlab (read from a JSON file):
>>fs.
fs.dichte fs.hoehe fs.ts2
fs.temperatur fs.ts3 fs.viskositaet
fs.ts1 fs.ts4
Each one of the fs.ts* components contains another struct. In this particular case, the index of ts goes from 1 to 4, but in another case it could as well be 2 or 7. You get the idea, right? I want the program to be flexible enough to handle any possible input.
So my question comes down to: How to query the maximum index of ts?
In an ideal world, this would work:
who fs.ts*
But unfortunately, it just returns nothing. Any ideas?
(Btw, I'm using Octave and don't have Matlab available for testing; however, there should really be a solution to this which works in both environments.)
You can use fieldnames to get all field names of the struct, then use regexp to extract the ones that start with ts and extract the number. Then you can compare the numbers to find the largest.
fields = fieldnames(fs);
number = str2double(regexp(fields, '(?<=^ts)\d+$', 'once', 'match'));
numbers = number(~isnan(number));
[~, ind] = max(number);
max_field = fields{ind};
max_value = fs.(max_field);
Not an answer to your exact question but sounds like instead of tsN fields, you should have a single ts field with a list.
Tip: every time you see a number in a variable or field name, think whether you shouldn't be using a vector/array/list instead.
This is true for all languages but more so for Octave since everything is arrays. Even if you have three field named ts1, ts2, and ts3 with scalars values, what you really have is three fields whose values are an array of size 1x1.
In Octave you can have two things. Either the value of ts is a cell array, each element of the cell array a scalar struct; or is a struct array. Use a cell array of structs when each struct has different keys, use a struct array when all structs have the same keys.
Struct array
octave> fs.ts = struct ("foo", {1, 2, 3, 4}, "bar", {"a", "b", "c", "d"});
octave> fs.ts # all keys/fields in the ts struct array
ans =
1x4 struct array containing the fields:
foo
bar
octave> fs.ts.foo # all foo values in the ts struct array
ans = 1
ans = 2
ans = 3
ans = 4
octave> numel (fs.ts) # number of structs in the ts struct array
ans = 4
octave> fs.ts(1) # struct 1 in the ts struct array
ans =
scalar structure containing the fields:
foo = 1
bar = a
octave> fs.ts(1).foo # foo value of the struct 1
ans = 1
Cell array of scalar structs
However, I'm not sure if JSON supports anything like struct arrays, you will probably need to have a list of structs. In that case, you will end up with a cell array of struct scalars.
octave> fs.ts = {struct("foo", 1, "bar", "a"), struct("foo", 2, "bar", "b"), struct("foo", 3, "bar", "c"), struct("foo", 4, "bar", "d"),};
octave> fs.ts # note, you actually have multiple structs
ans =
{
[1,1] =
scalar structure containing the fields:
foo = 1
bar = a
[1,2] =
scalar structure containing the fields:
foo = 2
bar = b
[1,3] =
scalar structure containing the fields:
foo = 3
bar = c
[1,4] =
scalar structure containing the fields:
foo = 4
bar = d
}
octave-gui:28> fs.ts{1} # get struct 1
ans =
scalar structure containing the fields:
foo = 1
bar = a
octave-gui:29> fs.ts{1}.foo # value foo from struct 1
ans = 1

Using Pandas to subset data from a dataframe based on multiple columns?

I am new to python. I have to extract a subset from pandas dataframe based on 2 lists corresponding to 2 columns in that dataframe. Both the values in list should match with that of dataframe at index level. I have tried with "isin" function but obviously it doesn't work with combinations.
from pandas import *
d = {'A' : ['a', 'a', 'c', 'a','b'] ,'B' : [1, 2, 1, 4,1]}
df = DataFrame(d)
list1 = ['a','b']
list2 = [1,2]
print df
A B
0 a 1
1 a 2
2 c 1
3 a 4
4 b 1
### Using isin function
df[(df.A.isin(list1)) & (df.B.isin(list2)) ]
A B
0 a 1
1 a 2
4 b 1
###Desired outcome
d2 = {'A' : ['a'], 'B':[1]}
DataFrame(d2)
A B
0 a 1
Please let me know if this can be done without using loops and if there is a way to do it in a single step.
A quick and dirty way to do this is using zip:
df['C'] = zip(df['A'], df['B'])
list3 = zip(list1, list2)
d2 = df[df['C'].isin(list3)
print(df2)
A B C
0 a 1 (a, 1)
You can of course drop the newly created column after you're done filtering on it.

Reading In Integers in Python

So, my question is simple. I'm simply struggling with syntax here. I need to read in a set of integers, 3, 11, 2, 4, 4, 5, 6, 10, 8, -12. What I want to do with those integers is place them in a list as I'm reading them. n = n x n array in which these will be presented. so if n = 3, then i will be passed something like this 3 \n 11 2 4 \n 4 5 6 \n 10 8 -12 ( \n symbolizing a new line in input file)
n = int(raw_input().strip())
a = []
for a_i in xrange(n):
value = int(raw_input().strip())
a.append(value)
print(a)
I receive this error from the above code code:
value = int(raw_input().strip())
ValueError: invalid literal for int() with base 10: '11 2 4'
The actual challenge can be found here, https://www.hackerrank.com/challenges/diagonal-difference .
I have already completed this in Java and C++, simply trying to do in Python now but I suck at python. If someone wants to, they don't have too, seeing the proper way to read in an entire line, say " 11 2 4 ", creating a new list out that line, and adding it to an already existing list. So then all I have to do is search said index of list[ desiredInternalList[ ] ].
You can split the string at white space and convert the entries into integers.
This gives you one list:
for a_i in xrange(n):
a.extend([int(x) for x in raw_input().split()])
and this a list of lists:
for a_i in xrange(n):
a.append([int(x) for x in raw_input().split()]):
You get this error because you try to give all inputs in one line. To handle this issue you may use this code
n = int(raw_input().strip())
a = []
while len(a)< n*n:
x=raw_input().strip()
x = map(int,x.split())
a.extend(x)
print(a)