Search for sequences in multiple vectors - regex

What is the easiest way to find the sequence I need in multiple vectors in R without using loops?
For example, I need to find vectors their "yahoo" comes after "google"(only order matters).
seq = c("google","yahoo")
Matches:
vec1 = c("smth","google","smth","yahoo","smth")
Not matches:
vec2 = c("smth","yahoo","smth","google","smth")

Check this assuming you have unique values for yahoo and google:
library(dplyr)
dt = data.frame(vec1 = c("smth","google","smth","yahoo","smth"))
dt = dt %>% mutate(row = row_number()) # get the row number for each value of vec1
dt$row[dt$vec1=="google"] < dt$row[dt$vec1=="yahoo"] # returns T/F
Modify this if you don't have unique vec1 values. This one uses the max row number:
dt = data.frame(vec1 = c("smth","google","smth","yahoo","smth"))
dt = dt %>% mutate(row = row_number()) %>%
group_by(vec1) %>% summarise(row = max(row)) # get the max row number for each unique value of vec1
dt$row[dt$vec1=="google"] < dt$row[dt$vec1=="yahoo"]

You can use which function to find the positions of your search terms within a given vector
which(vec1=="google")[1] < which(vec1=="yahoo")[1]
use [1] if you're interested only in the first occurrence of each search term.

Related

Applying Rcpp on a dataframe

I'm new to C++ and exploring faster computation possibilities on R through the Rcpp package. The actual dataframe contains over ~2 million rows, and is quite slow.
Existing Dataframes
Main Dataframe
df<-data.frame(z = c("a","b","c"), a = c(303,403,503), b = c(203,103,803), c = c(903,803,703))
Cost Dataframe
cost <- data.frame("103" = 4, "203" = 5, "303" = 6, "403" = 7, "503" = 8, "603" = 9, "703" = 10, "803" = 11, "903" = 12)
colnames(cost) <- c("103", "203", "303", "403", "503", "603", "703", "803", "903")
Steps
df contains z which is a categorical variable with levels a, b and c. I had done a merge operation from another dataframe to bring in a,b,c into df with the specific nos.
First step would be to match each row in z with the column names (a,b or c) and create a new column called 'type' and copy the corresponding number.
So the first row would read,
df$z[1] = "a"
df$type[1]= 303
Now it must match df$type with column names in another dataframe called 'cost' and create df$cost. The cost dataframe contains column names as numbers e.g. "103", "203" etc.
For our example, df$cost[1] = 6. It matches df$type[1] = 303 with cost$303[1]=6
Final Dataframe should look like this - Created a sample output
df1 <- data.frame(z = c("a","b","c"), type = c("303", "103", "703"), cost = c(6,4,10))
A possible solution, not very elegant but does the job:
library(reshape2)
tmp <- cbind(cost,melt(df)) # create a unique data frame
row.idx <- which(tmp$z==tmp$variable) # row index of matching values
col.val <- match(as.character(tmp$value[row.idx]), names(tmp) ) # find corresponding values in the column names
# now put all together
df2 <- data.frame('z'=unique(df$z),
'type' = tmp$value[row.idx],
'cost' = as.numeric(tmp[1,col.val]) )
the output:
> df2
z type cost
1 a 303 6
2 b 103 4
3 c 703 10
see if it works

New to python - trying to chose individual columns from transposed matrix

So presently code is as so:
table = []
for line in open("harrytest.csv") as f:
data = line.split(",")
table.append(data)
transposed = [[table[j][i] for j in range(len(table))] for i in range(len(table[0]))]
openings = transposed[1][1: - 1]
openings = [float(i) for i in openings]
mean = sum(openings)/len(openings)
print mean
minimum = min(openings)
print minimum
maximum = max(openings)
print maximum
range1 = maximum - minimum
print range1
This only prints one column of 7 for me, it also leaves out the bottom line. We are not allowed to import with csv module, use numpy, pandas. The only module allowed is os, sys, math & datetime.
How do I write the code so as to get median, first, last values for any column.
Change this line:
openings = transposed[1][1: - 1]
to this
openings = transposed[1][1:]
and the last row should appear. You calculations for mean, min, max and range seem correct.
For median you have to sort the row and select the one middle element or average of the two middle elements. First and last element is just row[0] and row[-1].

Deleting duplicate x values and their corresponding y values

I am working with a list of points in python 2.7 and running some interpolations on the data. My list has over 5000 points and I have some repeating "x" values within my list. These repeating "x" values have different corresponding "y" values. I want to get rid of these repeating points so that my interpolation function will work, because if there are repeating "x" values with different "y" values it runs an error because it does not satisfy the criteria of a function. Here is a simple example of what I am trying to do:
Input:
x = [1,1,3,4,5]
y = [10,20,30,40,50]
Output:
xy = [(1,10),(3,30),(4,40),(5,50)]
The interpolation function I am using is InterpolatedUnivariateSpline(x, y)
have a variable where you store the previous X value, if it is the same as the current value then skip the current value.
For example (pseudo code, you do the python),
int previousX = -1
foreach X
{
if(x == previousX)
{/*skip*/}
else
{
InterpolatedUnivariateSpline(x, y)
previousX = x /*store the x value that will be "previous" in next iteration
}
}
i am assuming you are already iterating so you dont need the actualy python code.
A bit late but if anyone is interested, here's a solution with numpy and pandas:
import pandas as pd
import numpy as np
x = [1,1,3,4,5]
y = [10,20,30,40,50]
#convert list into numpy arrays:
array_x, array_y = np.array(x), np.array(y)
# sort x and y by x value
order = np.argsort(array_x)
xsort, ysort = array_x[order], array_y[order]
#create a dataframe and add 2 columns for your x and y data:
df = pd.DataFrame()
df['xsort'] = xsort
df['ysort'] = ysort
#create new dataframe (mean) with no duplicate x values and corresponding mean values in all other cols:
mean = df.groupby('xsort').mean()
df_x = mean.index
df_y = mean['ysort']
# poly1d to create a polynomial line from coefficient inputs:
trend = np.polyfit(df_x, df_y, 14)
trendpoly = np.poly1d(trend)
# plot polyfit line:
plt.plot(df_x, trendpoly(df_x), linestyle=':', dashes=(6, 5), linewidth='0.8',
color=colour, zorder=9, figure=[name of figure])
Also, if you just use argsort() on the values in order of x, the interpolation should work even without the having to delete the duplicate x values. Trying on my own dataset:
polyfit on its own
sorting data in order of x first, then polyfit
sorting data, delete duplicates, then polyfit
... I get the same result twice

Matlab Codegen build error

I am trying to convert the below Matlab code into C++ using codegen. However it fails at build and I get the error:
"??? Unless 'rows' is specified, the first input must be a vector. If the vector is variable-size, the either the first dimension or the second must have a fixed length of 1. The input [] is not supported. Use a 1-by-0 or 0-by-1 input (e.g., zeros(1,0) or zeros(0,1)) to represent the empty set."
It then points to [id,m,n] = unique(id); being the culprit. Why doesn't it build and what's the best way to fix it?
function [L,num,sz] = label(I,n) %#codegen
% Check input arguments
error(nargchk(1,2,nargin));
if nargin==1, n=8; end
assert(ndims(I)==2,'The input I must be a 2-D array')
sizI = size(I);
id = reshape(1:prod(sizI),sizI);
sz = ones(sizI);
% Indexes of the adjacent pixels
vec = #(x) x(:);
if n==4 % 4-connected neighborhood
idx1 = [vec(id(:,1:end-1)); vec(id(1:end-1,:))];
idx2 = [vec(id(:,2:end)); vec(id(2:end,:))];
elseif n==8 % 8-connected neighborhood
idx1 = [vec(id(:,1:end-1)); vec(id(1:end-1,:))];
idx2 = [vec(id(:,2:end)); vec(id(2:end,:))];
idx1 = [idx1; vec(id(1:end-1,1:end-1)); vec(id(2:end,1:end-1))];
idx2 = [idx2; vec(id(2:end,2:end)); vec(id(1:end-1,2:end))];
else
error('The second input argument must be either 4 or 8.')
end
% Create the groups and merge them (Union/Find Algorithm)
for k = 1:length(idx1)
root1 = idx1(k);
root2 = idx2(k);
while root1~=id(root1)
id(root1) = id(id(root1));
root1 = id(root1);
end
while root2~=id(root2)
id(root2) = id(id(root2));
root2 = id(root2);
end
if root1==root2, continue, end
% (The two pixels belong to the same group)
N1 = sz(root1); % size of the group belonging to root1
N2 = sz(root2); % size of the group belonging to root2
if I(root1)==I(root2) % then merge the two groups
if N1 < N2
id(root1) = root2;
sz(root2) = N1+N2;
else
id(root2) = root1;
sz(root1) = N1+N2;
end
end
end
while 1
id0 = id;
id = id(id);
if isequal(id0,id), break, end
end
sz = sz(id);
% Label matrix
isNaNI = isnan(I);
id(isNaNI) = NaN;
[id,m,n] = unique(id);
I = 1:length(id);
L = reshape(I(n),sizI);
L(isNaNI) = 0;
if nargout>1, num = nnz(~isnan(id)); end
Just an FYI, if you are using MATLAB R2013b or newer, you can replace error(nargchk(1,2,nargin)) with narginchk(1,2).
As the error message says, for codegen unique requires that the input be a vector unless 'rows' is passed.
If you look at the report (click the "Open report" link that is shown) and hover over id you will likely see that its size is neither 1-by-N nor N-by-1. The requirement for unique can be seen if you search for unique here:
http://www.mathworks.com/help/coder/ug/functions-supported-for-code-generation--alphabetical-list.html
You could do one of a few things:
Make id a vector and treat it as a vector for the computation. Instead of the declaration:
id = reshape(1:prod(sizI),sizI);
you could use:
id = 1:numel(I)
Then id would be a row vector.
You could also keep the code as is and do something like:
[idtemp,m,n] = unique(id(:));
id = reshape(idtemp,size(id));
Obviously, this will cause a copy, idtemp, to be made but it may involve fewer changes to your code.
Remove the anonymous function stored in the variable vec and make vec a subfunction:
function y = vec(x)
coder.inline('always');
y = x(:);
Without the 'rows' option, the input to the unique function is always interpreted as a vector, and the output is always a vector, anyway. So, for example, something like id = unique(id) would have the effect of id = id(:) if all the elements of the matrix id were unique. There is no harm in making the input a vector going in. So change the line
[id,m,n] = unique(id);
to
[id,m,n] = unique(id(:));

R: combine lists of interest

I have a list like df_all (see below).
A = matrix( ceiling(10*runif(8)), nrow=4)
colnames(A) = c("country", "year_var")
dfa = data.frame(A)
df1 = dfa[1,]
df2 = dfa[2,]
df3 = dfa[3,]
df4 = dfa[4,]
df_all = list(df1, df2, df3, df4)
df_all
Now I want to combine the list of interest by using variable a.
a <- "2,3,4"
b <- strsplit(a, ",")[[1]]
To combine this lists, I use the folling loop:
for (i in 1:length(b)){
c<-b[i]
aa <- df_all[c:c]
print(aa)
}
Now my question is, How can I combine this result and save this as as variable?
Thanks!
Would this work for you:
basnum<-as.integer(b)
do.call(rbind, df_all[basnum])
Through df_all[basnum], a list with only the relevant data.frames is created.
do.call takes a function and a list as parameters (and some more but not relevant right now). The items of the list are then passed on as parameters to the function.
So in this case, the above is the equivalent to calling:
rbind(df_all[[2]], df_all[[3]], df_all[[4]])
And this produces one data.frame holding all the rows of interest.