Test for new id combinations in R - c++

I am looking to create an indicator that checks whether the a group takes new combinations of numbers or not. I have a dataset like this one:
combinations <- data.frame(combination_id = c(1, 1, 1, 1,
2, 2, 2,
3,
4,
5, 5, 5, 5,
6, 6, 6),
number = c(20, 10, 12, 18,
20, 10, 12,
20,
40,
20, 10, 30, 18,
18, 30, 10))
What I want is the following:
dataset_2 <- data.frame(combination_id = c(1, 1, 1, 1,
2, 2, 2,
3,
4,
5, 5, 5, 5,
6, 6, 6),
number = c(20, 10, 12, 18,
20, 10, 12,
20,
40,
20, 10, 30, 18,
18, 30, 10),
new_combination = c(1, 1, 1, 1,
0,0,0,
0,
1,
1,1, 1, 1,
0, 0, 0))
Basically an indicator new_combination that is 1 if any of the possible combinations in that combination_id is new (i.e. not present in the lower values of combination_id) or if it is just one number that has not been seen, and is zero if a number is alone but has been seen before (as 20 in group 3) or if all combinations have been seen before (as in groups 2 and 6).
So the first group takes value of 1 because none of those numbers or combinations have been taken before, group 2 takes the value of 0 because all possible combinations are also in group 1, group 3 is only one number that has been seen before so takes the value of 0. Group 4 has a new number (40) so takes the value of 1. Group 5 has new combinations with the number 30 so takes the value of 1 and group 6 has no new combinations so takes the value of zero.
I hope this made it clear what I am looking for.
Any ideas? Thank you so much.

library(data.table)
setDT(combinations)
combinations[, new_combinations := ifelse(
combination_id %in% combinations[rowid(number) == 1, combination_id], 1, 0)]
# combination_id number new_combinations
# 1: 1 20 1
# 2: 1 10 1
# 3: 1 12 1
# 4: 1 18 1
# 5: 2 20 0
# 6: 2 10 0
# 7: 2 12 0
# 8: 3 20 0
# 9: 4 40 1
#10: 5 20 1
#11: 5 10 1
#12: 5 30 1
#13: 5 18 1
#14: 6 18 0
#15: 6 30 0
#16: 6 10 0

dplyr approach:
require(dplyr)
combinations %>% dplyr::mutate(new_combination = !duplicated(number)) %>%
group_by(combination_id) %>%
dplyr::mutate(new_combination = as.numeric(any(new_combination))) %>%
ungroup()
combination_id number new_combination
<dbl> <dbl> <dbl>
1 1 20 1
2 1 10 1
3 1 12 1
4 1 18 1
5 2 20 0
6 2 10 0
7 2 12 0
8 3 20 0
9 4 40 1
10 5 20 1
11 5 10 1
12 5 30 1
13 5 18 1
14 6 18 0
15 6 30 0
16 6 10 0

A base R option with ave + duplicated
transform(
combinations,
new_combination = ave(+!duplicated(number), combination_id, FUN = max)
)
gives
combination_id number new_combination
1 1 20 1
2 1 10 1
3 1 12 1
4 1 18 1
5 2 20 0
6 2 10 0
7 2 12 0
8 3 20 0
9 4 40 1
10 5 20 1
11 5 10 1
12 5 30 1
13 5 18 1
14 6 18 0
15 6 30 0
16 6 10 0

Related

I have a string that want to split. I have done the next but the result is an error. Need some help finding the error in my code

library(stringr)
I want to substract the secong number on the string per_ocu in a bigger table. I did use a nested ifelse with substr() but the result is the last table below. It is only substracting the last line and it worked previously with the perocu_min version. I have also tried to trim the string. What is wrong with my code?
t_perocu
A tibble: 7 × 3
Groups: per_ocu [7]
per_ocu char_perocu n
1 0 a 5 personas 14 1471172
2 101 a 250 personas 18 3531
3 11 a 30 personas 16 37998
4 251 y m�s personas 20 3014
5 31 a 50 personas 16 5178
6 51 a 100 personas 17 3468
7 6 a 10 personas 15 85071
This is my code. (I want to make sure there is no bug in my R Studio Version.)
denue_1$perocu_max <- ifelse(denue_1$char_perocu == 14, substr(denue_1$per_ocu, 5, 1),
ifelse(denue_1$char_perocu == 15, substr(denue_1$per_ocu, 5, 2),
ifelse(denue_1$char_perocu == 16, substr(denue_1$per_ocu, 6, 2),
ifelse(denue_1$char_perocu == 17, substr(denue_1$per_ocu, 6, 3),
ifelse(denue_1$char_perocu == 18, substr(denue_1$per_ocu, 7, 3),
ifelse(denue_1$char_perocu == 20, substr(denue_1$per_ocu, 1, 3), 0))))))
table(denue_1$perocu_max)
table(denue_1$perocu_max)
This is what I got:
251
1606418 3014
This is what I was expecting
View(tperocu_max)
tperocu_max
Var1 Freq emp
1 5 1471172 7355860
2 10 85071 850710
3 30 37998 1139940
4 50 5178 258900
5 100 3468 346800
6 250 3531 882750
7 251 3014 756514

Replace values with NA based on condition

I am currently working on my first dataset as a PhD student. I have a dataset where several conditions have not been finished. In the dataset, this is visibly when 4 or more columns in a row have the value "1" (see example below). I want all the "1" values which do not depict "real" numbers (instead, they are "NAs) replaced by NA.
Any suggestions on how I could succeed?
example <- tibble(
a = c(1, 2, 3, 4, 5, 6, 7, 3, 4, 2, 7, 1),
b = c(1, 1, 1, 1, 1, 1, 2, 3, 4, 5, 6, 2),
c = c(3, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1),
d = c(5, 1, 2, 3, 1, 1, 1, 1, 1, 4, 1, 5),
e = c(4, 1, 3, 4, 1, 1, 1, 1, 2, 3, 7, 5),
f = c(3, 7, 6, 1, 1, 1, 1, 2, 1, 1, 1, 1))
This means I have this:
a b c d e f
1 1 1 3 5 4 3
2 2 1 1 1 1 7
3 3 1 1 2 3 6
4 4 1 1 3 4 1
5 5 1 1 1 1 1
6 6 1 4 1 1 1
7 7 2 1 1 1 1
8 3 3 1 1 1 2
9 4 4 1 1 2 1
10 2 5 1 4 3 1
11 7 6 1 1 7 1
12 1 2 1 5 5 1
And I need this:
a b c d e f
1 1 1 3 5 4 3
2 2 NA NA NA NA 7
3 3 1 1 2 3 6
4 4 1 1 3 4 1
5 5 1 NA NA NA NA
6 6 1 4 1 1 1
7 7 2 NA NA NA NA
8 3 3 1 1 1 2
9 4 4 1 1 2 1
10 2 5 1 4 3 1
11 7 6 1 1 7 1
12 1 2 1 5 5 1
Thank you very much!!

Conditional mutation across rows (by group/id)?

I have a large dataset that I would like some help with. An example is given below:
id id_row material
1 1 1 1
2 1 2 1
3 1 3 1
4 2 1 1
5 2 2 2
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 2 2
I would like to add a new column based on the values in material for the same id (across rows). In the new colum, I would like all id with values 1 and 2 in material (across rows) to be identified (e.g. as value 99) and if not both are present then return either 1 or 2.
Something like this:
id id_row material new_column
1 1 1 1 1
2 1 2 1 1
3 1 3 1 1
4 2 1 1 99
5 2 2 2 99
6 2 3 1 99
7 3 1 2 2
8 3 2 2 2
9 3 3 2 2
10 4 1 1 99
11 4 2 2 99
I have been looking online for a solution without any luck as well as tried using dplyr and group_by, mutate and ifelse without any luck. Thank you in advance!
Try this approach:
library(tidyverse)
tribble(
~id, ~id_row, ~material,
1, 1, 1,
1, 2, 1,
1, 3, 1,
2, 1, 1,
2, 2, 2,
2, 3, 1,
3, 1, 2,
3, 2, 2,
3, 3, 2,
4, 1, 1,
4, 2, 2
) |>
group_by(id) |>
mutate(new_column = if_else(any(material == 2) & any(material == 1), 99, NA_real_),
new_column = if_else(is.na(new_column), material, new_column))
#> # A tibble: 11 × 4
#> # Groups: id [4]
#> id id_row material new_column
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 1
#> 2 1 2 1 1
#> 3 1 3 1 1
#> 4 2 1 1 99
#> 5 2 2 2 99
#> 6 2 3 1 99
#> 7 3 1 2 2
#> 8 3 2 2 2
#> 9 3 3 2 2
#> 10 4 1 1 99
#> 11 4 2 2 99
Created on 2022-05-25 by the reprex package (v2.0.1)

C++ Sort vector by index

I need to sort a std::vector by index. Let me explain it with an example:
Imagine I have a std::vector of 12 positions (but can be 18 for example) filled with some values (it doesn't have to be sorted):
Vector Index: 0 1 2 3 4 5 6 7 8 9 10 11
Vector Values: 3 0 2 3 2 0 1 2 2 4 5 3
I want to sort it every 3 index. This means: the first 3 [0-2] stay, then I need to have [6-8] and then the others. So it will end up like this (new index 3 has the value of previous idx 6):
Vector Index: 0 1 2 3 4 5 6 7 8 9 10 11
Vector Values: 3 0 2 1 2 2 3 2 0 4 5 3
I'm trying to make it in one line using std::sort + lambda but I can't get it. Also discovered the std::partition() function and tried to use it but the result was really bad hehe
Found also this similar question which orders by odd and even index but can't figure out how to make it in my case or even if it is possible: Sort vector by even and odd index
Thank you so much!
Note 0: No, my vector is not always sorted. It was just an example. I've changed the values
Note 1: I know it sound strange... think it like hte vecotr positions are like: yes yes yes no no no yes yes yes no no no yes yes yes... so the 'yes' positions will go in the same order but before the 'no' positions
Note 2: If there isn't a way with lambda then I thought making it with a loop and auxiliar vars but it's more ugly I think.
Note 3: Another example:
Vector Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Vector Values: 3 0 2 3 2 0 1 2 2 4 5 3 2 3 0 0 2 1
Sorted Values: 3 0 2 1 2 2 2 3 0 3 2 0 4 5 3 0 2 1
The final Vector Values is sorted (in term of old index): 0 1 2 6 7 8 12 13 14 3 4 5 9 10 11 15 16 17
You can imagine those index in 2 colums, so I want first the Left ones and then the Right one:
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
You don't want std::sort, you want std::rotate.
std::vector<int> v = {20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31};
auto b = std::next(std::begin(v), 3); // skip first three elements
auto const re = std::end(v); // keep track of the actual end
auto e = std::next(b, 6); // the end of our current block
while(e < re) {
auto mid = std::next(b, 3);
std::rotate(b, mid, e);
b = e;
std::advance(e, 6);
}
// print the results
std::copy(std::begin(v), std::end(v), std::ostream_iterator<int>(std::cout, " "));
This code assumes you always do two groups of 3 for each rotation, but you could obviously work with whichever arbitrary ranges you wanted.
The output looks like what you'd want:
20 21 22 26 27 28 23 24 25 29 30 31
Update: #Blastfurnace pointed out that std::swap_ranges would work as well. The rotate call can be replaced with the following line:
std::swap_ranges(b, mid, mid); // passing mid twice on purpose
With the range-v3 library, you can write this quite conveniently, and it's very readable. Assuming your original vector is called input:
namespace rs = ranges;
namespace rv = ranges::views;
// input [3, 0, 2, 3, 2, 0, 1, 2, 2, 4, 5, 3, 2, 3, 0, 0, 2, 1]
auto by_3s = input | rv::chunk(3); // [[3, 0, 2], [3, 2, 0], [1, 2, 2], [4, 5, 3], [2, 3, 0], [0, 2, 1]]
auto result = rv::concat(by_3s | rv::stride(2), // [[3, 0, 2], [1, 2, 2], [2, 3, 0]]
by_3s | rv::drop(1) | rv::stride(2)) // [[3, 2, 0], [4, 5, 3], [0, 2, 1]]
| rv::join
| rs::to<std::vector<int>>; // [3, 0, 2, 1, 2, 2, 2, 3, 0, 3, 2, 0, 4, 5, 3, 0, 2, 1]
Here's a demo.

pandas pivot table using index data of dataframe

I want to create a pivot table from a pandas dataframe
using dataframe.pivot()
and include not only dataframe columns but also the data within the dataframe index.
Couldn't find any docs that show how to do that.
Any tips?
Use reset_index to make the index a column:
In [45]: df = pd.DataFrame({'y': [0, 1, 2, 3, 4, 4], 'x': [1, 2, 2, 3, 1, 3]}, index=np.arange(6)*10)
In [46]: df
Out[46]:
x y
0 1 0
10 2 1
20 2 2
30 3 3
40 1 4
50 3 4
In [47]: df.reset_index()
Out[47]:
index x y
0 0 1 0
1 10 2 1
2 20 2 2
3 30 3 3
4 40 1 4
5 50 3 4
So pivot uses the index as values:
In [48]: df.reset_index().pivot(index='y', columns='x')
Out[48]:
index
x 1 2 3
y
0 0 NaN NaN
1 NaN 10 NaN
2 NaN 20 NaN
3 NaN NaN 30
4 40 NaN 50