I have a large dataset that I would like some help with. An example is given below:
id id_row material
1 1 1 1
2 1 2 1
3 1 3 1
4 2 1 1
5 2 2 2
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 2 2
I would like to add a new column based on the values in material for the same id (across rows). In the new colum, I would like all id with values 1 and 2 in material (across rows) to be identified (e.g. as value 99) and if not both are present then return either 1 or 2.
Something like this:
id id_row material new_column
1 1 1 1 1
2 1 2 1 1
3 1 3 1 1
4 2 1 1 99
5 2 2 2 99
6 2 3 1 99
7 3 1 2 2
8 3 2 2 2
9 3 3 2 2
10 4 1 1 99
11 4 2 2 99
I have been looking online for a solution without any luck as well as tried using dplyr and group_by, mutate and ifelse without any luck. Thank you in advance!
Try this approach:
library(tidyverse)
tribble(
~id, ~id_row, ~material,
1, 1, 1,
1, 2, 1,
1, 3, 1,
2, 1, 1,
2, 2, 2,
2, 3, 1,
3, 1, 2,
3, 2, 2,
3, 3, 2,
4, 1, 1,
4, 2, 2
) |>
group_by(id) |>
mutate(new_column = if_else(any(material == 2) & any(material == 1), 99, NA_real_),
new_column = if_else(is.na(new_column), material, new_column))
#> # A tibble: 11 × 4
#> # Groups: id [4]
#> id id_row material new_column
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 1
#> 2 1 2 1 1
#> 3 1 3 1 1
#> 4 2 1 1 99
#> 5 2 2 2 99
#> 6 2 3 1 99
#> 7 3 1 2 2
#> 8 3 2 2 2
#> 9 3 3 2 2
#> 10 4 1 1 99
#> 11 4 2 2 99
Created on 2022-05-25 by the reprex package (v2.0.1)
Related
I am currently working on my first dataset as a PhD student. I have a dataset where several conditions have not been finished. In the dataset, this is visibly when 4 or more columns in a row have the value "1" (see example below). I want all the "1" values which do not depict "real" numbers (instead, they are "NAs) replaced by NA.
Any suggestions on how I could succeed?
example <- tibble(
a = c(1, 2, 3, 4, 5, 6, 7, 3, 4, 2, 7, 1),
b = c(1, 1, 1, 1, 1, 1, 2, 3, 4, 5, 6, 2),
c = c(3, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1),
d = c(5, 1, 2, 3, 1, 1, 1, 1, 1, 4, 1, 5),
e = c(4, 1, 3, 4, 1, 1, 1, 1, 2, 3, 7, 5),
f = c(3, 7, 6, 1, 1, 1, 1, 2, 1, 1, 1, 1))
This means I have this:
a b c d e f
1 1 1 3 5 4 3
2 2 1 1 1 1 7
3 3 1 1 2 3 6
4 4 1 1 3 4 1
5 5 1 1 1 1 1
6 6 1 4 1 1 1
7 7 2 1 1 1 1
8 3 3 1 1 1 2
9 4 4 1 1 2 1
10 2 5 1 4 3 1
11 7 6 1 1 7 1
12 1 2 1 5 5 1
And I need this:
a b c d e f
1 1 1 3 5 4 3
2 2 NA NA NA NA 7
3 3 1 1 2 3 6
4 4 1 1 3 4 1
5 5 1 NA NA NA NA
6 6 1 4 1 1 1
7 7 2 NA NA NA NA
8 3 3 1 1 1 2
9 4 4 1 1 2 1
10 2 5 1 4 3 1
11 7 6 1 1 7 1
12 1 2 1 5 5 1
Thank you very much!!
I have a pandas dataframe which has lists as values. I would like to transform this dataframe into the format in expected result. The dataframe is too large(1 million rows)
import pandas as pd
import numpy as np
df = pd.DataFrame(
[[['A', 'Second'], [], 'N/A', [6]],
[[2, 3], [3, 4, 6], [3, 4, 5, 7], [2, 6, 3, 4]]],
columns=list('ABCD')
)
df.replace('N/A',np.NaN, inplace=True)
df
A B C D
0 [A,Second] [] NaN [6]
1 [2,3] [3,4,6] [3,4,5,7] [2,6,3,4]
Expected result
0 A A
0 A Second
0 D 6
1 A 2
1 A 3
1 B 3
1 B 4
1 B 6
1 C 3
1 C 4
1 C 5
1 C 7
1 D 2
1 D 6
1 D 3
1 D 4
`
You can use double stack:
df1 = df.stack()
df = pd.DataFrame(df1.values.tolist(), index=df1.index).stack()
.reset_index(level=2,drop=True).reset_index()
df.columns = list('abc')
print (df)
a b c
0 0 A A
1 0 A Second
2 0 D 6
3 1 A 2
4 1 A 3
5 1 B 3
6 1 B 4
7 1 B 6
8 1 C 3
9 1 C 4
10 1 C 5
11 1 C 7
12 1 D 2
13 1 D 6
14 1 D 3
15 1 D 4
df.stack().apply(pd.Series).stack().reset_index(2, True) \
.rename_axis(['a', 'b']).reset_index(name='c')
a b c
0 0 A A
1 0 A Second
2 0 D 6
3 1 A 2
4 1 A 3
5 1 B 3
6 1 B 4
7 1 B 6
8 1 C 3
9 1 C 4
10 1 C 5
11 1 C 7
12 1 D 2
13 1 D 6
14 1 D 3
15 1 D 4
I have dataframe look like this:
a b c d e
0 0 1 2 1 0
1 3 0 0 4 3
2 3 4 0 4 2
3 4 1 0 4 3
4 2 1 3 4 3
5 3 2 0 3 3
6 2 1 1 1 0
7 1 1 0 3 3
8 3 3 3 3 4
9 2 3 4 2 2
I do following command:
df.groupby('A').sum()
And i get:
b c d e
a
0 1 2 1 0
1 1 0 3 3
2 5 8 7 5
3 9 3 14 12
4 1 0 4 3
And after that I want to access
labels = df['A']
But I have an error that there are no such column.
So does pandas have some syntax to get something like this?
a b c d e
0 0 1 2 1 0
1 1 1 0 3 3
2 2 5 8 7 5
3 3 9 3 14 12
4 4 1 0 4 3
I need to sum all values of columns b, c, d, e to column a with the relevant index
You can just access the index with df.index, and add it back into your dataframe as another column.
grouped_df = df.groupby('A').sum()
grouped_df['A'] = grouped_df.index
grouped_df.sum(axis=1)
Alternatively, groupby has 'as_index' option to keep the column 'A'
groupby('A', as_index=False)
or, after groupby, you can use reset_index to put the column 'A' back.
I create a working example dataset:
input ///
group value
1 3
1 2
1 3
2 4
2 6
2 7
3 4
3 4
3 4
3 4
4 17
4 2
5 3
5 5
5 12
end
My goal is to figure out the maximum distance between incremental values within group. For group 2, this would be 2, because the next highest value after 4 is 6. Note that the only value relevant to 4 is 6, not 7, because 7 is not the next highest value after 4. The result for group 3 is 0 because there is only one value in group 3. There will only be one result per group.
What I want to get:
input ///
group value result
1 3 1
1 2 1
1 3 1
2 4 2
2 6 2
2 7 2
3 4 0
3 4 0
3 4 0
3 4 0
4 17 15
4 2 15
5 3 7
5 5 7
5 12 7
end
The order is not important, so the order just above can change with no problem.
Any tips?
I may have figured it out:
bys group (value): gen d = value[_n+1] - value[_n]
bys group: egen result = max(d)
drop d
I want to create a pivot table from a pandas dataframe
using dataframe.pivot()
and include not only dataframe columns but also the data within the dataframe index.
Couldn't find any docs that show how to do that.
Any tips?
Use reset_index to make the index a column:
In [45]: df = pd.DataFrame({'y': [0, 1, 2, 3, 4, 4], 'x': [1, 2, 2, 3, 1, 3]}, index=np.arange(6)*10)
In [46]: df
Out[46]:
x y
0 1 0
10 2 1
20 2 2
30 3 3
40 1 4
50 3 4
In [47]: df.reset_index()
Out[47]:
index x y
0 0 1 0
1 10 2 1
2 20 2 2
3 30 3 3
4 40 1 4
5 50 3 4
So pivot uses the index as values:
In [48]: df.reset_index().pivot(index='y', columns='x')
Out[48]:
index
x 1 2 3
y
0 0 NaN NaN
1 NaN 10 NaN
2 NaN 20 NaN
3 NaN NaN 30
4 40 NaN 50