obj_id
Column B
Colunm C
a1
cat
bat
a2
bat
man
r1
man
apple
r2
apple
cat
The orignal dataframe (above) is called df
I am trying to make a new colunm called new_obj_id where if rows in column B match any row of col C the new_obj_id should then have values of obj_id that match col B
obj_id
Column B
Colunm C
new_obj_id
a1
cat
bat
a2
a2
bat
man
r1
r1
man
apple
r2
r2
apple
cat
a1
This is the expected table
This is what I tried but couldn't get through:
dataframe1['new_obj_id'] = dataframe1.apply(lambda x: x['obj_id']
if x['Column_B'] in x['Column C']
else 'none', axis=1)
Try this:
df['new_obj_id'] = df['Column C'].map(dict(zip(df['Column B'],df['obj_id'])))
Output:
0 a2
1 r1
2 r2
3 a1
I've .xlsx file. Rows are good, values are just fine. But i need to change the columns order by list of new columns positions, e.g:
old = [1, 2, 3, 4]
new = [2, 1, 4, 3]
Docs are checked - there is no straightforward options for this problem.
I've tried to iterate over columns, so:
old = {cell.column: cell.value for cell in ws[1]}.keys() # [1, 2, 3, 4]
new = [2, 1, 4, 3]
for i, col in enumerate(ws.iter_cols(max_col=old[-1]), 1):
if old[i-1] != new[i-1]:
for one in ws[get_column_letter(i)]:
old_cell = one
new_cell = ws.cell(old_cell.row, old[new[i-1]]-1)
new_cell.value, old_cell.value = old_cell.value, new_cell.value
old[i] = new_cell.column
old[new_cell.column] = old_cell.column
...but it work only for a few cases. Probably i'm missing some general solution.
At the and it should be, for example, old = [1, 2, 3, 4] new = [2, 1, 4, 3]:
Input file:
x A B C D
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C3 D4
Output file:
x A B C D
1 B1 A1 D1 C1
2 B2 A2 D2 C2
3 B3 A3 D3 C3
4 B4 A4 D3 C4
Your current approach risks overwriting cells. I'd be tempted to move the cells from existing columns to new columns in the correct order and then delete the old ones.
for c in ws['A']:
new_cell = c.offset(column=6)
new_cell.value = c.value
for c in ws['B']:
new_cell = c.offset(column=5)
new_cell.value = c.value
ws.delete_cols(min_col=1, max_col=4)
This is just to give you the general idea could be optimised and parametrised so you could do each row at once.
Or you could use move_range:
ws.move_range(min_col=1, max_col=1, min_row=1, max_row=ws.max_row, cols=5)
In either case be careful not to overwrite existing cells before you've moved them.
I have a csv-file that contains "pivot-like" data that I would like to store into a pandas DataFrame. The original data file is divided using different number of whitespaces to differentiate between the level in the pivot-data like so:
Text that I do not want to include,,
,Text that I do not want to include,Text that I do not want to include
,header A,header B
Total,100,100
A,,2.15
a1,,2.15
B,,0.22
b1,,0.22
" slightly longer name"...,,0.22
b3,,0.22
C,71.08,91.01
c1,57.34,73.31
c2,5.34,6.76
c3,1.33,1.67
x1,0.26,0.33
x2,0.26,0.34
x3,0.48,0.58
x4,0.33,0.42
c4,3.52,4.33
x5,0.27,0.35
x6,0.21,0.27
x7,0.49,0.56
x8,0.44,0.47
x9,0.15,0.19
x10,,0.11
x11,0.18,0.23
x12,0.18,0.23
x13,0.67,0.85
x14,0.24,0.2
x15,0.68,0.87
c5,0.48,0.76
x16,,0.15
x17,0.3,0.38
x18,0.18,0.23
d2,6.75,8.68
d3,0.81,1.06
x19,0.3,0.38
x20,0.51,0.68
Others,24.23,0
N/A,,
"Text that I do not want to include(""at all"") ",,
(It looks aweful, but you should be able to paste in e.g. Notepad to see it a bit clearer)
Basically, there are only two columns a and b, but the rows are indented using 0, 3, 6, 9, ... etc whitespaces to differentiate between the levels. So for instance,
zero level, the main group, A has 0 spaces,
first level a1 has 3 spaces,
second level a2 has 6 spaces,
third level a3 has 9 spaces and
fourth and final level has 12 spaces with the corresponding values for columns a and b respectively.
I would now like to be able to read and group this data on these levels in order to create a new summarizing DataFrame, with columns corresponding to these different levels, looking like:
Level 4 Diff(a,b) Level 0 Level 1 Level 2 Level 3
x7 525 C c1 c2 c3
x5 -0.03 A a1 a22 NaN
x4 -0.04 A a1 a22 NaN
x8 -0.08 C c1 c2 c3
…
Any clue on how to do this?
Thanks
Easiest is to split this into different functions
read the file
parse the lines
generate the 'tree'
construct the DataFrame
Parse the lines
def parse_file(file):
import ast
import re
pat = re.compile(r'^( *)(\w+),([\d.]+),([\d.]+)$')
for line in file:
r = pat.match(line)
if r:
spaces, label, a, b = r.groups()
diff = ast.literal_eval(a) - ast.literal_eval(b)
yield len(spaces)//3, label, diff
Reads each line, yields the level, 'label' and diff using a regular expression. I use ast to convert the string to int or float
Generate the tree
def parse_lines(lines):
previous_label = list(range(5))
for level, label, diff in lines:
previous_label[level] = label
if level == 4:
yield tuple(previous_label), diff
Initiates a list of length 5, and then overwrites the level this node is on.
Construct the DataFrame
with StringIO(file_content) as file:
lines = parse_file(file)
index, data = zip(*parse_lines(lines))
idx = pd.MultiIndex.from_tuples(index, names=[f'level_{i}' for i in range(len(index[0]))])
df = pd.DataFrame(data={'Diff(a,b)': list(data)}, index=idx)
Opens the file, constructs the index and generates the DataFrame with the different levels in the index. If you don't want this, you can add a .reset_index() or construct the DataFrame slightly different
df
level_0 level_1 level_2 level_3 level_4 Diff(a,b)
A a1 a2 a3 x1 -0.07
A a1 a2 a3 x2 -0.08000000000000002
A a1 a22 a3 x3 -0.04999999999999999
A a1 a22 a3 x4 -0.04000000000000001
A a1 a22 a3 x5 -0.03
A a1 a22 a3 x6 -0.06999999999999998
C c1 c2 c3 x7 525.0
C c1 c2 c3 x8 -0.08000000000000002
alternative for missing levels
def parse_lines(lines):
labels = [None] * 5
previous_level = None
for level, label, diff in lines:
labels[level] = label
if level == 4:
if previous_level < 3:
labels = labels[:previous_level + 1] + [None] * (5 - previous_level)
labels[level] = label
yield tuple(labels), diff
previous_level = level
the items under a22 don't seem to have a level_3, so it copies that from the previous. If this is unwanted, you can take this variation
df
level_0 level_1 level_2 level_3 level_4 Diff(a,b)
C c1 c2 c3 x1 -0.07
C c1 c2 c3 x2 -0.08000000000000002
C c1 c2 c3 x3 -0.09999999999999998
C c1 c2 c3 x4 -0.08999999999999997
C c1 c2 c4 x5 -0.07999999999999996
C c1 c2 c4 x6 -0.060000000000000026
C c1 c2 c4 x7 -0.07000000000000006
C c1 c2 c4 x8 -0.02999999999999997
C c1 c2 c4 x9 -0.04000000000000001
C c1 c2 c4 x11 -0.05000000000000002
C c1 c2 c4 x12 -0.05000000000000002
C c1 c2 c4 x13 -0.17999999999999994
C c1 c2 c4 x14 0.03999999999999998
C c1 c2 c4 x15 -0.18999999999999995
C c1 c2 c5 x17 -0.08000000000000002
C c1 c2 c5 x18 -0.05000000000000002
C c1 d2 d3 x19 -0.08000000000000002
C c1 d2 d3 x20 -0.17000000000000004
I have a dataframe like as following:
f1 f2 class n
0 weekly_return 0.155796 ab weekly
1 monthly_return 0.153907 ab monthly
2 volume_ratio 0.123844 NaN volume
3 margin_selling_balance 0.115411 ad margin
4 margin_debt_balance 0.107883 ae margin
5 rv_ratio 0.077373 NaN rv
..................................................................
and there is a list named lst_n as following:
lst_n = ['rv', 'ag', 'rg', ...........]
I want to set the the value of class column of this dataframe to 'class_a' if the value of n is in the lst_n. For example the fifth rows, the n is rv which is in the n list(lst_n), so the value of class is set to 'class_a'.
My code is following, but there is error:
lst_n = ['rv', 'ag', 'rg', ...........]
df.loc[df.n is in lst_n, 'class'] = 'class_a'
but there is error:
df.loc[df.n is in lst_n, 'class'] = 'class_a'
^
SyntaxError: invalid syntax
thanks!
You need isin for mask:
lst_n = ['rv', 'ag', 'rg']
df.loc[df['n'].isin(lst_n), 'class'] = 'class_a'
print (df)
f1 f2 class n
0 weekly_return 0.155796 ab weekly
1 monthly_return 0.153907 ab monthly
2 volume_ratio 0.123844 NaN volume
3 margin_selling_balance 0.115411 ad margin
4 margin_debt_balance 0.107883 ae margin
5 rv_ratio 0.077373 class_a rv
Another solution with Series.mask:
df['class'] = df['class'].mask(df.n.isin(lst_n), 'class_a')
print (df)
f1 f2 class n
0 weekly_return 0.155796 ab weekly
1 monthly_return 0.153907 ab monthly
2 volume_ratio 0.123844 NaN volume
3 margin_selling_balance 0.115411 ad margin
4 margin_debt_balance 0.107883 ae margin
5 rv_ratio 0.077373 class_a rv
If you need a bit of performance, you can use np.where.
df['class'] = np.where(df.n.isin(lst_n), 'class_a', df['class'])
df
Out[942]:
f1 f2 class n
0 weekly_return 0.155796 ab weekly
1 monthly_return 0.153907 ab monthly
2 volume_ratio 0.123844 NaN volume
3 margin_selling_balance 0.115411 ad margin
4 margin_debt_balance 0.107883 ae margin
5 rv_ratio 0.077373 class_a rv
I am very new to splunk and need your help in resolving below issue.
I have two CSV files uploaded in splunk instance. Below mentioned is each file and its fileds.
Apple.csv
a. A1 b. A2 c. A3
Orange.csv
a. O1 (may have values matching with values of A3) b. O2
My requirement is as below:
Select set of values of A1,A2,A3 and O2 from Apple.csv and Orange.csv
where A1=”X” and A2=”Y” and A3 = O1
and display the values in a table:
A1 A2 A3
X Y 123
LP HJK 222
X Y 999
O1 O2
999 open
123 closed
65432 open
Output
A1 A2 A3 O2
X Y 123 Open
X Y 999 closed
Very much appreciate your help.
You could do this
source="apple.csv" OR source="orange.csv"
| eval grouping=coalesce(A3,O1)
| stats first(A1) as A1 first(A2) as A2 first(A3) as A3 first(O2) as O2 by grouping
| fields - grouping
Although I would think that considering the timestamp of the events might also be important...