I'm trying to adjust some columns with negative values in my table, I want to all negative values be changed to 0,
The only problem is that the columns keep changing their names, so I would like to be able to make such adjustment based on column position,
For example, the columns are located in 3 and 4 position,
I have created a conditional column to adjust the negatives volumes,
#"New Column" = Table.AddColumn(#Previous Step", "New Column", each if OldColumnName < 0 then 0 else NewColumn),
Is there a way to make this conditional column based on the OldColumn position, and not by its name?
add column, custom column with formula
= if Record.Field(_,Table.ColumnNames(Source){2})<0 then 0 else Record.Field(_,Table.ColumnNames(Source){2})
or
= if Record.Field(_,Table.ColumnNames(Source){2})<0 then 0 else [some other column])
where {2} is the position in column names
Sample to transform in place to remove negatives
Stepname = Table.TransformColumns(#"PriorStepNameHere",{{Table.ColumnNames(#"PriorStepNameHere"){2}, each if _<0 then 0 else _, Int64.Type}})
for multiple column transformations
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
ColumnsToTransform = {Table.ColumnNames(Source){2},Table.ColumnNames(Source){3}},
#"MultipleTransform" = Table.TransformColumns(Source, List.Transform(ColumnsToTransform,(x)=>{x, each if _<0 then 0 else _, type number}))
in #"MultipleTransform"
Related
I have a table with the following column names:
A
B
C
D
E
F
G
I need to rename my columns so that from a certain column onwards they are numbered sequentially:
A
B
C
D (1)
E (2)
F (3)
G (4)
I know how to do it manually, but since I have 65 of such columns I was hoping to use something like TransformColumnNames to do it programmatically.
Many thanks!
Here's one way: It starts with a table named Table 1 as the source.
let
Source = Table1,
//Replace the "D" below with the name of your column that you want to start numbering at
#"Get Column Number to Start Adding Numbers At" = List.PositionOf(Table.ColumnNames(Source),"D"),
#"Setup Column Numbers" = List.Transform({1..List.Count(Table.ColumnNames(Source))}, each if _-#"Get Column Number to Start Adding Numbers At" > 0 then " (" & Text.From(_-#"Get Column Number to Start Adding Numbers At") & ")" else ""),
#"Create New Column Names" = List.Zip({Table.ColumnNames(Source), #"Setup Column Numbers"}),
#"Converted to Table" = Table.FromList(#"Create New Column Names", Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Extracted Values" = Table.TransformColumns(#"Converted to Table", {"Column1", each Text.Combine(List.Transform(_, Text.From)), type text}),
Result = Table.RenameColumns(Source, List.Zip({Table.ColumnNames(Source),#"Extracted Values"[Column1]}))
in
Result
Maybe if you pivot the columns that need to have the number, then add an index and create a new concatenated column with number included. remove the other columns and unpivot again?
I have a table of Project:
that I would like to filter by the FIELD, OPERATOR, and VALUE columns contained in the Project Group table:
The Power Query M to apply this filter would be:
let
Source = #"Project",
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Projectid", Int64.Type}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type", each [Projectid] >= 100000 and [Projectid] <= 500000)
in
#"Filtered Rows"
Results (need to remove the error row):
How do I convert the FIELD, OPERATOR, and VALUE columns into a function that can be used as a condition for the SelectRows function?
If you need to do comparisons, might be best to first change the types of the columns (in both tables) that are being compared. Preferably to type number.
The code below assumes that:
the OPERATOR column of Project Group table can only contain: > or < and that these values should be interpreted as >= and <= respectively.
the column in Project table (that needs to be compared) can change and its name will be in the FIELD column of the Project Group. It's assumed that the name matches exactly. If this is not the case, you might need to standardise things (or at least perform a case-insensitive search) to ensure values can be mapped to column names correctly.
Based on the assumptions above, here's one approach:
let
// Dummy table for example purposes
project = Table.FromColumns({
{0..10},
{5..15}
}, type table [projectId = number, name = number]),
// Dummy table for example purposes
projectGroup = Table.FromColumns({
{"projectId", "projectId"},
{">", "<"},
{5, 7}
}, type table [FIELD = text, OPERATOR = text, VALUE = number]),
// Should take in a row from "Project" table and return a boolean
// representing whether said row matches the criteria contained
// within "Project Group" table.
selectorFunc = (projectRow as record) as logical =>
let
shouldKeepProjectRow = Table.MatchesAllRows(projectGroup, (projectGroupRow as record) =>
let
fieldNameToCheck = projectGroupRow[FIELD],
valueFromProjectRow = Record.Field(projectRow, fieldNameToCheck),
compared = if projectGroupRow[OPERATOR] = ">" then
valueFromProjectRow >= projectGroupRow[VALUE]
else
valueFromProjectRow <= projectGroupRow[VALUE]
in compared
)
in shouldKeepProjectRow,
selectedRows = Table.SelectRows(project, selectorFunc)
in
selectedRows
The main function used is Table.MatchesAllRows (https://learn.microsoft.com/en-us/powerquery-m/table-matchesallrows).
Another approach could potentially be: Expression.Evaluate: https://learn.microsoft.com/en-us/powerquery-m/expression-evaluate. However, I've not used it, so I'm not sure whether there are any "gotchas"/implications to be aware of.
I am looping through an excel sheet, looking for a specific name. When found, I print the position of the cell and the value.
I would like to find the position and value of a neighbouring cell, however I can't get .cell() to work by adding 2, indicating I would like the cell 2 columns away in the same row.
row= row works, but column= column gives error, and column + 2 gives error. Maybe this is due to me listing columns as 'ABCDEFGHIJ' earlier in my code? (For full code, see below)
print 'Cell position {} has value {}'.format(cell_name, currentSheet[cell_name].value)
print 'Cell position next door TEST {}'.format(currentSheet.cell(row=row, column=column +2))
Full code:
file = openpyxl.load_workbook('test6.xlsx', read_only = True)
allSheetNames = file.sheetnames
#print("All sheet names {}" .format(file.sheetnames))
for sheet in allSheetNames:
print('Current sheet name is {}'.format(sheet))
currentSheet = file[sheet]
for row in range(1, currentSheet.max_row + 1):
#print row
for column in 'ABCDEFGHIJ':
cell_name = '{}{}'.format(column,row)
if currentSheet[cell_name].value == 'sign_name':
print 'Cell position {} has value {}'.format(cell_name, currentSheet[cell_name].value)
print 'Cell position TEST {}'.format(currentSheet.cell(row=row, column=column +2))
I get this output:
Current sheet name is Sheet1
Current sheet name is Sheet2
Cell position D5 has value sign_name
and:
TypeError: cannot concatenate 'str' and 'int' objects
I get the same error if I try "column = column" as "column = column +2".
Why does row=row work, but column=column dosen't? And how to find the cell name of the cell to the right of my resulting D5 cell?
The reason row=row works and column=column doesn't is because your column value is a string (letter from A to J) while the column argument of a cell is expecting an int (A would be 1, B would be 2, Z would be 26, etc.)
There are a few changes I would make in order to more effectively iterate through the cells and find a neighbor. Firstly, OpenPyXl offers sheet.iter_rows(), which given no arguments, will provide a generator of all rows that are used in the sheet. So you can iterate with
for row in currentSheet.iter_rows():
for cell in row:
because each row is a generator of cells in that row.
Then in this new nested for loop, you can get the current column index with cell.column (D would give 4) and the cell to the right (increment by one column) would be currentSheet.cell(row=row, column=cell.column+1)
Note the difference between the two cell's: currentSheet.cell() is a request for a specific cell while cell.column+1 is the column index of the current cell incremented by 1.
Relevant OpenPyXl documentation:
https://openpyxl.readthedocs.io/en/stable/api/openpyxl.cell.cell.html
https://openpyxl.readthedocs.io/en/stable/api/openpyxl.worksheet.worksheet.html
I need to add a column that Sums the value column of all columns that have a common id. However, any id = null is not summed, but equal to the value column.
The above example should result in:
TopPaymendId JournalLineNetAmount TopAmount
fcbcd407-ca26-4ea0-839a-c39767d05403 -3623.98 -7061.23
fcbcd407-ca26-4ea0-839a-c39767d05403 -3437.25 -7061.23
ce77faac-1638-40e9-ad62-be1813ce9031 -88.68 -88.68
531d9bde-3f52-47f3-a9cf-6f3566733af2 -152.23 -152.23
8266dfef-dd14-4654-a6d2-091729defde7 229.42 229.42
f8b97a47-15ef-427d-95e0-ce23cc8efb1f -777 -777
null -3.01 -3.01
null -2.94 -2.94
null 3312.5 3312.5
This code should work:
let
Source = Excel.CurrentWorkbook(){[Name="Data"]}[Content],
group = Table.Group(Source, {"TopPaymendId"}, {"TopAmount", each List.Sum([JournalLineNetAmount])}),
join = Table.Join(Source,{"TopPaymendId"},group,{"TopPaymendId"}),
replace = Table.ReplaceValue(join,each [TopAmount],each if [TopPaymendId] = null
then [JournalLineNetAmount] else [TopAmount],Replacer.ReplaceValue,{"TopAmount"})
in
replace
I have time series data in two separate DataFrame columns which refer to the same parameter but are of differing lengths.
On dates where data only exist in one column, I'd like this value to be placed in my new column. On dates where there are entries for both columns, I'd like to have the mean value. (I'd like to join using the index, which is a datetime value)
Could somebody suggest a way that I could combine my two columns? Thanks.
Edit2: I written some code which should merge the data from both of my column, but I get a KeyError when I try to set the new values using my index generated from rows where my first df has values but my second df doesn't. Here's the code:
def merge_func(df):
null_index = df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
notnull_index = df[(df['DOC_mg/L'].isnull() == True) & (df['TOC_mg/L'].isnull() == False)].index
df['DOC_mg/L'][notnull_index] = df[notnull_index]['TOC_mg/L']
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
return df
merge_func(sve)
And here's the error:
KeyError: "['2004-01-14T01:00:00.000000000+0100' '2004-03-04T01:00:00.000000000+0100'\n '2004-03-30T02:00:00.000000000+0200' '2004-04-12T02:00:00.000000000+0200'\n '2004-04-15T02:00:00.000000000+0200' '2004-04-17T02:00:00.000000000+0200'\n '2004-04-19T02:00:00.000000000+0200' '2004-04-20T02:00:00.000000000+0200'\n '2004-04-22T02:00:00.000000000+0200' '2004-04-26T02:00:00.000000000+0200'\n '2004-04-28T02:00:00.000000000+0200' '2004-04-30T02:00:00.000000000+0200'\n '2004-05-05T02:00:00.000000000+0200' '2004-05-07T02:00:00.000000000+0200'\n '2004-05-10T02:00:00.000000000+0200' '2004-05-13T02:00:00.000000000+0200'\n '2004-05-17T02:00:00.000000000+0200' '2004-05-20T02:00:00.000000000+0200'\n '2004-05-24T02:00:00.000000000+0200' '2004-05-28T02:00:00.000000000+0200'\n '2004-06-04T02:00:00.000000000+0200' '2004-06-10T02:00:00.000000000+0200'\n '2004-08-27T02:00:00.000000000+0200' '2004-10-06T02:00:00.000000000+0200'\n '2004-11-02T01:00:00.000000000+0100' '2004-12-08T01:00:00.000000000+0100'\n '2011-02-21T01:00:00.000000000+0100' '2011-03-21T01:00:00.000000000+0100'\n '2011-04-04T02:00:00.000000000+0200' '2011-04-11T02:00:00.000000000+0200'\n '2011-04-14T02:00:00.000000000+0200' '2011-04-18T02:00:00.000000000+0200'\n '2011-04-21T02:00:00.000000000+0200' '2011-04-25T02:00:00.000000000+0200'\n '2011-05-02T02:00:00.000000000+0200' '2011-05-09T02:00:00.000000000+0200'\n '2011-05-23T02:00:00.000000000+0200' '2011-06-07T02:00:00.000000000+0200'\n '2011-06-21T02:00:00.000000000+0200' '2011-07-04T02:00:00.000000000+0200'\n '2011-07-18T02:00:00.000000000+0200' '2011-08-31T02:00:00.000000000+0200'\n '2011-09-13T02:00:00.000000000+0200' '2011-09-28T02:00:00.000000000+0200'\n '2011-10-10T02:00:00.000000000+0200' '2011-10-25T02:00:00.000000000+0200'\n '2011-11-08T01:00:00.000000000+0100' '2011-11-28T01:00:00.000000000+0100'\n '2011-12-20T01:00:00.000000000+0100' '2012-01-19T01:00:00.000000000+0100'\n '2012-02-14T01:00:00.000000000+0100' '2012-03-13T01:00:00.000000000+0100'\n '2012-03-27T02:00:00.000000000+0200' '2012-04-02T02:00:00.000000000+0200'\n '2012-04-10T02:00:00.000000000+0200' '2012-04-17T02:00:00.000000000+0200'\n '2012-04-26T02:00:00.000000000+0200' '2012-04-30T02:00:00.000000000+0200'\n '2012-05-03T02:00:00.000000000+0200' '2012-05-07T02:00:00.000000000+0200'\n '2012-05-10T02:00:00.000000000+0200' '2012-05-14T02:00:00.000000000+0200'\n '2012-05-22T02:00:00.000000000+0200' '2012-06-05T02:00:00.000000000+0200'\n '2012-06-19T02:00:00.000000000+0200' '2012-07-03T02:00:00.000000000+0200'\n '2012-07-17T02:00:00.000000000+0200' '2012-07-31T02:00:00.000000000+0200'\n '2012-08-14T02:00:00.000000000+0200' '2012-08-28T02:00:00.000000000+0200'\n '2012-09-11T02:00:00.000000000+0200' '2012-09-25T02:00:00.000000000+0200'\n '2012-10-10T02:00:00.000000000+0200' '2012-10-24T02:00:00.000000000+0200'\n '2012-11-21T01:00:00.000000000+0100' '2012-12-18T01:00:00.000000000+0100'] not in index"
You are close, but you actually don't need to iterate over the rows when using the isnull() functions. by default
df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
Will return just the index of the rows where DOC_mg/L is not null and TOC_mg/L is null.
Now you can do something like this to set the values for TOC_mg/L:
null_index = df[(df['DOC_mg/L'].isnull() == False) & \
(df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df['DOC_mg/L'][null_index] # EDIT To switch the index position.
This will use the index of the rows where TOC_mg/L is null and DOC_mg/L is not null, and set the values for TOC_mg/L to the those found in DOC_mg/L in the same rows.
Note: This is not the accepted way for setting values using an index, but it is how I've been doing it for some time. Just make sure that when setting values, the left side of the equation is df['col_name'][index]. If col_name and index are switched you will set the values to a copy which is never set back to the original.
Now to set the mean, you can create a new column, we'll call this Mean_mg/L and set the value = 0.0. Then set this new column to the mean of both columns:
# Insert a new col at the end of the dataframe columns name 'Mean_mg/L'
# with default value 0.0
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
# Set this columns value to the average of DOC_mg/L and TOC_mg/L
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
In the columns where we filled null values with the corresponding column value, the average will be the same as the values.