I do not manage to update single values in a dataframe based on values of a dictionary before another calculation is executed
I am working with a dataframe and want to calculate values in different colums row per row. After the calculations of one row is finished, the values in the rows thereafter need to be changed before a new calculation can happen. This is since the values in the rows thereafter are dependent upon the results of the previous row.
To approach it, I am working with a dictionary. Currently I am working in Excel where I manage to update the values of a single cell with a dictionary. However to make the calculaton faster I want to work with a proper dataframe.
I manage to update the dictionary based on the results, but I do not manage to update my dataframe with these new values of the citionary
The code that works for my current model based on Excel is:
dict1={1:10,2:15.....38:29} #my dictionary
for row in range(2,sheet.max_row+1):
#updates the table with values of the dictionary before each calculation
sheet['F'+str(row)].value = dict1[sheet['C'+str(row)].value]
# calculations being executed
(.....)
#updating the dictionary with the results of the calculations in the row
dict1_1={sheet['C'+str(row)].value :sheet['F'+str(row)].value}
dict1.update(dict1_1)
What I tried so far with a dataframe looks like this:
for row in df.T.itertuples():
df.replace({"P_building_kg_y": dict1}) ##### <-----HERE IS THE PROBLEM!
# calculations being executed
(.....)
#updating the dictionary with the results of the calculations in the row
dict1_1=dict(zip(df.FacilityID, df.P_building_kg_y))
dict1.update(dict1_1)
I only want to update the values in the dataframe based on the dictionary. If you know a way of how to do it I would really appreciate it!
Related
I have a query in Power BI that takes two parameter: Start Date and End Date.
Whenever I pass these Dates it return a table of Date that contain few columns created according to this range of date such as Date, QuarterofYear, Year, MonthName......etc.
Can we create a mapping data flow in ADF that takes two parameter as input and return a calculated table according to provided dates?
Is there any function that return the range of dates?
For your request: "I want that I pass two date Start Date and End Date in ADF Mapping Data Flow , and Data flow will Create a column such as "Date" that contain that number of Date rows. Is there any function for this? Exam. Start Date=20-01-2019, End Date=20-01-2020 Then Date Column Values should be: 20-01-2019 21-01-2019 ......... ......... 20-02-2020", according the Data Factory documents and my experience, the answer is no, we can't achieve it in Data Flow.
There is a solution to this, but it is a bit tricky.
TL;DR
The general data flow looks like this:
We need a dummy source with exactly one row which contains whatever.
Then we derive a column where we use the mapLoop() expression to create an array of all the dates we want to get rows for.
Finally, we need to flatten the array column which will result in one row per array entry and thus one row per date.
Walkthrough
Source dummy
Each dataflow needs a source and we need exactly one row to make our dataflow work. To achieve this I've created a dataset called empty of type CSV in my data lake which has this content:
empty
""
This is our source definition:
And its result looks like this:
Derived column days
This is where the magic happens!
We create a new column dates which is an array of all the dates we want to have in our date table:
In this scenario we want a date table starting on 2019-01-01 and reaching one year into the future. The full expression looks like this:
mapLoop(
addDays(currentDate(), 365) - toDate(2019-01-01),
addDays(toDate(2019-01-01), #index)
)
This is what happens here:
the mapLoop() function builds an array of elements. You specify the number of elements you want to have and the lambda expression to calculate each of the elements. For example, mapIndex([1, 2, 3, 4], #item + 2 + #index) results in [4, 6, 8, 10]
addDays(currentDate(), 365) - toDate('2019-01-01') is the number of days between our start (2019-01-01) and end date (1 year in the future from now) and thus the number of dates we want to have in our resulting array.
addDays(toDate(2019-01-01), #index) calculates each array item by adding #index days to our start date. This is executed for the number of days we've calculated before and #index is the array position. Thus, the first element of the array will be 2019-01-01 + 1, the second 2019-01-01 + 2 and so on.
Our stream now has these columns:
Flatten
Finally, you need a flatten transformation which will expand each item in your array to its dedicated row. We can also dismiss the useless empty column in this step:
And this finally results in what we wanted to achieve:
References
Data transformation expressions in mapping data flow
The data frames are not similar in any way. They do not have the same values. I want to be able to compare one column from one df with another column from the other and graph them. For example, one df has a column named "Offical poverty total" and another df has a column named "violent crime rate". I want to be able to compare these two.
I tried df['Offical Poverty_Total'].append(crime['Violent crime'])
but this isn't what I was looking for. To make it simple, I want to have a new table with the two columns and be able to analyze the new table.
You're looking for
pd.concat((df['Offical Poverty_Total'], crime['Violent crime']), axis = 1)
This will align their indexes, so if you've changed the row ordering of the dataframes and want to just glue them together in the order you see them in, do
pd.concat((df['Offical Poverty_Total'], crime['Violent crime']), axis = 1, ignore_index = True)
Note: I have simplified this code and some background information quite a bit in order to pinpoint the exact issue. If you want me to explain additional aspects of this code, please ask in the comments.
I am currently trying to make a code capable of aggregating two or more dictionaries together to ultimately make a .csv file containing the following basic set up:
TF-A TF-A TF-B TF-B
ids gene name score sum hits score sum hits ... gene description
id1 gene A 53.85 14 37.65 7 ... stuff
id2 gene B 97.55 11 63.94 8 ... stuff
id3 gene C 88.67 9 79.43 12 ... stuff
id4 gene D 69.35 12 13.03 13 ... stuff
... ...... ...... ... ... ... ..........
idx gene Z 49.32 8 84.03 10 ... stuff
The dictionaries contain the names of genes as keys, with corresponding values being an array of scores (these scores are generated by calculating the probability of a transcription factor, a.k.a. TF, binding to a gene at a certain position). Each TF has one dict, and its keys only contain genes that it has at least one score with. After the dictionaries are opened, set intersection is used in order to generate a list of genes that all given transcription factors have in common, which are then organized into the dataframe in the "gene name" column in a for loop. Because of previous class structures I built before (not shown), I can retrieve the common name and description of each gene easily and place it in the "gene description" column using df.set_value.
for id in common_names:
gene_name= idconverter.getgene(id).gene_name
gene_description= idconverter.getgene(id).description
df.set_value(id,'gene name', gene_name)
df.set_value(id,'gene name', gene_description)
However, the number of columns in the dataframe is dependent on the number of transcription factors the user wishes to analyze. So, a user putting in two TFs will add four columns to the dataframe– two columns for TF-A (sum of scores and number of scores, or 'hits'), and two columns for TF-B. Inputting three TFs will yield six columns, inputting four TFs will yield eight columns, and so on. The ID, gene name, and description columns are constant.
So, before I build the dataframe, I make a list that expands for every TF given.
ColumnList = []
for tf_id in tf_id_list:
ColumnList+=['{} Total Sum of Hits'.format(tf_id), '{} Number of Hits'.format(tf_id)]
With this list, I concatenate it with the other column names, and then instantiate my dataframe.
df= DataFrame(columns= ['ids','gene name']+ColumnList+['gene description'])
As shown above, I can easily set the names and description in the correct cell in the dataframe. And I can easily calculate the number of hits and total sum of scores for each gene according to which TF from the original dicts, but I have NO idea how to place this information in their according cell. Because the number of columns is dependent on the inputted TFs, I do not know what kind of code I should write in order to accommodate for this variable number of columns, or how to specify column based on its adhering TF. Can anybody recommend me the proper code set up and/or methods?
I have done some research, and I did see a method whereby one can add in a certain piece of information into a cell based on what kind of information is present in another column (but in the same row):
Modifying a subset of rows in a pandas dataframe
df.ix[df.A==0, 'B'] = np.nan
If you read the code in the link provided above, this piece of code above adds in a NaN into the 'B' column whenever a zero is present in the 'A' column. I thought I could utilize this methodology, but given that I need to add the number of hits and total sum of scores based on whether they relate to the first TF, second TF, or third TF, and so on. Would one write:
for id in common_genes:
for tf_id in tf_id_list:
df.ix[df.'{} Number of Hits'.format(tf_id)== tf_id]= number_hits
df.ix[df.'{} Total Sum of Scores'.format(tf_id)== tf_id]= sum_scores
I don't believe that's correct, since the my IDE says the syntax does not compile. I also should note, I have simplified the above code a little bit– the variables "number_hits" and "sum_scores" are actually derived from a dictionary that contains gene names as keys, and a list of hits, score sum, and pertaining TF name as values.
In the end, I decided to make a dict of dicts– I realized that this data structure was actually the most ideal for what I needed. The information from the original dicts would be stored as inside a dict (the "total_dict"), and those would be accessed if and only if the gene was present in the list of common_genes (which was derived from a set that all the transcription factors had in common). They are accessed by setting a for loop to go through the TFs of the total_dict to determine whether the gene id was present in the TF dict, and if it was, the information in the value (i.e. sum of scores and number of hits) was added to the correct row (based on the gene name, or id) and column (based on the TF at hand).
for id in common_names:
gene_name= idconverter.getgene(id).gene_name
gene_description= idconverter.getgene(id).description
df.set_value(id,'gene name', gene_name)
df.set_value(id,'gene name', gene_description)
for tf_id in total_dict.keys():
if id in total_dict[tf_id].keys():
df.set_value(id, '{} Score Sum'.format(tf_id), total_dict[tf_dict][id][0])
df.set_value(id, '{} Score Sum'.format(tf_id), total_dict[tf_dict][id][1])
print(df.header())
Long story short, if you are dealing with a lot of different kinds of data, make SURE the kind of data structures you work with can be manipulated in order to build the kind of results you want to generate.
i have a scenario like the below ...where based on value of num_seats i have to split the rows in target into that amount of data along with another field(seat_num) which will be having counter which will increment by 1.Please suggest..
You can clone the rows based on a field value (in this case, num_seats), remove the original (non-cloned) row, then calculate the seat number and replace the original fields (num_seats, seat_num, last_seat, etc.) with the new values:
Here's a Gist of the above transformation: https://gist.github.com/mattyb149/e4cf796ff45983ebf87e
I am front end developer new to django. There is a certain column(server_reach) in our postgres DB which has values of (1,2). But I need to write a query which tells me if at least one of the filtered rows has a row with reachable values( 1= not reachable, 2 = reachable).
I was initially told that the values of the column would be (0,1) based on which I wrote this:
ServerAgent.objects.values('server').filter(
app_uuid_url=app.uuid_url,
trash=False
).annotate(serverreach=Sum('server_reach'))
The logic is simple that I fetch all the filtered rows and annotate them with the sum of the server_reaches. If this is more than zero then at least one entry is non-zero.
But the issue is that the actual DB has values (1,2). And this logic will not work anymore. I want to subtract the server_reach of each row by '1' before summing. I have tried F expressions as below
ServerAgent.objects.values('server').filter(
app_uuid_url=app.uuid_url,
trash=False
).annotate(serverreach=Sum(F('server_reach')-1))
But it throws the following error. Please help me getting this to work.
AttributeError: 'ExpressionNode' object has no attribute 'split'
Use Avg instead of Sum. If average value is greater than 1 then at least one row contains value of 2.