Subtract values in a data frame ignoring specific keys - python-2.7
I have two data frames as such:
df1 = pd.DataFrame({ 'pressure' : [42,42,42,42,42,42,42,36,36,36,36,36,36,36],
'load' : [350,350,350,350,350,350,350,700,700,700,700,700,700,700],
'speed' : [70,60,50,40,30,20,10,70,60,50,40,30,20,10],
'lforce' : [3.6,3.5,3.3,3.2,3.1,3.1,2.9,7.7,7.3,7.0,6.8,6.5,6.4,6.1],
'rforce' : [3.4,3.2,3.1,3.0,2.9,2.8,2.7,7.6,7.2,6.9,6.6,6.3,6.2,5.9]
}).set_index(['pressure','load','speed'])
df2 = pd.DataFrame({ 'pressure' : [47,47,47,47,47,47,47],
'load' : [20,20,20,20,20,20,20],
'speed' : [70,60,50,40,30,20,10],
'lforce' : [2.5,2.1,1.9,1.7,1.5,1.3,1.2],
'rforce' : [2.8,2.6,2.4,2.2,2.0,1.8,1.7]
}).set_index(['pressure','load','speed'])
Formatted:
>>> df1
lforce rforce
pressure load speed
42 350 70 3.6 3.4
60 3.5 3.2
50 3.3 3.1
40 3.2 3.0
30 3.1 2.9
20 3.1 2.8
10 2.9 2.7
36 700 70 7.7 7.6
60 7.3 7.2
50 7.0 6.9
40 6.8 6.6
30 6.5 6.3
20 6.4 6.2
10 6.1 5.9
>>> df2
lforce rforce
pressure load speed
47 20 70 2.5 2.8
60 2.1 2.6
50 1.9 2.4
40 1.7 2.2
30 1.5 2.0
20 1.3 1.8
10 1.2 1.7
I would like to subtract df2 from df1 on the lforce and rforce columns for each speed to get the resulting data frame df3.
My problem is that I need to ignore the pressure and load in df2 during the subtraction, but retain the originals from df1.
Desired result:
>>> df3
lforce rforce
pressure load speed
42 350 70 1.1 0.6
60 1.3 0.6
50 1.4 0.7
40 1.5 0.8
30 1.6 0.9
20 1.7 1.0
10 1.7 1.0
36 700 70 5.2 4.8
60 5.1 4.6
50 5.1 4.4
40 5.1 4.4
30 5.0 4.3
20 5.0 4.3
10 4.9 4.2
df1.sub(df2.reset_index([0, 1], drop=True), level=2)
output:
lforce rforce
pressure load speed
42 350 70 1.1 0.6
60 1.4 0.6
50 1.4 0.7
40 1.5 0.8
30 1.6 0.9
20 1.8 1.0
10 1.7 1.0
36 700 70 5.2 4.8
60 5.2 4.6
50 5.1 4.5
40 5.1 4.4
30 5.0 4.3
20 5.1 4.4
10 4.9 4.2
May be somehing like this:
>>> df3 = df1.reset_index(level=[0,1])
>>> df4 = df2.reset_index(level=[0,1])
>>> df4['pressure'] = 0
>>> df4['load'] = 0
>>> df3 - df4
pressure load lforce rforce
speed
10 42 350 1.7 1.0
10 36 700 4.9 4.2
20 42 350 1.8 1.0
20 36 700 5.1 4.4
30 42 350 1.6 0.9
30 36 700 5.0 4.3
40 42 350 1.5 0.8
40 36 700 5.1 4.4
50 42 350 1.4 0.7
50 36 700 5.1 4.5
60 42 350 1.4 0.6
60 36 700 5.2 4.6
70 42 350 1.1 0.6
70 36 700 5.2 4.8
Now you just have to move pressure and load back to index
Is this what you're looking for?
d1 = df1.reset_index(['pressure','load'])
d2 = df2.reset_index(['pressure','load'])
r0 = d1.merge(d2, left_index=True, right_index=True)
r1 = r0.set_index(['pressure_x','load_x'], drop=False)
r1['lforce'] = r1.lforce_x - r1.lforce_y
r1['rforce'] = r1.rforce_x - r1.rforce_y
df3 = r1[['lforce','rforce']]
df3
Related
Able to transform observation to 0 but had issue with their total
I have a raw data set looks like this: enter link description here And I tried to transform the observations having a "T" to 0, and then read in the data set and print out. Just this. However, with my code, simply by looking at the first observation in line 5, it is apparently something is off. For instance, the first observation for "Nov" should not be 0. I could not figure out what had gone wrong and I wonder is anyone would like to give me some advice on what I can do for the next? Thank you very much! Highly appreciated. My code is as below: INFILE "&DIRLSB.Pr1Snowfall1.csv" DSD FIRSTOBS=5; DROP i; INPUT Season $#; INPUT Year 1-4 Season 1-7 Sep Oct Nov Dec Jan Feb Mar Apr May Total; ARRAY Months (*) Sep -- May; DO i = 1 TO dim(Months); IF Months(i)=. Then Months(i)=0; END; RUN;
I'm guessing you have a missing T; statement somewhere that is reading T(race) as missing T. ".T does not equal ." I would use coalesce function. There is really no need to change missing T to 0 is there? missing t; data snow; infile cards firstobs=2; input Season:$7. Sep Oct Nov Dec Jan Feb Mar Apr May Total; array mth[*] Sep--May; do i = 1 to dim(mth); mth[i] = coalesce(mth[i],0); end; t = sum(of mth[*]); drop i; cards; Season Sep Oct Nov Dec Jan Feb Mar Apr May Total 1884-85 0 T 1 27.1 22.2 17 3.5 19.5 T 90.3 1885-86 0 1.7 8.2 8.4 16.9 16 6.5 7 0 64.7 1886-87 0 T 22.2 12.5 12 18.4 6.3 1.2 0 72.6 1893-94 0 0.5 6.1 27.6 20 29.5 5.4 13.3 0 102.4 1894-95 0 T 11.1 22.1 26.5 23.6 9.5 0.6 0 93.4 1895-96 0 1.5 5.9 8.7 22.5 39.1 45.1 1 0 123.8 1896-97 0 T 5.5 13.9 20.1 13.7 8.1 5.2 0 66.5 1897-98 0 0 10.1 18.4 32.1 26.8 1.2 2.4 0 91 1898-99 0 T 10.6 27 16.6 16.3 21.2 4.3 T 96 1899-00 T T 1.3 21.5 24.7 28.5 54 1.3 0 131.3 1906-07 0 5 5.7 18.7 11.7 15.7 3.1 2.5 1.3 63.7 1907-08 0 0 2.2 11.6 16.5 19.8 7.9 6.3 3 67.3 1908-09 0 0.5 4.6 10 22.5 6.1 9.7 9.8 3.3 66.5 1909-10 0 T 1.7 14.6 22 42.7 3.4 0.5 0 84.9 1910-11 0 2.2 15.7 29.8 9.5 30 13.5 4.7 2 107.4 1911-12 0 0 6.5 7.5 21.5 10.8 8.8 6.9 T 62 ;;;; run; proc print; run;
How to distinguish whether to read in as numeric or character observations?
I have two data sets having the same content but one is in tab-delimited format, and the other is in space-delimited format. Space-Delimited Tab_Delimited I have three questions which I could not figure them out and would like to ask for help. Any suggestions would be highly appreciated. First, I used the TextWrangler to open these two data sets, and I feel that the space-delimited data set means that the data sets are separated by spaces and the observations each row are in the same position. On the other hand, my understanding for tab-delimited data set was that the data sets which are separated by blanks and the blanks might not be necessary the same widths for each rows of the variables. Was my understanding correct? I am having trouble distinguishing them. Second, I was printing out the snowfall dataset as mentioned above from row number 5 to row number 122, and the "T" values in the dataset has to be converted to 0. My code for the space-delimited file of the snowfall data was as below, and my question was about its LOG. There were many warnings about "T" but I did not receive any errors. LOG Should I be concerned about the warnings here mentioning "invalid data for month(i) in line..." * Trying Space-Delimited data set; OPTIONS Errors=200; DATA SASWEEK.SnowSpace; DROP i MyTot diff; INFILE "&dirLSB.RochesterSnowfallSpace.txt" FIRSTOBS= 2 OBS= 122; INPUT Season $ Sep Oct Nov Dec Jan Feb Mar Apr May Total ; ARRAY Month(10) Sep -- Total; DO i = 1 TO 10 ; IF Month(i) = . THEN Month(i) = 0 ; MyTot = sum (of Sep -- May); diff = round (MyTot-Total, 3); IF diff ne 0 THEN PUT "**ERROR" MyTot= Total= diff= ; END; PROC PRINT DATA=sasweek.snowspace; TITLE "Rochester Snowfall in Space-Delimited format"; RUN; One of my professors suggested I should have made the monthly snowfall as "character". So the "T"s would not incur a warning in the LOG. I am not sure whether I should try it this way. Lastly, I tried to use "Proc Import" for the same data set but in xls file. The data set is as the link And my code is as follows: * Trying Excel file ; OPTIONS ERRORS=200; OPTIONS MSGLEVEL=i; PROC IMPORT OUT=SASWEEK.SNOWxls DATAFILE= "&dirLSB.RochesterSnowfall.xls" DBMS=xls; GETNAMES= no; RANGE= "Sheet1$a5:k122" ; PROC PRINT DATA= SASWEEK.SNOWxls; TITLE "Rochester Snowfall in xls format"; RUN; I received the error in the LOG saved as the HTML I still printed out a part of the dataset but the variable names were messed up and the output was not complete. Any ideas? Thank you all for your reading and thanks for any help:)
The DATA step with INPUT statement might be the best place to start. WARNINGs are fine, unless the goal is to have no warnings. The data file can be cleanly read by creating an input environment built for it: Custom informat zeroT converts T(text) to 0(number). Prevents warnings. INFILE DLM='0920'x specifying either tab or space may be delimiting data file values. INPUT Wrap fields Sep to Total in parenthesis ( ) to indicate grouped input Wrap informat specifiers in parenthesis ( ) that are applied over grouped variables : list input modifier that advances input parsing to next non-blank and reads until next character is blank. Sample Code proc format; invalue zeroT 'T'=0 other=[best12.]; run; data have; infile snowdata firstobs=2 dlm='0920'x; INPUT Season $ (Sep Oct Nov Dec Jan Feb Mar Apr May Total) (10 * :zeroT.) ; run; Sample Data (from SP text viewer) filename snowdata "%TEMP%\roc_snowfalls.txt"; * create local sample data file, text copied from sharepoint viewer; data _null_; file snowdata; input; put _infile_; datalines; Season Sep Oct Nov Dec Jan Feb Mar Apr May Total 1884-85 0 T 1 27.1 22.2 17 3.5 19.5 T 90.3 1885-86 0 1.7 8.2 8.4 16.9 16 6.5 7 0 64.7 1886-87 0 T 22.2 12.5 12 18.4 6.3 1.2 0 72.6 1887-88 0 0.2 2.2 9.3 21.3 4.1 13.2 0.4 0 50.7 1888-89 0 T 4 15.5 17.8 22 17.5 5.4 0 82.2 1889-90 0 T 5.7 6.1 20.2 14.8 19 T 0 65.8 1890-91 0 0 2.1 29.2 16.1 24.6 12.2 0.3 0.1 84.6 1891-92 0 0.1 9.7 4.7 26.4 10.3 25.1 0.8 T 77.1 1892-93 0 T 14 19.2 15.9 29.8 8.1 9.6 0 96.6 1893-94 0 0.5 6.1 27.6 20 29.5 5.4 13.3 0 102.4 1894-95 0 T 11.1 22.1 26.5 23.6 9.5 0.6 0 93.4 1895-96 0 1.5 5.9 8.7 22.5 39.1 45.1 1 0 123.8 1896-97 0 T 5.5 13.9 20.1 13.7 8.1 5.2 0 66.5 1897-98 0 0 10.1 18.4 32.1 26.8 1.2 2.4 0 91 1898-99 0 T 10.6 27 16.6 16.3 21.2 4.3 T 96 1899-00 T T 1.3 21.5 24.7 28.5 54 1.3 0 131.3 1900-01 0 0 17 20.3 29.8 36.9 13.7 23.8 T 141.5 1901-02 0 0.1 14.1 14.5 23.8 23 1.2 2.3 T 79 1902-03 0 0.1 4.1 27.7 18.1 15.6 2.4 0.3 0 68.3 1903-04 0 0.6 4.4 16.1 27.2 17.2 10.7 19.5 T 95.7 1904-05 0 0.2 2.1 15.8 27.5 15.2 7 0.5 0 68.3 1905-06 0 T 4 8.4 7.6 8 15.2 1.1 0 44.3 1906-07 0 5 5.7 18.7 11.7 15.7 3.1 2.5 1.3 63.7 1907-08 0 0 2.2 11.6 16.5 19.8 7.9 6.3 3 67.3 1908-09 0 0.5 4.6 10 22.5 6.1 9.7 9.8 3.3 66.5 1909-10 0 T 1.7 14.6 22 42.7 3.4 0.5 0 84.9 1910-11 0 2.2 15.7 29.8 9.5 30 13.5 4.7 2 107.4 1911-12 0 0 6.5 7.5 21.5 10.8 8.8 6.9 T 62 1912-13 0 0 7.2 6.9 10 18.6 15.2 1.3 0 59.2 1913-14 0 0.2 0.3 14.4 15.1 21.6 27.9 7.2 0 86.7 1914-15 0 0.8 4.7 16.1 22.9 9.8 6 0.5 0 60.8 1915-16 0 0 3.4 14.8 8.5 35.7 43.8 0.7 0 106.9 1916-17 0 0 11.7 24.9 22.7 16.7 14.6 2.3 T 92.9 1917-18 0 T 7.9 29.7 17.2 12.7 10.5 1.3 0 79.3 run;
Join in Dataframe
Joining two dataframe df1 = Customer_id Month Weightage_Pos 76 April 1.4 76 February 1.4 76 January 1.4 76 June 1.4 76 March 1.4 76 May 1.4 106 April 1.4 106 June 1.4 106 May 1.4 177 June 1.4 212 May 1.4 313 May 1.4 580 April 1.4 580 February 1.4 732 January 1.4 861 April 2 Another dataframe df2 = Customer_id Month Weightage_Available_Balance Credit_Card_Weightage Inflow_Weightage Final_weightage 76 April 2 0 0.15 2.15 76 February 0 0 1.8 1.8 76 January 0 0 0.15 0.15 76 June 2 0 0 2 76 March 1.8 0 2.1 3.9 76 May 2 0 0.15 2.15 106 April 2 0 0 2 106 February 2 0 0.45 2.45 106 January 0 0 0 0 106 June 2 0 0 2 106 March 2 0 0.45 2.45 106 May 2 0 0 2 119 April 0 0 0.3 0.3 119 March 0 0 0.15 0.15 119 May 0 0 2.4 2.4 177 June 1.8 1.2 0.15 3.15 177 May 0.8 1.2 0 2 198 February 0 0 0.45 0.45 198 June 0.8 0 0.45 1.25 198 March 0 0 1.2 1.2 313 April 0.8 0 0.15 0.95 313 March 0.8 0 0 0.8 313 May 0.8 0 0 0.8 397 May 0 0 0 0 547 February 0 0 0.15 0.15 547 May 0 0 0.3 0.3 I write code as : final_data_frame = pd.merge(df2,df1,on= ['Customer_id','Month'],how='left) But the output of final_data_frame is not correct as it shows all column values as NAN values in df2 with additional column Weightage_pos how can this issue be resolved.Is above join method wrong
Python 2.7: Reading a text file online to a string and printing output
I am reading data from this link: http://www.weerindelft.nl/clientraw.txt. The main goal is to print out the temperature http://www.weerindelft.nl displays. I have discovered that its in that text file so i only need to print out the right part of the file. This is my code: import socket from decimal import Decimal s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect(("www.weerindelft.nl" , 80)) s.sendall("GET http://www.weerindelft.nl/clientraw.txt HTTP/1.0\n\n") write = s.recv(1427247693) variable_1 = str(write[311:]) integer = float(variable_1[46:50]) tim = round(integer,0) print Decimal(tim) f = open("output.txt", "w") f.write(write) f.close s.close() This is my output: HTTP/1.1 200 OK Date: Wed, 04 Jan 2017 12:34:14 GMT Server: Apache Last-Modified: Wed, 04 Jan 2017 12:34:12 GMT Vary: Accept-Encoding Content-Type: text/plain X-Varnish: 110069959 109349321 Age: 32 Via: 1.1 varnish (Varnish/5.0) ETag: W/"b173bdaf-2fb-54544008156e2" Accept-Ranges: bytes Content-Length: 763 Connection: close 12345 7.0 7.8 318 5.4 85 1016.9 1.0 4.2 4.2 0.014 0.086 18.7 38 100.0 34 0.0 0 0 0.2 -100.0 255.0 -100.0 -100.0 -100.0 -100.0 -100 -100 -100 13 20 58 WeerinDelft-13:20:58 2 100 4 1 100 100 100 100 100 100 100 2.6 4.0 8.0 5.1 34 zonnig/Gestopt_met_regenen 0.2 4 4 4 7 5 5 8 6 6 5 6 6 4 4 4 4 5 6 9 8 30.4 3.0 949.9 4/1/2017 7.5 3.6 6.0 0.9 0.5 14 12 10 12 7 11 8 5 6 10 6.8 6.9 6.7 6.5 6.4 6.5 5.7 5.3 5.1 5.3 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1.0 1.0 1.0 8.0 5.1 5.4 18.2 0 13:09:41 2017/04/01 326 522 91 -100.0 -100.0 5 0 0 0 0 102.0 18.9 18.7 4.7 1017.2 1014.5 24 12:40 10:35 6.1 0.8 6.2 1.5 15 2017 -13.9 -1 1 -1 341 336 336 309 331 358 336 318 310 318 10.0 255.0 7.5 4.4 51.97944 -4.34139 0.6 90 66 1.0 10:46 0.0 0.0 0.0 0.0 0.0 0.0 249.8 05:47 13:11 !!C10.37S13!! I have used requests before and it worked like a charm. Unfortunately the assignment is to use the socket module. I think i know where the problem lies but not to solve it. I need to get rid of the HTTP code and information and just be able to read the file so i can print out the right part of it. Because at this very moment running this script only succeeds a couple of times because the text file is shifting and my script is focussing on: integer = float(variable_1[46:50]) This part of the text file/string. I hope you guys understand what I mean. My apologies in advance if this post has some flaws. Its my first one and I am fairly new to programming. Thanks in advance.
HTTP Response seprates the header and content with a blank line. So you can use write.split('\r\n\r\n', 1)[1] to get rid of the HTTP code and information, extract only the content of the response.
why ranks were different?
One: data have; input x1 x2; diff=x1-x2; a_diff= round(abs(diff), .01); * a_diff=abs(diff); cards; 50.7 60 3.3 3.3 28.8 30 46.2 43.2 1.2 2.2 25.5 27.5 2.9 4.9 5.4 5 3.8 3.2 1 4 ; run; proc rank data =have out =have_r; where diff; var a_diff ; ranks a_diff_r; run; proc print data =have_r;run; Results: Obs x1 x2 diff a_diff a_diff_r 1 50.7 60.0 -9.3 9.3 9.0 2 28.8 30.0 -1.2 1.2 4.0 3 46.2 43.2 3.0 3.0 7.5 4 1.2 2.2 -1.0 1.0 3.0 5 25.5 27.5 -2.0 2.0 5.5 6 2.9 4.9 -2.0 2.0 5.5 7 5.4 5.0 0.4 0.4 1.0 8 3.8 3.2 0.6 0.6 2.0 9 1.0 4.0 -3.0 3.0 7.5 Two: data have; input x1 x2; diff=x1-x2; a_diff=abs(diff); cards; 50.7 60 3.3 3.3 28.8 30 46.2 43.2 1.2 2.2 25.5 27.5 2.9 4.9 5.4 5 3.8 3.2 1 4 ; run; proc rank data =have out =have_r; where diff; var a_diff ; ranks a_diff_r; run; proc print data =have_r;run; results: Obs x1 x2 diff a_diff a_diff_r 1 50.7 60.0 -9.3 9.3 9.0 2 28.8 30.0 -1.2 1.2 4.0 3 46.2 43.2 3.0 3.0 7.5 4 1.2 2.2 -1.0 1.0 3.0 5 25.5 27.5 -2.0 2.0 5.0 6 2.9 4.9 -2.0 2.0 6.0 7 5.4 5.0 0.4 0.4 1.0 8 3.8 3.2 0.6 0.6 2.0 9 1.0 4.0 -3.0 3.0 7.5 Attention Please,Obs 3,9,5,6, why ranks were different? Thank you!
Run the code below and you'll see that they are actually different. That's because of inaccuracies in numeric storage; similar to how 1/3 is not representable in decimal notation (0.333333333333333 etc.) and 1-(1/3)-(1/3)-(1/3) is not equal to zero if you use, say, ten digits to store each result as you go (it is equal to 0.000000001, then), any computer system will have some issues with certain numbers that while in decimal (base 10) appear to store nicely, in binary do not. The solution here is basically to round as you are, or to fuzz the result which amounts to the same thing (it ignores differences less than 1x10^-12). data have; input x1 x2; diff=x1-x2; a_diff=abs(diff); put a_diff= hex16.; cards; 50.7 60 3.3 3.3 28.8 30 46.2 43.2 1.2 2.2 25.5 27.5 2.9 4.9 5.4 5 3.8 3.2 1 4 ; run;