Join in Dataframe - python-2.7

Joining two dataframe
df1 =
Customer_id Month Weightage_Pos
76 April 1.4
76 February 1.4
76 January 1.4
76 June 1.4
76 March 1.4
76 May 1.4
106 April 1.4
106 June 1.4
106 May 1.4
177 June 1.4
212 May 1.4
313 May 1.4
580 April 1.4
580 February 1.4
732 January 1.4
861 April 2
Another dataframe df2 =
Customer_id Month Weightage_Available_Balance Credit_Card_Weightage Inflow_Weightage Final_weightage
76 April 2 0 0.15 2.15
76 February 0 0 1.8 1.8
76 January 0 0 0.15 0.15
76 June 2 0 0 2
76 March 1.8 0 2.1 3.9
76 May 2 0 0.15 2.15
106 April 2 0 0 2
106 February 2 0 0.45 2.45
106 January 0 0 0 0
106 June 2 0 0 2
106 March 2 0 0.45 2.45
106 May 2 0 0 2
119 April 0 0 0.3 0.3
119 March 0 0 0.15 0.15
119 May 0 0 2.4 2.4
177 June 1.8 1.2 0.15 3.15
177 May 0.8 1.2 0 2
198 February 0 0 0.45 0.45
198 June 0.8 0 0.45 1.25
198 March 0 0 1.2 1.2
313 April 0.8 0 0.15 0.95
313 March 0.8 0 0 0.8
313 May 0.8 0 0 0.8
397 May 0 0 0 0
547 February 0 0 0.15 0.15
547 May 0 0 0.3 0.3
I write code as :
final_data_frame = pd.merge(df2,df1,on= ['Customer_id','Month'],how='left)
But the output of final_data_frame is not correct as it shows all column values as NAN values in df2 with additional column Weightage_pos
how can this issue be resolved.Is above join method wrong

Related

Able to transform observation to 0 but had issue with their total

I have a raw data set looks like this:
enter link description here
And I tried to transform the observations having a "T" to 0,
and then read in the data set and print out. Just this.
However, with my code, simply by looking at the first observation in line 5, it is apparently something is off.
For instance, the first observation for "Nov" should not be 0.
I could not figure out what had gone wrong and I wonder is anyone would like to give me some advice on what I can do for the next? Thank you very much! Highly appreciated.
My code is as below:
INFILE "&DIRLSB.Pr1Snowfall1.csv" DSD FIRSTOBS=5;
DROP i;
INPUT Season $#;
INPUT Year 1-4 Season 1-7 Sep Oct Nov Dec Jan Feb Mar Apr May Total;
ARRAY Months (*) Sep -- May;
DO i = 1 TO dim(Months);
IF Months(i)=. Then Months(i)=0;
END;
RUN;
I'm guessing you have a missing T; statement somewhere that is reading T(race) as missing T. ".T does not equal ."
I would use coalesce function. There is really no need to change missing T to 0 is there?
missing t;
data snow;
infile cards firstobs=2;
input Season:$7. Sep Oct Nov Dec Jan Feb Mar Apr May Total;
array mth[*] Sep--May;
do i = 1 to dim(mth);
mth[i] = coalesce(mth[i],0);
end;
t = sum(of mth[*]);
drop i;
cards;
Season Sep Oct Nov Dec Jan Feb Mar Apr May Total
1884-85 0 T 1 27.1 22.2 17 3.5 19.5 T 90.3
1885-86 0 1.7 8.2 8.4 16.9 16 6.5 7 0 64.7
1886-87 0 T 22.2 12.5 12 18.4 6.3 1.2 0 72.6
1893-94 0 0.5 6.1 27.6 20 29.5 5.4 13.3 0 102.4
1894-95 0 T 11.1 22.1 26.5 23.6 9.5 0.6 0 93.4
1895-96 0 1.5 5.9 8.7 22.5 39.1 45.1 1 0 123.8
1896-97 0 T 5.5 13.9 20.1 13.7 8.1 5.2 0 66.5
1897-98 0 0 10.1 18.4 32.1 26.8 1.2 2.4 0 91
1898-99 0 T 10.6 27 16.6 16.3 21.2 4.3 T 96
1899-00 T T 1.3 21.5 24.7 28.5 54 1.3 0 131.3
1906-07 0 5 5.7 18.7 11.7 15.7 3.1 2.5 1.3 63.7
1907-08 0 0 2.2 11.6 16.5 19.8 7.9 6.3 3 67.3
1908-09 0 0.5 4.6 10 22.5 6.1 9.7 9.8 3.3 66.5
1909-10 0 T 1.7 14.6 22 42.7 3.4 0.5 0 84.9
1910-11 0 2.2 15.7 29.8 9.5 30 13.5 4.7 2 107.4
1911-12 0 0 6.5 7.5 21.5 10.8 8.8 6.9 T 62
;;;;
run;
proc print;
run;

How to remove observation's correlation loading points in correlation loading plot in SAS?

Correlation Loading Plot from Pro PLS in SAS
Hi All,
I used Proc PLS to do a multivariate analysis and got a plot as attached. How can I remove the green colored points in the picture? I think they are the observations' correlation values. For example, I have 90 observations, and each of them will have a loading value on factor1 and factor2, so there will be 90 green points shown in the picture. Who can tell me which option can suppress them?
for example, data is like this:
par1 par2 par3 par4 par5 par6 par7 location
2680 0.546089996 237 1 0.172 2.25 305 5
3750 0.54836587 140 1.55 0.111 1.06 425 5
3590 0.54878718 168 1.27 0.131 0.969 516 5
2390 0.549510935 183 1.07 0.096 1.84 260 5
3780 0.549631747 140 1.12 0.118 1.98 472 5
2790 0.549934008 200 1.1 0.221 2.13 321 5
2880 0.5499945 227 1.14 0.185 1.54 439 5
2910 0.550357733 259 1.31 0.116 1.31 289 5
2420 0.550842789 177 1.32 0.044067423 1.95 260 5
3850 0.550964187 128 1.41 0.117 1.08 471 5
3530 0.552425146 165 1.23 0.11 1.57 494 5
2730 0.552913856 223 1.03 0.17 2 330 5
3130 0.553158535 252 1.02 0.174 2.13 322 5
3040 0.553709856 272 1.21 0.155 1.97 317 5
3830 0.554139421 153 1.27 0.137 1.47 455 5
3930 0.554569654 164 1.17 0.116 1.5 481 5
2430 0.554569654 136 1.3 0.198 2.11 226 8
3630 0.555247085 137 1.17 0.1 1.75 413 5
2490 0.555432126 176 1.06 0.113 1.39 236 5
3490 0.555555556 166 1.28 0.044444444 1.65 465 5
3840 0.556173526 164 1.23 0.0949 1.66 470 5
2480 0.556173526 239 1.28 0.102 2.2 238 5
3760 0.556173526 191 1.33 0.131 2.12 447 5
3850 0.556173526 174 1.35 0.241 2.42 381 3
3410 0.557413601 174 1.14 0.107 1.48 419 5
2960 0.559284116 229 1.08 0.165 1.99 304 5
3410 0.559284116 137 1.19 0.291 2.17 375 8
3300 0.560538117 121 1.13 0.153 1.82 352 8
3090 0.560538117 134 1.16 0.167 1.17 416 4
3210 0.560538117 124 1.09 0.172 0.82 390 4
3950 0.560538117 130 1.29 0.199 1.89 440 4
3300 0.561167228 131 1.06 0.242 2.45 367 8
2210 0.561167228 162 0.885 0.288 3.32 208 4
3170 0.561797753 126 1.3 0.151 1.31 388 4
2740 0.561797753 96.1 1.22 0.245 0.827 254 3
3750 0.561797753 144 1.08 0.257 2.62 366 3
3640 0.562429696 120 1.32 0.159 1.63 347 8
3210 0.563063063 148 1.29 0.206 2.18 352 8
2300 0.563697858 179 0.936 0.181 2.29 223 2
3410 0.564334086 141 0.856 0.136 2.03 370 8
3500 0.564334086 126 1.38 0.177 1.45 355 8
3470 0.564334086 101 0.989 0.222 1.84 349 3
2260 0.564334086 171 0.942 0.224 2.08 219 2
2220 0.564334086 180 0.956 0.281 1.84 219 4
2340 0.564971751 165 1.05 0.228 2.25 240 8
2380 0.564971751 161 0.976 0.287 1.6 214 4
3220 0.56561086 148 1.21 0.121 0.568 520 6
3920 0.566251416 176 1.08 0.045300113 2.26 637 6
3830 0.566251416 137 1.48 0.203 1.23 387 3
2510 0.566251416 152 1.24 0.222 1.84 223 8
2760 0.566251416 168 0.994 0.282 1.31 280 4
2640 0.566251416 154 0.979 0.345 1.52 291 4
3570 0.566893424 165 1.33 0.155 2.18 505 6
3170 0.566893424 126 1.08 0.162 1.41 341 4
3700 0.566893424 159 1.3 0.17 1.64 449 4
3250 0.566893424 104 1.32 0.2 1.37 372 8
3740 0.566893424 159 1.23 0.216 1.69 409 1
3380 0.566893424 163 1.53 0.245 2.19 367 3
3240 0.56753689 136 1.07 0.153 1.88 383 4
3400 0.56753689 109 1.36 0.161 1.16 420 4
3760 0.56753689 150 0.93 0.169 1.68 537 4
3560 0.56753689 123 1.03 0.193 2.32 374 8
2360 0.56753689 163 0.697 0.235 1.94 243 8
2430 0.56753689 166 0.762 0.247 2.31 231 8
3330 0.568181818 148 1.11 0.174 2 393 4
3080 0.568181818 139 1.13 0.188 2.08 349 8
3230 0.568181818 116 1.23 0.199 1.77 328 8
2180 0.568181818 144 1.01 0.215 2.13 207 8
2520 0.568181818 128 0.809 0.369 1.65 306 4
3320 0.568828214 152 1.15 0.14 1.65 395 4
2300 0.568828214 134 0.908 0.221 1.56 233 8
3730 0.568828214 141 1.58 0.238 1.96 405 3
3800 0.568828214 160 1.24 0.241 2.2 402 3
2440 0.568828214 153 1.03 0.258 1.89 223 4
3910 0.568828214 209 1.26 0.275 2.26 350 3
4010 0.569476082 139 1.28 0.045558087 1.7 602 6
2340 0.570125428 167 1.1 0.18 1.57 208 2
2360 0.570125428 176 0.704 0.2 1.6 219 2
3490 0.570776256 171 1.43 0.269 2.4 360 3
2620 0.571428571 132 1.09 0.202 1.8 224 8
3740 0.571428571 172 1.27 0.256 1.92 355 3
3600 0.57208238 128 1.16 0.17 1.94 434 4
3360 0.57208238 150 1.18 0.171 1.81 353 1
3620 0.57208238 131 1.28 0.177 2.24 360 3
3560 0.57208238 139 1.15 0.229 1.9 366 3
2740 0.572737686 277 0.876 0.171 1.71 290 10
2340 0.572737686 148 0.964 0.231 1.18 250 6
2760 0.572737686 168 0.905 0.303 2.1 264 4
2890 0.572737686 204 0.857 0.331 2.32 272 2
code is :
proc pls data=check method=rrr;
class location;
model par1-par7=location;
run;
In general, I don't think there's a simple way to do what you're looking for. You may want to construct your own graph.
You can get the template for the graph; I'll paste that here. Unfortunately all of the data printed on the graph is printed in a single statement, so it's not helpful to just comment out one line: you comment out the scatterplot x=CORRX y=CORRY and you remove all of the data. I also don't see that ODS Graphics Editor will be able to do this.
You would be best off probably constructing your own chart using this as a base, but calling it from PROC SGRENDER so you can control how the data comes in.
Here's the template, and you'll see the spot I'm talking about:
proc template;
define statgraph Stat.PLS.Graphics.CorrLoadPlot;
dynamic Radius1 Radius2 Radius3 Radius4 xLabel xShortLabel yLabel
yShortLabel CorrX CorrXLab TraceX CorrY CorrYLab TraceY _byline_
_bytitle_ _byfootnote_;
BeginGraph /;
entrytitle "Correlation Loading Plot";
layout overlayequated / equatetype=square commonaxisopts=(
tickvaluelist=(-1.0 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1.0) viewmin=
-1 viewmax=1) xaxisopts=(label=XLABEL shortlabel=XSHORTLABEL
offsetmin=0.05 offsetmax=0.05 gridDisplay=auto_off) yaxisopts=(
label=YLABEL shortlabel=YSHORTLABEL offsetmin=0.05 offsetmax=0.05
gridDisplay=auto_off);
ellipseparm semimajor=RADIUS1 semiminor=RADIUS1 slope=0.0 xorigin=
0.0 yorigin=0.0 / clip=true display=(outline) outlineattrs=(
pattern=dash) datatransparency=0.75;
scatterplot x=XCIRCLE1LABEL y=YCIRCLE1LABEL / markercharacter=
CIRCLE1LABEL datatransparency=0.75 primary=true;
ellipseparm semimajor=RADIUS2 semiminor=RADIUS2 slope=0.0 xorigin=
0.0 yorigin=0.0 / clip=true display=(outline) outlineattrs=(
pattern=dash) datatransparency=0.75;
scatterplot x=XCIRCLE2LABEL y=YCIRCLE2LABEL / markercharacter=
CIRCLE2LABEL datatransparency=0.75 primary=true;
ellipseparm semimajor=RADIUS3 semiminor=RADIUS3 slope=0.0 xorigin=
0.0 yorigin=0.0 / clip=true display=(outline) outlineattrs=(
pattern=dash) datatransparency=0.75;
scatterplot x=XCIRCLE3LABEL y=YCIRCLE3LABEL / markercharacter=
CIRCLE3LABEL datatransparency=0.75 primary=true;
ellipseparm semimajor=RADIUS4 semiminor=RADIUS4 slope=0.0 xorigin=
0.0 yorigin=0.0 / clip=true display=(outline) outlineattrs=(
pattern=dash) datatransparency=0.75;
scatterplot x=XCIRCLE4LABEL y=YCIRCLE4LABEL / markercharacter=
CIRCLE4LABEL datatransparency=0.75 primary=true;
scatterplot x=CORRX y=CORRY / group=CORRGROUP Name="ScatterVars"
markercharacter=CORRLABEL rolename=(_id1=_ID1 _id2=_ID2 _id3=
_ID3 _id4=_ID4 _id5=_ID5) tip=(y x group markercharacter _id1
_id2 _id3 _id4 _id5) tiplabel=(y=CORRXLAB x=CORRYLAB group=
"Corr Type" markercharacter="Corr ID");
SeriesPlot x=TRACEX y=TRACEY / tip=(y x) tiplabel=(y=CORRYLAB x=
CORRXLAB);
endlayout;
if (_BYTITLE_)
entrytitle _BYLINE_ / textattrs=GRAPHVALUETEXT;
else
if (_BYFOOTNOTE_)
entryfootnote halign=left _BYLINE_;
endif;
endif;
EndGraph;
end;
run;
I would consider posting this on communities.sas.com and seeing if one of the developers can give you more specific information; Sanjay and Dan often post there and may well be able to give you a simpler answer.

Python 2.7: Reading a text file online to a string and printing output

I am reading data from this link: http://www.weerindelft.nl/clientraw.txt.
The main goal is to print out the temperature http://www.weerindelft.nl displays. I have discovered that its in that text file so i only need to print out the right part of the file.
This is my code:
import socket
from decimal import Decimal
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("www.weerindelft.nl" , 80))
s.sendall("GET http://www.weerindelft.nl/clientraw.txt HTTP/1.0\n\n")
write = s.recv(1427247693)
variable_1 = str(write[311:])
integer = float(variable_1[46:50])
tim = round(integer,0)
print Decimal(tim)
f = open("output.txt", "w")
f.write(write)
f.close
s.close()
This is my output:
HTTP/1.1 200 OK
Date: Wed, 04 Jan 2017 12:34:14 GMT
Server: Apache
Last-Modified: Wed, 04 Jan 2017 12:34:12 GMT
Vary: Accept-Encoding
Content-Type: text/plain
X-Varnish: 110069959 109349321
Age: 32
Via: 1.1 varnish (Varnish/5.0)
ETag: W/"b173bdaf-2fb-54544008156e2"
Accept-Ranges: bytes
Content-Length: 763
Connection: close
12345 7.0 7.8 318 5.4 85 1016.9 1.0 4.2 4.2 0.014 0.086 18.7 38 100.0 34 0.0 0 0 0.2 -100.0 255.0 -100.0 -100.0 -100.0 -100.0 -100 -100 -100 13 20 58 WeerinDelft-13:20:58 2 100 4 1 100 100 100 100 100 100 100 2.6 4.0 8.0 5.1 34 zonnig/Gestopt_met_regenen 0.2 4 4 4 7 5 5 8 6 6 5 6 6 4 4 4 4 5 6 9 8 30.4 3.0 949.9 4/1/2017 7.5 3.6 6.0 0.9 0.5 14 12 10 12 7 11 8 5 6 10 6.8 6.9 6.7 6.5 6.4 6.5 5.7 5.3 5.1 5.3 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1.0 1.0 1.0 8.0 5.1 5.4 18.2 0 13:09:41 2017/04/01 326 522 91 -100.0 -100.0 5 0 0 0 0 102.0 18.9 18.7 4.7 1017.2 1014.5 24 12:40 10:35 6.1 0.8 6.2 1.5 15 2017 -13.9 -1 1 -1 341 336 336 309 331 358 336 318 310 318 10.0 255.0 7.5 4.4 51.97944 -4.34139 0.6 90 66 1.0 10:46 0.0 0.0 0.0 0.0 0.0 0.0 249.8 05:47 13:11 !!C10.37S13!!
I have used requests before and it worked like a charm. Unfortunately the assignment is to use the socket module. I think i know where the problem lies but not to solve it. I need to get rid of the HTTP code and information and just be able to read the file so i can print out the right part of it. Because at this very moment running this script only succeeds a couple of times because the text file is shifting and my script is focussing on:
integer = float(variable_1[46:50])
This part of the text file/string.
I hope you guys understand what I mean. My apologies in advance if this post has some flaws. Its my first one and I am fairly new to programming.
Thanks in advance.
HTTP Response seprates the header and content with a blank line.
So you can use
write.split('\r\n\r\n', 1)[1]
to get rid of the HTTP code and information, extract only the content of the response.

Perl regex: More than one line to match and output in a given format: Completely revised

If one can understand how to store stuff in memory, know where it is and how to get it out again in an orderly way it will go a long way to achieving results in Perl.(And probably all programming languages)
I am not a programmer.
I am trying to extract data from 'older program' output and import it into a SQL database. The extraction is the thing.
My previous question was largely incorrect, as I found out when importing data into the table, as I did not have enough data from the 'old program' output file.
I would like to learn from my mistake and re-ask my previous question, hopefully correctly this time.
I have included my poor effort at extracting the data, exactly as it was last time. It doesn't come anywhere near getting the correct data out.
I believe this is quite a complex question but maybe it isn't.
It is certainly above my level of Perl at present, and maybe ever.
Answers to my incorrectly phrased question have been partially understood. Thank you very much for them.
If I could summarize it, my main problem with this task is dealing with the type of question: 'If a line contains ... get data from two lines up and insert it at the beginning of .... Seemingly impossible for me.
I tried regex over the end of line searches but was unable to get that to work.
I was unable also to arrange successive loops to insert data in lines as I wanted. If one loop worked, the next did not and so on. I was prepared to work on successive files in a step by step process but the 'two lines up' question stumped me completely.
I was able to extract other data from these output files relatively easily as they are very orderly files, but this particular question has me stumped.
My revised question is:
My input file consists of batches of data(+-50 - 70 lines long) in the following format:
1(P1) 3 P.ell 05/0120 W/P068819 0 12.0 98/99 380 380 C03 104 PROCESSED 21/02/16 TIME 22.16.52 KSINA=8
AGE SPH %THN %INC SV SI MAI20 HTPC VIPC AGE BA DBH HT SPH CIH% CIV% CVD BCON CMAI C0 C0CAL SI20
0 1100 .0 89.0%SPH 2 2 .00 .0 .0 20.00 1 .0 17.3 0 .0 .0 .0 .0 .0000 .000 0% .00
7 815 25.9 .0 2 2 9.90 75.5 47.2 20.00 1 26.6 17.3 330 .0 .0 .0 13.0 .2099 1.005 .000 17.30
13 550 32.5 .0
18 330 40.0 .0
45 0 100.0 .0
0SQ -4 -4 -4 = SI20 17 17 17 PLANTN---104 GREEN MEADOWS MODEL---P.ELLIOTTII MAC MAC SQ 10 SI20 22.90
HTPC 76 76 76 =MAI20 10 10 10 FROM HTPC HTPC 100 MAI20 20.71
VIPC 47 47 47 =MAI20 10 10 10 HTPC/VIPC REGRESSION---P.ELLIOTTII GENERAL 1/83 VIPC 100 MAI20 20.99
MAIDBH 0
INMAI==> 0
0INPUT FOR CALCULATING HTPC & VIPC = HT ---- ----
AGE DBH HT VTREE SPH BA TOTAL WS UTIL S A B C D TCAI CTCAI TMAI UCAI CUCAI UMAI SCAI CSCAI SMAI IAGE
1 .0 .2 .0000 979 0 0 0 0 0 0 0 0 0 .0 0 .0 .0 0 .0 .0 0 .0 1.0
2 .0 .9 .0000 979 0 0 0 0 0 0 0 0 0 .0 0 .0 .0 0 .0 .0 0 .0 2.0
3 3.9 2.0 .0007 979 1 1 1 0 0 0 0 0 0 .7 1 .2 .0 0 .0 .0 0 .0 3.0
4 7.1 3.4 .0041 979 4 4 3 1 1 0 0 0 0 3.4 4 1.0 .6 1 .2 .0 0 .0 4.0
5 9.4 4.6 .0102 979 7 10 5 5 5 0 0 0 0 5.9 10 2.0 4.1 5 .9 .0 0 .0 5.0
6 11.3 5.7 .0188 979 10 18 6 12 12 1 0 0 0 8.4 18 3.1 7.5 12 2.0 .0 0 .0 6.0
7 13.0 6.7 .0293 979 13 29 7 22 19 3 0 0 0 10.3 29 4.1 9.7 22 3.1 .0 0 .0 7.0
17%
THN 11.4 6.7 .0230 164 2 4 1 3 2 0 0 0 0
REM 13.4 6.7 .0315 815 12 26 6 20 17 3 0 0 0
8 15.0 7.6 .0453 815 14 37 6 31 21 10 0 0 0 11.2 40 5.0 10.9 33 4.1 .0 0 .0 7.6
9 16.4 8.5 .0607 815 17 49 6 43 23 20 0 0 0 12.5 52 5.8 12.2 45 5.0 .2 0 .0 8.6
10 17.4 9.4 .0771 815 19 63 7 56 24 30 2 0 0 13.4 66 6.6 13.1 58 5.8 1.3 2 .2 9.6
11 18.3 10.3 .0941 815 21 77 7 70 24 41 5 0 0 13.9 80 7.3 13.6 72 6.5 3.0 5 .4 10.6
12 19.0 11.3 .1118 815 23 91 7 84 24 50 10 0 0 14.4 94 7.8 14.1 86 7.2 5.4 10 .8 11.6
13 19.6 12.2 .1299 815 25 106 8 98 24 56 18 0 0 14.7 109 8.4 14.4 100 7.7 8.0 18 1.4 12.6
33%
THN 17.5 12.2 .1044 265 6 28 2 25 8 15 3 0 0
REM 20.6 12.2 .1421 550 18 78 5 73 16 42 15 0 0
14 21.3 13.0 .1636 550 20 90 6 84 16 44 25 0 0 11.8 121 8.6 11.6 112 8.0 10.0 28 2.0 10.4
15 22.0 13.7 .1864 550 21 103 6 97 16 45 36 0 0 12.5 133 8.9 12.3 124 8.3 11.0 39 2.6 11.2
16 22.7 14.5 .2100 550 22 116 6 109 15 46 48 0 0 13.0 146 9.1 12.7 137 8.6 12.0 51 3.2 12.0
17 23.3 15.3 .2345 550 23 129 6 123 15 46 61 0 0 13.5 160 9.4 13.2 150 8.8 12.9 64 3.8 12.8
18 23.9 15.9 .2598 550 25 143 7 136 15 46 74 1 0 13.9 174 9.6 13.6 164 9.1 13.8 78 4.3 13.6
40%
THN 21.6 15.9 .2142 220 8 47 2 45 6 19 20 0 0
REM 25.3 15.9 .2901 330 17 96 4 92 9 28 54 1 0
19 26.0 16.6 .3203 330 17 106 4 101 9 27 63 3 0 10.0 184 9.7 9.8 174 9.1 10.5 88 4.6 11.0
20 26.6 17.3 .3519 330 18 116 5 112 9 27 71 5 0 10.4 194 9.7 10.2 184 9.2 10.6 99 4.9 11.7
21 27.2 18.0 .3849 330 19 127 5 122 9 27 80 8 0 10.9 205 9.8 10.7 194 9.3 11.1 110 5.2 12.4
22 27.9 18.7 .4192 330 20 138 5 133 8 26 87 11 0 11.3 216 9.8 11.1 206 9.3 11.5 121 5.5 13.2
23 28.4 19.3 .4546 330 21 150 5 145 8 26 94 16 0 11.7 228 9.9 11.4 217 9.4 11.8 133 5.8 14.0
24 29.0 20.0 .4914 330 22 162 5 157 8 26 101 22 0 12.2 240 10.0 11.9 229 9.5 12.3 145 6.1 14.9
25 29.6 20.6 .5292 330 23 175 6 169 8 25 106 29 0 12.5 253 10.1 12.2 241 9.6 12.6 158 6.3 15.7
26 30.2 21.2 .5682 330 24 188 6 182 8 25 112 37 0 12.9 265 10.2 12.6 254 9.8 13.0 171 6.6 16.5
27 30.7 21.8 .6083 330 25 201 6 194 8 25 115 46 0 13.2 279 10.3 13.0 267 9.9 13.3 184 6.8 17.3
28 31.3 22.4 .6492 330 25 214 7 208 8 24 119 56 1 13.5 292 10.4 13.2 280 10.0 13.6 198 7.1 18.2
29 31.9 23.0 .6908 330 26 228 7 221 8 24 122 65 2 13.7 306 10.5 13.5 293 10.1 13.8 212 7.3 19.0
30 32.4 23.5 .7332 330 27 242 7 235 8 24 123 77 3 14.0 320 10.7 13.7 307 10.2 14.0 226 7.5 19.8
31 33.0 23.9 .7766 330 28 256 7 249 8 24 125 88 5 14.3 334 10.8 14.0 321 10.4 14.3 240 7.7 20.4
32 33.6 24.4 .8202 330 29 271 8 263 8 23 126 99 7 14.4 349 10.9 14.1 335 10.5 14.4 255 8.0 21.0
Firstly the two variables in the first line(1(P1...): in this case 'C03 104' need to be extracted from it and be sent to OUTPUT.(Same as previous question, but the output position changes.)
Secondly, all lines beginning with 'THN' need to be extracted as they are except that the THN can be dropped.
If there are two, three, four or even five etc. 'THN' lines, they all need to be extracted from the batch and sent to OUTPUT.(+- same as previous question)
Thirdly, although sequentially the second step, the last figure in the 'AGE' column of the main tabular data just before the 'THN' line, needs to be attached to the extracted 'THN' line directly below it.(in this case the figures 7, 13 and 18) These need to be added to their respective THN lines. See expected output below where the ages have been inserted after the two 'C03 104' variables in each line.
If there are no 'THN' lines in a given batch, the entire batch should be ignored, with no output, and the next batch(starting with a '1(P1)' again) considered.
The correct output expected from the above batch is:
CO3 104 7 11.4 6.7 .0230 164 2 4 1 3 2 0 0 0 0
CO3 104 13 17.5 12.2 .1044 265 6 28 2 25 8 15 3 0 0
CO3 104 18 21.6 15.9 .2142 220 8 47 2 45 6 19 20 0 0
As will be seen from this, the two variables from the top line are inserted at the start of the output THN data line. The age figure read from the input batch is then inserted into its respective THN line and thereafter the rest of the THN line data is attached.
My effort some time ago but not updated is as follows:
while ( my $line = <INPUT> ) {
if($line =~ /\s{6,11}(\w{1}\d{1}\w{0,5})\s{0,5}(\d{3})/) {
my #c_no = "$1,$2\n";
foreach (#c_no) {
print OUTPUT $_;
}
if ($line =~ /^(\s{1}THN)(\s{1,3}\d{0,2}.\d)(\s{1,3}\d{0,2}.\d)(\s{1,2}\d{0,1}.\d{4})(\s{1,2}\d{2,4})
(\s{2,3}\d{1,2})(\s{1,6}\d{1,4})(\s{1,2}\d{1,2})(\s{1,5}\d{1,4})(\s{1,4}\d{1,4})
(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})|(^1(P1))/x){
print OUTPUT "$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14\n";
}
}
}
Advice, guidance and help would be greatly appreciated.
This is much more simply done using split to separate each line into space-delimited fields, and quite straightforward if you maintain state variables for the two fields from the header row, and the age from any row whose first field is entirely digits. Then all that is necessary is to print these three values before the numbers on any line that starts with THN
Note that it's simplest to pass the name of the input file as a parameter on the command line. Then all you have to do is read from <>. All the opening and error handling are already done for you
The output format you've asked for is rather esoteric. I can't see any pattern to the column widths and I've had to write a custom printf format to recreate it. If you need something else then all the values in each output line are in the #data array, which you can use as you wish
use strict;
use warnings 'all';
my ($c1, $c2, $age);
while ( <> ) {
next unless /\S/;
my #fields = split;
if ( $fields[0] eq '1(P1)' ) {
($c1, $c2) = #fields[10,11];
}
elsif ( $fields[0] !~ /\D/ ) {
$age = $fields[0];
}
elsif ( $fields[0] eq 'THN' ) {
my #data = ( $c1, $c2, $age, #fields[1..13] );
printf "%4s %5s %5d %5.1f%5.1f%7.4f%5d%4d%7d%3d%6d%5d%5d%5d%5d%5d\n", #data;
}
}
output
C03 104 7 11.4 6.7 0.0230 164 2 4 1 3 2 0 0 0 0
C03 104 13 17.5 12.2 0.1044 265 6 28 2 25 8 15 3 0 0
C03 104 18 21.6 15.9 0.2142 220 8 47 2 45 6 19 20 0 0
I copied and modified your example data, so this hasn't had a really good test. And I'm printing to STDOUT for testing purposes, but that should be easy to change.
The trick is to recognize that you've got line matching to do, which is great with regexes, and other processing, which is probably better with plain old code. So build a little loop, and process the lines with equal precedence (this is important for detecting errors in the file - don't try to nest things too much). Put in some state variables to help keep track of what comes next, and be sure to reset them appropriately.
Also, one thing I noticed in your example code is that you spent a lot if time getting spacing and number-of-digits right for the fields. That was almost certainly wasted time in this context, since the key was the "THN" at the start of the line. One trick with processing text is to focus on the things you really need, and use .* for the other stuff. That way, line noise or a syntax error or some strange formatting glitch won't screw up your program. (Sometimes .* becomes [^"]* or whatever, but you take the point...)
my $line_prefix, $have_age_col, $age_col;
while (<>) {
if (/^1\(P1\).*\s(?P<two_vars>\w+\s+\w+)\s+PROCESSED .* TIME .* KSINA=.*$/) {
# Start new section
$line_prefix = $+{two_vars};
$have_age_col = 0;
$age_col = undef;
}
if (/^AGE /) {
$have_age_col = 1;
}
if ($have_age_col && /^\s{0,5}(\d+)/) {
$age_col = substr " ".$1, -5;
}
if (/^THN /) {
die "THN encountered without header"
unless $line_prefix;
die "THN encountered without age column"
unless $have_age_col and $age_col;
s/^THN \s*//;
s/\s+$//;
my $output = "$line_prefix $age_col $_\n";
print STDOUT $output;
}
}

excel, vba or regex to copy values downwards based on repeated values

I have the following records:
62
STARTHERE 1.1 vol. 84 no. 1 1996 01.1 A 0 1 1996 04 24 0
STARTHERE 1.2 vol. 84 no. 2 1996 01.2 A 0 1 1996 05 23 0
STARTHERE 1.3 vol. 84 no. 3 1996 01.3 A 1 1 1996 08 13 0
STARTHERE 1.4 vol. 84 no. 4 1996 01.4 A 0 1 1996 10 15 0
STARTHERE 1.5 vol. 84 no. 5 1996 01.5 A 0 1 1997 01 22 0
STARTHERE 1.6 vol. 84 no. 6 1996 01.6 A 0 1 1997 02 10 0
63
STARTHERE 1.1 95:1 Feb 2002 1.1 A 0 1 2002 06 03 0
STARTHERE 1.2 95:2 Apr 2002 1.2 A 0 1 2002 06 17 0
STARTHERE 1.3 95:3 Jun 2002 1.3 A 0 1 2002 07 18 0
STARTHERE 1.4 95:4 Aug 2002 1.4 A 0 1 2003 02 24 0
STARTHERE 1.5 95:5 Oct 2002 1.5 A 0 1 2003 02 24 0
64
65
STARTHERE 1.1 34:1 Mar 1996 1.1 A 0 1 1996 07 16 0
STARTHERE 1.2 34:2 Jun 1996 1.2 A 0 1 1996 09 19 0
STARTHERE 1.3 34:3 Sep 1996 1.3 A 0 1 1996 12 17 0
I don't know if this is possible in excel, vba in excel or even through regex. I want to fill the lowest numerical value (e.g. 62) and replace the lower rows with values "STARTHERE" up until the next numerical value (63). Right now, it's done manually but I was thinking if there is a way of doing this mechanically. Through excel formula, VBA, or regex, as these are what I'm familiar with. So that I can get below, it's okay also that the 62 with blank value to the right are stripped but I'm fine even if it's not:
62
62 1.1 vol. 84 no. 1 1996 01.1 A 0 1 1996 04 24 0
62 1.2 vol. 84 no. 2 1996 01.2 A 0 1 1996 05 23 0
62 1.3 vol. 84 no. 3 1996 01.3 A 1 1 1996 08 13 0
62 1.4 vol. 84 no. 4 1996 01.4 A 0 1 1996 10 15 0
62 1.5 vol. 84 no. 5 1996 01.5 A 0 1 1997 01 22 0
62 1.6 vol. 84 no. 6 1996 01.6 A 0 1 1997 02 10 0
62
62 1.1 95:1 Feb 2002 1.1 A 0 1 2002 06 03 0
63 1.2 95:2 Apr 2002 1.2 A 0 1 2002 06 17 0
63 1.3 95:3 Jun 2002 1.3 A 0 1 2002 07 18 0
63 1.4 95:4 Aug 2002 1.4 A 0 1 2003 02 24 0
63 1.5 95:5 Oct 2002 1.5 A 0 1 2003 02 24 0
64
65
65 1.1 34:1 Mar 1996 1.1 A 0 1 1996 07 16 0
65 1.2 34:2 Jun 1996 1.2 A 0 1 1996 09 19 0
65 1.3 34:3 Sep 1996 1.3 A 0 1 1996 12 17 0
Many thanks!
I assume this data is from an Excel spreadsheet, with both the numerical values and the value "STARTHERE" are on the first column (column A). The other data are on column B, C, etc.
Basically, I will loop through the first column from the top to the bottom row. If the value within the selector cell is not a number, it will be equal to the one right above it. If it is, then we skip to the next cell.
Sub help()
ActiveSheet.Columns(1).NumberFormat = "0"
For i = 1 To ActiveSheet.UsedRange.Rows.count
If Not Information.IsNumeric(Cells(i, 1)) Then Cells(i, 1).value = Cells(i - 1, 1).value
Next i
End Sub