Extracting a table from a text file using PowerShell - regex

I have a table that I want to extract from a batch of text file. The problem is that the table does not begin at the same line in the every text file. Also, the presentation, format, and reuse of keywords makes it really difficult to write a regex expression (for me at least). I've figured out how extract information from specific lines but this table is just a no go for me. I've researched regex expressions and splits but have come up empty.
The top of the file looks like this:
Summary Call Volume Statistics:
Total Calls = 1000
Total Hours = 486.7
Average Call Frequency = 2.05
Summary Reliability Statistics:
Total Queued Calls = 152
Total Calls = 1000
Total On Time Calls = 710
Total Reliability = 0.7100
Total Raw Demand = 640.00
Total Covered Demand = 437.79
Summary Business Statistics:
Total Servers = 4
Total Sim Time (secs) = 1752079
Total Server Time (secs) = 7008316
Total Server Busy Time (secs) = 0
Total Business = 0.0000
Detail Node Sim Reliability:
Node Calls On Time Percent Queued UnderTm OverTm
-------- -------- -------- -------- -------- -------- --------
0 97 81 0.8351 17 1637404 0
1 115 92 0.8000 25 1637404 0
2 103 90 0.8738 16 1637404 0
3 68 53 0.7794 17 1637404 0
4 63 57 0.9048 6 1637404 0
5 35 29 0.8286 7 1637404 0
6 31 27 0.8710 4 1637404 0
7 40 36 0.9000 6 1637404 0
8 22 17 0.7727 5 1637404 0
9 26 24 0.9231 1 1637404 0
10 24 21 0.8750 3 1637404 0
11 23 0 0.0000 5 1637404 0
12 23 20 0.8696 2 1637404 0
13 15 0 0.0000 2 1637404 0
14 20 19 0.9500 1 1637404 0
15 19 0 0.0000 1 1637404 0
16 23 18 0.7826 4 1637404 0
17 12 9 0.7500 4 1637404 0
18 10 10 1.0000 0 1637404 0
19 11 0 0.0000 1 1637404 0
20 13 0 0.0000 2 1637404 0
21 9 7 0.7778 1 1637404 0
22 11 9 0.8182 1 1637404 0
23 11 0 0.0000 2 1637404 0
24 14 6 0.4286 3 1637404 0
25 6 6 1.0000 0 1637404 0
26 6 0 0.0000 0 1637404 0
27 4 0 0.0000 1 1637404 0
28 5 5 1.0000 0 1637404 0
29 12 10 0.8333 1 1637404 0
30 12 11 0.9167 1 1637404 0
31 4 2 0.5000 2 1637404 0
32 8 8 1.0000 0 1637404 0
33 4 4 1.0000 0 1637404 0
34 6 0 0.0000 0 1637404 0
35 11 10 0.9091 1 1637404 0
36 7 0 0.0000 1 1637404 0
37 5 0 0.0000 2 1637404 0
38 5 0 0.0000 0 1637404 0
39 8 0 0.0000 2 1637404 0
40 6 6 1.0000 0 1637404 0
41 9 7 0.7778 2 1637404 0
42 4 1 0.2500 1 1637404 0
43 8 5 0.6250 1 1637404 0
44 1 1 1.0000 0 1637404 0
45 2 0 0.0000 0 1637404 0
46 5 4 0.8000 0 1637404 0
47 6 5 0.8333 0 1637404 0
48 3 0 0.0000 0 1637404 0
49 3 0 0.0000 0 1637404 0
50 2 0 0.0000 0 1637404 0
51 3 0 0.0000 1 1637404 0
52 2 0 0.0000 0 1637404 0
53 3 0 0.0000 0 1637404 0
54 2 0 0.0000 0 1637404 0
-------- -------- -------- -------- -------- -------- --------
Total: 1000 710 0.7100 152 1637404 0
Later in the file there is this table:
Comparable Node Alpha Reliability:
Node Raw Dem Sim Rely Wtd Cov
-------- -------- -------- --------
0 71.0000 0.8351 59.2887
1 62.0000 0.8000 49.6000
2 56.0000 0.8738 48.9320
3 39.0000 0.7794 30.3971
4 35.0000 0.9048 31.6667
5 21.0000 0.8286 17.4000
6 20.0000 0.8710 17.4194
7 19.0000 0.9000 17.1000
8 17.0000 0.7727 13.1364
9 17.0000 0.9231 15.6923
10 16.0000 0.8750 14.0000
11 15.0000 0.0000 0.0000
12 14.0000 0.8696 12.1739
13 12.0000 0.0000 0.0000
14 12.0000 0.9500 11.4000
15 11.0000 0.0000 0.0000
16 10.0000 0.7826 7.8261
17 10.0000 0.7500 7.5000
18 9.0000 1.0000 9.0000
19 9.0000 0.0000 0.0000
20 9.0000 0.0000 0.0000
21 8.0000 0.7778 6.2222
22 8.0000 0.8182 6.5455
23 8.0000 0.0000 0.0000
24 8.0000 0.4286 3.4286
25 7.0000 1.0000 7.0000
26 6.0000 0.0000 0.0000
27 6.0000 0.0000 0.0000
28 6.0000 1.0000 6.0000
29 6.0000 0.8333 5.0000
30 6.0000 0.9167 5.5000
31 5.0000 0.5000 2.5000
32 5.0000 1.0000 5.0000
33 5.0000 1.0000 5.0000
34 5.0000 0.0000 0.0000
35 5.0000 0.9091 4.5455
36 5.0000 0.0000 0.0000
37 4.0000 0.0000 0.0000
38 4.0000 0.0000 0.0000
39 4.0000 0.0000 0.0000
40 4.0000 1.0000 4.0000
41 4.0000 0.7778 3.1111
42 4.0000 0.2500 1.0000
43 4.0000 0.6250 2.5000
44 3.0000 1.0000 3.0000
45 3.0000 0.0000 0.0000
46 3.0000 0.8000 2.4000
47 3.0000 0.8333 2.5000
48 3.0000 0.0000 0.0000
49 3.0000 0.0000 0.0000
50 3.0000 0.0000 0.0000
51 2.0000 0.0000 0.0000
52 2.0000 0.0000 0.0000
53 2.0000 0.0000 0.0000
54 2.0000 0.0000 0.0000
-------- -------- -------- --------
Total: 437.7852
I need to be able to store the two middle columns as an array in order to do some calculations.
How do I go about doing this in powershell? I already have the following code that works (with generic name changes):
foreach ($file in $files) {
$fullName = [IO.Path]::GetFileNameWithoutExtension($file)
$CR = $fullName.Split("CRAPTFV")[-2]
$CT = $fullName.Split("CRAPTFV")[-3]
$P = $fullName.Split("CRAPTFV")[-4]
$A = $fullName.Split("CRAPTFV")[-5]
$S = $fullName.Split("CRAPTFV")[-6]
$CV = $fullName.Split("CRAPTFV")[-7]
$DEM = Select-String -Path $file -Pattern("Total Covered Demand = (\d*.?\d*)")
$REL = Select-String -Path $file -Pattern("\d+\t+\s+(\d+\.{1}\d+)\t+\s+(\d\.{1}\d+)\t+\s+(\d+.{1}\d+)") -AllMatches
Write-Output "$CT,$CR,$CV,$S,$A,$P,$DEM.Matches.groups[1]" | Out-File "fileadress" -Append
}
The goal is to use the table from each file to calculate some measurement and then append it to an output file. I seem to have yanked them out with $REL and I can see all the values with this code
$REL = Select-String -Path $file -Pattern("\d+\t+\s+(\d+\.{1}\d+)\t+\s+(\d\.{1}\d+)\t+\s+(\d+.{1}\d+)") -AllMatches
Write-Host $REL.Matches
But when I type the following I can only see the first value for each file. This
Write-Host $REL.Matches.Groups[1]
produces this:
71.0000
71.0000
71.0000
71.0000
71.0000
71.0000
for all files.

If I imagine that 4 spaces give a tab here is a way to use $REL :
$REL.matches[0].Groups[2].Value gives 0.8351
$REL.matches[1].Groups[3].Value gives 49.6000
$REL.matches[X].Groups[Y].Value for a file gives the cell of th Y column of the X line. X and Y start from 0.

Related

Repeated Measures ANOVA using PROC GLM and trying to compare treatment groups across time with an estimate statement but getting an error message

I am doing a repeated measures anova using a PROC GLM and getting an error message of the following.
ERROR 22-322: Syntax error, expecting one of the following: a numeric constant, a datetime constant,*.
ERROR: Variable Control not found.
I am unsure of why it is not finding this. I am trying to add estimates for each of the groups by each other.
CODE
Title 'Repeated Measures ANOVA';
PROC GLM DATA=Volumeday;
CLASS Group;
MODEL Day11 Day15 Day18 Day22 Day25 Day29 Day31 = Group/ nouni;
REPEATED Volumes 7 (11 15 18 22 25 29 31) Contrast (1);
Estimate 'Control vs 50' Group Control 50;
RUN;
DATA
Group SingleID Day11 Day15 Day18 Day22 Day25 Day29 Day31
Control 1 18.3265 123.9459 277.5853 469.2007 786.0575 1200.4905 2157.7883
Control 2 8.9600 132.4787 272.9526 291.9831 358.5270 552.6809 831.4478
Control 3 30.3888 27.9484 32.9774 57.2910 102.3590 158.7149 264.3438
Control 4 152.3057 177.8362 237.3441 333.7665 541.3562 807.9747 1322.6820
Control 5 75.8382 210.9038 288.8744 526.7819 1177.8997 1495.5090 1983.0081
Control 6 109.4968 261.4477 646.6212 1045.1347 1409.9562 2100.0035 3606.0111
Control 7 69.7455 140.1223 545.3165 865.7220 1074.2843 4817.3938 5062.0829
Control 8 27.5759 140.4200 179.4372 208.1606 214.3055 244.5793 375.6873
Control 9 69.9840 278.3642 665.7404 948.2510 1291.2181 1773.4409 2526.5430
Control 10 0.0000 5.1754 28.6286 55.9888 85.3166 130.3152 228.2616
Control 11 58.3283 98.3813 250.7581 320.7870 498.5181 786.6884 1092.7527
Control 12 76.8369 213.3508 310.6329 687.7342 1158.0864 1816.7347 2468.0657
Control 13 83.2098 171.3893 494.8624 1279.0689 1586.3263 2146.5180 3179.8413
50 14 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
50 15 70.5439 289.3574 319.6232 605.7606 767.8226 1195.2030 1285.2694
50 16 56.6206 204.2804 209.3167 316.0949 874.6215 1058.7214 1066.0440
50 17 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
50 18 54.0136 260.7759 379.6304 598.2225 834.9887 1623.1321 1960.1044
50 19 0.0000 88.0999 158.8836 478.2094 594.5207 679.1422 785.5714
50 20 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
50 21 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
50 22 67.6676 176.2211 486.8332 671.0626 1510.1275 2288.6294 2493.2663
50 23 0.0000 92.0981 615.5709 942.8944 774.2735 1121.9150 1158.6388
50 24 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
50 25 0.0000 2.5323 2.0644 19.9026 44.7534 57.3573 61.9292
50 26 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
100 27 51.1564 112.6072 238.5177 560.8209 665.5958 1001.8340 1086.0031
100 28 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
100 29 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
100 30 31.9500 87.1252 95.7135 198.5657 631.4217 902.2800 1016.0448
100 31 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
100 32 0.0000 0.0000 5.6459 20.7742 59.5984 49.7367 94.8133
100 33 30.3624 179.8866 274.3788 410.4248 946.3005 1318.0000 1504.2507
100 34 0.0000 0.0000 25.1096 88.5573 145.3025 324.4817 476.9385
100 35 0.0000 41.6587 62.3404 102.9104 164.9199 179.5294 394.4932
100 36 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
100 37 0.0000 1.3099 1.7978 0.0000 0.0000 0.0000 0.0000
100 38 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
100 39 0.0000 12.9966 43.6856 207.1046 277.0362 430.6310 504.4368
200 40 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
200 41 53.2928 462.4648 695.2788 1064.9500 1172.7716 1270.2056 1507.2874
200 42 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
200 43 30.1754 151.9196 438.3676 353.3577 422.8638 460.1100 606.3912
200 44 12.1342 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
200 45 68.1526 113.8426 340.2685 706.1297 831.9715 1073.0574 1276.5542
200 46 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
200 47 16.2057 50.2423 100.9858 248.5628 339.0762 368.7926 660.4432
200 48 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
200 49 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
200 50 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
200 51 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
200 52 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
I think this would be to correct syntax. Again post data as TEXT and I will be able to test.
Estimate 'Control vs 50' Group 1 -1;

Sequential READ or WRITE not allowed after EOF marker

I have this code:
SUBROUTINE FNDKEY
1( FOUND ,IWBEG ,IWEND ,KEYWRD ,INLINE ,
2 NFILE ,NWRD )
IMPLICIT DOUBLE PRECISION (A-H,O-Z)
LOGICAL FOUND
CHARACTER*80 INLINE
CHARACTER*(*) KEYWRD
DIMENSION
1 IWBEG(40), IWEND(40)
C***********************************************************************
C FINDS AND READS A LINE CONTAINING A SPECIFIED KEYWORD FROM A FILE.
C THIS ROUTINE SEARCHES FOR A GIVEN KEYWORD POSITIONED AS THE FIRST
C WORD OF A LINE IN A FILE.
C IF THE GIVEN KEYWORD IS FOUND THEN THE CORRESPONDING LINE IS READ AND
C RETURNED TOGETHER WITH THE NUMBER OF WORDS IN THE LINE AND TWO INTEGER
C ARRAYS CONTAINING THE POSITION OF THE BEGINNING AND END OF EACH WORD.
C***********************************************************************
1000 FORMAT(A80)
C
FOUND=.TRUE.
IEND=0
10 READ(NFILE,1000,END=20)INLINE
NWRD=NWORD(INLINE,IWBEG,IWEND)
IF(NWRD.NE.0)THEN
IF(INLINE(IWBEG(1):IWEND(1)).EQ.KEYWRD)THEN
GOTO 999
ENDIF
ENDIF
GOTO 10
20 IF(IEND.EQ.0)THEN
IEND=1
REWIND NFILE
GOTO 10
ELSE
FOUND=.FALSE.
ENDIF
999 RETURN
END
And the following file named "2.dat" that I am trying to read:
TITLE
Example 7.5.3 - Simply supported uniformly loaded circular plate
ANALYSIS_TYPE 3 (Axisymmetric)
AXIS_OF_SYMMETRY Y
LARGE_STRAIN_FORMULATION OFF
SOLUTION_ALGORITHM 2
ELEMENT_GROUPS 1
1 1 1
ELEMENT_TYPES 1
1 QUAD_8
4 GP
ELEMENTS 10
1 1 1 19 11 20 16 21 13 22
2 1 13 21 16 23 10 24 2 25
3 1 3 26 18 27 17 28 4 29
4 1 18 30 7 31 12 32 17 27
5 1 3 33 5 34 14 35 18 26
6 1 18 35 14 36 6 37 7 30
7 1 5 38 8 39 15 40 14 34
8 1 14 40 15 41 9 42 6 36
9 1 10 23 16 43 17 32 12 44
10 1 16 20 11 45 4 28 17 43
NODE_COORDINATES 45 CARTESIAN
1 0.0000000000e+00 0.0000000000e+00
2 0.0000000000e+00 1.0000000000e+00
3 6.0000000000e+00 0.0000000000e+00
4 4.0000000000e+00 0.0000000000e+00
5 8.0000000000e+00 0.0000000000e+00
6 8.0000000000e+00 1.0000000000e+00
7 6.0000000000e+00 1.0000000000e+00
8 1.0000000000e+01 0.0000000000e+00
9 1.0000000000e+01 1.0000000000e+00
10 2.0000000000e+00 1.0000000000e+00
11 2.0000000000e+00 0.0000000000e+00
12 4.0000000000e+00 1.0000000000e+00
13 0.0000000000e+00 5.0000000000e-01
14 8.0000000000e+00 5.0000000000e-01
15 1.0000000000e+01 5.0000000000e-01
16 2.0000000000e+00 5.0000000000e-01
17 4.0000000000e+00 5.0000000000e-01
18 6.0000000000e+00 5.0000000000e-01
19 1.0000000000e+00 0.0000000000e+00
20 2.0000000000e+00 2.5000000000e-01
21 1.0000000000e+00 5.0000000000e-01
22 0.0000000000e+00 2.5000000000e-01
23 2.0000000000e+00 7.5000000000e-01
24 1.0000000000e+00 1.0000000000e+00
25 0.0000000000e+00 7.5000000000e-01
26 6.0000000000e+00 2.5000000000e-01
27 5.0000000000e+00 5.0000000000e-01
28 4.0000000000e+00 2.5000000000e-01
29 5.0000000000e+00 0.0000000000e+00
30 6.0000000000e+00 7.5000000000e-01
31 5.0000000000e+00 1.0000000000e+00
32 4.0000000000e+00 7.5000000000e-01
33 7.0000000000e+00 0.0000000000e+00
34 8.0000000000e+00 2.5000000000e-01
35 7.0000000000e+00 5.0000000000e-01
36 8.0000000000e+00 7.5000000000e-01
37 7.0000000000e+00 1.0000000000e+00
38 9.0000000000e+00 0.0000000000e+00
39 1.0000000000e+01 2.5000000000e-01
40 9.0000000000e+00 5.0000000000e-01
41 1.0000000000e+01 7.5000000000e-01
42 9.0000000000e+00 1.0000000000e+00
43 3.0000000000e+00 5.0000000000e-01
44 3.0000000000e+00 1.0000000000e+00
45 3.0000000000e+00 0.0000000000e+00
NODES_WITH_PRESCRIBED_DISPLACEMENTS 6
1 10 0.000 0.000 0.000
2 10 0.000 0.000 0.000
8 01 0.000 0.000 0.000
13 10 0.000 0.000 0.000
22 10 0.000 0.000 0.000
25 10 0.000 0.000 0.000
MATERIALS 1
1 VON_MISES
0.0
1.E+07 0.240
2
0.000 16000.0
1.000 16000.0
LOADINGS EDGE
EDGE_LOADS 5
2 3 10 24 2
1.000 1.000 1.000 0.000 0.000 0.000
4 3 7 31 12
1.000 1.000 1.000 0.000 0.000 0.000
6 3 6 37 7
1.000 1.000 1.000 0.000 0.000 0.000
8 3 9 42 6
1.000 1.000 1.000 0.000 0.000 0.000
9 3 10 12 44
1.000 1.000 1.000 0.000 0.000 0.000
*
* Monotonic loading to collapse
*
INCREMENTS 12
100.0 0.10000E-06 11 1 1 0 1 0
100.0 0.10000E-06 11 1 1 0 1 0
20.0 0.10000E-06 11 1 1 0 1 0
10.0 0.10000E-06 11 1 1 0 0 0
10.0 0.10000E-06 11 1 1 0 1 0
10.0 0.10000E-06 11 1 1 0 0 0
5.0 0.10000E-06 11 1 1 1 1 0
2.0 0.10000E-06 11 1 1 0 0 0
2.0 0.10000E-06 11 1 1 0 0 0
0.5 0.10000E-06 11 1 1 1 1 0
0.25 0.10000E-06 11 1 1 0 0 0
0.02 0.10000E-06 11 1 1 0 0 0
And I am getting the following error:
At line 22 of file GENERAL/fndkey.f (unit = 15, file = './2.dat')
Fortran runtime error: Sequential READ or WRITE not allowed after EOF marker, possibly use REWIND or BACKSPACE
The following file is the one that call's FNDKEY. When it calls FNDKWYm it passes to KEYWRD the string "RESTART".
SUBROUTINE RSTCHK( RSTINP ,RSTRT )
IMPLICIT DOUBLE PRECISION (A-H,O-Z)
LOGICAL RSTRT
CHARACTER*256 RSTINP
C
LOGICAL AVAIL,FOUND
CHARACTER*80 INLINE
DIMENSION IWBEG(40),IWEND(40)
C***********************************************************************
C CHECKS WETHER MAIN DATA IS TO BE READ FROM INPUT RE-START FILE
C AND SET INPUT RE-START FILE NAME IF REQUIRED
C***********************************************************************
1000 FORMAT(////,
1' Main input data read from re-start file'/
2' ======================================='///
3' Input re-start file name ------> ',A)
C
C Checks whether the input data file contains the keyword RESTART
C
CALL FNDKEY
1( FOUND ,IWBEG ,IWEND ,'RESTART',
2 INLINE ,15 ,NWRD )
IF(FOUND)THEN
C sets re-start flag and name of input re-start file
RSTRT=.TRUE.
RSTINP=INLINE(IWBEG(2):IWEND(2))//'.rst'
WRITE(16,1000)INLINE(IWBEG(2):IWEND(2))//'.rst'
C checks existence of the input re-start file
INQUIRE(FILE=RSTINP,EXIST=AVAIL)
IF(.NOT.AVAIL)CALL ERRPRT('ED0096')
ELSE
RSTRT=.FALSE.
ENDIF
C
RETURN
END
I solved the problem adding the comand BACKSPACE(NFILE) above the RETURN:
SUBROUTINE FNDKEY
1( FOUND ,IWBEG ,IWEND ,KEYWRD ,INLINE ,
2 NFILE ,NWRD )
IMPLICIT DOUBLE PRECISION (A-H,O-Z)
LOGICAL FOUND
CHARACTER*80 INLINE
CHARACTER*(*) KEYWRD
DIMENSION
1 IWBEG(40), IWEND(40)
C***********************************************************************
C FINDS AND READS A LINE CONTAINING A SPECIFIED KEYWORD FROM A FILE.
C THIS ROUTINE SEARCHES FOR A GIVEN KEYWORD POSITIONED AS THE FIRST
C WORD OF A LINE IN A FILE.
C IF THE GIVEN KEYWORD IS FOUND THEN THE CORRESPONDING LINE IS READ AND
C RETURNED TOGETHER WITH THE NUMBER OF WORDS IN THE LINE AND TWO INTEGER
C ARRAYS CONTAINING THE POSITION OF THE BEGINNING AND END OF EACH WORD.
C***********************************************************************
1000 FORMAT(A80)
C
FOUND=.TRUE.
IEND=0
10 READ(NFILE,1000,END=20)INLINE
NWRD=NWORD(INLINE,IWBEG,IWEND)
PRINT *,KEYWRD
IF(NWRD.NE.0)THEN
IF(INLINE(IWBEG(1):IWEND(1)).EQ.KEYWRD)THEN
GOTO 999
ENDIF
ENDIF
GOTO 10
20 IF(IEND.EQ.0)THEN
IEND=1
REWIND NFILE
GOTO 10
ELSE
FOUND=.FALSE.
ENDIF
BACKSPACE(NFILE)
999 RETURN
END

How to loop rows and columns in pandas while replacing values with a constant increment

I am trying to replace values in a dataframe by 0. the first column I need to replace the 1st 3 values, the next column the 1st 6 values so on so forth increasing by 3 every time
a=np.array([133,124,156,189,132,176,189,192,100,120,130,140,150,50,70,133,124,156,189,132])
b = pd.DataFrame(a.reshape(10,2), columns= ['s','t'])
for columns in b:
yy = 3
for i in xrange(yy):
b[columns][i] = 0
yy += 3
print b
the outcome is the following
s t
0 0 0
1 0 0
2 0 0
3 189 189
4 132 132
5 176 176
6 189 189
7 192 192
8 100 100
9 120 120
I am clearly missing something really simple, to make the loop replace 6 values instead of only 3 in column t, any ideas?
i would do it this way:
i = 1
for c in b.columns:
b.ix[0 : 3*i-1, c] = 0
i += 1
Demo:
In [86]: b = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('abcd'))
In [87]: %paste
i = 1
for c in b.columns:
b.ix[0 : 3*i-1, c] = 0
i += 1
## -- End pasted text --
In [88]: b
Out[88]:
a b c d
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 10 0 0 0
4 8 0 0 0
5 49 0 0 0
6 55 48 0 0
7 99 43 0 0
8 63 29 0 0
9 61 65 74 0
10 15 29 41 0
11 79 88 3 0
12 91 74 11 4
13 56 71 6 79
14 15 65 46 81
15 81 42 60 24
16 71 57 95 18
17 53 4 80 15
18 42 55 84 11
19 26 80 67 59
You need inicialize yy=3 before loop:
yy = 3
for columns in b:
for i in xrange(yy):
b[columns][i] = 0
yy += 3
print b
Python 3 solution:
yy = 3
for columns in b:
for i in range(yy):
b[columns][i] = 0
yy += 3
print (b)
s t
0 0 0
1 0 0
2 0 0
3 189 0
4 100 0
5 130 0
6 150 50
7 70 133
8 124 156
9 189 132
Another solution:
yy= 3
for i, col in enumerate(b.columns):
b.ix[:i*yy+yy-1, col] = 0
print (b)
s t
0 0 0
1 0 0
2 0 0
3 189 0
4 100 0
5 130 0
6 150 50
7 70 133
8 124 156
9 189 132

Find specific columns and replace the following column with specific value with gawk

I am trying to find all the places where my data has a repeating line and delete the repeating line. Also, I am looking for where the 2nd column has the value 90 and replace the following 2nd column with a specific number I designate.
My data looks like this:
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
7 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 31 0 0 0.0000 70221
I want my data to look like:
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 5 0 0 0.0000 70221
My code:
BEGIN {
priorline = "";
ERROROFFSET = 50;
ERRORVALUE[10] = 1;
ERRORVALUE[11] = 2;
ERRORVALUE[12] = 3;
ERRORVALUE[30] = 4;
ERRORVALUE[31] = 5;
ERRORVALUE[32] = 6;
ORS = "\n";
}
NR == 1 {
print;
getline;
priorline = $0;
}
NF == 6 {
brandnewline = $0
mytype = $2
$0 = priorline
priorField2 = $2;
if (mytype !~ priorField2) {
print;
priorline = brandnewline;
}
if (priorField2 == "90") {
mytype = ERRORVALUE[mytype];
}
}
END {print brandnewline}
##Here the parameters of the brandnewline is set to the current line and then the
##proirline is set to the line on which we just worked on and the brandnewline is
##set to be the next new line we are working on. (i.e line 1 = brandnewline, now
##we set priorline = brandnewline, thus priorline is line 1 and brandnewline takes
##on line 2) Next, the same parameters were set with column 2, mytype being the
##current column 2 value and priorField2 being the same value as mytype moves to
##the next column 2 value. Finally, we wrote an if statement where, if the value
##in column 2 of the current line !~ (does not equal) value of column two of the
##previous line, then the current line will be print otherwise it will just be
##skipped over. The second if statement recognizes the lines in which the value
##90 appeared and replaces the value in column 2 with a previously defined
##ERRORVALUE set for each specific type (type 10=1, 11=2,12=3, 30=4, 31=5, 32=6).
I have been able to successfully delete the repeating lines, however, I am unable to execute the next part of my code, which is to replace the values I designated in BEGIN as the ERRORVALUES (10=1, 11=2, 12=3, 30=4, 31=5, 32=6) with the actual columns that contain that value. Essentially, I want to just replace that value in the line with my ERRORVALUE.
If anyone can help me with this I would be very grateful.
One challenge is that you can't just compare one line with the previous because the ID number will be different.
awk '
BEGIN {
ERRORVALUE[10] = 1
# ... etc
}
# print the header
NR == 1 {print; next}
NR == 2 || $0 !~ prev_regex {
prev_regex = sprintf("^\\s+\\w+\\s+%s\\s+%s\\s+%s\\s+%s\\s+%s",$2,$3,$4,$5,$6)
if (was90) $2 = ERRORVALUE[$2]
print
was90 = ($2 == 90)
}
'
For lines where the 2nd column is altered, this ruins the line formatting:
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 5 0 0 0.0000 70221
If that's a problem, you could pipe the output of gawk into column -t, or if you know the line format is fixed, use printf() in the awk program.
This might work for you:
v=99999
sed ':a;$!N;s/^\(\s*\S*\s*\)\(.*\)\s*\n.*\2/\1\2/;ta;s/^\(\s*\S*\s*\) 90 /\1'"$(printf "%5d" $v)"' /;P;D' file
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 99999 0 0 0.0000 68700
12 31 0 0 0.0000 70221
This might work for you:
awk 'BEGIN {
ERROROFFSET = 50;
ERRORVALUE[10] = 1;
ERRORVALUE[11] = 2;
ERRORVALUE[12] = 3;
ERRORVALUE[30] = 4;
ERRORVALUE[31] = 5;
ERRORVALUE[32] = 6;
}
NR == 1 { print ; next }
{ if (a[$2 $6]) { next } else { a[$2 $6]++ }
if ( $2 == 90) { print ; n++ ; next }
if (n>0) { $2 = ERRORVALUE[$2] ; n=0 }
printf("% 4i% 8i% 3i% 5i% 9.4f% 6i\n", $1, $2, $3, $4, $5, $6)
}' INPUTFILE
See it in action here at ideone.com.
IMO the BEGIN block is obvious. Then the following happens:
the NR == 1 line prints the very first line (and switches to the next line, also this rule only apply to the very first line)
Then checking if we had seen already the any line with the same 2nd and 6th columns and if so, switch to the next line, else mark it as seen in an array (using the concatenated column values as indecies, but do note that this might fail you if you have large values in the 2nd and smalls in the 6th (e.g. 2 0020 concatenated is 20020 and it's the same for 20 020) so you might want to add a column separatar in the index like a[$2 "-" $6]... and you can use more columns to check even more properly)
If the line has 90 on the second column prints it, flags to swap on the next line then switch to next line (in the input file)
On the next line checks the 2nd column in ERRORVALUE and if it finds, replaces its contents.
Then prints the formated line.
I agree with Glenn that two passes over the file is nicer. You can remove your duplicate, perhaps nonconsecutive, lines using a hash like this:
awk '!a[$2,$3,$4,$5,$6]++' file.txt
You should then edit your values as desired. If you wish to change the value 90 in the second column to 5000, try something like this:
awk 'NR == 1 { print; next } { sub(/^90$/, "5000", $2); printf("%4i% 8i% 3i% 5i% 9.4f% 6i\n", $1, $2, $3, $4, $5, $6) }' file.txt
You can see that I stole Zsolt's printf statement (thanks Zsolt!) for the formatting, but you can edit this if necessary. You can also pipe the output from the first statement into the second for a nice one-liner:
cat file.txt | awk '!a[$2,$3,$4,$5,$6]++' | awk 'NR == 1 { print; next } { sub(/^90$/, "5000", $2); printf("%4i% 8i% 3i% 5i% 9.4f% 6i\n", $1, $2, $3, $4, $5, $6) }'
The previous options work for the most part, however here's the way I would do it, simple and sweet. After reviewing the other posts I believe this would be the most efficient. In addition this also allows for the extra request the OP added in the comments to have the line after 90 replaced with a variable from 2 lines prior. This does it all in a single pass.
BEGIN {
PC2=PC6=1337
replacement=5
}
{
if( $6 == PC6 ) next
if( PC2 == 90 ) $2 = replacement
replacement = PC2
PC2 = $2
PC6 = $6
printf "%4s%8s%3s%5s%9s%6s\n",$1, $2, $3, $4, $5, $6
}
Example Input
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
7 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 31 0 0 0.0000 70221
Example Output
1 70 0 0 0.000000 57850
2 31 0 0 0.000000 59371
3 41 0 0 0.000000 60909
4 70 0 0 0.000000 61478
5 31 0 0 0.000000 62999
6 41 0 0 0.000000 64537
8 70 0 0 0.000000 65106
9 11 0 0 0.000000 66627
10 21 0 0 0.000000 68165
11 90 0 0 0.000000 68700
12 21 0 0 0.000000 70221

replacing a specific column with a specific value using gawk

I am trying to find everywhere my data has a 90 in column 2 and two lines above change the value of column 2. For example in my data below, if I see 90 at line 11 I want to change my column 2 value at line 9 from 11 to 5. I have a predetermined set of values I want to change the number to; the values will always be 10,11,12,30,31,32 to 1,2,3,4,5,6 respectably.
My data
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 31 0 0 0.0000 70221
What I want
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 5 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 31 0 0 0.0000 70221
I have been trying to store the previous line and use that as a reference but I can only go back one line, and I need to go back two. Thank you for your help.
This should work:
function pra(a) {
for(e in a) {
printf "%s ", a[e];
}
print "";
}
BEGIN {
vals[10] = 1;
vals[11] = 2;
vals[12] = 3;
vals[30] = 4;
vals[31] = 5;
vals[32] = 6;
}
NR == 1 { split($0, a, " ") }
NR == 2 { split($0, b, " ") }
NR > 2 {
if($2 == "90") {
a[2] = vals[a[2]];
}
pra(a);
al = 0;
for(i in a) al++;
for(i = 1; i <= al; i++) {
a[i] = b[i];
}
split($0, b, " ");
}
END {
pra(a);
pra(b);
}
The rundown of how this works:
* BEGING block - assign the translation values to vals
* NR == 1 and NR == 2 - remember the first two lines into split arrays a and b
* NR > 2 - for all lines after the first two
* If the second column has value 90, change it using the translation array
* Move elements of array b to a and split the current line into b
* END block - print a and b, which are basically last two lines
Sample run:
$ cat inp && awk -f mkt.awk inp
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 31 0 0 0.0000 70221
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 2 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 31 0 0 0.0000 70221
You can do something like this:
function pra(a) {
printf "%4d%8d%3d%5d%9.4f%6d\n", a[1], a[2], a[3], a[4], a[5], a[6]
}
BEGIN {
vals[10] = 1;
vals[11] = 2;
vals[12] = 3;
vals[30] = 4;
vals[31] = 5;
vals[32] = 6;
}
NR == 1 { print }
NR == 2 { split($0, a, " ") }
NR == 3 { split($0, b, " ") }
NR > 4 {
if($2 == "90") {
a[2] = vals[a[2]];
}
pra(a);
for(i = 1; i <= 6; i++) {
a[i] = b[i];
}
split($0, b, " ");
}
END {
pra(a);
pra(b);
}
To make it work for this specific case that includes formatting. Sample run:
$ cat inp && awk -f mkt.awk inp
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 31 0 0 0.0000 70221
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 2 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 31 0 0 0.0000 70221
This version maintains your original formatting
awk 'BEGIN{ new[" 1"]="10"; new[" 2"]="11"; new[" 3"]="12"
new[" 4"]="30"; new[" 5"]="31"; new[" 6"]="32" }
{ line[-2]=line[-1]; line[-1]=line[0]; line[0]=$0 }
$2==90 { if( match( line[-2], /^ *[0-9]+ +[1-6] / ) ) {
old=substr( line[-2], RLENGTH-2,2 )
line[-2]=substr( line[-2], 1, RLENGTH-3 ) new[old] \
substr( line[-2], RLENGTH ) } }
NR>2 { printf("%s\n",line[-2]) }
END { printf("%s\n%s\n",line[-1],line[0]) }' file.in