Remove new line after pattern and newline between 2 pattern - regex

I have to parse and filter in linux command line only some log files.
after applying some awk and sed commands
awk -v RS='+++ ' '!/Diameter|REPT OM BLOCKED|REPT OM STARTING/ { print f $0 } {f=RT}' ./snmplog* | grep -v '+++' | grep -v '++-' | sed -e 's/^\s*//g' | sed -e '/^$/d'
I got an output like this which look like an xml file:
<Alarm>
<AlarmIndex>8865</AlarmIndex>
<ObjectName>0-0-1#RcvTCAPabortRatio^0-0-3</ObjectName>
<SpecificProblem>KPI OUTSIDE OF CRITICAL THRESHOLD</SpecificProblem>
<ProbableCause>ThresholdCrossed</ProbableCause>
<NotificationIdentifier>8865</NotificationIdentifier>
<Severity>Cleared</Severity>
<AlarmType>QualityOfServiceAlarm</AlarmType>
<AdditionalText></AdditionalText>
<OMText>REPT MEAS KPI
(RcvTCAPabortRatio^0-0-3 = 20) OUTSIDE OF CRITICAL ALARM THRESHOLD (10)</O
MText>
<AlarmCode>922044</AlarmCode>
<AlarmSource>PLATFORM</AlarmSource>
<AlarmTime>Wed Mar 11 00:15:10 2015</AlarmTime>
<RepeatCount>0</RepeatCount>
<OMDBKey>/MS044</OMDBKey>
<AutoClear>1</AutoClear>
</Alarm>
<Alarm>
<AlarmIndex>8928</AlarmIndex>
<ObjectName>0-0-1#RcvTCAPabortRatio^0-0-11</ObjectName>
<SpecificProblem>KPI OUTSIDE OF CRITICAL THRESHOLD</SpecificProblem>
<ProbableCause>ThresholdCrossed</ProbableCause>
<NotificationIdentifier>8928</NotificationIdentifier>
<Severity>Cleared</Severity>
<AlarmType>QualityOfServiceAlarm</AlarmType>
<AdditionalText></AdditionalText>
<OMText>REPT MEAS KPI
(RcvTCAPabortRatio^0-0-11 = 19) OUTSIDE OF CRITICAL ALARM THRESHOLD (10)</
OMText>
<AlarmCode>922044</AlarmCode>
<AlarmSource>PLATFORM</AlarmSource>
<AlarmTime>Wed Mar 11 00:15:10 2015</AlarmTime>
<RepeatCount>0</RepeatCount>
<OMDBKey>/MS044</OMDBKey>
<AutoClear>1</AutoClear>
</Alarm>
<Alarm>
<AlarmIndex>8771</AlarmIndex>
<ObjectName>0-0-1#SuccStandaloneISDRatio</ObjectName>
<SpecificProblem>ZERO DENOMINATOR</SpecificProblem>
<ProbableCause>CorruptData</ProbableCause>
<NotificationIdentifier>8771</NotificationIdentifier>
<Severity>Cleared</Severity>
<AlarmType>ProcessingErrorAlarm</AlarmType>
<AdditionalText></AdditionalText>
<OMText>REPT MEAS KPI
CALCULATION OF (SuccStandaloneISDRatio) FAILED FOR ZERO DENOMINATOR</OMText>
<AlarmCode>922041</AlarmCode>
<AlarmSource>PLATFORM</AlarmSource>
<AlarmTime>Wed Mar 11 01:00:10 2015</AlarmTime>
<RepeatCount>0</RepeatCount>
<OMDBKey>/MS041</OMDBKey>
<AutoClear>1</AutoClear>
</Alarm>
I would like to have after threatment something like this:
<Alarm><AlarmIndex>8771</AlarmIndex>...<OMText>REPT MEAS KPI
CALCULATION OF (SuccStandaloneISDRatio) FAILED FOR ZERO DENOMINATOR</OMText><AlarmCode>922041</AlarmCode>...</Alarm>
I have to remove all new line after > and keep new line between tags.
As you can see in my log I have an issue in the tag </OMText> in which I can have a new line also and it should be removed.
I already try with many sed regex found here, but without success
How can I do this?
[Edit]
As requested, please find below the original log file:
+++ FE01 2015-03-11 00:25:35 SNMP /SNM001 #310852 0-0-1 >
<Alarm>
<AlarmIndex>1119</AlarmIndex>
<ObjectName>0-0-3#destMMENotAvail</ObjectName>
<SpecificProblem>CLR error,Diameter Peer:p3.mmeccd.3gppnetwork.org</SpecificProblem>
<ProbableCause>CommunicationsSubsystemFailure</ProbableCause>
<NotificationIdentifier>1119</NotificationIdentifier>
<Severity>Minor</Severity>
<AlarmType>CommunicationAlarm</AlarmType>
<AdditionalText>The destination MME is not reachable</AdditionalText>
<OMText>CLR error,Diameter Peer:p3.mmeccd.3gppne
twork.org</OMText>
<AlarmCode>50906</AlarmCode>
<AlarmSource>SDM#RTLTE</AlarmSource>
<AlarmTime>Wed Mar 11 00:25:35 2015</AlarmTime>
<RepeatCount>0</RepeatCount>
<OMDBKey></OMDBKey>
<AutoClear>1</AutoClear>
</Alarm>
END OF REPORT #310852++-
+++ FE01 2015-03-11 00:25:58 SNMP /SNM001 #310853 0-0-1 >
<Alarm>
<AlarmIndex>8914</AlarmIndex>
<ObjectName>0-0-14#2AILogger.C!81</ObjectName>
<SpecificProblem>OM BLOCKED AILogger.C</SpecificProblem>
<ProbableCause>QueueSizeExceeded</ProbableCause>
<NotificationIdentifier>8914</NotificationIdentifier>
<Severity>Minor</Severity>
<AlarmType>QualityOfServiceAlarm</AlarmType>
<AdditionalText></AdditionalText>
<OMText>REPT OM BLOCKED FOR PROCESS PDLSU1
612 MESSAGES DISCARD
OM IDENTITY :
CRERROR BEING BLOCKED; FILE : AILogger.C LINE NUMBER : 81
</OMText>
<AlarmCode>906065</AlarmCode>
<AlarmSource>PLATFORM</AlarmSource>
<AlarmTime>Wed Mar 11 00:25:58 2015</AlarmTime>
<RepeatCount>0</RepeatCount>
<OMDBKey>/CR065</OMDBKey>
<AutoClear>1</AutoClear>
</Alarm>
END OF REPORT #310853++-
First I have to discared messages which contains within tags: "Diameter", "REPT OM BLOCKED" "REPT OM STARTING" then keeping only the message between the tags ...

awk '
/<Alarm>/,/<\/Alarm>/ {
sub(/^[[:blank:]]+/, "") # trim leading blanks
sub(/[[:blank:]]+$/, "") # trim trailing blanks
if (/>$/) # if the line ends with a tag
printf "%s", $0 # print it with no newline
else
print
}
/<\/Alarm>/ {print ""} # add a newline after each Alarm block
' log.file
outputs
<Alarm><AlarmIndex>1119</AlarmIndex><ObjectName>0-0-3#destMMENotAvail</ObjectName><SpecificProblem>CLR error,Diameter Peer:p3.mmeccd.3gppnetwork.org</SpecificProblem><ProbableCause>CommunicationsSubsystemFailure</ProbableCause><NotificationIdentifier>1119</NotificationIdentifier><Severity>Minor</Severity><AlarmType>CommunicationAlarm</AlarmType><AdditionalText>The destination MME is not reachable</AdditionalText><OMText>CLR error,Diameter Peer:p3.mmeccd.3gppne
twork.org</OMText><AlarmCode>50906</AlarmCode><AlarmSource>SDM#RTLTE</AlarmSource><AlarmTime>Wed Mar 11 00:25:35 2015</AlarmTime><RepeatCount>0</RepeatCount><OMDBKey></OMDBKey><AutoClear>1</AutoClear></Alarm>
<Alarm><AlarmIndex>8914</AlarmIndex><ObjectName>0-0-14#2AILogger.C!81</ObjectName><SpecificProblem>OM BLOCKED AILogger.C</SpecificProblem><ProbableCause>QueueSizeExceeded</ProbableCause><NotificationIdentifier>8914</NotificationIdentifier><Severity>Minor</Severity><AlarmType>QualityOfServiceAlarm</AlarmType><AdditionalText></AdditionalText><OMText>REPT OM BLOCKED FOR PROCESS PDLSU1
612 MESSAGES DISCARD
OM IDENTITY :
CRERROR BEING BLOCKED; FILE : AILogger.C LINE NUMBER : 81
</OMText><AlarmCode>906065</AlarmCode><AlarmSource>PLATFORM</AlarmSource><AlarmTime>Wed Mar 11 00:25:58 2015</AlarmTime><RepeatCount>0</RepeatCount><OMDBKey>/CR065</OMDBKey><AutoClear>1</AutoClear></Alarm>

To pipe (should be modified from original file like you post later)
sed '
# don t care out of section
/<Alarm>/,\#</Alarm># !d
# in section
/<Alarm>/,\#</Alarm># {
# keep line in hold buffer
H
# if not the end, loop (cycle to next line and start of script)
\#</Alarm># !b
# clean current buffer
s/.*//
# exchange buffer (current/hold)
x
# remove first new line (extra due to first keep)
s/\n//
# remove first new line
s/\n//
# reformat first part until OMText
s#\(</AlarmIndex>\).*\(<OMText>\)#\1...\2#
# reformat between AlarmCode and /Alarm
s#\(</AlarmCode>\).*\(</Alarm>\)#\1...\2#
# print result at output
}' YourFile
Self explain, posix version

Related

match variable string at end of field with awk

Yet again my unfamiliarity with AWK lets me down, I can't figure out how to match a variable at the end of a line?
This would be fairly trivial with grep etc, but I'm interested in matching integers at the end of a string in a specific field of a tsv, and all the posts suggest (and I believe it to be the case!) that awk is the way to go.
If I want to just match a single one explicity, that's easy:
Here's my example file:
PVClopT_11 PAU_02102 PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A No DOI found.
PVCpnf_18 PAK_3526 PAK_03186 3fxq 3fxq_A 99.7 2.7e-21 7e-26 122.2 >3fxq_A LYSR type regulator of TSAMBCD; transcriptional regulator, LTTR, TSAR, WHTH, DNA- transcription, transcription regulation; 1.85A {Comamonas testosteroni} PDB: 3fxr_A* 3fxu_A* 3fzj_A 3n6t_A 3n6u_A* 10.1111/j.1365-2958.2010.07043.x
PVCunit1_19 PAU_02807 PAU_02793 3kx6 3kx6_A 19.7 45 0.0012 31.3 >3kx6_A Fructose-bisphosphate aldolase; ssgcid, NIH, niaid, SBRI, UW, emerald biostructures, glycolysis, lyase, STRU genomics; HET: CIT; 2.10A {Babesia bovis} No DOI found.
PVClumt_17 PAU_02231 PAU_02190 3lfh 3lfh_A 39.7 12 0.0003 28.9 >3lfh_A Manxa, phosphotransferase system, mannose/fructose-speci component IIA; PTS; 1.80A {Thermoanaerobacter tengcongensis} No DOI found.
PVCcif_11 plu2521 PLT_02558 3h2t 3h2t_A 96.6 2.6e-05 6.7e-10 79.0 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_16 PAU_03338 PAU_03377 5jbr 5jbr_A 29.2 22 0.00058 23.9 >5jbr_A Uncharacterized protein BCAV_2135; structural genomics, PSI-biology, midwest center for structu genomics, MCSG, unknown function; 1.65A {Beutenbergia cavernae} No DOI found.
PVCunit1_17 PAK_2892 PAK_02622 1cii 1cii_A 63.2 2.7 6.9e-05 41.7 >1cii_A Colicin IA; bacteriocin, ION channel formation, transmembrane protein; 3.00A {Escherichia coli} SCOP: f.1.1.1 h.4.3.1 10.1038/385461a0
PVCunit1_11 PAK_2886 PAK_02616 3h2t 3h2t_A 96.6 1.9e-05 4.9e-10 79.9 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_11 PAU_03343 PAU_03382 3h2t 3h2t_A 97.4 4.4e-07 1.2e-11 89.7 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCunit1_5 afp5 PAU_02779 4tv4 4tv4_A 63.6 2.6 6.7e-05 30.5 >4tv4_A Uncharacterized protein; unknown function, ssgcid, virulence, structural genomics; 2.10A {Burkholderia pseudomallei} No DOI found.
And I can pull out all the lines which have a "_11" at the end of the first column by running the following on the commandline:
awk '{ if ($1 ~ /_11$/) { print } }' 02052017_HHresults_sorted.tsv
I want to enclose this in a loop to cover all integers from 1 - 5 (for instance), but I'm having trouble passing a variable in to the text match.
I expect it should be something like the following, but $i$ seems like its probably incorrect and by google-fu failed me:
awk 'BEGIN{ for (i=1;i<=5;i++){ if ($1 ~ /_$i$/) { print } } }' 02052017_HHresults_sorted.tsv
There may be other issues I haven't spotted with that awk command too, as I say, I'm not very awk-savvy.
EDIT FOR CLARIFICATION
I want to separate out all the matches, so can't use a character class. i.e. I want all the lines ending in "_1" in one file, then all the ones ending in "_2" in another, and so on (hence the loop).
You can't put variables inside //. Use string concatenation, which is done by simply putting the strings adjacent to each other in awk. You don't need to use a regexp literal when you use the ~ operator, it always treats the second argument as a regexp.
awk '{ for (i = 1; i <= 5; i++) {
if ( $1 ~ ("_" i "$") ) { print; break; }
}' 02052017_HHresults_sorted.tsv
It sounds like you're thinking about this all wrong and what you really need is just (with GNU awk for gensub()):
awk '{ print > ("out" gensub(/.*_/,"",1,$1)) }' 02052017_HHresults_sorted.tsv
or with any awk:
awk '{ n=$1; sub(/.*_/,"",n); print > ("out" n) }' 02052017_HHresults_sorted.tsv
No need to loop, use regex character class [..]:
awk 'match($1,/_([1-5])$/,a){ print >> a[1]".txt" }' 02052017_HHresults_sorted.tsv

How to match a group of lines that match a pattern

I am trying to filter out a group of lines that match a pattern using a regexp but am having trouble getting the correct regexp to use.
The text file contains lines like this:
transaction 390134; promote; 2016/12/20 01:17:07 ; user: build
to: DEVELOPMENT ; from: DEVELOPMENT_BUILD
# some commit comment
/./som/file/path 11745/409 (22269/257)
# merged
version 22269/257 (22269/257)
ancestor: (22133/182)
transaction 390136; promote; 2016/12/20 01:17:08 ; user: najmi
to: DEVELOPMENT ; from: DEVELOPMENT_BUILD
/./some/other/file/path 11745/1 (22269/1)
version 22269/1 (22269/1)
ancestor: (none - initial version)
type: dir
I would like to filter out the lines that start with "transaction", contain "User: build all the way until the next line that starts with "transaction".
The idea is to end up with transaction lines where user is not "build".
Thanks for any help.
If you want only the transaction lines for all users except build:
grep '^transaction ' test_data| grep -v 'user: build$'
If you want the whole transaction record for such users:
awk '/^transaction /{ p = !/user: build$/};p' test_data
OR
perl -lne 'if(/^transaction /){$p = !/user: build$/}; print if $p' test_data
The -A and -v options of grep command would have done the trick if all transaction records had same number of lines.

Add value between column using sed/awk based on matching value at certain column

I have a log files with many records. All line of rows and columns have same format. I want to use sed to match value in certain column and adding new value in between column. As an example, a log like this :
2014.3.17 23:57:11 127.0.0.3 10.21.31.141 http://vcs2.msg.yahoo.com/capacity *DENIED* Banned site: msg.yahoo.com GET 0 0 3 403 - working_time content3 -
My command will search the log for msg.yahoo.com (column 9th) and if match it will add value (Social Media) between column 12 and 13. As intended output :
2014.3.17 23:57:11 127.0.0.3 10.21.31.141 http://vcs2.msg.yahoo.com/capacity *DENIED* Banned site: msg.yahoo.com GET 0 0 Social Media 3 403 - working_time content3 -
My awk code only put Social Media between column 12 and 13 :
awk -v column=12 -v value="Social Media" '
BEGIN {
FS = OFS = " ";
}
{
for ( i = NF + 1; i > column; i-- ) {
$i = $(i-1);
}
$i = value;
print $0;
}
' access3.log
but it need to find msg.yahoo.com in column 9 before add value. Its like this, if column
9 = msg.yahoo.com, put Social Media after column 12 or between 12 and 13 column.
Workable but ugly is sed (as things so often are):
sed '/\([^ ]* \)\{8\}msg\.yahoo\.com/s/\(\([^ ]* \)\{12\}\)/\1Social Media /' filename
Here is the fix for awk
awk '$9=="msg.yahoo.com"{$(NF-6)=$(NF-6) " Social Media"}1' access3.log
Explanation
$9=="msg.yahoo.com" only target on the line which msg.yahoo.com in column 9
$(NF-6)=$(NF-6) " Social Media" column (NF-6) is the reverse column 6 from end, and replace with a new value.
1 just means true and print.

Print remaining lines in file after regular expression that includes variable

I have the following data:
====> START LOG for Background Process: HRBkg Hello on 2013/09/27 23:20:20 Log Level 3 09/27 23:20:20 I Background process is using
processing model #: 3 09/27 23:20:23 I 09/27 23:20:23 I --
Started Import for External Key
====> START LOG for Background Process: HRBkg Hello on 2013/09/30 07:31:07 Log Level 3 09/30 07:31:07 I Background process is using
processing model #: 3 09/30 07:31:09 I 09/30 07:31:09 I --
Started Import for External Key
I need to extract the remaining file contents after the LAST match of ====> START LOG.....
I have tried numerous times to use sed/awk, however, I can not seem to get awk to utilize a variable in my regular expression. The variable I was trying to include was for the date (2013/09/30) since that is what makes the line unique.
I am on an HP-UX machine and can not use grep -A.
Any advice?
There's no need to test for a specific time just to find the last entry in the file:
awk '
BEGIN { ARGV[ARGC] = ARGV[ARGC-1]; ARGC++ }
NR == FNR { if (/START LOG/) lastMatch=NR; next }
FNR == lastMatch { found=1 }
found
' file
This might work for you (GNU sed):
a=2013/09/30
sed '\|START LOG.*'"$a"'|{h;d};H;$!d;x' file
This will return your desired output.
sed -n '/START LOG/h;/START LOG/!H;$!b;x;p' file
If you have tac available, you could easily do..
tac <file> | sed '/START LOG/q' | tac
Here is one in Python:
#!/usr/bin/python
import sys, re
for fn in sys.argv[1:]:
with open(fn) as f:
m=re.search(r'.*(^====> START LOG.*)',f.read(), re.S | re.M)
if m:
print m.group(1)
Then run:
$ ./re.py /tmp/log.txt
====> START LOG for Background Process: HRBkg Hello on 2013/09/30 07:31:07 Log Level 3
09/30 07:31:07 I Background process is using processing model #: 3
09/30 07:31:09 I
09/30 07:31:09 I -- Started Import for External Key
If you want to exclude the ====> START LOGS.. bit, change the regex to:
r'.*(?:^====> START LOG.*?$\n)(.*)'
For the record, you can easily match a variable against a regular expression in Awk, or vice versa.
awk -v date='2013/09/30' '$0 ~ date {p=1} p' file
This sets p to 1 if the input line matches the date, and prints if p is non-zero.
(Recall that the general form in Awk is condition { actions } where the block of actions is optional; if omitted, the default action is to print the current input line.)
This prints the last START LOG, it set a flag for the last block and print it.
awk 'FNR==NR { if ($0~/^====> START LOG/) f=NR;next} FNR>=f' file file
You can use a variable, but if you have another file with another date, you need to know the date in advance.
var="2013/09/30"
awk '$0~v && /^====> START LOG/ {f=1}f' v="$var" file
====> START LOG for Background Process: HRBkg Hello on 2013/09/30 07:31:07 Log Level 3
09/30 07:31:07 I Background process is using processing model #: 3
09/30 07:31:09 I
09/30 07:31:09 I -- Started Import for External Key
With GNU awk (gawk) or Mikes awk (mawk) you can set the record separator (RS) so that each record will contain a whole log message. So all you need to do is print the last one in the END block:
awk 'END { printf "%s", RS $0 }' RS='====> START LOG' infile
Output:
====> START LOG for Background Process: HRBkg Hello on 2013/09/30 07:31:07 Log Level 3
09/30 07:31:07 I Background process is using processing model #: 3
09/30 07:31:09 I
09/30 07:31:09 I -- Started Import for External Key
Answer in perl:
If your logs are in assume filelog.txt.
my #line;
open (LOG, "<filelog.txt") or "die could not open filelog.tx";
while(<LOG>) {
#line = $_;
}
my $lengthline = $#line;
my #newarray;
my $j=0;
for(my $i= $lengthline ; $i >= 0 ; $i++) {
#newarray[$j] = $line[$i];
if($line[$i] =~ m/^====> START LOG.*/) {
last;
}
$j++;
}
print "#newarray \n";

How can I extract all conversations in a Postfix log from a particular relay using awk?

I am trying to extract the from address from the sending relay IP address in a postfix log file
Any ideas???
Much appreciated for any help
Ken
Nov 16 00:05:10 mailserver pfs/smtpd[4365]: 925D54E6D9B: client=client1[1.2.3.4]
Nov 16 00:05:10 mailserver pfs/cleanup[4413]: 925D54E6D9B: message-id=<11414>
Nov 16 00:05:10 mailserver pfs/qmgr[19118]: 925D54E6D9B: from=<11414#localhost>, size=40217, nrcpt=1 (queue active)
Nov 16 00:05:10 mailserver pfs/smtp[4420]: 925D54E6D9B: to, relay=[1.3.5.7]:25, delay=0.02, delays=0.02/0/0/0, dsn=5.0.0, status=bounced (host [1.3.5.7] refused to talk to me: 550 Please remove this address from your list)
Nov 16 00:05:10 mailserver pfs/bounce[4310]: 925D54E6D9B: sender non-delivery notification: 972E34E6D9F
Nov 16 00:05:10 mailserver pfs/qmgr[19118]: 925D54E6D9B: removed
Hmm, if you just want to collect the from and relay fields with their display bling, you could use this:
/: from=/ { lastFrom = $7 }
/relay=/ { print lastFrom, $8 }
If you really want to extract the core addresses, it gets slightly more complex...
/: from=/ { lastFrom = $7 }
/relay=/ {
r = $8
gsub(/from=</, "", lastFrom)
gsub(/>,*/, "", lastFrom)
gsub(/relay=\[/, "", r)
gsub(/\].*/, "", r)
print lastFrom, r
}
$ awk -f mail2.awk mail.dat
11414#localhost 1.3.5.7
As usual, these solutions work in both The One True Awk as well as gawk.
$7 ~ /^from=,$/ {
from[$6] = substr($7, 7, length($7) - 8)
}
$8 ~ /^relay=\[/ {
if (substr($8, "[1.3.5.7]"))
print from[$6]
delete from[$6]}
}
Each time a from-recording line is seen, this saves it in an associative array,
indexed by the queue ID of the message. When a relay line is seen, if it's for
the relay you're interested in the associated from line is printed. substr() is
used just so you don't have to \-escape all of the metacharacters - "[", "]", ".".
Whether it's a relay you're interested in or not, the from data is cleaned
up so that the array doesn't grow without bounds.