Match regex from right to left? - regex

Is there any way of matching a regex from right to left? What Im looking for is a regex that gets
MODULE WAS INSERTED EVENT
LOST SIGNAL ON E1/T1 LINK OFF
CRC ERROR EVENT
CLK IS DIFF FROM MASTER CLK SRC OF
from this input
CLI MUX trap received: (022) CL-B MCL-2ETH MODULE WAS INSERTED EVENT 07-05-2010 12:08:40
CLI MUX trap received: (090) IO-2 ML-1E1 EX1 LOST SIGNAL ON E1/T1 LINK OFF 04-06-2010 09:58:58
CLI MUX trap received: (094) IO-2 ML-1E1 EX1 CRC ERROR EVENT 04-06-2010 09:58:59
CLI MUX trap received: (009) CLK IS DIFF FROM MASTER CLK SRC OFF 07-05-2010 12:07:32
If i could have done the matching from right to left I could have written something like everything to right of (EVENT|OFF) until the second appearance of more than one space [ ]+
The best I managed today is to get everything from (022) to EVENT with the regex
CLI MUX trap received: \([0-9]+\)[ ]+(.*[ ]+(EVENT|OFF))
But that is not really what I wanted :)
edit: What language its for? Its actually a config string for a filter we have but my guess it is using standard GNU C Regex library.
edit2: I like the answers about cutting by length but Amarghosh was probably more what I was looking for. Do not really know why I did not think about just cutting on length like:
^.{56}(.{39}).*$
Super thanks for the quick answers...

In .NET you could use the RightToLeft option :
Regex RE = new Regex(Pattern, RegexOptions.RightToLeft);
Match theMatch = RE.Match(Source);

With regex, you could simply replace this:
^.{56}|.{19}$
with the empty string.
But really, you only need to cut out the string from "position 56" to "string-length - 19" with a substring function. That's easier and much faster than regex.
Here's an example in JavaScript, other languages work more or less the same:
var lines = [
'CLI MUX trap received: (022) CL-B MCL-2ETH MODULE WAS INSERTED EVENT 07-05-2010 12:08:40',
'CLI MUX trap received: (090) IO-2 ML-1E1 EX1 LOST SIGNAL ON E1/T1 LINK OFF 04-06-2010 09:58:58',
'CLI MUX trap received: (094) IO-2 ML-1E1 EX1 CRC ERROR EVENT 04-06-2010 09:58:59',
'CLI MUX trap received: (009) CLK IS DIFF FROM MASTER CLK SRC OFF 07-05-2010 12:07:32'
];
for (var i=0; i<lines.length; i++) {
alert( lines[i].substring(56, lines[i].length-19) );
}

If tokens are guaranteed to be separated by more than one space and words within the string before EVENT|OFF are guaranteed to be separated by just one space - only then you can look for single-space-separated words followed by spaces followed by EVENT or OFF
var s = "CLI MUX trap received: (022) CL-B MCL-2ETH MODULE WAS INSERTED EVENT 07-05-2010 12:08:40"
+ "\nCLI MUX trap received: (090) IO-2 ML-1E1 EX1 LOST SIGNAL ON E1/T1 LINK OFF 04-06-2010 09:58:58"
+ "\nCLI MUX trap received: (094) IO-2 ML-1E1 EX1 CRC ERROR EVENT 04-06-2010 09:58:59"
+ "\nCLI MUX trap received: (009) CLK IS DIFF FROM MASTER CLK SRC OFF 07-05-2010 12:07:32"
var r = /\([0-9]+\).+?((?:[^ ]+ )* +(?:EVENT|OFF))/g;
var m;
while((m = r.exec(s)) != null)
console.log(m[1]);
Output:
MODULE WAS INSERTED EVENT
LOST SIGNAL ON E1/T1 LINK OFF
CRC ERROR EVENT
CLK IS DIFF FROM MASTER CLK SRC OFF
Regex: /\([0-9]+\).+?((?:[^ ]+ )* +(?:EVENT|OFF))/g
\([0-9]+\) #digits in parentheses followed by
.+? #some characters - minimum required (non-greedy)
( #start capturing
(?:[^ ]+ )* #non-space characters separated by a space
` +` #more spaces (separating string and event/off -
#backticks added for emphasis), followed by
(?:EVENT|OFF) #EVENT or OFF
) #stop capturing

Does the input file fit nicely into fixed width tabular text like this? Because if it does, then the simplest solution is to just take the right substring of each line, from column 56 to column 94.
In Unix, you can use the cut command:
cut -c56-94 yourfile
See also
Wikipedia/Cut (Unix)
In Java, you can write something like this:
String[] lines = {
"CLI MUX trap received: (022) CL-B MCL-2ETH MODULE WAS INSERTED EVENT 07-05-2010 12:08:40",
"CLI MUX trap received: (090) IO-2 ML-1E1 EX1 LOST SIGNAL ON E1/T1 LINK OFF 04-06-2010 09:58:58",
"CLI MUX trap received: (094) IO-2 ML-1E1 EX1 CRC ERROR EVENT 04-06-2010 09:58:59",
"CLI MUX trap received: (009) CLK IS DIFF FROM MASTER CLK SRC OFF 07-05-2010 12:07:32",
};
for (String line : lines) {
System.out.println(line.substring(56, 94));
}
This prints:
MODULE WAS INSERTED EVENT
LOST SIGNAL ON E1/T1 LINK OFF
CRC ERROR EVENT
CLK IS DIFF FROM MASTER CLK SRC OFF
A regex solution
This is most likely not necessary, but something like this works (as seen on ideone.com):
line.replaceAll(".* \\b(.+ .+) \\S+ \\S+", "$1")
As you can see, it's not very readable, and you have to know your regex to really understand what's going on.
Essentially you match this to each line:
.* \b(.+ .+) \S+ \S+
And you replace it with whatever group 1 matched. This relies on the usage of two consecutive spaces exclusively for separating the columns in this table.

How about
.{56}(.*(EVENT|OFF))

Can you do field-oriented processing, rather than a regex? In awk/sh, this would look like:
< $datafile awk '{ print $(NF-3), $(NF-2) }' | column
which seems rather cleaner than specifying a regex.

Related

Telegraf: How to extract from field using regex processor?

I would like to extract the values for connections, upstream and downstream using telegraf regex processor plugin from this input:
2022/11/16 22:38:48 In the last 1h0m0s, there were 10 connections. Traffic Relayed ↑ 60 MB, ↓ 4 MB.
Using this configuration the result key "upstream" is a copy of the initial message but without a part of the 'regexed' stuff.
[[processors.regex]]
tagpass = ["snowflake-proxy"]
[[processors.regex.fields]]
## Field to change
key = "message"
## All the power of the Go regular expressions available here
## For example, named subgroups
pattern = 'Relayed.{3}(?P<UPSTREAM>\d{1,4}\W.B),'
replacement = "${UPSTREAM}"
## If result_key is present, a new field will be created
## instead of changing existing field
result_key = "upstream"
Current output:
2022/11/17 10:38:48 In the last 1h0m0s, there were 1 connections. Traffic 3 MB ↓ 5 MB.
How do I get the decimals?
I'm quite a bit confused how to use the regex here, because on several examples in the web it should work like this. See for example: http://wiki.webperfect.ch/index.php?title=Telegraf:_Processor_Plugins
The replacement config option specifies what you want to replace in for any matches.
I think you want something closer to this:
[[processors.regex.fields]]
key = "message"
pattern = '.*Relayed.{3}(?P<UPSTREAM>\d{1,4}\W.B),.*$'
replacement = "${1}"
result_key = "upstream"
to get:
upstream="60 MB"

Python: Match a special caracter with regular expression

Hi everyone I'm using the re.match function to extract pieces of string within a row from the file.
My code is as follows:
## fp_tmp => pointer of file
for x in fp_tmp:
try:
cpuOverall=re.match(r"(Overall CPU load average)\s+(\S+)(%)",x)
cpuUsed=re.match(r"(Total)\s+(\d+)(%)",x)
ramUsed=re.match(r"(RAM Utilization)\s+(\d+\%)",x)
####Not Work####
if cpuUsed is not None: cpuused_new=cpuUsed.group(2)
if ramUsed is not None: ramused_new=ramUsed.group(2)
if cpuOverall is not None: cpuoverall_new=cpuOverall.group(2)
except:
searchbox_result = None
Each field is extracted from the following corresponding line:
ramUsed => RAM Utilization 2%
cpuUsed => Total 4%
cpuOverall => Overall CPU load average 12%
ramUsed, cpuUsed, cpuOverall are the variable where I want write the result!!
Corretly line are:
(space undefined) RAM Utilization 2%
(space undefined) Total 4%
(space undefined) Overall CPU load average 12%
When I execute the script all variable return a value: None.
With other variable the script work corretly.
Why the code not work in this case? I use the python3
I think that the problem is a caracter % that not read.
Do you have any suggestions?
PROBLEM 2:
## fp_tmp => pointer of file
for x in fp_tmp:
try:
emailReceived=re.match(r".*(Messages Received)\s+\S+\s+\S+\s+(\S+)",x)
####Not Work####
if emailReceived is not None: emailreceived_new=emailReceived.group(2)
except:
searchbox_result = None
Each field is extracted from the following corresponding on 2 lines in a file:
[....]
Counters: Reset Uptime Lifetime
Receiving
Messages Received 3,406 1,558 3,406
[....]
Rates (Events Per Hour): 1-Minute 5-Minutes 15-Minutes
Receiving
Messages Received 0 0 0
Recipients Received 0 0 0
[....]
I want extract only second occured, that:
Rates (Events Per Hour): 1-Minute 5-Minutes 15-Minutes
Receiving
Messages Received 0 0 0 <-this
Do you have any suggestions?
cpuOverall line: you forgot that there is more information at the start of the line. Change to
'.*(Overall CPU load average)\s+(\S+%)'
cpuUsed line: you forgot that there is more information at the start of the line. Change to
'.*(Total)\s+(\d+%)'
ramUsed line: you forgot that there is more information at the start of the line... Change to
'.*(RAM Utilization)\s+(\d+%)'
Remember that re.match looks for an exact match from the start:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. [..]
With these changes, your three variables are set to the percentages:
>>> print (cpuused_new,ramused_new,cpuoverall_new)
4% 2% 12%

Regex to find everything in between

I have the following regex which works when there is no leading /d,"There is 1 interface on the system:
or a trailing ",2017-01-...
Here is the regex:
(?m)(?<_KEY_1>\w+[^:]+?):\s(?<_VAL_1>[^\r\n]+)$
Here is a sample of what I am trying to parse:
1,"There is 1 interface on the system:
Name : Mobile Broadband Connection
Description : Qualcomm Gobi 2000 HS-USB Mobile Broadband Device 250F
GUID : {1234567-12CD-1BC1-A012-C1A1234CBE12}
Physical Address : 00:a0:c6:00:00:00
State : Connected
Device type : Mobile Broadband device is embedded in the system
Cellular class : CDMA
Device Id : A1000001234f67
Manufacturer : Qualcomm Incorporated
Model : Qualcomm Gobi 2000
Firmware Version : 09010091
Provider Name : Verizon Wireless
Roaming : Not roaming
Signal : 67%",2017-01-20T16:00:07.000-0700
I am trying to extract field names where for example Cellular class would equal CDMA but for all fields beginning after:
1,"There is 1 interface on the system: (where 1 increments 1,2 3,4 and so on
and before the tailing ",2017-01....
Any help is much appreciated!
You could use look-ahead to ensure that the strings you match come before a ",\d sequence, and do not include a ". The latter would ensure you will only match between double quotes, of which the second has the pattern ",\d:
/^\h*(?<_KEY_1>[\w\h]+?)\h*:\h*(?<_VAL_1>[^\r\n"]+)(?="|$)(?=[^"]*",\d)/gm
See it on regex101
NB: I put the g and m modifiers at the end, but if your environment requires them at the start with (?m) notation, that will work too of course.
Your example string seems to be a record from a csv file. This is how I will accomplish the task with Python (2.7 or 3.x):
import csv
with open('file.csv', 'r') as fh:
reader = csv.reader(fh)
results = []
for fields in reader:
lines = fields[1].splitlines()
keyvals = [list(map(str.strip, line.split(':', 1))) for line in lines[1:]]
results.append(keyvals)
print(results)
It can be done in a similar way with other languages.
You haven't responded to my comments or any of the answers, but here is my answer - try
^\s*(?<_KEY_1>[\w\s]+?)\s*:\s*(?<_VAL_1>[^\r\n"]+).*$
See it here at regex101.

Print remaining lines in file after regular expression that includes variable

I have the following data:
====> START LOG for Background Process: HRBkg Hello on 2013/09/27 23:20:20 Log Level 3 09/27 23:20:20 I Background process is using
processing model #: 3 09/27 23:20:23 I 09/27 23:20:23 I --
Started Import for External Key
====> START LOG for Background Process: HRBkg Hello on 2013/09/30 07:31:07 Log Level 3 09/30 07:31:07 I Background process is using
processing model #: 3 09/30 07:31:09 I 09/30 07:31:09 I --
Started Import for External Key
I need to extract the remaining file contents after the LAST match of ====> START LOG.....
I have tried numerous times to use sed/awk, however, I can not seem to get awk to utilize a variable in my regular expression. The variable I was trying to include was for the date (2013/09/30) since that is what makes the line unique.
I am on an HP-UX machine and can not use grep -A.
Any advice?
There's no need to test for a specific time just to find the last entry in the file:
awk '
BEGIN { ARGV[ARGC] = ARGV[ARGC-1]; ARGC++ }
NR == FNR { if (/START LOG/) lastMatch=NR; next }
FNR == lastMatch { found=1 }
found
' file
This might work for you (GNU sed):
a=2013/09/30
sed '\|START LOG.*'"$a"'|{h;d};H;$!d;x' file
This will return your desired output.
sed -n '/START LOG/h;/START LOG/!H;$!b;x;p' file
If you have tac available, you could easily do..
tac <file> | sed '/START LOG/q' | tac
Here is one in Python:
#!/usr/bin/python
import sys, re
for fn in sys.argv[1:]:
with open(fn) as f:
m=re.search(r'.*(^====> START LOG.*)',f.read(), re.S | re.M)
if m:
print m.group(1)
Then run:
$ ./re.py /tmp/log.txt
====> START LOG for Background Process: HRBkg Hello on 2013/09/30 07:31:07 Log Level 3
09/30 07:31:07 I Background process is using processing model #: 3
09/30 07:31:09 I
09/30 07:31:09 I -- Started Import for External Key
If you want to exclude the ====> START LOGS.. bit, change the regex to:
r'.*(?:^====> START LOG.*?$\n)(.*)'
For the record, you can easily match a variable against a regular expression in Awk, or vice versa.
awk -v date='2013/09/30' '$0 ~ date {p=1} p' file
This sets p to 1 if the input line matches the date, and prints if p is non-zero.
(Recall that the general form in Awk is condition { actions } where the block of actions is optional; if omitted, the default action is to print the current input line.)
This prints the last START LOG, it set a flag for the last block and print it.
awk 'FNR==NR { if ($0~/^====> START LOG/) f=NR;next} FNR>=f' file file
You can use a variable, but if you have another file with another date, you need to know the date in advance.
var="2013/09/30"
awk '$0~v && /^====> START LOG/ {f=1}f' v="$var" file
====> START LOG for Background Process: HRBkg Hello on 2013/09/30 07:31:07 Log Level 3
09/30 07:31:07 I Background process is using processing model #: 3
09/30 07:31:09 I
09/30 07:31:09 I -- Started Import for External Key
With GNU awk (gawk) or Mikes awk (mawk) you can set the record separator (RS) so that each record will contain a whole log message. So all you need to do is print the last one in the END block:
awk 'END { printf "%s", RS $0 }' RS='====> START LOG' infile
Output:
====> START LOG for Background Process: HRBkg Hello on 2013/09/30 07:31:07 Log Level 3
09/30 07:31:07 I Background process is using processing model #: 3
09/30 07:31:09 I
09/30 07:31:09 I -- Started Import for External Key
Answer in perl:
If your logs are in assume filelog.txt.
my #line;
open (LOG, "<filelog.txt") or "die could not open filelog.tx";
while(<LOG>) {
#line = $_;
}
my $lengthline = $#line;
my #newarray;
my $j=0;
for(my $i= $lengthline ; $i >= 0 ; $i++) {
#newarray[$j] = $line[$i];
if($line[$i] =~ m/^====> START LOG.*/) {
last;
}
$j++;
}
print "#newarray \n";

Parsing (partially) non-uniform text blocks in Perl

I have a file with a few blocks that look like this in a file (and in a variable, at this point in the program).
Vlan2 is up, line protocol is up
....
reliability 255/255, txload 1/255, rxload 1/255^M
....
Last clearing of "show interface" counters 49w5d
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
....
L3 out Switched: ucast: 17925 pkt, 23810209 bytes mcast: 0 pkt, 0 bytes
33374 packets input, 13154058 bytes, 0 no buffer
Received 926 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
3094286 packets output, 311981311 bytes, 0 underruns
0 output errors, 0 interface resets
0 output buffer failures, 0 output buffers swapped out
Here's a second block, to show you how the blocks can slightly vary:
port-channel86 is down (No operational members)
...
reliability 255/255, txload 1/255, rxload 1/255
...
Last clearing of "show interface" counters 31w2d
...
RX
147636 unicast packets 0 multicast packets 0 broadcast packets
84356 input packets 119954232 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 0 CRC 0 no buffer
0 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop
0 input with dribble 0 input discard
0 Rx pause
TX
147636 unicast packets 0 multicast packets 0 broadcast packets
84356 output packets 119954232 bytes
0 jumbo packets
0 output error 0 collision 0 deferred 0 late collision
0 lost carrier 0 no carrier 0 babble 0 output discard
0 Tx pause
0 interface resets
I want to pick out certain data elements from each block, which may or may not exist in each block. For example, in the first block I posted I may want to know that there are 0 runts, 0 input errors and 0 overrun. In the second block, I might want to know that there are 0 jumbo packets, collisions, etc. If a given query isn't in the block, it's acceptable to just return na, as this is designed to be processed uniformly.
Each block is structured in a similar way to the two I posted; newlines and spaces delimiting some entries, commas delimiting others.
I have a few ideas as to how this might work. I'm unaware if there is any kind of "look back" function in Perl, but I could attempt to look for the field names (runts, "input errors", etc) and then grab the previous integer; that seems like it would be the most elegant solution for this, but I'm unsure if it's possible.
Currently, I'm doing this in Perl. Each "block" that I'm processing is actually several of these blocks (separated by double newlines). It doesn't have to be done in a single regular expressions; I believe it can be done by applying several regular expressions per block. Performance is not really a factor, as this script will run maybe once per hour.
My goal is to get all of this into a .csv file (or some other data format that's easily graphable) in an automated fashion.
Any ideas?
Edit: example output in CSV as I mentioned, which would be written line by line (for multiple entries like this) to a file as the end result. If a particular entry isn't found in the block, it is marked na in the corresponding line:
interface_name,txload,rxload,last_clearing,input_queue,output_drops,runts,....
vlan2,1,1,49w5d,0-75-0-0,0,0,....
port-channel86,1,1,31w2d,na,na,0,...
Simple hash of properties and numbers.
sub extract {
my ($block) = #_;
my %r;
while ($block =~ /(?<num>\d+) \s (?<name>[A-Za-z\s]+)/gmsx) {
my $name = $+{name};
my $num = $+{num};
$name =~ s/\A \s+//msx;
$name =~ s/\s+ \z//msx;
$r{$name} = $num;
}
return %r;
}
my $block = <<'';
Vlan2 is up, line protocol is up
⋮
my $block2 = <<'';
port-channel86 is down (No operational members)
⋮
use Data::Dumper qw(Dumper);
print Dumper {extract $block};
print Dumper {extract $block2};
I don't think a single regex could do it, nor would I want to support it if it could.
Using multiple regexes, you could easily use something like:
(\d+) runts
(\d+) input errors
...etc...
A simple array of property names and a loop could solve this pretty quickly and with very little fuss.
If you can strip down the input to smaller chunks with some preprocessing, you would be less likely to get false positives.
Here is one way to do it in awk, but this needs lots of tweak to be perfect.
But again, use SNMP.
awk '{
printf $1
for (i=1;i<=NF;i++) {
if ($i" "$(i+1)~/Input queue:/) printf ",%s",$(i+2)
if ($i~/runts/) printf ",%s",$(i-1)
if ($i~/multicast,/) printf ",%s",$(i-1)
}
print ""
}' RS="swapped out" file