Multiline sed regex extraction issue: part of buffer matches - regex

I have to extract data from a log, and I'm trying to use sed to extract the data from 3 lines. The log entries (after grepping) look like this:
Tuesday March 11 2014
INBOUND>>>>> 06:22:10:066 Eventid:141004(3)
[SGW-S11/S4]GTPv2C Rx PDU, from 172.9.9.1:10000 to 173.10.10.1:2123 (187)
TEID: 0x00000000, Message type: EGTP_CREATE_SESSION_REQUEST (0x20)
I need to extract the "from IP", the "to IP", and the "Message Type".
This is what I have as of now:
sed -n '1!N; s/^INBOUND>>>>>.*\n.*from \([0-9.]*\).* to \([0-9.]*\).*/\1 \2/p'
When I extend it to the third line, to extract the message type, with:
sed -n '1!N; s/^INBOUND>>>>>.*\n.*from \([0-9.]*\).* to \([0-9.]*\).*\n.*, Message type: \([A-Z_]*\).*/\1 \2/p'
The entire pattern doesn't match.
This doesn't match the string unless there is a line before the INBOUND>>>>> string, which I think should match, since the ^ indicates the start of line. (This isn't really a problem since there is a datestamp, just a curiosity)
Bash Version: GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Sed Version: GNU sed version 4.1.5
Could you please give me any pointers on this? Thanks in advance.
P.S. The IPs can be IPv4 or IPv6, but I will change the IP regex once this problem's solved.
P.P.S. I need to use a regex i.e. not awk, because there will be other patterns too; this is the first, and I'm having problems :(

Your entire pattern
sed -n '1!N; s/^INBOUND>>>>>.*\n.*from \([0-9.]*\).* to \([0-9.]*\).*\n.*, Message type:\([A-Z_]*\).*/\1 \2/p'
can't match because you're missing a space between Message type: and \([A-Z_]*\)
Are you sure there are no hidden characters before INBOUND (when you omit the first line)?
This one works for me:
sed -r 's/.*from ([0-9.:]*) to ([0-9.:]*).*Message type: ([A-Z_]*).*/\1 \2 \3/'
(note that I used the -r flag so I won't have to escape the brackets)

You can use awk and no regex:
awk -F" |:" '/^INBOUND/ {getline;print $5 RS $8;getline;print $7}' file
172.9.9.1
173.10.10.1
EGTP_CREATE_SESSION_REQUEST
You say this is date out from a grep, it may be incorporated to the awk
Give us all data and how you like to output to be, and we will help you.
awk -F" |:" '/^INBOUND/ {getline;printf "%s %s",$5,$8;getline;print "",$7}' file
172.9.9.1 173.10.10.1 EGTP_CREATE_SESSION_REQUEST

Related

Regex command line change format of each line

I have a file that contains lines in a format similar to this...
/data/file.geojson?10,20,30,40
/data/file.geojson?bbox=-5.20751953125,49.05227025601607,3.0322265625,56.46249048388979
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-2.8482055664062496,54.38935426009769,-0.300750732421875,55.158473983815306
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
I've tried a combination of grep, sed, gawk, and |(pipes) to try and pattern match and then change the format to be more like this...
[10,40],[30,40],[30,20][10,20],
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979].....
Hopefully you get the idea from the first line so I don't have to type out all the examples manually!
I've got the hang of regex to match the co-ordinates. In fact the input file is the result of extracting from apache access logs. It might be easier to read/understand answers if they just match positive integer numbers, I will then be able to slot in a more complicated pattern to match the right range.
To be able to arrange the results like you which it is important to be able to access the last for values per line.
No pattern matching is required if you use awk. You can split the input strings by a set of delimiters and reassemble the resulting fields. 40 can be accessed as $(NF), 30 as $(NF-1) and so on.
awk -F'[?,=]' '
{printf "[%s,%s],[%s,%s],[%s,%s],[%s,%s]\n",
$(NF-3),$(NF),$(NF-1),$(NF),
$(NF-1),$(NF-2),$(NF-3),$(NF-2)
}' file
I'm using ?, , or = as the field delimiters. This makes it simple to access the columns of interest.
Output:
[10,40],[30,40],[30,20],[10,20]
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979],[3.0322265625,49.05227025601607],[-5.20751953125,49.05227025601607]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-2.8482055664062496,55.158473983815306],[-0.300750732421875,55.158473983815306],[-0.300750732421875,54.38935426009769],[-2.8482055664062496,54.38935426009769]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
Btw, also sed can be used here:
sed -r 's/.*[?=]([^,]+),([^,]+),([^,]+),(.*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
The command is capturing the numbers at the end each in a separate capturing group and re-assembles them in the replacement part.
Not all versions of sed support the + quantifier. The most compatible version would look like this :)
sed 's/.*[?=]\([^,]\{1,\}\),\([^,]\{1,\}+\),\([^,]\{1,\}\),\(.*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
sed strips off items prior to numbers, then awk splits on comma and outputs in different order. Assuming data is in a file called "td.txt"
sed 's/^[^0-9-]*//' td.txt|awk -F, '{print "["$1","$4"],["$3","$4"],["$3","$2"],["$1","$2"],"}'
This might work for you (GNU sed):
sed -r 's/^.*\?[^-0-9]*([^,]*),([^,]*),([^,]*),([^,]*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
Or with more toothpicks:
sed 's/^.*\?[^-0-9]*\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
You can use the following to match:
(\/data\/file\.geojson\?(?:bbox=)?)([0-9.-]+),([0-9.-]+),([0-9.-]+),([0-9.-]+)
And replace with the following:
$1[$2,$3],[$4,$5]
See DEMO

Linux search and replace a patterns case within a string

Been struggling to figure out a way to do this. Basically I need to change the case of anything enclosed in {} from lower to upper within a string representing a uri (and also strip out the braces but I can use sed to do that)
E.g
/logs/{server_id}/path/{os_id}
To
/logs/SERVER_ID/path/OS_ID
The case of the rest of the string must be preserved in lower which is what has been beating me. Looked at combos of sed,awk,tr with regex so far. Any help appreciated.
sed "s/{\([^{}]*\)}/\U\1/g"
This works by matching all text enclosed within {} and replacing it with its uppercase version.
echo "/logs/{server_id}/path/{os_id}" | sed "s/{\([^{}]*\)}/\U\1/g"
Gives /logs/SERVER_ID/path/OS_ID as the result.
echo "/logs/{server_id}/path/{os_id}" \
| sed 's#{\([^{}][^{}]*\)}#\U\1#;s#{\([^{}][^{}]*\)}#\U\1#'
output
/logs/SERVER_ID/path/OS_ID
The part of the solution you seem to have missed is the 'capture groups' available in sed, i.e. \(regex\). This is then referenced by \1. You could have anywhere from 1-9 capture groups if you're a real masochist ;-)
Also note that I just repeat the same cmd 2 times, as the first {...} pair as been converted to the UC version (without surrounding {}s), so only remaining {...} targets will match.
There are probably less verbose syntax available for [^{}][^{}* but this will work with just about any sed going back to the 80s. I seem to recall that some seds don't support the \U directive, but for the systems I have access to, this works.
Does that help?
$ awk '{
while(match($0,/{[^}]+}/))
$0=substr($0,1,RSTART-1) toupper(substr($0,RSTART+1,RLENGTH-2)) substr($0,RSTART+RLENGTH)
}1' file
/logs/SERVER_ID/path/OS_ID
This one handles arbitrary number and format of braces:
echo "/logs/{server_id}/path/{os_id}/{foo}" | awk -v RS='{' -v FS='}' -v ORS='\0' -v OFS='\0' '!/}/ { print } /}/ { $1 = toupper($1); print}'
Output:
/logs/SERVER_ID/path/OS_ID/FOO

sed : match all instances of regex in infile1.txt, and output only these to outfile2.txt

I have a text file infile1 with 1,000's of lines.
I wish to use sed to extract the occuring instances of a regex pattern match to outfile2.
NB
Each instance of the regex pattern match may occur more than once on each line of infile1.
Each instance of the extracted regex pattern should be printed to a new line in outfile2.
Does anyone know the syntax within sed to place the regex into?
ps the regex pattern is
\(Google[ ]{1,3}“[a-zA-Z0-9 ]{1,100}[., ]{0,3}”\)
Thank you :)
I think you want
grep -oE 'Google[ ]{1,3}"[a-zA-Z0-9 ]{1,100}[., ]{0,3}"' filename
-o tells grep to print only the matches, each on a line of its own, and -E instructs it to interpret the regex in extended POSIX syntax, which your regex appears to be.
Note that [ ] could be replaced with just a space, and you might want to use [[:alnum:] ] instead of [a-zA-Z0-9 ] to cover umlauts and suchlike if they exist in the current locale.
Addendum: It is also possible to do this with sed. I don't recommend it, but you could write (using GNU sed):
sed -rn 's/Google[ ]{1,3}"[A-Za-z0-9 ]{1,100}[., ]{0,3}"/\n&\n/g; s/[^\n]*\n([^\n]*\n)/\1/g; s/\n[^\n]*$//p' filename
To make this work with older versions of BSD sed, use -En instead of -rn. -r and -E enable extended regex syntax. -r was historically used by GNU sed, -E by BSD sed; newer versions of them support both for compatibility. -n disables auto-printing.
The code works as follows:
# mark all occurrences of the regex by circumscribing them with newlines
s/Google[ ]{1,3}"[A-Za-z0-9 ]{1,100}[., ]{0,3}"/\n&\n/g
# Isolate every other line from the pattern space (the matches). This will
# leave the part behind the last match...
s/[^\n]*\n([^\n]*\n)/\1/g
# ...so we remove it afterwards and print the result of the transformation if it
# happened (the s///p flag does that). The transformation will not happen if
# there were no matches in the line (because then no newlines will have been
# inserted), so in those cases nothing will be printed.
s/\n[^\n]*$//p
It can be done with sed too, but it isn't pretty:
sed -n ':start /foo/{ h; s/\(foo\).*/\1/; s/.*\(foo\)/\1/; p; g; s/foo\(.*\)/\1/; b start; }' infile1 >outfile2
-- provided that you replace the four occurences of foo above with your pattern Google {1,3}“[a-zA-Z0-9 ]{1,100}[., ]{0,3}”.
Yeah, I told you it isn't pretty. :)

Understanding a sed example

I found a solution for extracting the password from a Mac OS X Keychain item. It uses sed to get the password from the security command:
security 2>&1 >/dev/null find-generic-password -ga $USER | \
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
The code is here in a comment by 'sr105'. The part before the | evaluates to password: "secret". I'm trying to figure out exactly how the sed command works. Here are some thoughts:
I understand the flags -En, but what are the commas doing in this example? In the sed docs it says a comma separates an address range, but there's 3 commas.
The first 'address' /^password: / has a trailing s; in the docs s is only mentioned as the replace command like s/pattern/replacement/. Not the case here.
The ^password: "(.*)"$ part looks like the Regex for isolating secret, but it's not delimited.
I can understand the end part where the back-reference \1 is printed out, but again, what are the commas doing there??
Note that I'm not interested in an easier alternative to this sed example. This will only be part of a larger bash script which will include some more sed parsing in an .htaccess file, so I'd really like to learn the syntax even if it is obscure.
Thanks for your help!
Here is sed command:
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
Commas are used as regex delimiter it can very well be another delimiter like #:
sed -En '/^password: / s#^password: "(.*)"$#\1#p'`
/^password: / finds an input line that starts with password:
s#^password: "(.*)"$#\1#p finds and captures double-quoted string after password: and replaces the entire line with the captured string \1 ( so all that remains is the password )
First, the command extracts passwords from a file (or stream) and prints them to stdout.
While you "normally" might execute a sed command on all lines of a file, sed offers to specify a regex pattern which describes which lines the following command should get applied to.
In your case
/^password: /
is a regex, saying that the command:
s,^password: "(.*)"$,\1,p
should get executed for all lines looking like password: "secret". The command substitutes those lines with the password itself while suppressing the outer lines.
The substitute command might look uncommon but you can choose the delimiter in an sed command, it is not limited to /. In this case , was chosen.

Regular Expression to parse Common Name from Distinguished Name

I am attempting to parse (with sed) just First Last from the following DN(s) returned by the DSCL command in OSX terminal bash environment...
CN=First Last,OU=PCS,OU=guests,DC=domain,DC=edu
I have tried multiple regexs from this site and others with questions very close to what I wanted... mainly this question... I have tried following the advice to the best of my ability (I don't necessarily consider myself a newbie...but definitely a newbie to regex..)
DSCL returns a list of DNs, and I would like to only have First Last printed to a text file. I have attempted using sed, but I can't seem to get the correct function. I am open to other commands to parse the output. Every line begins with CN= and then there is a comma between Last and OU=.
Thank you very much for your help!
I think all of the regular expression answers provided so far are buggy, insofar as they do not properly handle quoted ',' characters in the common name. For example, consider a distinguishedName like:
CN=Doe\, John,CN=Users,DC=example,DC=local
Better to use a real library able to parse the components of a distinguishedName. If you're looking for something quick on the command line, try piping your DN to a command like this:
echo "CN=Doe\, John,CN=Users,DC=activedir,DC=local" | python -c 'import ldap; import sys; print ldap.dn.explode_dn(sys.stdin.read().strip(), notypes=1)[0]'
(depends on having the python-ldap library installed). You could cook up something similar with PHP's built-in ldap_explode_dn() function.
Two cut commands is probably the simplest (although not necessarily the best):
DSCL | cut -d, -f1 | cut -d= -f2
First, split the output from DSCL on commas and print the first field ("CN=First Last"); then split that on equal signs and print the second field.
Using sed:
sed 's/^CN=\([^,]*\).*/\1/' input_file
^ matches start of line
CN= literal string match
\([^,]*\) everything until a comma
.* rest
http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators
awk -v RS=',' -v FS='=' '$1=="CN"{print $2}' foo.txt
I like awk too, so I print the substring from the fourth char:
DSCL | awk '{FS=","}; {print substr($1,4)}' > filterednames.txt
This regex will parse a distinguished name, giving name and val a capture groups for each match.
When DN strings contain commas, they are meant to be quoted - this regex correctly handles both quoted and unquotes strings, and also handles escaped quotes in quoted strings:
(?:^|,\s?)(?:(?<name>[A-Z]+)=(?<val>"(?:[^"]|"")+"|[^,]+))+
Here is is nicely formatted:
(?:^|,\s?)
(?:
(?<name>[A-Z]+)=
(?<val>"(?:[^"]|"")+"|[^,]+)
)+
Here's a link so you can see it in action:
https://regex101.com/r/zfZX3f/2
If you want a regex to get only the CN, then this adapted version will do it:
(?:^|,\s?)(?:CN=(?<val>"(?:[^"]|"")+"|[^,]+))