Convert regex from Python format to GNU Sed format

Convert regex from Python format to GNU Sed format - regex

I'm parsing a roughly 10GB log file, and need to feed it through sed to capture some output. The necessary capture segment based on what I would use in JavaScript is:
s/method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)""/"\1","\2","\3"/
Unfortunately sed (GNU sed 4.2.1, GnuWin32 edition) is struggling over the [^"]* ranges. It refuses to match them. I've tried variations of other acceptance blocks, with [a-zA-Z0-9:\\/.]* and similar variants but there seem to always be new characters inside the block that it misses, and really I can accept any valid character held between the quotes. With sed's * routine being a greedy implementation it tends to also have problems on the final "accept" item, pulling in all the other items on the log entry right up until the end.
I need to capture everything between the quotation marks and ignore the rest of the log entry.
I've been at this for two days for some stupid thing I could have implemented directly in python if there wasn't a requirement it be executed from a script with sed. Can any regex guru out there help?
EDIT:
For the extra information about examples, this produces no matches on my system, sed 4.2.1 from the GnuWin32.sourceforge.net collection: sed -r 's/method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)""/"\1","\2","\3"/' logfile
This produces matches for some entries: sed -r 's/^.*\method\=""([A-Z]*).*path=""([a-zA-Z0-9:\/]*).*accept=""(.*)"".*/"\1","\2","\3"/ logfile
Here are some (slightly redacted but not too much) lines:
"server-01/1.2.3.4 time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""GET"" path=""/ourapp/foo/bar/AAA-123:1029"" status=""200"" message=""OK"" duration=""7"" query=""cc=1463648"" content_type=""application/json"" referer=""https://example.org/somewhere"" from=""foo#bar.com"" ip=""1.2.3.4"" agent=""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"" req_header_accept=""application/json, text/javascript, application/sord+xml; q=0.01"" req_header_accept-language=""en-US,en;q=0.8"" req_header_x-request-id=""29/Oct/2014:05:59:59.968a-abc123ABC"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"
"server-01/1.2.3.4 time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""GET"" path=""/ourapp/foo/bar:AA9.1/ABC-123/record"" status=""200"" message=""OK"" duration=""73"" query=""view=includeFields"" content_type=""application/json"" from=""None"" ip=""1.2.3.4"" req_header_accept=""application/json"" req_header_x-request-id=""ab123-abc123-12345abc"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"
"server-01/1.2.3.4 time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""HEAD"" path=""/ourapp/foo/bar:AA3.4/ABC-123/meta"" status=""200"" message=""OK"" duration=""21"" content_type=""application/json"" from=""foo#bar.com"" ip=""1.2.3.4"" agent=""Java/1.7.0_25"" req_header_accept=""application/json"" req_header_accept-language=""en"" req_header_cache-control=""no-cache"" req_header_x-request-id=""29/Oct/2014:05:59:59.882va-af527A"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"

The key to this problem turned out to Windows shell interactions with the sed command. See the last section in this answer for details.
Demonstration under a Unix shell
As sample input consider:
$ cat file
some method=""this is my method"" more stuff path=""My Path"" accept=""Yes"" end of line
The following sed command processes that input:
$ sed -r 's/.*method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)"".*/"\1","\2","\3"/' file
"this is my method","My Path","Yes"
Note that the -r option is required to so that unescaped parens act as grouping rather than literal characters.
Using the more complex input in the revised question:
$ sed -r 's/.*method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)"".*/"\1","\2","\3"/' input
"GET","/ourapp/foo/bar/AAA-123:1029","application/json, text/javascript, application/sord+xml; q=0.01"
"GET","/ourapp/foo/bar:/AA9.1/ABC-123/record","application/json"
"HEAD","/ourapp/foo/bar:/AA3.4/ABC-123/meta","application/json"
As regards the accept issue, I see two accept variables in the sample input:
req_header_accept
req_header_accept-language
Because the regex matches accept="", the former should be matched, not the latter.
Matching non-quotes
Consider the input:
$ cat test.txt
Billy "The Kid" Smith
Jimmy "The Fish" Stuart
Chuck "The Man" Norris
This sed command selects the quoted material:
$ sed -r 's/.*"([^"]*)".*/\1/' test.txt
The Kid
The Fish
The Man
All these test were done on GNU sed version 4.2.1 under linux.
Windows Shell Issues
The following are key points for making sed commands work on Windows:
Enclose sed commands in double quotes. Under the Windows shell, commands should be protected by double-quotes, not single quotes as Unix uses.
If a string needs to contain double-quotes, write them in hexadecimal coding as \x22.
Under Windows, an unquoted caret ^ is an escape character. This, however, does not affect us because, in our case, the ^ always appear inside a double-quoted string.
CygWin, if it is available, avoids Windows shell issues.
Thus, for the Billy The Kid input, try:
sed -r "s/.*\x22([^\x22]*)\x22.*/\1/" test.txt
Also, ^ is a Windows escape character but it reportedly only functions as such outside quotes. Thus, I left it as is in the above command.
For the full case, Bryan reports that the following works:
sed -r "s/^.*method\=\x22\x22([^\x22]*).*path=\x22\x22([^\x22]*).*req_header_accept=\x‌22\x22([^\x22]*).*$/\x22\1\x22,\x22\2\x22,\x22\3\x22/" logfile

Related

GNU sed regex to fix mySQL db inserts for SQLite

I am trying to translate a huge mySQL database dump file from mySQL syntax into SQLite syntax.
At https://regex101.com/ I have successfully created a ECMAScript flavor regex to turn something like:
,'foo\'s bar!',
into:
,"foo\'s bar!"
with this regular expression:
/,'([^']+)\\'([^']+)',/"$1\\'$2"/g
testing against this short file:
(1058,'gpl5q0x51349lmdq3e0ijm4k9b6n','Henry\'s_1.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33854,'mUVk0/XGX+afIpkrqBm7LQ==','2021-01-06 03:07:23'),
(1059,'xzj8mivsenkakkrurfjytxjsaj1h','Henry\'s_2.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33555,'KfRYqfAWtSIYXZ6oQZyYbA==','2021-01-06 03:07:23'),
Resulting in:
(1058,'gpl5q0x51349lmdq3e0ijm4k9b6n'"Henry\'s_1.csv"'text/csv','{\"identified\":true,\"analyzed\":true}',33854,'mUVk0/XGX+afIpkrqBm7LQ==','2021-01-06 03:07:23'),
(1059,'xzj8mivsenkakkrurfjytxjsaj1h'"Henry\'s_2.csv"'text/csv','{\"identified\":true,\"analyzed\":true}',33555,'KfRYqfAWtSIYXZ6oQZyYbA==','2021-01-06 03:07:23'),
but for the life of me I cannot translate this into a GNU sed flavor regex.
For example, this command does not make any substitutions in the output:
sed -r s/,'([^']+)\\'([^']+)',/"$1\\'$2"/g <test.sql
...
sed -r s/,'([^']+)\\'([^']+)',/"\1\\'\2"/g <test.sql: doesn't work either.
I have looked for a regex tool online that translates between different flavors of regex but cannot find one that works on GNU sed (shipped with GIT: sed (GNU sed) 4.8). PCRE seems to be close to what sed has but that doesn't work. I tried perl as well, no luck.
Anyone know a regex expression that works or a translator tool that works?
I am just about ready to write a nodejs program to do this for me.
Also, for extra credit, how can I write a sed script to handle any number of escaped quotes within a quoted string? I have that issue to deal with as well in my DB dump file.
Examples:
'foo\'-bar' // on instance
'foo\'and\'bar' // two instances
'foo\'and\'bar\'s on the deck' // three instances
and so on...
Thanks!

You can use
sed -E "s/,'([^']+)\\\\'([^']+)',/"'"'"\\1\\\\'\\2"'"'/g test.sql
The "s/,'([^']+)\\\\'([^']+)',/"'"'"\\1\\\\'\\2"'"'/g consists of
"s/,'([^']+)\\\\'([^']+)',/" - a s/,'([^']+)\\'([^']+)',/ part (inside double quotes, so backslashes need doubling)
'"' - a " char (inside single quotes)
"\\1\\\\'\\2" - \1\\'\2 pattern (inside double quotes, so backslashes are doubled)
'"' - a " char (inside single quotes)
/g - the global flag (no need quoting here).

First look at your command
sed -r s/,'([^']+)\\'([^']+)',/"\1\\'\2"/g test.sql
I prefer writing the whole sed command in single quotes. When you need a single quote, you must close the string ('), use an escaped single quote (\') and open the next string with a ', all joined: '\''.
I also added two , characters.
sed -r 's/,'\''([^'\'']+)\\'\''([^'\'']+)'\'',/,"\1\\'\''\2",/g' test.sql
# Shorter
sed -r 's/,'\''([^'\'']+\\'\''[^'\'']+)'\'',/,"\1",/g' test.sql
# Using another way to write the single quotes, with the hex notation
sed -r 's/,\x27([^\x27]+\\\x27[^\x27]+)\x27,/,"\1",/g' test.sql
This works for simple cases, not for 'foo\'and\'bar\'s on the deck'.
I think you want to replace the quotes in the simple fields too.
Suppose you want to transform
(1058,'gpl5q0x51349lmdq3e0ijm4k9b6n','Henry\'s_1.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33854,'mUVk0/XGX+afIpkrqBm7LQ==','2021-01-06 03:07:23'),
(1059,'xzj8mivsenkakkrurfjytxjsaj1h','Henry\'s_2.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33555,'KfRYqfAWtSIYXZ6oQZyYbA==','2021-01-06 03:07:23'),
(2000,'extra credit from question','foo\'and\'bar\'s on the deck','text/csv','{\"identified\":true,\"analyzed\":true}',33999,'KgSBFstbdthdsssssstvbA==','2022-01-02 13:07:23'),
into
(1058,"gpl5q0x51349lmdq3e0ijm4k9b6n","Henry\'s_1.csv","text/csv","{\"identified\":true,\"analyzed\":true}",33854,"mUVk0/XGX+afIpkrqBm7LQ==","2021-01-06 03:07:23"),
(1059,"xzj8mivsenkakkrurfjytxjsaj1h","Henry\'s_2.csv","text/csv","{\"identified\":true,\"analyzed\":true}",33555,"KfRYqfAWtSIYXZ6oQZyYbA==","2021-01-06 03:07:23"),
(2000,"extra credit from question","foo\'and\'bar\'s on the deck","text/csv","{\"identified\":true,\"analyzed\":true}",33999,"KgSBFstbdthdsssssstvbA==","2022-01-02 13:07:23"),
In this answer I don't use the '\'' but the hexadecimal notation \x27.
First "backup" the \' combinations (replace them by an unused character like \r), replace all normal quotes by double quotes and "restore the backup" (change back the \r).
sed 's/\\\x27/\r/g; s/\x27/"/g; s/\r/\\\x27/g' test.sql
# or hex value for double quote "
sed 's/\\\x27/\r/g; s/\x27/\x22/g; s/\r/\\\x27/g' test.sql

Multiline sed regex extraction issue: part of buffer matches

I have to extract data from a log, and I'm trying to use sed to extract the data from 3 lines. The log entries (after grepping) look like this:
Tuesday March 11 2014
INBOUND>>>>> 06:22:10:066 Eventid:141004(3)
[SGW-S11/S4]GTPv2C Rx PDU, from 172.9.9.1:10000 to 173.10.10.1:2123 (187)
TEID: 0x00000000, Message type: EGTP_CREATE_SESSION_REQUEST (0x20)
I need to extract the "from IP", the "to IP", and the "Message Type".
This is what I have as of now:
sed -n '1!N; s/^INBOUND>>>>>.*\n.*from \([0-9.]*\).* to \([0-9.]*\).*/\1 \2/p'
When I extend it to the third line, to extract the message type, with:
sed -n '1!N; s/^INBOUND>>>>>.*\n.*from \([0-9.]*\).* to \([0-9.]*\).*\n.*, Message type: \([A-Z_]*\).*/\1 \2/p'
The entire pattern doesn't match.
This doesn't match the string unless there is a line before the INBOUND>>>>> string, which I think should match, since the ^ indicates the start of line. (This isn't really a problem since there is a datestamp, just a curiosity)
Bash Version: GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Sed Version: GNU sed version 4.1.5
Could you please give me any pointers on this? Thanks in advance.
P.S. The IPs can be IPv4 or IPv6, but I will change the IP regex once this problem's solved.
P.P.S. I need to use a regex i.e. not awk, because there will be other patterns too; this is the first, and I'm having problems :(

Your entire pattern
sed -n '1!N; s/^INBOUND>>>>>.*\n.*from \([0-9.]*\).* to \([0-9.]*\).*\n.*, Message type:\([A-Z_]*\).*/\1 \2/p'
can't match because you're missing a space between Message type: and \([A-Z_]*\)
Are you sure there are no hidden characters before INBOUND (when you omit the first line)?
This one works for me:
sed -r 's/.*from ([0-9.:]*) to ([0-9.:]*).*Message type: ([A-Z_]*).*/\1 \2 \3/'
(note that I used the -r flag so I won't have to escape the brackets)

You can use awk and no regex:
awk -F" |:" '/^INBOUND/ {getline;print $5 RS $8;getline;print $7}' file
172.9.9.1
173.10.10.1
EGTP_CREATE_SESSION_REQUEST
You say this is date out from a grep, it may be incorporated to the awk
Give us all data and how you like to output to be, and we will help you.
awk -F" |:" '/^INBOUND/ {getline;printf "%s %s",$5,$8;getline;print "",$7}' file
172.9.9.1 173.10.10.1 EGTP_CREATE_SESSION_REQUEST

Understanding a sed example

I found a solution for extracting the password from a Mac OS X Keychain item. It uses sed to get the password from the security command:
security 2>&1 >/dev/null find-generic-password -ga $USER | \
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
The code is here in a comment by 'sr105'. The part before the | evaluates to password: "secret". I'm trying to figure out exactly how the sed command works. Here are some thoughts:
I understand the flags -En, but what are the commas doing in this example? In the sed docs it says a comma separates an address range, but there's 3 commas.
The first 'address' /^password: / has a trailing s; in the docs s is only mentioned as the replace command like s/pattern/replacement/. Not the case here.
The ^password: "(.*)"$ part looks like the Regex for isolating secret, but it's not delimited.
I can understand the end part where the back-reference \1 is printed out, but again, what are the commas doing there??
Note that I'm not interested in an easier alternative to this sed example. This will only be part of a larger bash script which will include some more sed parsing in an .htaccess file, so I'd really like to learn the syntax even if it is obscure.
Thanks for your help!

Here is sed command:
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
Commas are used as regex delimiter it can very well be another delimiter like #:
sed -En '/^password: / s#^password: "(.*)"$#\1#p'`
/^password: / finds an input line that starts with password:
s#^password: "(.*)"$#\1#p finds and captures double-quoted string after password: and replaces the entire line with the captured string \1 ( so all that remains is the password )

First, the command extracts passwords from a file (or stream) and prints them to stdout.
While you "normally" might execute a sed command on all lines of a file, sed offers to specify a regex pattern which describes which lines the following command should get applied to.
In your case
/^password: /
is a regex, saying that the command:
s,^password: "(.*)"$,\1,p
should get executed for all lines looking like password: "secret". The command substitutes those lines with the password itself while suppressing the outer lines.
The substitute command might look uncommon but you can choose the delimiter in an sed command, it is not limited to /. In this case , was chosen.

Bash: Change "Title" of postscript file

I have a postscript file where I'd like to change the "Title" attribute before generating a pdf from it.
Following the beginning of the file:
%!PS-Adobe-3.0
%%BoundingBox: 0 0 595 842
%%HiResBoundingBox: 0 0 595 842
%%Title: GMT v5.1.1_r12693 [64-bit] Document from pscoast
%%Creator: GMT5
[…]
I now match the line %%Title: GMT v5.1.1_r12693 [64-bit] Document from pscoast with ^%%Title:\s.* and like to replace everything after the colon with the content of a variable.
My non-working code so far:
sed "s/\(^%%Title:\)\s.*$/\1 $title/g" test_file.ps
My sed knowledge is very limited and my experimentation didn't yield anything useful so far - your help will be greatly appreciated.
All the best, Chris
EDIT: added my non-working code

One of the tricks for getting sed to work correctly is getting the shell quoting right. This creates a postscript file with the new title:
newtitle="Shiny New Title"
sed 's/^%%Title:.*/%%Title: '"$newtitle/" sample.ps >new.ps
This updates the postscript in place:
newtitle="Shiny New Title"
sed -i 's/^%%Title:.*/%%Title: '"$newtitle/" sample.ps
Many of the characters that one uses in sed expressions, like $, (, or *, are shell-active. To protect them from possible shell expansion, they should be in single-quotes. However, because one wants the shell to expand the $newtitle variable, it cannot be in single-quotes. Thus, if you look carefully, you will see that the above substitute expression is in two parts, one single-quoted and one double-quoted. Adding a space between them to make it clearer:
's/^%%Title:.*/%%Title: ' "$newtitle/" # Do not use this form.
Thus, the shell-active characters are protected by single-quotes and only the parts that we want the shell to mess with are in double-quotes

Maybe this is what you're looking for:
myvar="some content"
sed -e "s/^\(%%Title:\).*/\1 $myvar/" < inputfile
# output
...
%%Title: some content
...

egrep regular expression works within PHP, but doesn't work at unix shell - escaping issues?

I think my problem has something to do with escaping differences between using a regex within PHP versus using it at Bash commandline.
Here is my regex that is working in PHP:
$emailregex = '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$';
So I try giving the following at commandline and it doesn't seem to match anything.
(where emails.txt is a long plain text file with thousands of (possibly badly-formed) email addresses, one per line).
[root#host dir]# egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$' emails.txt
I have tried surrounding the regex with double-quotemarks instead of single-quotemarks, but it made no difference.
Do I need to add some backslashes into the regex?
SOLVED! Thank you!
My file was created in Windows and extra CR in the END-OF-LINE markers did not agree with the dollar sign in the regex.

Single quotes should work with bash...
It works for me with this simple case:
echo test#test.com | egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$'
In your text file, the line has to only contain the email address. Any additional spaces on the line will throw it off. For example this doesn't print anything:
echo " test#test.com" | egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$'
Your problem might be that you have a dos formatted file. In that case the extra \r will make it so that the regex doesn't match since it will think there's an extra character at the end of the line. You can run dos2unix against it, or make your regex less restrictive by removing the beginning and end markers from your regex:
egrep '[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})'

WWorks for me:
JPP-MacBookPro-4:tmp jpp$ cat emails.txt
aa#bb.com
bb#cc.com
not an email
cc#dd.ee.ff
JPP-MacBookPro-4:tmp jpp$ egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$' emails.txt
aa#bb.com
bb#cc.com
cc#dd.ee.ff
JPP-MacBookPro-4:tmp jpp$
Beware trailing whitespace/tabs/and returns - they have a way of biting regexs
There is a great ref on shell quoting here http://www.mpi-inf.mpg.de/~uwe/lehre/unixffb/quoting-guide.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Convert regex from Python format to GNU Sed format - regex

Related

GNU sed regex to fix mySQL db inserts for SQLite

Multiline sed regex extraction issue: part of buffer matches

Understanding a sed example

Bash: Change "Title" of postscript file

egrep regular expression works within PHP, but doesn't work at unix shell - escaping issues?

Categories

Resources