grep based on regex not matching anything - regex

I am trying to grep by regex but I am having trouble figuring out what is wrong with my regex. I couldnt find any bash regex testers out there so this has been really hard to figure out.
Here is my regex
[0-9]*\.[0-9]*[G][:space:]*\.\/[bbg-sevent-test-][0-9]*
I am trying to match my regex to this piece of text
2.0G ./bbg-sevent-test-132^M
The command I am running is:
./kafka_prefill.sh | sed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g" | grep '[0-9]*\.[0-9]*[G][:space:]*\.\/[bbg-sevent-test-][0-9]*' > data3.txt
What this does is run my script, translate/remove parts of my output, then grep based on regex and put it in the file data3.txt
I am currently getting this error:
grep: Invalid range end
** update ** thanks to Ed Plunkett
updated regex:
^[0-9]*\.[0-9]*[G][[:space:]]*\.\/bbg-sevent-test-[0-9]*$
My command no longer has a regex error. However nothing is matching. Here is a sample output:
********************************************************************************^M
This is a private computer system containing information that is proprietary^M
and confidential to the owner of the system. Only individuals or entities^M
authorized by the owner of the system are allowed to access or use the system.^M
Any unauthorized access or use of the system or information is strictly^M
prohibited.^M
^M
All violators will be prosecuted to the fullest extent permitted by law.^M
********************************************************************************^M
Last login: Tue Dec 29 16:43:23 2015 from 10.81.64.204^M^M
sudo bash^M
cd /data/kafka/tmp/kafka-logs/^M
du -kh . | egrep "bbg-sevent-test-*"^M
-bash: ulimit: open files: cannot modify limit: Operation not permitted^M
### Trinity env = prod ###^M
### Kafka Broker Id = 1 ###^M
### Kafka Broker must be started as root!! ###^M
exit^M
exit^M
### Trinity env = prod ###^M
### Kafka Broker Id = 1 ###^M
### Kafka Broker must be started as root!! ###^M
^[]0;root#ip-10-81-66-20:/home/ec2-user^G^[[?1034h[root#ip-10-81-66-20 ec2-user]# cd /data/kafka/tmp/kafka-logs/^M
^[]0;root#ip-10-81-66-20:/data/kafka/tmp/kafka-logs^G[root#ip-10-81-66-20 kafka-logs]# du -kh . | egrep "bbg-sevent-test-*"^M
2.2G ./bbg-sevent-test-439^M
2.2G ./bbg-sevent-test-638^M
2.2G ./bbg-sevent-test-679^M
2.2G ./bbg-sevent-test-159^M
I am only trying to match this bit
2.2G ./bbg-sevent-test-159

Why is this in square brackets?
[bbg-sevent-test-]
If you are matching that entire literal string, including brackets, escape them:
\[bbg-sevent-test-\]
If you're not matching the brackets as literal characters, leave them out:
bbg-sevent-test-
Looks to me like you don't really want them there. In a regex, text you match literally is just slapped in there as-is, no special syntax required except for escaping special characters like []*+?(). etc.
What you've got there is, syntactically, a range -- but a broken one, since there's nothing after the last hyphen. However, a range is clearly not your intent.

Related

Remove end of file after matching regex keeping the expression matched in multiple files (sed?)

I'm cleaning up a lot of markdown files to import them into Pelican (a static website generator). While compiling I get errors about the date format in multiple files. What I need to do is leave the date (yyyy-mm-dd) and delete to the end of the line after it. This is the last try I've made with sedand RegEx:
sed -i "s/\(\d{4}-\d{2}-\d{2}\)\*/\1 /g" *.md
My hope was that sed would take the whole pattern within the parenthesis as 1 and then keep it as the substitution string.
This is an example of the errors (all numbers change):
ERROR: Could not process ./2010-12-28-the-open-internet-a-case-for-net-neutrality.html.md
| ValueError: '2010-12-28 21:22:00.000000000 +01:00 true' is not a valid date
ERROR: Could not process ./2011-05-27-two-one-must-read-internet-business-book.html.md
| ValueError: '2011-05-27 13:08:00.000000000 +02:00 true' is not a valid date
I've looked around SO but all I've found is about static strings, while mine change all the time.
Thanks for your help.
Please take care of these files, at least make a backup before using sed on them.
This can be done by using the i flag with an extension: -i.bckup.
So I am not sure that You want to modify the content of the files or the names itself.
An expression that would only keep the date would be:
sed -r 's/([^-]*[-][^-]*[-][^-]*).*/\1/'
I suspect your sed is not seeing \d as a metacharacter meaning [0-9], so use it instead.
sed -i -r 's/([0-9]{4}-[0-9]{2}-[0-9]{2}).*/\1/' *.md
Note:
# with the -r extended regex option you do not escape your pattern groupings ()
# no need for the /g option since you are removing everything after the first match
# .* is probably the wildcard you meant to use. * matches any number of the preceeding pattern and . matches any single character.
Here is a command line test:
echo '2011-05-27 13:08:00.000000000 +02:00 true' | sed -r 's/([0-9]{4}-[0-9]{2}-[0-9]{2}).*/\1/'
which outputs:
2011-05-27

Convert regex from Python format to GNU Sed format

I'm parsing a roughly 10GB log file, and need to feed it through sed to capture some output. The necessary capture segment based on what I would use in JavaScript is:
s/method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)""/"\1","\2","\3"/
Unfortunately sed (GNU sed 4.2.1, GnuWin32 edition) is struggling over the [^"]* ranges. It refuses to match them. I've tried variations of other acceptance blocks, with [a-zA-Z0-9:\\/.]* and similar variants but there seem to always be new characters inside the block that it misses, and really I can accept any valid character held between the quotes. With sed's * routine being a greedy implementation it tends to also have problems on the final "accept" item, pulling in all the other items on the log entry right up until the end.
I need to capture everything between the quotation marks and ignore the rest of the log entry.
I've been at this for two days for some stupid thing I could have implemented directly in python if there wasn't a requirement it be executed from a script with sed. Can any regex guru out there help?
EDIT:
For the extra information about examples, this produces no matches on my system, sed 4.2.1 from the GnuWin32.sourceforge.net collection: sed -r 's/method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)""/"\1","\2","\3"/' logfile
This produces matches for some entries: sed -r 's/^.*\method\=""([A-Z]*).*path=""([a-zA-Z0-9:\/]*).*accept=""(.*)"".*/"\1","\2","\3"/ logfile
Here are some (slightly redacted but not too much) lines:
"server-01/1.2.3.4 time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""GET"" path=""/ourapp/foo/bar/AAA-123:1029"" status=""200"" message=""OK"" duration=""7"" query=""cc=1463648"" content_type=""application/json"" referer=""https://example.org/somewhere"" from=""foo#bar.com"" ip=""1.2.3.4"" agent=""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"" req_header_accept=""application/json, text/javascript, application/sord+xml; q=0.01"" req_header_accept-language=""en-US,en;q=0.8"" req_header_x-request-id=""29/Oct/2014:05:59:59.968a-abc123ABC"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"
"server-01/1.2.3.4 time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""GET"" path=""/ourapp/foo/bar:AA9.1/ABC-123/record"" status=""200"" message=""OK"" duration=""73"" query=""view=includeFields"" content_type=""application/json"" from=""None"" ip=""1.2.3.4"" req_header_accept=""application/json"" req_header_x-request-id=""ab123-abc123-12345abc"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"
"server-01/1.2.3.4 time=""Wed Oct 29 05:59:59 GMT+00:00 2014"" method=""HEAD"" path=""/ourapp/foo/bar:AA3.4/ABC-123/meta"" status=""200"" message=""OK"" duration=""21"" content_type=""application/json"" from=""foo#bar.com"" ip=""1.2.3.4"" agent=""Java/1.7.0_25"" req_header_accept=""application/json"" req_header_accept-language=""en"" req_header_cache-control=""no-cache"" req_header_x-request-id=""29/Oct/2014:05:59:59.882va-af527A"" req_header_x-forward=""1.2.3.4"" req_header_x-forwarded-for=""1.2.3.4"" ","2014-10-28T23:59:59.000-0000","someapp-01.a",production,1,"/home/someapp/log/ourapp-access.log","ut01-splunkidx18.i"
The key to this problem turned out to Windows shell interactions with the sed command. See the last section in this answer for details.
Demonstration under a Unix shell
As sample input consider:
$ cat file
some method=""this is my method"" more stuff path=""My Path"" accept=""Yes"" end of line
The following sed command processes that input:
$ sed -r 's/.*method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)"".*/"\1","\2","\3"/' file
"this is my method","My Path","Yes"
Note that the -r option is required to so that unescaped parens act as grouping rather than literal characters.
Using the more complex input in the revised question:
$ sed -r 's/.*method=""([^"]*)"".*path=""([^"]*)"".*accept=""([^"]*)"".*/"\1","\2","\3"/' input
"GET","/ourapp/foo/bar/AAA-123:1029","application/json, text/javascript, application/sord+xml; q=0.01"
"GET","/ourapp/foo/bar:/AA9.1/ABC-123/record","application/json"
"HEAD","/ourapp/foo/bar:/AA3.4/ABC-123/meta","application/json"
As regards the accept issue, I see two accept variables in the sample input:
req_header_accept
req_header_accept-language
Because the regex matches accept="", the former should be matched, not the latter.
Matching non-quotes
Consider the input:
$ cat test.txt
Billy "The Kid" Smith
Jimmy "The Fish" Stuart
Chuck "The Man" Norris
This sed command selects the quoted material:
$ sed -r 's/.*"([^"]*)".*/\1/' test.txt
The Kid
The Fish
The Man
All these test were done on GNU sed version 4.2.1 under linux.
Windows Shell Issues
The following are key points for making sed commands work on Windows:
Enclose sed commands in double quotes. Under the Windows shell, commands should be protected by double-quotes, not single quotes as Unix uses.
If a string needs to contain double-quotes, write them in hexadecimal coding as \x22.
Under Windows, an unquoted caret ^ is an escape character. This, however, does not affect us because, in our case, the ^ always appear inside a double-quoted string.
CygWin, if it is available, avoids Windows shell issues.
Thus, for the Billy The Kid input, try:
sed -r "s/.*\x22([^\x22]*)\x22.*/\1/" test.txt
Also, ^ is a Windows escape character but it reportedly only functions as such outside quotes. Thus, I left it as is in the above command.
For the full case, Bryan reports that the following works:
sed -r "s/^.*method\=\x22\x22([^\x22]*).*path=\x22\x22([^\x22]*).*req_header_accept=\x‌​22\x22([^\x22]*).*$/\x22\1\x22,\x22\2\x22,\x22\3\x22/" logfile

Sed command on Linux only matches 1, where on Solaris multiple matches are found

note I'm a beginner with SED.
We make use of sed-command which looks in the output of a clearcase command and obtains the names of users with a view:
<clearcase output> | sed -n "/Development streams:/,// s/[^ ]* *Views: *\([^_ ]*\)_.*/\1/p"
(Example clearcase output:
Project: project:xx0.0_xx0000#/xxxx/xxxxx_xxxxxxxx0
Project's mastership: XXXXXXXXX#/xxxx/xxxxx_xxxxxxxx0
Project folder: folder:XX0.0#/xxxx/xxxxx_xxxxxxxx0 (RootFolder/Xxxxxxxx/XX0.0)
Modifiable components:component:00000000000#/xxxx/xxxxxx_xxxxxxxx0
component:xxxxxxx_xxxxxxx#/xxxx/xxxxxx_xxxxxxxx0
component:xxxxxxxxxxxxxx_x#/xxxx/xxxxxx_xxxxxxxx0
component:xxxxx#/xxxx/xxxxxx_xxxxxxxx0
component:xxxxxx_xxxxxxxxxxx#/xxxx/xxxxxx_xxxxxxxx0
Integration stream: stream:xx0.0_xx0000_integration#/xxxx/xxxxx_xxxxxxxx0 (FI: nemelis)
Integration views: olduser_xx0.0_xx0000_int - Properties: dynamic ucmview readwrite nshareable_dos
nemelis_xx0.0_xx0000_int - Properties: dynamic ucmview readwrite nshareable_dos
otheruser_xx0.0_xx0000_int - Properties: dynamic ucmview readwrite nshareable_dos
Development streams: stream:nemelis_xx0.0_xx0000#/xxxx/xxxxx_xxxxxxxx0 [unlocked] - No rebase or delivery is pending.
Views:nemelis_xx0.0_xx0000
stream:otheruser_xx0.0_xx0000_streamidentifier#/xxxx/xxxxx_xxxxxxxx0 [unlocked] - No rebase or delivery is pending.
Views:otheruser_xx0.0_xx0000_streamidentifier
)
On Solaris it will output:
nemelis
otheruser
But on (Redhat-)Linux only the first name is given.
(Note: I've looked on Stackoverflow and found comments that Sed is always greedy on Posix / Gnu and that Perl should be used (see Non greedy regex matching in sed?). Thus I've tried to fix it with Perl, but than I ran into a forrest of problems with the p at the end, using "//", "|", missing operator before < token >, etcetera, hence my post here)
Not sure what you're trying to achieve by specifying the address as //. You probably imply that it should be end of file or a blank line. Use $ as the address in the former case, and /^$/ in the latter.
The following might work for you:
sed -n "/Development streams:/,$ s/[^ ]* *Views: *\([^_ ]*\)_.*/\1/p"
From the manual:
$
This address matches the last line of the last file of input, or
the last line of each file when the `-i' or `-s' options are
specified.

Linux Bash Regular Expressions, retrieving data from SNMPGet Output

I've been working on getting a few simple monitoring tools running at home, and decided to be funny and retrieve the printer data along with everything else, however now that I've got the SNMP portion of it working quite well, I can't seem to be able to parse the data that my SNMPGET command retrieves properly in Linux, the current script I am using is as follows:
#!/usr/bin/env bash
# RegEx for Strings: "(.+?)"| -?\d+
RegExStr='"(.+?)"| -?\d+'
# ***
# Brother HL-2150N Printer
# ***
# Order Data: Toner Naame, Toner Level, Drum Name, Drum Status, Total Pages Printer, Display Status
Input=$(snmpget -v 1 -c public 192.168.16.112 SNMPv2-SMI::mib-2.43.11.1.1.6.1.1 SNMPv2-SMI::mib-2.43.11.1.1.8.1.1 SNMPv2-SMI::mib-2.43.11.1.1.6.1.2 SNMPv2-SMI::mib- 2.43.11.1.1.9.1.1 SNMPv2-SMI::mib-2.43.10.2.1.4.1.1 SNMPv2-SMI::mib-2.43.16.5.1.2.1.1 -m BROTHER-MIB)
Output1=( $(echo $Input | egrep -o $RegExStr) )
# Output
echo $Input
echo ${Output1[#]}
Which, oddly enough does not work. I'm fairly certain my regular expression ( "(.+?)" ) is correct, as I've tested it numerous times in various different syntax checkers and testers. It's supposed to select all the data that's between quotation marks ("").
Anyhow, the SNMPGET return is:
SNMPv2-SMI::mib-2.43.11.1.1.6.1.1 = STRING: "Black Toner Cartridge" SNMPv2-SMI::mib-2.43.11.1.1.8.1.1 = INTEGER: -2 SNMPv2-SMI::mib-2.43.11.1.1.6.1.2 = STRING: "Drum Unit" SNMPv2-SMI::mib-2.43.11.1.1.9.1.1 = INTEGER: -3 SNMPv2-SMI::mib-2.43.10.2.1.4.1.1 = Counter32: 13630 SNMPv2-SMI::mib-2.43.16.5.1.2.1.1 = STRING: "SLAAP "
I've tried various things myself, and using grep returns a blank string. to my understanding grep does not support every regular expression command by itself, so I started using egrep, while this returns SOMETHING, it is everything inside the original string divided by spaces, starting at the first quotation mark.
Is there anything I'm missing? I've looked around, and adjusted my methods a few times but never seemed to get a usable array in return.
Anyhow, I appreciate any help/pointers you'd be able to give me. I'd like to be able to get this running, even if just for fun and a good learning experience. Thank you in advance though! I'll be fidgeting on with it some more myself, but will check here every now and then.
From your output:
To get all strings:
grep -oP 'STRING: *"\K[^"]*'
Black Toner Cartridge
Drum Unit
SLAAP
To get all integers:
grep -oP '(INTEGER|Counter32): *\K[^ ]*'
-2
-3
13630
With awk you can do this:
awk 'NR%2==0' RS=\" <<< $Input
Black Toner Cartridge
Drum Unit
SLAAP
Or into a variable
Output1=$(awk 'NR%2==0' RS=\" <<< $Input)

Bash based regex domain name validation

I want to create a script that will add new domains to our DNS Servers.
I found that Fully qualified domain name validation REGEX.
However, when I use it with sed, it is not working as I would expect:
echo test | sed '/(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(:[a-zA-Z]{2,})$)/p'
--------
Output is:
test
echo test.com | sed '/(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(:[a-zA-Z]{2,})$)/p'
--------
Output is:
test.com
I expected that the output of the first command should be a blank line.
What do I do wrong?
I find this to be a more comprehensive regex:
(?=^.{4,253}$)(^(?:[a-zA-Z0-9](?:(?:[a-zA-Z0-9\-]){0,61}[a-zA-Z0-9])?\.)+([a-zA-Z]{2,}|xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])$)
RFC 1034§3: Allows for a length of 4-253, with the shortest operational domain I'm aware of, "t.co", still matching where the other answers don't. 255 bytes is the maximum length, minus the length octet for each label (TLD and "primary" subdomain) gives us 253: (?=^.{4,253}$)
RFC 3696§2: Single-letter TLDs are technically permitted, meaning the minimum length would be 3, but as there are currently no single-letter TLDs a minimum length of 4 is practical.
RFC 1034§3: Allows numbers in subdomains, which Conor Clafferty's apparently doesn't (by not distinguishing other subdomains from "primary" subdomains -- i.e. the domain you register -- which the DNS spec doesn't)
RFC 1034§3: Restricts individual labels to 63 characters, permitting hyphens in the middle while restricting the beginning and end to alphanumerics (?:[a-zA-Z0-9](?:(?:[a-zA-Z0-9\-]){,61}[a-zA-Z0-9])?\.)
Requires a two-letter or larger TLD, but may be punycoded ([a-zA-Z]{2,}|xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])
RFC 3696§2: The DNS spec technically permits numerics in the TLD, as well as single-letter TLDs; however, there are currently no single-letter TLDs or TLDs with numbers currently, and all-numeric TLDs are not permitted, so this part of the regex has been simplified to [a-zA-Z]{2,}.
--OR--
RFC 3490§5: an internationalized domain name ccTLD (IDN ccTLD) may be punycoded, as indicated by an "xn--" prefix, after which it may contain letters, numbers, or hyphens. This approximates to xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9]
Be aware that this pattern does not validate a punycode TLD! Invalid punycode will be tolerated, e.g. "xn--qqqq", because attempting to validate punycode against the appropriate encoding mechanisms is beyond the scope of a regular expression. While punycode itself technically permits an encoded string ending in a hyphen, RFC 3492§5 observes and respects the IDNA limitation that labels may not end in a hyphen.
EDIT 02/2021: Hat tip to user2241415 for pointing out that IDN ccTLDs did not match the previously-specified regex.
You are missing a question mark in your regex :
(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)
You can test your regex here
You can do what you want with grep :
$ echo test.com | grep -P '(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)'
test.com
$ echo test | grep -P '(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)'
$
No sed implementation I am aware of supports the various Perl extensions you are using in that regex. Try with Perl or grep -P or pcregrep, or simplify the regex to something sed can cope with. Here is a quick and dirty adaptation which splits the regex into a script of three different regexes, and rejects when something fails to match (or matches, in the middlemost case).
echo 'test' | sed -r '/^.{5,254}$/!d
/^([^.]*\.)*[0-9]+\./d # Seems incorrect; 112.com is valid
/^([a-zA-Z0-9_\-]{1,63}\.?)+([a-zA-Z]{2,})$/!d' # should disallow underscore
# also, what's with the question mark after the literal dot?
This also completely fails to accept IDNA domains (which can contain dashes and numbers in the TLD, among other things) so I would definitely not recommend this, but hopefully it shows you how to adapt something like this to sed if you wish to.
Pierre-Louis' answer didn't quite work for me. e.g. "kittens" is considered a domain name.
I added one slight adjustment to ensure that the domain at least had a dot in it.
(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+\.(?:[a-z]{2,})$)
Theres an extra \. just before it reads the last portion of the domain.
I use grep -P to do this.
echo test | grep -P "^[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,})+$"
--------
Output is:
echo www.test.com | grep -P "^[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,})+$"
--------
Output is: www.test.com
if the domain has to exist you can try:
$ cat test.sh
#!/bin/bash
for h in "bert" "ernie" "www.google.com"
do
host $h 2>&1 > /dev/null
if [ $? -eq 0 ]
then
echo "$h is a FQDN"
else
echo "$h is not a FQDN"
fi
done
jalderman#mba:/tmp$ ./test.sh
bert is not a FQDN
ernie is not a FQDN
www.google.com is a FQDN