Match unknown substring with RegEx - regex

How can I get an unknown substring with an regular expression? I know what's before and after the wanted string but I don't want the known part with in the result.
Example text:
jhgjgjgvocher_SOMETHINGHERE.dbhjjkghjkg
vocher_SOMETHINGELSE.db
I'm looking for 'SOMETHINGHERE' and 'SOMETHINGELSE' only.
vocher_ and .db are always before and after the relevant part but should not be in the result.
A working solution is:
cat test | egrep -o "vocher_.*\.db" | cut -d "_" -f2 | cut -d "." -f1
… but you know it's ugly.
Is it possible to search exactly for an unknown part with regex (in this case only the .* part), or do I need to use something like sed? Is there a better solution?

A simple solution using perl is the following:
perl -ne 'if (/vocher_(.*)\.db/){ print "$1\n";}' test_file.txt
This iterates line-by-line over the file and only prints the desired portion.

Use the following grep approach:
grep -Po '(?<=vocher_).+(?=\.db)' test
-P - allows Perl regular expressions
-o - prints only matched substrings
The output will be like below:
SOMETHINGHERE
SOMETHINGELSE

Related

How to get the release value?

I've a file with the below name formats:
rzp-QAQ_SA2-5.12.0.38-quality.zip
rzp-TEST-5.12.0.38-quality.zip
rzp-ASQ_TFC-5.12.0.38-quality.zip
I want the value as: 5.12.0.38-quality.zip from the above file names.
I'm trying as below, but not getting the correct value though:
echo "$fl_name" | sed 's#^[-[:alpha:]_[:digit:]]*##'
fl_name is the variable containing the file name.
Thanks a lot in advance!
You are matching too much with all the alpha, digit - and _ in the same character class.
You can match alpha and - and optionally _ and alphanumerics
sed -E 's#^[-[:alpha:]]+(_[[:alnum:]]*-)?##' file
Or you can shorten the first character class, and match a - at the end:
sed -E 's#^[-[:alnum:]_]*-##' file
Output of both examples
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
With GNU grep you could try following code. Written and tested with shown samples.
grep -oP '(.*?-){2}\K.*' Input_file
OR as an alternative use(with a non-capturing group solution, as per the fourth bird's nice suggestion):
grep -oP '(?:[^-]*-){2}\K.*' Input_file
Explanation: using GNU grep here. in grep program using -oP option which is for matching exact matched values and to enable PCRE flavor respectively in program. Then in main program, using regex (.*?-){2} means, using lazy match till - 2 times here(to get first 2 matches of - here) then using \K option which is to make sure that till now matched value is forgotten and only next mentioned regex matched value will be printed, which will print rest of the values here.
It is much easier to use cut here:
cut -d- -f3- file
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
If you want sed then use:
sed -E 's/^([^-]*-){2}//' file
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
Assumptions:
all filenames contain 3 hyphens (-)
the desired result always consists of stripping off the 1st two hyphen-delimited strings
OP wants to perform this operation on a variable
We can eliminate the overhead of sub-process calls (eg, grep, cut and sed) by using parameter substitution:
$ f1_name='rzp-ASQ_TFC-5.12.0.38-quality.zip'
$ new_f1_name="${f1_name#*-}" # strip off first hyphen-delimited string
$ echo "${new_f1_name}"
ASQ_TFC-5.12.0.38-quality.zip
$ new_f1_name="${new_f1_name#*-}" # strip off next hyphen-delimited string
$ echo "${new_f1_name}"
5.12.0.38-quality.zip
On the other hand if OP is feeding a list of file names to a looping construct, and the original file names are not needed, it may be easier to perform a bulk operation on the list of file names before processing by the loop, eg:
while read -r new_f1_name
do
... process "${new_f1_name)"
done < <( command-that-generates-list-of-file-names | cut -d- -f3-)
In plain bash:
echo "${fl_name#*-*-}"
You can do a reverse of each line, and get the two last elements separated by "-" and then reverse again:
cat "$fl_name"| rev | cut -f1,2 -d'-' | rev
A Perl solution capturing digits and characters trailing a '-'
cat f_name | perl -lne 'chomp; /.*?-(\d+.*?)\z/g;print $1'

grep regex for return numerical values in between string

so i'm running the linux command
ls /etc/systemd/system | grep -o -E "[0-9]+"
which should return just numerical values, the only problem it returns some unwanted numerical values from parts of results i dont want, i want only the numerical values between - and .service so in like test-blah4-1321.service i just want it to return 1321. What am i missing here?
example
$ ls /etc/systemd/system
test.service test-blah4-1321.service test-blah2.service test-blah5-1387.service test-blah3-1521.service
GNU grep has the -P option for perl-style regexes, and the -o option to print only what matches the pattern. These can be combined using look-around assertions (described under Extended Patterns in the perlre manpage) to remove part of the grep pattern from what is determined to have matched for the purposes of -o.
Source
Applied to your example this would be:
echo test-blah4-1321.service | grep -oP '(?<=-)\d+(?=\.service)'
when I need look ahead or look behind tests I usually switch to a Perl one-liner.
this should do the trick.
echo test-blah4-1321.service | perl -ne 'm/(?<=-)(\d+)(?=\.service)/g; print "$1\n";'

grep strings between "{{_(" and ")}}"

I want to parse html files to extract strings between "{{_(" and ")}}" using GREP. I tried something like this:
grep '"[^{{_(|)}}$]"' *.html
but it didn't work.
Can someone help me please?
Thanks!
You may use
grep -oP '(?<={{_\().+?(?=\)}})' file
Details
-o - output only matched substrings
-P - enable the PCRE regex engine
(?<={{_\().+?(?=\)}}) match:
(?<={{_\() - a location that is immediately preceded with {{+(
.+? - any 1 or more more chars other than line break chars, as few as possible
(?=\)}}) - a location that is immediately followed with )}} .
See the regex demo.
#Wiktor Stribiżew's answer works really good. However, if you have multiple files, you would get an output like this, where the respective file name per each match is also displayed:
foo.html: content abc
foo.html: test 123
bar.html: first match
bar.html: second match
So, if you are only interested in the matching string as output, you can try sed instead
sed -n 's/.*{{_(\(.*\))}}.*/\1/p' *.html
You can also count the unique occurrence of matches and things like that...
Update:
Or just use the -h | --no-filename with the grep that #Wiktor Stribiżew has provided.
grep -h -oP '(?<={{_\().+?(?=\)}})' *.html
Or the -c flag in order to display the count of matches per each file:
grep -c -oP '(?<={{_\().+?(?=\)}})' *.html
As in the posts before with it is possible to grep the value of an HTML property.
placeholder="SOME TEXT_HERE" -> grep -> "SOME TEXT_HERE"
grep -oP '(?<=placeholder=").+?(?=")' *html

Regular expression in perl does not work as expected

I have a simple bash script that uses a line of perl code + regex to extract the necessary piece of string. It looks like
ANSWER=$(host $IPW 2>/dev/null | perl -p -e 's#^.+\s\b([a-zA-Z]{4,8}\d{1,3})(?=-\d\.).+$#\1#;'
It works for the most part, but produces unexpected matches from time to time. Example:
$ echo "Host 31.201.188.199.in-addr.arpa. not found: 3(NXDOMAIN)" | perl -p -e 's#^.+?\s\b([a-zA-Z]{4,8}\d{1,3})(?=-\d\.).+?(?=\.$)#\1#;'
Host 31.201.188.199.in-addr.arpa. not found: 3(NXDOMAIN)
The string is supposed to match parts of string like "server100" (letters + digits) and return the corresponding part. Is there something I am missing or don't understand yet. (sorry for bothering)
Your regex doesn't match, so no substitution is made. The line is therefore printed as is.
If you don't want to print when there is no match, you can use -n instead of -p, plus and print to print the line on successful substitution:
echo "Host 31.201.188.199.in-addr.arpa. not found: 3(NXDOMAIN)" |
perl -n -e 's#^.+?\s\b([a-zA-Z]{4,8}\d{1,3})(?=-\d\.).+?(?=\.$)#\1# and print'
I assume the sample text that you show shouldn't be printed at all?
I suggest that you use a simple match instead of a substitution. I've also removed the superfluous parts of your regex pattern
perl -lne 'print $1 if /.*\s([a-z]{4,8}\d{1,3})(?=-\d\.)/i'

Having trouble with GREP and REGEX

I have a text file that stores combinations of four numbers in the following format:
Num1,Num2,Num3,Num4
Num5,Num6,Num7,Num8
.............
I have a whole bunch of such files and what I need is to grep for all filenames that contains the pattern described above.
I constructed my grep as follows:
grep -l "{d+},{d+},{d+},{d+}" /some/path/to/file/name
The grep terminates without returning anything.
Can somebody point out what I might be doing wrong with my grep statement?
Thanks
This should do what you want:
egrep -l '[[:digit:]]+,[[:digit:]]+,[[:digit:]]+,[[:digit:]]+' /some/path/to/file/name
One way is using a perl regexp:
grep -Pl "\d+,\d+,\d+,\d+" /some/path/to/file/name
In your syntax d is literal. It should be escaping that letter, but is not accepted by grep regular regexp.