Grabbing a substring from text with bash - regex

I am tring to extract a substring from some text and I am struggling to find the correct sed or regex that will do it for me.
My input text could be one of the following
feature/XXX-9999-SomeOtherText
develop
feature/XXX-99999-SomeMoreText
bugfix/XXX-9999
feature/XXXX-9999
XXX-9999
and I want to pull out just the XXX-9999, but there can be any number of Xs and 9s. where there are no Xs or 9s (as per the second example) I would like to return an empty value.
I have tried several ways using sed and the closest I got was
echo "feature/XXX-9999-SomeOtherText" | sed 's/.*\([[:alpha:]]\{3\}-[[:digit:]]\{4\}\).*/\1/'
which works if there are 3 Xs and 4 9s but anything else gives the full input string.

You can use grep and its -o option:
grep -o 'X\+-9\+'
If you want non-matching lines to result in empty lines you can add || echo ''.

You can use this sed,
sed 's#\(^\|.*/\)\([a-Z0-9]\+-[0-9]\+\).*#\2#g; /[a-zA-Z0-9]\+-[0-9]\+/!s#.*##g' yourfile

echo "feature/XXX-9999-SomeOtherText\nnoX nor 9" | sed 's/.*\([[:alpha:]]\{1,\}-[[:digit:]]\{1,\}\).*/\1/
t
s/.*//'
you use a count that is fixed in your test {3} so any number of X equal or greater succeed but not less. Change it to a minimum {1,} (equivalent to the + for GNU sed).
I also add the non container to empty line (not removing the line), if not needed, remove fom t until last /

Run on your posted sample input file:
$ sed -r -n 's/[^X]*(X+-9+).*/\1/p' file
XXX-9999
XXX-99999
XXX-9999
XXXX-9999
XXX-9999
$ sed -r -n 's/[^X]*(X+-9+)?.*/\1/p' file
XXX-9999
XXX-99999
XXX-9999
XXXX-9999
XXX-9999
The above IMHO shows a couple of the most likely interpretations of where there are no Xs or 9s (as per the second example) I would like to return an empty value.
If your sed doesn't support -r then this would work with any sed:
sed -n 's/[^X]*\(XX*-99*\).*/\1/p' file
sed -n 's/[^X]*\(XX*-99*\)*.*/\1/p' file

Related

How to get the release value?

I've a file with the below name formats:
rzp-QAQ_SA2-5.12.0.38-quality.zip
rzp-TEST-5.12.0.38-quality.zip
rzp-ASQ_TFC-5.12.0.38-quality.zip
I want the value as: 5.12.0.38-quality.zip from the above file names.
I'm trying as below, but not getting the correct value though:
echo "$fl_name" | sed 's#^[-[:alpha:]_[:digit:]]*##'
fl_name is the variable containing the file name.
Thanks a lot in advance!
You are matching too much with all the alpha, digit - and _ in the same character class.
You can match alpha and - and optionally _ and alphanumerics
sed -E 's#^[-[:alpha:]]+(_[[:alnum:]]*-)?##' file
Or you can shorten the first character class, and match a - at the end:
sed -E 's#^[-[:alnum:]_]*-##' file
Output of both examples
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
With GNU grep you could try following code. Written and tested with shown samples.
grep -oP '(.*?-){2}\K.*' Input_file
OR as an alternative use(with a non-capturing group solution, as per the fourth bird's nice suggestion):
grep -oP '(?:[^-]*-){2}\K.*' Input_file
Explanation: using GNU grep here. in grep program using -oP option which is for matching exact matched values and to enable PCRE flavor respectively in program. Then in main program, using regex (.*?-){2} means, using lazy match till - 2 times here(to get first 2 matches of - here) then using \K option which is to make sure that till now matched value is forgotten and only next mentioned regex matched value will be printed, which will print rest of the values here.
It is much easier to use cut here:
cut -d- -f3- file
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
If you want sed then use:
sed -E 's/^([^-]*-){2}//' file
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
Assumptions:
all filenames contain 3 hyphens (-)
the desired result always consists of stripping off the 1st two hyphen-delimited strings
OP wants to perform this operation on a variable
We can eliminate the overhead of sub-process calls (eg, grep, cut and sed) by using parameter substitution:
$ f1_name='rzp-ASQ_TFC-5.12.0.38-quality.zip'
$ new_f1_name="${f1_name#*-}" # strip off first hyphen-delimited string
$ echo "${new_f1_name}"
ASQ_TFC-5.12.0.38-quality.zip
$ new_f1_name="${new_f1_name#*-}" # strip off next hyphen-delimited string
$ echo "${new_f1_name}"
5.12.0.38-quality.zip
On the other hand if OP is feeding a list of file names to a looping construct, and the original file names are not needed, it may be easier to perform a bulk operation on the list of file names before processing by the loop, eg:
while read -r new_f1_name
do
... process "${new_f1_name)"
done < <( command-that-generates-list-of-file-names | cut -d- -f3-)
In plain bash:
echo "${fl_name#*-*-}"
You can do a reverse of each line, and get the two last elements separated by "-" and then reverse again:
cat "$fl_name"| rev | cut -f1,2 -d'-' | rev
A Perl solution capturing digits and characters trailing a '-'
cat f_name | perl -lne 'chomp; /.*?-(\d+.*?)\z/g;print $1'

Deleting everything between two string matches in a file

I got this text in file.txt:
Osmun.Prez#mail.com:c7lB2m6b#3.a.a:tt_webid_v2=6990226111024612869; tt_webid=6990226111024612869; tt_csrf_token=VD5Nb_TQFH4RKhoJeSe2nzLB; R6kq3TV7=AHkh4PB6AQAA3LIS90nWf2ss0Q7ZTCQjUat4axctvhQY68DdUEz92RwpmVSX|1|0|e9d6917c2fe555827dcf5ee916ba9778079ab2a9; ttwid=1%7CAFodeNF0iZM2fyy-ZeiZ6HTpZoG_MSx6SmXHgGVQ-V4%7C1627538859%7C59ca1e4a56f9f537b55e655a6dabff88e44eb48502b164ed6b4199f5a5263cb0; passport_csrf_token_default=6f7653c3ce946a6ce5444723fb0c509b; passport_csrf_token=6f7653c3ce946a6ce5444723fb0c509b; sid_guard=0483b7d37f4e4bd20ab3046e29724798%7C1627538893%7C5184000%7CMon%2C+27-Sep-2021+06%3A08%3A13+GMT; uid_tt=27b52febe6222486b9f6b6a90ef4ffeace5ea25c09d29a1583be5a1ecf760996; uid_tt_ss=27b52febe6222486b9f6b6a90ef4ffeace5ea25c09d29a1583be5a1ecf760996; sid_tt=0483b7d37f4e4bd20ab3046e29724798; sessionid=0483b7d37f4e4bd20ab3046e29724798; sessionid_ss=0483b7d37f4e4bd20ab3046e29724798; store-idc=maliva; store-country-code=us; odin_tt=294845c8f7711db177f7c549a9f44edb1555031b27a2a485df809cd92c4e544ac0772bf462df5b7a100f6e488c45303cd62df3b6b950f0842520cd887850137b035d990f29cc8b752765e594560c977f; cmpl_token=AgQQAPNSF-RMpbE89z5HYF0_-2PcrxjXf4fZYP5_ZA
How can I delete everything from the string inside ( first & only instance ) from :tt_ to _ZA in file.txt keeping only Osmun.Prez#mail.com:c7lB2m6b#3.a.a using bash linux?
Thank you
Something like:
sed -i "s/:tt_.*//" file.txt
if you want to edit the file in place. If not, remove the -i switch.
The sed command means: replace (s), in each line of file.txt, all the chars (.*) starting by the pattern :tt_ with an empty string (//).
Or the command:
sed -i "s/:tt_.*_ZA//" file.txt
which is more adherent to what you ask for, but returns the same output.
Use pattern substitution:
i=$(cat file.txt)
echo "${i/:tt*_ZA}"
Assuming the general requirement is to remove everything after the 2nd : ...
Sample data:
$ cat file.txt
Osmun.Prez#mail.com:c7lB2m6b#3.a.a:tt_webid_v ... to end of line
some.one#home.com:B52_m6b#9_az.more.stuff:delete from here ... to end of line
One sed idea:
$ sed -En 's/^([^:]*:[^:]*).*$/\1/p' file.txt
Osmun.Prez#mail.com:c7lB2m6b#3.a.a
some.one#home.com:B52_m6b#9_az.more.stuff
Using awk
awk 'BEGIN{FS=OFS=":"}{print $1,$2}'
Using : as the delimiter, it is easy to extract the columns before :tt
This deletes all chars from ":tt_" to the last "_ZA", inclusive, in file.txt
Mac_3.2.57$cat file.txt | sed 's/\(\)[:]tt.*_ZA\(.*\)/\1\2/'
Osmun.Prez#mail.com:c7lB2m6b#3.a.a
Mac_3.2.57$
Or if it is always the first 2 values which are separated by colon (as per you example)
cat file.txt | cut -f1,2 -dā€™:ā€™

Sed : print all lines after match

I got my research result after using sed :
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | cut -f 1 - | grep "pattern"
But it only shows the part that I cut. How can I print all lines after a match ?
I'm using zcat so I cannot use awk.
Thanks.
Edited :
This is my log file :
[01/09/2015 00:00:47] INFO=54646486432154646 from=steve idfrom=55516654455457 to=jone idto=5552045646464 guid=100021623456461451463 n
um=6 text=hi my number is 0 811 22 1/12 status=new survstatus=new
My aim is to find all users that spam my site with their telephone numbers (using grep "pattern") then print all the lines to get all the information about each spam. The problem is there may be matches in INFO or id, so I use sed to get the text first.
Printing all lines after a match in sed:
$ sed -ne '/pattern/,$ p'
# alternatively, if you don't want to print the match:
$ sed -e '1,/pattern/ d'
Filtering lines when pattern matches between "text=" and "status=" can be done with a simple grep, no need for sed and cut:
$ grep 'text=.*pattern.* status='
You can use awk
awk '/pattern/,EOF'
n.b. don't be fooled: EOF is just an uninitialized variable, and by default 0 (false). So that condition cannot be satisfied until the end of file.
Perhaps this could be combined with all the previous answers using awk as well.
Maybe this is what you actually want? Find lines matching "pattern" and extract the field after text= up through just before status=?
zcat file* | sed -e '/pattern/s/.*text=\(.*\)status=[^/]*/\1/'
You are not revealing what pattern actually is -- if it's a variable, you cannot use single quotes around it.
Notice that \(.*\)status=[^/]* would match up through survstatus=new in your example. That is probably not what you want? There doesn't seem to be a status= followed by a slash anywhere -- you really should explain in more detail what you are actually trying to accomplish.
Your question title says "all line after a match" so perhaps you want everything after text=? Then that's simply
sed 's/.*text=//'
i.e. replace up through text= with nothing, and keep the rest. (I trust you can figure out how to change the surrounding script into zcat file* | sed '/pattern/s/.*text=//' ... oops, maybe my trust failed.)
The seldom used branch command will do this for you. Until you match, use n for next then branch to beginning. After match, use n to skip the matching line, then a loop copying the remaining lines.
cat file | sed -n -e ':start; /pattern/b match;n; b start; :match n; :copy; p; n ; b copy'
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | ***cut -f 1 - | grep "pattern"***
instead change the last 2 segments of your pipeline so that:
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | **awk '$1 ~ "pattern" {print $0}'**

SED: Number of returned lines

To a file jungle.txt with following text ...
A lion sleeps in the jungle
A lion sleeps tonight
A tiger awakens in the swamp
The parrot observes
Wimoweh, wimoweh, wimoweh, wimoweh
... one could perform GREP search ...
$ grep lion jungle.txt
... or SED search ...
$ sed "/lion/p" jungle.txt
... to find occurences of a pattern ("lion" in this case).
Is there some easy way to get a number of returned lines? Or at least to know that there was more than 1 found? As always, I've googled a lot first, but surprisingly found no answer.
Thanks!
grep can count matching lines:
grep -c 'lion' file
Output:
2
Syntax:
-c: Suppress normal output; instead print a count of matching lines for each input file. With the -v, --invert-match option (see below), count non-matching lines. (-c is specified by POSIX.)
This might work for you (GNU sed):
sed '/lion/!d' file | sed '$=;d'
or if you prefer:
sed -n '/lion/p' file | sed -n '$='
N.B. if the file is empty or the first sed command finds nothing the result of the second sed command is blank.
You can use awk
awk '/lion/ {a++} END {print a+0}'
2
But I would say that the best solution is the one posted by Cyros using grep -c 'lion' file
Just pass the grep command output to wc- l command to count the number of returned lines,
$ grep 'lion' file | wc -l
2
From wc --help
-l, --lines print the newline counts

regexp (sed) suppress "no match" output

I'm stuck on that and can't wrap my head around it: How can I tell sed to return the value found, and otherwise shut up?
It's really beyond me: Why would sed return the whole string if he found nothing? Do I have to run another test on the returned string to verify it? I tried using "-n" from the (very short) man page but it effectively suppresses all output, including matched strings.
This is what I have now :
echo plop-02-plop | sed -e 's/^.*\(.\)\([0-9][0-9]\)\1.*$/\2/'
which returns
02 (and that is fine and dandy, thank you very much), but:
echo plop-02plop | sed -e 's/^.*\(.\)\([0-9][0-9]\)\1.*$/\2/'
returns
plop-02plop (when it should return this = "" nothing! Dang, you found nothing so be quiet!
For crying out loud !!)
I tried checking for a return value, but this failed too ! Gasp !!
$ echo plop-02-plop | sed -e 's/^.*\(.\)\([0-9][0-9]\)\1.*$/\2/' ; echo $?
02
0
$ echo plop-02plop | sed -e 's/^.*\(.\)\([0-9][0-9]\)\1.*$/\2/' ; echo $?
plop-02plop
0
$
This last one I cannot even believe. Is sed really the tool I should be using? I want to extract a needle from a haystack, and I want a needle or nothing..?
sed by default prints all lines.
What you want to do is
/patt/!d;s//repl/
IOW delete lines not matching your pattern, and if they match, extract particular element from it, giving capturing group number for instance. In your case it will be:
sed -e '/^.*\(.\)\([0-9][0-9]\)\1.*$/!d;s//\2/'
You can also use -n option to suppress echoing all lines. Then line is printed only when you explicitly state it. In practice scripts using -n are usually longer and more cumbersome to maintain. Here it will be:
sed -ne 's/^.*\(.\)\([0-9][0-9]\)\1.*$/\2/p'
There is also grep, but your example shows, why sed is sometimes better.
Perhaps you can use egrep -o?
input.txt:
blooody
aaaa
bbbb
odor
qqqq
E.g.
sehe#meerkat:/tmp$ egrep -o o+ input.txt
ooo
o
o
sehe#meerkat:/tmp$ egrep -no o+ input.txt
1:ooo
4:o
4:o
Of course egrep will have slightly different (better?) regex syntax for advanced constructs (back-references, non-greedy operators). I'll let you do the translation, if you like the approach.