how to extract substring and numbers only using grep/sed - regex

I have a text file containing both text and numbers, I want to use grep to extract only the numbers I need for example, given a file as follow:
miss rate 0.21
ipc 222
stalls n shdmem 112
So say I only want to extract the data for miss rate which is 0.21. How do I do it with grep or sed? Plus, I need more than one number, not only the one after miss rate. That is, I may want to get both 0.21 and 112. A sample output might look like this:
0.21 222 112
Cause I need the data for later plot.

If you really want to use only grep for this, then you can try:
grep "miss rate" file | grep -oe '\([0-9.]*\)'
It will first find the line that matches, and then only output the digits.
Sed might be a bit more readable, though:
sed -n 's#miss rate ##p' file

Use awk instead:
awk '/^miss rate/ { print $3 }' yourfile
To do it with just grep, you need non-standard extensions like here with GNU grep using PCRE (-P) with positive lookbehind (?<=..) and match only (-o):
grep -Po '(?<=miss rate ).*' yourfile

Using the special look around regex trick \K with pcre engine with grep :
grep -oP 'miss rate \K.*' file.txt
or with perl :
perl -lne 'print $& if /miss rate \K.*/' file.txt

The grep-and-cut solution would look like:
to get the 3rd field for every successful grep use:
grep "^miss rate " yourfile | cut -d ' ' -f 3
or to get the 3rd field and the rest use:
grep "^miss rate " yourfile | cut -d ' ' -f 3-
Or if you use bash and "miss rate" only occurs once in your file you can also just do:
a=( $(grep -m 1 "miss rate" yourfile) )
echo ${a[2]}
where ${a[2]} is your result.
If "miss rate" occurs more then once you can loop over the grep output reading only what you need. (in bash)

You can use:
grep -P "miss rate \d+(\.\d+)?" file.txt
or:
grep -E "miss rate [0-9]+(\.[0-9]+)?"
Both of those commands will print out miss rate 0.21. If you want to extract the number only, why not use Perl, Sed or Awk?
If you really want to avoid those, maybe this will work?
grep -E "miss rate [0-9]+(\.[0-9]+)?" g | xargs basename | tail -n 1

I believe
sed 's|[^0-9]*\([0-9\.]*\)|\1 |g' fiilename
will do the trick. However every entry will be on it's own line if that is ok. I am sure there is a way for sed to produce a comma or space delimited list but I am not a super master of all things sed.

Related

Sed : print all lines after match

I got my research result after using sed :
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | cut -f 1 - | grep "pattern"
But it only shows the part that I cut. How can I print all lines after a match ?
I'm using zcat so I cannot use awk.
Thanks.
Edited :
This is my log file :
[01/09/2015 00:00:47] INFO=54646486432154646 from=steve idfrom=55516654455457 to=jone idto=5552045646464 guid=100021623456461451463 n
um=6 text=hi my number is 0 811 22 1/12 status=new survstatus=new
My aim is to find all users that spam my site with their telephone numbers (using grep "pattern") then print all the lines to get all the information about each spam. The problem is there may be matches in INFO or id, so I use sed to get the text first.
Printing all lines after a match in sed:
$ sed -ne '/pattern/,$ p'
# alternatively, if you don't want to print the match:
$ sed -e '1,/pattern/ d'
Filtering lines when pattern matches between "text=" and "status=" can be done with a simple grep, no need for sed and cut:
$ grep 'text=.*pattern.* status='
You can use awk
awk '/pattern/,EOF'
n.b. don't be fooled: EOF is just an uninitialized variable, and by default 0 (false). So that condition cannot be satisfied until the end of file.
Perhaps this could be combined with all the previous answers using awk as well.
Maybe this is what you actually want? Find lines matching "pattern" and extract the field after text= up through just before status=?
zcat file* | sed -e '/pattern/s/.*text=\(.*\)status=[^/]*/\1/'
You are not revealing what pattern actually is -- if it's a variable, you cannot use single quotes around it.
Notice that \(.*\)status=[^/]* would match up through survstatus=new in your example. That is probably not what you want? There doesn't seem to be a status= followed by a slash anywhere -- you really should explain in more detail what you are actually trying to accomplish.
Your question title says "all line after a match" so perhaps you want everything after text=? Then that's simply
sed 's/.*text=//'
i.e. replace up through text= with nothing, and keep the rest. (I trust you can figure out how to change the surrounding script into zcat file* | sed '/pattern/s/.*text=//' ... oops, maybe my trust failed.)
The seldom used branch command will do this for you. Until you match, use n for next then branch to beginning. After match, use n to skip the matching line, then a loop copying the remaining lines.
cat file | sed -n -e ':start; /pattern/b match;n; b start; :match n; :copy; p; n ; b copy'
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | ***cut -f 1 - | grep "pattern"***
instead change the last 2 segments of your pipeline so that:
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | **awk '$1 ~ "pattern" {print $0}'**

How to egrep the month from the date?

I want to ask how I can use egrep to extract only the month section of a date in the form of
mm/dd/yyyy at hh:mm:ss
I've tried the positive lookbehind assertion but it didn't seem to work. The context of this code is: I'm looking at multiple files and gathering the dates from each of the files into timestamp.txt. In the original files, all the dates are located after TimeStamp:(note space after colon)
I'm not too great with regular expressions so I know I'm missing the expression to block out the / after the first two digits as well. If anybody can help me with that, that would be awesome :D
egrep "(?<=TimeStamp:\s)" $CURFILE | sort >> ../timestamp.txt
Thank you!
Not sure if this is possible with egrep but it is with perl
echo "TimeStamp: 12/10/2012"| perl -n -e 'print $1f m#: (..)/#'
Here's an other way of doing this
echo "TimeStamp: 12/10/2012"| awk -F/ '{print $1}' | awk '{print $2}'
echo "TimeStamp: 12/10/2012"| grep TimeStamp: | cut -d ' ' -f 2 |cut -d '/' -f 1
The code is untested and might have argument escaping problems.
The idea is to first split the input by space (or colon) and then by slash. If there are more spaces in the line you might need to manipulate -f values or add more splits.

How can I extract the content between two brackets?

My input:
1:FAILED + *1 0 (8328832,AR,UNDECLARED)
This is what I expect:
8328832,AR,UNDECLARED
I am trying to find a general regular expression that allows to take any content between two brackets out.
My attempt is
grep -o '\[(.*?)\]' test.txt > output.txt
but it doesn't match anything.
Still using grep and regex
grep -oP '\(\K[^\)]+' file
\K means that use look around regex advanced feature. More precisely, it's a positive look-behind assertion, you can do it like this too :
grep -oP '(?<=\()[^\)]+' file
if you lack the -P option, you can do this with perl :
perl -lne '/\(\K[^\)]+/ and print $&' file
Another simpler approach using awk
awk -F'[()]' '{print $2}' file

Grep matches only of multiple separated strings

I have a file with lines containing this format:
fieldA=value1, fieldB=value2, fieldC=value3, fieldD=value4, fieldE=value5
I am interested in fieldA, fieldB, fieldD. However, fieldC may or may not be present, therefore I cannot use something like:
grep "field" * | awk -F"," '{print $1, $2, $4}'
My end goal is to have output like this, all in one line:
fieldA=value1, fieldB=value2, fieldD=value4
I tried using grep -E, but it outputs those fields in different lines, and the association between the fields breaks.
grep -o -E "field1_=\w*|field2_=\w*|field3_=\w*"
if you know the field name of A,B,D grep and xargs could do the job. ( awk/sed could do it for sure)
grep -Po "fieldA=[^,]*|fieldB=[^,]*|fieldD=[^,]*" file|xargs -n3
that gives you:
fieldA=value1 fieldB=value2 fieldD=value4
if you want the comma in output:
grep -Po "fieldA=[^,]*,|fieldB=[^,]*,|fieldD=[^,]*" file|xargs -n3
Is a sed solution acceptable?
sed 's/^\([^ ]* [^ ]*\).*\(fieldD=[^,]*\).*/\1 \2/' filename

Regular expression to extract a percentage

I have strings like the following: blabla a13724bla-bla244 35%
Notice that there is always a space before the percentage. I would like to extract the percentage number (so, without the %) from these strings using the Linux shell.
Assuming you have GNU grep:
$ grep -oP '\d+(?=%)' <<< "blabla a13724bla-bla244 35%"
35
Using sed:
echo blabla a13724bla-bla244 35% | sed 's/.*[ \t][ \t]*\([0-9][0-9]*\)%.*/\1/'
If you expect to have multiple percentages in a line then:
echo blabla 20% a13724bla-bla244 35% | \
sed -e 's/[^%0-9 ]*//g;s/ */\n/g' | sed -n '/%/p'
You can try this
echo "blabla a13724bla-bla244 35%" | cut -d' ' -f3 | sed 's/\%//g'
NOTE: Assumption is the input is always in this format and percentage is 3rd token separated by space.
You may try this regular expression:
/\s(\d+%)/
Use this regular expression:
\s(\d{1,3})%
If you need it in shell, you can use sed or this perl one-liner:
echo "blah 35%" | perl -pe "s/.*\s(\d{1,3})%/\1/g"
35
If you always have a number of continuous columns maybe you should try with awk instead of a regular expresion.
cat file.txt |awk '{print $3}' |cut -d "%" -f 1
With this code you obtain the third column.