Matching bash variables as number literals with grep

Matching bash variables as number literals with grep - regex

I have a (GNU) bash script which establishes two variables to be matched in a file.
hour=$(head -n 1 sensorstest.log | cut -f5 | cut -d"-" -f1)
dom=$(head -n 1 sensorstest.log | cut -f5 | cut -d"-" -f4)
...and matches them to other occurrences in the file
grep -E [^0-9]+"$hour"-[0-9]+-[0-9]+-"$dom"-[0-9]+-[0-9]{4} sensorstest.log
Here is an example of the script calculating the mean for all values in field 2 of the input file for the given hour of day.
hMean=$(grep -E [^0-9]+"$hour"-[0-9]+-[0-9]+-"$dom"-[0-9]+-[0-9]{4} sensorstest.log | cut -f2 | awk ' {sum+=$
1}{count++}{mean=sum/count} END {printf("%.2f",mean) } ' );
Here is an example of the cleanup of the input file.
echo "removing: "$hour"th hour of the "$dom"th day of the "$month"th month"
sed -i -r '/'"$hour"'-[0-9]+-[0-9]+-'"$dom"'-'"$month"'-[0-9]{4}/d' sensorstest.log
And finally... Here is an example line in the file:
The format is:
status<tab>humidity<tab>temperature<tab>unix timestamp<tab>time/date
OK 94.4 16.9 1443058486 1-34-46-24-9-2015
I am attempting to match all instances of the hour on the day of the first entry in the file.
This works fine for numbers below 9, however;
Problem: Numbers over 9 are being matched as two single digit numbers, resulting in 12 matching 1, 2, 12, 21...etc.
Here is an example of where is trips up:
OK 100 17.2 1442570381 9-59-41-18-9-2015
OK 100 17.1 1442570397 9-59-57-18-9-2015
Moistening 100 17.6 1442574014 11-0-14-18-9-2015
Moistening 100 17.6 1442574030 11-0-30-18-9-2015
Here the output skips to 0-0-0-19-9-2015 (and yes I am missing an hour of entries from the log)
$ sudo statanhourtest.sh
100,1.4,1.40,-98.6 16.5,17.2,16.90,.7 1442566811 9-0-0-18-9-2015
removing: 9th hour of the 18th day of the 9th month
$ sudo statanhourtest.sh
100,1.4,1.40,-98.6 18.3,18.8,18.57,.5 1442620804 0-0-0-19-9-2015
removing: 0th hour of the 19th day of the 9th month
The problem is only happening with the hours. the day ($dom) is matching fine.
I have tried using the -w option with grep, but I think this only returns the exact match where I need the whole line.
There's not much online about matching numbers literally in grep. And I found nothing on using bash variables as a number literal.
Any help or relevant links would be greatly appreciated.
EDIT:
I have solved the problem after a night of dredging through the script.
The problem lay with my sed expression right at the end.
The problem being in single quoting parts of the sed expression and double quoting variables for expansion by the shell.
I took this from a suggestion on another thread.
Double quoting the whole expression solved the problem.
The awk suggestion has greatly increased the efficiency and accuracy of the script. Thanks again.

awk to the rescue!
I think you can combine everything to a simple awk script without needing any regex. For example,
awk 'NR==1{split($NF,h,"-")} {split($NF,t,"-")} t[1]==h[1] && t[4]==h[4]'
will parse the time stamp on the first row of the file and filters only the hour and day matching records.
This will take the average of field 2
awk 'NR==1
{
split($NF,h,"-")
}
{
split($NF,t,"-")
}
t[1]==h[1] && t[4]==h[4]
{
sum+=$2;
c++
}
END
{
print "Average: " sum/c
}'

Related

How to get number in range 10-20 in grep

I need from this file extract line that starts with a number in the range 10-20 and I have tried use grep "[10-20]" tmp_file.txt, but from a file that has this format
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.aa
12.bbb
13.cccc
14.ddddd
15.eeeeee
16.fffffff
17.gggggggg
18.hhhhhhhhh
19.iiiiiiiiii
20.jjjjjjjjjjj
21.
it returned everything and marked every number that contains either 1, 0, 10, 2, 0, 20 or 21 :/

With an extended regular expression (-E):
grep -E '^(1[0-9]|20)\.' file
Output:
10.aa
12.bbb
13.cccc
14.ddddd
15.eeeeee
16.fffffff
17.gggggggg
18.hhhhhhhhh
19.iiiiiiiiii
20.jjjjjjjjjjj
See: The Stack Overflow Regular Expressions FAQ

An other one with awk
awk '/^10\./,/^20\./' tmp_file.txt
awk '/^10\./,/^13\./' tmp_file.txt
10.aa
12.bbb
13.cccc

Try
grep -w -e '^1[[:digit:]]' -e '^20' tmp_file.txt
-w forces matches of whole words. That prevents matching lines like 100.... It's not POSIX, but it's supported by every grep that most people will encounter these days. Use grep -e '^1[[:digit:]]\.' -e '^20\.' ... if you are concerned about portability.
The -e option can be used multiple times to specify multiple patterns.
[[:digit:]] may be more reliable than [0-9]. See In grep command, can I change [:digit:] to [0-9]?.

Assuming the file might not be sorted and using numeric comparison
awk -F. '$1 >= 10 && $1 <= 20' < file.txt

grep is not the tool for this, because grep finds text patterns, but does not understand numeric values. Making patterns that match the 11 values from 10-20 is like stirring a can of paint with a screwdriver. You can do it, but it's not the right tool for the job.
A much clearer way to do this is with Perl:
$ perl -n -e'print if /^(\d+)/ && $1 >= 10 && $1 <= 20' foo.txt
This says to print a line of the file if the beginning of the line ^ matches one or more digits \d+ and if the numeric value of what was matched $1 is between the values of 10 and 20.

regex | List of result from grep

The following grep command gives me the number of requests from July 1st to July 31st between 8 a.m. and 4 p.m.
zgrep -E "[01\-31]/Jul/2021:[08\-16]" localhost_access.log* | wc -l
I don't want to get all requests in the month, but the requests per day. I could of course enter the command 31 times, but that's tedious. Is there a way to display the requests per day one below the other, so that I get the following as a result (ideally sorted by number), for example
543
432
321
etc.
How to do that?

You want to count lines based on a certain value in a line. That's a good job for awk. With grep-only, you would always have to process the input files once per day. In any way, we need to fix your regex first:
zgrep -E "[01\-31]/Jul/2021:[08\-16]" localhost_access.log* | wc -l
[08\-16] matches the characters 0, 8, -, 1 and 6. What you want to match is (0[89])|(1[0-6]); that is 0, followed by one of 8 or 9 - or - 1 followed by one of range 0-6. To make it easier, we assume normal days in the date and therefore match the day with [0-9]{2} (two digits).
Here's a complete awk for your task:
awk -F/ '/[0-9]{2}\/Jul\/2021:(0[89])|(1[0-6])/{a[$1]++}END{for (i in a) print "day " i ": " a[i]}' localhost_access.log*
Explanation:
/[0-9]{2}\/Jul\/2021:(0[89])|(1[0-6])/ matches date + time for every day (at 08-16) in july
{a[$1]++} builds an array with key=day and a counter of occurrences.
END{for (i in a) print "day " i ": " a[i]} prints the array when all input files were processed
Because we've set the field separator to /, you need to change a[$1] to address the correct position (for two more slashes before the actual date: a[$3]). (Of course this can be solved in a more dynamic way.)
Example:
$ cat localhost_access.log
01/Jul/2021:08 log message
01/Jul/2021:08 log message
02/Jul/2021:08 log message
02/Jul/2021:07 log message
$ awk -F/ '/[0-9]{2}\/Jul\/2021:(0[89])|(1[0-6])/{a[$1]++}END{for (i in a) print "day " i ": " a[i]}' localhost_access.log*
day 01: 2
day 02: 1
Run zcat | awk in case your log files are compressed, but remember the regex above searches for "Jul/2021" only.

get number value between two strings using regex

I have a string with multiple value outputs that looks like this:
SD performance read=1450kB/s write=872kB/s no error (0 0), ManufactorerID 27 Date 2014/2 CardType 2 Blocksize 512 Erase 0 MaxtransferRate 25000000 RWfactor 2 ReadSpeed 22222222Hz WriteSpeed 22222222Hz MaxReadCurrentVDDmin 3 MaxReadCurrentVDDmax 5 MaxWriteCurrentVDDmin 3 MaxWriteCurrentVDDmax 1
I would like to output only the read value (1450kB/s) using bash and sed.
I tried
sed 's/read=\(.*\)kB/\1/'
but that outputs read=1450kB but I only want the number.
Thanks for any help.

Sample input shortened for demo:
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/read=\(.*\)kB/\1/'
SD performance 1450kB/s write=872/s no error
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/.*read=\(.*\)kB.*/\1/'
1450kB/s write=872
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/.*read=\([0-9]*\)kB.*/\1/'
1450
Since entire line has to be replaced, add .* before and after search pattern
* is greedy, will try to match as much as possible, so in 2nd example it can be seen that it matched even the values of write
Since only numbers after read= is needed, use [0-9] instead of .

Running
sed 's/read=\(.*\)kB/\1/'
will replace read=[digits]kB with [digit]. If you want to replace the whole string, use
sed 's/.*read=\([0-9]*\)kB.*/\1/'
instead.
As Sundeep noticed, sed doesn't support non-greedy pattern, updated for [0-9]* instead

Feeding sed matched block through a command

I have json lines that contain multiple parts per line that look like this:
"SomeDate":"Date(-2156284800000)",
I would like to convert each occurrence in all lines into something more human readable:
"SomeDate":"1901-09-03 00:19:32",
I tried using sed to put the matched block (in this case the timestamp) into the argumentlist of the date command. This fails.
$ echo '"SomeDate":"Date(-2156284800000)",' | \
sed "s/Date(\([0-9\-]*\)[0-9][0-9][0-9])/$(date -d#\\1 \"+%F %T\")/g"
date: invalid date `#\\1'
"SomeDate":"",
In an attempt to debug this all I added an 'echo' to the date to validate the command it should be running
$ echo '"SomeDate":"Date(-2156284800000)",' | \
sed "s/Date(\([0-9\-]*\)[0-9][0-9][0-9])/$(echo date -d#\\1 \"+%F %T\")/g"
"SomeDate":"date -d#-2156284800 "+%F %T"",
$ date -d#-2156284800 "+%F %T"
1901-09-03-00:19:32
Why isn't the first command running as expected?
The best guess I have right now is that the subshell is executed first WITHOUT the \1 substitution and then the resulting output is actually used by sed.
How do I achieve what I'm trying to do?
P.S. I'm using CentOS 6.6

How about using awk:
echo '"SomeDate":"Date(-2156284800000)",' | awk '{ print gensub(/Date\(([0-9\-]+)\)/, ("date -d#\\1 \"+%F %T\"" |& getline t) ? t : "\\1", "g"); }'
Disclaimer: There's probably a better way to do this, but briefly:
gensub is like gsub but gives you access to the matched groups
Capture the Date(XXX) bit with:
/Date\(([0-9\-]+)\)/
(which gets the actual epoch in the match group \1)
The second argument is:
("date -d#\\1 \"+%F %T\"" |& getline t) ? t : "\\1"
which forms the date command, runs it (with getline) and assigns the result to the variable t. getline, bizarely returns 1 on success, so we check for that with the ternary (?:) statement and return the first line of output from that command.
Finally, we tell gensub to be global.

The workaround I use right now is via perl:
$ echo fee 4321432431 fie 1429882795 fum | \
perl -MPOSIX -pe 's/(\d+)/strftime "%F", localtime $1/eg'
fee 2106-12-10 fie 2015-04-24 fum

Grep rsync output?

Running an rsync command produces output similar to this :
66256896 92% 4.51MB/s 0:00:01
How can I grep this output for just the percentage value ?
So anything {0-100}% so instead of showing the full output I only see the percentage ?
The command would be:
rsyncd -Pav server.com::files/remotefile.tar.gz localfile.tar.gz | grep xxx
Thanks

If you really want to use sed, this ugly thing works!
rsyncd -Pav server.com::files/remotefile.tar.gz localfile.tar.gz | sed -e 's/%.*/%/; s/.* //'
It replaces % followed by the rest of the line with just % (thereby deleting everything after the percent), then replaces everything up to the space before the percentage also with nothing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Matching bash variables as number literals with grep - regex

Related

How to get number in range 10-20 in grep

regex | List of result from grep

get number value between two strings using regex

Feeding sed matched block through a command

Grep rsync output?

Categories

Resources