Git log stats with regular expressions - regex

I would like to do some stats on my git log to get something like:
10 Daniel Schmidt
5 Peter
1 Klaus
The first column is the count of commits and the second the commiter.
I already got as far as this:
git log --raw |
grep "^Author: " |
sort |
uniq -c |
sort -nr |
less -FXRS
The interesting part is the
grep "^Author: "
which i wanted to modify with a nice Regex to exclude the mail adress.
With Rubular something like this http://rubular.com/r/mEzP2hFjGb worked, but if i insert it in the grep (or in a piped other one) it won't get me the right output.
Sidequestion: Is there a possibility to get the count and the author seperated by something else then whitespace while staying in this pipe command style? I would like to have a nicer seperator between both to us column later (and maybe some color ^^)
Thanks a lot for your help!

Google git-extras. It has a git summary that does this.

git shortlog -n -s gets you the same data. On the git repository, for example (piped to head to get higher numbers):
$ git shortlog -n -s | head -4
11129 Junio C Hamano
1395 Shawn O. Pearce
1103 Linus Torvalds
896 Jeff King
To get a different delimiter, you could pipe it to awk:
$ git shortlog -n -s | awk 'BEGIN{OFS="|";} { $1=$1; print $0 }' | head -4
11129|Junio|C|Hamano
1395|Shawn|O.|Pearce
1103|Linus|Torvalds
896|Jeff|King

You can get the full power of pcre (which should match your experiments with Rebular) with a perl one-liner:
perl -ane 'print if /^Author: /'
Just extend that pattern as necessary.
To reformat you can use awk (eg awk '{printf "%5d\t%s", $1, $2}')

Related

How can i extract some strings using grep and regular expression

I have some lines likes:
2017-03-10 21:55:57.426 INFO es.sd.phase.kpi.KPIEventNotifier - ID-es2rxsf01v-54870-1489080967572-0-2605574 - KPI1: 52 ms [ValidationPhase:1#TransformationPhase:8#EnrichmentPhase:10#DynamicRouterPhase:4#PoseseadorPhase:29#generateACK:0#EndPhase:0]
The output of grep command have to show:
2017-03-10 21:55:57.426 KPI1: 52 ms
I tried agroup both with:
tail -F file.log | grep -Po "(.\*INFO).*(KPI1.*ms)"
But obviosly only show:
2017-03-10 21:55:57.426 INFO es.sd.phase.kpi.KPIEventNotifier - ID-es2rxsf01v-54870-1489080967572-0-2605574 - KPI1: 52 ms
We need avoid this part:
INFO es.sd.phase.kpi.KPIEventNotifier - ID-es2rxsf01v-54870-1489080967572-0-2605574 -
And only show this part:
2017-03-10 21:55:57.426 KPI1: 52 ms
Thanks
Javi
Instead of using grep and another tools whatever it is to filter the grep result, you can use awk that is field based. Using the default field separator (whitespace), you can write:
awk '$3=="INFO" && $8=="KPI1:"{print $1,$2,$8,$9,$10}' file.log
grep can not omit/treat non-capturing groups(as they shouldn't be captured) of variable length like (?:INFO.*) or (?=INFO.*) from the final output. Actually, we can't mark suquences of variable length as non-captured.Use sed command instead(to get only needed matched groups):
sed -En 's/^([-0-9.: ]+)INFO.*?(KPI.+ms).*/\1\2/p' file.log
-E option, allows extended regular expressions
/p flag, tells to print only matched substrings
It is easily solved with the cut command:
tail -F file.log | cut -f 1,2,9-11 -d " "
I often use cut instead of awk, since I think the syntax looks cleaner.

Match only first occurrence of digit

After few hours of disappointed searching I can't figure this out.
I am piping to grep input, what I want to get is first occurrence of any digit.
Example:
nmcli --version
nmcli tool, version 1.1.93
Pipe to grep with regex
nmcli --version |grep -o '[[:digit:]]'
Output:
1
1
9
3
What I want:
1
Yeah there is a way to do that with another pipe, but is there "pure" single regex to do that?
With GNU grep:
nmcli --version | grep -Po ' \K[[:digit:]]'
Output:
1
See: Support of \K in regex
Although you want to avoid another process, it seems simplest just to add a head to your existing command...
grep -o [[:digit:]] | head -n1
echo "nmcli tool, version 1.1.93" |sed "s/[^0-9]//g" |cut -c1
1
echo "nmcli tool, version 1.1.93" |grep -o '[0-9]' |head -1
1
This can be seen as a stream editing task: reduce that one line to the first digit. Basic regex register-based referencing achieves the task:
$ echo "junk 1.2.3.4" | sed -e 's/.* \([0-9]\).*/\1/'
1
Traditionally, Grep is best for searching for files and lines which match a pattern. This is why the grep solution requires the use of Perl regex; Perl regex has features that, in combination with -o, allow grep to escape "out of the box" and be used in ways it wasn't really intended: match X, but then output a substring of X. The solution is terse, but not portable to grep implementations that don't have PCRE.
Use [0-9] to match ASCII digits, by the way. The purpose of [[:digit:]] is to bring in locale-specific behavior: to be able to match digits other than just the ASCII 0x30 through 0x39.
It's fairly safe to say that nmcli isn't going to put outs its --version using, say, Devangari numerals, like १.२.३.४.
You could use standard awk instead:
nmcli --version | awk 'match($0, /[[:digit:]]/) {print substr($0, RSTART, RLENGTH); exit}'
For example:
$ seq 11111 33333 | awk 'match($0, /[[:digit:]]/) {print substr($0, RSTART, RLENGTH); exit}'
1

Sed : print all lines after match

I got my research result after using sed :
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | cut -f 1 - | grep "pattern"
But it only shows the part that I cut. How can I print all lines after a match ?
I'm using zcat so I cannot use awk.
Thanks.
Edited :
This is my log file :
[01/09/2015 00:00:47] INFO=54646486432154646 from=steve idfrom=55516654455457 to=jone idto=5552045646464 guid=100021623456461451463 n
um=6 text=hi my number is 0 811 22 1/12 status=new survstatus=new
My aim is to find all users that spam my site with their telephone numbers (using grep "pattern") then print all the lines to get all the information about each spam. The problem is there may be matches in INFO or id, so I use sed to get the text first.
Printing all lines after a match in sed:
$ sed -ne '/pattern/,$ p'
# alternatively, if you don't want to print the match:
$ sed -e '1,/pattern/ d'
Filtering lines when pattern matches between "text=" and "status=" can be done with a simple grep, no need for sed and cut:
$ grep 'text=.*pattern.* status='
You can use awk
awk '/pattern/,EOF'
n.b. don't be fooled: EOF is just an uninitialized variable, and by default 0 (false). So that condition cannot be satisfied until the end of file.
Perhaps this could be combined with all the previous answers using awk as well.
Maybe this is what you actually want? Find lines matching "pattern" and extract the field after text= up through just before status=?
zcat file* | sed -e '/pattern/s/.*text=\(.*\)status=[^/]*/\1/'
You are not revealing what pattern actually is -- if it's a variable, you cannot use single quotes around it.
Notice that \(.*\)status=[^/]* would match up through survstatus=new in your example. That is probably not what you want? There doesn't seem to be a status= followed by a slash anywhere -- you really should explain in more detail what you are actually trying to accomplish.
Your question title says "all line after a match" so perhaps you want everything after text=? Then that's simply
sed 's/.*text=//'
i.e. replace up through text= with nothing, and keep the rest. (I trust you can figure out how to change the surrounding script into zcat file* | sed '/pattern/s/.*text=//' ... oops, maybe my trust failed.)
The seldom used branch command will do this for you. Until you match, use n for next then branch to beginning. After match, use n to skip the matching line, then a loop copying the remaining lines.
cat file | sed -n -e ':start; /pattern/b match;n; b start; :match n; :copy; p; n ; b copy'
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | ***cut -f 1 - | grep "pattern"***
instead change the last 2 segments of your pipeline so that:
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | **awk '$1 ~ "pattern" {print $0}'**

grep: group capturing

I have following string:
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
and I need to get value of "scheme version", which is 1234 in this example.
I have tried
grep -Eo "\"scheme_version\":(\w*)"
however it returns
"scheme_version":1234
How can I make it? I know I can add sed call, but I would prefer to do it with single grep.
You'll need to use a look behind assertion so that it isn't included in the match:
grep -Po '(?<=scheme_version":)[0-9]+'
This might work for you:
echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' |
sed -n 's/.*"scheme_version":\([^}]*\)}/\1/p'
1234
Sorry it's not grep, so disregard this solution if you like.
Or stick with grep and add:
grep -Eo "\"scheme_version\":(\w*)"| cut -d: -f2
I would recommend that you use jq for the job. jq is a command-line JSON processor.
$ cat tmp
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
$ cat tmp | jq .scheme_version
1234
As an alternative to the positive lookbehind method suggested by SiegeX, you can reset the match starting point to directly after scheme_version": with the \K escape sequence. E.g.,
$ grep -Po 'scheme_version":\K[0-9]+'
This restarts the matching process after having matched scheme_version":, and tends to have far better performance than the positive lookbehind. Comparing the two on regexp101 demonstrates that the reset match start method takes 37 steps and 1ms, while the positive lookbehind method takes 194 steps and 21ms.
You can compare the performance yourself on regex101 and you can read more about resetting the match starting point in the PCRE documentation.
To avoid using greps PCRE feature which is available in GNU grep, but not in BSD version, another method is to use ripgrep, e.g.
$ rg -o 'scheme_version.?:(\d+)' -r '$1' <file.json
1234
-r Capture group indices (e.g., $5) and names (e.g., $foo).
Another example with Python and json.tool module which can validate and pretty-print:
$ python -mjson.tool file.json | rg -o 'scheme_version[^\d]+(\d+)' -r '$1'
1234
Related: Can grep output only specified groupings that match?
You can do this:
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | awk -F ':' '{print $4}' | tr -d '}'
Improving #potong's answer that works only to get "scheme_version", you can use this expression :
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_id":["]*\([^(",})]*\)[",}].*/\1/p'
scheme_version
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_rev":["]*\([^(",})]*\)[",}].*/\1/p'
4-cad1842a7646b4497066e09c3788e724
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"scheme_version":["]*\([^(",})]*\)[",}].*/\1/p'
1234

bash script regex matching

In my bash script, I have an array of filenames like
files=( "site_hello.xml" "site_test.xml" "site_live.xml" )
I need to extract the characters between the underscore and the .xml extension so that I can loop through them for use in a function.
If this were python, I might use something like
re.match("site_(.*)\.xml")
and then extract the first matched group.
Unfortunately this project needs to be in bash, so -- How can I do this kind of thing in a bash script? I'm not very good with grep or sed or awk.
Something like the following should work
files2=(${files[#]#site_}) #Strip the leading site_ from each element
files3=(${files2[#]%.xml}) #Strip the trailing .xml
EDIT: After correcting those two typos, it does seem to work :)
xbraer#NO01601 ~
$ VAR=`echo "site_hello.xml" | sed -e 's/.*_\(.*\)\.xml/\1/g'`
xbraer#NO01601 ~
$ echo $VAR
hello
xbraer#NO01601 ~
$
Does this answer your question?
Just run the variables through sed in backticks (``)
I don't remember the array syntax in bash, but I guess you know that well enough yourself, if you're programming bash ;)
If it's unclear, dont hesitate to ask again. :)
I'd use cut to split the string.
for i in site_hello.xml site_test.xml site_live.xml; do echo $i | cut -d'.' -f1 | cut -d'_' -f2; done
This can also be done in awk:
for i in site_hello.xml site_test.xml site_live.xml; do echo $i | awk -F'.' '{print $1}' | awk -F'_' '{print $2}'; done
If you're using arrays, you probably should not be using bash.
A more appropriate example wold be
ls site_*.xml | sed 's/^site_//' | sed 's/\.xml$//'
This produces output consisting of the parts you wanted. Backtick or redirect as needed.