Formatting git log output with sed/awk/grep - regex

Summary / 'gist of' version,
if I have a set of messages with subject [SUB] and body [BODY] like below, How can I add a newline after the subject only if [BODY] exists (And replace the place holders with *)
[SUB] some subject. [BODY] some body lines
with newline chars and !### bunch of other *## chars
without [(BODY)] or [(SUB)]... and more stuff
[SUB] Another subject. with no body [BODY]
[SUB] another [BODY] some body.
I want this to be formatted like
* some subject.
some body lines
with newline chars and !### bunch of other *## chars
without [(BODY)] or [(SUB)]... and more stuff
* Another subject. with no body
* another
some body.
What I really wanna do,
So I am trying to auto-generate my CHANGELOG.md file from the git log output. The problem is, I need to put newline char only if the body of the commit message is non empty.
The current code looks like this, (broken into two lines)
git log v0.1.0..v0.1.2 --no-merges --pretty=format:'* %s -- %cn | \
[%h](http://github.com/../../commit/%H) %n%b' | grep -v Minor | grep . >> CHANGELOG.md
and a sample output,
* Added run information display (0.1.2) -- ... | [f9b1f6c](http://github.com/../../commit/...)
+ Added runs page to show a list of all the runs and run inforation, include sorting and global filtering.
+ Updated run information display panel on the run-info page
+ Changed the links and their names around.
* Update README.md -- abc | [2a90998](http://github.com/../../commit/...)
* Update README.md -- xt | [00369bd](http://github.com/../../commit/...)
You see here, the lines starting with the * are the commits, and the lines starting on + are just a part of the body for the first commit. Right now it adds a %n (newline) in front of all the body sections regardless of whether it's empty or not. I want to add this ONLY if its non empty (probably even after removing the whitespaces)
How would I achieve this? my knowledge of sed and awk is almost non-existing, and trying to learn didn't help much.
(I will can make sure all the code in the body is indented, so it wont confuse list of commits with lists in the body)
My Answer
i'm sure jthills answer is correct (and maye even a better way to do it), but while I was looking to figure out what his meant, i came up wit this. Hope it will help myself or someone in he future,
I am pasting the full shell script that I used,
mv CHANGELOG.md CHANGELOG.md.temp
printf '### Version '$1' \n\n' > CHANGELOG.md
git log $2..$1 --no-merges --pretty=format:'[SUB]%s -- %cn | \
[%h](http://github.com/<user>/<gitrepo>/commit/%H) [BODY]%b' | grep -v Minor | \
sed '{:q;N;s/\s*\[BODY\][\n\s]*\[SUB\]/\n\[SUB\]/;b q}' | \
sed 's/\[SUB\]/* /g' |
sed 's/\[BODY\]/\n\n/'>> CHANGELOG.md
cat CHANGELOG.md.temp >> CHANGELOG.md
rm CHANGELOG.md.temp
I am basically prepending the new commit log to the CHANGELOG.md using the temp file. Please feel free to suggest shorter versions for this 3 sed commands

Tag your syntax in the git log output. This will handle inserting the newlines properly, the rest you know:
git log --pretty=tformat:'%s%xFF%x01%b%xFF%x02' \
| sed '1h;1!H;$!d;g # buffer it all (see comments for details)
s/\xFF\x01\xff\x02//g # strip null bodies
s/\xFF\x01/\n/g # insert extra newline before the rest
s/\xFF.//g # cleanup
'
(edit: quote/escape typos)

For your first file in your question, you could try the following:
awk -f r.awk input.txt
where input.txt is the input file, and r.awk is :
{
line=line $0 ORS
}
END {
while (getSub()) {
getBody()
print "* " subj
if (body) {
print ""
print body
}
}
}
function getBody(ind) {
ind=index(line,"[SUB]")
if (ind) {
body=substr(line,1,ind-1)
line=substr(line,ind)
}
else
body=line
sub(/^[[:space:]]*/,"",body)
sub(/[[:space:]]*$/,"",body)
}
function getSub(ind,ind2) {
ind=index(line,"[SUB]")
if (ind) {
ind=ind+5
ind2=index(line,"[BODY]")
subj=substr(line, ind, ind2-ind)
line=substr(line,ind2+6)
return 1
}
else
return 0
}
gives output:
* some subject.
some body lines
with newline chars and !### bunch of other *## chars
without [(BODY)] or [(SUB)]... and more stuff
* Another subject. with no body
* another
some body.

I wrestled with this way longer than expected, simply trying to get a git log output with some sed tweaking of the git message to format/extract our JIRA messages.
Here is my solution:
logsheet = "!f() { git log --format='%h ^ %<(80,trunc)%s ^ A:%<(20,trunc)%an ^ D:%ad ' --no-merges --date=short $1 | sed -e 's/\\([A-Z]*-[0-9]*\\)/\\1 ^/'; }; f"
The escapes, the shell function with a ! were all needed because I had an arg as well as a pipe.
:-)

Related

Read file until regex in changelog file using bash

I want to get content in file until find a regex in content. I need this to report my last changelog
Example:
## <small>0.27.3 (2019-03-18)</small>
* Fix integration tests
* Change log message
## <small>0.27.2 (2019-03-18)</small>
* Change find to filter
* Fix bug in typo
* Format message in request
I want a regex to return oly the content of my latest version. Example:
## <small>0.27.3 (2019-03-18)</small>
* Fix integration tests
* Change log message
How can I make this using sed, grep or awk?
Thanks for this
Edit:
I can made it:
CHANGELOG_MESSAGE=$(head -n 1 CHANGELOG.md)$(head -n 20 CHANGELOG.md | tail -n 19 | sed '/##/q' | awk '!/##/')
I think that this solution is a few complex, but works
try this:
sed '1p;1,/^##/!d;/##/d' CHANGELOG.md
explanation
1p # print first line
;
1,/^##/!d # delete everything after second ##-line
;
/##/d # delete all lines start with ##
output
## <small>0.27.3 (2019-03-18)</small>
* Fix integration tests
* Change log message

sed / awk - remove space in file name

I'm trying to remove whitespace in file names and replace them.
Input:
echo "File Name1.xml File Name3 report.xml" | sed 's/[[:space:]]/__/g'
However the output
File__Name1.xml__File__Name3__report.xml
Desired output
File__Name1.xml File__Name3__report.xml
You named awk in the title of the question, didn't you?
$ echo "File Name1.xml File Name3 report.xml" | \
> awk -F'.xml *' '{for(i=1;i<=NF;i++){gsub(" ","_",$i); printf i<NF?$i ".xml ":"\n" }}'
File_Name1.xml File_Name3_report.xml
$
-F'.xml *' instructs awk to split on a regex, the requested extension plus 0 or more spaces
the loop {for(i=1;i<=NF;i++) is executed for all the fields in which the input line(s) is(are) splitted — note that the last field is void (it is what follows the last extension), but we are going to take that into account...
the body of the loop
gsub(" ","_", $i) substitutes all the occurrences of space to underscores in the current field, as indexed by the loop variable i
printf i<NF?$i ".xml ":"\n" output different things, if i<NF it's a regular field, so we append the extension and a space, otherwise i equals NF, we just want to terminate the output line with a newline.
It's not perfect, it appends a space after the last filename. I hope that's good enough...
▶    A D D E N D U M    ◀
I'd like to address:
the little buglet of the last space...
some of the issues reported by Ed Morton
generalize the extension provided to awk
To reach these goals, I've decided to wrap the scriptlet in a shell function, that changing spaces into underscores is named s2u
$ s2u () { awk -F'\.'$1' *' -v ext=".$1" '{
> NF--;for(i=1;i<=NF;i++){gsub(" ","_",$i);printf "%s",$i ext (i<NF?" ":"\n")}}'
> }
$ echo "File Name1.xml File Name3 report.xml" | s2u xml
File_Name1.xml File_Name3_report.xml
$
It's a bit different (better?) 'cs it does not special print the last field but instead special-cases the delimiter appended to each field, but the idea of splitting on the extension remains.
This seems a good start if the filenames aren't delineated:
((?:\S.*?)?\.\w{1,})\b
( // start of captured group
(?: // non-captured group
\S.*? // a non-white-space character, then 0 or more any character
)? // 0 or 1 times
\. // a dot
\w{1,} // 1 or more word characters
) // end of captured group
\b // a word boundary
You'll have to look-up how a PCRE pattern converts to a shell pattern. Alternatively it can be run from a Python/Perl/PHP script.
Demo
Assuming you are asking how to rename file names, and not remove spaces in a list of file names that are being used for some other reason, this is the long and short way. The long way uses sed. The short way uses rename. If you are not trying to rename files, your question is quite unclear and should be revised.
If the goal is to simply get a list of xml file names and change them with sed, the bottom example is how to do that.
directory contents:
ls -w 2
bob is over there.xml
fred is here.xml
greg is there.xml
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}";
done
shopt -u nullglob
# output
bob is over there.xml
fred is here.xml
greg is there.xml
# then rename them
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
# I prefer 'rename' for such things
# rename 's/[[:space:]]/_/g' "${a_glob[i]}";
# but sed works, can't see any reason to use it for this purpose though
mv "${a_glob[i]}" $(sed 's/[[:space:]]/_/g' <<< "${a_glob[i]}");
done
shopt -u nullglob
result:
ls -w 2
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml
globbing is what you want here because of the spaces in the names.
However, this is really a complicated solution, when actually all you need to do is:
cd [your space containing directory]
rename 's/[[:space:]]/_/g' *.xml
and that's it, you're done.
If on the other hand you are trying to create a list of file names, you'd certainly want the globbing method, which if you just modify the statement, will do what you want there too, that is, just use sed to change the output file name.
If your goal is to change the filenames for output purposes, and not rename the actual files:
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}" | sed 's/[[:space:]]/_/g';
done
shopt -u nullglob
# output:
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml
You could use rename:
rename --nows *.xml
This will replace all the spaces of the xml files in the current folder with _.
Sometimes it comes without the --nows option, so you can then use a search and replace:
rename 's/[[:space:]]/__/g' *.xml
Eventually you can use --dry-run if you want to just print filenames without editing the names.

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

I have a bunch of daily printer logs in CSV format and I'm writing a script to keep track of how much paper is being used and save the info to a database, but I've come across a small problem
Essentially, some of the document names in the logs include commas in them (which are all enclosed within double quotes), and since it's in comma separated format, my code is messing up and pushing everything one column to the right for certain records.
From what I've been reading, it seems like the best way to go about fixing this would be using awk or sed, but I'm unsure which is the best option for my situation, and how exactly I'm supposed to implement it.
Here's a sample of my input data:
2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"MicrosoftWordDocument1",COMSYRWS14,A4,PCL6,,,NOT DUPLEX,GRAYSCALE,35kb,
And here's what I have so far:
#!/bin/bash
#Get today's file name
yearprefix="20"
currentdate=$(date +"%m-%d-%y");
year=${currentdate:6};
year="$yearprefix$year"
month=${currentdate:0:2};
day=${currentdate:3:2};
filename="papercut-print-log-$year-$month-$day.csv"
echo "The filename is: $filename"
# Remove commas in between quotes.
#Loop through CSV file
OLDIFS=$IFS
IFS=,
[ ! -f $filename ] && { echo "$Input file not found"; exit 99; }
while read time user pages copies printer document client size pcl blank1 blank2 duplex greyscale filesize blank3
do
#Remove headers
if [ "$user" != "" ] && [ "$user" != "User" ]
then
#Remove any file name with an apostrophe
if [[ "$document" =~ "'" ]];
then
document="REDACTED"; # Lazy. Need to figure out a proper solution later.
fi
echo "$time"
#Save results to database
mysql -u username -p -h localhost -e "USE printerusage; INSERT INTO printerlogs (time, username, pages, copies, printer, document, client, size, pcl, duplex, greyscale, filesize) VALUES ('$time', '$user', '$pages', '$copies', '$printer', '$document', '$client', '$size', '$pcl', '$duplex', '$greyscale', '$filesize');"
fi
done < $filename
IFS=$OLDIFS
Which option is more suitable for this task? Will I have to create a second temporary file to get this done?
Thanks in advance!
As I wrote in another answer:
Rather than interfere with what is evidently source data, i.e. the stuff inside the quotes, you might consider replacing the field-separator commas (with say |) instead:
s/,([^,"]*|"[^"]*")(?=(,|$))/|$1/g
And then splitting on | (assuming none of your data has | in it).
Is it possible to write a regular expression that matches a particular pattern and then does a replace with a part of the pattern
There is probably an easier way using sed alone, but this should work. Loop on the file, for each line match the parentheses with grep -o then replace the commas in the line with spaces (or whatever it is you would like to use to get rid of the commas - if you want to preserve the data you can use a non printable and explode it back to commas afterward).
i=1 && IFS=$(echo -en "\n\b") && for a in $(< test.txt); do
var="${a}"
for b in $(sed -n ${i}p test.txt | grep -o '"[^"]*"'); do
repl="$(sed "s/,/ /g" <<< "${b}")"
var="$(sed "s#${b}#${repl}#" <<< "${var}")"
done
let i+=1
echo "${var}"
done

Add specific source file types to svn recursively

I know this is an ugly script but it does the job.
What I am facing now is adding a few more extensions what would clutter the scrip even more.
How can I make it more modular?
Specifically, how can I write this long regular expression (source file extensions) on multiple lines? say one extension on each line. I guess I am doing something wrong with string concatenation but not quite sure what exactly.
Here's the original file:
#!/bin/bash
COMMAND='svn status'
XARGS='xargs'
SVN='svn add'
$COMMAND | grep -E '(\.m|\.mat|\.java|\.js|\.php|\.cpp|\.h|\.c|\.py|\.hs|\.pl|\.xml|\.html|\.sh|.\asm|\.s|\.tex|\.bib|.\Makefile|.\jpg|.\gif|.\png|.\css)'$ | awk ' { print$2 } ' | $XARGS $SVN
and here's roughly what I am aiming at
...code code
'(.\m|
\.mat|
\.js|
.
.
.
.\css)'
..more code here
Anybody?
I know this doesn't answer the question directly, but from a readability perspective, I think most developers would agree that a single-line regex is the most common way to do things and therefore the most maintainable approach. Further, I'm not sure why you're including a period in each of your extensions, this should only need to be used once.
I wrote this little script to automatically add all images to svn. You should be able to simply add extensions between the pipes in the regex to add or remove different file types. Note that it makes sure to only add files that are unrecognized by making sure each line starts with a "?" (^\?) and ends with a period (\.(extensions)$). Hope it's helpful!
#!/bin/bash
svn st | grep -E "^\?.*\.(png|jpg|jpeg|tiff|bmp|gif)$" > /tmp/svn-auto-add-img
while read output; do
FILE=$(echo $output | awk '{ print $2 }')
svn add $FILE
done < /tmp/svn-auto-add-img
exit 0
How about this:
PATTERNS="
\.foo
\.bar
\.baz"
# Put them into one list separated by or ("|").
PATTERNS=`echo $PATTERNS |sed 's/\s\+/|/g'`
$COMMAND | grep -E "($PATTERNS)"
(Note that this would not work if you put quotes around $PATTERNS in the call to echo -- echo is taking care of stripping whitespace and converting newlines to spaces for us.)
#!/bin/bash
COMMAND='svn status'
XARGS='xargs'
SVNADD='svn add'
pats=
pats+=' \.m'
pats+=' \.mat'
pats+=' \.java'
pats+=' \.js'
# add your 'or-able' sub patterns here
# build the full pattern
pattern='(';for pat in $pats;do pattern+="$pat|";done;pattern=${pattern%\|}')$'
# run grep with the generated pattern
files=$($COMMAND | grep -E ${pattern} | awk ' { print $NF } ')
if [ " $files" != " " ]
then
$COMMAND | grep -E ${pattern} | awk ' { print $NF } ' | $XARGS $SVNADD
fi

sed: return last occurrence match until end of file

Using sed, how do I return the last occurance of a match until the End Of File?
(FYI this has been simplified)
So far I've tried:
sed -n '/ Statistics |/,$p' logfile.log
Which returns all lines from the first match onwards (almost the entire file)
I've also tried:
$linenum=`tail -400 logfile.log | grep -n " Statistics |" | tail -1 | cut -d: -f1`
sed "$linenum,\$!d" logfile.log
This works but won't work over an ssh connection in one command, really need it all to be in one pipeline.
Format of the log file is as follows:
(There are statistics headers with sub data written to the log file every minute, the purpose of this command is to return the most recent Statistics header together with any associated errors that occur after the header)
Statistics |
Stuff
More Stuff
Even more Stuff
Statistics |
Stuff
More Stuff
Error: incorrect value
Statistics |
Stuff
More Stuff
Even more Stuff
Statistics |
Stuff
Error: error type one
Error: error type two
EOF
Return needs to be:
Statistics |
Stuff
Error: error type one
Error: error type two
Your example script has a space before Statistics but your sample data doesn't seem to. This has a regex which assumes Statistics is at beginning of line; tweak if that's incorrect.
sed -n '/^Statistics |/h;/^Statistics |/!H;$!b;x;p'
When you see Statistics, replace the hold space with the current line (h). Otherwise, append to the hold space (H). If we are not at the end of file, stop here (b). At end of file, print out the hold space (x retrieve contents of hold space; p print).
In a sed script, commands are optionally prefixed by an "address". Most commonly this is a regex, but it can also be a line number. The address /^Statistics |/ selects all lines matching the regular expression; /^Statistics |/! selects lines not matching the regular expression; and $! matches all lines except the last line in the file. Commands with no explicit address are executed for all input lines.
Edit Explain the script in some more detail, and add the following.
Note that if you need to pass this to a remote host using ssh, you will need additional levels of quoting. One possible workaround if it gets too complex is to store this script on the remote host, and just ssh remotehost path/to/script. Another possible workaround is to change the addressing expressions so that they don't contain any exclamation marks (these are problematic on the command line e.g. in Bash).
sed -n '/^Statistics |/{h;b};H;${x;p}'
This is somewhat simpler, too!
A third possible workaround, if your ssh pipeline's stdin is not tied up for other things, is to pipe in the script from your local host.
echo '/^Statistics |/h;/^Statistics |/!H;$!b;x;p' |
ssh remotehost sed -n -f - file
If you have tac available:
tac INPUTFILE | sed '/^Statistics |/q' | tac
This might work for you:
sed '/Statistics/h;//!H;$!d;x' file
Statistics |
Stuff
Error: error type one
Error: error type two
If you're happy with an awk solution, this kinda works (apart from getting an extra blank line):
awk '/^Statistics/ { buf = "" } { buf = buf "\n" $0 } END { print buf }' input.txt
sed ':a;N;$!ba;s/.*Statistics/Statistics/g' INPUTFILE
should work (GNU sed 4.2.1).
It reads the whole file to one string, then replaces everything from the start to the last Statistics (word included) with Statistics, and prints what's remaining.
HTH
This might also work, slightly more simple version of the sed solution given by the others above:
sed -n 'H; /^Statistics |/h; ${g;p;}' logfile.log
Output:
Statistics |
Stuff
Error: error type one
Error: error type two