Extracting number of insertion and deletions of git diff using sed - regex

I'm using gif diff --shortstat filename to get the number of lines changed in my files. The output example is as follows:
1 file changed, 1 insertion(+), 1 deletion(-)
Now I want to use that command with sed and extract the number of insertion and deletions only, in this case 1 and 1.
I'm using sed to match and extract the groups but all I get is the same text from the git command again. My command is as follows (trying to get only the insertion lines number).
sed "s/\([0-9]+\) insertion/\1/"
So, a complete execution will be like:
$ git diff --shortstat filename | sed 's/\([0-9]+\) insertion/\1/'
> 1 file changed, 1 insertion(+), 1 deletion(-)
What I need to change to get this to work or is there any other way to do it?

You may use this sed command to extract both insertion and deletion number:
git diff --shortstat filename |
sed -E 's/.* ([0-9]+) insertion.* ([0-9]+) deletion.*/\1,\2/'
This will produce a comma delimited pair of numbers e.g.
1,1

Related

sed / awk - remove space in file name

I'm trying to remove whitespace in file names and replace them.
Input:
echo "File Name1.xml File Name3 report.xml" | sed 's/[[:space:]]/__/g'
However the output
File__Name1.xml__File__Name3__report.xml
Desired output
File__Name1.xml File__Name3__report.xml
You named awk in the title of the question, didn't you?
$ echo "File Name1.xml File Name3 report.xml" | \
> awk -F'.xml *' '{for(i=1;i<=NF;i++){gsub(" ","_",$i); printf i<NF?$i ".xml ":"\n" }}'
File_Name1.xml File_Name3_report.xml
$
-F'.xml *' instructs awk to split on a regex, the requested extension plus 0 or more spaces
the loop {for(i=1;i<=NF;i++) is executed for all the fields in which the input line(s) is(are) splitted — note that the last field is void (it is what follows the last extension), but we are going to take that into account...
the body of the loop
gsub(" ","_", $i) substitutes all the occurrences of space to underscores in the current field, as indexed by the loop variable i
printf i<NF?$i ".xml ":"\n" output different things, if i<NF it's a regular field, so we append the extension and a space, otherwise i equals NF, we just want to terminate the output line with a newline.
It's not perfect, it appends a space after the last filename. I hope that's good enough...
▶    A D D E N D U M    ◀
I'd like to address:
the little buglet of the last space...
some of the issues reported by Ed Morton
generalize the extension provided to awk
To reach these goals, I've decided to wrap the scriptlet in a shell function, that changing spaces into underscores is named s2u
$ s2u () { awk -F'\.'$1' *' -v ext=".$1" '{
> NF--;for(i=1;i<=NF;i++){gsub(" ","_",$i);printf "%s",$i ext (i<NF?" ":"\n")}}'
> }
$ echo "File Name1.xml File Name3 report.xml" | s2u xml
File_Name1.xml File_Name3_report.xml
$
It's a bit different (better?) 'cs it does not special print the last field but instead special-cases the delimiter appended to each field, but the idea of splitting on the extension remains.
This seems a good start if the filenames aren't delineated:
((?:\S.*?)?\.\w{1,})\b
( // start of captured group
(?: // non-captured group
\S.*? // a non-white-space character, then 0 or more any character
)? // 0 or 1 times
\. // a dot
\w{1,} // 1 or more word characters
) // end of captured group
\b // a word boundary
You'll have to look-up how a PCRE pattern converts to a shell pattern. Alternatively it can be run from a Python/Perl/PHP script.
Demo
Assuming you are asking how to rename file names, and not remove spaces in a list of file names that are being used for some other reason, this is the long and short way. The long way uses sed. The short way uses rename. If you are not trying to rename files, your question is quite unclear and should be revised.
If the goal is to simply get a list of xml file names and change them with sed, the bottom example is how to do that.
directory contents:
ls -w 2
bob is over there.xml
fred is here.xml
greg is there.xml
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}";
done
shopt -u nullglob
# output
bob is over there.xml
fred is here.xml
greg is there.xml
# then rename them
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
# I prefer 'rename' for such things
# rename 's/[[:space:]]/_/g' "${a_glob[i]}";
# but sed works, can't see any reason to use it for this purpose though
mv "${a_glob[i]}" $(sed 's/[[:space:]]/_/g' <<< "${a_glob[i]}");
done
shopt -u nullglob
result:
ls -w 2
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml
globbing is what you want here because of the spaces in the names.
However, this is really a complicated solution, when actually all you need to do is:
cd [your space containing directory]
rename 's/[[:space:]]/_/g' *.xml
and that's it, you're done.
If on the other hand you are trying to create a list of file names, you'd certainly want the globbing method, which if you just modify the statement, will do what you want there too, that is, just use sed to change the output file name.
If your goal is to change the filenames for output purposes, and not rename the actual files:
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}" | sed 's/[[:space:]]/_/g';
done
shopt -u nullglob
# output:
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml
You could use rename:
rename --nows *.xml
This will replace all the spaces of the xml files in the current folder with _.
Sometimes it comes without the --nows option, so you can then use a search and replace:
rename 's/[[:space:]]/__/g' *.xml
Eventually you can use --dry-run if you want to just print filenames without editing the names.

Remove the data before the second repeated specified character in linux

I have a text file which has some below data:
AB-NJCFNJNVNE-802ac94f09314ee
AB-KJNCFVCNNJNWEJJ-e89ae688336716bb
AB-POJKKVCMMMMMJHHGG-9ae6b707a18eb1d03b83c3
AB-QWERTU-55c3375fb1ee8bcd8c491e24b2
I need to remove the data before the second hyphen (-) and produce another text file with the below output:
802ac94f09314ee
e89ae688336716bb
9ae6b707a18eb1d03b83c3
55c3375fb1ee8bcd8c491e24b2
I am pretty new to linux and trying sed command with unsuccessful attempts for the last couple of hours. How can I get the desired output with sed or any other useful command like awk?
You can use a simple cut call:
$ cat myfile.txt | cut -d"-" -f3- > myoutput.txt
Edit:
Some explanation, as requested in the comments:
cut breaks up a string of text to fields according to a given delimiter.
-d defines the delimiter, - in this case.
-f defines which fields to output. In this case, we want to eliminate everything before the second hyphen, or, in other words, return the third field and onwards (3-).
The rest of the command is just piping the output. cating the file into cut, and then saving the result to an output file.
Or, using sed:
cat myfile.txt | sed -e 's/^.\+-//'

How to delete a specific number of random lines matching a pattern

I have an svg file with a grid of dots represented by lines that have the word use in them. I would like to delete a specific number of random lines matching that use pattern, then save a new version of the file. This answer was very close.
So it will be a combination of this (delete one random line in a specific range):
sed -i '.svg' $((9 + RANDOM % 579))d /filename.svg
and this (delete all lines matching pattern use):
sed -i '.svg' /use/d /filename.svg
In other words, the logic would go something like this:
sed -i delete 'x' number of RANDOM lines matching 'use' from 'input.svg' and save to 'output.svg'
I'm running these commands from Terminal on a Mac and am inexperienced with syntax so formatting the command for that would be ideal.
Delete each line containing "use" with a probability of 10%:
awk '!/use/ || rand() > 0.10' file
Randomly delete exactly one line containing "use":
awk -v n="$(( RANDOM % $(grep -c "use" file) ))" '!/use/ || n-- != 0' file
Here's an example invocation:
$ cat file
some string
a line containing "use"
another use-ful line
more random data
$ awk -v n="$(( RANDOM % $(grep -c "use" file) ))" '!/use/ || n-- != 0' file
some string
another use-ful line
more random data
One of the lines containing use was removed.
This might work for you: (GNU sed & sort):
sed -n '/\<use\>/=' file | sort -r | head -5 | sed 's/$/d/' | sed -i.bak -f - file
Extract the line numbers of the lines containing the word use from the file. Randomly sort those line numbers then take the first say 5 and build a sed script to delete them from the original file.

Find duplicates of a file by name in a directory recursively - Linux

I have a folder which contains sub folders and some more files in them.
The files are named in the following way
abc.DEF.xxxxxx.dat
I'm trying to find the duplicate files only matching 'xxxxxx' in the above pattern ignoring the rest. The extension .dat doesn't change. But the length of abc and DEF might change. The order of separation by periods also doesn't change.
I'm guessing I need to use Find in the following way
find -regextype posix-extended -regex '\w+\.\w+\.\w+\.dat'
I need help coming up with the regular expression. Thanks.
Example:
For a file named 'epg.ktt.crwqdd.dat', I need to find duplicate files containing 'crwqdd'.
You can use awk for that:
find /path -type f -name '*.dat' | awk -F. 'a[$4]++'
Explanation:
Let find give the following output:
./abd.DdF.TTDFDF.dat
./cdd.DxdsdF.xxxxxx.dat
./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat
Basically, spoken with the words of a computer, you want to count the occurrences of a pattern between .dat and the next dot and print those lines where pattern appeared at least the second time.
To achieve this we split the file names by the . what gives us 5(!) fields:
echo ./abd.DEF.xxxxxx.dat | awk -F. '{print $1 " " $2 " " $3 " " $4 " " $5}'
/abd DEF xxxxxx dat
Note the first, empty field. The pattern of interest is $4.
To count the occurrences of a pattern in $4 we use an associative array a and increment it's value on each occurrence. Unoptimized, the awk command will look like:
... | awk -F. '{{if(a[$4]++ > 1){print}}'
However, you can write an awk program in the form:
CONDITION { ACTION }
What will give us:
... | awk -F. 'a[$4]++ > 1 {print}'
print is the default action in awk. It prints the whole current line. As it is the default action it can be omitted. Also the >1 check can be omitted because awk treats integer values greater than zero as true. This gives us the final command:
... | awk -F. 'a[$4]++'
To generalize the command we can say the pattern of interest isn't the 4th column, it is the next to last column. This can be expressed using number of fields in awk its NF:
... | awk -F. 'a[$(NF-1)]++'
Output:
./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat

How to optimize a grep regular expression to match a URL

Background:
I have a directory called "stuff" with 26 files (2 .txt and 24 .rtf) on Mac OS 10.7.5.
I'm using grep (GNU v2.5.1) to find all strings within these 26 files that match the structure of a URL, then print them to a new file (output.txt).
The regex below does work on a small scale. I ran it on directory with 3 files (1 .rtf and 2 .txt) with a bunch of dummy text and 30 URLs, and it executed successfully in less than 1 second.
I am using the following regex:
1
grep -iIrPoh 'https?://.+?\s' . --include=*.txt --include=*.rtf > output.txt
Problem
The current size of my directory "stuff" is 180 KB with 26 files. In terminal, I cd to this directory (stuff) then run my regex. I waited about 15 minutes and decided to kill the process as it did NOT finish. When I looked at the output.txt file, it was a whopping 19.75GB (screenshot).
Question
What could be causing the output.txt file to be so many orders of maginitude larger than the entire directory?
What more could I add to my regex to streamline the processing time.
Thank you in advance for any guidance you can provide here. I've been working on many different variations of my regex for almost 16 hours, and have read tons of posts online but nothing seems to help. I'm new to writing regex, but with a small bit of hand holding, I think I'll get it.
Additional Comments
I ran the following command to see what was being recorded in the output.txt (19.75GB) file. It looks like the regex is finding the right strings with the exception of what i think are odd characters like: curly braces } { and a string like: {\fldrslt
**TERMINAL**
$ head -n 100 output.txt
http://michacardenas.org/\
http://culturelab.asc.upenn.edu/2013/03/06/calling-all-wearable-electronics-hackers-e-textile-makers-and-fashion-activists/\
http://www.mumia-themovie.com/"}}{\fldrslt
http://www.mumia-themovie.com/}}\
http://www.youtube.com/watch?v=Rvk2dAYkHW8\
http://seniorfitnesssite.com/category/senior-fitness-exercises\
http://www.giac.org/
http://www.youtube.com/watch?v=deOCqGMFFBE"}}{\fldrslt
http://www.youtube.com/watch?v=deOCqGMFFBE}}
https://angel.co/jason-a-hoffman\
https://angel.co/joyent?save_req=mention_slugs"}}{\fldrslt
http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html"}}{\fldrslt
http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html}}
http://www.cooking-hacks.com/index.php/documentation/tutorials/ehealth-biometric-sensor-platform-arduino-raspberry-pi-medical"}}{\fldrslt
http://www.cooking-hacks.com/index.php/documentation
Catalog of regex commands tested so far
2
grep -iIrPoh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_2.txt)
3
grep -iIroh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_3.txt)
4
grep -iIrPoh 'https?://\S+\s' . --include=*.txt --include=*.rtf > sixth.txt
FAIL: took 1 second to run / produced blank file (output_4.txt)
5
grep -iIroh 'https?://' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_5.txt)
6
grep -iIroh 'https?://\S' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_6.txt)
7
grep -iIroh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_7.txt)
8
grep -iIrPoh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: let run for 1O mins and manually killed process / produced 20.63 GB file (output_8.txt) / On the plus side, this regex captured strings that were accurate in the sense that they did NOT include any odd additional characters like curly braces or RTF file format syntax {\fldrslt
9
find . -print | grep -iIPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_9.txt
FAIL: took 1 second to run / produced blank file (output_9.txt)
10
find . -print | grep -iIrPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_10.txt
FAIL: took 1 second to run / produced blank file (output_10.txt)
11
grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
Editor's note: this regex only worked properly when I output strings to the terminal window. It did not work when I output to a file output_11.txt.
NEAR SUCCESS: All URL strings were cleanly cut to remove white space before and after string, and removed all special markup associated with .RTF format. Downside: of the sample URLs tested for accuracy, some were cut short losing their structure at the end. I'm estimating that about 10% of strings were improperly truncated.
Example of truncated string:
URL structure before the regex: http://www.youtube.com/watch?v=deOCqGMFFBE
URL structure after the regex: http://www.youtube.com/watch?v=de
The question now becomes:
1.) Is there a way to ensure we do not eliminate a part of the URL string as in the example above?
2.) Would it help to define an escape command for the regex? (if that is even possible).
12
grep -iIroh 'https?:\/\/[\w~#%&_+=,.?\/-]+' . --include=*.txt --include=*.rtf > output_12.txt
FAIL: took 1 second to run / produced blank file (output_12.txt)
13
grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > tmp/output.txt
FAIL: let run for 2 mins and manually killed process / produced 1 GB file. The intention of this regex was to isolate grep's output file (output.txt) in to a subdirectory to ensure we weren't creating an infinite loop that had grep reading back it's own output. Solid idea, but no cigar (screenshot).
14
grep -iIroh 'https\?://[a-z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
FAIL: Same result as #11. The command resulted in an infinite loop with truncated strings.
15
grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
ALMOST WINNER: This captured the entirety of the URL string. It did result in an infinite loop creating millions of strings in terminal, but I can manually identify where the first loop starts and ends so this should be fine. GREAT JOB #acheong87! THANK YOU!
16
find . -print | grep -v output.txt | xargs grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' --include=*.txt --include=*.rtf > output.txt
NEAR SUCCESS: I was able to grab the ENTIRE URL string, which is good. However, the command turned into an infinite loop. After about 5 seconds of running output to terminal, it produced about 1 million URL strings, which were all duplicates. This would have been a good expression if we could figure out how to escape it after a single loop.
17
ls *.rtf *.txt | grep -v 'output.txt' | xargs -J {} grep -iIF 'http' {} grep -iIFo > output.txt
NEAR SUCCESS: this command resulted in a single loop through all files in the directory, which is good b/c solved the infinite loop problem. However, the structure of the URL strings were truncated and included the filename from where the strings came from.
18
ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'
NEAR SUCCESS: This expression prevented an infinite loop which is good, it created a new file in the directory it was querying which was small, about 30KB. It captured all the proper characters in the string and a couple ones not needed. As Floris mentioned, in the instances where the URL was NOT terminated with a space - for example http://www.mumia-themovie.com/"}}{\fldrslt it captured the markup syntax.
19
ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[a-z./?#=%_-,~&]+'
FAIL: This expression prevented an infinite loop which is good, however it did NOT capture the entire URL string.
The expression I had given on the comments (your test 17) was intended to test for two things:
1) can we make the infinite loop go away
2) can we loop over all files in the directory cleanly
I believe we achieved both. So now I am audacious enough to propose a "solution":
ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'
Breaking it down:
ls *.rtf *.txt - list all .rtf and .txt files
grep -v 'output.txt' - skip 'output.txt' (in case it was left from a previous attempt)
xargs - "take each line of the input in turn and substitute it
- at the end of the following command
- (or use -J xxx to sub at place of xxx anywhere in command)
grep -i - case insensitive
-I - skip binary (shouldn't have any since we only process .txt and .rtf...)
-o - print only the matched bit (not the entire line), i.e. just the URL
-h - don't include the name of the source file
-E - use extended regular expressions
'http - match starts with http (there are many other URLs possible... but out of scope for this question)
s? - next character may be an s, or is not there
:// - literal characters that must be there
[^[:space:]]+ - one or more "non space" characters (greedy... "as many as possible")
This seemed to work OK on a very simple set of files / URLs. I think that now that the iterating problem is solved, the rest is easy. There are tons of "URL validation" regexes online. Pick any one of them... the above expression really just searches for "everything that follows http until a space". If you end up with odd or missing matches let us know.
I'm guessing a bit but for a line like
http://a.b.com something foo bar
the pattern can match as
http://a.b.com
http://a.b.com something
http://a.b.com something foo
(always with space at the end).
But I don't know if grep tries to match same line multiple times.
Better try
'https?://\S+\s'
as pattern
"What could be causing the output.txt file to be so many orders of maginitude larger than the entire directory?" me thinks you are running a cycle with grep reading back its own output? Try directing the output to > ~/tmp/output.txt.