Replace/substitute adversarial substring variables in shell script

Replace/substitute adversarial substring variables in shell script - regex

I have three unescaped adversarial shell variables.
$mystring
$old
$new
Remember, all three strings are adversarial. They will contain special characters. They will contain everything possible to mess up the replace. If there is a loophole in your replace, the strings will exploit it.
What is the simplest function to replace $old with $new in $mystring?
(I couldn't find any solution in stack overflow for a generic substitution that will work in all cases).

There's nothing fancy here -- the only thing you need to do to ensure that your values are treated as literals in a parameter expansion is to ensure that you're quoting the search value, as described in the relevant section of BashFAQ #21:
result=${mystring/"$old"/$new}
Without the double quotes on the inside, $old would be interpreted as a fnmatch-style glob expression; with them, it's literal.
To operate on streams instead, consider gsub_literal, also described in BashFAQ #21:
# usage: gsub_literal STR REP
# replaces all instances of STR with REP. reads from stdin and writes to stdout.
gsub_literal() {
# STR cannot be empty
[[ $1 ]] || return
# string manip needed to escape '\'s, so awk doesn't expand '\n' and such
awk -v str="${1//\\/\\\\}" -v rep="${2//\\/\\\\}" '
# get the length of the search string
BEGIN {
len = length(str);
}
{
# empty the output string
out = "";
# continue looping while the search string is in the line
while (i = index($0, str)) {
# append everything up to the search string, and the replacement string
out = out substr($0, 1, i-1) rep;
# remove everything up to and including the first instance of the
# search string from the line
$0 = substr($0, i + len);
}
# append whatever is left
out = out $0;
print out;
}
'
}
some_command | gsub_literal "$search" "$rep"
...which can also be used for in-place replacement on files using techniques from the following (yet again taken from the previously-linked FAQ):
# Using GNU tools to preseve ownership/group/permissions
gsub_literal "$search" "$rep" < "$file" > tmp &&
chown --reference="$file" tmp &&
chmod --reference="$file" tmp &&
mv -- tmp "$file"

Related

Replace special characters except the following ,.#

I'm looking for an option to remove special characters from a file except for the following 3 items ,.#
The following awk command gets close but it removes all punctuation.
awk '{gsub(/[[:punct:]]/,"",except(".","#",","))}1' test.csv > test2.csv
Any ideas...

There are no opposite character classes in POSIX and no lookarounds to restrict a more generic pattern with some exceptions. The only way is to spell out the POSIX character class.
According to Character Classes and Bracket Expressions:
‘[:punct:]’
Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ \ { | } ~.
You may use
/[!-+\/:-?[-`{-~-]/
See the regex demo.
Legend:

All 3 of these approaches will work in any locale and will work for any character class by just changing the class name and will work for other bracket expressions or strings etc.:
1) Just look for any punct but only change it if it's not one of the chars you don't want changed:
$ echo 'a.b?c#d#e,f' |
awk '{
new = ""
while ( match($0,/[[:punct:]]/) ) {
chr = substr($0,RSTART,1)
new = new substr($0,1,RSTART-1) (chr ~ /[,.#]/ ? chr : "")
$0 = substr($0,RSTART+RLENGTH)
}
print new $0
}'
a.bcd#e,f
2) Turn the chars you don't want changed into other strings first then turn them back afterwards:
$ echo 'a.b?c#d#e,f' |
awk '{
gsub(/a/,"aA"); gsub(/,/,"aB"); gsub(/\./,"aC"); gsub(/#/,"aD")
gsub(/[[:punct:]]/,"")
gsub(/aD/,"#"); gsub(/aC/,"."); gsub(/aB/,","); gsub(/aA/,"a")
print
}'
a.bcd#e,f
Changing a into aA and back is what guarantees that the strings you create when converting the #, etc. are strings that cannot exist elsewhere in the input at that time and that's why you can safely convert them back afterwards.
3) Suffix the puncts with the RS value, then remove the RS suffix from the chars you don't want changed, then change the remaining RS-suffixed puncts:
$ echo 'a.b?c#d#e,f' |
awk '{
gsub(/[[:punct:]]/,"&"RS)
$0 = gensub("([,.#])"RS,"\\1","g")
gsub("[[:punct:]]"RS,"")
print
}'
a.bcd#e,f
That one uses GNU awk for gensub(), with other awks you'd need match()+substr().

SED regex find (and remove) option from a command text

I have a config file with param=option[,option...], using standard bash utilities, perhaps the the help of sed, remove one option from the list.
#
param=aa,bb,cc
param=aa,bb
param=bb,cc
param=bb
#
in this example, I want to remove 'bb' (and the separator) from all lines, and in the last case, because 'bb' was the sole option, remove the complete line, so the final result will be
#
param=aa,cc
param=aa
param=cc
#
option 'bb' can be alone or at the start, center or end of the list. Obviously, 'bb' embedded on another option name (ie xxbb, bbxx, etc) should not be considered.
edit: fix typo, addn'l example

Here is a sed version to remove bb parameter from any position and delete the line if bb is the only parameter:
First the input file:
#
param=aa,bb,cc
param=aa,bb
param=bb,cc
param=bb
#
Now run this sed:
sed -E '/^param=/{/=bb$/d; s/,bb(,|$)/\1/; s/=bb,/=/;}' file
This will give:
#
param=aa,cc
param=aa
param=cc
#
To use inline editing use:
sed -i.bak -E '/^param=/{/=bb$/d; s/,bb(,|$)/\1/; s/=bb,/=/;}' file

Note: The solutions below do not address updating the input file; a simple (though not fully robust) approach is to use
awk '...' file > file.$$ && mv file.$$ file
A POSIX-compliant awk solution that should work robustly:
awk -F'=' '
$1 != "param" { print; next }
{
sub(/,bb,/, ",", $2)
sub(/(^|,)bb$/, "", $2)
if ($2 != "") print $1 FS $2
}
' file
GNU awk allows for a simpler solution, using its (nonstandard) gensub() function:
awk -F'=' '
$1 != "param" { print; next }
{
newList = gensub(/(^|,)bb(,|$)/, "\\2", 1, $2)
if (newList != "") print $1 FS newList
}
' file
A (POSIX-compliant) field-based alternative (more verbose, but perhaps easier to generalize):
awk -F'=' '
$1 != "param" { print; next }
{
n = split($2, opts, ","); optList = ""
for (i=1; i<=n; ++i) {
if (opts[i] != "bb") {
optList = optList (optList == "" ? "" : ",") opts[i]
}
}
if (optList != "") print $1 FS optList
}
' file

Let's say your Input_file is as follows:
param=aa,bb,cc
param=aa,bb
param=bb
Then the following code:
awk -F"=" '$2=="bb"{next} {sub(/,bb/,"");print}' Input_file
outputs:
param=aa,cc
param=aa

I'd use a temporary format to be able to find the occurrences easier. And to remove lines I would suggest using grep:
sed 's/=/=,/;s/$/,/;s/,bb,/,/;s/=,/=/;s/,$//;/=$/d'
the s/=/=,/ converts it to:
param=,aa,bb,cc
param=,aa,bb
param=,bb
than s/$/,/ to:
param=,aa,bb,cc,
param=,aa,bb,
param=,bb,
than s/,bb,/,/
param=,aa,cc,
param=,aa,
param=,
and s/=,/=/;s/,$// will remove the commata at the begining and end again
removing empty options can be done with grep -v '=$', or some more advanced sed magic (so it can be still used with sed -i)
EDIT:
the "sed magic" is just appending '/=$/d'
tested this one, and it works fine:
sed -i 's/=/=,/;s/$/,/;s/,bb,/,/;s/=,/=/;s/,$//;/=$/d' filename
or
sed 's/=/=,/;s/$/,/;s/,bb,/,/;s/=,/=/;s/,$//;/=$/d' filename_in > filename_out

copying first string into second line

I have a text file in this format:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375 Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375 aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375 abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
Here I call the first string before the first space as word (for example abacısı)
The string which starts with after first space and ends with integer is definition (for example Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875)
I want to do this: If a line includes more than one definition (first line has one, second line has two, third line has three), apply newline and put the first string (word) into the beginning of the new line. Expected output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
I have almost 1.500.000 lines in my text file and the number of definition is not certain for each line. It can be 1 to 5

Small python script does the job. Input is expected in input.txt, output gotes to output.txt.
import re
rf = re.compile('([^\s]+\s).+')
r = re.compile('([^\s]+\s\:\s\d+\.\d+)')
with open("input.txt", "r") as f:
text = f.read()
with open("output.txt", "w") as f:
for l in text.split('\n'):
offset = 0
first = ""
match = re.search(rf, l[offset:])
if match:
first = match.group(1)
offset = len(first)
while True:
match = re.search(r, l[offset:])
if not match:
break
s = match.group(1)
offset += len(s)
f.write(first + " " + s + "\n")

I am assuming the following format:
word definitionkey : definitionvalue [definitionkey : definitionvalue …]
None of those elements may contain a space and they are always delimited by a single space.
The following code should work:
awk '{ for (i=2; i<=NF; i+=3) print $1, $i, $(i+1), $(i+2) }' file
Explanation (this is the same code but with comments and more spaces):
awk '
# match any line
{
# iterate over each "key : value"
for (i=2; i<=NF; i+=3)
print $1, $i, $(i+1), $(i+2) # prints each "word key : value"
}
' file
awk has some tricks that you may not be familiar with. It works on a line-by-line basis. Each stanza has an optional conditional before it (awk 'NF >=4 {…}' would make sense here since we'll have an error given fewer than four fields). NF is the number of fields and a dollar sign ($) indicates we want the value of the given field, so $1 is the value of the first field, $NF is the value of the last field, and $(i+1) is the value of the third field (assuming i=2). print will default to using spaces between its arguments and adds a line break at the end (otherwise, we'd need printf "%s %s %s %s\n", $1, $i, $(i+1), $(i+2), which is a bit harder to read).

With perl:
perl -a -F'[^]:]\K\h' -ne 'chomp(#F);$p=shift(#F);print "$p ",shift(#F),"\n" while(#F);' yourfile.txt
With bash:
while read -r line
do
pre=${line%% *}
echo "$line" | sed 's/\([0-9]\) /\1\n'$pre' /g'
done < "yourfile.txt"
This script read the file line by line. For each line, the prefix is extracted with a parameter expansion (all until the first space) and spaces preceded by a digit are replaced with a newline and the prefix using sed.
edit: as tripleee suggested it, it's much faster to do all with sed:
sed -i.bak ':a;s/^\(\([^ ]*\).*[0-9]\) /\1\n\2 /;ta' yourfile.txt

Assuming there are always 4 space-separated words for each definition:
awk '{for (i=1; i<NF; i+=4) print $i, $(i+1), $(i+2), $(i+3)}' file
Or if the split should occur after that floating point number
perl -pe 's/\b\d+\.\d+\K\s+(?=\S)/\n/g' file
(This is the perl equivalent of Avinash's answer)

Bash and grep:
#!/bin/bash
while IFS=' ' read -r in1 in2 in3 in4; do
if [[ -n $in4 ]]; then
prepend="$in1"
echo "$in1 $in2 $in3 $in4"
else
echo "$prepend $in1 $in2 $in3"
fi
done < <(grep -o '[[:alnum:]][^:]\+ : [[:digit:].]\+' "$1")
The output of grep -o is putting all definitions on a separate line, but definitions originating from the same line are missing the "word" at the beginning:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
The for loop now loops over this, using a space as the input file separator. If in4 is a zero length string, we're on a line where the "word" is missing, so we prepend it.
The script takes the input file name as its argument, and saving output to an output file can be done with simple redirection:
./script inputfile > outputfile

Using perl:
$ perl -nE 'm/([^ ]*) (.*)/; my $word=$1; $_=$2; say $word . " " . $_ for / *(.*?[0-9]+\.[0-9]+)/g;' < input.log
Output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
Explanation:
Split the line to separate first field as word.
Then split the remaining line using the regex .*?[0-9]+\.[0-9]+.
Print word concatenated with every match of above regex.

I would approach this with one of the excellent Awk answers here; but I'm posting a Python solution to point to some oddities and problems with the currently accepted answer:
It reads the entire input file into memory before processing it. This is harmless for small inputs, but the OP mentions that the real-world input is kind of big.
It needlessly uses re when simple whitespace tokenization appears to be sufficient.
I would also prefer a tool which prints to standard output, so that I can redirect it where I want it from the shell; but to keep this compatible with the earlier solution, this hard-codes output.txt as the destination file.
with open('input.txt', 'r') as input:
with open('output.txt', 'w') as output:
for line in input:
tokens = line.rstrip().split()
word = tokens[0]
for idx in xrange(1, len(tokens), 3):
print(word, ' ', ' '.join(tokens[idx:idx+3]), file=output)
If you really, really wanted to do this in pure Bash, I suppose you could:
while read -r word analyses; do
set -- $analyses
while [ $# -gt 0 ]; do
printf "%s %s %s %s\n" "$word" "$1" "$2" "$3"
shift; shift; shift
done
done <input.txt >output.txt

Please find the following bash code
#!/bin/bash
# read.sh
while read variable
do
for i in "$variable"
do
var=`echo "$i" |wc -w`
array_1=( $i )
counter=0
for((j=1 ; j < $var ; j++))
do
if [ $counter = 0 ] #1
then
echo -ne ${array_1[0]}' '
fi #1
echo -ne ${array_1[$j]}' '
counter=$(expr $counter + 1)
if [ $counter = 3 ] #2
then
counter=0
echo
fi #2
done
done
done
I have tested and it is working.
To test
On bash shell prompt give the following command
$ ./read.sh < input.txt > output.txt
where read.sh is script , input.txt is input file and output.txt is where output is generated

here is a sed in action
sed -r '/^indirger(ken|di)/{s/([0-9]+[.][0-9]+ )(indirge)/\1\n\2/g}' my_file
output
indirgerdi indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]+YDH[Past] : 22.2626953125
indirge[Verb]+[Pos]+Hr[Aor]+YDH[Past]+[A3sg] : 18.720703125
indirgerken indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]-Yken[Adv+While] : 19.6201171875

Awk replace entire line when match is found

I have the following code:
function replaceappend() {
awk -v old="^$2" -v new="$3" '
sub(old,new) { replaced=1 }
{ print }
END { if (!replaced) print new }
' "$1" > /tmp/tmp$$ &&
mv /tmp/tmp$$ "$1"
}
replaceappend "/etc/ssh/sshd_config" "Port" "Port 222"
It works perfectly but I am looking to modify it so it replaces the entire lines contents rather than just the matching text.
At the moment it would do this:
Port 1234 -> Port 222 1234
I want it to be work like this:
Port 1234 -> Port 222
I closest code I can find to do this is found here:
awk 'NR==4 {$0="different"} { print }' input_file.txt
This would replace the entire line of the match with the new content. How can I implement this into my existing code?

Just change:
sub(old,new) { replaced=1 }
to:
$0~old { $0=new; replaced=1 }
or:
sub(".*"old".*",new) { replaced=1 }

If you want to replace the entire line you can simplify your function. To avoid problems with metacharacters in the variables you pass to awk, I would suggest using a simple string search too:
awk -vold="$2" -vnew="$3" 'index($0,old)==1{f=1;$0=new}1;END{if(!f)print new}' "$1"
index returns the character position of the string that you are searching for, starting at 1. If the string old is at the start of the line, then the line is changed to the value of new. The 1 after the block is always true so every line is printed (this is a common shorthand for an unconditional {print} block).
As mklement0 has pointed out in the comments, the variables you pass to awk are still subject to some interpretation: for example, the string \n will be interpreted as a newline character, \t as a tab character, etc. However, this issue is much less significant than it would be using regular expressions, where things like a . would match any character.

Again, use a regular expression for that which you want to replace:
replaceappend port.txt "Port.*" "Port 222"
Here you are replacing Port (if it starts the line, as per your function definition) plus whatever follows until the end of the line with "Port 222".
EDIT: To make this part of the function instead of requiring it in the call, modify it to
function replaceappend() {
awk -v old="^$2.*" -v new="$3" '
sub(old,new) { replaced=1 }
{ print }
END { if (!replaced) print new }
' "$1" > /tmp/tmp$$ &&
mv /tmp/tmp$$ "$1"
}

Simple script to count NLOC?

Do you know a simple script to count NLOCs (netto lines of code). The script should count lines of C Code. It should not count empty lines or lines with just braces. But it doesn't need to be overly exact either.

I would do that using awk & cpp (preprocessor) & wc . awk removes all braces and blanks, the preprocessor removes all comments and wc counts the lines:
find . -name \*.cpp -o -name \*.h | xargs -n1 cpp -fpreprocessed -P |
awk '!/^[{[:space:]}]*$/' | wc -l
If you want to have comments included:
find . -name \*.cpp -o -name \*.h | xargs awk '!/^[{[:space:]}]*$/' | wc -l

Looking NLOC on the Net, I found mostly "Non-commented lines of code".
You don't specify if comments must be skipped...
So if I stick to your current message, the following one-liner in Perl should do the job:
perl -pe "s/^\s*[{}]?\s*\n//" Dialog.java | wc -l
I can extend it to handle line comments:
perl -pe "s#^\s*[{}]?\s*\n|^\s*//.*\n##" Dialog.java | wc -l
or perhaps
perl -pe "s#^\s*(?:[{}]?\s*|//.*)\n##" Dialog.java | wc -l
Handling block comments is slightly more tricky (I am not a Perl expert!).
[EDIT] Got it... First part can be probably improved (shorter). Was fun to experiment with.
perl -e "$x = join('', <>); $x =~ s#/\*.*?\*/##gs; print $x" Dialog.java | perl -pe "s#^\s*(?:[{}]?\s*|//.*)\n##" | wc -l
PS.: I use double quotes because I tested on Windows...

Check out DPack plugin for Visual Studio. It has a stats report for any solution/project.

Not a script, but you can try this command-line open source tool: NLOC

Source monitor is freeware source analysis software. It is windows application but it also can be run with parameters from command line.
It can analyze C++, C, C#, VB.NET, Java, Delphi, Visual Basic (VB6) or HTML.

Ohloh offers the free Ohcount which counts lines of code and comments.

If the comments can still be in, the standard unix tool are sufficent:
grep -x -v "[[:space:]}{]*" files.c | wc

SLOCCOunt is not a simple script and does much more than what you need. However, it is a powerful alternative to the already mentioned Ohcount and NLOC. :)

I usually just do this:
grep -vc '^$' (my files)
Works only if your empty lines are really empty (no spaces). Sufficient for me.

Locmetrics works well.

Here's a simple Perl script eLOC.pl:
#!/usr/bin/perl -w
# eLOC - Effective Lines of Code Counter
# JFS (2005)
#
# $ perl eLOC.pl --help
#
use strict;
use warnings;
use sigtrap;
use diagnostics;
use warnings::register;
no warnings __PACKAGE__;
sub DEBUG { 0 }
use English qw( -no_match_vars ) ; # Avoids regex performance penalty
use Getopt::Long qw(:config gnu_getopt);
use File::DosGlob 'glob';
use Pod::Usage;
our $VERSION = '0.01';
# globals
use constant NOTFILENAME => undef;
my %counter = (
'PHYS' => 0,
'ELOC' => 0,
'PURE_COMMENT' => 0,
'BLANK' => 0,
'LLOC' => 0,
'INLINE_COMMENT'=> 0,
'LOC' => 0,
);
my %header = (
"eloc" => "eloc",
"lloc" => "lloc",
"loc" => "loc",
"comment" => "comment",
"blank" => "blank",
"newline" => "newline",
"logicline" => "lgcline",
);
my %total = %counter; # copy
my $c = \%counter; # see format below
my $h = \%header; # see top format below
my $inside_multiline_comment = 0;
my $filename = NOTFILENAME;
my $filecount = 0;
my $filename_header = "file name";
# process input args
my $version = '';
my $help = '';
my $man = '';
my $is_deterministic = '';
my $has_header = '';
print STDERR "Input args:'" if DEBUG;
print STDERR (join("|",#ARGV),"'\n") if DEBUG;
my %option = ('version' => \$version,
'help' => \$help,
'man' => \$man,
'deterministic' => \$is_deterministic,
'header' => \$has_header
);
GetOptions( \%option, 'version', 'help', 'man',
'eloc|e', # print the eLOC counts
'lloc|s', # print the lLOC counts (code statements)
'loc|l' , # print the LOC counts (eLOC + lines of a single brace or parenthesis)
'comment|c' , # print the comments counts (count lines which contains a comment)
'blank|b' , # print the blank counts
'newline|n' , # print the newline count
'logicline|g' , # print the logical line count (= LOC + Comment Lines + Blank Lines)
'deterministic', # print the LOC determination for every line in the source file
'header', # print header line
) or invalid_options("$0: invalid options\nTry `$0 --help' for more information.");
version() if $version;
pod2usage(-exitstatus => 0, -verbose => 1) if $help ;
pod2usage(-exitstatus => 0, -verbose => 2) if $man;
#
$has_header = 1 if $is_deterministic && $has_header eq '';
#format for print_loc_metric()
my ($format, $format_top) = make_format();
print STDERR "format:\n" if DEBUG > 10;
print STDERR $format if DEBUG > 10;
eval $format;
die $# if $#; # $EVAL_ERROR
if(DEBUG>10) {
print STDERR ("format_top:\n", $format_top);
}
if( $has_header) {
eval $format_top;
die $# if $#; # $EVAL_ERROR
}
# process files
print STDERR ("Input args after Getopts():\n",
join("|",#ARGV),"\n") if DEBUG > 10;
expand_wildcards();
#ARGV = '-' unless #ARGV;
foreach my $fn (#ARGV) {
$filename = $fn;
unless (open(IN, "<$filename")) {
warn "$0: Unable to read from '$filename': $!\n";
next;
}
print STDERR "Scanning $filename...\n" if DEBUG;
clear_counters();
generate_loc_metric();
$filecount++;
print_loc_metric();
close(IN)
or warn "$0: Could not close $filename: $!\n";
}
# print total
if($filecount > 1) {
$filename = "total";
$c = \%total;
print_loc_metric();
}
exit 0;
#-------------------------------------------------
sub wsglob {
my #list = glob;
#list ? #list : #_; #HACK: defence from emtpy list from glob()
}
sub expand_wildcards {
print STDERR ("Input args before expand_wildcards():\n",
join("|",#ARGV),"\n") if DEBUG;
{
#ARGV = map( /['*?']/o ? wsglob($_) : $_ , #ARGV);
}
print STDERR ("Input args after expand_wildcards():\n",
join("|",#ARGV),"\n") if DEBUG;
}
sub clear_counters {
for my $name ( keys %counter) {
$counter{$name} = 0;
}
}
sub make_format {
my $f = 'format STDOUT =' . "\n";
$f .= '# LOC, eLOC, lLOC, comment, blank, newline, logicline and filename' . "\n";
my $f_top = 'format STDOUT_TOP =' . "\n";
my $console_screen_width = (get_terminal_size())[0];
print STDERR '$console_screen_width=' . $console_screen_width ."\n" if DEBUG>10;
$console_screen_width = 100 if $console_screen_width < 0;
my $is_print_specifiers_set =
($option{"eloc"} or
$option{"lloc"} or
$option{"loc"} or
$option{"comment"} or
$option{"blank"} or
$option{"newline"} or
$option{"logicline"});
my %o = %option;
my $fc = 0;
if( $is_print_specifiers_set ) {
$fc++ if $o{"eloc"};
$fc++ if $o{"lloc"};
$fc++ if $o{"loc"};
$fc++ if $o{"comment"};
$fc++ if $o{"blank"};
$fc++ if $o{"newline"};
$fc++ if $o{"logicline"};
if( $fc == 0 ) { die "$0: assertion failed: field count is zero" }
}
else {
# default
$fc = 7;
$o{"loc"} = 1;
$o{"eloc"} = 1;
$o{"lloc"} = 1;
$o{"comment"} = 1;
$o{"blank"} = 1;
$o{"newline"} = 1;
$o{"logicline"} = 1;
}
if (DEBUG > 10) {
while( (my ($name, $value) = each %{o}) ) {
print STDERR "name=$name, value=$value\n";
}
}
# picture line
my $field_format = '#>>>>>> ';
my $field_width = length $field_format;
my $picture_line = $field_format x $fc;
# place for filename
$picture_line .= '^';
$picture_line .= '<' x ($console_screen_width - $field_width * $fc - 2);
$picture_line .= "\n";
$f .= $picture_line;
$f_top .= $picture_line;
# argument line
$f .= '$$c{"LOC"}, ' ,$f_top .= '$$h{"loc"}, ' if $o{"loc"};
$f .= '$$c{"ELOC"}, ' ,$f_top .= '$$h{"eloc"}, ' if $o{"eloc"};
$f .= '$$c{"LLOC"}, ' ,$f_top .= '$$h{"lloc"}, ' if $o{"lloc"};
$f .= '$$c{"comment"}, ' ,$f_top .= '$$h{"comment"}, ' if $o{"comment"};
$f .= '$$c{"BLANK"}, ' ,$f_top .= '$$h{"blank"}, ' if $o{"blank"};
$f .= '$$c{"PHYS"}, ' ,$f_top .= '$$h{"newline"}, ' if $o{"newline"};
$f .= '$$c{"logicline"}, ',$f_top .= '$$h{"logicline"}, ' if $o{"logicline"};
$f .= '$filename' . "\n";
$f_top .= '$filename_header' . "\n";
# 2nd argument line for long file names
$f .= '^';
$f .= '<' x ($console_screen_width-2);
$f .= '~~' . "\n"
.' $filename' . "\n";
$f .='.' . "\n";
$f_top .='.' . "\n";
return ($f, $f_top);
}
sub generate_loc_metric {
my $is_concatinated = 0;
LINE: while(<IN>)
{
chomp;
print if $is_deterministic && !$is_concatinated;
# handle multiline code statements
if ($is_concatinated = s/\\$//) {
warnings::warnif("$0: '\\'-ending line concantinated");
increment('PHYS');
print "\n" if $is_deterministic;
my $line = <IN>;
$_ .= $line;
chomp($line);
print $line if $is_deterministic;
redo unless eof(IN);
}
# blank lines, including inside comments, don't move to next line here
increment('BLANK') if( /^\s*$/ );
# check whether multiline comments finished
if( $inside_multiline_comment && m~\*/\s*(\S*)\s*$~ ) {
$inside_multiline_comment = 0;
# check the rest of the line if it contains non-whitespace characters
#debug $_ = $REDO_LINE . $1, redo LINE if($1);
warnings::warnif("$0: expression '$1' after '*/' discarded") if($1);
# else mark as pure comment
increment('PURE_COMMENT');
next LINE;
}
# inside multiline comments
increment('PURE_COMMENT'), next LINE if( $inside_multiline_comment );
# C++ style comment at the begining of line (except whitespaces)
increment('PURE_COMMENT'), next LINE if( m~^\s*//~ );
# C style comment at the begining of line (except whitespaces)
if ( m~^\s*/\*~ ) {
$inside_multiline_comment = 1 unless( m~\*/~ );
increment('PURE_COMMENT'), next LINE;
}
# inline comment, don't move to next line here
increment('INLINE_COMMENT') if ( is_inline_comment($_) );
# lLOC implicitly incremented inside is_inline_comment($)
#
increment('LOC') unless( /^\s*$/ );
# standalone braces or parenthesis
next LINE if( /^\s*(?:\{|\}|\(|\))+\s*$/ );
# eLOC is not comments, blanks or standalone braces or parenthesis
# therefore just increment eLOC counter here
increment('ELOC'), next LINE unless( /^\s*$/ );
}
continue {
increment('PHYS');
print " [$.]\n" if $is_deterministic; # $INPUT_LINE_NUMBER
}
}
sub print_loc_metric {
$$c{'comment'} = $$c{'PURE_COMMENT'} + $$c{'INLINE_COMMENT'};
# LOC + Comment Lines + Blank Lines
$$c{'logicline'} = $$c{'LOC'} + $$c{'comment'} + $$c{'BLANK'};
unless (defined $filename) {
die "print_loc_metric(): filename is not defined";
}
my $fn = $filename;
$filename = "", $filename_header = ""
unless($#ARGV);
print STDERR ("ARGV in print_loc_metric:" , join('|',#ARGV), "\n")
if DEBUG;
write STDOUT; # replace with printf
$filename = $fn;
}
sub increment {
my $loc_type = shift;
defined $loc_type
or die 'increment(\$): input argument is undefined';
$counter{$loc_type}++;
$total{$loc_type}++;
print "\t#". $loc_type ."#" if $is_deterministic;
}
sub is_inline_comment {
my $line = shift;
defined $line
or die 'is_inline_comment($): $line is not defined';
print "\n$line" if DEBUG > 10;
# here: line is not empty, not begining both C and C++ comments signs,
# not standalone '{}()', not inside multiline comment,
# ending '\' removed (joined line created if needed)
# Possible cases:
# - no C\C++ comment signs => is_inline_comment = 0
# - C++ comment (no C comment sign)
# * no quote characters => is_inline_comment = 1
# * at least one comment sign is not quoted => is_inline_comment = 1
# * all comment signs are quoted => is_inline_comment = 0
# - C comment (no C++ comment sign)
# * no quote characters => is_inline_comment = 1,
# ~ odd number of '/*' and '*/' => $inside_multiple_comment = 1
# ~ even number => $inside_multiple_comment = 0
# * etc...
# - ...
# algorithm: move along the line from left to right
# rule: quoted comments are not counted
# rule: quoted by distinct style quotes are not counted
# rule: commented quotes are not counted
# rule: commented distinct style comments are not counted
# rule: increment('LLOC') if not-quoted, not-commented
# semi-colon presents in the line except that two
# semi-colon in for() counted as one.
#
$_ = $line; #hack: $_ = $line inside sub
# state
my %s = (
'c' => 0, # c slash star - inside c style comments
'cpp' => 0, # c++ slash slash - inside C++ style comment
'qm' => 0, # quoted mark - inside quoted string
'qqm' => 0, # double quoted - inside double quoted string
);
my $has_comment = 0;
# find state
LOOP:
{
/\G\"/gc && do { # match double quote
unless( $s{'qm'} || $s{'c'} || $s{'cpp'} ) {
# toggle
$s{'qqm'} = $s{'qqm'} ? 0 : 1;
}
redo LOOP;
};
/\G\'/gc && do { # match single quote
unless( $s{'qqm'} || $s{'c'} || $s{'cpp'} ) {
# toggle
$s{'qm'} = $s{'qm'} ? 0 : 1;
}
redo LOOP;
};
m~\G//~gc && do { # match C++ comment sign
unless( $s{'qm'} || $s{'qqm'} || $s{'c'} ) {
# on
$has_comment = 1;
$s{'cpp'} = 1;
}
redo LOOP;
};
m~\G/\*~gc && do { # match begining C comment sign
unless( $s{'qm'} || $s{'qqm'} || $s{'cpp'} ) {
# on
$has_comment = 1;
$s{'c'} = $s{'c'} ? 1 : 1;
}
redo LOOP;
};
m~\G\*/~gc && do { # match ending C comment sign
unless( $s{'qm'} || $s{'qqm'} || $s{'cpp'} ) {
# off
if( $s{'c'} ) {
$s{'c'} = 0;
}
else {
die 'is_inline_comment($): unexpected c style ending comment sign'.
"\n'$line'";
}
}
redo LOOP;
};
/\Gfor\s*\(.*\;.*\;.*\)/gc && do { # match for loop
unless( $s{'qm'} || $s{'qqm'} || $s{'cpp'} || $s{'c'} ) {
# not-commented, not-quoted semi-colon
increment('LLOC');
}
redo LOOP;
};
/\G\;/gc && do { # match semi-colon
unless( $s{'qm'} || $s{'qqm'} || $s{'cpp'} || $s{'c'} ) {
# not-commented, not-quoted semi-colon
# not inside for() loop
increment('LLOC');
}
redo LOOP;
};
/\G./gc && do { # match any other character
# skip 1 character
redo LOOP;
};
/\G$/gc && do { # match end of the line
last LOOP;
};
#default
die 'is_inline_comment($): unexpected character in the line:' .
"\n'$line'";
}
# apply state
$inside_multiline_comment = $s{'c'};
return $has_comment;
}
sub version {
# TODO: version implementation
print <<"VERSION";
NAME v$VERSION
Written by AUTHOR
COPYRIGHT AND LICENSE
VERSION
exit 0;
}
sub invalid_options {
print STDERR (#_ ,"\n");
exit 2;
}
sub get_terminal_size {
my ($wchar, $hchar) = ( -1, -1);
my $win32console = <<'WIN32_CONSOLE';
use Win32::Console;
my $CONSOLE = new Win32::Console();
($wchar, $hchar) = $CONSOLE->MaxWindow();
WIN32_CONSOLE
eval($win32console);
return ($wchar, $hchar) unless( $# );
warnings::warnif($#); # $EVAL_ERROR
my $term_readkey = <<'TERM_READKEY';
use Term::ReadKey;
($wchar,$hchar, $wpixels, $hpixels) = GetTerminalSize();
TERM_READKEY
eval($term_readkey);
return ($wchar, $hchar) unless( $# );
warnings::warnif($#); # $EVAL_ERROR
my $ioctl = <<'IOCTL';
require 'sys/ioctl.ph';
die "no TIOCGWINSZ " unless defined &TIOCGWINSZ;
open(TTY, "+</dev/tty")
or die "No tty: $!";
unless (ioctl(TTY, &TIOCGWINSZ, $winsize='')) {
die sprintf "$0: ioctl TIOCGWINSZ (%08x: $!)\n",
&TIOCGWINSZ;
}
($hchar, $wchar, $xpixel, $ypixel) =
unpack('S4', $winsize); # probably $hchar & $wchar should be swapped here
IOCTL
eval($ioctl);
warnings::warnif($#) if $# ; # $EVAL_ERROR
return ($wchar, $hchar);
}
1;
__END__
=head1 NAME
eLOC - Effective Lines of Code Counter
=head1 SYNOPSIS
B<eloc> B<[>OPTIONB<]...> B<[>FILEB<]...>
Print LOC, eLOC, lLOC, comment, blank, newline and logicline counts
for each FILE, and a total line if more than one FILE is specified.
See L</"LOC Specification"> for more info, use `eloc --man'.
-e, --eloc print the {E}LOC counts
-s, --lloc print the lLOC counts (code {S}tatements)
-l, --loc print the {L}OC counts (eLOC + lines of a single brace or parenthesis)
-c, --comment print the {C}omments counts (count lines which contains a comment)
-b, --blank print the {B}lank counts
-n, --newline print the {N}ewline count
-g, --logicline print the lo{G}ical line count (= LOC + Comment Lines + Blank Lines)
--deterministic print the LOC determination for every line in the source file
--header print header line
--help display this help and exit
--man display full help and exit
--version output version information and exit
With no FILE, or when FILE is -, read standard input.
Metrics counted by the program are based on narration from
http://msquaredtechnologies.com/m2rsm/docs/rsm_metrics_narration.htm
=for TODO: Comment Percent = Comment Line Count / Logical Line Count ) x 100
=for TODO: White Space Percentage = (Number of spaces / Number of spaces and characters) * 100
=head1 DESCRIPTION
eLOC is a simple LOC counter. See L</"LOC Specification">.
=head2 LOC Specification
=over 1
=item LOC
Lines Of Code = eLOC + lines of a single brace or parenthesis
=item eLOC
An effective line of code or eLOC is the measurement of all lines that are
not comments, blanks or standalone braces or parenthesis.
This metric more closely represents the quantity of work performed.
RSM introduces eLOC as a metrics standard.
See http://msquaredtechnologies.com/m2rsm/docs/rsm_metrics_narration.htm
=item lLOC
Logical lines of code represent a metrics for those line of code which form
code statements. These statements are terminated with a semi-colon.
The control line for the "for" loop contain two semi-colons but accounts
for only one semi colon.
See http://msquaredtechnologies.com/m2rsm/docs/rsm_metrics_narration.htm
=item comment
comment = pure comment + inline comment
=over
=item pure comment
Comment lines represent a metrics for pure comment line without any code in it.
See L</"inline comment">.
=item inline comment
Inline comment line is a line which contains both LOC line and pure comment.
Inline comment line and pure comment line (see L</"pure comment">)
are mutually exclusive, that is a given physical line cannot be an inline comment
line and a pure comment line simultaneously.
=over
=item Example:
static const int defaultWidth = 400; // value provided in declaration
=back
=back
=item blank
Blank line is a line which contains at most whitespaces.
Blank lines are counted inside comments too.
=item logicline
The logical line count = LOC + Comment Lines + Blank Lines
=back
=head1 KNOWN BUGS AND LIMITATIONS
=over
=item
It supports only C/C++ source files.
=item
Comments inside for(;;) statements are not counted
=over
=item Example:
for(int i = 0; i < N /*comment*/; i++ ); #LLOC# #LLOC# #LOC# #ELOC# #PHYS# [1]
=back
=item
'\'-ending lines are concatinated ( though newline count is valid)
=item
Input from stdin is not supported in the case
the script is envoked solely by name without explicit perl executable.
=item
Wildcards in path with spaces are not supported (like GNU utilities).
=back
=over
=begin fixed
=item Limitation: single source file
Only one source file at time supported
=item Limitation: LLOC is unsupported
The logical lines of code metric is unsupported.
=item missed inline comment for C style comment
#include <math.h> /* comment */ #ELOC# #PHYS# [2]
But must be
#include <math.h> /* comment */ #INLINE_COMMENT# #ELOC# #PHYS# [2]
=item wrong LOC type for the code after '*/'
/* another #PURE_COMMENT# #PHYS# [36]
trick #PURE_COMMENT# #PHYS# [37]
*/ i++; #PURE_COMMENT# #PHYS# [38]
In the last line must be
#INLINE_COMMENT# #PHYS# [38]
=end fixed
=back
=head1 SEE ALSO
Metrics counted by the program are based on narration from L<http://msquaredtechnologies.com/m2rsm/docs/rsm_metrics_narration.htm>
=cut

The following script will get count of all file matching a pattern in a given directory.
# START OF SCRIPT
var str files
var str dir
set $files = "*.cpp" # <===================== Set your file name pattern here.
set $dir = "C:/myproject" # <===================== Set your project directory here.
# Get the list of files in variable fileList.
var str fileList
find -rn files($files) dir($dir) > $fileList
# Declare variables where we will save counts of individual files.
var int c # all lines
var int nb # non-blank lines
# Declare variables where we will save total counts for all files.
var int totalc # sum-total of all lines
var int totalnb # sum-total of all non-blank lines
# Declare variable where we will store file count.
var int fileCount
# We will store the name of the file we are working on currently, in the following.
var str file
# Go thru the $fileList one by one file.
while ($fileList<>"")
do
# Extract the next file.
lex "1" $fileList >$file
# Check if this is a flat file. We are not interested in directories.
af $file >null # We don't want to see the output.
# We only want to set the $ftype variable.
if ($ftype=="f")
do
# Yes, this is a flat file.
# Increment file count.<br>
set $fileCount = $fileCount+1<br>
# Collect the content of $file in $content<br>
var str content # Content of one file at a time<br>
repro $file >$content<br>
# Get count and non-blank count.<br>
set $c={len -e $content}<br>
set $nb={len $content}<br>
echo -e "File: " $file ", Total Count: " $c ", Non-blank Count: " $nb<br>
# Update total counts.<br>
set $totalc = $totalc + $c<br>
set $totalnb = $totalnb + $nb<br>
done
endif
done
Show sum-totals
echo "**********************************************************************************************************************************"
echo "Total Count of all lines:\t" $totalc ",\tTotal Count of non-blank lines:\t" $totalnb ", Total files: " $fileCount
echo "**********************************************************************************************************************************"
# END OF SCRIPT
If you want line counts in files modified in year 2008 only, add ($fmtime >= "2008"), etc.
If you don't have biterscripting, get it from .com .

Not a simple script, but CCCC (C and C++ Code Counter) has been around for a while and it works great for me.

I have a program called scc that strips C comments (and C++ comments, though with C99 they're the same). Apply that plus a filter to remove blank lines and, if so desired, lines containing just open and close braces, to generate the line counts. I've used that on internal projects - not needed to discount open/close braces. Those scripts were more complex, comparing the source code for two different versions of a substantial project stored in ClearCase. They also did statistics on files added and removed, and on lines added and removed from common files, etc.
Not counting braces makes quite a difference:
Black JL: co -q -p scc.c | scc | sed '/^[ ]*$/d' | wc -l
208
Black JL: co -q -p scc.c | scc | sed '/^[ {}]*$/d' | wc -l
144
Black JL: co -p -q scc.c | wc -l
271
Black JL:
So, 144 lines under your rules; 208 counting open and close brace lines; 271 counting everything.
Lemme know if you want the code for scc (send email to first dot last at gmail dot com). It's 13 KB of gzipped tar file including man page, torture test, and some library files.
#litb commented that 'cpp -fpreprocessed -P file' handles stripping of
comments. It mostly does. However, when I run it on the stress test
for SCC, it complains when (in my opinion) it should not:
SCC has been trained to handle 'q' single quotes in most of
the aberrant forms that can be used. '\0', '\', '\'', '\\
n' (a valid variant on '\n'), because the backslash followed
by newline is elided by the token scanning code in CPP before
any other processing occurs.
When the CPP from GCC 4.3.2 processes this, it complains (warns):
SCC has been trained to handle 'q' single quotes in most of
<stdin>:2:56: warning: missing terminating ' character
the aberrant forms that can be used. '\0', '\', '\'', '\\
<stdin>:3:27: warning: missing terminating ' character
n' (a valid variant on '\n'), because the backslash followed
by newline is elided by the token scanning code in CPP before
any other processing occurs.
Section 5.1.1.2 Translation Phases of the C99 standard says:
The precedence among the syntax rules of translation is specified by the following phases.(Footnote 5)
Physical source file multibyte characters are mapped, in an implementation-defined
manner, to the source character set (introducing new-line characters for
end-of-line indicators) if necessary. Trigraph sequences are replaced by
corresponding single-character internal representations.
Each instance of a backslash character () immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
Footnote 5 is:
(5) Implementations shall behave as if these separate phases occur, even
though many are typically folded together in practice.
Consequently, in my view, CPP is mishandling phase two in the example text. Or, at least, the warning is not what I want - the construct is valid C and it is not self-evident that the warning is warranted.
Granted, it is an edge case, and extra warnings are permitted. But it would annoy the living daylights out of me. If I didn't have my own, possibly better tool for the job, then using 'cpp -fpreprocessed -P' would do - it is an extreme edge case that I'm complaining about (and, it might be legitimate to argue that it is more likely that there is a problem than not -- though a better heuristic would observe that the line was spliced and the result was a legitimate single character constant and therefore the complaint should be suppressed; if the result was not a legitimate single character constant, then the complaint should be produced. (On my test case - admittedly a torture test - CPP yields 13 problems, mostly related to the one I'm complaining about, where SCC correctly yields 2.)
(I observe that the '-P' manages to suppress a '#line' directive in the output that appears when the option is omitted.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js