sed command line to change "aas=aas" into "aas = aas" - regex

A Questions for the regular expression/ sed experts out there:
I need to beautify some c++ code.
The code is littered with various version of the assignment operator with different types of spacing.
i.e.
a=b
a =b
a= B
a = b
a= b
A = B. // the correct format needed, and so must be ignored by SED
There should only be one space around the =. If more are found, the extras must be removed.
I need to make a script that will scan through all files in folder and subfolders and search and replace as needed.
There are some variations, like the a+=b etc.
I run on OsX but have linux and windows machines available.
help much appreciated.

You can use this sed to insert a single space before and after all = operators:
Input file:
cat file
a ==b
a=b
a =b
a/=b
a *=b
a+= b
a-= b
a= B
a%= B
a = b
a= b
A = B
sed command:
sed -E 's~[[:blank:]]*([-+*/%=]?=)[[:blank:]]*~ \1 ~g' file
a == b
a = b
a = b
a /= b
a *= b
a += b
a -= b
a = B
a %= B
a = b
a = b
A = B
This is regex used for matching (using ~ as delimiter):
~[[:blank:]]*([-+*/%=]?=)[[:blank:]]*~ - matches 0 or more white spaces followed by an optional -+*/%= characters before a literal =. We are also capturing this operator in group #1
This is patter used in replacement:
~ \1 ~ Which means a space before and after string captured in group #1

It may interest you doing it with Perl
A simple file.cpp:
#include <iostream>
int main(){
int i = 3;
i += 3;
i-=3;
i * = 3; // not valid just for sure
i/=3;
int i2
=
3;
if( i
=
= i2 ){} // not valid just for sure
}
perl -lpe '$/=undef;s/\s*([=!%\*\/+-])?\s*(=)\s*/ $1$2 /g' file.cpp
the output:
#include <iostream>
int main(){
int i = 3;
i += 3;
i -= 3;
i *= 3;
i /= 3;
int i2 = 3;
if( i == i2 ){}
}

Related

Regex - Would Like to Match Patterns Across Multiple Lines and Consolidate the Lines

I have a file of program code and would like to use Regex to combine multiple lines.
A typical series of lines might look this:
rp = 10;
cp = 15;
wd = 2;
ht = 1;
dr = 3;
ds = 10 + 5 * x;
sp = 50;
er = 1;
Anim(rp, cp, wd, ht, dr, ds, sp, er);
All the lines except the last assign values to parameters.
The last line is a call to function Anim() using the parameters.
The file of program code will have multiple instances of blocks of code in this format.
I would like to perform an edit that rewrites the lines that assign values
as one line:
rp = 10; cp = 15; wd = 2; ht = 1; dr = 3; ds = 10 + 5 * x; sp = 50; er = 1;
Anim(rp, cp, wd, ht, dr, ds, sp, er);
What makes this challenging is that sometimes the parameters are assigned in
a different order. Also, not all the parameters are assigned because they may already
have the correct values. So sometimes the program code might look like this:
cp = 15;
rp = 10;
dr = 3;
sp = 50;
er = 1;
Anim(rp, cp, wd, ht, dr, ds, sp, er);
What I can say for sure is that the variables being assigned always have two letters.
And the block of lines to be rewritten always has this pattern:
one or more of these: ^\s*[a-z][a-z]\s*=\s*.*;\r\n (Note - call this line1 or line2 or ...)
followed by: ^\s*Anim(rp, cp, wd, ht, dr, ds, sp, er);\r\n
and I want to rewrite as: line; line2; line3; etc. \r\n
Anim(rp, cp, wd, ht, dr, ds, sp, er);\r\n
I would be very grateful for any suggestions on how I can use Regex to perform these edits,
if possible.
I use Notepad++.
Thank you.
Ctrl+H
Find what: (?<!\);)\R(?!\w+\()
Replace with: A SPACE
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
(?<!\);) # negative lookbehind, make sure we haven't ");" before
\R # any kind of linebreak (i.e. \r, \n, \r\n)
(?!\w+\() # negative lookahead make sure we haven't a word and a prenthesis after
Screenshot (before):
Screenshot (after):

I need to replace C comments with a blank space or new line using sed

The input file is like this
#include <stdio.h>
int main()
{
// this is a function
float alpha = 0;
// test
/* */
int y = 11; // comment
y = y + 15;
//
char z = 'n';
/* end of file
c */
}
And the desired output should look like this
#include <stdio.h>
int main()
{
*
float alpha = 0;*
*
*
int y = 11; *
y = y + 15;*
*
char z = 'n';*
*
*
}
Here the * represents EOL.
I have tried this but it simply deletes the spaces and new lines too.
sed '/^[ \t]*\/\//d;/\/*\*\//d;/^[ \t]*\/\*/d' $[input file]
This might work for you (GNU sed):
sed -z 's#//[^\n]*##g
s#/\*#\x00#g
s#\*/#\x01#g
s/\x00[^\n\x00\x01]*\x01//g;tb
:b;s/\x00[^\x01\n]*\n/\n\x00/;tb;s/\x00[^\n]*\x01//;tb' file
The solution comes in 3 parts:
Single line comments are removed
Multi line comments on a single line are removed.
Multi line comments on multiple lines are removed but the newlines of such lines are retained.
The solution uses the -z option which may cause problems if the file contains null characters.
N.B. This solution is only partial as many corner cases may break it e.g. literal comments as part of a variable value.
sed 's/\/\/.*//;s/\/\*.*//;s/.*\*\///' file
The command works as following:- the first part searches for the string "//" and then eliminates the {string} followed by it on the same line. The second part searches for the string "/" and then eliminates the {string} followed by it on the same line. the third part searches for the string "/" and then eliminates the {string} followed by it on the same line. The failure for the given solution would be when the multi-line comment exceeds more than 2 lines

How to remove anything after a non-slash character in a string?

The problem I am encountering is strange. Suppose I have:
a = "www.XXXXXXX.com"
b = "www.XXXXXXX.com/laskdfj/=*&9809f/12-613"
c = "www.XXXX.comllkjldfjlsadjfjldsf"
d = "http://www.XXXX.CoMmasldfjl"
e = "www.XXX.us/sdf"
f = "www.XXX.us0948klsdf"
If following after the ".com" or ".us" is not a slash, then remove it. So the result would be like:
a = "www.XXXXXXX.com"
b = "www.XXXXXXX.com/laskdfj/=*&9809f/12-613"
c = "www.XXXX.com"
d = "http://www.XXXX.CoM"
e = "www.XXX.us/sdf"
f = "www.XXX.us"
Regular expression is new to me, and I read several blogs about regular expression, none of them seem to talk about how to use if-statement to handle my situation... any hints?
You can utilize sub for this task:
sub('(.*\\.(?i:com|us))[^/]+', '\\1', x)
If you're wanting a more general approach, you can use:
sub('(.*\\.[[:alpha:]]{2,3})[^/]*', '\\1', x)
CodeBunk

Easily aligning characters after whitespace in vim

I would like to create a mapped vim command that helps me align assignments for variables across multiple lines. Imagine I have the following text in a file:
foo = 1;
barbar = 2;
asdfasd = 3;
jjkjfh = 4;
baz = 5;
If I select multiple lines and use the regex below, noting that column 10 is in the whitespace for all lines, stray whitespace after column 10 will be deleted up to the equals sign.
:'<,'>s/^\(.\{10}\)\s*\(=.*\)$/\1\2/g
Here's the result:
foo = 1;
barbar = 2;
asdfasd = 3;
jjkjfh = 4;
baz = 5;
Is there a way to get the current cursor position (specifically the column position) while performing a visual block selection and use that column in the regular expression?
Alternatively, if it is possible to find the max column for any of the equals signs on the selected lines and insert whitespace so all equals signs are aligned by column, that is preferred to solving the previous problem. Imagine quickly converting:
foo = 1;
barbar = 2;
asdfasd = 3;
jjkjfh = 4;
baz = 5;
to:
foo = 1;
barbar = 2;
asdfasd = 3;
jjkjfh = 4;
baz = 5;
with a block selection and a key-combo.
Without plugins
In this case
foo = 1
fizzbuzz = 2
bar = 3
You can add many spaces with a macro:
0f=10iSPACEESCj
where 10 is an arbitrary number just to add enough space.
Apply the macro M times (for M lines) and get
foo = 1
fizzbuzz = 2
bar = 3
Then remove excessive spaces with a macro that removes all characters till some column N:
0f=d12|j
where 12 is the column number you want to align along and | is a vertical bar (SHIFT + \). Together 12| is a "go to column 12" command.
Repeat for each line and get
foo = 1
fizzbuzz = 2
bar = 3
You can combine the two macros into one:
0f=10iSPACEESCd11|j
Not completely satisfied with Tabular and Align, I've recently built another similar, but simpler plugin called vim-easy-align.
Check out the demo screencast: https://vimeo.com/63506219
For the first case, simply visual-select the lines and enter the command :EasyAlign= to do the trick.
If you have defined a mapping such as,
vnoremap <silent> <Enter> :EasyAlign<cr>
you can do the same with just two keystrokes: Enter and =
The case you mentioned in the comment,
final int foo = 3;
public boolean bar = false;
can be easily aligned using ":EasyAlign*\ " command, or with the aforementioned mapping, Enter, *, and space key, yielding
final int foo = 3;
public boolean bar = false;
There are two plugins for that: Either the older Align - Help folks to align text, eqns, declarations, tables, etc, or Tabular.

Regex Replacing : to ":" etc

I've got a bunch of strings like:
"Hello, here's a test colon:. Here's a test semi-colon;"
I would like to replace that with
"Hello, here's a test colon:. Here's a test semi-colon;"
And so on for all printable ASCII values.
At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).
Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.
Thanks,
Dom
The big advantage of using a regex is to deal with the tricky cases like &#38; Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with &#. Finally, the regex solution is one without a lot of surprises for future maintainers.
I'd say a regex was the right choice.
Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all -~. For that reason, you could consider &#1?\d\d;.
As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :
For each match // Using regex_iterator<>
Print the prefix of the match
Remove the first 2 and last character of the match (&#;)
lexical_cast the result to int, then truncate to char and append.
Print the suffix of the last match.
This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.
NUMS = '1234567890'
MAIN LINE = INPUT :F(END)
SWAP LINE ? '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
OUTPUT = LINE :(MAIN)
END
* Repaired SNOBOL4 Solution
* &#38; -> &
digit = '0123456789'
main line = input :f(end)
result =
swap line arb . l
+ '&#' span(digit) . n ';' rem . line :f(out)
result = result l char(n) :(swap)
out output = result line :(main)
end
I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.
Here's a Python implementation:
s = "Hello, here's a test colon:. Here's a test semi-colon;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)
Producing:
"Hello, here's a test colon:. Here's a test semi-colon;"
I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.
The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:
dd = "0123456789"
ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.
echo "Hello, here's a test colon:. Here's a test semi-colon; Further test &#65;. abc.~.def." | perl -we 'sub translate { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) ) { return sprintf("%c",$x); } else { return "&#".$x.";"; } } while (<>) { s/&#(1?\d\d);/&translate($1)/ge; print; }'
Pretty-printing that:
#!/usr/bin/perl -w
sub translate
{
my $x=$_[0];
if ( ($x >= 32) && ($x <= 126) )
{
return sprintf( "%c", $x );
}
else
{
return "&#" . $x . ";" ;
}
}
while (<>)
{
s/&#(1?\d\d);/&translate($1)/ge;
print;
}
Though perl being perl, I'm sure there's a much better way to write that...
Back to C code:
You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.
Here's another Perl's one-liner (see #mrree's answer):
a test file:
$ cat ent.txt
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
the one-liner:
$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt
or using more specific regex:
$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt
both one-liners produce the same output:
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.
// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>
int main() {
using namespace BOOST_SPIRIT_CLASSIC_NS;
std::string line;
while (std::getline(std::cin, line)) {
assert(parse(line.begin(), line.end(),
// match "&#(\d+);" where 32 <= $1 <= 126 or any char
*(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
| anychar_p[&putchar])).full);
putchar('\n');
}
}
compile:
$ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp
run:
$ echo "Hello,  here's a test colon:." | spirit_ncr2a
output:
"Hello,  here's a test colon:."
I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!
I'm currently using python and would have solved it with this oneliner:
''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])
Does that make any sense?
Here's a NCR scanner created using Flex:
/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
/**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}
To make an executable:
$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl
Example:
$ echo "Hello,  here's a test colon:.
> Here's a test semi-colon; 'ƒ'
> &#59; <-- may be recursive" \
> | ncr2a
It prints for non-recursive version:
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
; <-- may be recursive
And the recursive one produces:
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
; <-- may be recursive
This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).
dd = "0123456789"
ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:
dd = "0123456789"
ccp = "#" (span(dd) $ n ";") $ enc
+ *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = replace(line,char(10),"#") :(rdl)
done
end
Here's a version based on boost::regex_token_iterator. The program replaces decimal NCRs read from stdin by corresponding ASCII characters and prints them to stdout.
#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>
int main()
{
boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
const int subs[] = {-1, 1}; // non-match & subexpr
boost::sregex_token_iterator end;
std::string line;
while (std::getline(std::cin, line)) {
boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);
for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
if (isncr) { // convert NCR e.g., ':' -> ':'
const int d = boost::lexical_cast<int>(*tok);
assert(32 <= d && d < 127);
std::cout << static_cast<char>(d);
}
else
std::cout << *tok; // output as is
}
std::cout << '\n';
}
}