Ignore C comments and include statements using sed (//, /**, **/, #) - regex

So I've got some example C code
/**
example text
**/
#include <stdio.h>
int main(){
int example = 0;
// example text
return;
}
How would I specifically use sed to ignore all lines starting with // or # while also ignoring lines in the range of /** to **/?
I've tried things along the lines of sed -E '/(^#|\/\*/,/\*\/|^\/\/)/!s/example/EXAMPLE/g' but I have a feeling I'm not using the | correctly as it pops an error saying "unmatched ("
My desired final output should be
/**
example text
**/
#include <stdio.h>
int main(){
int EXAMPLE = 0;
// example text
return;
}
The change from the sed command would have changed instances of the word "example" in the program to the uppercase version "EXAMPLE", and what I'm trying to do is make sure words on commented lines are not being changed.

Without ignoring the possibility to fall into circumstances that sed will not be the right tool for this job as sin and melpomene mention in comments, the bellow command will do the trick in your particular exercise:
sed -E '/(#|\/\/)/b ; /\/\*\*/,/\*\*\//b; s/example/EXAMPLE/g' file
/**
example text
**/
#include <stdio.h> example
int main(){
int EXAMPLE = 0;
// example text
return;
}
sed special word b makes use of labels:
'b LABEL'
Unconditionally branch to LABEL. The LABEL may be omitted, in
which case the next cycle is started.
In other words, instead of negating a pattern like /pattern/! you can use /pattern/b without a label and when /pattern/ is found sed jumps (because of b) to the next cycle skipping the substitution s/example/EXAMPLE/g command.
Your attempt does not work because you try to use logical OR | in a mix of patterns like # or // and also a range like /\/\*\*/,/\*\*\//

Related

How to implement syntax highlighting using sed command in te gnome-terminal?

I want to highlight function names of a C program using the sed command on the linux terminal.
I am able to do it using tput to color the function name. For which I have provided the code below. (first line)
I am not able to do the coloring if I use printf/echo/command substitution to color the output of the terminal. (second line of code). I guess this is because I am not able to reference the strings with \1 and \2. When using this it shows some other characters instead of the function names.
The regular expression I have used reads that, the first character of function name can be an alphabet or an underscore and the second character can be alphanumeric and underscore and the third character should be an open parenthesis. I want to reference the Regex by using \1 \2 and \3 and colour everything except \3. This is the idea I have come up with.
My question is, is there any other way to not color the open parenthesis or a way to use the printf and color the function name.
sed -E "s,([a-zA-Z_])([a-zA-Z0-9_]*)(\(),$(tput setaf 1)\1\2$(tput sgr0)\3," Sample.c
sed -E "s,([a-zA-Z_])([a-zA-Z0-9_]*)(\(),$(printf "\033[0;36m\1\2\033[0m\3")," Sample.c
Sample .c :
#include <stdio.h>
int main()
{
int array[100], maximum, size, c, location = 1;
printf("Enter the number of elements in array\n");
scanf("%d", &size);
printf("Enter %d integers\n", size);
for (c = 0; c < size; c++)
scanf("%d", &array[c]);
return 0;
}
Expected result -> main, printf, scanf should be coloured in Sample.c.
The tput is clever however the backtracking will not resolve for an embedded printf because it's inside the subshell and so printf won't work.
There is a bashism which may work for you in the var=$'ansi-ized content' syntax. Three capture groups didn't seem necc. so omitted:
BEGC=$'\033[0;36m' ENDC=$'\033[0m'; \
sed -E "s,([a-zA-Z_][a-zA-Z0-9_]*)(\(),${BEGC}\1${ENDC}\2," Sample.c
However, there's another more fundamental issue in that nested functions will not be highlighted. Notice in the updated Sample.c here that the fictitious "getSize()" function will not be highlighted:
#include <stdio.h>
int main()
{
int array[100], maximum, size, c, location = 1;
printf("Enter the number of elements in array\n");
scanf("%d", &size);
printf("Enter %d integers\n", getSize(size));
for (c = 0; c < size; c++)
scanf("%d", &array[c]);
return 0;
}
A simple regex will not work as there is a recursion requirement. Probably awk can do it since it has a while loop and functions (gensub() maybe?)

Extracting a function body by name with BASH and regex

I have some automatically generated code from MATLAB coder. I would like to make a script to find my entries out of large file. I've successfully plowed my way through regex with BASH to get the main function main\( *([^)]+?)\), and then the body with /\{([^}]+)\}/; however, I'm having a terrible time glueing those together. All I need is the function names contained in main().
I realize that this could be a terrible exercise, but the automatically generated code gives me simple functions that looks like:
int main(int argc, const char * const argv[])
{
(void)argc;
(void)argv;
/* Initialize the application. You do not need to do this more than one time. */
RT_initialize();
/* Invoke the entry-point functions. You can call entry-point functions multiple times. */
main_RT();
/* Terminate the application. You do not need to do this more than one time. */
RT_terminate();
return 0;
}
I would like to extract the that function and body, but my regex is poorer than I recalled.
Any guidance would be greatly appreciated.
A simple way to fairly reliably extract the entire function body is to run the code through a formatter first:
indent -kr < mymain.c | sed -n 's/^int main(/,/^}/p'
cflow can give you a function call graph. eg:
cflow -d2 mymain.c
Due to some restrictions to being on BSD, the resulting BASH function follows to get the function body from a C source for a function by name. This was only tested with the well-formatted C code from MATLAB's Coder.
function getFunctionInC(){
TMPFILEIDENT="/tmp/indent.$$.tmp" #temp file
indent "$1" $TMPFILEIDENT
cat $TMPFILEIDENT | awk '
BEGIN { state = 0; last = ""; }
$0 ~ /^'$2'\(/ { print last; state = 1; }
{ if (state == 1) print; }
$0 ~ /^}/ { if (state) state = 2; }
{ last = $0; }
'
}
The formatting is terrible on the outputs, but I can easily pull the function names to dynamically create defines. Thanks to everyone who read the question.

std regex_search to match only current line

I use a various regexes to parse a C source file, line by line. First i read all the content of file in a string:
ifstream file_stream("commented.cpp",ifstream::binary);
std::string txt((std::istreambuf_iterator<char>(file_stream)),
std::istreambuf_iterator<char>());
Then i use a set of regex, which should be applied continusly until the match found, here i will give only one for example:
vector<regex> rules = { regex("^//[^\n]*$") };
char * search =(char*)txt.c_str();
int position = 0, length = 0;
for (int i = 0; i < rules.size(); i++) {
cmatch match;
if (regex_search(search + position, match, rules[i],regex_constants::match_not_bol | regex_constants::match_not_eol))
{
position += ( match.position() + match.length() );
}
}
But it don't work. It will match the comment not in the current line, but it will search whole string, for the first match, regex_constants::match_not_bol and regex_constants::match_not_eol should make the regex_search to recognize ^$ as start/end of line only, not end start/end of whole block. So here is my file:
commented.cpp:
#include <stdio.h>
//comment
The code should fail, my logic is with those options to regex_search, the match should fail, because it should search for pattern in the first line:
#include <stdio.h>
But instead it searches whole string, and immideatly finds //comment. I need help, to make regex_search match only in current line. The options match_not_bol and match_not_eol do not help me. Of course i can read a file line by line in a vector, and then do match of all rules on each string in vector, but it is very slow, i have done that, and it take too long time to parse a big file like that, that's why i want to let regex deal with new lines, and use positioning counter.
If it is not what you want please comment so I will delete the answer
What you are doing is not a correct way of using a regex library.
Thus here is my suggestion for anyone that wants to use std::regex library.
It only supports ECMAScript that somehow is a little
poor than all modern regex library.
It has bugs as many as you like ( just I found ):
the same regex but different results on Linux and Windows only C++
std::regex and ignoring flags
std::regex_match and lazy quantifier with strange behavior
In some cases (I test specifically with std::match_results ) It is 200 times slower in comparison to std.regex in d language
It has very confusing flag-match and almost it does not work (at least for me)
conclusion: do not use it at all.
But if anyone still demands to use c++ anyway then you can:
use boost::regex about Boost library because:
It is PCRE support
It has less bug ( I have not seen any )
It is smaller in bin file ( I mean executable file after compiling )
It is faster then std::regex
use gcc version 7.1.0 and NOT below. The last bug I found is in version 6.3.0
use clang version 3 or above
If you have enticed (= persuade) to NOT use c++ then you can use:
Use d regular expression link library for large task: std.regex and why:
Fast Faster Command Line Tools in D
Easy
Flexible drn
Use native pcre or pcre2 link that have been written in c
Extremely fast but a little complicated
Use perl for a simple task and specially Perl one-liner link
#include <stdio.h>
//comment
The code should fail, my logic is with those options to regex_search, the match should fail, because it should search for pattern in the first line:
#include <stdio.h>
But instead it searches whole string, and immideatly finds //comment. I need help, to make regex_search match only in current line.
Are you trying to match all // comments in a source code file, or only the first line?
The former can be done like this:
#include <iostream>
#include <fstream>
#include <regex>
int main()
{
auto input = std::ifstream{"stream_union.h"};
for(auto line = std::string{}; getline(input, line); )
{
auto submatch = std::smatch{};
auto pattern = std::regex(R"(//)");
std::regex_search(line, submatch, pattern);
auto match = submatch.str(0);
if(match.empty()) continue;
std::cout << line << std::endl;
}
std::cout << std::endl;
return EXIT_SUCCESS;
}
And the later can be done like this:
#include <iostream>
#include <fstream>
#include <regex>
int main()
{
auto input = std::ifstream{"stream_union.h"};
auto line = std::string{};
getline(input, line);
auto submatch = std::smatch{};
auto pattern = std::regex(R"(//)");
std::regex_search(line, submatch, pattern);
auto match = submatch.str(0);
if(match.empty()) { return EXIT_FAILURE; }
std::cout << line << std::endl;
return EXIT_SUCCESS;
}
If for any reason you're trying to get the position of the match, tellg() will do that for you.

Extracting individual sentences from a text file ... I haven't got it right YET

As part of a larger program, I'm extracting individual sentences from a text file and placing them as strings into a vector of strings. I first decided to use the procedure I've commented out. But then, after a test, I realized that it's doing 2 things wrong:
(1) It's not separating sentences when they are separated by a new line.
(2) It's not separating sentences when they end in a quotation mark. (Ex. The sentences The string Obama said, "Yes, we can." Then he audience gave a thunderous applause. would not be separated.)
I need to fix those problems. However, I'm afraid this going to end up as spaghetti code, if it isn't already. Am I going about this wrong? I don't want to keep going back and fixing things. Maybe there's some easier way?
// Extract sentences from Plain Text file
std::vector<std::string> get_file_sntncs(std::fstream& file) {
// The sentences will be stored in a vector of strings, strvec:
std::vector<std::string> strvec;
// Print out error if the file could not be found:
if(file.fail()) {
std::cout << "Could not find the file. :( " << std::endl;
// Otherwise, proceed to add the sentences to strvec.
} else {
char curchar;
std::string cursentence;
/* While we haven't reached the end of the file, add the current character to the
string representing the current sentence. If that current character is a period,
then we know we've reached the end of a sentence if the next character is a space or
if there is no next character; we then must add the current sentence to strvec. */
while (file >> std::noskipws >> curchar) {
cursentence.push_back(curchar);
if (curchar == '.') {
if (file >> std::noskipws >> curchar) {
if (curchar == ' ') {
strvec.push_back(cursentence);
cursentence.clear();
} else {
cursentence.push_back(curchar);
}
} else {
strvec.push_back(cursentence);
cursentence.clear();
}
}
}
}
return strvec;
}
Given your request to detect sentence boundaries by punctuation, whitespace, and certain combinations of them, using a regular expression seems to be a good solution. You can use regular expression to describe possible sequences of characters that indicate sentence boundaries, e.g.
[.!?]\s+
which means: "one of dot, exclamation mark question mark, followed by one or more whitespaces".
One particularly convenient way of using regular expressions in C++ is to use the regex implementation included in the Boost library. Here is an example of how it work in your case:
#include <string>
#include <vector>
#include <iostream>
#include <iterator>
#include <boost/regex.hpp>
int main()
{
/* Input. */
std::string input = "Here is a short sentence. Here is another one. And we say \"this is the final one.\", which is another example.";
/* Define sentence boundaries. */
boost::regex re("(?: [\\.\\!\\?]\\s+" // case 1: punctuation followed by whitespace
"| \\.\\\",?\\s+" // case 2: start of quotation
"| \\s+\\\")", // case 3: end of quotation
boost::regex::perl | boost::regex::mod_x);
/* Iterate through sentences. */
boost::sregex_token_iterator it(begin(input),end(input),re,-1);
boost::sregex_token_iterator endit;
/* Copy them onto a vector. */
std::vector<std::string> vec;
std::copy(it,endit,std::back_inserter(vec));
/* Output the vector, so we can check. */
std::copy(begin(vec),end(vec),
std::ostream_iterator<std::string>(std::cout,"\n"));
return 0;
}
Notice I used the boost::regex::perl and boost:regex:mod_x options to construct the regex matcher. This allowed by to use extra whitespace inside the regex to make it more readable.
Also note that certain characters, such as . (dot), ! (exclamation mark) and others need to be escaped (i.e. you need to put \\ in front of them), because they would meta characters with special meanings otherwise.
When compiling/linking the code above, you need to link it with the boost-regex library. Using GCC the command looks something like:
g++ -W -Wall -std=c++11 -o test test.cpp -lboost_regex
(assuming your program in stored in a file called test.cpp).

Why my perl script isn't finding bad indetation from my regex match

My work's coding standard uses this bracket indentation:
some declaration
{
stuff = other stuff;
};
control structure, function, etc()
{
more stuff;
for(some amount of time)
{
do something;
}
more and more stuff;
}
I'm writing a perl script to detect incorrect indentation. Here's what I have in the body of a while(<some-file-handle>):
# $prev holds the previous line in the file
# $current holds the current in the file
if($prev =~ /^(\t*)[^;]+$/ and $current =~ /^(?<=!$1\t)[\{\}].+$/) {
print "$file # line ${.}: Bracket indentation incorrect\n";
}
Here, I'm trying to match:
$prev: A line not ended with a semi-colon, followed by...
$current: A line not having the number of leading tabs+1 of the previous line.
This doesn't seem to match anything, at the moment.
the $prev variable needs some modification.
it should be something like \t* then .+ then not ending in semicolon
also, the $current should be like:
anything ending in ; or { or } not having the number of leading tabs+1 of the previous line.
EDIT
the perl code to try the $prev
#!/usr/bin/perl -l
open(FP,"example.cpp");
while(<FP>)
{
if($_ =~ /^(\t*)[^;]+$/) {
print "got the line: $_";
}
}
close(FP);
//example.cpp
for(int i = 0;i<10;i++)
{
//not this;
//but this
}
//output
got the line: {
got the line: //but this
got the line: }
it did not detect the line with the for loop ...
am i missing something...
i see a couple of problems...
your prev regex matches all lines which do not have a ; anywhere. which will break on lines like (for int x = 1; x < 10; x++)
if the indent of the opening { is incorrect, you will not detect that.
try this instead, it only cares if you have a ;{ (followed by any whitespace) at the end.
/^(\s*).*[^{;]\s*$/
now you should change your strategy so that if you see a line which does not end in { or ; you increment the indent counter.
if you see a line which ends in }; or } decrement your indent counter.
compare all lines against this
/^\t{$counter}[^\s]/
so...
$counter = 0;
if (!($curr =~ /^\t{$counter}[^\s]/)) {
# error detected
}
if ($curr =~ /[};]+/) {
$counter--;
} else if ($curr =~ /^(\s*).*[^{;]\s*$/) }
$counter++;
}
sorry for not styling my code according to your standards... :)
And you intend to only count tabs (not spaces) for indentation?
Writing this kind of checker is complicated. Just think about all the possible constructs that uses braces that should not change indentation:
s{some}{thing}g
qw{ a b c }
grep { defined } #a
print "This is just a { provided to confuse";
print <<END;
This {
$is = not $code
}
END
But anyway, if the issues above aren't important to you, consider whether the semi colon is important at all in your regex. After all, writing
while($ok)
{
sort { some_op($_) }
grep { check($_} }
my_func(
map { $_->[0] } #list
);
}
Should be possible.
Have you considered looking at Perltidy?
Perltidy is a Perl script that reformats Perl code into set standards. Granted, what you have isn't part of the Perl standard, but you can probably tweak the curly braces via the configuration file Perltidy uses. If all else fails, you can hack through the code. After all, Perltidy is just a Perl script.
I haven't really used it, but it might be worth looking into. Your problem is trying to locate all the various edge cases, and making sure you're handling them correctly. You can parse 100 programs to find that the 101st reveal problems in your formatter. Perltidy has been used by thousands of people on millions of lines of code. If there is an issue, it probably already has been found.