BASH sed expression optimisation or conversion to native bash substitution - regex

I have a sed expression which might run anywhere from 0
to thousands of times, the input is piped and substituted:
somefunc() { sed "s/\s*//g; s/[\"\'~\!#\\\/\$%\^&\*\(\)\=]//g; s/\.\.//g"; }
And I simply use it like this:
echo 'Hello world' | somefunc
This is quite slow, so I tried to convert to a native bash substitution
and failed and I don't know if theres a way for me to optimise it,
so I decided to ask here
Is there a way to do this, maybe convert to a native bash substitution,
maybe use a different tool, anything that is even slightly faster helps
Thanks in advance

somefunc() {
local tmp=${1//[[:space:]\"\'~!#\\$%^&*\/()=]/}
printf '%s' "${tmp//../}"
}

Related

How to pass a variable line number in sed substitute command

I am trying to do a sed operation like this
sed -i '100s/abc/xyz/' filename.txt
I wanted 100 in a variable say $var from a perl script. So, I am trying like this
system("sed -i "${vars}s/abc/xyz/" filename.txt").
This is throwing some error.
Again when I am doing like this putting system command in single quotes:
system('sed -i "${vars}s/abc/xyz/" filename.txt')
this is substituting wrongly. What can be done?
Better and safer is to use the LIST variant of system, because it avoids unsafe shell command line parsing. The command, sed in your case, will receive the command line arguments un-alterated and without the need to quote them.
NOTE: I added -MO=Deparse just to illustrate what the one-liner compiles to.
NOTE: I added -e to be on the safe side as you have -i on the command line which expects a parameter.
$ perl -MO=Deparse -e 'system(qw{sed -i -e}, "${vars}s/abc/xyz/", qw{filename.txt})'
system(('sed', '-i', '-e'), "${vars}s/abc/xyz/", 'filename.txt');
-e syntax OK
Of course in reality it would be easier just to do the processing in Perl itself instead of calling sed...
Shelling out to sed from within perl is a road to unnecessary pain. You're introducing additional quoting and variable expansion layers, and that's at best making your code less clear, and at worst introducing bugs accidentally.
Why not just do it in native perl which is considerably more effective. Perl even allows you to do in place editing if you want.
But it's as simple as:
open ( my $input, '<', 'filename.txt');
open ( my $output, '>', 'filename.txt.new');
select $output;
while ( <$input> ) {
if ( $. == $vars ) {
s/abc/xyz/
}
print;
}
Or if you're really keen on the in place edit, you can look into setting `$^I:
Perl in place editing within a script (rather than one liner)
But I'd suggest 'just' renaming the file after you're done is as easy.

Finding and modifying function definitions (C++) via bash-script

Currently I am working on a fairly large project. In order to increase the quality of our code, we decided to enforce the treatement of return values (Error Codes) for every function. GCC supports a warning concerning the return value of a function, however the function definition has to be preceeded by the following flag.
static __attribute__((warn_unused_result)) ErrorCode test() { /* code goes here */ }
I want to implement a bashscript that parses the entire source code and issues a warning in case the
__attribute__((warn_unused_result))
is missing.
Note that all functions that require this kind of modification return a type called ErrorCode.
Do you think this is possible via a bash script ?
Maybe you can use sed with regular expressions. The following worked for me on a couple of test files I tried:
sed -r "s/ErrorCode\s+\w+\s*(.*)\s*\{/__attribute__((warn_unused_result)) \0/g" test.cpp
If you're not familiar with regex, the pattern basically translates into:
ErrorCode, some whitespace, some alphanumerics (function name), maybe some whitespace, open parenthesis, anything (arguments), close parenthesis, maybe some whitespace, open curly brace.
If this pattern is found, it is prefixed by __attribute__((warn_unused_result)). Note that this only works if you are putting the open curly brace always in the same line as the arguments and you don't have line breaks in your function declarations.
An easy way I could imagine is via ctags. You create a tag file over all your source code, and then parse the tags file. However, I'm not quite sure about the format of the tags file. The variant I'm using here (Exuberant Ctags 5.8) seems to put an "f" in the fourth column, if the tag represents a function. So in this case I would use awk to filter all tags that represent functions, and then grep to throw away all lines without __attribute__((warn_unused_result)).
So, in a nutshell, first you do
$ ctags **/*.c
This creates a file called "tags" in the current directory. The command might also be ctags-exuberant, depending on your variant. The **/*.c is a glob pattern that might work in your shell - if it doesn't, you have to supply your source files in another way (look at the ctagsoptions).
Then you filter the funktions:
$ cat tags | awk -F '\t' '$4 == "f" {print $0}' | grep -v "__attribute__((warn_unused_result))"
No, it is not possible in the general case. The C++ grammar is the most complex of all the languages I know of, and C++ is not parsable via regular expressions in the general case. You might succeed if you limit yourself to a very narrow set of uses, but I am not sure how feasible it is in your case.
I also do not think the excersise is worth the effort, since sometimes ignoring the result of the function is an OK thing.

grep not matching strings when they come from a variable

I'm writing a script that is helping me process log files. In it, I have my grep flags stored in a variable. The flags and strings themselves work just fine, but when I pass them to grep using a variable, the parts of the string that use escaped characters don't produce any matches. See below:
grepvars="-B4 -Psihe 'caused\sby|unable|fault|error|deadlock|checkpoint|corrupt|fail|exception|fatal|severe|\tat\s'"
grep -B4 -Psihe 'caused\sby|unable|fault|error|deadlock|checkpoint|corrupt|fail|exception|fatal|severe|\tat\s' adapter_15.log > adapter_15-error1.log
grep $grepvars adapter_15.log > adapter_15-error2.log
wc -l *-error?.log
51398 adapter_15-error1.log
25032 adapter_15-error2.log
As you can see, the \tat\s part does not produce matches when passed through a variable to grep. What that is supposed to match is a (literal tab)at(literal space). Although this works correctly without using a variable, I'd rather use one since it makes my multiple grep calls easier to manage. What do I have to do to ensure that grep will perform this match correctly when passed through a variable?
After not having any sort of luck with this, I found a workaround: create a function and call it when needed. Here's what I came up with:
grep4j () {
unset IFS
nice -n 15 grep -B3 -Psihe '\tat\s|caused\sby|unable|fault|error|deadlock|checkpoint|corrupt|fail|exception|fatal|severe' $1
IFS=$'\n'
}
Yes, I did try unsetting IFS before and after the grep strings that were using the varaible. It didn't work (and I need it to be set for other things to work). Doing the function like this met my needs, and maybe it will help someone else as well. Cheers!
In case you're curious, this is designed to get relevant messages out of log4j-formatted logs. It saves me a lot of time.
If you're storing all the options of grep in a string then I guess you need to use evil eval:
str="grep $grepvars adapter_15.log > adapter_15-error2.log"
eval "$str"
It may be easier to stuff options into the environment variable GREP_OPTIONS, and patterns into a file, like so:
grep -f <file-with-patterns> ...

Bash quote behavior and sed

I wrote a short bash script that is supposed to strip the leading tabs/spaces from a string:
#!/bin/bash
RGX='s/^[ \t]*//'
SED="sed '$RGX'"
echo " string" | $SED
It works from the command line, but the script gets this error:
sed: -e expression #1, char 1: unknown command: `''
My guess is that something is wrong with the quotes, but I'm not sure what.
Putting commands into variables and getting them back out intact is hard, because quoting doesn't work the way you expect (see BashFAQ #050, "I'm trying to put a command in a variable, but the complex cases always fail!"). There are several ways to deal with this:
1) Don't do it unless you really need to. Seriously, unless you have a good reason to put your command in a variable first, just execute it and don't deal with this messiness.
2) Don't use eval unless you really really really need to. eval has a well-deserved reputation as a source of nasty and obscure bugs. They can be avoided if you understand them well enough and take the necessary precautions to avert them, but this should really be a last resort.
3) If you really must define a command at one point and use it later, either define it as a function or an array. Here's how to do it with a function:
RGX='s/^[ \t]*//'
SEDCMD() { sed "$RGX"; }
echo " string" | SEDCMD
Here's the array version:
RGX='s/^[ \t]*//'
SEDCMD=(sed "$RGX")
echo " string" | "${SEDCMD[#]}"
The idiom "${SEDCMD[#]}" lets you expand an array, keeping each element a separate word, without any of the problems you're having.
It does. Try:
#!/bin/bash
RGX='s/^[ \t]*//'
#SED='$RGX'
echo " string" | sed "$RGX"
This works.
The issue you have is with quotes and spaces. Double quoted strings are passed as single arguments.
Add set -x to your script. You'll see that variables within a single-quote mark are not expanded.
+To expand on my comment above:
#!/bin/bash
RGX='s/^[[:space:]]+//'
SED="sed -r '$RGX'"
eval "printf \" \tstring\n\" | $SED"
Note that this also makes your regex an extended one, for no particular reason. :-)

Perl file treatment is limited in size?

I've made a translator in perl for a messageboard migration, All I do is applying regexes and print the result. I write stdout to a file and here we go ! But the problem is that my program won't work after 18 MB written !
I've made a translate.pl ( https://gist.github.com/914450 )
and launch it with this line :
$ perl translate.pl mydump.sql > mydump-bbcode.sql
Really sorry for quality of code but I never use perl... I tried sed for same work but didn't manage to apply the regex I found in original script.
[EDIT]
I reworked the code and sanitized some regexes (see gist.github.com/914450) but I'm still stuck. When I splited the big dump in 15M files, I launched translate.pl 7(processes) by 7 to use all cores but the script stops at a variable size. a "tail" command doesn't show a complex message on any url when it stops...
Thanks Guys ! I let you know if I manage finally
yikes - start with the basics:
use strict;
use warnings;
..at the top of your script. It will complain about not properly declaring your lexicals, so go ahead and do that. I don't see anything obvious that would be truncating your file, but perhaps one or more of your regexes is pathological. Also, the undefs at the end are not needed.
For what you are doing, you might consider just using sed
You say the "script stops". It keeps running but produces no more output? Or actually stops running? If it stops running, what does:
perl translate.pl mydump.sql > mydump-bbcode.sql
echo $?
show? And if you add a print STDERR "done!\n"; after your loop, does that show up?
Perl can certainly handle files much larger than 18 MB. I know because I routinely run files of 5 GB through Perl.
I think that your problem is in while($html=<FILE>).
Whenever $html is set to an empty line the while will evaluate as False and exit the loop.
You need to use something like while( defined( $html = <FILE> ) )
Edit:
Hmm. I had always thought you need the defined but in my testing just now it didn't exit on blank lines or 0. Must be more of that special Perl magic that mostly works the way you intend -- except when it doesn't.
Indeed if you restructure the while loop enough you can fool Perl into working the way I always thought it worked. (And it might have, in Perl 4 or in earlier versions of Perl 5)
This will fail:
$x = <>;
chomp $x;
while( $x ) {
print $x;
$x = <>;
chomp $x;
}
There could be any number of things going on:
Try adding $| = 1; to the top of your script. This will make all output unbuffered.
One of your regexes is going crazy and is deleting strings when you're not expecting it.
You've run out of disk space.
There's nothing really wrong with your script (other than you're missing use strict; use warnings; and you're not using the three-argument form of open()) that would cause it to stop working after some magic number of bytes.
Hello guys and Thank you so much for your help and ideas !
After trying to cut and parallelize the jobs, I tried to cut my program in 3 programs, translate1.pl, translate2.pl and 3... the job is done, and it's fast by 8 active cores !
then my launcher.sh starts successively the 3 scripts for each splitted file. done with 2 loops and here we go :)
Regards, Yoann