Bash script to match segments in lines of source code - regex

I'm trying to learn a new programming language, and it's big. Thousands of new terms to learn. I know programming, but don't know the name used for a certain procedure or constant in this language. But I have a script file that I put together that helps tremendously by searching through a large selection of source files, as long as I get a group of the characters right.
But now I want to use && to match up multiple segments in the same line, and I want to pass this whole expression to the script file as one argument, so I might pass it this with a read command:
moo && cow
And it would match this:
Moonlight over Moscow
But not this:
I heard a cow mooing.
If I wanted it either way I would pass it this:
moo && cow || cow && moo
It's tricky, and probably outside what you can normally do with the available syntax. But then I'm no expert, so I don't really know.
I'm flexible on what gets passed to the script, like single &s and |s, the use of brackets, and so on. I just need to understand the rules involved and which utility can do it for me. Or set of utilities if it comes to that.

If you only want to check for the two elements in order, simply match anything between them with .*:
my_str="moonlight over moscow"
if [[ $my_str =~ moo.*cow ]]; then
echo "match"
fi

Related

Finding and modifying function definitions (C++) via bash-script

Currently I am working on a fairly large project. In order to increase the quality of our code, we decided to enforce the treatement of return values (Error Codes) for every function. GCC supports a warning concerning the return value of a function, however the function definition has to be preceeded by the following flag.
static __attribute__((warn_unused_result)) ErrorCode test() { /* code goes here */ }
I want to implement a bashscript that parses the entire source code and issues a warning in case the
__attribute__((warn_unused_result))
is missing.
Note that all functions that require this kind of modification return a type called ErrorCode.
Do you think this is possible via a bash script ?
Maybe you can use sed with regular expressions. The following worked for me on a couple of test files I tried:
sed -r "s/ErrorCode\s+\w+\s*(.*)\s*\{/__attribute__((warn_unused_result)) \0/g" test.cpp
If you're not familiar with regex, the pattern basically translates into:
ErrorCode, some whitespace, some alphanumerics (function name), maybe some whitespace, open parenthesis, anything (arguments), close parenthesis, maybe some whitespace, open curly brace.
If this pattern is found, it is prefixed by __attribute__((warn_unused_result)). Note that this only works if you are putting the open curly brace always in the same line as the arguments and you don't have line breaks in your function declarations.
An easy way I could imagine is via ctags. You create a tag file over all your source code, and then parse the tags file. However, I'm not quite sure about the format of the tags file. The variant I'm using here (Exuberant Ctags 5.8) seems to put an "f" in the fourth column, if the tag represents a function. So in this case I would use awk to filter all tags that represent functions, and then grep to throw away all lines without __attribute__((warn_unused_result)).
So, in a nutshell, first you do
$ ctags **/*.c
This creates a file called "tags" in the current directory. The command might also be ctags-exuberant, depending on your variant. The **/*.c is a glob pattern that might work in your shell - if it doesn't, you have to supply your source files in another way (look at the ctagsoptions).
Then you filter the funktions:
$ cat tags | awk -F '\t' '$4 == "f" {print $0}' | grep -v "__attribute__((warn_unused_result))"
No, it is not possible in the general case. The C++ grammar is the most complex of all the languages I know of, and C++ is not parsable via regular expressions in the general case. You might succeed if you limit yourself to a very narrow set of uses, but I am not sure how feasible it is in your case.
I also do not think the excersise is worth the effort, since sometimes ignoring the result of the function is an OK thing.

build a control file to reformat source file with <wbr>

My problem: long chemical terms, without any guidance to a browser about where to break the term. Some terms are over 70 characters.
My goal: introduce <wbr> at logical insertion points.
Example of problem:
isoquinolinetetramethylenesulfoxidetetrachlororuthenate (55 chars)
Example of opportunities to break a chemical term (e.g. the way a person would pronounce the term as opposed to typing the term):
iso<wbr>quinoline
tetra<wbr>methylene
methylene<wbr>sulfoxide
tetra<wbr>chloro
Usually (but not always) iso, tetra, and methyl are word_break_opportunities.
In general how should I set up an environment with:
control file with "rules" that introduce word_break opportunities
file on which to apply the rules from the control file
The control file will be updated with new rules as new chemical term are encountered.
Would like to use: sed, awk, regex.
Perhaps the environment would look like:
awk rules.awk inputfile.txt > outputfile.txt
Am prepared for trial and error so would appreciate basic explanation so I can refine the control file.
My platform: Windows 7; 64 bit; 8 GB memory; GNUwin32; sed 4.1.5.4013; awk 3.1.6.2962
Thank you in advance.
Your first job is to come up with a list of what is and isn't breakable. Once you have this you can define a format to interpret, and build some code around it.
For example, I would probably go something like:
Opening chars:
iso
tetra
then some code like:
for Each openingString {
if (string.startsWith(openingString)){
insert wbr after opening string
}
}
2.
Opening chars, unless followed by
iso|"tope|bob"
tetra|"pak"
for Each openingString {
if (string.startsWith(openingString)){
get the next element from the row (after the |, surrounded by ")
split around the |
for each part
if (!string.startsWith(part, openingString.length)) {
insert wbr after openingString
}
}
}
then build up from there. It's a pretty monumental task though, it's going to take a lot of building on to get to something useful, but if you're committed to it! The first task is to decide how you're going to hold these mappings though.

Hunspell/Aspell data conversion to human-readable inflection list

Is there an easy way to generate a human-readable inflection list from Hunspell/Aspell dictionary data files?
For example, I'd like to generate the following outputs (for different languages):
...
book, books
book, books, booked, booking
...
go, goes, went, gone, going
...
I looked at the Hunspell/Aspell docs, but couldn't find an API call that would do this.
There is a method that the command line one does, but it doesn't output quite in the format you're looking for. You could also do this manually if you wanted though just by some simple scripting with regex.
The format of for each set of affixes is
TYPE TAG REMOVE REPLACE MATCH
Such that where TAG matches what follows what's behind the /in a given word in the .dicfile, you can do the following (presuming you've already stripped the word of the /...):
if($word =~ /$match$/) $word =~ s/$remove$/$replace/;
Notice the $ there matching the end-of-line/word. Adjust with ^ if it's a prefix.
There are three caveats:
The $match directly from the .aff file is in almost all cases equivalent to standard regex. There are minor variations such that if the match is something like [abc-gh], you'd be better to change it to (a|b|c|-|g|h) or [abcgh-] (hunspell doesn't use hyphen as a metacharacter) otherwise it'll be interpreted as [abcdefgh] (standard regex). For a negated character class, your options are to manually move the - to the end of the expression (e.g. [^a-df] to [^adf-] or to use negative look behinds.
If $replace is 0, then you should change it to an empty string.
If your result ends with /..., you need to reprocess it again because it has a double affix.
Be careful. By my rough calculations, the dictionary I'm working on could have more than 50 million words being formed (and I wouldn't be surprised if it hits beyond 100 million).

Perl replace every occurrence differently

In a perl script, I need to replace several strings. At the moment, I use:
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/\>$1/g;
The aim is to format in a FASTA file every sequence name. It works well in my case so I don't need to touch this part. However, it happens that a sequence name appears several times in the file. I must not have at the end twice - or more - the same sequence name. I thus need to have for instance:
seqName1
seqName2
etc.
(instead of seqName, seqName, etc.)
Is this possible to somehow process differently every occurrence automatically? I don't know how many sequence there are, if there are similar names, etc. An idea would be to concatenate a random string at every occurrence for instance, hence my question.
Many thanks.
John perfectly solved it and chepner helped with the smart idea to avoid conflicts, here is the final result:
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/
sub {
return '>'.$1.$i++;
}->();
/eg;
Many many thanks.
I was actually trying to do something like this the other day, here's what I came up with
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/
sub {
# return random string
}->();
/eg;
the \e modifier interprets the substitution as code, not text. I use an anonymous code ref so that I can return at any point.

bash rename using regex array substitution

i have a very similar question as for this post.
i would like to know how to rename occurances within a filename with designated substitutions. for example if the original file is called: 'the quick brown quick brown fox.avi' i would like to rename it to 'the slow red slow red fox.avi'.
i tried this:
new="(quick=>'slow',brown=>'red')"
regex="quick|brown"
rename -v "s/($regex)/$new{$1}/g" *
but no love :(
i also tried with
regex="qr/quick|brown/"
but this just gives errors. any idea what im doing wrong?
Based on your example, I think you want multiple substitutions (not just converting "quick brown" to "slow red" but converting a list of words to a list of new words. You can separate the substitutions with a semicolon. Here's a solution that works for your example:
rename -v 's/quick/slow/g;s/brown/red/g' *
And if you're really bent on using an array to map the old strings to the new string, you can cram even more Perl into the argument to rename (but at some point you might just write the Perl script as a stand-alone script):
rename -v '%::new=(quick=>"slow",brown=>"red");s/(quick|brown)/$::new{$1}/g' *