Remove the first comment from a collection of Java source files using Perl - regex

I'm trying to remove the first C-style comment (only the first) from a collection of Java source files. At first I tried a multi-line sed, but that didn't work properly so after some Googling it seemed Perl was the way to go. I used to like Perl, it was the first language I ever used to make a web program with, but I've run into a wall trying to get this script to work:
#!/usr/bin/perl -i.bak
$s=join("",<>);
$s=~ s/("(\\\\|\\"|[^"])*")|(\/\*([^*]|\*(?=[^\/]))*\*\/)|(\/\/.*)/$1 /;
print $s;
I call it with the filename(s) of the files to be processed, e.g. ./com.pl test.java. According to everything on the Internet, -i (in-place edit) should redirect output from print statements to the file instead of printing to stdout. Now here's the thing: it doesn't. No matter what I try, I can't seem to get it to replace the file with the print output. I've tried $^I too but that doesn't work either.
I don't know if it's relevant but I'm on Ubuntu 11.04.
P.S. I'm aware of the pitfalls of regexing source code :)

Does the following not work from the command line?
$ perl -pi.bak 's|your_regex|here|' *.java
Inside a script
The script equivalent of the above is:
#!/usr/bin/perl -pi.bak
s|your_regex|here|;
The original post was missing the p flag, as pointed out by triplee in his comment.
See perldoc perlrun for more.

Related

Regex removing bold markdown from inside codeblock only

I'm editing in bulk some markdown files to be compliant with mkdocs syntax (material theme).
My previous documentation software accepted bold inside codeblock, but I discover now it's far from standard.
I've more than 10k codeblocks in this documentation, with more than 300 md files in nested directories, and most of them has ** in order to bold some word.
To be precise I should make any CodeBlock from this:
this is a **code block** with some commands
```my_lexer
enable
configure **terminal**
interface **g0/0**
```
to this
this is a **code block** with some commands
```my_lexer
enable
configure terminal
interface g0/0
```
The fun parts:
there are bold words in the rest of the document I would like to maintain (outside code block)
not every row of the code block has bold in it
not even every code block has necessarily bold in it
Now I'm using visual studio code with the substitute in files, and most of the easy regex I did for the porting is working. But it's not a perfect regex syntax (for examples, groups are denoted with $1 instead of \1 and maybe some other differences I don't know about).
But I accept other software (regex flavors) too if they are more regex compliant and accept 'replace in all files and subdirectories' (like notepad++, atom, etc..)
Sadly, I don't even know how to start something so complicated.
The most advanced I did is this: https://regex101.com/r/vRnkop/1 (there is also the text i'm using to test it)
(^```.*\n)(.*?\*\*(.*?)\*\*.*$\n)*
I hardly think this is a good start to do that!
Thanks
Visual Studio is not my forté but I did read you should be able to use PCRE2 regex syntax. Therefor try to substitute the following pattern with an empty string:
\*\*(?=(((?!^```).)*^```)(?:(?1){2})*(?2)$)
See an online demo. The pattern seems a bit rocky and maybe someone else knows a much simpler pattern. However I did wanted to make sure this would both leave italic alone and would make bold+italic to italic. Note that . matches newline here.
If you have unix tools like sed. it is quite easy:
sed '/^```my_lexer/,/^```/ s/\*\*//g' orig.md >new.md
/regex1/,/regex2/ cmd looks for a group of lines where the first line matches the first regex and the final line matches the second regex, and then runs cmd on each of them. This limits the replacements to the relevant sections of the file.
s/\*\*//g does search and replace (I have assumed any instance of ** should be deleted
Some versions of sed allow "in-place" editing with -i. For example, to edit file.md and keep original version as file.md.orig:
sed -i.orig '...' file.md
and you can edit multiple files with something like:
find -name '*.md' -exec sed -i.orig '...' \{} \+

Bash regex can't match xml? [duplicate]

This question already has answers here:
Why does BASH_REMATCH not work for a quoted regular expression?
(3 answers)
Closed 7 years ago.
I'm trying to integrate Jenkins to grab stories from pivotal tracker which have been marked as finished if they have been included in a build (we already have integration with github on both jenkins and pivotal tracker). Using curl, I am able to grab related stories using pivotal's web api, which returns an xml with the story values. I have tested a bash regex (<current_state>(.*)<\/current_state>) on regexraptor.net, to match the node value in a string (in this case, <story><current_state>this</current_state></story>), with the goal of returning this (the desired text from the element) for assignment to a bash variable. regexraptor tells me it matches, but when I try to get output the text, it comes up blank.
currently, my code to output goes like this:
$XML_STRING='<current_state>this</current_state>'
[[ $XML_STRING =~ '<current_state>(.*)<\/current_state>' ]]
echo "${BASH_REMATCH[1]}"
which outputs nothing.
How can I get it to output 'this'?
Due to the constraints of the Oracle Linux distro on the server I'm using at work, I need to get this working using pure bash if possible. sed and grep are available, but I've had no luck with them- sed finds a match, but outputs the whole stream of xml rather than the single word value I want, and I just plain can't get grep working. xpath is unavailable, and xmllint has proved unworkable as well
Bash regular expressions are written without quotes. The following code returns "this" on my mac in bash:
XML_STRING='<current_state>this</current_state>'
[[ $XML_STRING =~ \<current_state\>(.*)<\/current_state\> ]]
echo "${BASH_REMATCH[1]}"
Assuming that our XML looks like the XML in this question, you could extract the current_state value using xmllint like this:
$ xmllint --xpath 'string(/stories/story/current_state)' myfile.xml
accepted
If the input file contains more than one story, this will only return the first. You can select other stories by providing the appropriate index to your XPath expression:
$ xmllint --xpath 'string((/stories/story/current_state)[2])' myfile.xml
unscheduled
If you actually have:
XML_STRING='<current_state>this</current_state>'
You can get at the value text in the middle like this:
$ echo "$XML_STRING" | cut -f2 -d'>' | cut -f1 -d'<'
this
Check your file on linux in the following way:
cat -vte filename
if you see strange ^# in every 2nd place or any other strange thing, then your file has some control character inside which make you problems. Also you may want to check/change windows end lines to linux end lines.

How can a sequence of regular expressions (with replacements and extractions) be executed in order on a single body of text?

What is a good way to run a sequence of regular expressions on a body of text?
Both regex for finding and replacing patterns
I’ve spent days searching for it without success. I would be happy with an answer that was nothing more than a Google search in the right direction.
Given your requirements, I would approach it with sed similar to:
sed -e 's/\n\w/#*4#/g' -e 's/\n/[enter]/g' -e 's/[#][*]4[#]/START/g'
As a quick example of the successive application of the regex, consider:
echo "this is absolutely absent minded bs" | \
sed -e 's/ab/#*4#/g' -e 's/b/[enter]/g' -e 's/[#][*]4[#]/START/g'
Output:
this is STARTsolutely STARTsent minded [enter]s
It will match your regex of newline and word in an identical fashion.
I would consider using a simple script to do something like this.
If you are on windows .bat files, if you are on Linux (or have access to something like Cygwin) bash (.sh)
The find and replace functions should be searchable once you choose a scripting language
A lot has changed since I asked the question, and for those are looking for an answer, the solution can be done in Ruby, Python, Javascript, Perl, and pretty much any language that supports Regex. I still don't know of a great online or GUI program for it.
Awhile ago, I created a simple GUI for it in Python. Here's how to install it.
Install Python: 2.7
(https://www.python.org/downloads/)
Install the Regex Python Module:
If on Mac, open the Terminal app. If on Windows, open Command Prompt. Then
type pip install regex and let it install
Download this Python file:
here
Run the file: Open the file with Python Launcher 2.7
The program should be pretty self-explanatory, but I created a help document here if anyone has any trouble.
An example of an in-code alternative would be to create a Python file (TextFile.py) and have the following code in it.
import regex as re
#input your file path
location_ = **YourFilePath**
f = open(location_,'r+')
text_ = f.read()
f.close()
# input your regular expressions
text_ = re.sub(r'Your_Regex_Find' ,'Your_Regex_Replace' ,text_)
text_ = re.sub(r'Your_Regex_Find2','Your_Regex_Replace2',text_)
print text_

How to search and replace this string with sed?

I'm desperately trying to search the following:
<texit info> author=MySelf title=MyTitle </texit>
and replace it with blank.
What I've tried so far is the following:
sed –I '1,5s/<texit//;s/info>//;s/author=MySelf//;s/title=MyTitle//' test.txt
But it doesn't work.
Don't edit XML with sed -- the right tool would be something like XMLStarlet, with a line like the following:
xmlstarlet ed -u //texit[#info] -v 'author=NewAuthor title=NewTitle'
...if your goal were to update the text within the tag.
Regular expressions are not expressive enough to correctly handle XML (even formally -- regular expressions are theoretically sufficient to parse regular languages; XML is not one). For instance, your original would be just as valid written with newlines, as:
< texit
info >author=MySelf title=MyTitle</texit>
...and writing a sed command to handle that case would not be fun. XML-native tools, on the other hand, can correctly handle all of XML's corner cases.
That said, the sed expression you gave does indeed "work", inasmuch as it does exactly what it's written to do.
sed -e '1,5s/<texit//;s/info>//;s/author=MySelf//;s/title=MyTitle//' \
<<<"<texit info>author=MySelf title=MyTitle foo bar</texit>"
returns the output
foo bar</texit>
which is exactly what it should do, as it's removing the <texit string, the info> string, the author=MySelf, title=MyTitle, but leaving the closing </texit> and any excess text, just as you asked. If you expect or desire it to do something different, you should explain what that is.
sed 's/<texit\s\+info>\s*author=MySelf\s\+title=MyTitle\s*<\/texit>//g' test.txt
You should generally not edit XML with a regex, but if you only want to strip these tags, the above will work. You don't need multiple s commands, just use a single pattern with correctly defined whitespace.

Removing Different URLs with Regex

I am looking to remove a ton of bad spam URL links from my forums using regex in either grep or vim and subsequently using find/replace commands. I am looking for a way to select just the bad URLs to do that.
All of the URLs are different and are preceeded by \n________\n. (Thats 8 underscores)
Here is an example of one of the URLs:
\n________\n[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
So basically I was trying to use the \n... and the [/URL] as boundaries to select that and everything inbetween. What I came up with is this:
[\\]n[_][_][_][_][_][_][_][_][\\]n.*\[\/URL\]]
Using that does not correctly close the search and selects pretty much everything. I very am new at this and appreciate any insight. Thanks.
Assuming GNU ERE, this should work:
\\n_{8}\\n\s\[URL=(.*)].*\[/URL]
RegexBuddy seems to agree with me:
That said,
> grep -E \\n_{8}\\n\s\[URL=(.*)].*\[/URL] test.txt
doesn't work on my system (Cygwin with GNU grep 2.6.3; test.txt's contents are shown in the screenshot above).
If you want to give sed a chance following will do the job:
sed 's/^.*\(\[URL.*\)$/\1/' file.txt
PS: You can do same :s/^.*\(\[URL.*\)$/\1/ in your vi session as well.
OUTPUT
For the file.txt that contains:
\n__\n[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
It produces:
[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
In Vim this should remove all lines that match the pattern:
:g/\\n\%(\\_\)\{8}\\n \[URL=.\{-}\/URL\]/d
That pattern matches the sample text taken literally, all in one line.
I was actually able to do this in Microsoft Word using the following:
[\\]n_{8}[\\]n?*/URL\]
Thank you for all the input, couldn't have done it without the help!