Removing comments from a group of php files via command line in Windows - regex

I have a bunch of files and I would like to send them to deployment sans any comments but with whitespace intact (so that I can make any quick changes in production in emergency cases).
The comments can be either single line comments (#, //) or multi line syntax /**/ and at any indentation level.
I want to create a batch file that when executed from any directory reads all php files and strips their comments.
I am not even sure what to try. I know I can fetch all the files with .php extension easily and loop through them. Replacing their content is easy enough as well. What I am stuck on is how to remove the comments.

There's multiple ways to do it, but I think the most efficient will be to use regular expressions.
I'm not a RegExp guru, but I know that you can use grep (or any powerfull text editor such as Notepad++, sublimetext, et...) to replace expressions highlighted to empty string.
For example, in Sublime Text, I've tested this regular expression, which find any multiple lines comment :
\/\*([\s\S]*?)\*/|//.*|#.*
Quick Explanation :
the regexp is made of 3, with an OR symbol inside ( the pipe symbol |)
the first expression mean "something starting with /*, any character, or any blank character , and finishing with */"
the second means "any expression starting with // and any character following
the third means "any expression starting with # and any character following
Once you've set this search expression in sublime text, you can replace what it did return with a "blank" replacement.

Related

Regex NotePad++ or batch script to find and replace double bracketed text with CR LF -- would prefer NP++

I managed to do most of my conversion in VBA Macro (Word > txt) but some changes were made also that I could not forego or get around. Unfortunately, I had not been in the habit of using styles and precise formatting in my docs... (Which is why a PanDoc conversion did not "pan" out well, if you'll excuse the pun.)
In my docs, I was using bold text/lines for in-text titles (not Heading 2 alas) but as I was converting mid-sentence one or two-word bold phrases into phrases to go between double square brackets, the makeshift titles/headings were also changed to [[some title]] format in the process.
With Find and Replace (a batch script that goes through all files in a folder would also do), I would like to search for each and any number of instances of CRLF [[some title CRLF]]CRLF and replace the brackets with ** (to make the title bold), or perhaps ## to make the headings I was missing back in MS Word (I would of course need the line breaks as well).
For better understanding, please see attached picture here:
I am fairly sure that all instances are similarly syntaxed. If not, I may be able to tailor your regex code to differing instances later on.
As you can see, I was trying to do it in two steps but that's not good, because the second step (which I couldn't even get right) would propably have altered other texts I need intact (there must be sentences that start with double brackets after CRLF).
I would need the two steps in one so that only the targeted double bracketed text would be changed to bold or Heading 2.
Basically what I could not do is: find the proper regex solution for matching double CRLF-ed and square-bracketed text for any number of words than may occupy more than one line and starts with a capital letter. I would need an empty line above and below the title as indicated in the image (the VBA macro somehow made two instances of CRLF and carried the brackets to a new line, which I do not like, either).
EDIT.
In the meantime I managed to cook something up but now I couldn't insert the CRLF in front of the match string. At this point this is not enough as other instances are also changed, even lowercase in-line items, for some reason...
Regex:
\[\[([A-Z][\S\s]+?)\]\]
Substitution:
## $1\r\n
https://regex101.com/r/mH6B9N/1
Since then, I made improvements towards what I wanted (I had to test in NotePad++ and not Regex101, for different results), but now in multiple documents I have found match across spill-over lines, as described in here:
Single line regex search in Notepad++
Is it possible that I cannot do what I want? The problem is having non-title text strings having line-break, double brackets and capitalized letters.
What it looks like in other documents:
See here.
I circled around with red in image for clarification. See also:
https://regex101.com/r/8XsIGx/1
Is it possible to match a certain word like "címnél" and not execute on that match if that word is present in a line?
Thanks very much in advance,
F.
You can use
(?s)\R\K\[\[((?:(?!\[\[|]]).)*)\R*]](?=\R)
Replace with ## $1. See the regex demo.
Details:
(?s) - equivalent of the . matches newline option
\R - a line break sequence
\K - omit the text matched so far (the newlines)
\[\[ - a [[ text
((?:(?!\[\[|]]).)*) - Group 1: any char, as many as possible occurrences, that does not start a [[ or ]] char sequence
\R* - zero or more line breaks
]] - a ]] text
(?=\R) - immediately to the right, there must be a line break.

Remove Word smart quotes from a text file using vim

I have a large text file, originally generated in Microsoft Word, that contains these four character sequences, alongside regular text:
?~#~\
?~#~]
?~#~X
?~#~Y
From the content of what is written in the file, it appears that the sequences respectively correspond to open double quotes, close double quotes, open single quote, and close single quote. When displayed in Vim, everything in the sequences other than the question mark appears in blue.
I cannot remove them with a command such as
:.,$s/?~#~Y//
This command results in the following error from vim:
E33: No previous substitute regular expression
E476: Invalid command
Press ENTER or type command to continue
These commands also produce errors:
:.,$s/\?~#~Y//
:.,$s/\?\~\#\~Y//
Specifically,
E866: (NFA regexp) Misplaced ?
E476: Invalid command
Press ENTER or type command to continue
What would be the correct way to automatically remove or replace the sequences? Ideally, I'd like to remove the double quotes, and replace the open/close single quotes with a traditional single quote or apostrophe.
Since "everything in the sequences other than the question mark appears in blue", all characters except the question mark are probably binary characters. I'd suggest this approach:
go to the first sequence and yank it: press v to start marking, extend the mark to the end of the sequence, then press y
paste the sequence as the replace pattern from the unnamed register: :%s/Ctrl-r"//gEnter
repeat for the remaining sequences.
If you’re using a unicode-compatible encoding (such as utf-8) and your font supports it, the smart quotes will show properly.
Additionally, the digraphs for them are 6', 6", 9', and 9". This makes it pretty easy to chain a couple of substitutes to swap them for straight variants:
%s/<C-k>6'\|<C-k>9'/'/g
Etc. Wrap it in a function or command to make it easier for later.
Sorry to bump an old thread but I stumbled upon this late at night while trying to figure out how to remove the exact same characters from a bind9 configuration file that I had pasted in from a website. The aberrant characters were "~#~X", "~#~Y", " | ", and I believe another but I can't remember it at the moment. Anyway, regular expressions couldn't seem to find and replace using the above methods, but I was able to find a solution.
If you can set VIM to show the special characters in their binary representation, then you can use regex to find that. Here's how I did it:
Steps to fix
Open the file with the problem characters in VIM
(a) original method - :set encoding=latin1|set isprint=|set display+=uhex
(b) easier method - :set encoding=utf-8
NOTE: either of these should display the digraph characters in their binary form <<<>>>
(e.g. <80>, <99>, ... )
Then search and replace with VIM regex like so
:%s:\%xNN:':g #replace NN with byte code (i.e. 80, 99, etc.)
Let's break that command down, shall we:
%s: - search command looking for all occurrences due to the % at the start and the 's' for search. The ':' (colon) has been used as the delimiter in this case, but you can use other symbols to delimit the search command.
\%x - the backslash escapes the %x which represents a byte code that we're looking for (i.e. <2 x numbers between brackets>)
NN - replace with the two chars inside of the <> that you're looking to replace in your file. In my case, the byte codes were <e2>, <80>, <99>, which I had to search for separately.
:' - then, the colon delimiting the replacement group where I'm specifying a single quote to replace the byte code, you could put whatever text you want here.
:g - finally, the last colon delineation and the letter 'g' which means to search the entire file top to bottom.
You can do more research in VIM's help with:
:help isprint
Anyway, I hope this helps someone else in the future.
References:
https://blog-en.openalfa.com/how-to-edit-non-printing-and-unicode-characters-in-vim-editor
https://unix.stackexchange.com/questions/108020/can-vim-display-ascii-characters-only-and-treat-other-bytes-as-binary-data
VIM How do I search for a <XX> single byte representation

Use REGEX to find line breaks within a wrapped content

The direct question: How can I use REGEX lookarounds to find instances of \r\n that occur between a set of characters (stand in open and closing tags), "[ and ]" with arbitrary characters and line breaks inside as well?
The situation:
I have a large database exported to tab or comma delineated text files that I'm trying to import into excel. The problem is that some of the cells come from text areas that contain line breaks, and are qualified by double quotes. Importing into excel these line breaks are treated as new rows. I cannot adjust how the file is exported. I data needs to be preserved, but the exact format doesn't, so I was planning on using some placeholder for the returns or ~
Here's a generic illustration of the format of my data:
column1rowA column2rowA column3rowA column4rowA
column1rowB column2rowB "column3rowB
3Bcont
3Bcont
3Bcont
" column4rowB
column1rowC column2rowC column4rowC
column1rowD column2rowD "column3rowD
3Dcont" column4rowD
My thought has been to try to select and replace line breaks within the quotes using REGEX search and replace in Notepad++. To try and make is simpler I have tried adding a character to the double quotes to help indicate whether it is an opening or closing quote:
"[column3rowB
3Bcont
3Bcont
3Bcont
]"
I am new to REGEX. The progress I've made (which isn't much) is:
(?<="[) missing some sort of wildcard \r\n(?=.*]")
Every iteration I've tried has also included every line break between the first "[ and last ]"
I would also appreciate any other approaches that solve the underlying problem
If you can use some tool other than Notepad++, you can use this regex (see my working example on regex101):
(?!\n(([^"]*"){2})*[^"]*$)\n
It uses a negative lookahead to find line breaks only when not followed by an even number of quotes. You could replace them with <br>, spaces, or whatever is appropriate.
Breakdown:
(?! ... ) This is the negative lookahead, necessary because it's zero-width. Anything matched by it will still be available to match again.
(([^"]*"){2})* This is the other key piece. It ensures even-numbered pairs of non-quote characters followed by a quote.
[^"]*$ This is ensuring that there are no more quotes from there until the end of the string.
Caveat:
I couldn't get it to work in Notepad++ because it always recognizes $ as the end of a line, not the end of the entire string.
Great answer from Brian. I added an option that would only consider real linebreaks (i.e. \n\r), which worked for my CSV file:
(?!\n|\r(([^"]*"){2})*[^"]*$)\n|\r

Replace line of text Notepad++ or UltraEdit

Real quick question here that i cant work out.
I have a bunch of text files across many directories. Within these dirs are text files named init.txt
In these many text files, are lots of lines starting with
Effective =
What i need to do is replace any line that contains that string with another string,
preferably in Notepad++, or UltraEdit if need be.
In Notepad++, iv found Search -> Replace in Files... which lets me specify a starting directory, but i cant get to replace the entire line with my new line.
I have never used regular expressions before (if thats the best way to do this) as iv never needed to, so any help would be very much appreciated.
Thank you for helping me out.
For your problem, a litter regular expression may help a lot. I use regex search in Notepad++ nearly everyday, and it is really useful.
I do not want to itimidate you with some complicated regex grammar. Instead, I hope after reading my answer, you might see that the basics of regular expression is not so exotic, and it is for regular people's everyday use.
Follow these instructions:
In Notepad++ press Ctrl-F, and switch to the Find in Files tab, in Serach mode part(it is on the bottom of the dialog), select Regular expression
In the Find what field, what you need to input here may vary according to the specific pattern of the text you want to replace.
If the text fragment you want to substitute always
Shows up at the beginning of a line,
There is NO LEADING WHITESPACES before the text,
It containes EXACTLY ONE SPCACE CHARACTER before the = character
^Effective = should be used as the pattern in the Find what Field.
The ^ symbol in ^Effective = means matching begin of the line (so if Effectiv = appears in the middle of a line, it will be ignored ), and the rest is the exact words to be matched.
However, if the above conditions is not all satisfied, e.g.
the text segement may containe leading whitesapces,
the number of withspaces between the word Effective and = symbol may vary, from one to unlimited
Under such circumstance, you may need to use ^Effective\s+=.
The \s+ part in ^Effective\s+= matches one to unlimited number of whitespaces(including, spaces \0x20, tabs \t, carrige-return \r, and new-line \n)
If you want to match zero to unlimited spaces between Effective and =, you can replace \s+ to \s*
In the Rplace with field, input changeLine
In filters filed, select the file type you want to search
Check In all sub-folders
Click Replace in Files button
Set the search mode in Notepad++
Find: Effective =
Replace with: changeLine
Search Mode: Extended (\n, \t, etc)
From: https://superuser.com/questions/34451/notepad-find-and-replace-string-with-a-new-line

Multiline Regular Expression search and replace!

I've hit a wall. Does anybody know a good text editor that has search and replace like Notepad++ but can also do multi-line regex search and replace? Basically, I am trying to find something that can match a regex like:
search oldlog\(.*\n\s+([\r\n.]*)\);replace newlog\(\1\)
Any ideas?
Notepad++ can now handle multi line regular expressions (just update to the latest version - feature was introduced around March '12).
I needed to remove all onmouseout and onmouseover statements from an HTML document and I needed to create a non-greedy multi line match.
onmouseover=.?\s*".*?"
Make sure you check the: [ ] . matches newline checkbox if you want to use the multi line match capability.
EditPad Pro has better regex capabilities than any other editor I've ever used.
Also, I suspect you have an error in your regex — [\r\n.] will match only carriage returns, newlines, and full stops. If you're trying to match any character (i.e. "dot operator plus CR and LF), try [\s\S] instead.
My personal recommendation is IDM Computing's UltraEdit (www.ultraedit.com) - it can do regular expressions (both search and replace) with Perl, Unix and UltraEdit syntax. Multi-line matching is one of the capabilities in Perl regex mode in it.
It also has other nice search capabilities (e.g search in specific character column range, search in multiple files, search history, search favorites, etc...)
(source: ultraedit.com)
The Zeus editor can do multi-line search and replace.
I use Eclipse, which is free and that you may already have if you are a developer. '\R' acts as platform independent line delimiter. Here is an example of multi-line search:
search:
\bibitem.(\R.)?\R?{([^{])}$\R^([^\].[^}]$\R.$\R.)
and replace:
\defcitealias{$2}{$3}
I'm pretty sure Notepad++ can do that now via the TextFX plugin (which is included by default). Hit Control-R in Notepad++ and have a play.
TextPad has good Regex search and replace capabilities; I've used it for a while and am pretty happy with it.
From the Features:
Powerful search/replace engine using
UNIX-style regular expressions, with
the power of editor macros. Sets of
files in a directory tree can be
searched, and text can be replaced in
all open documents at once.
For more options than you could possibly need, check out "Notepad++ Alternatives" at AlternativeTo.net.
you can use Python Script plugin for Multiline Regular Expression search and replace!
- http://npppythonscript.sourceforge.net/docs/latest/scintilla.html?highlight=pymlreplace#Editor.pymlreplace
# This example replaces any <br/> that is followed by another on the next line (with optional spaces in between), with a single one
editor.pymlreplace(r"<br/>\s*\r\n\s*<br/>", "<br/>\r\n")
I use Notepad++ all the time but it's Regex has alway been a bit lacking.
Sublime Text is what you want.
EditPlus does a good job at search/replace using regex (including multiline)
You could use Visual Studio. Download Express for free if you don't have a copy.
VS's regex is non-standard, so you'd have to use \n:b+[\r\n] instead.
The latest version of UltraEdit has multiline find and replace w/ regex support.
Or if you're OK with using a more specialized regular expression tool for this, there's Regex Hero. It has the side benefit of being able to do everything on the fly. In other words, you don't have to click a button to test your regular expression because it's automatically tested after every keypress.
Personally, I'd use UltraEdit if I'm looking to replace text in multiple files. That way I can just select the files to replace as a batch and click Replace. But if I'm working with a single text file and I'm in need of writing a more complex regular expression then I'd paste it into Regex Hero and work with it there. That's because Regex Hero can save time when you see everything happen in real-time.
ED for windows has two versions of regex, three sorts of cut and paste (selection, lines or blocks, AND you can shift from one to the next (unlike ultra edit, which is clunky at best) with just a mouse click while you are highlighting -- no need to pull down a menu. The sheer speed of getting the job done is incredible, like reading on a Kindle, you don't have to think about it.
You can use a recent version of Notepad++ (Mine is 6.2.2).
No need to use the option ". match newline" as suggested in another answer. Instead, use the adequate regular expression with ^ for "begin of line" and $ for "end of line". Then use \r\n after the $ for a "new line" in a dos file (or just \n in a unix file as the carriage return is mainly used for dos/windows text file):
Ex.: to remove all lines starting with tags OBJE after a line starting with a tag UID (from a gedcom file - used in genealogy), I did use the following search regex:
^UID (.*)$\r\n^(OBJE (.*)$\r\n)+
And the following replace value:
UID \1\r\n
This is matching lines like this:
UID 4FBB852FB485B2A64DE276675D57A1BA
OBJE #M4#
OBJE #M3#
OBJE #M2#
OBJE #M1#
and the output of the replacement is
UID 4FBB852FB485B2A64DE276675D57A1BA
550 instances have been replaced in less than 1 sec. Notepad++ is really efficient!
Otherwise, to validate a Regular expression I like to use the .Net RegEx Tester (http://regexhero.net/tester/). It's really great to write and test on the fly a Reg Ex...
PS.: Also, you can use [\s\S] in your regex to match any character including new lines. So, if you look for any block of "multi-line" text starting with "xxx" and ending with "abc", the following Regex will be fine:^xxx[\s\S]*?abc$ where "*?" is to match as less as possible between xxx and abc !!!