Regex to replace block comment with line comment - regex

There are tons of examples to do the conversion from C-style line comment to 1-line block comment. But I need to do the opposite: find a regex to replace multi-line block comment with line comments.
From:
This text must not be touched
/*
This
is
random
text
*/
This text must not be touched
To
This text must not be touched
// This
// is
// random
// text
This text must not be touched
I was thinking if there's a way to represent "each line" concept in regex, then just add // in front of each line. Something like
\/\*\n(?:(.+)\n)+\*\/ -> // $1
But the greediness nature of the regex engine makes $1 just match the last line before */. I know Perl and other languages have some advanced regex features like recursion, but I need to do this in a standard engine. Is there any trick to accomplish this?
EDIT: To clarify, I'm looking for pure regex solution, not involving any programming language. Should be testable on sites like https://regex101.com/.

If you are interested in a single regex pass in the modern JavaScript engine (and other regex engines supporting infinite length patterns in lookbehinds), you can use
/(?<=^(\/)\*(?:(?!^\/\*)[\s\S])*?\r?\n)(?=[\s\S]*?^\*\/)|(?:\r?\n)?(?:^\/\*|^\*\/)/gm
Replace with $1$1, see the regex demo.
Details
(?<=^(\/)\*(?:(?!^\/\*)[\s\S])*?\r?\n) - a positive lookbehind that matches a location that is immediately preceded with
^(\/)\* - /* substring at the start of a line (with / captured into Group 1)
(?:(?!^\/\*)[\s\S])*? - any char, zero or more occurrences, as few as possible, not starting a /* char sequence that appears at the start of a line
\r?\n - a CRLF or LF ending
(?=[\s\S]*?^\*\/) - a positive lookahead that requires any 0 or more chars as few as possible followed with */ at the start of a line, immediately to the right of the current location
| - or
(?:\r?\n)? - an optional CRLF or LF linebreak
(?:^\/\*|^\*\/) - and then either /* or */ at the start of a line.

As usual in such cases, two regular expressions—the second applied to the matches of the first—can do what one cannot achieve.
const txt = `This text must not be touched
/*
This
is
random
text
*/
This text must not be touched`;
const to1line = str => str.replace(
/\/\*\s*(.*?)\s*\*\//gs,
(_, comment) => comment.replace( /^/mg, '//')
);
console.log( to1line( txt ));

Related

How to write RegEx for the Below Expression

12345678912345678T / 14750,47932 SS
Vis à 6PC 45H Din 913 M 8x20
Art. client: 294519
QTE: 200 Pce
I want to write a RegEx which can find above stated multiline string type from a long txt file where Starting condition will be "18 digit long word" comprises with numbers and Uppercase alphabets and Ending condition shoould be "Pce"
I have written this much and it only reads first line but don't know what to write next
^[0-9A-Z]{18,18}.*
Any type of help will be appreciated.
. in most engines doesn't include new lines, hence your match stopping at the end of the line. You could either use the DOTALL flag if available, otherwise hack around with an "include-all" class, for example [\s\S] (a char that is either a space or not a space).
With a lazy quantifier, you could use for example:
^[0-9A-Z]{18}[\s\S]*?Pce$
You didn't specify a programming language so something like this would work:
/^[\dA-Z]{18}[^\dA-Za-z].*?Pce$/gms
^[\dA-Z]{18} - start with 18 digits and/or capital letters
[^\dA-Za-z] - not a digit nor letter
.*? - anything, lazily
substitute with [\s\S]*? if single line modifier is not available to you
Pce$ - end with Pce
gms - global, multi line, and single line modifiers
https://regex101.com/r/RXaAT4/1

startWith with regex kotlin [duplicate]

I am trying to work on regular expressions. I have a mainframe file which has several fields. I have a flat file parser which distinguishes several types of records based on the first three letters of every line. How do I write a regular expression where the first three letters are 'CTR'.
Beginning of line or beginning of string?
Start and end of string
/^CTR.*$/
/ = delimiter
^ = start of string
CTR = literal CTR
$ = end of string
.* = zero or more of any character except newline
Start and end of line
/^CTR.*$/m
/ = delimiter
^ = start of line
CTR = literal CTR
$ = end of line
.* = zero or more of any character except newline
m = enables multi-line mode, this sets regex to treat every line as a string, so ^ and $ will match start and end of line
While in multi-line mode you can still match the start and end of the string with \A\Z permanent anchors
/\ACTR.*\Z/m
\A = means start of string
CTR = literal CTR
.* = zero or more of any character except newline
\Z = end of string
m = enables multi-line mode
As such, another way to match the start of the line would be like this:
/(\A|\r|\n|\r\n)CTR.*/
or
/(^|\r|\n|\r\n)CTR.*/
\r = carriage return / old Mac OS newline
\n = line-feed / Unix/Mac OS X newline
\r\n = windows newline
Note, if you are going to use the backslash \ in some program string that supports escaping, like the php double quotation marks "" then you need to escape them first
so to run \r\nCTR.* you would use it as "\\r\\nCTR.*"
^CTR
or
^CTR.*
edit:
To be more clear: ^CTR will match start of line and those chars. If all you want to do is match for a line itself (and already have the line to use), then that is all you really need. But if this is the case, you may be better off using a prefab substr() type function. I don't know, what language are you are using. But if you are trying to match and grab the line, you will need something like .* or .*$ or whatever, depending on what language/regex function you are using.
Regex symbol to match at beginning of a line:
^
Add the string you're searching for (CTR) to the regex like this:
^CTR
Example: regex
That should be enough!
However, if you need to get the text from the whole line in your language of choice, add a "match anything" pattern .*:
^CTR.*
Example: more regex
If you want to get crazy, use the end of line matcher
$
Add that to the growing regex pattern:
^CTR.*$
Example: lets get crazy
Note: Depending on how and where you're using regex, you might have to use a multi-line modifier to get it to match multiple lines. There could be a whole discussion on the best strategy for picking lines out of a file to process them, and some of the strategies would require this:
Multi-line flag m (this is specified in various ways in various languages/contexts)
/^CTR.*/gm
Example: we had to use m on regex101
Try ^CTR.\*, which literally means start of line, CTR, anything.
This will be case-sensitive, and setting non-case-sensitivity will depend on your programming language, or use ^[Cc][Tt][Rr].\* if cross-environment case-insensitivity matters.
^CTR.*$
matches a line starting with CTR.
Not sure how to apply that to your file on your server, but typically, the regex to match the beginning of a string would be :
^CTR
The ^ means beginning of string / line
There's are ambiguities in the question.
What is your input string? Is it the entire file? Or is it 1 line at a time? Some of the answers are assuming the latter. I want to answer the former.
What would you like to return from your regular expression? The fact that you want a true / false on whether a match was made? Or do you want to extract the entire line whose start begins with CTR? I'll answer you only want a true / false match.
To do this, we just need to determine if the CTR occurs at either the start of a file, or immediately following a new line.
/(?:^|\n)CTR/
(?i)^[ \r\n]*CTR
(?i) -- case insensitive -- Remove if case sensitive.
[ \r\n] -- ignore space and new lines
* -- 0 or more times the same
CTR - your starts with string.

i need help in regex

so i have (matlab) code .. and of the lines doesnt have (;) after the line
i want to find that line
for a starter :
sad= sdfsdf ; %this is comment
sad = awaww ;
n= sdfdsfd ;
m = (asd + adsf(asd,asd)) %this is comment
lets say i want to find the 4th line because it doesnt have (;) at the end of line ..
so far im stuck at this :
/(^[-a-zA-Z0-9]+\s*=[-a-zA-Z0-9#:%,_\+.()~#?&//= ]+)(?!;)$/gim
so this will work fine.. it will find the fourth line only
but what if i wanted (;) in middle of the line but not at end or before the comment .. ?
w=sss (;)aaa **;** % i dont want this line to be selected
w=sss (;)aaa %i want this line to be selected
http://regexr.com/3cfor
Well, let's find all lines which end with a semicolon:
^.+?;
optionally followed by horizontal whitespace:
^.+?;[ \t]*
and an optional comment:
^.+?;[ \t]*(?:%.*)?
This expression easily matches all the lines you don't want. So, inverse it:
^(?!.+?;[ \t]*(?:%.*)?$).+
Unfortunately, that's too easy. It fails to match lines which contain a semicolon in a comment. We could replace .+? with [^%\r\n]+? but this would fail on lines containing a % in a string.
If you need a more robust pattern, you'll have to account for all of this.
So let's start the same way, by defining what a "correct" line should look like. I'll use the PCRE syntax for atomic grouping, so you'll have to use perl = TRUE.
A string is: '(?>[^']+|'')*'
Other code (except string, comments and semicolons) is covered by: [^%';\r\n]+
So "normal" code is:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?
Then, we add the required semicolon and optional comment:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$
Finally, we invert all of this:
^(?!(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$).+
And we have the final pattern. Demo.
You don't need to fully tokenize the input, you only have to recognize the different "lexer modes". I hope handling strings and comments is enough, but I didn't check the Matlab syntax thoroughly.
You could use this with other regex engines that do not support atomic groups by replacing (?> with (?: but you'll expose yourself to the catastrophic backtracking problem.

Regular expression for a string that does not start with a /*

I use EditPad Pro text editor.
I need read string into code, but I need to ignore ones that start with the label "/*" or tab + /*, for example:
/**
* Light up the dungeon using "claravoyance"
*
* memorizes all floor grids too.
**/
/** This function returns TRUE if a "line of sight" **/
#include "cave.h"
(tab here) /* Vertical "knights" */
if (g->multiple_objects) {
/* Get the "pile" feature instead */
k_ptr = &k_info[0];
}
put_str("Text inside", hgt - 1, (wid - COL_MAP) / 2);
/* More code*** */
I like to return:
"Text inside"
I have try this (reading Regular expression for a string that does not start with a sequence), but not work for me:
^(?! \*/\t).+".*"
any help?
Edit: I used:
^(?!#| |(\t*/)|(/)).+".*"
And it return:
put_str("Text inside"
I'm close to finding the solution.
EditPad obviously supports variable-length lookbehind in pro version 6 and lite version 7 since it's flavor is indicated as "JGsoft": Just Great Software regular expression engine.
Knowing this and without the use of capture groups, you could combine two variable length lookbehinds:
(?<!^[ \t]*/?[*#][^"\n]*")(?<=^[^"\n]*")[^"]+
(?<!^[ \t]*/?[*#][^"\n]*") The negative lookbehind for avoiding the quoted part to be preceded by [ \t]*/?[*#] any comments, which could be preceded by any amount of space/tab. Made the / optional, as a multi-line comment can also start with *.
(?<=^[^"\n]*") The positive lookbehind for assuring, that there's any amount of [^"\n], characters, that are no quotes or newlines followed by one quote before.
[^"]+ As supposed to be always balanced quoting, now it should be convenient, to match the non-quotes after the first double-quote (which is inside the lookbehind)
If a single " may occur in any line (not balanced), change the end: [^"]+ to [^"\n]+(?=")
Possibly there are different solutions for the problem. Hope it helps :)
Here's one approach: ^(?!\t*/\*).*?"(.+?)"
Breakdown:
^(?!\t*/\*) This is a negative lookahead anchored to the beginning of the line,
to ensure that there is no `/*` at the beginning (with or
without tabs)
.*?" Next is any amount of characters, up to a double-quote. It's lazy
so it stops at the first quote
(.+?)" This is the capture group for everything between the quotes, again
lazy so it doesn't slurp other quotes
You can use this regex:
/\*.*\*/(*SKIP)(*FAIL)|".*?"
Working demo
Edit: if you use EditPad then you can use this regex:
"[\w\s]+"(?!.*\*/)

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"