i need help in regex - regex

so i have (matlab) code .. and of the lines doesnt have (;) after the line
i want to find that line
for a starter :
sad= sdfsdf ; %this is comment
sad = awaww ;
n= sdfdsfd ;
m = (asd + adsf(asd,asd)) %this is comment
lets say i want to find the 4th line because it doesnt have (;) at the end of line ..
so far im stuck at this :
/(^[-a-zA-Z0-9]+\s*=[-a-zA-Z0-9#:%,_\+.()~#?&//= ]+)(?!;)$/gim
so this will work fine.. it will find the fourth line only
but what if i wanted (;) in middle of the line but not at end or before the comment .. ?
w=sss (;)aaa **;** % i dont want this line to be selected
w=sss (;)aaa %i want this line to be selected
http://regexr.com/3cfor

Well, let's find all lines which end with a semicolon:
^.+?;
optionally followed by horizontal whitespace:
^.+?;[ \t]*
and an optional comment:
^.+?;[ \t]*(?:%.*)?
This expression easily matches all the lines you don't want. So, inverse it:
^(?!.+?;[ \t]*(?:%.*)?$).+
Unfortunately, that's too easy. It fails to match lines which contain a semicolon in a comment. We could replace .+? with [^%\r\n]+? but this would fail on lines containing a % in a string.
If you need a more robust pattern, you'll have to account for all of this.
So let's start the same way, by defining what a "correct" line should look like. I'll use the PCRE syntax for atomic grouping, so you'll have to use perl = TRUE.
A string is: '(?>[^']+|'')*'
Other code (except string, comments and semicolons) is covered by: [^%';\r\n]+
So "normal" code is:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?
Then, we add the required semicolon and optional comment:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$
Finally, we invert all of this:
^(?!(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$).+
And we have the final pattern. Demo.
You don't need to fully tokenize the input, you only have to recognize the different "lexer modes". I hope handling strings and comments is enough, but I didn't check the Matlab syntax thoroughly.
You could use this with other regex engines that do not support atomic groups by replacing (?> with (?: but you'll expose yourself to the catastrophic backtracking problem.

Related

Regex to remove double lines ignoring the punctation marks and spaces in Notepad++ [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 months ago.
Is it possible to remove duplications with ignoring the punctation marks and spaces in Notepad++? I would keep one of them matching lines (doesn't matter which to keep).
My examples are from the txt file:
Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rough work, iconoclasm, but the only way to get the truth. Oliver Wendell Holmes
Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Rule No.1: Never lose money. Rule No.2: Never forget rule No.1. Warren Buffett
Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem
Self-esteem isn't everything it's just that there's nothing without it. Gloria Steinem
You said she's a senior? Babe we're all crazy.
You said, she's a senior! Babe we're ALL crazy.
You said, she's a senior? Babe we're ALL crazy!
Result I need:
Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem
You said, she's a senior! Babe we're ALL crazy.
I can delete 100% matching duplications with regex, but can't find a regex rule to ignore spaces and marks.
I don't think regex is the best tool for this task, but it's a nice challenge. You can match single words using a nested structure like:
((\w+)\W+((\w+)\W+( ... ((\w+)\W+)? ... )?)?(\w*))
When matching this, capture groups 2 to n contain the words 1 to n-1 of a line. The nested structure is necessary to make it non-ambiguous - otherwise, running the regex takes too long.
To match the duplicate lines, we use a similar structure with back-references:
\1\W+(\2\W+( ... (\9\W+)? ... )?)?
This will also match lines that are substrings of the previous line, which is again helpful to improve performance.
Notice that you have to use the \g{n}-notation when using more than 9 references in Notepad++. Moreover, to avoid matching line breaks you should use [^\w\n\r] instead of \W. To further improve performance, unnecessary groups should be non-matching, i.e., (?: ... ).
To generate the rather long regex that solves the problem for, e.g., up to 20 words per line, you can use the following script:
MAX_WORDS = 20
punct = "[^\\w\\n\\r]"
backref = (i) => `\\g{${i}}`
patternKeep = (i) => "(\\w+)[^\\w\\n\\r]+" + (i < 0 ? "" : `(?:${patternKeep(i-1)})?`)
patternRemove = (i) => `${backref(MAX_WORDS-i + 2)}(?:${punct}+` + (i < 0 ? "" : patternRemove(i-1)) + ")?"
console.log("^(" + patternKeep(MAX_WORDS) + "(\\w*))(\\r?\\n" + patternRemove(MAX_WORDS)+ `${punct}*${backref(MAX_WORDS+4)}${punct}*)+$`)
When copying this to Notepad++ with settings "Wrap around" on and "Match case" off and replacing with $1, it will remove all duplicate lines in your example.
I doubt that it can be done purely with regular expressions. If it can then I imagine that the expression would be difficult to understand and difficult to maintain. Instead I would suggest a multi-step approach.
Step 1 - modify each line to be: original-line separator original-line.
Step 2 - convert it to be line-without-punctuation separator original-line.
Step 3 - sort the lines
Step 4 - remove duplicated lines
Step 5 - remove line-without-punctuation and separator leaving just the original line.
In more detail:
In all the replaces below: select "Wrap around", unselect "Dot matches newline", unselect "Match whole word only" and unselect "Match case".
Step 1 - choose a separator, some text that is not punctuation and does not occur in the file. Here I use qqq. Do a regular expression replace of ^(.+)$ with \1qqq\1.
Step 2 - remove any punctuation before the separator. Repeatedly do a regular expression replace of [!',-.:?]+(.*qqq) with \1 until no more replacements are made. This expression matches all the punctuation in the example, but you may need to add more for your full text. Also need to reduce multiple spaces to singles, so repeatedly do a regular expression replace of +(.*qqq) with \1 until no more replacements are made. One final step to handle spaces before the qqq do a regular expression replace of qqq with qqq (this could also use a non-regular expression replace).
Step 3 - sort the lines lexicographically.
Step 4 - remove duplicated lines. Repeatedly do a regular expression replace of ^(.*qqq).*\R\1 with \1 until no more replacements are made.
Step 5 - Remove unwanted text leaving the original line. Do a regular expression replace of ^.*qqq with nothing (the empty string).
If all punctuation can be deleted and the result being a line without punctuation then could simple do a regular expression replace of [!',-.:? ]+ with , a sort and finally a remove duplicates.
Previously this question attracted an answer, but the author deleted it. To me it was so interesting because a special technique was illustrated. In a comment the answerer pointed me towards another thread to read more about it.
After experimenting a bit with that answer, an idea was the following pattern. Settings in NP++ are to uncheck: [ ] match case, [ ] .matches newline - Replace with emptystring.
^(?>[^\w\n]*(\w++)(?=.*\R(\2?+[^\w\n]*\1\b)))+[^\w\n]*\R(?=\2[^\w\n]*$)
Here is the demo in Regex101 - Assumption is, that duplicate lines are consecutive (like sample).
Most of the used regex-tokens can be looked up in the Stack Overflow Regex FAQ.
In short words, the mechanism used is to capture words from one line to the first group (\w++) while inside the lookahead (?=.*\R(\2?+...\1\b))) a second group in the consecutive line is "growing" from itself plus the captures until \R(?=\2...$) it either matches all words or fails.
Illustration of some steps from the regex101 debugger:
The second group holds the substring of the consecutive line that matches words and order of the previous line. It expands at each repetition from optionally itself and a word from the previous line. Separated by [^\w\n]* any amount of characters that are not word characters or newline.
For making it work, matching is done without giving back at crucial points (prevent backtracking).

Regex to replace block comment with line comment

There are tons of examples to do the conversion from C-style line comment to 1-line block comment. But I need to do the opposite: find a regex to replace multi-line block comment with line comments.
From:
This text must not be touched
/*
This
is
random
text
*/
This text must not be touched
To
This text must not be touched
// This
// is
// random
// text
This text must not be touched
I was thinking if there's a way to represent "each line" concept in regex, then just add // in front of each line. Something like
\/\*\n(?:(.+)\n)+\*\/ -> // $1
But the greediness nature of the regex engine makes $1 just match the last line before */. I know Perl and other languages have some advanced regex features like recursion, but I need to do this in a standard engine. Is there any trick to accomplish this?
EDIT: To clarify, I'm looking for pure regex solution, not involving any programming language. Should be testable on sites like https://regex101.com/.
If you are interested in a single regex pass in the modern JavaScript engine (and other regex engines supporting infinite length patterns in lookbehinds), you can use
/(?<=^(\/)\*(?:(?!^\/\*)[\s\S])*?\r?\n)(?=[\s\S]*?^\*\/)|(?:\r?\n)?(?:^\/\*|^\*\/)/gm
Replace with $1$1, see the regex demo.
Details
(?<=^(\/)\*(?:(?!^\/\*)[\s\S])*?\r?\n) - a positive lookbehind that matches a location that is immediately preceded with
^(\/)\* - /* substring at the start of a line (with / captured into Group 1)
(?:(?!^\/\*)[\s\S])*? - any char, zero or more occurrences, as few as possible, not starting a /* char sequence that appears at the start of a line
\r?\n - a CRLF or LF ending
(?=[\s\S]*?^\*\/) - a positive lookahead that requires any 0 or more chars as few as possible followed with */ at the start of a line, immediately to the right of the current location
| - or
(?:\r?\n)? - an optional CRLF or LF linebreak
(?:^\/\*|^\*\/) - and then either /* or */ at the start of a line.
As usual in such cases, two regular expressions—the second applied to the matches of the first—can do what one cannot achieve.
const txt = `This text must not be touched
/*
This
is
random
text
*/
This text must not be touched`;
const to1line = str => str.replace(
/\/\*\s*(.*?)\s*\*\//gs,
(_, comment) => comment.replace( /^/mg, '//')
);
console.log( to1line( txt ));

Regex to match all lines starting with a specific string

I have this very long cfg file, where I need to find the latest occurrence of a line starting with a specific string. An example of the cfg file:
...
# format: - search.index.[number] = [search field]:element.qualifier
...
search.index.1 = author:dc.contributor.*
...
search.index.12 = language:dc.language.iso
...
jspui.search.index.display.1 = ANY
...
I need to be able to get the last occurrence of the line starting with search.index.[number] , more specific: I need that number. For the above snippet, that number would be 12.
As you can see, there are other lines too containing that pattern, but I do not want to match those.
I'm using Groovy as a programming/scripting language.
Any help is appreciated!
Have you tried:
def m = lines =~ /(?m)^search\.index\.(\d+)/
m[ -1 ][ 1 ]
Try this as your expression :
^search\.index\.(\d+)/
And then with Groovy you can get your result with:
matcher[0][0]
Here is an explanation page.
I don't think you should go for it but...
If you can do a multi-line search (anyway you have to here), the only way would be to read the file backward. So first, eat everything with a .* (om nom nom)(if you can make the dot match all, (?:.|\s)* if you can't). Now match your pattern search\.index\.(\d+). And you want to match this pattern at the beginning of a line: (?:^|\n) (hoping you're not using some crazy format that doesn't use \n as new line character).
So...
(?:.|\s)*(?:^|\n)search\.index\.(\d+)
The number should be in the 1st matching group. (Test in JavaScript)
PS: I don't know groovy, so sorry if it's totally not appropriate.
Edit:
This should also work:
search\.index\.(\d+)(?!(?:.|\s)*?(?:^|\n)search\.index\.\d+)

Regex match all lines that don't end with ,0 and ,1

I have a malformed CSV file which has two columns: Text,Value
The value is either 1 or 0, but some lines are malformed and span two lines:
1. "This line is fine, but there are some that are not like this",0
2. "Another good line",1
4. "Oh, I'm so bad!!
5. I spanned two lines!",0
6. "Why did you break me? FileHelpers can't read two lines!!",1
Line 4 and 5 are supposed to be one line, but the CSV file I got is broken and they span two lines, this causes the FileHelpers engine to fail while reading the csv file.
I have two CSV files with about 3000 lines each and I will only need to fix them once. I want to use notepad++ to find all the lines that are not ending in ,0 or ,1, what kind of regex can I use for that? Or maybe to regular expressions, one for the ,0 case the other one for the ,1 case.
Update:
Dan's answer works without the comma [^01]$ instead of ,[^01]$, but it only matches lines that are not ending with 0 or 1... it works sufficiently well in my case, but it does skip lines that are broken and actually end with 0 or 1.
I don't know how the other answer would work:
Something like the below is what I would use in Notepad++
[^,][^01]$
Here are the steps I did:
Use ([^,][^01])$ to match the lines and replaced with \1{marked}
Then switched to extended mode and replaced {marked}\r\n with `` ( empty ) to get a single line.
Screenshots below:
The expression you would use is
([^,].|,[^01])$
But unfortunately, notepad++ does not support alternation (the | operator). [1]
You can match the broken lines with these two expressions then:
[^,].$
,[^01]$
Except, of course, if the "Text" part does end in ,0 or ,1 itself. :-)
[1] http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Unsupported_Regex_Operators
,[^01]$
Make sure regex mode is on.
General considerations
In general, to match a line that does not end with a specific pattern, you may use
^(?!.*pattern$).*$
where ^ matches the start of a line, (?!.*pattern$) is a negative lookahead that fails the match if there are 0 or more chars other than line break chars, as few as possible (.*) followed with pattern at the end of the line ($), and the .*$ actually matches the line.
To remove a line that does not end with some pattern together with a line break at the end, use
^(?!.*pattern$).*\R?
where \R? is an optional line break sequence.
In case of several fixed strings, you may use
^(?!.*(?:pattern|pattern2|patternN)$).*\R?
If there is just one or two fixed strings to check at the end of the line, you may use a bit quicker regex like
^.*$(?<!a)(?<!bcd)
that will match any line not ending with a and bcd.
^.*$(?<!1)(?<!0)
Current problem solution
So, for the current issue, to match a line not ending with 1 or 0, you may use
^(?!.*[01]$).*$ # without the line break
^(?!.*[01]$).*$\R? # with the line break
Or,
^.*(?<![01])$ # without the line break
^.*(?<![01])$\R? # with the line break
To remove/replace a line break on a line that does not end with a specific pattern you may use
(?<![01])$\R?
Replace with either an empty string (to remove the line break) or with any other delimiter string or character.

Regex Ignore Comments In Java

I can't seem to find a concise answer on this and as I know very little about regex, I feel the easiest option is to ask.
I am trying to count lines of code in eclipse, which I can do, but it includes the comments.
Basically, my regex pattern is "\n"
Pretty basic, yes, but try as I might, I can't seem to figure out a way to ignore a line starting with "//"
I've tried [^(//)] but that seems to count every "/". I've tried the same thing without the delimiter: "\"
Any ideas, even if you just point me in the right direction, my google searches didn't turn up anything useful.
Better to use negative lookahead here. For your case code like this will work:
String str = "Line1\n" +
"/Line2\n" +
"//Line3\n" +
"Line4\n" +
" // Line5\n" +
"Line6\n";
Pattern pt = Pattern.compile("^(?!\\s*//)", Pattern.MULTILINE);
matcher matcher = pt.matcher(str);
int c=0;
while (matcher.find()) c++;
System.out.println("# of lines: " + c);
Output
# of lines: 4
(?!\\s*//) is negative lookahead that is saying match only if a line doesn't start with 0 or more spaces followed by //
As you can see there are 2 lines above starting with comment // hence they are not counted.
Also it is important to use Pattern.MULTILINE flag here to make every line recognize start of line character ^.
Simplified regex, will improve if you need more:
^//
Anything in [] is a character class, which means match one of the symbols within it. Also adding ^ is the inverse of that and is not the same as a ^ outside which mathces the beginning of the string.
You can also do something like:
^[^/][^/].*
to match lines not starting with //
There are better ways and other tools to count lines of code, for instance a test coverage tool or (I don't know if this still works with the newest version): http://metrics.sourceforge.net/
If you just ignore //, then you will still count the package and import declarations along with multi-line comments such as:
/**
* Javadoc
*/
or brackets that sit alone on lines like:
while(...)
{
...
}
I'm not familiar with eclipse, but if it has lookahead this should do it:
\n(?!//)
^\s*(?!//)\S.*$
Highlights any line that starts ("^") with any number of spaces (\s*"), then does not start a comment ("(?!//)"), then has a non-space character ("\S"), then has any number of any characters until end of the line (".*$").