Regex to match all lines starting with a specific string - regex

I have this very long cfg file, where I need to find the latest occurrence of a line starting with a specific string. An example of the cfg file:
...
# format: - search.index.[number] = [search field]:element.qualifier
...
search.index.1 = author:dc.contributor.*
...
search.index.12 = language:dc.language.iso
...
jspui.search.index.display.1 = ANY
...
I need to be able to get the last occurrence of the line starting with search.index.[number] , more specific: I need that number. For the above snippet, that number would be 12.
As you can see, there are other lines too containing that pattern, but I do not want to match those.
I'm using Groovy as a programming/scripting language.
Any help is appreciated!

Have you tried:
def m = lines =~ /(?m)^search\.index\.(\d+)/
m[ -1 ][ 1 ]

Try this as your expression :
^search\.index\.(\d+)/
And then with Groovy you can get your result with:
matcher[0][0]
Here is an explanation page.

I don't think you should go for it but...
If you can do a multi-line search (anyway you have to here), the only way would be to read the file backward. So first, eat everything with a .* (om nom nom)(if you can make the dot match all, (?:.|\s)* if you can't). Now match your pattern search\.index\.(\d+). And you want to match this pattern at the beginning of a line: (?:^|\n) (hoping you're not using some crazy format that doesn't use \n as new line character).
So...
(?:.|\s)*(?:^|\n)search\.index\.(\d+)
The number should be in the 1st matching group. (Test in JavaScript)
PS: I don't know groovy, so sorry if it's totally not appropriate.
Edit:
This should also work:
search\.index\.(\d+)(?!(?:.|\s)*?(?:^|\n)search\.index\.\d+)

Related

Look for any character that surrounds one of any character including itself

I am trying to write a regex code to find all examples of any character that surrounds one of any character including itself in the string below:
b9fgh9f1;2w;111b2b35hw3w3ww55
So ‘b2b’ and ‘111’ would be valid, but ‘3ww5’ would not be.
Could someone please help me out here?
Thanks,
Nikhil
You can use this regex which will match three characters where first and third are same using back reference, where as middle can be any,
(.).\1
Demo
Edit:
Above regex will only give you non-overlapping matches but as you want to get all matches that are even overlapping, you can use this positive look ahead based regex which doesn't consume the next two characters instead groups them in group2 so for your desired output, you can append characters from group1 and group2.
(.)(?=(.\1))
Demo with overlapping matches
Here is a Java code (I've never programmed in Ruby) demonstrating the code and the same logic you can write in your fav programming language.
String s = "b9fgh9f1;2w;111b2b35hw3w3ww55";
Pattern p = Pattern.compile("(.)(?=(.\\1))");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1) + m.group(2));
}
Prints all your intended matches,
111
b2b
w3w
3w3
w3w
Also, here is a Python code that may help if you know Python,
import re
s = 'b9fgh9f1;2w;111b2b35hw3w3ww55'
matches = re.findall(r'(.)(?=(.\1))',s)
for m in re.findall(r'(.)(?=(.\1))',s):
print(m[0]+m[1])
Prints all your expected matches,
111
b2b
w3w
3w3
w3w

i need help in regex

so i have (matlab) code .. and of the lines doesnt have (;) after the line
i want to find that line
for a starter :
sad= sdfsdf ; %this is comment
sad = awaww ;
n= sdfdsfd ;
m = (asd + adsf(asd,asd)) %this is comment
lets say i want to find the 4th line because it doesnt have (;) at the end of line ..
so far im stuck at this :
/(^[-a-zA-Z0-9]+\s*=[-a-zA-Z0-9#:%,_\+.()~#?&//= ]+)(?!;)$/gim
so this will work fine.. it will find the fourth line only
but what if i wanted (;) in middle of the line but not at end or before the comment .. ?
w=sss (;)aaa **;** % i dont want this line to be selected
w=sss (;)aaa %i want this line to be selected
http://regexr.com/3cfor
Well, let's find all lines which end with a semicolon:
^.+?;
optionally followed by horizontal whitespace:
^.+?;[ \t]*
and an optional comment:
^.+?;[ \t]*(?:%.*)?
This expression easily matches all the lines you don't want. So, inverse it:
^(?!.+?;[ \t]*(?:%.*)?$).+
Unfortunately, that's too easy. It fails to match lines which contain a semicolon in a comment. We could replace .+? with [^%\r\n]+? but this would fail on lines containing a % in a string.
If you need a more robust pattern, you'll have to account for all of this.
So let's start the same way, by defining what a "correct" line should look like. I'll use the PCRE syntax for atomic grouping, so you'll have to use perl = TRUE.
A string is: '(?>[^']+|'')*'
Other code (except string, comments and semicolons) is covered by: [^%';\r\n]+
So "normal" code is:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?
Then, we add the required semicolon and optional comment:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$
Finally, we invert all of this:
^(?!(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$).+
And we have the final pattern. Demo.
You don't need to fully tokenize the input, you only have to recognize the different "lexer modes". I hope handling strings and comments is enough, but I didn't check the Matlab syntax thoroughly.
You could use this with other regex engines that do not support atomic groups by replacing (?> with (?: but you'll expose yourself to the catastrophic backtracking problem.

Find missing entries in one file

I've got two files:
1st: Entries.txt
confirmation.resend
send
confirmation.showResendForm
login.header
login.loginBtn
2nd: Used_Entries.txt
confirmation.showResendForm = some value
login.header = some other value
I want to find all entries from the first file (Entries.txt) that have not been asigned a value in the 2nd file (Used_Entries.txt)
In this example I'd like the following result:
confirmation.resend
send
login.loginBtn
In the result confirmation.showResendForm and login.header do not show up because these exist in the Used_Entries.txt
How do I do this? I've been playing around with regular expressions but haven't been able to solve it. A bash script or sth would be much appreciated!
You can do this with regex. But get your code mood ready, because you can't match both files with regex at once, and we do want to match both contents with regex at once. Well, that means you must have at least some understanding of your language, I would like you to concatenate the contents from the two files with at least a new line in between.
This regex solution expects your string to be matched to be in this format:
text (no equals sign)
text
text
...
key (no equals sign) ␣ (optional whitespace) = (literal equal) whatever (our regex will skip this part.)
key=whatever
key=whatever
Do I have your attention? Yes? Please see the following regex (using techniques accessible to most regex engines):
/(^[^=\n]+$)(?!(?s).*^\1\s*=)/m
Inspired from a recent answer I saw from zx81, you can switch to (?s) flag in the middle to switch to DOTALL mode suddenly, allowing you to start multiline matching with . in the middle of a RegExp. Using this technique and the set syntax above, here's what the regex does, as an explanation:
(^[^=\n]+$) Goes through all the text (no equals sign) elements. Enforces no equals signs or newlines in the capture. This means our regex hits every text element as a line, and tries to match it appropriately.
(?! Opens a negative lookahead group. Asserts that this match will not locate the following:
(?s).* Any number of characters or new lines - As this is a greedy match, will throw our matcher pointer to the very end of the string, skipping to the last parts of the document to backtrack and scoop up quickly.
^\1\s*= The captured key, followed by an equals sign after some optional whitespaces, in its own line.
) Ends our group.
View a Regex Demo!
A regex demo with more test cases
I'm stupid. I could had just put this:
/(^[^=\n]+$)(?!.*^\1\s*=)/sm
I've been going at this a little bit to complex and just solved it with a small script in scala:
import scala.io.Source
object HelloWorld {
def main(args: Array[String]) {
val entries = (for(line <- Source.fromFile("Entries.txt").getLines()) yield {
line
}).toList
val usedEntries = (for(line <- Source.fromFile("Used_Entries.txt").getLines()) yield {
line.dropRight(line.length - line.indexOf(' '))
}).toList
println(entries)
println(usedEntries)
val missingEntries = (for {
entry <- entries
if !usedEntries.exists(_ == entry)
} yield {
entry
}).toList
println(missingEntries)
println("Missing Entries: ")
println()
for {
missingEntry <- missingEntries
} yield {
println(missingEntry)
}
}
}
import re
e=open("Entries.txt",'r')
m=e.readlines()
u=open("Used_Entries.txt",'r')
s=u.read()
y=re.sub(r"= .*","",s)
for i in m:
if i.strip() in [k.strip() for k in y.split("\n")] :
pass
else:
print i.strip()

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.

Regex Ignore Comments In Java

I can't seem to find a concise answer on this and as I know very little about regex, I feel the easiest option is to ask.
I am trying to count lines of code in eclipse, which I can do, but it includes the comments.
Basically, my regex pattern is "\n"
Pretty basic, yes, but try as I might, I can't seem to figure out a way to ignore a line starting with "//"
I've tried [^(//)] but that seems to count every "/". I've tried the same thing without the delimiter: "\"
Any ideas, even if you just point me in the right direction, my google searches didn't turn up anything useful.
Better to use negative lookahead here. For your case code like this will work:
String str = "Line1\n" +
"/Line2\n" +
"//Line3\n" +
"Line4\n" +
" // Line5\n" +
"Line6\n";
Pattern pt = Pattern.compile("^(?!\\s*//)", Pattern.MULTILINE);
matcher matcher = pt.matcher(str);
int c=0;
while (matcher.find()) c++;
System.out.println("# of lines: " + c);
Output
# of lines: 4
(?!\\s*//) is negative lookahead that is saying match only if a line doesn't start with 0 or more spaces followed by //
As you can see there are 2 lines above starting with comment // hence they are not counted.
Also it is important to use Pattern.MULTILINE flag here to make every line recognize start of line character ^.
Simplified regex, will improve if you need more:
^//
Anything in [] is a character class, which means match one of the symbols within it. Also adding ^ is the inverse of that and is not the same as a ^ outside which mathces the beginning of the string.
You can also do something like:
^[^/][^/].*
to match lines not starting with //
There are better ways and other tools to count lines of code, for instance a test coverage tool or (I don't know if this still works with the newest version): http://metrics.sourceforge.net/
If you just ignore //, then you will still count the package and import declarations along with multi-line comments such as:
/**
* Javadoc
*/
or brackets that sit alone on lines like:
while(...)
{
...
}
I'm not familiar with eclipse, but if it has lookahead this should do it:
\n(?!//)
^\s*(?!//)\S.*$
Highlights any line that starts ("^") with any number of spaces (\s*"), then does not start a comment ("(?!//)"), then has a non-space character ("\S"), then has any number of any characters until end of the line (".*$").