Matching everything except a specified regex - regex

I have a huge file, and I want to blow away everything in the file except for what matches my regex. I know I can get matches and just extract those, but I want to keep my file and get rid of everything else.
Here's my regex:
"Id":\d+
How do I say "Match everything except "Id":\d+". Something along the lines of
!("Id":\d+) (pseudo regex) ?
I want to use it with a Regex Replace function. In english I want to say:
Get all text that isn't "Id":\d+ and replace it with and empty string.

Try this:
string path = #"c:\temp.txt"; // your file here
string pattern = #".*?(Id:\d+\s?).*?|.+";
Regex rx = new Regex(pattern);
var lines = File.ReadAllLines(path);
using (var writer = File.CreateText(path))
{
foreach (string line in lines)
{
string result = rx.Replace(line, "$1");
if (result == "")
continue;
writer.WriteLine(result);
}
}
The pattern will preserve spaces between multiple Id:Number occurrences on the same line. If you only have one Id per line you can remove the \s? from the pattern. File.CreateText will open and overwrite your existing file. If a replacement results in an empty string it will be skipped over. Otherwise the result will be written to the file.
The first part of the pattern matches Id:Number occurrences. It includes an alternation for .+ to match lines where Id:Number does not appear. The replacement uses $1 to replace the match with the contents of the first group, which is the actual Id part: (Id:\d+\s?).

well, the opposite of \d is \D in perl-ish regexes. Does .net have something similar?

Sorry, but I totally don't get what your problem is. Shouldn't it be easy to grep the matches into a new file?
Yoo wrote:
Get all text that isn't "Id":\d+ and replace it with and empty string.
A logical equivalent would be:
Get all text that matches "Id":\d+ and place it in a new file. Replace the old file with the new one.

I haven't use .net before, but following works in java
System.out.println("abcd Id:12351abcdf".replaceAll(".*(Id:\\d+).*","$1"));
produces output
Id:12351
Although in true sense it doesnt match the criteria of matching everything except Id:\d+, but it does the job

Related

Kotlin: Can't parse regular expression containing multiple back slashes - Why do I have a "Unclosed group"?

I am trying to match the anchor link that is "triple-escaped" in this string example:
blablabla some text <a href=\\\"#anchor\\\"> some more text
This is my regular expression:
href=(\\\\\\)(\"#.*)(\\\\\\)\"
If I test it on regex101.com it works, but I need to do this filtering in Kotlin, which I thought I could do like this:
fun findEscapedAnchors(text: String): String {
val pattern = "href=(\\\\\\)(\"#.*)(\\\\\\)\""
val regex = pattern.toRegex()
val matches = regex.find(text)
// do something with the matches
}
First of all, if I paste this String into my code (in Android Studio), it gets auto-escaped even more, but it doesn't work. If I edit it to match the above String, it complains that there's an unclosed group. I thought I could maybe put it in triple quotations to not have to escape characters, but that also failed. What am I doing wrong?
I have figured it out myself: A raw string (triple quotations) was indeed the way to go, but Regex apparently still needs the character escapes in the string. Before, I had them removed, because I thought that's how raw strings work, but I was wrong. So it works now with this:
val regex = """href=(\\\\\\)(\"#.*)(\\\\\\)\"""".toRegex()

Regular expression to find specific text within a string enclosed in two strings, but not the entire string

I have this type of text:
string1_dog_bit_johny_bit_string2
string1_cat_bit_johny_bit_string2
string1_crocodile_bit_johny_bit_string2
string3_crocodile_bit_johny_bit_string4
string4_crocodile_bit_johny_bit_string5
I want to find all occurrences of “bit” that occur only between string1 and string2. How do I do this with regex?
I found the question Regex Match all characters between two strings, but the regex there matches the entire string between string1 and string2, whereas I want to match just parts of that string.
I am doing a global replacement in Notepad++. I just need regex, code will not work.
Thank you in advance.
Roman
If I understand correctly here a code to do what you want
var intput = new List<string>
{
"string1_dog_bit_johny_bit_string2",
"string1_cat_bit_johny_bit_string2",
"string1_crocodile_bit_johny_bit_string2",
"string3_crocodile_bit_johny_bit_string4",
"string4_crocodile_bit_johny_bit_string5"
};
Regex regex = new Regex(#"(?<bitGroup>bit)");
var allMatches = new List<string>();
foreach (var str in intput)
{
if (str.StartsWith("string1") && str.EndsWith("string2"))
{
var matchCollection = regex.Matches(str);
allMatches.AddRange(matchCollection.Cast<Match>().Select(match => match.Groups["bitGroup"].Value));
}
}
Console.WriteLine("All matches {0}", allMatches.Count);
This regex will do the job:
^string1_(?:.*(bit))+.*_string2$
^ means the start of the text (or line if you use the m option like so: /<regex>/m )
$ means the end of the text
. means any character
* means the previous character/expression is repeated 0 or more times
(?:<stuff>) means a non-capturing group (<stuff> won't be captured as a result of the matching)
You could use ^string1_(.*(bit).*)*_string2$ if you don't care about performance or don't have large/many strings to check. The outer parenthesis allow multiple occurences of "bit".
If you provide us with the language you want to use, we could give more specific solutions.
edit: As you added that you're trying a replacement in Notepad++ I propose the following:
Use (?<=string1_)(.*)bit(.*)(?=_string2) as regex and $1xyz$2 as replacement pattern (replace xyz with your string). Then perform an "replace all" operation until N++ doesn't find any more matches. The problem here is that this regex will only match 1 bit per line per iteration - and therefore needs to be applied repeatedly.
Btw. even if a regexp matches the whole line, you can still only replace parts of it using capturing groups.
You can use the regex:
(?:string1|\G)(?:(?!string2).)*?\Kbit
regex101 demo. Tried it on notepad++ as well and it's working.
There're description in the demo site, but if you want more explanations, let me know and I'll elaborate!

regular expression matching issue

I've got a string which has the following format
some_string = ",,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,"
and this is the content of a text file called f
I want to search for a specific term within the xxx (let's say that term is 'silicon')
note that the xxx can all be different and can contain any special characters (including meta characters) except for a new line
match = re.findall(r",{3}(.*?silicon.*?),{3}", f.read())
print match
But this doesn't seem to work because it returns results which are in the format:
["xxx,,,xxx,,,xxx,,,xxx,,,silicon", "xxx,,,xxx,,,xxx,,,xxsiliconxx"] but I only want it to return ["silicon", "xxsiliconxx"]
What am I doing wrong?
Try the following regex:
(?<=,{3})(?:(?!,{3}).)*?silicon.*?(?=,{3})
Example:
>>> s = ',,,xxx,,,silicon,,,xxx,,,xxsiliconxx,,,xxx'
>>> re.findall(r'(?<=,{3})(?:(?!,{3}).)*?silicon.*?(?=,{3})', s)
['silicon', 'xxsiliconxx']
I am assuming that the content in the xxx can contain commas, just not three consecutive commas or it would end the field. If the content in the xxx sections cannot contain any commas, you can use the following instead:
(?<=,{3})[^,\r\n]*?silicon.*?(?=,{3})
The reason your current approach doesn't work is that even though .*? will try to match as few characters as possible, the match will still start as early as possible. So for example the regex a*?b would match the entire string "aaaab". The only time the regex will advance the starting position is when the regex fails to match, and since ,,, can be matched by the .*?, your match will always start at the beginning of the string or just after the previous match.
The lookbehind and lookahead are used to address the issue raised by JaredC in comments, basically re.findall() won't return overlapping matches, so you need the leading and trailing ,,, to not be a part of the match.

What is the regular expression (used in actionscript) to find the first newline (\n) in a text?

What is the regular expression to find the first newline (\n) in a text (used to find and delete the newline)? I'm using the regular expression in ActionScript and tried
ta.text = ta.text.replace(/\n*/,'')
but it doesn't seem to work
Thanks
You're using the regular expression \n* which matches the first occurrence of zero (!) or more line feed characters. The first match of this regex is thus always at the very start of the string. If the string starts with line feed characters, those will be matched. If the string starts with something else, the zero-length string at the start of the regex will be matched.
Use \n to match the first line feed character. Use \n+ to match the fist sequence of line feed characters. Use [\r\n]+ to match the first sequence of line breaks, regardless of the line break style used (LF only, CRLF, etc.). Use \r?\n to match a single line break as either LF only or CRLF.
In your ActionScript code, use two slashes to delimit the regex you want to use:
ta.text = ta.text.replace(/[\r\n]+/,'');
Just tested this and it worked for me:
ta.text = ta.text.replace("\n",'');
My actual code was (cut and pasted):
var testString:String = "Hello\nWorld";
trace(testString);
testString = testString.replace("\n", '');
trace(testString);
Which yeilded the output:
Hello
World
HelloWorld
Alternatively, you can define a pattern along the lines of what you were attempting:
var pattern:RegExp = /AB\*C/;
And that works as well. The modified code would become:
var pattern:RegExp = /\n/;
var testString:String = "Hello\nWorld";
trace(testString);
testString = testString.replace(pattern, '');
trace(testString);
Note that the code above only replaces the first instance of a newline character (as you requested). Doing more would require either a recursive call to the replace function or a more sophisticated RegExp.
I hope that helps in some way,
--gMale
EDIT: given the comment discussion below, try working with one of these events, instead:
change Dispatched when text in the TextArea control changes through user input.
dataChange Dispatched when the data property changes.
textInput Dispatched when the user types, deletes, or pastes text into the control.

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.