I have a well structured XML file with several grouped units, which contain a consistent number of child elements.
I am trying to find a way, through Regex in Notepad++, to search throughout all of these groups for a certain argument that contains a single word. I have found a way of doing this but the problem is I want to find the negation of this word, that means for instance, if the word is "downward" I want to find anything that is NOT "downward".
Here is an example:
<xml:jus id="84" trek="spanned" place="downward">
I've came up with <xml:jus id="\d+" trek="[\w]*" place="\<downward"> to find these tags, but I need to find all other matches that do not have "downward" in place= argument. I tried <xml:jus id="\d+" trek="[\w]*" place="^\<downward"> but without success.
Any help is appreciated.
If the properties and the string is in the same format, you could also make use of SKIP FAIL to first match what you want to exclude.
<xml:jus id="\d+" trek="\w+" place="downward">(*SKIP)(*F)|<xml:jus id="\d+" trek="\w+" place="[^"]+">
Regex demo
You might be able to use a negative lookahead to exclude downward from being the place:
<[^>]+ place="(?!downward").*?"[^>]*>
Demo
Related
I have a project that demands extracting data from XML files (values inside the <Number>... </Number> tag), however, in my regular expression, I haven't been able to extract lines that had multiple data separated by a newline, see the below example:
As you can see above, I couldn't replicate the multiple lines detection by my regular expression.
If you are using a script somewhere, your first plan should be to use a XML parser. Almost every language has one and it should be far more accurate compared to using regex. However, if you just want to use regex to search for strings inside npp, then you can use \s+ to capture multiple new lines:
<Number>(\d+\s)+<\/Number>
https://regex101.com/r/MwvBxz/1
I'm not sure I fully understand what you are trying to do so if this doesn't do it then let me know what you are going for.
You can use this find+replace combo to remove everything which is not a digit in between the <Number> tag:
Find:
.*?<Number>(.*?)<\/Number>.*
Replace:
$1
finally i was able to find the right regular expression, I'll leave it below if anyone needs it:
<Type>\d</Type>\n<Number>(\d+\n)+(\d+</Number>)
Explanation:
\d: Shortcut for digits, same as [1-9]
\n: Newline.
+: Find the previous element 1 to many times.
Have a good day everybody,
After giving it some more thought I decided to write a second answer.
You can make use of look arounds:
(?<=<Number>)[\d\s]+(?=<\/Number>)
https://regex101.com/r/FiaTKD/1
I am trying to match a specific string, but only when it's not part of a couple specific literal strings. I wish to exclude results falling within the literal strings <span class='highlight'> and </span>. So if I search for "light", "high", "pan", "an", etc. I want to match any other occurrences that are not part of those two literals.
I'm not trying to parse full HTML, only those two strings listed, which will never change. The class value will never change from 'highlight'.
I have tried all manners of lookarounds, capturing groups, non-capturing groups, etc that I can think of and have come up with nothing. Lookarounds don't seem to be working, I'm betting because the position(s) of the string in relation to the cases to be excluded are not guaranteed to be in a certain order.
Is this possible with only regex?
Would this method work for you?
Search-and-replace those two tags with the empty string:
s/(<span class='highlight'>|<\/span>)//g
Search for your string
Of course you might end up with your search string being "around" one of those bits, e.g. searching for abcd and matching ab</span>cd. You could get around that my replacing with some character sequence you are sure is not something that can be searched for.
You'll also lose the context of the situation of the string you're looking for relative to those tags, but not knowing what you're trying to achieve exactly, it's difficult to say whether that is important for you or not.
Oops, I thought I was properly simplifying my question, but it turns out I was wrong. I inherited code that was taking a string and doing a regex replace on a list of search terms by looping through them one at a time and wrapping matches in <span class="highlight"></span>. That resulted in a phrase like "Look into the light" ending up looking incorrect if you searched for "the light". "the" was matched and replaced, then "light" was matched, but would match the newly replaced tag for "the". The trick wasn't to fix the regex that got run on each individual word, but to change it into a regex that processed all of them together. Rather than regex replace using the, then light, the regex just needed to be the|light.
I use Notepad++,
i need to search and replace entire word that contain a specific keyword.
Ex: someting HELP.blablabla.blabla someting
i would like to search entire text for words that contain the keyword "HELP" untill the first space OR the first comma.
In this case: HELP.blablabla.blabla
thanks a lot
Go to the search panel, check the regex checkbox on the bottom and try: (HELP)([^ ,]*)
Note: There are a space character after the ^
This regex means: Search for the entire word HELP (HELP) followed by anything that it isn't an space or an comma [^ ,] the ^ inside the brackets is a denial
Edit:
You can use just HELP[^ ,]* the parenthesis is just to create capturing groups if you need to use the specific groups to replace later. As pointed by #alphabravo
You say search and replace an entire word but if it were that simple then I wonder why a regular search and replace isn't sufficient. So I'm reading between the lines and assuming you want to match on full lines of text.
I think I've used npp enough to get the syntax right. I don't remember any eccentricities that would apply. Is the comma/space optional?
^[^, ]*HELP[^, ]*[, ]
I'm kinda thinking this one might be good enough:
^[^, ]*HELP
Okay so I am having an issue getting a repeat to work at all, let alone the way I want it to work...
I will be bringing in a string with the following information
NETWORK;PASS;1;THIS TEXT|CAN BE|RANDOM|WITH|PIPE|SEPERATORS;\r
what I have so far
(?:NETWORK;.*;(?:0|1);)([^|]*)
this currently leaves me the first block matched
THIS TEXT
what I am trying to do is set it up so I can programmatically specify which block to match. the text separated with pipes will have between 3-7 "blocks" and depending on the situation I may need to match any one of them, but only one at a time.
I had thought about just duplicating
([^|]*)
and adding a non matching operator to all but the one but I cant seem to get it to match anything if I duplicate that group, and neither can I get repeat operators to work on the group.
I am a bit lost so this may not make entire sense if clarification is required I will provide on request. any help is appreciated.
Why not just split THIS TEXT|CAN BE|RANDOM|WITH|PIPE|SEPERATORS on the pipe symbol? Much easier than a dynamically-generated regex.
But if you really want to generate a regex:
Start with (?:NETWORK;.*;(?:0|1);)
To get the nth element (indexed from 0), add (?:[^|]+[|]){n} (replace n with the number to skip), followed by ([^|]+)
Example:
(?:NETWORK;.*;(?:0|1);)(?:[^|]+[|]){3}([^|]+)
Debuggex Demo
Matches WITH in your example. Here's a regex101 demo.
I have an text that consists of information enclosed by a certain pattern.
The only thing I know is the pattern: "${template.start}" and ${template.end}
To keep it simple I will substitute ${template.start} and ${template.end} with "a" in the example.
So one entry in the text would be:
aINFORMATIONHEREa
I do not know how many of these entries are concatenated in the text. So the following is correct too:
aFOOOOOOaaASDADaaASDSDADa
I want to write a regular expression to extract the information enclosed by the "a"s.
My first attempt was to do:
a(.*)a
which works as long as there is only one entry in the text. As soon as there are more than one entries it failes, because of the .* matching everything. So using a(.*)a on aFOOOOOOaaASDADaaASDSDADa results in only one capturing group containing everything between the first and the last character of the text which are "a":
FOOOOOOaaASDADaaASDSDAD
What I want to get is something like
captureGroup(0): aFOOOOOOaaASDADaaASDSDADa
captureGroup(1): FOOOOOO
captureGroup(2): ASDAD
captureGroup(3): ASDSDAD
It would be great to being able to extract each entry out of the text and from each entry the information that is enclosed between the "a"s. By the way I am using the QRegExp class of Qt4.
Any hints? Thanks!
Markus
Multiple variation of this question have been seen before. Various related discussions:
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Using regular expressions how do I find a pattern surrounded by two other patterns without including the surrounding strings?
Use RegExp to match a parenthetical number then increment it
Regex for splitting a string using space when not surrounded by single or double quotes
What regex will match text excluding what lies within HTML tags?
and probably others...
Simply use non-greedy expressions, namely:
a(.*?)a
You need to match something like:
a[^a]*a
You have a couple of working answers already, but I'll add a little gratuitous advice:
Using regular expressions for parsing is a road fraught with danger
Edit: To be less cryptic: for all there power, flexibility and elegance, regular expression are not sufficiently expressive to describe any but the simplest grammars. Ther are adequate for the problem asked here, but are not a suitable replacement for state machine or recursive decent parsers if the input language become more complicated.
SO, choosing to use RE for parsing input streams is a decision that should be made with care and with an eye towards the future.