Regex: Trying to extract all values (separated by new lines) within an XML tag

Regex: Trying to extract all values (separated by new lines) within an XML tag - regex

I have a project that demands extracting data from XML files (values inside the <Number>... </Number> tag), however, in my regular expression, I haven't been able to extract lines that had multiple data separated by a newline, see the below example:
As you can see above, I couldn't replicate the multiple lines detection by my regular expression.

If you are using a script somewhere, your first plan should be to use a XML parser. Almost every language has one and it should be far more accurate compared to using regex. However, if you just want to use regex to search for strings inside npp, then you can use \s+ to capture multiple new lines:
<Number>(\d+\s)+<\/Number>
https://regex101.com/r/MwvBxz/1
I'm not sure I fully understand what you are trying to do so if this doesn't do it then let me know what you are going for.

You can use this find+replace combo to remove everything which is not a digit in between the <Number> tag:
Find:
.*?<Number>(.*?)<\/Number>.*
Replace:
$1

finally i was able to find the right regular expression, I'll leave it below if anyone needs it:
<Type>\d</Type>\n<Number>(\d+\n)+(\d+</Number>)
Explanation:
\d: Shortcut for digits, same as [1-9]
\n: Newline.
+: Find the previous element 1 to many times.
Have a good day everybody,

After giving it some more thought I decided to write a second answer.
You can make use of look arounds:
(?<=<Number>)[\d\s]+(?=<\/Number>)
https://regex101.com/r/FiaTKD/1

Related

Notepad++ Regex Search XML argument for anything but certain word

I have a well structured XML file with several grouped units, which contain a consistent number of child elements.
I am trying to find a way, through Regex in Notepad++, to search throughout all of these groups for a certain argument that contains a single word. I have found a way of doing this but the problem is I want to find the negation of this word, that means for instance, if the word is "downward" I want to find anything that is NOT "downward".
Here is an example:
<xml:jus id="84" trek="spanned" place="downward">
I've came up with <xml:jus id="\d+" trek="[\w]*" place="\<downward"> to find these tags, but I need to find all other matches that do not have "downward" in place= argument. I tried <xml:jus id="\d+" trek="[\w]*" place="^\<downward"> but without success.
Any help is appreciated.

If the properties and the string is in the same format, you could also make use of SKIP FAIL to first match what you want to exclude.
<xml:jus id="\d+" trek="\w+" place="downward">(*SKIP)(*F)|<xml:jus id="\d+" trek="\w+" place="[^"]+">
Regex demo

You might be able to use a negative lookahead to exclude downward from being the place:
<[^>]+ place="(?!downward").*?"[^>]*>
Demo

How to replace all lines based on previous lines in Notepad++?

I have an XML code:
<Line1>Matched_text Other_text</Line1>
<Line2>Text_to_replace</Line2>
How to tell Notepad++ to find Matched_text and replace Text_to_replace to Replaced_text? There are several similar blocks of code, with one exactly Matched _text and different Other_text and Text_to_replace. I want to replace all in once.
My idea is to put
Matched_text*<Line2>*</Line2>
in the Find field, and
Matched_text*<Line2>Replaced_text</Line2>
in the Replace field. I know that \1 in regex might be useful, but I don't know where to start.
The actual code is:
<Name>Matched_text, Other_text</Name>
<IsBillable>false</IsBillable>
<Color>-Text_to_replace</Color>

The regex you're looking for is something like the following.
Find: (Matched_text[\w,\s<>\/]*<Color>-).*(</Color>)
Replace: \1Replaced_text\2
Broken down:
`()` is how you tell regex that you want to keep things (for use in /1, /2, etc.), these are called capture groups in regex land.
`Matched_text[\w,\s<>\/]*` means you want your anchor `Matched_text` and everything after it up till the next part of the expression.
`<Color>-).*(</Color>)` Select everything between <Color>- and </Color> for replacement.
If you have any questions about the expression, I highly recommend looking at a regex cheatsheet.

Replacing char in a String with Regular Expression

I got a string like this:
PREFIX-('STRING WITH SPACES TO REPLACE')
and i need this:
PREFIX-('STRING_WITH_SPACES_TO_REPLACE')
I'm using Notepad++ for the Regex Search and Replace, but i'm shure every other Editor capable of regex replacements can do it to.
I'm using:
PREFIX-\('(.*)(\s)(.*)'\)
for search and
PREFIX-('\1_\3')
for replace
but that replaces only one space from the string.

The regex search feature in Notepad++ is very, very weak. The only way I can see to do this in NPP is to manually select the part of the text you want to work on, then do a standard find/replace with the In selection box checked.
Alternatively, you can run the document through an external script, or you can get a better editor. EditPad Pro has the best regex support I've ever seen in an editor. It's not free, but it's worth paying for. In EPP all I had to do was this:
search: ((?:PREFIX-\('|\G)[^\s']+)\s+
replace: $1_
EDIT: \G matches the position where the previous match ended, or the beginning of the input if there was no previous match. In other words, the first time you apply the regex, \G acts like \A. You can prevent that by adding a negative lookahead, like so:
((?:PREFIX-\('|(?!\A)\G)[^\s']+)\s+
If you want to prevent a match at the very beginning of the text no matter what it starts with, you can move the lookahead outside the group:
(?!\A)((?:PREFIX-\('|\G)[^\s']+)\s+
And, just in case you were wondering, a lookbehind will work just as well as a lookahead:
((?:PREFIX-\('|(?<!\A)\G)[^\s']+)\s+

You have to keep matching from the beggining of the string untill you can match no more.
find /(PREFIX-\('[^\s']*)\s([^']*'\))/
replace $1_$2
like: while (/(PREFIX-\('[^\s']*)\s([^']*'\))/$1_$2/) {}

How about using Replace all for about 20 times? Or until you're sure no string contains more spaces

Due to nature of regex, it's not possible to do this in one step by normal regular expression.
But if I be in your place, I do such replaces in several steps:
find such patterns and mark them with special character
(Like replacing STRING WITH SPACES TO REPLACE with #STRING WITH SPACES TO REPLACE#
Replace #([^#\s]*)\s to #\1_ server times.
Remove markers!

I studied a little the regex tool in Notepad++ because I didn't know their possibilities.
I conclude that they aren't powerful enough to do what you want.
Your are obliged to learn and use a programming language having a real regex capability. There are a number of them. Personnaly, I use Python. It would take 1 mn to do what you want with it

You'd have to run the replace several times for each space but this regex will work
/(?<=PREFIX-\(')([^\s]+)\s+/g
Replace with
\1_ or $1_
See it working at http://refiddle.com/10z

Regular expression question

I have some text like this:
dagGeneralCodes$_ctl1$_ctl0
Some text
dagGeneralCodes$_ctl2$_ctl0
Some text
dagGeneralCodes$_ctl3$_ctl0
Some text
dagGeneralCodes$_ctl4$_ctl0
Some text
I want to create a regular expression that extracts the last occurrence of dagGeneralCodes$_ctl[number]$_ctl0 from the text above.
the result should be: dagGeneralCodes$_ctl4$_ctl0
Thanks in advance
Wael

This should do it:
.*(dagGeneralCodes\$_ctl\d\$_ctl0)
The .* at the front is greedy so initially it will grab the entire input string. It will then backtrack until it finds the last occurrence of the text you want.
Alternatively you can just find all the matches and keep the last one, which is what I'd suggest.
Also, specific advice will probably need to be given depending on what language you're doing this in. In Java, for example, you will need to use DOTALL mode to . matches newlines because ordinarily it doesn't. Other languages call this multiline mode. Javascript has a slightly different workaround for this and so on.

You can use:
[\d\D]*(dagGeneralCodes\$_ctl\d+\$_ctl0)
I'm using [\d\D] instead of . to make it match new-line as well. The * is used in a greedy way so that it will consume all but the last occurrence of dagGeneralCodes$_ctl[number]$_ctl0.

I really like using this Regular Expression Cheatsheet; it's free, a single page, and printed, fits on my cube wall.

Using an asterisk in a RegExp to extract data that is enclosed by a certain pattern

I have an text that consists of information enclosed by a certain pattern.
The only thing I know is the pattern: "${template.start}" and ${template.end}
To keep it simple I will substitute ${template.start} and ${template.end} with "a" in the example.
So one entry in the text would be:
aINFORMATIONHEREa
I do not know how many of these entries are concatenated in the text. So the following is correct too:
aFOOOOOOaaASDADaaASDSDADa
I want to write a regular expression to extract the information enclosed by the "a"s.
My first attempt was to do:
a(.*)a
which works as long as there is only one entry in the text. As soon as there are more than one entries it failes, because of the .* matching everything. So using a(.*)a on aFOOOOOOaaASDADaaASDSDADa results in only one capturing group containing everything between the first and the last character of the text which are "a":
FOOOOOOaaASDADaaASDSDAD
What I want to get is something like
captureGroup(0): aFOOOOOOaaASDADaaASDSDADa
captureGroup(1): FOOOOOO
captureGroup(2): ASDAD
captureGroup(3): ASDSDAD
It would be great to being able to extract each entry out of the text and from each entry the information that is enclosed between the "a"s. By the way I am using the QRegExp class of Qt4.
Any hints? Thanks!
Markus
Multiple variation of this question have been seen before. Various related discussions:
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Using regular expressions how do I find a pattern surrounded by two other patterns without including the surrounding strings?
Use RegExp to match a parenthetical number then increment it
Regex for splitting a string using space when not surrounded by single or double quotes
What regex will match text excluding what lies within HTML tags?
and probably others...

Simply use non-greedy expressions, namely:
a(.*?)a

You need to match something like:
a[^a]*a

You have a couple of working answers already, but I'll add a little gratuitous advice:
Using regular expressions for parsing is a road fraught with danger
Edit: To be less cryptic: for all there power, flexibility and elegance, regular expression are not sufficiently expressive to describe any but the simplest grammars. Ther are adequate for the problem asked here, but are not a suitable replacement for state machine or recursive decent parsers if the input language become more complicated.
SO, choosing to use RE for parsing input streams is a decision that should be made with care and with an eye towards the future.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex: Trying to extract all values (separated by new lines) within an XML tag - regex

You can use this find+replace combo to remove everything which is not a digit in between the <Number> tag: Find: .?<Number>(.?)<\/Number>.* Replace: $1

finally i was able to find the right regular expression, I'll leave it below if anyone needs it: <Type>\d</Type>\n<Number>(\d+\n)+(\d+</Number>) Explanation: \d: Shortcut for digits, same as [1-9] \n: Newline. +: Find the previous element 1 to many times. Have a good day everybody,

After giving it some more thought I decided to write a second answer. You can make use of look arounds: (?<=<Number>)[\d\s]+(?=<\/Number>) https://regex101.com/r/FiaTKD/1

Related

Notepad++ Regex Search XML argument for anything but certain word

How to replace all lines based on previous lines in Notepad++?

Replacing char in a String with Regular Expression

Regular expression question

Using an asterisk in a RegExp to extract data that is enclosed by a certain pattern

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex: Trying to extract all values (separated by new lines) within an XML tag - regex

You can use this find+replace combo to remove everything which is not a digit in between the <Number> tag: Find: .*?<Number>(.*?)<\/Number>.* Replace: $1

finally i was able to find the right regular expression, I'll leave it below if anyone needs it: <Type>\d</Type>\n<Number>(\d+\n)+(\d+</Number>) Explanation: \d: Shortcut for digits, same as [1-9] \n: Newline. +: Find the previous element 1 to many times. Have a good day everybody,

After giving it some more thought I decided to write a second answer. You can make use of look arounds: (?<=<Number>)[\d\s]+(?=<\/Number>) https://regex101.com/r/FiaTKD/1

Related

Notepad++ Regex Search XML argument for anything but certain word

How to replace all lines based on previous lines in Notepad++?

Replacing char in a String with Regular Expression

Regular expression question

Using an asterisk in a RegExp to extract data that is enclosed by a certain pattern

Categories

Resources

You can use this find+replace combo to remove everything which is not a digit in between the <Number> tag: Find: .?<Number>(.?)<\/Number>.* Replace: $1