RegEx to extract first XML element name with optional namespace prefix - regex

I have to extract with regEx first element name in the xml (ignoring optional namespace prefix.
Here is the sample XML1:
<ns1:Monkey xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace">
<foodType>
<vegtables>
<carrots>1</carrots>
</vegtables>
<foodType>
</ns1:Monkey>
And here is similar XML that is without namespace, XML2:
<Monkey xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace">
<foodType>
<vegtables>
<carrots>1</carrots>
</vegtables>
<foodType>
</Monkey>
I need a regEx that will return me "Monkey" for either XML1 or XML2
So far I tried HERE this regEx <(\w+:)(\w+) that works for XML1 .... but I don't know how to make it work for XML2

Since it seems to be a one-time job and you really do not have access to XML parser, you can use either of the 2 regexps (that will work only for the XML files like you provided as samples):
<(\w+:)?(\w+)(?=\s*xmlns="http://myurlisrighthereheremonkey\.com/monkeynamespace")
Demo 1
Or (if you check the whole single file contents with the regex):
^\s*<(\w+:)?(\w+)
Demo 2
The main changes are 2:
(\w+:)? - adding ? modifier makes the first capturing group optional
^\s* makes the regex match at the beginning of the string (guess you do not have XML declaration there), or (?=\s*xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace") look-ahead forcing the match only if followed by optional spaces and literal xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace".
However, you really need to think about changing to code supporting XML parsing, it will make your life and lives of those who will be in charge of maintaining code easier.

Related

.net regex - strings that don't contain full stop preceding last `<item>` - Attempt 2

This question follows from .net regex - strings that don't contain full stop on last list item
Problem is now the below. Note that examples have been amended and more added - all need to be satisfied. Good examples should return no matches, and bad examples should return matches.
I'm trying to use .net regex for identifying strings in XML data that don't contain a full stop before the last tag. I have not much experience with regex. I'm not sure what I need to change & why to get the result I'm looking for.
There are line breaks and carriage returns at end of each line in the data.
A schema is used for the XML.
We have no access to .Net code - just users using a custom built application.
Example 1 of bad XML Data - should give 1 match:
<randlist prefix="unorder">
<item>abc</item>
<item>abc</item>
<item>abc</item>
</randlist>
Example 2 of bad XML Data - should give 1 match:
<randlist prefix="unorder">
<item>abc. abc</item>
<item>abc. abc</item>
<item>abc. abc</item>
</randlist>
Example 1 of good XML Data - regexp should give no matches - full stop preceding last </item>:
<randlist prefix="unorder">
<item>abc</item>
<item>abc</item>
<item>abc.</item>
</randlist>
Example 2 of good XML Data - regexp should give no matches - full stop preceding last </item>:
<randlist prefix="unorder">
<item>abc. abc</item>
<item>abc. abc</item>
<item>abc. abc.</item>
</randlist>
Reg exp patterns I tried that didn't work (either false positives or no matches using https://regex101.com/) for criteria above in the bad XML data (not tested on good XML data):
^<randlist \w*=[\S\s]*\.*[^.]*<\/item>[\n]*<\/randlist>$
^\s+<item>[^<]*?(?<=\.)<\/item>$
Seeing how you are using .NET, you could:
Load the XML file in an XML Document.
Use the GetElementsByTagName method to get all your item tags within the randlist element.
Get the last element returned by [2].
Check if it contains the period character.
The above should be more readable, and if the structure of the XML changes, you won't have to rewrite half your script.
The regexp pattern below works for us - tested in Notepad++
[^.]<\/item>\s{1,2}<\/randlist>

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Regular expression for XML element with arbitrary attribute value

I'm not very confortable with RegEx.
I have a text file with a lot of data and different formats. I want to keep this kind of string.
<data name=\"myProptertyValue\" xml:space=\"preserve\">
Only the value of the name property can change.
So I imagined a regex like this <data name=\\\"(.)\\\" xml:space=\\\"preserve\\\"> but it's not working.
Any tips?
try this
<data name=\\".*?\\" xml:space=\\"preserve\\">
no need to add \ to "
Your (.) will capture only a single character; add a quantifier like + (“one or more”):
/<data name=\\"(.+)\\" xml:space=\\"preserve\\">/
Depending on what exactly your input is (element by element or entire document) and on what you want to achieve (removing/replacing/testing/capturing), you should make the regex global (by adding the g flag), so it is applied not only once. Also, you should make the + quantifier lazy by adding a ? to it. That will make it non-greedy, because you want capturing to stop at the ending quote of the attribute (like all but quotation mark: [^"]). Then, it will look like this:
/<data name=\\"(.+?)\\" xml:space=\\"preserve\\">/g
<data name=\\"(.+)\\" xml:space=\\"preserve\\">
It will catch what's inside "data name".
If you're having trouble with regex, using this kind of sites to construct your regex can help you : https://regex101.com/ , http://regexr.com/ etc.

Remove first level xml (and namespace) tag regex

Lets say I have the following XML
<Monkey xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace">
<foodType>
<vegtables>
<carrots>1</carrots>
</vegtables>
<foodType>
</Monkey>
Now the service im using already has the root level element and name space. so I really just need:
<foodType>
<vegtables>
<carrots>1</carrots>
</vegtables>
<foodType>
Is there a regex I can use that generically removes both the top level opening and closing tags? By generic I mean that it doesn't have to start with the type Monkey.
Thanks in advance!
You can match the following regex:
^<[^>]+>(.*)<\/\w+>$
and replace with the first captured group \1 or $1
Demo

Contents within an attribute for both single and multiple ending tags

How can I fetch the contents within value attribute of the below tag across the files
<h:graphicImage .... value="*1.png*" ...../>
<h:graphicImage .... value="*2.png*" ....>...</h:graphicImage>
My regular expression search result should result into
1.png
2.png
All I could find was content for multiple ending tags but what about the single ending tags.
Use an XML parser instead, regex cannot truly parse XML properly, unless you know the input will always follow a particular form.
However, here is a regex you can use to extract the value attribute of h:graphicImage tags, but read the caveats after:
<h:graphicImage[^>]+value="\*(.*?)\*"
and the 1.png or 2.png will be in the first captured group.
Caveats:
here I have assumed that your 1.png, 2.png etc are always surrounded by asterisks as that is what it seems from your question (that is what the \* is for)
this regex will fail if one of the attributes has a ">" character in it, for example
<h:graphicImage foo=">" value="*1.png*"
This is what I mentioned before about regex never being able to parse XML properly.
You could work around this by adjusting your regex:
<h:graphicImage.+?+value="\*(.*?)\*"
But this means that if you had <h:graphicImage /><foo value="*1.png*"> then the 1.png from the foo tag is extracted, when you only want to extract from the graphicImage tag.
Again, regex will always have issues with corner cases for XML, so you need to adjust according to your application (for example, if you know that only the graphicImage tag will ever have a "value" attribute, then the second case may be better than the first).