Regular expression for XML element with arbitrary attribute value

Regular expression for XML element with arbitrary attribute value - regex

I'm not very confortable with RegEx.
I have a text file with a lot of data and different formats. I want to keep this kind of string.
<data name=\"myProptertyValue\" xml:space=\"preserve\">
Only the value of the name property can change.
So I imagined a regex like this <data name=\\\"(.)\\\" xml:space=\\\"preserve\\\"> but it's not working.
Any tips?

try this
<data name=\\".*?\\" xml:space=\\"preserve\\">
no need to add \ to "

Your (.) will capture only a single character; add a quantifier like + (“one or more”):
/<data name=\\"(.+)\\" xml:space=\\"preserve\\">/
Depending on what exactly your input is (element by element or entire document) and on what you want to achieve (removing/replacing/testing/capturing), you should make the regex global (by adding the g flag), so it is applied not only once. Also, you should make the + quantifier lazy by adding a ? to it. That will make it non-greedy, because you want capturing to stop at the ending quote of the attribute (like all but quotation mark: [^"]). Then, it will look like this:
/<data name=\\"(.+?)\\" xml:space=\\"preserve\\">/g

<data name=\\"(.+)\\" xml:space=\\"preserve\\">
It will catch what's inside "data name".
If you're having trouble with regex, using this kind of sites to construct your regex can help you : https://regex101.com/ , http://regexr.com/ etc.

Related

Regex Python, Find Everything Inbetween Quotes after Keyword

I have strings that looks like this:
"Grand Theft Auto V (5)" border="0" src="/product_images/Gaming/Playstation4 Software/5026555416986_s.jpg" title="Grand... (the string continues for a while here)
I want to use regex to grab this: /product_images/Gaming/Playstation4 Software/5026555416986_s.jpg
Basically, everything in src="..."
At the moment I produce a list using re.findall(r'"([^"]*)"', line) and grab the appropriate one, but there's a lot of quotes in the full string and I'd like to be more efficient.
Can anyone help me put together an expression for this please?

Try with this
(?<=src=").+(?=" )

Use this as RE :
src="(.+?)"
This will return result as you want.
re.findall('src="(.+?)"', text_to_search_from)

What is the regular expression in Jmeter for a dynamic string?

This is getting generated in a request output in Jmeter and I need to capture the dynamic value.
<update id="javax.faces.ViewState"><![CDATA[-8480553014738537212:-8925834053543623028]]></update>
the - (hyphen) symbol coming in the output is also dynamic.
I have tried handling this using
<update id="javax.faces.ViewState"><![CDATA[(.+?)]]></update>
But this is not helping. Please suggest.

The correct way to grab the data is by using the XPath Extractor with the following XPath:
//update[#id='javax.faces.ViewState']/text()
It gets the update tags that have id attribute with the javax.faces.ViewState value and extracts the text from these nodes.
Your regex is not correct because the [ (and literal dots) must be escaped in the regular expressions, and can be fixed as <update\s+id="javax\.faces\.ViewState"><!\[CDATA\[([^\]<]+)]]></update>. See the regex demo.

RegEx to extract first XML element name with optional namespace prefix

I have to extract with regEx first element name in the xml (ignoring optional namespace prefix.
Here is the sample XML1:
<ns1:Monkey xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace">
<foodType>
<vegtables>
<carrots>1</carrots>
</vegtables>
<foodType>
</ns1:Monkey>
And here is similar XML that is without namespace, XML2:
<Monkey xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace">
<foodType>
<vegtables>
<carrots>1</carrots>
</vegtables>
<foodType>
</Monkey>
I need a regEx that will return me "Monkey" for either XML1 or XML2
So far I tried HERE this regEx <(\w+:)(\w+) that works for XML1 .... but I don't know how to make it work for XML2

Since it seems to be a one-time job and you really do not have access to XML parser, you can use either of the 2 regexps (that will work only for the XML files like you provided as samples):
<(\w+:)?(\w+)(?=\s*xmlns="http://myurlisrighthereheremonkey\.com/monkeynamespace")
Demo 1
Or (if you check the whole single file contents with the regex):
^\s*<(\w+:)?(\w+)
Demo 2
The main changes are 2:
(\w+:)? - adding ? modifier makes the first capturing group optional
^\s* makes the regex match at the beginning of the string (guess you do not have XML declaration there), or (?=\s*xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace") look-ahead forcing the match only if followed by optional spaces and literal xmlns="http://myurlisrighthereheremonkey.com/monkeynamespace".
However, you really need to think about changing to code supporting XML parsing, it will make your life and lives of those who will be in charge of maintaining code easier.

Regex Assistance for a url filepath

Can someone assist in creating a Regex for the following situation:
I have about 2000 records for which I need to do a search/repleace where I need to make a replacement for a known item in each record that looks like this:
<li>View Product Information</li>
The FILEPATH and FILE are variable, but the surrounding HTML is always the same. Can someone assist with what kind of Regex I would substitute for the "FILEPATH/FILE" part of the search?

you may match the constant part and use grouping to put it back
(<li>View Product Information</li>)
then you should replace the string with $1your_replacement$2, where $1 is the first matching group and $2 the second (if using python for instance you should call Match.group(1) and Match.group(2))
You would have to escape \ chars if you're using Java instead.

Regexp for finding tags without nested tags

I'm trying to write a regexp which will help to find non-translated texts in html code.
Translated texts means that they are going through special tag: or through construction: ${...}
Ex. non-translated:
<h1>Hello</h1>
Translated texts are:
<h1><fmt:message key="hello" /></h1>
<button>${expression}</button>
I've written the following expression:
\<(\w+[^>])(?:.*)\>([^\s]+?)\</\1\>
It finds correct strings like:
<p>text<p>
Correctly skips
<a><fmt:message key="common.delete" /></a>
But also catches:
<li><p><fmt:message key="common.delete" /></p></li>
And I can't figure out how to add exception for ${...} strings in this expression
Can anybody help me?

If I understand you properly, you want to ensure the data inside the "tag" doesn't contain fmt:messsage or ${....}
You might be able to use a negative-lookahead in conjuction with a . to assert that the characters captured by the . are not one of those cases:
/<(\w+)[^>]*>(?:(?!<fmt:message|\$\{|<\/\1>).)*<\/\1>/i
If you want to avoid capturing any "tags" inside the tag, you can ignore the <fmt:message portion, and just use [^<] instead of a . - to match only non <
/<(\w+)[^>]*>(?:(?!\$\{)[^<])*<\/\1>/i
Added from comment If you also want to exclude "empty" tags, add another negative-lookahead - this time (?!\s*<) - ensure that the stuff inside the tag is not empty or only containing whitespace:
/<(\w+)[^>]*>(?!\s*<)(?:(?!\$\{)[^<])*<\/\1>/i

If the format is simple as in your examples you can try this:
<(\w+)>(?:(?!<fmt:message).)+</\1>

Rewritten into a more formal question:
Can you match
aba
but not
aca
without catching
abcba ?
Yes.
FSM:
Start->A->B->A->Terminate
Insert abcba and run it
Start is ready for input.
a -> MATCH, transition to A
b -> MATCH, transition to B
c -> FAIL, return fail.

I've used a simple one like this with success,
<([^>]+)[^>]*>([^<]*)</\1>
of course if there is any CDATA with '<' in those it's not going to work so well. But should do fine for simple XML.

also see
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
for a discussion of using regex to parse html
executive summary: don't

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression for XML element with arbitrary attribute value - regex

try this <data name=\\".*?\\" xml:space=\\"preserve\\"> no need to add \ to "

<data name=\\"(.+)\\" xml:space=\\"preserve\\"> It will catch what's inside "data name". If you're having trouble with regex, using this kind of sites to construct your regex can help you : https://regex101.com/ , http://regexr.com/ etc.

Related

Regex Python, Find Everything Inbetween Quotes after Keyword

What is the regular expression in Jmeter for a dynamic string?

RegEx to extract first XML element name with optional namespace prefix

Regex Assistance for a url filepath

Regexp for finding tags without nested tags

Categories

Resources