I have a String like this:
val rawData = "askljdld<a>content to extract</a>lkdsjkdj<a>more content to extract</a>sdkdljk
and I want to extract the content between the tags <a>
I've tried this, but the end part of the regex is not working as I expected:
val regex = "<a>(.*)</a>".r
for(m <- regex.findAllIn(rawData)){
println(m)
}
the output is:
<a>content to extract</a>lkdsjkdj<a>more content to extract</a>
I understand what's happening: the regex finds the first <a> and the last </a>.
How can I have an iterator with the two entries?
<a>content to extract</a>
<a>more content to extract</a>
thanks in advance
All is very simple: "<a>(.*?)</a>"
.*? - means anything until something. In your case until </a>
Your regex is not the right one. You should use <a>(.*?)</a> instead
val rawData = "askljdld<a>content to extract</a>lkdsjkdj<a>more content to extract</a>sdkdljk"
val regex = "<a>(.*?)</a>".r
regex.findAllIn(rawData).foreach(println)
Related
I would like to match a string using regex in python which contains a specific string (lazy match) but haven't figured out how to do so.
For instance, in the following example, how do I return just '<tag1>some text<tag2>some other text</tag2><tag1>'
and not the whole string
#!/bin/python3
import re
pattern = r'(<([a-zA-Z0-9]+?)\b[^>]*>.*?<tag2>some other text</tag2>.*?</\2>)'
text = '<root> <tag1>some text<tag2>some other text</tag2></tag1> </root>'
print(re.search(pattern, text, re.DOTALL).groups(0))
The code above prints <root> <tag1>some text<tag2>some other text</tag2></tag1> </root> when I want it to print <tag1>some text<tag2>some other text</tag2></tag1>
Of course, all of this assuming that there can be any tag in the place of tag1
Turns out, the solution is quite simple,here's the regex that works:
.*(<([a-zA-Z0-9]+?)\b[^>]*>.*?<tag2>some other text</tag2>.*?</\2>).*
Here is a text:
<a class="mkapp-btn mab-download" href="javascript:void(0);" onclick="zhytools.downloadApp('C100306099', 'appdetail_dl', '24', 'http://appdlc.hicloud.com/dl/appdl/application/apk/f4/f44d320c2c1b466389e6f6b3d3f5cff4/com.uniquestudio.android.iemoji.1806141014.apk?sign=portal#portal1531621480529&source=portalsite' , 'v1.1.4');">
I want to extract
http://appdlc.hicloud.com/dl/appdl/application/apk/f4/f44d320c2c1b466389e6f6b3d3f5cff4/com.uniquestudio.android.iemoji.1806141014.apk?sign=portal#portal1531621480529&source=portalsite
I use below code to extract it.
m = re.search("mkapp-btn mab-download.*'http://[^']'", apk_page)
In my opinion, I can use .* to match the string between mkapp-btn mab-download and http. However I failed.
EDIT
I also tried.
m = re.search("(?<=mkapp-btn mab-download.*)http://[^']'", apk_page)
You need to add + after exclusion ([^']) because is more than one character. Also, you need to group using parenthesis to extract only the part you want.
m = re.search("mkapp-btn mab-download.*'(http[^']+)'", apk_page)
m.groups()
And the output will be
('http://appdlc.hicloud.com/dl/appdl/application/apk/f4/f44d320c2c1b466389e6f6b3d3f5cff4/com.uniquestudio.android.iemoji.1806141014.apk?sign=portal#portal1531621480529&source=portalsite',)
I have to extract some substrings, this is like an XML markup in a plain text doc, like
lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf
Can i extract this pattern in a single command?
In a case like this, I tried to use a matcher, the group command to extract this single match.
I don't want to do something like
String pattern = /<AAA>(.*)<\/AAA>/;
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher("lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf");
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
}
There must be a more elegant way.
Edit :
Thank you time_yates, i was looking for something like that.
Could you explain a little why you use [0][1] on the result of
def extract = (input =~ '<AAA>(.+?)</AAA>')[0][1]
Answer by tim_yates :
=~ returns a Matcher, and so [0] gets the first match, which is 2 groups, the first is the String that had the match in it (your whole string) the second [1] is the group you defined in your expression
Thank you so much for your help, and thanks to all the readers.
Power of a community !!!
Can't you just do:
def input = 'lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf'
def extract = (input =~ '<AAA>(.+?)</AAA>')[0][1]
assert extract == 'myString'
This is the shortest (not the best) way I can think of without external libs:
String str = "lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf";
System.out.println(str.substring(str.indexOf(">") + 1, str.lastIndexOf("<")));
Or using StringUtils (which is million times better than my previous sugestion with substring):
StringUtils.substringBetween(str, "<AAA>", "</AAA>");
Still I'd go with matcher() like you proposed among all these.
I have the following tag from an XML file:
<msg><![CDATA[Method=GET URL=http://test.de:80/cn?OP=gtm&Reset=1(Clat=[400441379], Clon=[-1335259914], Decoding_Feat=[], Dlat=[0], Dlon=[0], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[7363], ntCoent-Length=[15783], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:270/CSI:-/Me:1/Total:271]]>
Now I try to get from this message: Clon, Dlat, Dlon and Clat.
However, I already created the following regex:
(?<=Clat=)[\[\(\d+\)\n\n][^)n]+]
But the problem is here, I would like to get only the numbers without the brackets. I tried some other expressions.
Do you maybe know, how I can expand this expression, in order to get only the values without the brackets?
Thank you very much in advance.
Best regards
The regex
(clon|dlat|dlon|clat)=\[(-?\d+)\]
Gives
As I stated before, if you use this regex to extract the information out of this CDATA element, that's okay. But you really want to get to the contents of that element using an XML parser.
Example usage
Regex r = new Regex(#"(clon|dlat|dlon|clat)=\[(-?\d+)\]");
string s = ".. here's your cdata content .. ";
foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnoreCase))
{
var name = match.Groups[1].Value; //will contain "clon", "dlat", "dlon" or "clat"
var inner_value = match.Groups[2].Value; //will contin the value inside the square-brackets, e.g. "400441379"
//Do something with the matches
}
I have an output, where i'd like to fetch the value of CMEngine node i.e., everything inside CMEngine node. Please help me with a regex, I already have a java code in place which uses the regex, so I just need the regex. Thanks
My XML
<General>
<LanguageID>en_US</LanguageID>
<CMEngine>
<CMServer/> <!-- starting here -->
<DaysToKeepHistory>4</DaysToKeepHistory>
<PreprocessorMaxBuf>5000000</PreprocessorMaxBuf>
<ServiceRefreshInterval>30</ServiceRefreshInterval>
<ReuseMemoryBetweenRequests>true</ReuseMemoryBetweenRequests>
<Trace Enabled="false">
<ActiveCategories>
<Category>ENVIRONMENT</Category>
<Category>EXEC</Category>
<Category>EXTERNALS</Category>
<Category>FILESYSTEM</Category>
<Category>INPUT_DOC</Category>
<Category>INTERFACES</Category>
<Category>NETWORKING</Category>
<Category>OUTPUT_DOC</Category>
<Category>PREPROCESSOR_INPUT</Category>
<Category>REQUEST</Category>
<Category>SYSTEMRESOURCES</Category>
<Category>VIEWIO</Category>
</ActiveCategories>
<SeverityLevel>ERROR</SeverityLevel>
<MessageInfo>
<ProcessAndThreadIds>true</ProcessAndThreadIds>
<TimeStamp>true</TimeStamp>
</MessageInfo>
<TraceFile>
<FileName>CMEngine_log.txt</FileName>
<MaxFileSize>1000000</MaxFileSize>
<RecyclingMethod>Restart</RecyclingMethod>
</TraceFile>
</Trace>
<JVMLocation>C:\Informatica\9.1.0\java\jre\bin\server</JVMLocation>
<JVMInitParamList/> <!-- Ending here -->
</CMEngine>
</General>
If it has to be a regex, and if there is only one CMEngine tag per string:
Pattern regex = Pattern.compile("(?<=<CMEngine>)(?:(?!</CMEngine>).)*", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
Since that output appears to be machine-generated and is unlikely to contain comments or other stuff that might confuse the regex, this should work quite reliably.
It starts at a position right after a <CMEngine> tag: (?<=<CMEngine>)and matches all characters until the next </CMEngine> tag: (?:(?!</CMEngine>).)*.