Perl, replace multiple matches in string - regex

So, i'm parsing an XML, and got a problem. XML has objects containing script, which looks about that:
return [
['measurement' : org.apache.commons.io.FileUtils.readFileToByteArray(new File('tab_2_1.png')),
'kpi' : org.apache.commons.io.FileUtils.readFileToByteArray(new File('tab_2_2.png'))]]
I need to replace all filenames, saving file format, every entry of regexp template, because string can look like that:
['measurement' : org.apache.commons.io.FileUtils.readFileToByteArray(new File('tab_2_1.png'))('tab_2_1.png'))('tab_2_1.png')),
and i still need to replace all image_name before .png
I used this regexp : .*\(\'(.*)\.png\'\),
but it catches only last match in line before \n, not in whole string.
Can you help me with correcting this regexp?

The problem is that .* is greedy: it matches everything it can. So .*x matches all up to the very last x in the string, even if all that contains xs. You need the non-greedy
s/\('(.*?)\.png/('$replacement.png/g;
where the ? makes .* match up to the first .png. The \(' are needed to suitably delimit the pattern to the filename. This correctly replaces the filenames in the shown examples.
Another way to do this is \('([^.]*)\.png, where [^.] is the negated character class, matching anything that is not a .. With the * quantifier it again matches all up to the first .png
The question doesn't say how exactly you are "parsing an XML" but I dearly hope that it is with libraries like XML::LibXML of XML::Twig. Please do not attempt that with regex. The tool is just not fully adequate for the job, and you'll get to know about it. A lot has been written over years about this, search SO.

Related

Regular Expressions - Select the Second Match

I have a txt file with <i> and </i> between words that I would like to remove using Editpad
For example, I'd like to keep when it's like this:
<i>Phrases and words.</i>
And I'd like to remove the </i> and <i> tags inside the phrase, when it's like this:
<i>Phrases</i>and<i> words.</i>
<i>Phrases</i>and <i>words.</i>
I was trying to do that using regex, but I couldn't do it.
As the tag is followed by space or a word character I could find when the line has the double tag with
/ <i>|<\/i> /
but this way I can't just press replace for nothing, I have to edit line by line I search.
There's anyway to accomplish that?
* Edited *
Another example of lines found on the subtitle text
<i>- find me on the chamber.</i>
- What? <i>Go. Go, go, go!</i>
Rule number one: you can't parse html with regex.
That being said, if you know each line follows a certain pattern, you can usually hack something together to work. ;)
If I've understood correctly, it looks like you can simply remove all <i> and </i> that aren't either at the beginning or end of the lines. In that case, one method you could try is the following regex:
(?<=.)\<\/?i\>(?=.)
This will match the tags, with a lookahead and behind to make sure that we aren't at the end/start of a line (by checking if another character exists in front/behind. (Note that typically matched characters in a lookahead/behind won't be replaced when you search/replace.)
Disclaimer: this works on regex101, but notepad++ may have some differences to the pcre regex style.
update to work with Editpad
EDIT: since this question is actually wanting to know how to do this in Editpad, below is a modified alternative:
Try searching for the regex: (.)\<\/?i\>(.). This will match (and capture) exactly one character before and after the <i> tags.
When replacing, use backreferences to replace the entire match with the two captured characters - a replacement string of \1\2 should work.

Perl regex to match only if not followed by both patterns

I am trying to write a pattern match to only match when a string is not followed by both following patterns. Right now I have a pattern that I've tried to manipulate but I can't seem to get it to match correctly.
Current pattern:
/(address|alias|parents|members|notes|host|name)(?!(\t{5}|\S+))/
I am trying to match when a string is not spaced correctly but not if it is part of a larger word.
For example I want it to match,
host \t{4} something
but not,
hostgroup \t{5} something
In the above example it will match hostgroup and end up separating it into 2 separate words "host" and "group"
Match:
notes \t{4} something
but not,
notes_url \t{5} something
Using my pattern it ends up turning into:
notes \t{5} _url
Hopefully that makes a bit more sense.
I'm not at all clear what you want, but word boundaries will probably do what you ask.
Does this work for you?
/\b(address|alias|parents|members|notes|host|name)\b(?!\t{5})/
Update
Having understood your problem better, does this do what you want?
/\b(address|alias|parents|members|notes|host|name)\b(?!\t{5}(?!\t))/

parsing url for specific param value

im looking to use a regular expression to parse a URL to get a specific section of the url and nothing if I cannot find the pattern.
A url example is
/te/file/value/jifle?uil=testing-cdas-feaw:jilk:&jklfe=https://value-value.jifels/temp.html/topic?id=e997aad4-92e0-j30e-a3c8-jfkaliejs5#c452fds-634d-f424fds-cdsa&bf_action=jildape
I wish to get the bolded text in it.
Currently im using the regex "d=([^#]*)" but the problem is im also running across urls of this pattern:
and im getting the bold section of it
/te/file/value/jifle?uil=testing-cdas-feaw:jilk:&jklfe=https://value-value.jifels/temp.html/topic?id=e997aad4-92e0-j30e-a3c8-jfkaliejs5&bf_action=jildape
I would prefer it have no matches of this url because it doesnt contain the #
Regexes are not a magic tool that you should always use just because the problem involves a string. In this case, your language probably has a tool to break apart URLs for you. In PHP, this is parse_url(). In Perl, it's the URI::URL module.
You should almost always prefer an existing, well-tested solution to a common problem like this rather than writing your own.
So you want to match the value of the id parameter, but only if it has a trailing section containing a '#' symbol (without matching the '#' or what's after it)?
Not knowing the specifics of what style of regexes you're using, how about something like:
id=([^#&]*)#
regex = "id=([\\w-])+?#"
This will grab everything that is character class[a-zA-Z_0-9-] between 'id=' and '#' assuming everything between 'id=' and '#' is in that character class(i.e. if an '&' is in there, the regex will fail).
id=
-Self explanatory, this looks for the exact match of 'id='
([\\w-])
-This defines and character class and groups it. The \w is an escaped \w. '\w' is a predefined character class from java that is equal to [a-zA-Z_0-9]. I added '-' to this class because of the assumed pattern from your examples.
+?
-This is a reluctant quantifier that looks for the shortest possible match of the regex.
#
-The end of the regex, the last character we are looking for to match the pattern.
If you are looking to grab every character between 'id=' and the first '#' following it, the following will work and it uses the same logic as above, but replaces the character class [\\w-] with ., which matches anything.
regex = "id=(.+?)#"

How to match a string that does not end in a certain substring?

how can I write regular expression that dose not contain some string at the end.
in my project,all classes that their names dont end with some string such as "controller" and "map" should inherit from a base class. how can I do this using regular expression ?
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Do a search for all filenames matching this:
(?<!controller|map|anythingelse)$
(Remove the |anythingelse if no other keywords, or append other keywords similarly.)
If you can't use negative lookbehinds (the (?<!..) bit), do a search for filenames that do not match this:
(?:controller|map)$
And if that still doesn't work (might not in some IDEs), remove the ?: part and it probably will - that just makes it a non-capturing group, but the difference here is fairly insignificant.
If you're using something where the full string must match, then you can just prefix either of the above with ^.* to do that.
Update:
In response to this:
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Not quite sure what you're attempting with the public/class stuff there, so try this:
public.*class.*(?<!controller|map)$`
The . is a regex char that means "anything except newline", and the * means zero or more times.
If this isn't what you're after, edit the question with more details.
Depending on your regex implementation, you might be able to use a lookbehind for this task. This would look like
(?<!SomeText)$
This matches any lines NOT having "SomeText" at their end. If you cannot use that, the expression
^(?!.*SomeText$).*$
matches any non-empty lines not ending with "SomeText" as well.
You could write a regex that contains two groups, one consists of one or more characters before controller or map, the other contains controller or map and is optional.
^(.+)(controller|map)?$
With that you may match your string and if there is a group() method in the regex API you use, if group(2) is empty, the string does not contain controller or map.
Check if the name does not match [a-zA-Z]*controller or [a-zA-Z]*map.
finally I did it in this way
public.*class.*[^(controller|map|spec)]$
it worked

Regex for all strings not containing a string? [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 6 years ago.
Ok, so this is something completely stupid but this is something I simply never learned to do and its a hassle.
How do I specify a string that does not contain a sequence of other characters. For example I want to match all lines that do NOT end in '.config'
I would think that I could just do
.*[^(\.config)]$
but this doesn't work (why not?)
I know I can do
.*[^\.][^c][^o][^n][^f][^i][^g]$
but please please please tell me that there is a better way
You can use negative lookbehind, e.g.:
.*(?<!\.config)$
This matches all strings except those that end with ".config"
Your question contains two questions, so here are a few answers.
Match lines that don't contain a certain string (say .config) at all:
^(?:(?!\.config).)*$\r?\n?
Match lines that don't end in a certain string:
^.*(?<!\.config)$\r?\n?
and, as a bonus: Match lines that don't start with a certain string:
^(?!\.config).*$\r?\n?
(each time including newline characters, if present.
Oh, and to answer why your version doesn't work: [^abc] means "any one (1) character except a, b, or c". Your other solution would also fail on test.hg (because it also ends in the letter g - your regex looks at each character individually instead of the entire .config string. That's why you need lookaround to handle this.
(?<!\.config)$
:)
By using the [^] construct, you have created a negated character class, which matches all characters except those you have named. Order of characters in the candidate match do not matter, so this will fail on any string that has any of [(\.config) (or [)gi.\onc(])
Use negative lookahead, (with perl regexs) like so: (?!\.config$). This will match all strings that do not match the literal ".config"
Unless you are "grepping" ... since you are not using the result of a match, why not search for the strings that do end in .config and skip them? In Python:
import re
isConfig = re.compile('\.config$')
# List lst is given
filteredList = [f.strip() for f in lst if not isConfig.match(f.strip())]
I suspect that this will run faster than a more complex re.
As you have asked for a "better way": I would try a "filtering" approach. I think it is quite easy to read and to understand:
#!/usr/bin/perl
while(<>) {
next if /\.config$/; # ignore the line if it ends with ".config"
print;
}
As you can see I have used perl code as an example. But I think you get the idea?
added:
this approach could also be used to chain up more filter patterns and it still remains good readable and easy to understand,
next if /\.config$/; # ignore the line if it ends with ".config"
next if /\.ini$/; # ignore the line if it ends with ".ini"
next if /\.reg$/; # ignore the line if it ends with ".reg"
# now we have filtered out all the lines we want to skip
... process only the lines we want to use ...
I used Regexpal before finding this page and came up with the following solution when I wanted to check that a string doesn't contain a file extension:
^(.(?!\.[a-zA-Z0-9]{3,}))*$ I used the m checkbox option so that I could present many lines and see which of them did or did not match.
so to find a string that doesn't contain another "^(.(?!" + expression you don't want + "))*$"
My article on the uses of this particular regex