How do I make this "Use of uninitialized value" warning go away? - regex

Let's say I want to write a regular expression to change all <abc>, <def>, and <ghi> tags into <xyz> tags.. and I also want to change their closing tags to </xyz>. This seems like a reasonable regex (ignore the backticks; StackOverflow has trouble with the less-than signs if I don't include them):
`s!<(/)?(abc|def|ghi)>!<${1}xyz>!g;`
And it works, too. The only problem is that for opening tags, the optional $1 variable gets assigned undef, and so I get a "Use of uninitialized value..." warning.
What's an elegant way to fix this? I'd rather not make this into two separate regexs, one for opening tags and another for closing tags, because then there are two copies of the taglist that need to be maintained, instead of just one.
Edit: I know I could just turn off warnings in this region of the code, but I don't consider that "elegant".

Move the question mark inside the capturing bracket. That way $1 will always be defined, but may be a zero-length string.

How about:
`s!(</?)(abc|def|ghi)>!${1}xyz>!g;`

s!<(/?)(abc|def|ghi)>!<${1}xyz>!g;
The only difference is changing "(/)?" to "(/?)". You have already identified several functional solution. This one has the elegance you asked for, I think.

Here is one way:
s!<(/?)(abc|def|ghi)>!<$1xyz>!g;
Update: Removed irrelevant comment about using (?:pattern).

You could just make your first match be (</?), and get rid of the hard-coded < on the "replace" side. Then $1 would always have either < or </. There may be more elegant solutions to address the warning issue, but this one should handle the practical problem.

To make the regex capture $1 in either case, try:
s!<(/|)?(abc|def|ghi)>!<${1}xyz>!g;
^
note the pipe symbol, meaning '/' or ''
For '' this will capture the '' between '<' and 'abc>', and for '', capture '/' between '<' and 'abc>'.

I'd rather not make this into two
separate regexs, one for opening tags
and another for closing tags, because
then there are two copies of the
taglist that need to be maintained
Why? Put your taglist into a variable and interpolate that variable into as many regexes as you like. I'd consider this even whith a single regex because it's much more readable with a complicated regex (and what regex isn't complicated?).

Be careful in as much as HTML is a bit harder then it looks to be at first glance. For example, do you want to change "<abc foo='bar'>" to "<xyz foo='bar'>"? Your regex won't. Do you want to change "<img alt='<abc>'>"? The regex will. Instead, you might want to do something like this:
use HTML::TreeBuilder;
my $tree=HTML::TreeBuilder->new_from_content("<abc>asdf</abc>");
for my $tag (qw<abc def ghi>) {
for my $elem ($tree->look_down(_tag => $tag)) {
$elem->tag('xyz');
}
}
print $tree->as_HTML;
That keeps you from having to do the fiddly bits of parsing HTML yourself.

Add
no warnings 'uninitialized';
or
s!<(/)?(abc|def|ghi)>! join '', '<', ${1}||'', 'xyz>' !ge;

Related

Get only first match in Regex

Given this string: hello"C07","73" (quotes included) I want to get "C07". I'm using (?:hello)|(?<=")(?<screen>[a-zA-Z0-9]+)?(?=") to try to do this. However, it consistently matches "73" as well. I've tried ...0-9]+){1}..., but that doesn't work either. I must be misunderstanding how this is supposed to work, but I can't figure out any other way.
How can I get just the first set of characters between quotes?
EDIT: Here's a link to show my problem.
EDIT: Ok, here's exactly what I'm trying to do:
Basically, what I'm trying to get is this: 1) a positive match on "hello", 2) a group named "screen" with, in this case, "C07" in it and 3) a group named "format" with, in this case, "73" in it.
Both the "C07" and "73" will vary. "hello" will always be the same. There may or may not be an extra comma between "hello" and the first double-quote.
For you initial question of how to stop after the first match either removing the global search, or searching from the start of the string would accomplish that.
For the latter question you can name your groups and just keep extending the pattern throughout the line(s).
hello"(?<screen>[^"]+)","(?<format>[^"]+)"
Demo: http://regex101.com/r/PBXe8l/1
Based on your regex example, why not:
^(?:hello)"([a-zA-Z\d]+)"
Regex Demo

Regex to match text between two strings, including the strings

I'm trying to fix some conflicts in a merge by git, there are a lot of <<<<<< HEAD and ====== blocks I want to be able to just find and replace with an empty string in a lot of files.
I found this regex pattern that correctly matches everything between the two strings, but it leaves out the beginning and ending strings, and I want to be able to match them also.
(?s)(?<=<<<<<<< HEAD).*?(?=\=\=\=\=\=\=\=)
So, match <<<<<<< HEAD, ======= and everything between them to do a search/replace.
Can anyone help me out? I would be running this on files I'm certain I don't want anything between those strings, I guess that's also why I didn't try a "use theirs" flag when doing the merge, because I need to see the files first.
Just leave out the look-arounds mentioned by Xufox
(?s)(<<<<<<< HEAD)(.*?)(\=\=\=\=\=\=\=)
The .*? is wrapped with parentheses so you can reference it in the replacement. \1 for the first group, \2 for the second, and \3 for everything in between (but the syntax can vary.)
I think you might be asking the wrong question here. The best way to actually handle merge conflicts is with a merge tool. You should look into something like meld. And specifically setting git merge tool to use that. Manual merges are not fun...
Use a pretty ui to analyze the merge instead
You want to match on the opening and closing tags of the conflict sections?
Parentheses are primarily used for group capturing, if you do like so:
(<<<<<<< HEAD)(.*\s)+(\s*=======)
It will create 4 groups which you can access the members.
Tested: http://regexr.com/

Regex skipping first match?

I'm trying to match every plain "twitter-like hashtags" in a text, and make hyperlinks of them.
I've had some success, but strangely enough my regular expression /(^|\s)#(\w*[a-zA-Z_]+\w*)/ is skipping the first case when the string starts with the # sign (the rest are properly matched). If I write anything else before, then the first case is properly matched. It fails only when a # sign appears to be the first character of the string.
Do you know why could it be?
function make_hashes_into_twitter_hashtag_urls($content){
$content = preg_replace('/(^|\s)#(\w*[a-zA-Z_]+\w*)/', '<span class="color_my_hash">\1#</span>\2 ', $content);
echo $content;
}
add_filter('the_content','make_hashes_into_twitter_hashtag_urls');
Thanks,
The the_content(); template tag outputs HTML and text, and HTML and regular expressions don't play nice together.
I don't know the details about why exactly something like <p>#, (which is what the_content(); was outputing), was making my regex to fail.
I'd love to know why, and how my regex should be - to avoid being trapped (even if, as stated, it's not recommended to mix regex and HTML).
My question is mainly answered, but any additional details on how to hack/workaround this situation would be hugely appreciated!

Removing empty bbcode tags using regex

Using regex I'm trying to remove empty bbcode tags. By empty I mean nothing in between them:
[tag][/tag]
If there is something between them then it should be kept.
I've searched a lot and played around with a regex tester but haven't come up with anything that works right.
Edit: I realize now why I was having a hard time with this. In addition to the example above, I also have one's like:
[url=http://www.somedomain.com/][/url]
I'm trying to cleanup bbcode when a form is submitted so it's not stored since it's unneeded.
In Javascript, you could do :
str.replace(/\[([^\[\]]*)\]\[\/\1\]/g, '');
The operative aspect of regex in this case is the use of internal backrefs; I'm not sure, off the top of my head, whether this is universally supported, but .NET, in any case, seems to use PCRE (is this true?).
The pattern, then, is [, a word, ][/, the same word, ]. If we assume the word has simply the quality of "does not contain ]", then an appropriate regex to match an empty tag is \[([^\]]+)\]\[/\1\], escaped as necessary in context.
For the second case, if assume the form [tag=arg][/tag], and that tag and arg each don't contain any ']' (not a reasonable assumption! but dealing with it is left as an exercise for the reader -- and I'm quite sure most bbcode implementations don't actually deal with that problem, either), one could use a regex \[([^\]=]+)(=[^\]]*)?\]\[/\1\].

Regular expression that allows square-brackets

This works perfect..
result2 = Regex.Replace(result2, "[^A-Za-z0-9/.,>#:\s]", "", RegexOptions.Compiled)
But I need to allow square brackets ([ and ]).
Does this look correct to allow Brackets without changing what is allowed and not allowed from the above?
result2 = Regex.Replace(result2, "[^A-Za-z0-9\[\]/.,>#:\s]", "", RegexOptions.Compiled)
Reason I need a second opinion is that I think if this is correct something else is blocking it that is out of my control.
I cant say any one person did or did not answer the question or try to help, I would split the solution among everyone if I could because it made me think. The key was to separate the brackets by using a single \ . Thanks everyone for your help.
result = Regex.Replace(result, "[^A-Za-z0-9/\[\].,>#\s]", "", RegexOptions.Compiled)
The gimmick #tripleee mentioned does indeed work in .NET. Just make sure ] is the first character (or in this case, first after the ^.
result2 = Regex.Replace(result2, "[^][A-Za-z0-9/.,>#:\s]", "");
But be careful about porting the regex to other flavors. Some will treat it as a syntax error, and some will treat it as two atoms: [^] and [A-Za-z0-9/.,>#:\s], the first of which matches literally anything but nothing--i.e., any character including newlines.
On a side note, why are you using the RegexOptions.Compiled option? That's something you should use only when you know you need it. The increased performance will almost never be significant, and it comes with a pretty high price tag, as explained here.
http://msdn.microsoft.com/en-us/library/8zbs0h2f.aspx