how to write link parser with regex - regex

I have a line: "a herf = sdfsjkdhfks http://www.google.com 134"
I want to get the "http://www.google.com" part only if there is a "<" at the beginning and a ">" in the end
For now my regex is "(?i)(http)(s:| :).+\.[A-Za-z]{2,}/?"
What can I do to check if the arrow bracket exist without taking it as part of my regular expression, I mean, I do not want arrow bracket to be the output of the match"
In this case, the output should be null cause there is no arrow bracket, but if there are, I want the output to be just "www.google.com"
Thanks in advance

Include the bracket as part of your regex, then as a second step after you've found the match, strip it out of that result string before you return the result.
If you're anchoring the angled brackets to the start and end of the regex, this could be as simple as something like .substring(1,matchedString.length()-1).

This will get the link part skipping any thing at the start and end.
import re
content = "<ahref = 123 http://googl 235>"
re.findall("<a[\s]*href[\s]*=.*(http://[^> ]*)[\s]*.*>",content)

Related

How to replace specific strings between <> using regex

Can anyone tell me how to do the following task using regex?
replace all the ABC with DEF only when ABC is inside both <> and ""
original string:
<tagA nameABC1="attr1ABCx xyzABC" name2="attABCa"> outside"ABC"xyz</tagA>
<tagB nameABC2="attr2ABCx cccABC" name3="testABCb"> outside_"ABC"</tagB>
desired string after replacing:
<tagA nameABC1="attr1DEFx xyzDEF" name2="attDEFa"> outside"ABC"xyz</tagA>
<tagB nameABC2="attr2DEFx cccDEF" name3="testDEFb"> outside_"ABC"</tagB>
Edited:
Thank you guys.
I've decided to use HTML parser library jsoup to handle all html text properly.
Assuming well formed input (no dangling quotes or brackets):
Search: ABC(?=(?:(?:[^"]*"){2})*[^"]*"[^"]*$)(?=[^<>]*>)
Replace: DEF
See live demo.
This works by applying two look aheads:
the first look ahead (?=(?:(?:[^"]*"){2})*[^"]*"[^"]*$) requires there to be an odd number of quote characters in the remaining input, which in turn means the match is inside quotes
the other look ahead (?=[^<>]*>) requires the next angle bracket to be a closing bracket, which in turn means the match is inside an angle bracket pair
This is not bullet proof, for example it doesn't cater for closing angle brackets being inside quotes, but even this could be handled with an even more complicated look ahead that applied similar logic from the first look ahead when matching angle brackets... an excerise left for the reader.

Perl: regx pattern matching

Whenever i find the word ".abc.corp:" in a line on file, i would like to exclude those lines:
Example Line:
kubernte-fileserver-NN.abc.corp:/srv/export/storage/nsp_updates 1231231 123112 123123 89% /devops
can someone help me to find out the correct regex pattern:
im trying out with below pattern match: unable to figure it out
/^(.*(?!\.abc\.corp).*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\%\s+([\/\w\d\.-]+)$/
I'm confused with subpattern group $1 with negative look around!
By "exclude it", I assume you want to exclude the entire line.
Your try will not exclude anything, because here Perl can always find some point in the share path to split it, where the split point is not immediately followed by .abc.corp, like if it splits it:
kubernte-fileserver-N
N.abc.corp:/srv/export/storage/nsp_updates
or (as it's actually going to do) just consume everything by the first .*, with nothing left for the second one.
I'd instead first try to match the string you're trying to avoid, and failing to do so, proceed with the actual handling:
if (/^\S+\.abc\.corp:/) {
# SKIP
}
elsif (/^(.*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\%\s+([\/\w\d\.-]+)$/) {
...
}
Besides actually working, this makes the code much more readable.

emacs copy text after symbol in each line

I have a block of text that looks something like
par.dm_std;
par.dm_POM;
par.dm_CaCO3;
and I want it to look like
par.dm_std = dm_std;
par.dm_POM = dm_std;
par.dm_CaCO3 = dm_CaCO3;
So I am essentially trying to copy everything after the "." and put an equals sign before and a semicolon afterward. I tried to run a query replace with
par\.\(.*\) -> par\.\1 = \1;
but then emacs returns the error message
Invalid use of `\' in replacement text
I can't figure out for the life of me what I am doing wrong here?
By the way, this is matlab code I am working with.
You should not escape . in the replacement text. You also should have a literal ; at the end of the match expression; otherwise, it will be included in \1 and you'll get an extra semicolon before the equal sign.
Replace regexp: par\.\(.*\);
Replace with: par.\1 = \1;
Apparently, I should have used replace-regex rather than query-replace-regex.
With the former, everything just works.

How can I strip double quotes and braces from my strings before insert in Rails4?

I am parsing values from xml and saving them to variables. I was able to strip all but the braces and double quotes from the string. The value displays like this on the page: ["MPEG Video"].
Here is an exampled of the parse saving it to a variable:
#video_format = REXML::XPath.each(media_parse_doc, "//track[#type='Video']/Format/text()") { |element| element }
I tried using .ts like this:
#video_format = (REXML::XPath.each(media_parse_doc, "//track[#type='Video']/Format/text()") { |element| element } ).ts('[]"','')
but it did not work. I saw some examples telling to you gsub and I looked at the api dock for gsub but I am not understanding the thought logic in the examples to be able to apply it correctly to my own case. Here is one of the examples:
"foobar".gsub(/^./, "") # => "oobar"
I understand it is removing te first character but I don't know how to set it up to remove " and [.
Why the /^? Is that ascii for something? Can someone please show me the correct syntax to remove the double quotes and braces from my varialbes and explain the logic process so I can better understand to use on my own in the future?
Thank you for the help!
If you want to understand regular expressions, check out http://rubular.com/.
"foobar".gsub(/^./, "") # => "oobar" that particular example will substitue the first letter of the string with "" (ie, nothing). The reason is that the ^ says "pin the match to the beginning of the string", and the . says "match any character" - so, it'll match any character at the beginning of the string. The encosing / characters are just the standard delimiters for a regular expression - so it's only the ^. that you need to figure out.
To replace double quotes: 'fo"o"bar'.gsub(/"/, "") # => "foobar"
To replace left square bracket: 'fo[o[bar'.gsub(/\[/, "") # => "foobar" (because square brackets are a special character in regex, you have to prefix them with a \ when you want to use them as a 'normal' character.
to replace all quotes and square brackers in one: 'fo[o"[b]"ar'.gsub(/("|\[|\])/, "") # => "foobar"
(the parenthesis indicate a group, and the pipes | indicate 'or'. So, ("|\[|\]) means "match any of the things in this group: a quote, or a left square bracket, or a right square bracket".
But really what you should do is do a good intro tutorial to regular expressions and start from the basics. Once you understand that, it shouldn't be too hard to start composing simple regular expressions of your own.
If you're on a mac, this app is very useful for writing your own regex's: http://krillapps.com/patterns/

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.