get data using regex - regex

hello
i want to get data from a site using regex
http://helwa.maktoob.com/sec8180/art97048/pno1/title_%D8%B7%D8%A8%D9%82-%D9%81%D9%8A%D8%AA%D9%88%D8%AA%D8%B4%D9%8A%D9%86%D9%8A-%D8%A8%D8%A7%D9%84%D8%AE%D8%B6%D8%A7%D8%B1/index.htm
i used that regex /<div class="txtblk"(.*)?<div class="imgv cls">/is
but i gave me Invalid RegExp
why ?
i want to get data inside <div class="txtblk"></div>

Try escaping your double-quotes. Depending on your regex interpreter, those might be causing you problems.

The regex itself looks valid.
It depends on where/how you are using it, though; JavaScript for example doesn't know the /s modifier. To simulate a dot-matches-all mode in JavaScript, use [\s\S] instead of ..
Then, you might be running into problems with the quotes depending on the quoting rules for your language.
Also, you probably want to use (.*?) instead of (.*)?. (Or, if it's JavaScript, ([\s\S]*?)).
Finally, using regex to match HTML is not recommended. Use a DOM parser.

u may need to use a site that collects rss from links
like this
http://www.allwebdesignresources.com/webdesignblogs/graphics/turn-html-web-sites-into-rss-feeds-20-tools-converters-for-html-to-rss-conversions/

Related

Bash - Regex for HTML contents

I'm learning about Bash scripting, and need some help understanding regex's.
I have a variable that is basically the html of a webpage (exported using wget):
currentURL = "https://www.example.com"
currentPage=$(wget -q -O - $currentURL)
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
I started with this, but I need to modify the regex:
Test string (this is what currentURL contains, there can be zero to many instances of this):
<img src="./download/file.php?id=123456&t=1">
Current Regex:
.\/download\/file.php\?id=[0-9]{6}\&mode=view
Here's the regex I created, but it doesn't seem to work in bash.
The best solution would be to have the ID of each file. In this case, simply 123456. But if we can start with getting the /download/file.php?id=123456, that'd be a good start.
Don't parse XML/HTML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
Check: Using regular expressions with HTML tags
Example using xidel:
xidel -s "$currentURL" -e '//a/extract(#href,"id=(\d+)",1)'
Let's first clarify a couple of misunderstandings.
I'm learning about Bash scripting, and need some help understanding regex's.
You seem to be implying some sort of relation between Bash and regex.
As if Bash was some sort of regex engine.
It isn't. The [[ builtin is the only thing I recall in Bash that supports regular expressions, but I think you mean something else.
There are some common commands executed in Bash that support some implementation of regular expressions such as grep or sed and others. Maybe that's what you meant. It's good to be specific and accurate.
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
This suggests an underlying assumption that if you want to extract content from an HTML, then regex is the way to go. That assumption is incorrect.
Although it's best to extract content from HTML using an XML parser (using one of the suggestions in Gilles' answer),
and trying to use regex for it is not a good reflect,
for simple cases like yours it might just be good enough:
grep -oP '\./download/file\.php\?id=\K\d+(?=&mode=view)' file.html
Take note that you escaped the wrong characters in the regex:
/ and & don't have a special meaning and don't need to be escaped
. and ? have special meaning and need to be escaped
Some extra tricks in the above regex are good to explain:
The -P flag of grep enables Perl style (powerful) regular expressions
\K is a Perl specific symbol, it means to not include in the match the content before the \K
The (?=...) is a zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in the match.
The \K and the lookahead trickery is to work with grep -o, which outputs only the matched part. But without these trickeries the matched part would be for example ./download/file.php?id=123456&mode=view, which is more than what you want.

use regex to get both link and text associated with it (anchor tag)

I created a regex string that I hoped would get both the link and the associated text in an html page. For instance, if I had a link such as:
<a href='www.la.com/magic.htm'>magicians of los angeles</a>
Then the link I want is 'www.la.com/magic.htm' and the text I want is 'magicians of los angeles'.
I used the following regex expression:
strsearch = "\<a\s+(.*?)\>(.*?)\</a\s*?\>|"
But my vb program told me I was getting too many matches.
Is there something wrong with the regEx expression?
The circle-brackets are meant to get 'groups' that can be back-referenced.
Thanks
What about this one:
\<a href=.+\</a>
All there is left to do is to go over each match and extract the substrings using regular string manipulation.
Check here (although regexr follows javascript regex implementation, it is still useful in our scenario)
With that being said, I often see people stating that regexes are not suited for parsing Html. You might need to use an Html Parser for this. You have HtmlAgilityPack, which is not maintained anymore, and AngleSharp, that I know of to recommend.
I tried with following pattern , it worked.
\<a href=(.*?)\>(.*?)\<\/a\s*?\>|
Also Found two errors on your origin string:
missed a escape syntax on /a
the reserved word 'href' is captured on
first group
At last , i would like recommend you a great site to test REGEX string. It will helps your debug really fast. Refer this (also demonstrating the result you want) :
REGEX101

Removing everything between a tag (including the tag itself) using Regex / Eclipse

I'm fairly new to figuring out how Regex works, but this one is just frustrating.
I have a massive XML document with a lot of <description>blahblahblah</description> tags. I want to basically remove any and all instances of <description></description>.
I'm using Eclipse and have tried a few examples of Regex I've found online, but nothing works.
<description>(.*?)</description>
Shouldn't that work?
EDIT:
Here is the actual code.
<description><![CDATA[<center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3"><th>ID</th><td>308</td></tr></table></center>]]></description>
I'm not familiar with Eclipse, but I would expect its regex search facility to use Java's built-in regex flavor. You probably just need to check a box labeled "DOTALL" or "single-line" or something similar, or you can add the corresponding inline modifier to the regex:
(?s)<description>(.*?)</description>
That will allow the . to match newlines, which it doesn't by default.
EDIT: This is assuming there are newlines within the <description> element, which is the only reason I can think of why your regex wouldn't work. I'm also assuming you really are doing a regex search; is that automatic in Eclipse, or do you have to choose between regex and literal searching?

Regex and Yahoo Pipes: How to replace end of url

Here's the Pipe though you may not need it to answer the question: http://pipes.yahoo.com/pipes/pipe.info?_id=85a288a1517e615b765df9603fd604bd
I am trying to modify all url's as so:
http://mediadownloads.mlb.com/mlbam/2009/08/12/mlbf_6073553_th_3.jpg with
http://mediadownloads.mlb.com/mlbam/2009/08/12/mlbtv_6073553_1m.mp4
The syntax should be something like:
In item.mediaUrl replace f with tv and In item.mediaUrl replace last 8 characters with 1m.mp4
mlbf_(\d+)_.* replaced w/ mlbtv_$1_1m.mp4
breaks the rss feed though I know I am close
Any idea as to what syntax I need there?
Your regex and replacement look okay to me, assuming the regex is being applied only to the URLs. If it were being applied to the surrounding text as well, the .* would tend to consume a lot more than you wanted. See what happens if you change the regex to this:
mlbf_(\d+)_[\w.]+
I do not know how this yahoo pipes work, but this regex should do it according this site:
Regex:
.*?/([0-9]*)/([0-9]*)/([0-9]*)/mlbf_([0-9]*)_.*
Substitution:
http://mediadownloads.mlb.com/mlbam/$1/$2/$3/mlbtv_$4_1m.mp4

replace url paths using Regex

How can I change the url of my images from this:
http://www.myOLDwebsite.com/**********.*** (i have gifs, jpgs, pngs)
to this:
http://www.myNEWwebiste.com/somedirectory/**********.***
Using REGexp text editor?
Really thanks for your time
[]'s
Mateus
Why use regex?
Using conventional means, replace:
src="http://www.myOLDwebsite.com/
with:
src="http://www.myNEWwebiste.com/somedirectory/
Granted, this assumes your image tags always follow the 'src="<url>"' pattern, with double quotes and everything.
Using regex is of course also possible. Replace this:
(src\s*=\s*["'])http://www\.myOLDwebsite\.com/
with:
\1http://www.myNEWwebiste.com/somedirectory/
alternatively, if your text editor uses $ to mark back references:
$1http://www.myNEWwebiste.com/somedirectory/
On second thought - why do your images have absolute URLs in the first place? Isn't that unnecessary?
Well, the easiest way is probably to use sed in in-place mode:
sed -ir \
's#http://www[.]myOLDwebsite[.]com/#http://www.myNEWwebsite.com/subdirectory/#g' \
file1 file2 ...
If for some reason you need to actually interpret the HTML (rather than just do a simple string replacement), a quick script built around BeautifulSoup is going to be safer -- lots of people try to do HTML or XML parsing via regular expressions, but it's very hard if not impossible to cover all corner cases.
All that said, it'd be better if you were using relative links to not have your HTML depend on the server it's hosted on. See also the <BASE HREF="..."> element you can put in your <HEAD> to specify a location all URLs are relative to; if you were using that, you'd only need to do a single replacement.