Bash - Regex for HTML contents

Bash - Regex for HTML contents - regex

I'm learning about Bash scripting, and need some help understanding regex's.
I have a variable that is basically the html of a webpage (exported using wget):
currentURL = "https://www.example.com"
currentPage=$(wget -q -O - $currentURL)
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
I started with this, but I need to modify the regex:
Test string (this is what currentURL contains, there can be zero to many instances of this):
<img src="./download/file.php?id=123456&t=1">
Current Regex:
.\/download\/file.php\?id=[0-9]{6}\&mode=view
Here's the regex I created, but it doesn't seem to work in bash.
The best solution would be to have the ID of each file. In this case, simply 123456. But if we can start with getting the /download/file.php?id=123456, that'd be a good start.

Don't parse XML/HTML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
Check: Using regular expressions with HTML tags
Example using xidel:
xidel -s "$currentURL" -e '//a/extract(#href,"id=(\d+)",1)'

Let's first clarify a couple of misunderstandings.
I'm learning about Bash scripting, and need some help understanding regex's.
You seem to be implying some sort of relation between Bash and regex.
As if Bash was some sort of regex engine.
It isn't. The [[ builtin is the only thing I recall in Bash that supports regular expressions, but I think you mean something else.
There are some common commands executed in Bash that support some implementation of regular expressions such as grep or sed and others. Maybe that's what you meant. It's good to be specific and accurate.
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
This suggests an underlying assumption that if you want to extract content from an HTML, then regex is the way to go. That assumption is incorrect.
Although it's best to extract content from HTML using an XML parser (using one of the suggestions in Gilles' answer),
and trying to use regex for it is not a good reflect,
for simple cases like yours it might just be good enough:
grep -oP '\./download/file\.php\?id=\K\d+(?=&mode=view)' file.html
Take note that you escaped the wrong characters in the regex:
/ and & don't have a special meaning and don't need to be escaped
. and ? have special meaning and need to be escaped
Some extra tricks in the above regex are good to explain:
The -P flag of grep enables Perl style (powerful) regular expressions
\K is a Perl specific symbol, it means to not include in the match the content before the \K
The (?=...) is a zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in the match.
The \K and the lookahead trickery is to work with grep -o, which outputs only the matched part. But without these trickeries the matched part would be for example ./download/file.php?id=123456&mode=view, which is more than what you want.

Related

Regular expression - ignore first 7 character from extracted data

Extracting data from a page source. In the extracted data, need to display text after the ". Tried different options. Didn't work. Any suggestions
Page source text
enter image description here
input type name=loginForm_SUBMIT value="1" /input type=""name="faces.ViewState" id="faces.ViewState" value="9uiY/UWJ1/w3PQ==" /><
regular expression: value="[^"1" ].*\w==
Output: value="9uiY/UWJ1/w3PQ==
Expected Output: 9uiY/UWJ1/w3PQ==

Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful xpath query.
theory :
According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1 (check my wrapper to have newlines delimited output
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
ruby nokogiri, check this example
php DOMXpath, check this example
Check: Using regular expressions with HTML tags
Example using xpath :
xmllint --html --xpath 'string(//input[#value][2]/#value)' file
Output :
9uiY/UWJ1/w3PQ==

You may try this
(?:value[^v]*value=\")([^\"]*)
The output you want is captured in group 1, and you can retrieve it by backreference \1 or $1. Demo
"value=" is occurred twice in your sample text, so you seemed use the regex(value="[^"1" ].*\w==) to avoid the first one and match second one.
But the regex is wrong because character class'[...]' means one character. If the character class is followed by the quantifier(repeater) *, +, or {min,max} etc, then it's possible the regex means the string which has multiple characters.

Is posible to add characters to a string as part of a regular expression (regex)

I use an application to find specific text patterns in free text fields in XML records. It uses regex to identify the pattern and then it is tagged in the XML. For a specific project, it would be a great time saver (I am working with about 18 million records) if I could add 2 characters 27 in front of one of the pattern I have to use.
Can this be done or am I just going to have to go the long way around?

No, you can't have a regex match text that isn't there. A regex will only be able to return text that is part of the original text.
However, if you matched into groups, you could potentially use the group name for extra information about what you're matching.

Regex is not the right tool if you'd like to edit an XML file. Instead, use a modern language like Python, Perl, Ruby, PHP, Java with a proper XML parser module. If you work in Unix like shell, I recommend xmlstarlet
That said, if you'd like to go ahead with a substitution, you can try sed (at your own risks) :
sed -i -r 's/987654/27&/g' files*.xml
(use only -i switch only to modify in-place)

get data using regex

hello
i want to get data from a site using regex
http://helwa.maktoob.com/sec8180/art97048/pno1/title_%D8%B7%D8%A8%D9%82-%D9%81%D9%8A%D8%AA%D9%88%D8%AA%D8%B4%D9%8A%D9%86%D9%8A-%D8%A8%D8%A7%D9%84%D8%AE%D8%B6%D8%A7%D8%B1/index.htm
i used that regex /<div class="txtblk"(.*)?<div class="imgv cls">/is
but i gave me Invalid RegExp
why ?
i want to get data inside <div class="txtblk"></div>

Try escaping your double-quotes. Depending on your regex interpreter, those might be causing you problems.

The regex itself looks valid.
It depends on where/how you are using it, though; JavaScript for example doesn't know the /s modifier. To simulate a dot-matches-all mode in JavaScript, use [\s\S] instead of ..
Then, you might be running into problems with the quotes depending on the quoting rules for your language.
Also, you probably want to use (.*?) instead of (.*)?. (Or, if it's JavaScript, ([\s\S]*?)).
Finally, using regex to match HTML is not recommended. Use a DOM parser.

u may need to use a site that collects rss from links
like this
http://www.allwebdesignresources.com/webdesignblogs/graphics/turn-html-web-sites-into-rss-feeds-20-tools-converters-for-html-to-rss-conversions/

Regular expression extraction in text editors

I'm kind of new to programming, so forgive me if this is terribly obvious (which would be welcome news).
I do a fair amount of PHP development in my free time using pregmatch and writing most of my expressions using the free (open source?) Regex Tester.
However frequently I find myself wanting to simply quickly extract something and the only way I know to do it is to write my expression and then script it, which is probably laughable, but welcome to my reality. :-)
What I'd like is something like a simple text editor that I can feed my expression to (given a file or a buffer full of pasted text) and have it parse the expression and return a document with only the results.
What I find is usually regex search/replace functions, as in Notepad++ I can easily find (and replace) all instances using an expression, but I simply don't know how to only extract it...
And it's probably terribly obvious, can expression match only the inverse? Then I could use something like (just the expression I'm currently working on):
([^<]*)
And replace everything that doesn't match with nothing. But I'm sure this is something common and simple, I'd really appreciate any poniters.
FWIW I know grep and I could do it using that, but I'm hoping their are better gui'ified solution I'm simply ignorant of.
Thanks.
Zach
What I was hoping for would be something that worked in a more standard set of gui tools (ie, the tools I might already be using). I appreciate all the responses, but using perl or vi or grep is what I was hoping to avoid, otherwise I would have just scripted it myself (of course I did) since their all relatively powerful, low-level tools.
Maybe I wasn't clear enough. As a senior systems administrator the cli tools are familiar to me, I'm quite fond of them. Working at home however I find most of my time is spent in a gui, like Netbeans or Notepad++. I just figure there would be a simple way to achieve the regex based data extraction using those tools (since in these cases I'd already be using them).
Something vaguely like what I was referring to would be this which will take aa expression on the first line and a url on the second line and then extract (return) the data.
It's ugly (I'll take it down after tonight since it's probably riddled with problems).
Anyway, thanks for your responses. I appreciate it.

If you want a text editor with good regex support, I highly recommend Vim. Vim's regex engine is quite powerful and is well-integrated into the editor. e.g.
:g!/regex/d
This says to delete every line in your buffer which doesn't match pattern regex.
:g/regex/s/another_regex/replacement/g
This says on every line that matches regex, do another search/replace to replace text matching another_regex with replacement.
If you want to use commandline grep or a Perl/Ruby/Python/PHP one-liner any other tool, you can filter the current buffer's text through that tool and update the buffer to reflect the results:
:%!grep regex
:%!perl -nle 'print if /regex/'

Have you tried nregex.com ?
http://www.nregex.com/nregex/default.aspx
There's a plugin for Netbeans here, but development looks stalled:
http://wiki.netbeans.org/Regex
http://wiki.netbeans.org/RegularExpressionsModuleProposal
You might also try The Regulator:
http://sourceforge.net/projects/regulator/

Most regex engines will allow you to match the opposite of the regex.
Usually with the ! operator.

I know grep has been mentioned, and you don't want a cli tool, but I think ack deserves to be mentioned.
ack is a tool like grep, aimed at
programmers with large trees of
heterogeneous source code.
ack is written purely in Perl, and
takes advantage of the power of Perl's
regular expressions.

A good text editor can be used to perform the actions you are describing. I use EditPadPro for search and replace functionality and it has some other nice feaures including code coloring for most major formats. The search panel functionality includes a regular expression mode that allows you to input a regex then search for the first instance which identifies if your expression matches the appropriate information then gives you the option to replace either iteratively or all instances.
http://www.editpadpro.com

My suggestion is grep, and cygwin if you're stuck on a Windows box.
echo "text" | grep ([^<]*)
OR
cat filename | grep ([^<]*)

What I'd like is something like a
simple text editor that I can feed my
expression to (given a file or a
buffer full of pasted text) and have
it parse the expression and return a
document with only the results.
You have just described grep. This is exactly what grep does. What's wrong with it?

Do calculation on captured number in regex before using it in replacement

Using a regex, I am able to find a bunch of numbers that I want to replace. However, I want to replace the number with another number that is calculated using the original - captured - number.
Is that possible in notepad++ using a kind of expression in the replacement-part?
Edit: Maybe a strange thought, but could the calculation be done in the search part, generating a second captured number that would effectively be the result?

Even if it is possible, it will almost certainly be "messy" - why not do the replacements with a simple script instead? For example..
#!/usr/bin/env ruby
f = File.new("f1.txt", File::RDWR)
contents = f.read()
contents.gsub!(/\d+/){|m|
m.to_i + 1 # convert the current match to an integer, and add one
}
f.truncate(0) # empty the existing file
f.seek(0) # seek to the start of the file, before writing again
f.write(contents) # write modified file
f.close()
..and the output:
$ cat f1.txt
This was one: 1
This two two: 2
$ ruby replacer.rb
$ cat f1.txt
This was one: 2
This two two: 3
In reply to jeroen's comment,
I was actually interested if the possibility existed in the regular expression itself as they are so widespread
A regular expression is really just a simple pattern matching syntax. To do anything more advanced than search/replace with the matches would be up to the text-editors, but the usefulness of this is very limited, and can be achieved via scripting most editors allow (Notepad++ has a plugin system, although I've no idea how easy it is to use).
Basically, if regex/search-and-replace will not achieve what you want, I would say either use your editors scripting ability or use an external script.

Is that possible in notepad++ using a kind of expression in the replacement-part?
Interpolated evaluation of regular-expression matches is a relatively advanced feature that I probably would not expect to find in a general-purpose text editing application. I played around with Notepad++ a bit but was unable to get this to work, nor could I find anything in the documentation that suggests this is possible.

Hmmm... I'd have to recommend AWK to do this.
http://en.wikipedia.org/wiki/AWK

notepad++ has limited regular expressions built in. There are extensions that add a bit more to the regular expression find and replace, but I've found those hard to use. I would recommend writing a little external program to do it for you. Either Ruby, Perl or Python would be great for it. If you know those languages. I use Ruby and have had lots of success with it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js