perl regex problem -- $amp in yahoo finance page - regex

I found an old perl hack on the O'Reilly site http://oreilly.com/pub/h/1041 and decided to check it out. After a little fiddling around it started to run but the regex are out of date.
Here is the question: with this
/<a href="\/q\/op\?s=(.*?)\&m=(.*?)">/
as the first line of regex, what needs to be modified to make the regex function again? The following are snippets from
http://finance.yahoo.com/q/op?s=FISV
<a href="/q/op?s=FISV&k=55.000000">
and
<a href="/q/os?s=FISV&m=2011-04-15">
.
The original hack is dated 2004 and option symbols looked like this (FQVAH or FQVFF) back then instead of fisv110416c00060000 for a call option and fisv110416p00090000 for a put option. First thing I did to get it going was to modify all instances of $url to $curl because until the name was changed the symbol was not being passed to yahoo for lookup. The &amp is giving me the most trouble. If this is found to run without modification I would be very surprised and would very much like to know what system and perl -V is installed. SLES 10 and perl 5.8.0 is what I am currently using.
Any suggestions would be helpful. It could be a useful script to anyone who is serious about protecting themselves from a falling equity market.
Thanks,
robm

I'm not /100%/ sure what you're asking, but if I'm understanding, you want a regex that will capture "fisv110416c00060000" and tell you the first few letters, whether it's a call or a put, and the amount?
If so, you're looking for something like:
/([a-z]+)(\d+)([cp])(\d+)/
That should capture the following for the first example
$1 = "fisv"
$2 = 110416
$3 = c
$4 = 00060000
The original regex was very specific to that html string. You can include the beginning bits of it if you need to use it to check that the entire string is there as well. Of course, make your regex as tight as possible to avoid over-matches and wasted time pattern matching. I'm just not sure the exact pattern you're trying to match (ie: is it always "fisv"?).

You should either first unescape the html, this would turn the & into a &, or just change the regex, like this:
/<a href="\/q\/os\?s=(.*?)\&(?:amp;)?m=(.*?)">/
To match both types of urls:
/<a href="\/q\/o[ps]\?s=(.*?)\&(?:amp;)?[mk]=(.*?)">/

Related

How to use Regex to replace a tag in a word document with Powershell

First post on stackoverflow for me so sorry if something is out of norm or similar ^^
Currently I'm trying to find a way to read vouchers out of a .csv that I get from my pfsense.
Plan is to read it out of the .csv and write it down in a Word document so that secretaries can print it out and give them out to coworkers.
So far I have no problems replacing names and room numbers, all I gotta do now is to find a way to replace the voucher codes, but since they obviously always change I tried to use regex, here's the current state of that part of my code:
if ($Vouchers -match '((\d|\w){11})*') {
$matches.0 }
ReplaceTag –Document $Doc -FindText ‘<Vouchers>’ -replacewithtext $matches
The regex itself is working perfectly fine (already tested it on regex101) so I guess it's the code.
I'm assuming that it's trying to literally match "((\d|\w){11})*" instead of using the pattern :\
Any kinda help would be welcomed!

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

underscore to camelCase RegEx

Our standards have changed and I want to do a 'find and replace' in say Dreamweaver(it allows for RegEx or we just got Visual Studio 2010, if it allows for searching by RegEx) for all the underscores and camelCase them.
What would be the RegEx to do that?
RegEx baffles me. I definitely need to do some studying.
Thanks in advance!
Update: A little more info - I'm searching within my html,aspx,cfm or css documents for any string that contains an underscore and replacing it with the following letter capitalized.
I had this problem, but I need to also handle converting fields like gap_in_cover_period_d_last_5_yr into gapInCoverPeriodDLast and found 9 out of 10 other sed expressions, don't like 1 letter words or numbers.
So to answer the question, use sed. sed -s is the equivalent to using the :s command in vim. So the example below is a command (ie sed -s/.../gc
This seemed to work, although I did have to run it twice (word_a_word will become wordA_word on the first pass and wordAWord on the second. sed backward commands, are just too magical for my muggle blood):-
s/\([A-Za-z0-9]\+\)_\([0-9a-z]\)/\1\U\2/gc
I recently had to approach a similar situation you asked about. Here is a regex I've been using in VIM which does the job for me:
%s/_\([a-zA-Z]\)/\u\1/g
As an example:
this_is_a_test becomes thisIsATest
I don't think there is a good way to do this purely with regex. Searching for _ characters is easy, something like ._. should work to find an _ with something on either side, but you need a more powerful scripting system to change the case of the character following the _. I suggest perl :)
I have a solution in PHP
preg_replace("/(_)(.)/e", "strtoupper('\\2')", $str)
There may be a more elegant selector criteria but I wanted to keep it simple.

REGEX: best practice to insert before, after or between?

i'm nervous as hell asking this question since there's a LOT of RegEx posts out there. but i'm asking for best method as well, so i'm going to risk it (fully expecting a rep hit if i botch the job...)
i've been given a list to reformat. 120 questions and answers (240 tag sets total). * glark * all i need to do is make the text between the tags a link, like so:
<li>do snails make your feet itch?</li>
has to become
<li>do snails make your feet itch?</li>`
THIS IS NOT A JAVASCRIPT/PHP RegEx question. it is JUST RegEx that i can drop into the search/replace fields of my IDE. i'll likely try and do a batch replace afterwards with PERL to insert the 'n' variable so the links point properly.
and i know you're going to ask 'if you can use PERL for that, why not the whole shebang?' and that's a valid question, but i want to be using RegEx more for the power it has for big lists like this. plus my PERL skills are sketchy at best... unless you want to tack that on as well... :D heh heh.
if this question can't be answered or is wrong for this part of the forum, please accept my apologies and point me in the right direction.
many thanks!
WR!
You can do it in two steps.
Substitute <li> with <li><a href="#n">
Substitute </li> with </a></li>
Or you can try to be clever and it it in one. Here is a substitute command in Perl syntax ($1 references what was matched in the brackets).
s,<li>(.*)</li>,<li>$1</li>,
And while you are there it's easy to replace the second part of the replacement pattern with an expression that will increment n
s,<li>(.*)</li>,q{<li>$1</li>},e
See how you can run this from the command line:
echo '<li>do snails make your feet itch?</li>' |
perl -pe 's,<li>(.*)</li>,q{<li>$1</li>},e'
<li>do snails make your feet itch?</li>
Search
<li>(.*?)</li>
Replace
<li>$1</li>

Do calculation on captured number in regex before using it in replacement

Using a regex, I am able to find a bunch of numbers that I want to replace. However, I want to replace the number with another number that is calculated using the original - captured - number.
Is that possible in notepad++ using a kind of expression in the replacement-part?
Edit: Maybe a strange thought, but could the calculation be done in the search part, generating a second captured number that would effectively be the result?
Even if it is possible, it will almost certainly be "messy" - why not do the replacements with a simple script instead? For example..
#!/usr/bin/env ruby
f = File.new("f1.txt", File::RDWR)
contents = f.read()
contents.gsub!(/\d+/){|m|
m.to_i + 1 # convert the current match to an integer, and add one
}
f.truncate(0) # empty the existing file
f.seek(0) # seek to the start of the file, before writing again
f.write(contents) # write modified file
f.close()
..and the output:
$ cat f1.txt
This was one: 1
This two two: 2
$ ruby replacer.rb
$ cat f1.txt
This was one: 2
This two two: 3
In reply to jeroen's comment,
I was actually interested if the possibility existed in the regular expression itself as they are so widespread
A regular expression is really just a simple pattern matching syntax. To do anything more advanced than search/replace with the matches would be up to the text-editors, but the usefulness of this is very limited, and can be achieved via scripting most editors allow (Notepad++ has a plugin system, although I've no idea how easy it is to use).
Basically, if regex/search-and-replace will not achieve what you want, I would say either use your editors scripting ability or use an external script.
Is that possible in notepad++ using a kind of expression in the replacement-part?
Interpolated evaluation of regular-expression matches is a relatively advanced feature that I probably would not expect to find in a general-purpose text editing application. I played around with Notepad++ a bit but was unable to get this to work, nor could I find anything in the documentation that suggests this is possible.
Hmmm... I'd have to recommend AWK to do this.
http://en.wikipedia.org/wiki/AWK
notepad++ has limited regular expressions built in. There are extensions that add a bit more to the regular expression find and replace, but I've found those hard to use. I would recommend writing a little external program to do it for you. Either Ruby, Perl or Python would be great for it. If you know those languages. I use Ruby and have had lots of success with it.