REGEX: best practice to insert before, after or between? - regex

i'm nervous as hell asking this question since there's a LOT of RegEx posts out there. but i'm asking for best method as well, so i'm going to risk it (fully expecting a rep hit if i botch the job...)
i've been given a list to reformat. 120 questions and answers (240 tag sets total). * glark * all i need to do is make the text between the tags a link, like so:
<li>do snails make your feet itch?</li>
has to become
<li>do snails make your feet itch?</li>`
THIS IS NOT A JAVASCRIPT/PHP RegEx question. it is JUST RegEx that i can drop into the search/replace fields of my IDE. i'll likely try and do a batch replace afterwards with PERL to insert the 'n' variable so the links point properly.
and i know you're going to ask 'if you can use PERL for that, why not the whole shebang?' and that's a valid question, but i want to be using RegEx more for the power it has for big lists like this. plus my PERL skills are sketchy at best... unless you want to tack that on as well... :D heh heh.
if this question can't be answered or is wrong for this part of the forum, please accept my apologies and point me in the right direction.
many thanks!
WR!

You can do it in two steps.
Substitute <li> with <li><a href="#n">
Substitute </li> with </a></li>
Or you can try to be clever and it it in one. Here is a substitute command in Perl syntax ($1 references what was matched in the brackets).
s,<li>(.*)</li>,<li>$1</li>,
And while you are there it's easy to replace the second part of the replacement pattern with an expression that will increment n
s,<li>(.*)</li>,q{<li>$1</li>},e
See how you can run this from the command line:
echo '<li>do snails make your feet itch?</li>' |
perl -pe 's,<li>(.*)</li>,q{<li>$1</li>},e'
<li>do snails make your feet itch?</li>

Search
<li>(.*?)</li>
Replace
<li>$1</li>

Related

Regular expression cannot match "</p>" correctly

everyone.
I'm having some difficulties to use regular expressions to grep the text from HTML, which has
</p>
I'm using unsung hero.*</p> to grep the paragraph I'm interested in, but cannot make it match until next </p>
The command I use is:
egrep "unsung hero.*</p>" test
and in test is a webpage like:
<p>There are going to be outliers among us, people with extraordinary skill at recognizing faces. Some of them may end up as security officers or gregarious socialites or politicians. The rest of us are going to keep smiling awkwardly at office parties at people we\'re supposed to know. It\'s what happens when you stumble around in the 21st century with a mind that was designed in the Stone Age.</p>\n <p>(SOUNDBITE OF MUSIC)</p>\n <p>VEDANTAM: This week\'s show was produced by Chris Benderev and edited by Jenny Schmidt. Our supervising producer is Tara Boyle. Our team includes Renee Cohen, Parth Shah, Laura Kwerel, Thomas Lu and Angus Chen.</p>\n <p>Our unsung hero this week is Alexander Diaz, who troubleshoots technical problems whenever they arise and has the most unflappable, kind disposition in the face of whatever crisis we throw his way. Producers at NPR have taken to calling him Batman because he\'s constantly, silently, secretly saving the day. Thanks, Batman.</p>\n <p>If you like today\'s episode, please take a second to share it with a friend. We\'re always looking for new people to discover our show. I\'m Shankar Vedantam, and this is NPR.</p>\n <p>(SOUNDBITE OF MUSIC)</p>\n\n <p class="disclaimer">Copyright © 2019 NPR. All rights reserved. Visit our website terms of use and permissions pages at www.npr.org for further information.</p>\n\n <p class="disclaimer">NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.</p>\n</div><div class="share-tools share-tools--secondary" aria-label="Share tools">\n <ul>\n
I'm expecting to match before
</p>\n <p>If you like
But it actually went way further than that.
I feel like the regular expression I used has issue, but don't know how. Any help will be appreciated.
Thanks!
20190523:
Thanks for your guys' suggestions.
I tried
egrep "unsung hero.*?</p>" test
But it didn't give me the result I want, insted it's like
Leo, I feel like this is a useful expression and I'd like to get it right. could you explain a bit?
The other test I did for
[^<]*
Actually gave the result expected
With .* the match will be greedy and match the longest substring possible. (Which is in your case until the last paragraph.)
What you actually want is a non-greedy match with .*?
Your specific command should most likely look like this:
grep -P -o "unsung hero.*?</p>" test
Another solution would be to expand your regex until the end of the string/webpage and than pick the selected substring with a group.
UPDATE
As Charles Duffy pointed out correctly, this will not work with the standard (POSIX ERE) syntax. Therefore the command above uses the -P flag to specify that it is a perl regular expression.
If your system or application does not support perl regular expression and you are ok with matching until the first < (instead of matching until the first </p>), matching every character except < is the way to go.
With this, the complete command should look like this:
grep -o "unsung hero[^<]*</p>" test
Thanks to Charles for pointing that out in the comments.

UNIX: How would I grep in a script using a variable as a search parameter for a file?

Before I Start, this isn't exactly how it seems and I did search the web for a while before coming here. Basically I have a script where the user passes in a string and stores it in a variable. I then have to take that word and search for all the subwords that could be made from it in a dictionary file. The problem I am having is I need to make sure the words are at least 4 characters long. I do not have the best grasp on regular expressions. I'm aware of the techniques you can use just logically can't piece it together sometimes. I will show you the line of code and explain my reasoning behind why I think it should be this way. Then, could someone correct me on my logic? I am not looking for someone to send me the working line of code but perhaps correct my logic so I can understand better and derive the answer on my own.
words=$(grep -iE '(["$text"]{4,})' /usr/dict/words)
echo "$words"
For example if I pass in string college I should get output like
cell
cello
clee
cleg
etc.....
I am storing the command in another variable to echo. I am not sure why exactly, It just seems from what I saw online most people were rather fond of this. Using grep with -i for ignore case and -E for regular expression or (egrep) I believe the expression needs to be enclosed in single quote parenthesis for expressions. $text is the variable I stored the users input in. I know $ usually signifies the ending in and [] is a range and "" makes it read the variable rather than print what is there. Then {4,} meaning four or more characters. then the last part is the path to the file. Any input would be appreciated and again, I do not like being spoon fed answers it's an easy way to learn nothing. I would just like corrections on my logic if all possible. Thanks everyone!!
If by "subwords" you mean permutations of its letters, then your command is fine except for the quotes. Unfortunately you have to do it like this:
words=$(grep -iE '(['"$text"']{4,})' /usr/dict/words)
This way you pass to grep the single quoted string so that the shell doesn't interpret its special symbols. But at the same time you have to expand your $text var, thus you have to make a gap inside your single-quoted string, and in that gap place your variable in double quotes.
Hope I didn't spoil it for you.

basic refresher needed: regexp syntax to replace a grep that doesn't work

This will probably take any of you folks about 1 minute to answer, so I apologize for the brain lapse. But I've overthought this so much that I am totally forgetting what I know is simple regexp using Perl.
I have an array #array containing several values:
ball_123456789
glove_234578901
bat_1458158568
ball_6319254815
hat_2343581451
ball_again_3353585885
ball_4845555555
racket_343581558
... and I want to extract only the elements in the array beginning with "ball_" (but not "ball_again_", above.) In other words, I want #found to include ball_123456789, ball_6319254815, ball_4845555555.
Obviously something like "#found = grep /ball_/, #array" isn't effective because it would grab not only "ball_123456789" but "ball_again_3353585885".
What I lack is enough knowledge in regexp to formulate an effective pattern-matching statement.
Help ?
ball_\d+
This simple regex should do it for you.See demo.
https://regex101.com/r/vD5iH9/29

perl regex problem -- $amp in yahoo finance page

I found an old perl hack on the O'Reilly site http://oreilly.com/pub/h/1041 and decided to check it out. After a little fiddling around it started to run but the regex are out of date.
Here is the question: with this
/<a href="\/q\/op\?s=(.*?)\&m=(.*?)">/
as the first line of regex, what needs to be modified to make the regex function again? The following are snippets from
http://finance.yahoo.com/q/op?s=FISV
<a href="/q/op?s=FISV&k=55.000000">
and
<a href="/q/os?s=FISV&m=2011-04-15">
.
The original hack is dated 2004 and option symbols looked like this (FQVAH or FQVFF) back then instead of fisv110416c00060000 for a call option and fisv110416p00090000 for a put option. First thing I did to get it going was to modify all instances of $url to $curl because until the name was changed the symbol was not being passed to yahoo for lookup. The &amp is giving me the most trouble. If this is found to run without modification I would be very surprised and would very much like to know what system and perl -V is installed. SLES 10 and perl 5.8.0 is what I am currently using.
Any suggestions would be helpful. It could be a useful script to anyone who is serious about protecting themselves from a falling equity market.
Thanks,
robm
I'm not /100%/ sure what you're asking, but if I'm understanding, you want a regex that will capture "fisv110416c00060000" and tell you the first few letters, whether it's a call or a put, and the amount?
If so, you're looking for something like:
/([a-z]+)(\d+)([cp])(\d+)/
That should capture the following for the first example
$1 = "fisv"
$2 = 110416
$3 = c
$4 = 00060000
The original regex was very specific to that html string. You can include the beginning bits of it if you need to use it to check that the entire string is there as well. Of course, make your regex as tight as possible to avoid over-matches and wasted time pattern matching. I'm just not sure the exact pattern you're trying to match (ie: is it always "fisv"?).
You should either first unescape the html, this would turn the & into a &, or just change the regex, like this:
/<a href="\/q\/os\?s=(.*?)\&(?:amp;)?m=(.*?)">/
To match both types of urls:
/<a href="\/q\/o[ps]\?s=(.*?)\&(?:amp;)?[mk]=(.*?)">/

Need simple regex for LaTeX

In my LaTeX files, I have literally thousands of occurrences of the following construct:
$\displaystyle{...math goes here...}$
I'd like to replace these with
\mymath{...math goes here...}
Note that the $'s disappear, but the curly braces remain---if not for the trailing $, this would be a basic find-and-replace. If only I knew any regex, I'm sure it would handle this with no problem. What's the regex I need to make this happen?
Many thanks in advance.
Edit: Some issues and questions have arisen, so let me clarify:
Yes, $\displaystyle{ ... }$ can occur multiple times on the same line.
No, nested }$'s (such as $\displaystyle{...{more math}$...}$) cannot occur. I mean, I suppose it could if you put it in an \mbox or something, but I can't imagine why anyone would ever do that inside a $\displaystlye{}$ construct, the purpose of which is to display math inline with text. At any rate, it's not something I've ever done or am likely to do.
I tried using the perl suggestion, but while the shell raised no objections, the files remained unaffected.
I tried using the sed suggestion, but the shell objected to an "unexpected token near `('". I've never used sed before (and "man sed" was obtuse), but here's what I did: navigated to a directory containing .tex files and typed "sed s/\$\\displaystyle({[^}]+})\$/\\mymath\1/g *.tex". No luck. How do I use sed to do what I want?
Again, many many thanks for all offered help.
Be very careful when using REGEX to do this type of substitution
because the theoretical answer is that
REGEX is incapable of matching this type of pattern.
REGEX is a finite state machine; it does not incorporate a pushdown stack so
it cannot work with nested structures such as "{...math goes here...}" if
there is any possibility of nesting such that something like "{more math}$"
can appear as part of a "math goes here" string. You need at a minimum a
context free grammar to describe this type of construct - a state machine
just doesn't cut it!
Now having said that, you may still be able to pull this off using REGEX
provided none of your "math goes here" strings are more complex than
what a state machine can handle.
Give it a shot.... but beware of the results!
sed:
s/\$\\displaystyle({[^}]+})\$/\\mymath\1/g
perl -pi -e 's/$\\displaystyle({.*)}\$/\\mymath$1}/g' *.tex
if multiples }$ are on the same line you need a non greedy version:
perl -pi -e 's/$\\displaystyle({.*?)}\$/\\mymath$1}/g' *.tex