Extract some strings - regex

need some help with this RegEx magic..
I have this:
delete
and this:
(<a)*([^>]*>)[^<]*(</a>)
$1 = <a
$2 = href="/en/node/1032/delete?destination=node%2F5%2Fblog">
$3 = </a>
I need some aditional strings:
1032
href="/en/ en is dynamic!
How can I get this strings?
Used in php

Your sample could be captured with
(<a)\b.*?((href="/en/).*?(?</)(\d+)/.*?").*?>).*?(</a>)
...but perhaps replacing the "en" with something broader, depending on what you want to capture.
HOWEVER, and I want to emphasize this, don't use regex to parse HTML. The above regex won't work for certain HTML-valid input, and due to the limitations of regex it cannot be refined to work for every possible case. You'll get better, more correct results with an HTML or XML parser.

([^/ ]). That will give you
href="
en
node
1032

Related

RegExp find wrong tags

I have some urls saved in DB like hello world
with break tags, so i need to delete them, the problem that <br/> are in other places to so i can't delete all of them,
i write RegExp <*"*<br\/?>"> but it select not only <br> and quotes too.
You really shouldn't be using regular expressions for parsing HTML or XML.
Having said that. As I understand it, you have br tags inside the href attribute of a tags.
try :
href\s*?=\s*?\"(.*?)(<br\/?\>)\"
If you try to search about the right lines in the database, then this is your regex extended to match the whole line:
<.*\".*<br\/>\">.*>
After this you can mach the '<br/>' directly in those lines. Is there a language to edit your DB?
Some of the other answers here are okay. I'll offer an alternative:
https://regex101.com/r/uG5PBA/2
This'll put the break tags in a capture group -- group 1, so that you can simply nix them.
Regex:
<a[\s\S]*?(\<br\/>)[\s\S]*?<\/a>
Test String:
hello worldhello world

Regex Python, Find Everything Inbetween Quotes after Keyword

I have strings that looks like this:
"Grand Theft Auto V (5)" border="0" src="/product_images/Gaming/Playstation4 Software/5026555416986_s.jpg" title="Grand... (the string continues for a while here)
I want to use regex to grab this: /product_images/Gaming/Playstation4 Software/5026555416986_s.jpg
Basically, everything in src="..."
At the moment I produce a list using re.findall(r'"([^"]*)"', line) and grab the appropriate one, but there's a lot of quotes in the full string and I'd like to be more efficient.
Can anyone help me put together an expression for this please?
Try with this
(?<=src=").+(?=" )
Use this as RE :
src="(.+?)"
This will return result as you want.
re.findall('src="(.+?)"', text_to_search_from)

finding text between <script></script> tags with RegEx for Coldfusion including linebreaks

I am trying to extract javascript code from HTML content that I receive via CFHTTP request.
I have this simple regex that catches everyting as long as there is no linebreak in the code between the tags.
var result=REMatch("<script[^>]*>(.*?)</script>",html);
This will catch:
<script>testtesttest</script<
but not
<script>
testtest
</script>
I have tried to use (?m) for multiline, but it doesn't work like that.
I am using the reference to figure it out but I am just not getting it with regex.
Heads up, normally there would be javascript between the script tags, not simple text so also characters like {}();:-_ etc.
Can anyone help me out?
Cheers
[[UPDATE]]
Thanks guys, I will try the solutions. I favor regex because but I will look into the HTML Parser too.
(?m) multiline mode is for making ^ and $ match on line breaks (not just start/end of string as is default), but what you're trying to do here is make . include newlines - for that you want (?s) (dot-all mode).
However, I probably wouldn't do this with regex - a HTML parser is a more robust solution. Here's how to do it with jSoup:
var result = jsoup.parse(html).select('script').text();
More details on using jSoup in CF are available here, or alternatively you can use the TagSoup parser, which ships with CF10 (so you don't need to worry about jars/etc).
If you really want regex, then you can use this:
var result = rematch('<script[^>]*>(?:[^<]+|<(?!/script>))+',html);
Unlike using (?s).*? this avoids matching empty blocks (but it will still fail in certain edge cases - if accuracy is required use a HTML parser).
To extract just the text from the first script block, you can strip the script tag with this:
result = ListRest( result[1] , '>' );
You can use dot matches all mode or replace . with [\s\S] to get the same effect.
<script[^>]*>[\s\S]*?</script> would match everything including newlines.

Regex Assistance for a url filepath

Can someone assist in creating a Regex for the following situation:
I have about 2000 records for which I need to do a search/repleace where I need to make a replacement for a known item in each record that looks like this:
<li>View Product Information</li>
The FILEPATH and FILE are variable, but the surrounding HTML is always the same. Can someone assist with what kind of Regex I would substitute for the "FILEPATH/FILE" part of the search?
you may match the constant part and use grouping to put it back
(<li>View Product Information</li>)
then you should replace the string with $1your_replacement$2, where $1 is the first matching group and $2 the second (if using python for instance you should call Match.group(1) and Match.group(2))
You would have to escape \ chars if you're using Java instead.

use regex to modify srt file?

one format of srt file looks like this:
0:00:04 --> 00:00:10
and another format looks like this
0:00:04,000 --> 00:00:10,000
I want to process the first kind of file to append an ,000 to each time-frame for compatibility purposes so that the first file has the ,000 formatting that I need like the above second example.
I was thinking of trying to use some string functions like mid(), right(), instring() but wondered if regex might do the job better, any suggestions on how to do this?
You can use this regex to match the first group :
^([0-9]{1,2}:[0-9]{2}:[0-9]{2}) --> ([0-9]{1,2}:[0-9]{2}:[0-9]{2})$
And then replace $1 by $1 + ",000" and $2 by $2 + ",000"
Since you don't indicate which language you used, I did a simple example in PHP :
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace("/^([0-9]{1,2}:[0-9]{2}:[0-9]{2}) --> ([0-9]{1,2}:[0-9]{2}:[0-9]{2})$/i", "$1,000 --> $2,000", "0:00:04 --> 00:00:10");
// output : 0:00:04,000 --> 00:00:10,000
?>
With sed (it's available on Windows too):
sed -i '/\d\+\:\d\+:\d\+ --> \d\+\:\d\+:\d\+/ s_\(\d\+\:\d\+:\d\+\)\s*-->\(\d\+\:\d\+:\d\+\)\s*_\1,000 --> \2,000_' INPUT.srt
It will be done inplace.
(And I know it's not the correct regex to capture time definitions... but it works for this job.)
Sure, that sounds like a good idea. A simple approach would be to match for (\d?\d:\d\d:\d\d) and replace it with the match itself plus ,000 (for "the match itself" use a back reference, which might be something like \1 or $1, depending on your language).
Try implementing this, and if you need further help, start a new question where you mention what you have tried, where you are stuck and which language you are using.
Why not simply
sed -e 's/ -->\|$/,000&/' old.srt >new.srt
provided that old.srt consistently contains the shorter format only.