Removing newline from text within tags - regex

I would like to remove all the newlines within a specific html-tag that contains a block of text.
Im sure this is basic stuff but I have no experience with regex so any help would be welcomed.
Thanks

You haven’t specified your language, so I’ll just give you the regex (no code):
\n(?=[^<>]*</)
Replace all matches with a blank (to “delete” them).
This assumes well-formed XML (of which, HTML is a subset).
It works by requiring any matched newline to be followed by characters such that the next angle bracket encountered is a closing tag.
It’s not bulletproof, but will probably work for most cases, and hopefully your case.

I guess you want to do this :
str.replace("/<(html|div)>(.*)\n+(?=[\s\S]*<\/\1>)/g", "<$1>$2 ")
This regex target the html or div tags, you can add more just doing this (html|div|p|input|html6tag)
But, you have to run this regex until no more replacements are found

Related

Regex: Replace double double quotes (solved), but only in lines that contain a special string (subcondition unsolved)

1. Summary of the problem
I have a csv file where I want to replace normal quotes in text with typographic ones.
It was hard (because HTML is also included), but I have meanwhile created a good regex expression that does just the right thing: in three "capturing groups" I find the left and right quotation marks and the text inside. Replacing then is a piece of cake.
2. Regex engine
I can use the regex engine of Notepad++ (boost) or PCRE2 comaptible, for developping and testing purposes I have used https://regex101.com.
3. What I'm having a hard time with and just can't get right, where I need your help is here:
I want to add a sub condition, in order to find the text in quotes only in certain lines, want to identify these lines by the language, e.g. ENGLISH or FRENCH (see also example in the screenshot).
Screenshot of a sample
The string indicating the language is always in the same line before the text to be found, BUT only the text in quotes (main condition) should be marked after matching the sub condition, so that I will be able to replace them.
It is about a few thousand records in the csv file, in the worst case I could also replace it manually. But I'm pretty sure that this should also work via regex.
4. What I have tried
Different approaches with look arounds and non-capturing groups didn't lead me to the desired result - possibly because I didn't really understand how they work.
An example can be found here: https://regex101.com/r/ketwwm/1
The example can be found here, it only contains the regex expression to match and mark the (three) groups WITHOUT the searched subcondition:
("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
Hopefully anyone in the community could help? (Hopefully I have not missed anything, it's my first post here )
5. Update 03/18/2022: Almost resolved with two slightly different approaches (thank you all!) What is still unsolved ..
Solution of #Thefourthbird (see answer 1)
^(?!.?"ENGLISH")[^"]".*(SKIP)(F)|("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))
Nearly perfect, just missing matches in an HTML section. HTML sections in the csv file are always enclosed by double quotes and may have line feeds (LF). https://regex101.com/r/x5shnx/1
Solution of #Wiktor Stribiżew (see in comments below)
^.?"ENGLISH".?\K("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))
The same with matches in HTML sections, see above. Plus: Doesn't match text in double double quotes if more than one such entry occurs within a text. https://regex101.com/r/I4NTdb/1
Screenshot (only to illustrate)
If you want to match multiple occasions, you can use SKIP matching all lines that do not start with FRENCH:
^"(?!FRENCH")[^"]*".*(*SKIP)(*F)|("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
The pattern matches:
^ Start of string
" Match literally
(?!FRENCH") Negative lookhead, assert not FRENCH" directly to the right
[^"]*" Match any char except " and match "
.*(*SKIP)(*F) Match the rest of the line and skip it
| Or
("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$))) Your current pattern
Regex demo

Regex strip all html except background style url

I have the following regex that will find all the background style URLs in my HTML. I'm trying to strip all the HTML except for the background image URLs. My goal is to abstract a list of background image URLs from my HTML page.
Expression URL\(\s*(['"]?)(.*?)\1\s*\)
Example HTML
<img style="background-image: url(http://domain.com/2003-Th.jpg)">
I'd just like to do the not of this expression.
I don't know netbeans ide, so this is a guess only.
But beware: you search for url(...) everywhere. It does not matter where the text occurs: in a css block, in html style-attributes, in javascript, but also in pure text and comments!
General modifications
If you really want to include background-images only, you should state that in your regex, too. So it becomes
\bbackground-image\s*:\s*URL\(\s*(['"]?)(.*?)\1\s*\)
To speed things up (at least in some implementations), try to prevent backreferences. In this case
\bbackground-image\s*:\s*URL\(\s*(?:'([^']+)'|"([^"]+)"|([^)]+))\s*\)
It's a bit more, but at least in sublime text it's worth it.
Use
To replace everything but the urls from background-images, you could use the single regex
[\s\S]*?\bbackground-image\s*:\s*URL\(\s*(?:'([^']+)'|"([^"]+)"|([^)]+))\s*\)|[\s\S]+
and replace everything with $1$2$3\n. There are (almost) always two \n at the end, but I think that should be no problem.
This won't work in some regex engines where the not the order of the elements is decisive, but the length of the match.
However, if it's a problem, you can try to use
[\s\S]*?\bbackground-image\s*:\s*URL\(\s*(?:'([^']+)'|"([^"]+)"|([^)]+))\s*\)[\s\S]*?(?=\z|\bbackground-image\s*:\s*URL\(\s*(?:'[^']+'|"[^"]+"|[^)]+)\s*\))
and replace everything with $1$2$3\n.
[\s\S] means every character (including \n)
\b is a word boundary
(?= ... ) is a positive lookahead. It has to match but is not part of the result
\z is the end of the text
(maybe you have to tweak the regex a bit to fit into netbeans)
Anyway, not every regex implementeation supports lookaheads. If this is not supported by netbeans, you have to use a multi-step approach:
First step
Replace
[\s\S]*?\bbackground-image\s*:\s*URL\(\s*(?:'([^']+)'|"([^"]+)"|([^)]+))\s*\)
with >-BG-URL:$1$2$3\n.
>-BG-URL: is something to indicate the values and distinct them from the rest.
Second step
Manually replace everything after the last match (you won't need --BG-URL then) or replace
^>-BG-URL:(.*)|^[\s\S]+
with $1

replace single line javascript comments with multiline style comments in Notepad++ using regular expressions

I would like to use Notepad++ to search a javascript file or a html file containing some javascript and replace all single line comments with a multiline style comment.
For example // some comment goes here to be replaced with /* some comment goes here */
Using Notepad++ search and replace with Regular Expression selected with (//.*)(\r\n) for search and \/*\1\*/\r\n kinda works.
Problems:
It only finds // some comment goes here if there is at least one space before the // it will not find it if there is a tab before it, or at the start of a line or if there is a letter/number before it. I could workaround that by first doing a global non regular expression search replace to replace all occurrences of // with space //
// some comment goes here is replaced with /*// some comment goes here*/ that is the two forward slashes are not replaced. I can workaround this afterwards by doing a global non regular expression search to replace all occurrences of /*// with /*.
The javascript may be in a html file, in which case somewhere in the file there is likely to be something like http://msdn.microsoft.com/ clearly I would not like this to be replaced with http:/*msdn.microsoft.com/*/ I could workaround this in advance by replacing all :// with say :/ZZZ/ where ZZZ is some escaping method and then afterwards replacing :/ZZZ/ with ://
There will be problems with the likes of <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> I guess that I will have to look after these manually.
This is not really a Notepad++ problem. I am sure that I would have the same difficulties using any regular search and replace system.
All suggestions gratefully received.
Thank you for taking the time to read this
Quick way
Use this regex:
(?<=\s)//([^\n\r]*)
Replace with:
/\*$1\*/
Explanatory way
1 - Double slashes // weren't going to replace, because you had them in that capturing group. So you'll capture and replace them again.
2 - Most of the time there is nothing but a space or a newline (\n) before comments begin. I included this as a lookbehind to ensure. By this way, URLs and DOCTYPE won't get touched.
* I don't confirm this searching and replacing method, however it may work with most cases.
Update
Take care of settings. You should have you cursor at the very beginning of the file content.
Then do a Replace All
This is not a job for regex, use a parser instead. Have a look at: Esprima for example.

Find / Replace functionality that allows for boundary replacements instead of expressions

Apologies in advance for the confusing title. My issue is as follows, I have the following text in about 600 files:
$_REQUEST['FOO']
I would like to replace it with the following:
$this->input->post('FOO')
To clarify, I am matching against the following:
$_REQUEST any number of A-Za-z\d followed by a ]
and replacing it with:
$this->input->post( the alphanumeric word from above followed by a )
Or in general:
Anchor token TEXT TO KEEP end anchor token
This differs from standard find/replace as I want to retain text inside of two word boundaries.
Is this functionality present in any text editors (Eclipse,np++,etc). Or am I going to need to write some type of program to parse these 600 files to make the replacement?
s/\$__REQUEST\[(.*?)]/$this->input->post(\1)/
The .*? will match everything from [ to the first ] rather than the last although it's unlikely that it will matter in this case.
By the way the PHP superglobal is $_REQUEST rather than $__REQUEST
You can do this in Notepad++ using regular expressions. Replace
\$_REQUEST\['([^']*)'\]
with
$this->input->post('$1')
If you ever have double-quotes too, you can do use a more complex expression to handle both cases, though I'm not sure Notepad++ supports backreferences; replace
\$_REQUEST\[(['"])(.*?)\1\]
with
$this->input->post($1$2$1)
Note that I've reverted to using #ExplosionPills' suggested (.*?) here—it may be better, actually.

Need assistance regex matching a single quote, but do not include the quote in the result

I'm trying to find out a way to match the following test string:
token = '1866FB352F4DF76BCB92C3482DB7D7B4F562';
The data I want returned is...
1866FB352F4DF76BCB92C3482DB7D7B4F562
I've tried the following, the closes I have is this, but it's including the single quote at the end:
(?!token = ')(\w+)';
Now, another one, which works closely, but it's including the last single quote:
'([^']+)'
Anyone want to take a stab at this?
Update: After looking at what I need to parse, I found the same value in the html, in the form area, which looks like it might be easier to grab:
name="token" value="482CD1FE037F68D5A36F4C961A6D57D9"
Again, I just need the contents within value="*"
However, the regex will have to parse the entire html source, so I assume I will need to search for name="toke" value= but not include that in the result set.
If your regex engine supports lookaround, you can use
(?<=')\w+(?=')
This matches an alphanumeric word if it's surrounded by single quotes, without making those quotes a part of the actual match. If you only want to match hexadecimal numbers, use
(?i)(?<=')[0-9A-F]+(?=')
EDIT:
Since you have now added that you're using JMeter, and because JMeter doesn't support lookbehind assertions for reasons incomprehensible to me (because Java itself does support it just fine), you can possibly cheat like this:
\b[0-9A-F]+(?=')
only checks whether an entire hex number occurs right before a ' character. It does not check for the presence of an opening quote, but chances are that this won't matter.