RegExp , Notepad++ Replace / remove several values - regex

I have this dataset: (about 10k times)
<Id>HOW2SING</Id>
<PopularityRank>1</PopularityRank>
<Title><![CDATA[Superior Singing Method - Online Singing Course]]></Title>
<Description><![CDATA[High Quality Vocal Improvement Product With High Conversions. Online Singing Lessons Course Converts Like Crazy Using Content Packed Sales Video. You Make 75% On Every Sale Including Front End, Recurring, And 1-click Upsells!]]></Description>
<HasRecurringProducts>true</HasRecurringProducts>
<Gravity>45.9395</Gravity>
<PercentPerSale>74.0</PercentPerSale>
<PercentPerRebill>20.0</PercentPerRebill>
<AverageEarningsPerSale>74.9006</AverageEarningsPerSale>
<InitialEarningsPerSale>70.1943</InitialEarningsPerSale>
<TotalRebillAmt>16.1971</TotalRebillAmt>
<Referred>75.0</Referred>
<Commission>75</Commission>
<ActivateDate>2011-06-23</ActivateDate>
</Site>
I am trying to do the following:
Get the data from within the tags, and use it to create a URL, so in this example it should make
http://www.reviews.how2sing.domain.com
also, all other data has to go, i want to perform a REGEX function that will just give me a list of URLS.
I prefer to do it using notepad++ but i suck at regex, any help would be welome

To keep the regex relatively simple you can just use:
.*?<id>(.+?)</id>
Replace with:
http://www.reviews.\1.domain.com\n
That will search and replace all instances of Id tag and preceding text. You can then just remove the last manually.
Make sure matches newline is selected.
Regex is straightforward, only slightly tricky part is that it uses +? and *? which are non-greedy. This prevents the whole file from being matched. The () indicate a capture group that is used in the replacement, i.e. \1.
If you want to a regex that will include replacing the last part then use:
.*?(?:(<id>)?(.+?)</id>).+?(?:<id>|\Z)
This is a bit more tricky, it uses:
?:. A non-capturing group.
| OR
\Z end of file
Basically, the first time it will match everything up to the end of the first </id> and replace up to and including the next <id>. After that it will have replaced the starting <id> so everything before </id> goes in the group. On the last match it will match the end of file \Z.

If you only want the Id values, you can do:
'<Id>([^<]*)<\/Id>'
Then you can get the first captured group \1 which is the Id text value and then create a link from it.
Here is a demo:
http://regex101.com/r/jE9qN8
[UPDATE]
To get rid of all other lines, match this regex: '.*<Id>([^<]*)<\/Id>.*' and replace by first captured group \1. Note for the regex match, since there are multiple lines, you will need to have the DOTALL or /s flag activated to also match newlines.
Hope that helps.

Related

RegEx Replace - Remove Non-Matched Values

Firstly, apologies; I'm fairly new to the world of RegEx.
Secondly (more of an FYI), I'm using an application that only has RegEx Replace functionality, therefore I'm potentially going to be limited on what can/can't be achieved.
The Challange
I have a free text field (labelled Description) that primarily contains "useless" text. However, some records will contain either one or multiple IDs that are useful and I would like to extract said IDs.
Every ID will have the same three-letter prefix (APP) followed by a five digit numeric value (e.g. 12911).
For example, I have the following string in my Description Field;
APP00001Was APP00002TEST APP00003Blah blah APP00004 Apple APP11112OrANGE APP
THE JOURNEY
I've managed to very crudely put together an expression that is close to what I need (although, I actually need the reverse);
/!?APP\d{1,5}/g
Result;
THE STRUGGLE
However, on the Replace, I'm only able to retain the non-matched values;
Was TEST Blah blah Apple OrANGE APP
THE ENDGAME
I would like the output to be;
APP00001 APP00002 APP00003 APP00004 APP11112
Apologies once again if this is somewhat of a 'noddy' question; but any help would be much appreciated and all ideas welcome.
Many thanks in advance.
You could use an alternation | to capture either the pattern starting with a word boundary in group 1 or match 1+ word chars followed by optional whitespace chars.
What you capture in group 1 can be used as the replacement. The matches will not be in the replacement.
Using !? matches an optional exclamation mark. You could prepend that to the pattern, but it is not part of the example data.
\b(APP\d{1,5})\w*|\w+\s*
See a regex demo
In the replacement use capture group 1, mostly using $1 or \1

RegEx for transforming the next text using PhpStorm's search and replace dialog

I need to transform text using regex
TPI +2573<br>
NM$ +719<br>
Молоко +801<br>
Прод. жизнь +6.5<br>
Оплод-сть +3.6<br>
Л. отела 6.3/3.9<br>
Вымя +1.48<br>
Ноги +1.61<br>
to this one
<strong>TPI</strong> +2573<br>
<strong>NM$</strong> +719<br>
<strong>Молоко</strong> +801<br>
<strong>Прод. жизнь</strong> +6.5<br>
<strong>Оплод-сть</strong> +3.6<br>
<strong>Л. отела</strong> 6.3/3.9<br>
<strong>Вымя</strong> +1.48<br>
<strong>Ноги</strong> +1.61<br>
Is it possible with regex in PhpStorm's search and replace dialog?
Given your text, you can use this regex,
.* +
and replace it with <strong>$0</strong> (Notice there is a space after </strong>)
We're using .* to capture everything but stop just before one (possible one or more) space because that's the point after which we want the text to remain intact. Once we capture the text, we use back-reference $0 to replace the match with <strong>$0</strong> so only matched text is placed within <strong> tags.
Regex Demo
Just in case, if this doesn't work for any of the samples you haven't included in your post, then please list the rules of replacement and I will give you a more robust solution, that will work flawlessly for your given rules.

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

Finding Ten Digit Number using regex in notepad++

I am trying to replace everything from a data dump and keep only the ten digit numbers from that dump using notepad++ regex.
Trying to do something like this (?<!\d)0\d{7}(?!\d) but no luck.
Forward
There where problems in older versions of Notepad++ which wouldn't handle PCRE expressions. This proposed solution was tested in NotePad++ v6.8.8, but should work in any version later than v6.2.
Description
([0-9]{10})|.
Replace with: $1
This expression will do the following:
capture 10 digit numbers and place them into capture group 1, which is then just reinserted into the output string
matches everything less and removes it.
How To in Notepad ++
From Notepad++
press the ctrlh to enter the find and replace
mode
Select the Regular Expression option
In the "Find what" field place the regular expression
in the "Replace with" field enter $1
Click Replace all
Example
Live Demo
https://regex101.com/r/fZ9vH7/1
Source Text
fdsafasfa1234567890zzzzzzz12345
After Replacement
1234567890
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]{10} any character of: '0' to '9' (10 times)
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
Extra credit
The OP wasn't clear on what to do with substrings of numbers longer than 10 characters. If strings of numbers longer than 10 digits are undesirable and need to be removed in their entirity, then use this
([0-9]{10})(?![0-9])|[0-9]+|.
Replace with: $1
Live Demo: https://regex101.com/r/aS4sN1/1
Try this:
Find: .*(\d{10}).*
Replace: \1
This has been tested in Notepad++.
As an example of a different procedure but answering the question by examples: How to get a list of ID's of your Facebook group to avoid removal of active users, it's used to reduce as well a group from 10.000 to 5000 members as well as removal of not active members from a group on Facebook.
It might be outdated but don't mind old program just look below what will do what, since explanation was to understand FIND: and Replace: what it does:
As well as a different example of how to parse as well as text and code out of HTML. And a range of numbers if they are with 2 digits up to 30.
You can try this to purge the list of member_id= and with them along with numbers from 2 to up to 30 digits long. Making sure only numbers and whole "member_id=12456" or "member_id=12" are written to the file. Later you can replace out the member_id= with blanking it out. Then copy the whole list to a duplicate scanner or remove duplicates. And have all unique IDs. And then use it in the Java code below.
"This is used to purge all Facebook user ID's by a group out of a single HTML file after you saved it scrolling down the group"
You should use the "Regular Expression" and ". matches newline" on the code below. This represents the removal of all FIND by $1 zeroing out everything:
Find: (member_id=\d{2,30})|.
Replace: $1
Second use the Extended Mode on this mode:
Find: member_id=
Replace: \n
That will make new lines with \n and with an easy way to remove all Fx0 in all lines to manually remove all the extra characters that come in buggy Notepad++
Then you can easily as well then remove all duplicates.
Connect all lines into one single space between.
The option was to use this tool which aligns the whole text with one space between each ID since its removing all duplicates: https://www.tracemyip.org/tools/remove-duplicate-words-in-text/
As well then again "use Normal option in Notepad++":
Remember to add ' to beginning and end
Find: "ONE SPACE"
Replace ','
Then you can copy the whole line into your java edit and then remove all members who are not active. If you though use a whole scrolled down HTML of a page.
['21','234','124234'] <-- remember right characters from beginning.
Extra secure would be to add your IDs to the beginning.
The facebook group removal java code is here:
https://gist.github.com/michaelv/11145168