Finding Ten Digit Number using regex in notepad++ - regex

I am trying to replace everything from a data dump and keep only the ten digit numbers from that dump using notepad++ regex.
Trying to do something like this (?<!\d)0\d{7}(?!\d) but no luck.

Forward
There where problems in older versions of Notepad++ which wouldn't handle PCRE expressions. This proposed solution was tested in NotePad++ v6.8.8, but should work in any version later than v6.2.
Description
([0-9]{10})|.
Replace with: $1
This expression will do the following:
capture 10 digit numbers and place them into capture group 1, which is then just reinserted into the output string
matches everything less and removes it.
How To in Notepad ++
From Notepad++
press the ctrlh to enter the find and replace
mode
Select the Regular Expression option
In the "Find what" field place the regular expression
in the "Replace with" field enter $1
Click Replace all
Example
Live Demo
https://regex101.com/r/fZ9vH7/1
Source Text
fdsafasfa1234567890zzzzzzz12345
After Replacement
1234567890
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]{10} any character of: '0' to '9' (10 times)
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
Extra credit
The OP wasn't clear on what to do with substrings of numbers longer than 10 characters. If strings of numbers longer than 10 digits are undesirable and need to be removed in their entirity, then use this
([0-9]{10})(?![0-9])|[0-9]+|.
Replace with: $1
Live Demo: https://regex101.com/r/aS4sN1/1

Try this:
Find: .*(\d{10}).*
Replace: \1
This has been tested in Notepad++.

As an example of a different procedure but answering the question by examples: How to get a list of ID's of your Facebook group to avoid removal of active users, it's used to reduce as well a group from 10.000 to 5000 members as well as removal of not active members from a group on Facebook.
It might be outdated but don't mind old program just look below what will do what, since explanation was to understand FIND: and Replace: what it does:
As well as a different example of how to parse as well as text and code out of HTML. And a range of numbers if they are with 2 digits up to 30.
You can try this to purge the list of member_id= and with them along with numbers from 2 to up to 30 digits long. Making sure only numbers and whole "member_id=12456" or "member_id=12" are written to the file. Later you can replace out the member_id= with blanking it out. Then copy the whole list to a duplicate scanner or remove duplicates. And have all unique IDs. And then use it in the Java code below.
"This is used to purge all Facebook user ID's by a group out of a single HTML file after you saved it scrolling down the group"
You should use the "Regular Expression" and ". matches newline" on the code below. This represents the removal of all FIND by $1 zeroing out everything:
Find: (member_id=\d{2,30})|.
Replace: $1
Second use the Extended Mode on this mode:
Find: member_id=
Replace: \n
That will make new lines with \n and with an easy way to remove all Fx0 in all lines to manually remove all the extra characters that come in buggy Notepad++
Then you can easily as well then remove all duplicates.
Connect all lines into one single space between.
The option was to use this tool which aligns the whole text with one space between each ID since its removing all duplicates: https://www.tracemyip.org/tools/remove-duplicate-words-in-text/
As well then again "use Normal option in Notepad++":
Remember to add ' to beginning and end
Find: "ONE SPACE"
Replace ','
Then you can copy the whole line into your java edit and then remove all members who are not active. If you though use a whole scrolled down HTML of a page.
['21','234','124234'] <-- remember right characters from beginning.
Extra secure would be to add your IDs to the beginning.
The facebook group removal java code is here:
https://gist.github.com/michaelv/11145168

Related

Notepad++ delete all text in a file but not strings that begin with 8D

I am racking my brains trying to figure out how to solve this problem I have. Here are some sample records from my text file:
active users 8D1DF3
active users by test 8D04R0
active users by maker 8DZZ99
active users by report class 8D2CV6
I am trying to find a way to a way using regular expressions in Notepad++ to remove all of the text except for the strings that start with 8D, the result would be this:
8D1DF3
8D04R0
8DZZ99
8D2CV6
In my research I have only found the possibility to remove lines based on strings being found not the ability to remove all text from lines other than the strings I want to keep.
Any clues as to how I can achieve this would be much appreciated.
Try the regex below:
(?<!\S)(?!8D)\S+|\h+
and replace with nothing.
See live demo here
Breakdown:
(?<!\S) Shouldn't be preceded by a non-whitespace character (it shouldn't start match from the middle)
(?!8D) A sub-string shouldn't start with 8D
\S+ Match the rest
|\h+ Or match horizontal whitespaces
The Regex you want could be this, then replace with '\1':
.*(8D.{4})
It actually matches everything, but creates a Group with '8D' (and four more characters), you can use for replace.

Remove columns from CSV

I don't know anything about Notepad++ Regex.
This is the data I have in my CSV:
6454345|User1-2ds3|62562012032|324|148|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|0|0|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|1534|51564|411b0fdf54fe29745897288c6ad699f7be30f389
How can I use a Regex to remove the 5th and 6th column? The numbers in the 5th and 6th column are variable in length.
Another problem is the User row can also contain a |, to make it even worse.
I can use a macro to fix this, but the file is a few millions lines long.
This is the final result I want to achieve:
6454345|User1-2ds3|62562012032|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|411b0fdf54fe29745897288c6ad699f7be30f389
I am open for suggestions on how to do this with another program, command line utility, either Linux or Windows.
Match \|[^|]+\|[^|]+(\|[^|]+$)
Repalce $1
Basically, Anchor to the end of the line, and remove columns [-1] and [-2] (I assume columns can't be empty. Replace + with * if they can)
If you need finer detail then that, I'd recommend writing a Java or Python script to manual parse and rewrite the file for you.
I've captured three groups and given them names. If you use a replace utility like sed or vimregex, you can replace remove with nothing. Or you can use a programming language to concatenate keep_before and keep_after for the desired result.
^(?<keep_before>(?:[^|]+\|){3})(?<remove>(?:[^|]+\|){2})(?<keep_after>.*)$
You may have to remove the group namings and use \1 etc. instead, depending on what environment you use.
Demo
From Notepad++ hit ctrl + h then enter the following in the dialog:
Find what: \|\d+\|\d+(\|[0-9a-z]+)$
Replace with: $1
Search mode: Regular Expression
Click replace and done.
Regex Explain:
\|\d+ : match 1st string that starts with | followed by number
\|\d+ : match 2nd string that starts with | followed by number
(\|[0-9a-z]+): match and capture the string after the 2nd number.
$ : This is will force regex search to match the end of the string.
Replacement:
$1 : replace the found string with whatever we have between the captured group which is whatever we have between the parentheses (\|[0-9a-z]+)

Replace spaces with dashes, but only for text found between quotes in the text TAGS=""

Is it possible to do the following with Notepad++'s FIND/REPLACE function?
I have a text file where I want to replace spaces found in between the quotes of the text TAGS="*" with dashes.
Example:
TAGS="tag1,tag2,tag 3,tag4,tag 5"
should become:
TAGS="tag1,tag2,tag-3,tag4,tag-5"
So far I can find the text I want using:
FIND WHAT: TAGS="*"
But how do I have it replace spaces with dashes?
--------------------- UPDATE -----------------
My question before used tag1,tag2, but the actual data in the file does not have numbers, only words.
These following are three actual lines from the file. I need to find spaces between the quotes of TAGS="*" and replace only those spaces with dashes:
<DT>Kundalini Yoga - Pranayama - Breathing Techniques
<DT>40 Ways The World Makes Awesome Hot Dogs | Food Republic
<DT>Fix Windows boot, Fix your Boot sequence with BcdEdit, BootSect, BCDboot, WINRE,...
In the lines above, there are 3 instances of TAGS="*" which I've extracted here to make them easy to see:
TAGS="kundalini,yoga,fire breath,breathing,breath of fire"
TAGS="recipe,cooking,hot dog"
TAGS="windows stuff,bcdboot,bcdsect,repair,boot"
which, after the FIND/REPLACE, should look like:
TAGS="kundalini,yoga,fire-breath,breathing,breath-of-fire"
TAGS="recipe,cooking,hot-dog"
TAGS="windows-stuff,bcdboot,bcdsect,repair,boot"
Use the following regex:
Find what: (?:\G(?!^)|\bTAGS=")[^\s"]*\K\s+
Replace with: -
Details:
(?:\G(?!^)|\bTAGS=") - Finds either the end of the previous successful match (\G(?!^)) or
[^\s"]* - 0+ chars other than a space and "
\K - match reset operator discarding the text matched so far
\s+ - 1+ whitespaces
See the screenshot with settings below:
Use the following find/replace pattern in regex mode, and do a replace all to cover the entire document (or selection which you want). Note that I make no effort to check for TAGS="...", under the assumption that you don't have strings of the form tag123 or tag 123 anywhere else in your document.
Find:
tag\s+(\d*)
Replace:
tag-$1
Input:
tag1,tag2,tag 3,tag4,tag 5
Output:
tag1,tag2,tag-3,tag4,tag-5

RegExp , Notepad++ Replace / remove several values

I have this dataset: (about 10k times)
<Id>HOW2SING</Id>
<PopularityRank>1</PopularityRank>
<Title><![CDATA[Superior Singing Method - Online Singing Course]]></Title>
<Description><![CDATA[High Quality Vocal Improvement Product With High Conversions. Online Singing Lessons Course Converts Like Crazy Using Content Packed Sales Video. You Make 75% On Every Sale Including Front End, Recurring, And 1-click Upsells!]]></Description>
<HasRecurringProducts>true</HasRecurringProducts>
<Gravity>45.9395</Gravity>
<PercentPerSale>74.0</PercentPerSale>
<PercentPerRebill>20.0</PercentPerRebill>
<AverageEarningsPerSale>74.9006</AverageEarningsPerSale>
<InitialEarningsPerSale>70.1943</InitialEarningsPerSale>
<TotalRebillAmt>16.1971</TotalRebillAmt>
<Referred>75.0</Referred>
<Commission>75</Commission>
<ActivateDate>2011-06-23</ActivateDate>
</Site>
I am trying to do the following:
Get the data from within the tags, and use it to create a URL, so in this example it should make
http://www.reviews.how2sing.domain.com
also, all other data has to go, i want to perform a REGEX function that will just give me a list of URLS.
I prefer to do it using notepad++ but i suck at regex, any help would be welome
To keep the regex relatively simple you can just use:
.*?<id>(.+?)</id>
Replace with:
http://www.reviews.\1.domain.com\n
That will search and replace all instances of Id tag and preceding text. You can then just remove the last manually.
Make sure matches newline is selected.
Regex is straightforward, only slightly tricky part is that it uses +? and *? which are non-greedy. This prevents the whole file from being matched. The () indicate a capture group that is used in the replacement, i.e. \1.
If you want to a regex that will include replacing the last part then use:
.*?(?:(<id>)?(.+?)</id>).+?(?:<id>|\Z)
This is a bit more tricky, it uses:
?:. A non-capturing group.
| OR
\Z end of file
Basically, the first time it will match everything up to the end of the first </id> and replace up to and including the next <id>. After that it will have replaced the starting <id> so everything before </id> goes in the group. On the last match it will match the end of file \Z.
If you only want the Id values, you can do:
'<Id>([^<]*)<\/Id>'
Then you can get the first captured group \1 which is the Id text value and then create a link from it.
Here is a demo:
http://regex101.com/r/jE9qN8
[UPDATE]
To get rid of all other lines, match this regex: '.*<Id>([^<]*)<\/Id>.*' and replace by first captured group \1. Note for the regex match, since there are multiple lines, you will need to have the DOTALL or /s flag activated to also match newlines.
Hope that helps.

Remove all lines that don't match regex in Notepad++

I have a range of files of a specific format. I have pasted an example here.
----------------------------------------------
Attempting to factor n = 160000000000110400000000018963... Found a factor: 400000000000129
Time (s): 18.9561
----------------------------------------------
Attempting to factor n = 164025000000137700000000028819... Found a factor: 405000000000179
Time (s): 22.3426
----------------------------------------------
Attempting to factor n = 168100000000155800000000036051... Found a factor: 410000000000197
Time (s): 101.183
I would like a regular expression that I can use to capture the times, e.g. for all the lines with format "Time (s): X.Y" I want to keep X.Y on a seperate line, and throw EVERYTHING ELSE away.
I have the following expression: Time (s):\s+(\d+.\d+), which captures these. This captures the lines I need, but Notepad++ only seems to have functionality to replace with something, not save what it matches. So I can remove all those lines, which is nearly the opposite of what I want.
Any help?
Well I don't know Noteplad++ but its likely that you can use the result of capture groups in the replacement field. Either try
\1
or
$1
1 = first capture group. So you basically replace the whole line with \2 in your case.
Use this on the command line:
for /f "usebackq tokens=3" %a in (`findstr /b "Time" 1.txt`) do #echo %a
Follow next steps (Notepad++ 6.2.3):
Clean and mark
Replace: ^(Time \(s\):)+ ([.\r]*) with: #\2
Remove unmarked lines
Replace: ^[^#]+[.\n]* with: (empty)
Remove mark
Replace: ^#(.*) with: \1
Use the following expression to match the entire line:
.*\(s\)\:\s+(\d+.\d+)
Now you can replace this with
\1
which gives you the matched group number 1 (the only group in the above expression) that matches the time
Adjust your regular expression so it either matches a "Time" line and captures the time within, or matches the whole line. Then replace with the captured text, which will be blank for ignored lines.
Find what: (Time \D+(\d+.\d+)|.*)
Replace with: \2
This leaves you with a sequence of captured times plus blank lines, which can be removed using TextFX's Remove blank lines, or Extended Replace on "\r\n\r\n".
Similar to MaurizioRam's answer (which lead me to figuring out this answer), you can take advantage of the "Mark" tab in the Find window.
As you probably know Ctrl+F opens a window with Find and Replace tabs. It also has tabs Find In Files, Find In Projects, and Mark.
Mark will let you add a special highlight (a mark) to everything your regex matches, by pressing "Mark All".
After pressing "Mark All" you can "Copy Marked Text" which will copy everything that your regex matched into your clipboard.
You can now paste this into a new file, which will give you a file with only the text your regex matched.