How to Match Tilde-Delimited Data Using Regex - regex

I have data like this:
~10~682423~15~Test Data~10~68276127~15~More Data~10~6813~15~Also Data~
I'm trying to use Notepad++ to find and replace the values within tag 10 (682423, 68276127, 6813) with zeroes. I thought the syntax below would work, but it selects the first occurrence of the text I want and the rest of the line, instead of just the text I want (~10~682423~, for example). I also tried dozens of variations from searching online, but they also either did the same thing or wouldn't return any results.
~10~.*~

You can use: (?<=~10~)\d+(?=~) and replace with 0. This uses lookarounds to check that ~10~ precedes the digit sequence and the (?=~) ensures a ~ follows the digit sequence. If any character could be after the ~10~ field, use (?<=~10~)[^~]+(?=~).
The problem with ~10~.*~ is that the * is greedy, so it just slurps away matching any character and ~.

Use
\b10~\d+
Replace with 10~0. See proof. \b10~ will capture 10 as entire number (no match in 210 is allowed) and \d+ will match one or more digits.

Related

regex to match specific pattern of string followed by digits

Sample input:
___file___name___2000___ed2___1___2___3
DIFFERENT+FILENAME+(2000)+1+2+3+ed10
Desired output (eg, all letters and 4-digit numbers and literal 'ed' followed immediately by a digit of arbitrary length:
file name 2000 ed2
DIFFERENT FILENAME 2000 ed10
I am using:
[A-Za-z]+|[\d]{4}|ed\d+ which only returns:
file name 2000 ed
DIFFERENT FILENAME 2000 ed
I see that there is a related Q+A here:Regular Expression to match specific string followed by number?
eg using ed[0-9]* would match ed#, but unsure why it does not match in the above.
As written, your regex is correct. Remember, however, that regex tries to match its statements from left to right. Your ed\d+ is never going to match, because the ed was already consumed by your [A-Za-z] alternative. Reorder your regex and it'll work just fine:
ed\d+|[a-zA-Z]+|\d{4}
Demo
Nick's answer is right, but because in-order matching can be a less readable "gotcha", the best (order-insensitive) ways to do this kind of search are 1) with specified delimiters, and 2) by making each search term unique.
Jan's answer handles #1 well. But you would have to specify each specific delimiter, including its length (e.g. ___). It sounds like you may have some unusual delimiters, so this may not be ideal.
For #2, then, you can make each search term unique. (That is, you want the thing matching "file" and "name" to be distinct from the thing matching "2000", and both to be distinct from the thing matching "ed2".)
One way to do this is [A-Za-z]+(?![0-9a-zA-Z])|[\d]{4}|ed\d+. This is saying that for the first type of search term, you want an alphabet string which is followed by a non-alphanumeric character. This keeps it distinct from the third search term, which is an alphabet string followed by some digit(s). This also allows you to specify any range of delimiters inside of that negative lookbehind.
demo
You might very well use (just grab the first capturing group):
(?:^|___|[+(]) # delimiter before
([a-zA-Z0-9]{2,}) # the actual content
(?=$|___|[+)]) # delimiter afterwards
See a demo on regex101.com

How can I delete this part of the text with regex?

I have a problem that I really hope that somebody could help me. So, I want to delete some parts of text from a notepad++ document using Regex. If there's another software that I can use to delete this part of text, let me know please, I am really really noob with regex
So, my document its like this:
1
00:00:00,859 --> 00:00:03,070
text over here
2
00:00:03,070 --> 00:00:09,589
text over here
3
00:00:09,589 --> 00:00:10,589
some numbers here
4
00:00:10,589 --> 00:00:12,709
Text over here
5
00:00:12,709 --> 00:00:18,610
More text with numbers here
What I want to learn is how can I delete the first 2 lines of numbers in all the document? So I could get only the text parts (the "text over here" parts)
I would really appreciate any kind of help!
My solution:
^[\s\S]{1,5}\d{1,3}:\d{1,3}:\d{1,3},\d{1,5}\s-->\s*?\d{1,3}:\d{1,3}:\d{1,3},\d{1,5}\s
This solution match both types: either all data in one line, or numbers in one line and data in the second.
Demo: https://regex101.com/r/nKD0DQ/1/
Simplest solution;
\d+(\r\n|\r|\n)\d{2}:\d{2}.*(\r\n|\r|\n)
Get line with some number \d+ with its line break (\r\n|\r|\n)
Also the next line that starts with two 2-digit numbers and a colon \d{2}:\d{2} with the rest .* and its line break. No need to match all since we already are in the correct line, since subtitle file is defined well with its predictable structure.
Put this as Find what: value in Search -> Replace.. in Notepad++, with Seach Mode: Regular Expression and with replace value (Replace with:) of empty space. Will get you the correct result, lines of expected text with empty line in between each.
to see it on action on regex101
Subtitles, for accuracy you can use this:
\d+(\r\n|\n|\r)(\d\d:){2}\d\d,\d{3}\s*-->\s*(\d\d:){2}\d\d,\d{3}(\r\n|\n|\r)
Check Regular Expression, Find what with this and Replace with empty would do.
Regxe Demo
srt subtitles are basically ordered. And it's better accurate than lose texts.
\d : a single digit.
+ : one or more of occurances of the afore character or group.
\r\n: carriage and return. (newline)
* : zero or more of occurances of the afore character or group.
| : Or, match either one.
{3}: Match afore character or group three times.
I'm going for a less specific regex:
^[0-9]*\n[0-9:,]*\s-->\s[0-9:,]*
Demo # regex101

How to use an RE to match a line of ===== and the line above

I want to match two lines like the following using a Regular Expression:-
abcmnoxyz
=========
The first line is essentially random, the second line will be all the same character of a limited number of possibles (=, - and maybe a couple more). The lines can probably be required to be the same length but it would be nice if they didn't have to be. It would be OK to have multiple REs, one for each possible 'underline' character.
Can anyone come up with a way to do this?
This regex should do what you're trying to do :
regex = "(.*)\n(.)\2{2,}$"
group 1 will give you the line before the repeated linet
Live demo here
EXPLANATION
(.*)\n: match anything followed by a new line
(.)\2{2,} : capture something then check if its followed by same character 2+ more no. of times. You don't need to worry about which character is repeated.
In case you've a set of characters that can be repeated you can put a character set like this : [=-] instead of dot (.)
Use Grep's -B Flag
Matching with Alternation
Given your example, you can use extended regular expressions with alternations and a range operator. The -B flag tells grep how many lines before the match to include in the output.
$ grep -E -B1 '^(={5,}|-{5,})$' sample.txt
abcmnoxyz
=========
You can add alternations for additional characters if you want, although boundary markers ought to be as consistent as you can make them. You can also adjust the minimum number of sequential characters required for a match to suit your needs. I used a five-character range in the example because that's what was posted as the criterion in your original topic sentence, and because a shorter boundary marker is more likely to accidentally match truly random text.
Matching with a Character Class
Also, note that the following does the same job, but is a bit more concise. It uses a character class and a backreference to avoid alternations, which can get messy if you add many more boundary characters. Both versions are equally effective at matching your example.
$ grep -E -B1 '^([=-])\1{4,}$'
abcmnoxyz
========
A regex like this
^([^=\v]+)\v=+$
will do. Check it out at example 1
Explanation:
^([^=\v]+) # 1 or more matches of anything that is not a '=' or vertical space \v
\v=+$ # match a vertical space followed by 1 or more '='
If you want to extend this to more characters like '-' you could do this:
^([^=\-\v]+)\v(-|=)\2+$
Look at example 2
And, thanks to Ashish Ranjan, suppose you wanted to have = and/or - on the first line, use something like this:
^(.+)\v(-|=)\2+$
which would even allow you to have a first line like "=====". Having my doubts if OP had this in mind, though. Look at example 3
Hope this works
^([a-z]{1,})\n([=-]{1,})
\n and \r you have try both based on file format (unix or dos)
\1 will give you first line
\2 will give you second line
If the file contains same pattern over the text, then it might give you lot occurrence.
This answer is irrespective of number of characters in one line.
Ex: Tester

Interesting easy looking Regex

I am re-phrasing my question to clear confusions!
I want to match if a string has certain letters for this I use the character class:
[ACD]
and it works perfectly!
but I want to match if the string has those letter(s) 2 or more times either repeated or 2 separate letters
For example:
[AKL] should match:
ABCVL
AAGHF
KKUI
AKL
But the above should not match the following:
ABCD
KHID
LOVE
because those are there but only once!
that's why I was trying to use:
[ACD]{2,}
But it's not working, probably it's not the right Regex.. can somebody a Regex guru can help me solve this puzzle?
Thanks
PS: I will use it on MYSQL - a differnt approach can also welcome! but I like to use regex for smarter and shorter query!
To ensure that a string contains at least two occurencies in a set of letters (lets say A K L as in your example), you can write something like this:
[AKL].*[AKL]
Since the MySQL regex engine is a DFA, there is no need to use a negated character class like [^AKL] in place of the dot to avoid backtracking, or a lazy quantifier that is not supported at all.
example:
SELECT 'KKUI' REGEXP '[AKL].*[AKL]';
will return 1
You can follow this link that speaks on the particular subject of the LIKE and the REGEXP features in MySQL.
If I understood you correctly, this is quite simple:
[A-Z].*?[A-Z]
This looks for your something in your set, [A-Z], and then lazily matches characters until it (potentially) comes across the set, [A-Z], again.
As #Enigmadan pointed out, a lazy match is not necessary here: [A-Z].*[A-Z]
The expression you are using searches for characters between 2 and unlimited times with these characters ACDFGHIJKMNOPQRSTUVWXZ.
However, your RegEx expression is excluding Y (UVWXZ])) therefore Z cannot be found since it is not surrounded by another character in your expression and the same principle applies to B ([ACD) also excluded in you RegEx expression. For example Z and A would match in an expression like ZABCDEFGHIJKLMNOPQRSTUVWXYZA
If those were not excluded on purpose probably better can be to use ranges like [A-Z]
If you want 2 or more of a match on [AKL], then you may use just [AKL] and may have match >= 2.
I am not good at SQL regex, but may be something like this?
check (dbo.RegexMatch( ['ABCVL'], '[AKL]' ) >= 2)
To put it in simple English, use [AKL] as your regex, and check the match on the string to be greater than 2. Here's how I would do in Java:
private boolean search2orMore(String string) {
Matcher matcher = Pattern.compile("[ACD]").matcher(string);
int counter = 0;
while (matcher.find())
{
counter++;
}
return (counter >= 2);
}
You can't use [ACD]{2,} because it always wants to match 2 or more of each characters and will fail if you have 2 or more matching single characters.
your question is not very clear, but here is my trial pattern
\b(\S*[AKL]\S*[AKL]\S*)\b
Demo
pretty sure this should work in any case
(?<l>[^AKL\n]*[AKL]+[^AKL\n]*[AKL]+[^AKL\n]*)[\n\r]
replace AKL for letters you need can be done very easily dynamicly tell me if you need it
Is this what you are looking for?
".*(.*[AKL].*){2,}.*" (without quotes)
It matches if there are at least two occurences of your charactes sorrounded by anything.
It is .NET regex, but should be same for anything else
Edit
Overall, MySQL regular expression support is pretty weak.
If you only need to match your capture group a minimum of two times, then you can simply use:
select * from ... where ... regexp('([ACD].*){2,}') #could be `2,` or just `2`
If you need to match your capture group more than two times, then just change the number:
select * from ... where ... regexp('([ACD].*){3}')
#This number should match the number of matches you need
If you needed a minimum of 7 matches and you were using your previous capture group [ACDF-KM-XZ]
e.g.
select * from ... where ... regexp('([ACDF-KM-XZ].*){7,}')
Response before edit:
Your regex is trying to find at least two characters from the set[ACDFGHIJKMNOPQRSTUVWXZ].
([ACDFGHIJKMNOPQRSTUVWXZ]){2,}
The reason A and Z are not being matched in your example string (ABCDEFGHIJKLMNOPQRSTUVWXYZ) is because you are looking for two or more characters that are together that match your set. A is a single character followed by a character that does not match your set. Thus, A is not matched.
Similarly, Z is a single character preceded by a character that does not match your set. Thus, Z is not matched.
The bolded characters below do not match your set
ABCDEFGHIJKLMNOPQRSTUVWXYZ
If you were to do a global search in the string, only the italicized characters would be matched:
ABCDEFGHIJKLMNOPQRSTUVWXYZ

regex remove all numbers from a paragraph except from some words

I want to remove all numbers from a paragraph except from some words.
My attempt is using a negative look-ahead:
gsub('(?!ami.12.0|allo.12)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
But this doesn't work. I get this:
"." "" "ami.. " "allo."
Or my expected output is:
"." "" 'ami.12.0','allo.12'
You can't really use a negative lookahead here, since it will still replace when the cursor is at some point after ami.
What you can do is put back some matches:
(ami.12.0|allo.12)|[[:digit:]]+
gsub('(ami.12.0|allo.12)|[[:digit:]]+',"\\1",
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
I kept the . since I'm not 100% sure what you have, but keep in mind that . is a wildcard and will match any character (except newlines) unless you escape it.
Your regex is actually finding every digit sequence that is not the start of "ami.12.0" or "allo.12". So for example, in your third string, it gets to the 12 in ami.12.0 and looks ahead to see if that 12 is the start of either of the two ignored strings. It is not, so it continues with replacing it. It would be best to generalize this, but in your specific case, you can probably achieve this by instead doing a negative lookbehind for any prefixes of the words (that can be followed by digit sequences) that you want to skip. So, you would use something like this:
gsub('(?<!ami\\.|ami\\.12\\.|allo\\.)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)