regex irregular format string input - regex

i have this format of string:
MY TITLE OF STRING 5 - EP.2
MY TITLE OF STRING 6 - EP.3
But in some cases this rule jump and my string can become that way:
MY TITLE OF STRING 5- EP.2
MY TITLE OF STRING 6-EP.3
This is a my regex
(\d*)\s-\s.*?EP.\s*(\d*)
but works only standard case.

You may make the first \s match zero or more occurrences using * quantifier:
(\d+)\s*-\s*EP.\s*(\d+)
^
See the regex demo
If you need to match any 0+ chars as few as possible between the - and EP, re-insert .*? in the pattern
(\d+)\s*-\s*.*?EP.\s*(\d+)

Just for fun as Wiktor already gave a working answer, this one will also work :
(\d+)[\s-]+EP\.(\d+)$.
Explanation
(\d+) at least one digit
[\s-]+ one or more hyphen or space
EP\. followed by EP.
(\d+)$ at least one digit until end of string
Demo

Related

How to extract a word that could possibly be followed with another word

I want to extract [games, games, things, things] from
the following array.
Today_games
Today_games_freq
Today_things
Today_things_freq
I have tried Today_(\w+)(?=_freq)?
Which will give me the extra "freq"
And some other combinations, but I couldn't figure out how to get just after the first hyphen.
You can use
Today_(\w+?)(?:_freq)?$
See the regex demo. This matches Today_, then captures any one or more word chars (as few as possible) into Group 1 (with (\w+?)), and then (?:_freq)?$ matches an optional occurrence of a _freq substring and asserts the position at the end of string.
Or,
Today_([^\W_]+)
See this regex demo.
Here, Today_ is matched and the ([^\W_]+) pattern captures one or more alphanumeric chars into Group 1 (same as \w+ with _ subtracted from \w).

golang regex get the string including the search character

I am extracting a piece of string from a string (link):
https://arteptweb-vh.akamaihd.net/i/am/ptweb/100000/100000/100095-000-A_0_VO-STE%5BANG%5D_AMM-PTWEB_XQ.1V7rLEYkPH.smil/master.m3u8
The desired output should be 100000/100000/100095-000-A_
I am using the Regex ^.*?(/[i,na,fm,d]([,/]?)(/am/ptweb/|.+=.+,))([^_]*).*?$ in Golang flavor and I can get only the group 4 with the folowing output 100000/100000/100095-000-A
However I want the underscore after A.
Bit stuck on this, any help on this is appreciated.
You can use
(/(i|na|fm|d)(/am/ptweb/|.+=.+,))([^_]*_?)
See the regex demo.
Details:
(/(i|na|fm|d)(/am/ptweb/|.+=.+,)) - Group 1:
/ - a / char
(i|na|fm|d) - Group 2: i, na, fm or d
(/am/ptweb/|.+=.+,) - Group 3: /amp/ptweb/ or one or more chars as many as possible (other than line break chars), =, one or more chars as many as possible (other than line break chars) and a , char
([^_]*_?) - Group 4: zero or more chars other than _ and then an optional _.
You can match the underscore after the A like:
^.*?(/(?:[id]|na|fm)([,/]?)(/am/ptweb/|.+=.+,))([^_]*_).*$
See a regex demo
A few notes about the pattern that you tried:
This notation is a character class [i,na,fm,d] which should be a grouping (?:[id]|na|fm)
In this group ([,/]?) you optionally capture either , or / so in theory it could match a string that has /i//am/ptweb/
The last part .*?$ does not have to be non greedy as it is the last part of the pattern
This part [^_]* can also match spaces and newlines

Regex: match patterns starting from the end of string

I wish to match a filename with column and line info, eg.
\path1\path2\a_file.ts:17:9
//what i want to achieve:
match[1]: a_file.ts
match[2]: 17
match[3]: 9
This string can have garbage before and after the pattern, like
(at somewhere: \path1\path2\a_file.ts:17:9 something)
What I have now is this regex, which manages to match column and line, but I got stuck on filename capturing part.. I guess negative lookahead is the way to go, but it seems to match all previous groups and garbage text in the end of string.
(?!.*[\/\\]):(\d+):(\d+)\D*$
Here's a link to current implementation regex101
You can replace the lookahead with a negated character class:
([^\/\\]+):(\d+):(\d+)\D*$
See the regex demo. Details:
([^\/\\]+) - Group 1: one or more chars other than / and \
: - a colon
(\d+) - Group 2: one or more digits
: - a colon
(\d+) - Group 3: one or more digits
\D*$ - zero or more non-digit chars till end of string.

Kotlin / Regex - Replace a group of pattern with a repeating character

I would like to mask the email passed in the maskEmail function. I'm currently facing a problem wherein the asterisk * is not repeating when i'm replacing group 2 and and 4 of my pattern.
Here is my code:
fun maskEmail(email: String): String {
return email.replace(Regex("(\\w)(\\w*)\\.(\\w)(\\w*)(#.*\\..*)$"), "$1*.$3*$5")
}
Here is the input:
tom.cat#email.com
cutie.pie#email.com
captain.america#email.com
Here is the current output of that code:
t*.c*#email.com
c*.p*#email.com
c*.a*#email.com
Expected output:
t**.c**#email.com
c****.p**#email.com
c******.a******#email.com
Edit:
I know this could be done easily with for loop but I would need this to be done in regex. Thank you in advance.
For your problem, you need to match each character in the email address that not is the first character in a word and occurs before the #. You can do that with a negative lookbehind for a word break and a positive lookahead for the # symbol:
(?<!\b)\w(?=.*?#)
The matched characters can then be replaced with *.
Note we use a lazy quantifier (?) on the .* to improve efficiency.
Demo on regex101
Note also as pointed out by #CarySwoveland, you can replace (?<!\b) with \B i.e.
\B\w(?=.*?#)
Demo on regex101
As pointed out by #Thefourthbird, this can be improved further efficiency wise by replacing the .*? with a [^\r\n#]* i.e.
\B\w(?=[^\r\n#]*#)
Demo on regex101
Or, if you're only matching single strings, just [^#]*:
\B\w(?=[^#]*#)
Demo on regex101
I suggest keeping any char at the start of string and a combination of a dot + any char, and replace any other chars with * that are followed with any amount of characters other than # before a #:
((?:\.|^).)?.(?=.*#)
Replace with $1*. See the regex demo. This will handle emails that happen to contain chars other than just word (letter/digit/underscore) and . chars.
Details
((?:\.|^).)? - an optional capturing group matching a dot or start of string position and then any char other than a line break char
. - any char other than a line break char...
(?=.*#) - if followed with any 0 or more chars other than line break chars as many as possible and then #.
Kotlin code (with a raw string literal used to define the regex pattern so as not to have to double escape the backslash):
fun maskEmail(email: String): String {
return email.replace(Regex("""((?:\.|^).)?.(?=.*#)"""), "$1*")
}
See a Kotlin test online:
val emails = arrayOf<String>("captain.am-e-r-ica#email.com","my-cutie.pie+here#email.com","tom.cat#email.com","cutie.pie#email.com","captain.america#email.com")
for(email in emails) {
val masked = maskEmail(email)
println("${email}: ${masked}")
}
Output:
captain.am-e-r-ica#email.com: c******.a*********#email.com
my-cutie.pie+here#email.com: m*******.p*******#email.com
tom.cat#email.com: t**.c**#email.com
cutie.pie#email.com: c****.p**#email.com
captain.america#email.com: c******.a******#email.com

Regex matching a text after a specific string until another specific string

If I have the following example:
X-FileName: pallen (Non-Privileged).pst
Here is our forecast
Message-ID: <15464986.1075855378456.JavaMail.evans#thyme>
How can I select the text
Here is our forecast
after "X-FileName .... \n" until "Message-ID" execluded?
I read about lookahead and behind and tried this but didn't work:
(?<=X-FileName:(\n)+$).+(?=Message-ID:)
This should do it:
(?:X-FileName:[^\n]+)\n+([^\n]+)\n+(?:Message-ID:) (group #1 is the match)
Demo
Explanation:
(?:X-FileName:[^\n]+) matches X-Filename: followed by any number of characters that aren't newlines, without capturing it (?:).
\n+ matches any number of consecutive newlines.
([^\n]+) matches and captures any number of consecutive characters that aren't newlines.
\n+, again, matches any number of consecutive newlines.
(?:Message-ID:) matches Message-ID: without capturing it (?:).
Edit: as #WiktorStribiżew mentioned though, splitting your text into lines may be an easier/cleaner way to retrieve what you want.
There are two approaches here, and they depend on the broader context. If your expected substring is the second paragraph, just split with \n\n (or \r\n\r\n) and get the second item from the resulting list.
If it is a text inside some larger text, use a regex.
See a Python demo:
import re
s='''X-FileName: pallen (Non-Privileged).pst
Here is our forecast
Message-ID: <15464986.1075855378456.JavaMail.evans#thyme>'''
# Non-regex way for the string in the exact same format
print(s.split('\n\n')[1])
# Regex way to get some substring in a known context
m = re.search(r'X-FileName:.*[\r\n]+(.+)', s)
if m:
print(m.group(1))
The regex means:
X-FileName: - a literal substring
.* - any 0+ chars other than line break chars
[\r\n]+ - 1 or more CR or LF chars
(.+) - Group 1: one or more chars other than line break chars, as many as possible.
See the regex demo.