Simple regex doesn't work in Chrome extension (Popup.js) - regex

I have an issue that is driving me nuts. All other regex' are working, but they "stop" working if it includes a letter.
The string is as following:
مسلسل The Outsider 2020 الموسم الاول الحلقة 8 مترجمة
I want to get the number "8". The text can be understood as "Series x Season y Episode NUMBER". I'm trying to get the number using the following regex:
/(?<=.*الحلقة.*)[0-9]+(?=.*)/g
It's working fine when applying it here https://regexr.com/ OR on the console. I'm implementing it in the Popup.js file.
The following regex is working fine without issues
/(?<=title=")(.*?)(?=")/
My JS code to get the regex:
let episodeNumber = item.match(/(?<=title=")(.*?)(?=")/)[0];
console.log("t: " + episodeNumber);
episodeNumber = episodeNumber.match(/(?<=.*الحلقة.*)[0-9]+(?=.*)/g);
console.log(episodeNumber);
The first log gives me the result (the text above)
The second log gives null in the console. It's really weird.
If the regex isn't formatted well, please replace "الحلقة" with "Episode". It's in Arabic, that's why it would look weird in the formatting.

Related

Need to use regex to extract a part of a string

I'm a regex noob that's trying to use the regexp_extract() function in data studio to extract part of a string. Could you help me out?
I need to extract the part of the string that comes after 'May'. Everything before 'May' is exactly the same across all campaigns.
I've tried googling the solution and killed a lot of time on regexer.com but i can't figure it out
Current Campaign Name:
Xxxxx_xxxxx_PKN_Trueview_24th MayComedy Movie Fans18-24
Xxxxx_xxxxx_PKN_Trueview_24th MaySouth Asian Film Fans18-24
Xxxxx_xxxxx_PKN_Trueview_24th MayCricket Enthusiasts18-24
Xxxxx_xxxxx_PKN_Trueview_24th MayMotorcycle Enthusiasts18-24
Expected Campaign Names:
Comedy Movie Fans18-24
South Asian Film Fans18-24
Cricket Enthusiasts18-24
Motorcycle Enthusiasts18-24
EDIT: I'm trying to use this in data studio in the REGEXP_EXTRACT(Campaign,"regex_code_here") function. I think the acceptable syntax is re2.
You may actually use REGEXP_REPLACE here to remove all before and including May:
REGEXP_REPLACE(Campaign, '.*May', '')
See the regex demo:
The regex you need is this:
(?<=May).*$
Test it here.
You can use replace
^.*?May - Match everything up-to first occurrence of May
"$`" - replace with portion that follows substring Ref
let arr = ["Xxxxx_xxxxx_PKN_Trueview_24th MayComedy Movie Fans18-24","Xxxxx_xxxxx_PKN_Trueview_24th MaySouth Asian Film Fans18-24","Xxxxx_xxxxx_PKN_Trueview_24th MayCricket Enthusiasts18-24","Xxxxx_xxxxx_PKN_Trueview_24th MayMotorcycle Enthusiasts18-24"]
let op = arr.map(str=> str.replace(/^.*?May/g, "$`"))
console.log(op)

Type? of text blocks regEx function

I have recently started to use regEx for work and have now found a rather peculiar problem which I apparently can't solve myself...
Problem: I receive data from customers (from all over the world) and have to analyze it. The data this time has some specialties.
e.g. for raw data:
Screw М4х20 , DIN7985 - This is the original text with the problem
Screw M4x20 , DIN7985 - This is manual written text, which gives me
perfect results
If I now try to pick out the dimension "M4x20" with following regEx:
(\b[M]?\d+x\d+\b)
it yields me no results... neither in Excel, nor in websites like regExr:
Regex demo
If I delete the M4x20 and write it a new, I do get results.
I have absolutely no idea where the problem lies, except that it is caused by the M char and the x char - for reference: the rest of the text/letters (a-z) also doesn't work. The numbers are working ok.
Is there some way to analyze it?
Edit:
There is and I just found out: The letters are Cyrillic letters which are not being recognized.
Though they can apparently be changed to latin letters quite easily.
Two chars M and x are part of Cyrillic letters and they are represented in regex as \u041C (M) and \u0445 (x).
Regex demo
VBA code:
Set re = CreateObject("VBScript.RegExp")
re.Global = True
re.Pattern = "\u041C?\d+\u0445\d+"
For Each Match In re.Execute("Screw М4х20 , DIN7985")
Debug.Print (Match)
Next
Output:
М4х20

Driving Licenses siebel Issue

I have florida driving licenses like A123-123-12-123-1 and A123456789321.Now I am using below expression to show my data like XXXX-XXX-XX-XX1231.
([\s.,:])([a-zA-Z)\d{12}|([a-zA-Z)\d{3}[\s{1}-]\d{2}[\s{1}-]\d{3}[\s{1}-]\d{1}([\s.,:]).
Please let me know how can i use above expression to remove all spaces form the expresson and display the format as i mentioned above.
Thanks
It seems there is a mismatch in the input and output, e.g.
A123-123-12-123-1
XXXX-XXX-XX-XX1231
there are two extra characters (ignoring dashes) in the desired output.
So assuming you want to make the output longer by repeating "12", e.g.
A123-123-12-123-1
A123-123-12-121231
Here is the code:
regex = /(?:[\s.,:])([a-zA-Z)(\d{3})[\s-]?(\d{3})[\s-]?(\d{2})[\s-]?(\d{2})(\d{1})[\s-]?(\d{1})(?:[\s.,:])/
fixed = licence.replace(regex, "$1$2-$3-$4-$5$5$6$7")

How to include 2 words within Regex and result must be based on only those 2 words VB.NET

I would like to know how to include only 2 or more keywords within a Regex. and ending results should only show those words defined, not only one word.
What I currently have works with multiple keywords but I want it to use BOTH words not either one of the other.
For example:
Dim pattern As String = "(?i)[\t ](?<w>((arma)|(crapo))[a-z0-9]*)[\t ]"
Now the code works fine by including 'arma' or 'crapo'. I only want it to include BOTH 'arma' AND 'crapo' otherwise do not show any results.
Dealing with finding certain keywords within a PDF document and I only want to be shown results if the PDF document includes BOTH 'arma' and 'crapo' (Works fine by showing results for 'arma' OR 'crapo' I want to see results based on 'arma' AND 'crapo'.
Sorry for sounding so repetitive.
Edit: Here is my code. Please read comment.
Dim filesz() As String = GetPatternedFiles("c:\temp\", New String() {"tes*.pdf", "fes*.pdf", "Bas*.pdf"})
'The getpatterenedfiles is a function" also gettextfromPDF is another function.
For Each s As String In filesz
Dim thetext As String = Nothing
Dim pattern As String = "(?i)[\t ](?<w>(crapo)|(arma)[a-z0-9]*)[\t ]"
thetext = GetTextFromPDF(s)
For Each m As Match In Regex.Matches(thetext, pattern)
ListBox1.Items.Add(s)
Next
Next
You can use this regex:
\barma\b.*?\bcrapo\b|\bcrapo\b.*?\barma\b
Working demo
The idea is to match arma whatever crapo or crapo whatever arma and use word boundaries to avoid words like karma.
However, if you want to match karma or crapotos as you asked in your comment you can use:
arma.*?crapo|crapo.*?arma

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.