Extract address from description with regex - regex

Im trying to extract an address (written in french) out of a listing using regex.
here is the example:
"Don't wait, this home won't be on the market for long!
Pictures can be forwarded upon request.
123 de la street - city
345-555-1234 "
Imagine that whole thing is item.description. Here is a working set so far:
In "item.description", replace "^\d{1,4} des|de la|du [^,\s]+$" with "whatever"
and the address (123 de la street) will be correctly written over with whatever. BUT if I try to make it the only thing kept from the description, something like this (which dosent work):
In "item.description" replace "(.)(^\d{1,4} des|de la|du [^,\s]+$)(.)" with "$2"
What would be the best way to replace the whole description with just the address?
Thanks!

Try adding * to the first and last token, plus watch out for ^$ signs! (They match start and end of the text.)
"^(.*)(\d{1,4} des|de la|du [^,\s]+)(.*)$"

Related

Regex to get the [nth] name following a specific set of text

I don't have a great grasp on Regex; but I am attempting to grab names following the word "sortname", but only after the nth time that word appears.
I have (thanks to Wikipedia's API) a list of governors in the United States, listed in order of their states name alphabetically. (https://en.wikipedia.org/w/api.php?action=parse&prop=wikitext&page=List_of_current_United_States_governors&section=1&format=json)
If you do ctrl+f you will see that each name follows the word "sortname" and there are 50 of them. So if I wanted to see who the Governor of Texas is, I would get the name that follows the 43rd instance of the word "sortname". furthermore the first and last name of each governor is formatted as "sortname|Kay|Ivey" or "sortname|Michelle|Lujan Grisham".
Thanks for the help!
After some more testing I have ended up with the following pattern sortname([^;]*)[^}|]}
It collects more than necessary but its going in the right direction. I can use python to sort it out from there.
Assuming a string str contains the whole text, would you please try:
m = re.findall(r'sortname\|[^|]+\|[^}]+', str, re.DOTALL)
print(m[42])
Output:
sortname|Greg|Abbott

Simple regex doesn't work in Chrome extension (Popup.js)

I have an issue that is driving me nuts. All other regex' are working, but they "stop" working if it includes a letter.
The string is as following:
مسلسل The Outsider 2020 الموسم الاول الحلقة 8 مترجمة
I want to get the number "8". The text can be understood as "Series x Season y Episode NUMBER". I'm trying to get the number using the following regex:
/(?<=.*الحلقة.*)[0-9]+(?=.*)/g
It's working fine when applying it here https://regexr.com/ OR on the console. I'm implementing it in the Popup.js file.
The following regex is working fine without issues
/(?<=title=")(.*?)(?=")/
My JS code to get the regex:
let episodeNumber = item.match(/(?<=title=")(.*?)(?=")/)[0];
console.log("t: " + episodeNumber);
episodeNumber = episodeNumber.match(/(?<=.*الحلقة.*)[0-9]+(?=.*)/g);
console.log(episodeNumber);
The first log gives me the result (the text above)
The second log gives null in the console. It's really weird.
If the regex isn't formatted well, please replace "الحلقة" with "Episode". It's in Arabic, that's why it would look weird in the formatting.

Need to use regex to extract a part of a string

I'm a regex noob that's trying to use the regexp_extract() function in data studio to extract part of a string. Could you help me out?
I need to extract the part of the string that comes after 'May'. Everything before 'May' is exactly the same across all campaigns.
I've tried googling the solution and killed a lot of time on regexer.com but i can't figure it out
Current Campaign Name:
Xxxxx_xxxxx_PKN_Trueview_24th MayComedy Movie Fans18-24
Xxxxx_xxxxx_PKN_Trueview_24th MaySouth Asian Film Fans18-24
Xxxxx_xxxxx_PKN_Trueview_24th MayCricket Enthusiasts18-24
Xxxxx_xxxxx_PKN_Trueview_24th MayMotorcycle Enthusiasts18-24
Expected Campaign Names:
Comedy Movie Fans18-24
South Asian Film Fans18-24
Cricket Enthusiasts18-24
Motorcycle Enthusiasts18-24
EDIT: I'm trying to use this in data studio in the REGEXP_EXTRACT(Campaign,"regex_code_here") function. I think the acceptable syntax is re2.
You may actually use REGEXP_REPLACE here to remove all before and including May:
REGEXP_REPLACE(Campaign, '.*May', '')
See the regex demo:
The regex you need is this:
(?<=May).*$
Test it here.
You can use replace
^.*?May - Match everything up-to first occurrence of May
"$`" - replace with portion that follows substring Ref
let arr = ["Xxxxx_xxxxx_PKN_Trueview_24th MayComedy Movie Fans18-24","Xxxxx_xxxxx_PKN_Trueview_24th MaySouth Asian Film Fans18-24","Xxxxx_xxxxx_PKN_Trueview_24th MayCricket Enthusiasts18-24","Xxxxx_xxxxx_PKN_Trueview_24th MayMotorcycle Enthusiasts18-24"]
let op = arr.map(str=> str.replace(/^.*?May/g, "$`"))
console.log(op)

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.

Regex: Finding all line breaks without an " as previous character

I have a file with a bunch of data looking like this:
"sc14b61ecf5ef162","sc14b61b07ba1690","1264806000","1264806000","780","1080","Navn arrangement:
Dørene åpner:
Arr.start:
Arr:slutt:
Dørene stenger:
HA (navn):
HA (tlf):
Type arrangement: (her: om konsert, gjerne sjanger)
Forvetet antall gjester:"
"sc14b61f9e35f569","sc14b61bf07647db","1265583600","1265583600","1020","1260","Nord/Sør
Foredrag
Ønsker skjenking"
This repeats itself many times (with different data). I would like it to look like this:
"sc14b61ecf5ef162","sc14b61b07ba1690","1264806000","1264806000","780","1080","Navn arrangement:Dørene åpner:Arr.start:Arr:slutt:Dørene stenger:HA (navn):HA (tlf):Type arrangement: (her: om konsert, gjerne sjanger)Forvetet antall gjester:"
"sc14b61f9e35f569","sc14b61bf07647db","1265583600","1265583600","1020","1260","Nord/Sør Foredrag Ønsker skjenking"
I think that what I need is some way to remove all line-breaks that does not have an " in front of it, but my regex is weak.
I'm using Textwrangler (the text editor for OS X).
This is called negative look behind. This should do the trick.
(?<!")\n
Per #ax in the comments, on a Mac, you may need to change the \n to \r like so:
(?<!")\r
If that still doesnt work, sometimes you may need to combine the two:
(?<!")\r\n
One of these should meet your needs.