Need to use regex to extract a part of a string - regex

I'm a regex noob that's trying to use the regexp_extract() function in data studio to extract part of a string. Could you help me out?
I need to extract the part of the string that comes after 'May'. Everything before 'May' is exactly the same across all campaigns.
I've tried googling the solution and killed a lot of time on regexer.com but i can't figure it out
Current Campaign Name:
Xxxxx_xxxxx_PKN_Trueview_24th MayComedy Movie Fans18-24
Xxxxx_xxxxx_PKN_Trueview_24th MaySouth Asian Film Fans18-24
Xxxxx_xxxxx_PKN_Trueview_24th MayCricket Enthusiasts18-24
Xxxxx_xxxxx_PKN_Trueview_24th MayMotorcycle Enthusiasts18-24
Expected Campaign Names:
Comedy Movie Fans18-24
South Asian Film Fans18-24
Cricket Enthusiasts18-24
Motorcycle Enthusiasts18-24
EDIT: I'm trying to use this in data studio in the REGEXP_EXTRACT(Campaign,"regex_code_here") function. I think the acceptable syntax is re2.

You may actually use REGEXP_REPLACE here to remove all before and including May:
REGEXP_REPLACE(Campaign, '.*May', '')
See the regex demo:

The regex you need is this:
(?<=May).*$
Test it here.

You can use replace
^.*?May - Match everything up-to first occurrence of May
"$`" - replace with portion that follows substring Ref
let arr = ["Xxxxx_xxxxx_PKN_Trueview_24th MayComedy Movie Fans18-24","Xxxxx_xxxxx_PKN_Trueview_24th MaySouth Asian Film Fans18-24","Xxxxx_xxxxx_PKN_Trueview_24th MayCricket Enthusiasts18-24","Xxxxx_xxxxx_PKN_Trueview_24th MayMotorcycle Enthusiasts18-24"]
let op = arr.map(str=> str.replace(/^.*?May/g, "$`"))
console.log(op)

Related

Regex to get the [nth] name following a specific set of text

I don't have a great grasp on Regex; but I am attempting to grab names following the word "sortname", but only after the nth time that word appears.
I have (thanks to Wikipedia's API) a list of governors in the United States, listed in order of their states name alphabetically. (https://en.wikipedia.org/w/api.php?action=parse&prop=wikitext&page=List_of_current_United_States_governors&section=1&format=json)
If you do ctrl+f you will see that each name follows the word "sortname" and there are 50 of them. So if I wanted to see who the Governor of Texas is, I would get the name that follows the 43rd instance of the word "sortname". furthermore the first and last name of each governor is formatted as "sortname|Kay|Ivey" or "sortname|Michelle|Lujan Grisham".
Thanks for the help!
After some more testing I have ended up with the following pattern sortname([^;]*)[^}|]}
It collects more than necessary but its going in the right direction. I can use python to sort it out from there.
Assuming a string str contains the whole text, would you please try:
m = re.findall(r'sortname\|[^|]+\|[^}]+', str, re.DOTALL)
print(m[42])
Output:
sortname|Greg|Abbott

Regex String for Restructuring Author Firstname, Lastname, Title

I want to convert strings in the format
The European Union - A Very Short Introduction - Pinder, John
to
John Pinder - The European Union - A Very Short Introduction
I am having trouble matching on "Pinder" and "John" to reformat in the desired way.
You can use:
^(.*?)(?:-\s+(\w+),\s+(\w+))$
Demo
If you may have authors with multiple names (such as 'von Clausewitz, Carl') this won't work. Instead, maybe:
^(.*)(?:-\s+([^,]+?),\s+(\w+))$
Demo 2
There are many ways to approach the problem, all requiring some assumptions not specified in your question. Here is one solution...
^.+-\s+(.+),\s+(.+)$
regexper diagram
It is working by consuming as many characters as possible (up to first capture group, using hyphen and whitespace as delimiter) then it assumes there is a comma followed by whitespace separating first name from last name, which it assumes is the end of the string.
Depending on what you know about the uniformity of the data, this may or may not work for you, but I thought it would be nice to have a solution which does not try to restrict characters in name, but rather the rest of the format.
Use this code:
$code = preg_match_all('/(?:.*?) - (?:.*?) -(.*?),(.*)/', $string,$matches);
This will give you an array and $matches[1] will give you the last name (in this case "Pinder") and $matches[2] will give you the first name ("John"). You can then turn it back into a string if you want to using $lastname = implode('',$matches[1]);.

variable number of capturing groups

I have a xpath expression which I want to use to extract City and date from a td which contains a string of this kind:
City(may contain spaces and may be missing, but the following space is always present) on 2013/07/20
So far, I got to the following solution for extracting the date, which works partially:
//path/to/my/td/text()/replace(.,'(.*) on (.*)','$3')
This works when City is present, but when City is missing I get "on 2013/07/20" as a result.
I think this is because the first capturing group fails and so the number of groups is different.
How can I get this expression to work?
I did not fully check your regex, but it looks fine at first sight. Anyway, you can also go an easier way if you only want to get the date by extracting the text after "on ":
//path/to/my/td/text()/substring-after(.,'on ')
edit: or you may go the substring-way and select the last 10 characters of the content:
//path/to/my/td/text()/substring(., string-length(.) - 9)

Regex to remove footer using wildcards

Ok - this is well beyond my limited knowledge of regular expressions. We receive a report from a banking entity in a fixed with text file format. Unfortunately their system exports page headers with the data file that must be removed before processing on our end. The page headers start and end with the same text but the content changes (dates and page numbers). A typical one looks like:
00007xxxxx LAST1,FIRST1 111111 20120930
ABCD EXPORT RPT 10/04/12 at 10/04/12 16:20 Seq 1501 Page 16
MRK014 Report Date: 10/04/12
Acct# Name SH. Balance QTR (YYYYMMDD)
----------------------------------------------------------------------------------------------------
00007xxxxx LAST2,FIRST2 222222 20120930
So each header starts with "ABCD" (actually the name of the bank, just removed here for privacy) and ends with the row of -------------------.
What I need to get it down to is the customer data on two rows (00007xxxxx - those account numbers change per person).
So I need to select from the " ABCD" to the end of the "---" to remove that block of text.
Try this regex.. This is a Java code.. You can use the given pattern in your language..
str = str.replaceAll("ABCD((.*?)[\n\r])+(\\-*)", "");
Where str contains your above data.. Lines are separated by \n I assume..
To ensure you are removing correct part of report I would go with more complicated regex pattern.
Use regex pattern
(?<=[\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with empty string.
However if your environment does not support regex lookbehind, then you have to use pattern:
([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with first group.
For example in JavaScript it would be:
str.replace(/([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+/g, "$1")
Test this code here.

Regex to replace string with another string in MS Word?

Can anyone help me with a regex to turn:
filename_author
to
author_filename
I am using MS Word 2003 and am trying to do this with Word's Find-and-Replace. I've tried the use wildcards feature but haven't had any luck.
Am I only going to be able to do it programmatically?
Here is the regex:
([^_]*)_(.*)
And here is a C# example:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String test = "filename_author";
String result = Regex.Replace(test, #"([^_]*)_(.*)", "$2_$1");
}
}
Here is a Python example:
from re import sub
test = "filename_author";
result = sub('([^_]*)_(.*)', r'\2_\1', test)
Edit: In order to do this in Microsoft Word using wildcards use this as a search string:
(<*>)_(<*>)
and replace with this:
\2_\1
Also, please see Add power to Word searches with regular expressions for an explanation of the syntax I have used above:
The asterisk (*) returns all the text in the word.
The less than and greater than symbols (< >) mark the start and end
of each word, respectively. They
ensure that the search returns a
single word.
The parentheses and the space between them divide the words into
distinct groups: (first word) (second
word). The parentheses also indicate
the order in which you want search to
evaluate each expression.
Here you go:
s/^([a-zA-Z]+)_([a-zA-Z]+)$/\2_\1/
Depending on the context, that might be a little greedy.
Search pattern:
([^_]+)_(.+)
Replacement pattern:
$2_$1
In .NET you could use ([^_]+)_([^_]+) as the regex and then $2_$1 as the substitution pattern, for this very specific type of case. If you need more than 2 parts it gets a lot more complicated.
Since you're in MS Word, you might try a non-programming approach. Highlight all of the text, select Table -> Convert -> Text to Table. Set the number of columns at 2. Choose Separate Text At, select the Other radio, and enter an _. That will give you a table. Switch the two columns. Then convert the table back to text using the _ again.
Or you could copy the whole thing to Excel, construct a formula to split and rejoin the text and then copy and paste that back to Word. Either would work.
In C# you could also do something like this.
string[] parts = "filename_author".Split('_');
return parts[1] + "_" + parts[0];
You asked about regex of course, but this might be a good alternative.