Extract digits in between 2 different parameters - regex

I have all data being imported into one cell as:
"<blank space><email address><blank space><CustomerId><blank space><(email address)><line break for next entry>"
Example:
email1#provider.com 12345678 (email1#provider.com)
email224#provider.com 23902490 (email224#provider.com)
I need to extract only the customer ID's, while separating them with a comma, so I tried the following: regexreplace(A2,"([^[:digit:]])",","), however, this also extracts the numbers associated with the emails, so it returns me:
,,,,,1,,,,,,,,,,,,,,12345678,,,,,,,1,,,,,,,,,,,,,,
,,,,,224,,,,,,,,,,,,,,23902490,,,,,,,224,,,,,,,,,,,,,,
Since the email address is set by the user, I don't have control how many digits or if only digits are used in it. I can't seem to understand how to isolate the CustomerIds alone.
Please help!
Edit1:
CustomerID: 64-bit int field, randomly assigned to a client, therefore checking by the length of the string would not work.
Edit2:
For now, I am using the formula below, but I would still be interested in a solution using Regex.
filter(transpose(split($B$4," ")),isnumber(transpose(split($B$4," "))))

If they are separated by a space you should be able to set the space to be your delimiter and extract from there.
https://zapier.com/blog/split-text-excel-zapier/

use:
=ARRAYFORMULA(TEXTJOIN(", ", 1, IFERROR(REGEXEXTRACT(A1:A2, "(?s)(\d{8})"))))

Related

Pull a positioned grouped and repeating string using regex?

Lets say I have data output such as the following:
0; root.; 0; MLG.; 247; root.; 249; MLG.; 2390; toasty.; ... someNumber; username.;
I am trying to pull the username while ignoring the number and semicolon previous to it from when the numbers are not 0, and to do this for an unknown number of times. In regards to this data the output should appear favorably as:
root. MLG. toasty.
The syntax MUST be perl format. The application using this takes no exceptions. While I do have complete control over how this data is being presented (such as I can remove the semicolons next to the numbers and attach the unique number to the username with a period and separate with a semicolon) I would like to know the method to do this regardless.
Some of the many current regexs I've tried is as follows...
(The (?<field> is to specify the regex following the end > a field name to be specified and shown by the application in case anyone is wondering)
Pulls all data to the last semicolon starting from where the first number shown is not 0.
"(?<users_online>[1-9].*;)"
Pulls all data after the first instance of non-zero number and a semicolon occurs.
"[1-9];(?<users_online>.*?)"
Pulls all data after the first instance of a non-zero number and a semicolon and outputs any alphanumeric values up until the next "word" boundry.
"[1-9];((?<users_online> \w+\b))"
Any and all help is appreciated.
You can do it like this:
my $s = "0; root.; 0; MLG.; 247; root.; 249; MLG.; 2390; toasty.; ... someNumber; username.;";
# ^--- added
print join " ", $s=~/[1-9]0*; ([^;]+)/g;
You only need to describe at least one digit that isn't a zero before eventual zeros and to use an appropriate character class that excludes the semicolon.

Extract string of numbers from URL using regex PIG

I'm using PIG to generate a list of URLs that have been recently visited. In each of the URLs, there is a string of numbers that represents the product page visited. I'm trying to use a regex_extract_all() function to extract just the string of numbers, which vary in length from 6-8. The string of digits can be found directly after jobs2/view/ and usually ends with +&cd but sometimes they may end with ).
Here are a few example URLs:
(http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=cl k&gl=hk)
Here is the current regex I am using:
J = FOREACH jpage GENERATE FLATTEN(REGEX_EXTRACT_ALL(TEXTCOLUMN, '\/view\/(\d+)\+\&')) as (output:chararray)
I have also tried other forms such as:
'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', 'view.([0-9]+)', 'view\/([\d]+)\+',
'[0-9][0-9][0-9]+', and
'[0-9][0-9][0-9]*'; none of which work.
Can anybody assist here or have another way of going about it?
Much appreciated,
MM
Reason for"Unexpected character 'D'" is, you need to put double backslash instead of single backslash. eg just replace [\d+] to [\\d+]
Here your solution, please validate all your inputs strings
input.txt
http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928)=2&hl=zh-TW&ct=clk&gl=hk
http://webcache.googleusercontent.com/search?q=cache:http://my.linkedin.com/jobs2/view/9919248
Updated Pigscript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REGEX_EXTRACT(line,'.*/view/(\\d+)([+|&|cd|)?]+)?',1);
dump B;
(17069404)
(5977065)
(16988928)
(16988928)
(16988928)
(16988928)
I'm not familiar with PIG, but this regex will match your target:
(?<=/jobs2/view/)\d+
By using a (non-consuming) look behind, the entire match (not just a group of the match) is your number.

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.

How can I match all strings unless it contains a certain string?

So I want to match every string in this list, except the ones that contain the product SKU, which is /s7892632 <---- random string of numbers. I've been trying to do this for quite some time and have been unsuccessful. Any insight would be greatly appreciated.
/account/login?returnurl=/account/forgotpassword
/account/login?returnurl=/account/orders
/account/orders
/account/updateaddress
/account/updateemail
/account/updaterewardscard
/brands/havaianas
/careers
/Category List
/checkout
/checkout/addresses
/checkout/addresses/delivery
/checkout/addresses/deliverymethod
/checkout/affilinetbasket
/checkout/anonymous
/checkout/confirmation
/checkout/express
/checkout/login
/checkout/login?returnurl=/checkout/addresses
/checkout/null
/checkout/payment
/checkout/paypal
/checkout/quickshop/
/checkout/verify
/click-and-collect
/click-and-collect/click-and-collect-overview
/corporate/about-matalan
/corporate/careers
/corporate/cookies
/corporate/history
/customer-services/accessibility
/customer-services/contact
/customer-services/customer-services-home
/customer-services/delivery
/customer-services/faq
/customer-services/fitting-room
/customer-services/here-to-help
/customer-services/size-guides
/delivery
/events/mothers-day
/events/mothers-day/s2516241/tassle-detail-slouch-bag
/events/mothers-day/s2518752/waxed-jacket
/events/mothers-day/s2519237/fabric-buckle-tote-bag
/events/mothers-day/s2521182/heart-print-nightie
/events/mothers-day/s2521184/heart-print-dressing-gown
/events/mothers-day/s2521185/heart-print-pyjama-set
/events/mothers-day/s2521679/structured-tote-bag
/events/mothers-day/s2522143/chiffon-print-dress
/events/mothers-day/s2522347/butterfly-enamel-bowl-32cm-x-8cm
/events/mothers-day/s2526013/animal-print-jersey-blazer
/events/mothers-day/s2527624/croc-tote-bag
/events/mothers-day/s2529731/shift-dress
/events/mothers-day?page=1&size=120&cols=4&sort=&id=/events/mothers-day&priceRange[min]=2&priceRange[max]=59
/events/mothers-day?page=2&size=120&cols=4&sort=&id=/events/mothers-day&priceRange[min]=2&priceRange[max]=59
/events/mothers-day?page=2&size=36&cols=4&sort=&id=/events/mothers-day&priceRange[min]=2&priceRange[max]=59
/events/mothers-day?page=3&size=36&cols=4&sort=&id=/events/mothers-day&priceRange[min]=2&priceRange[max]=59
The following should work:
^(?!.*/s\d{7}/).*
Example: http://regexr.com?343nf
This assumes you have each string as a separate element in a list. If this is actually matching one big string with multiple lines you can use the same regex, but you may need to enable global and multiline options depending on the tool you are using (and make sure dotall/singleline is disabled).
Try this:
boolean noSku = !line.matches(".*/s\\d{5,}.*");
This uses {5,} which allows for any number of digits in the SKU greater than 4 (giving you flexibility with matching). You can change the number to whatever suits.
this matches lines that don't have the code....
^((?!s\d{7}).)*$

Regular expressions, what a trouble!

I need your kind help to resolve this question.
I state that I am not able to use regolar expressions with Oracle PL/SQL, but I promise that I'll study them ASAP!
Please suppose you have a table with a column called MY_COLUMN of type VARCHAR2(4000).
This colums is populated as follows:
Description of first no.;00123457;Description of 2nd number;91399399119;Third Descr.;13456
You can see that the strings are composed by couple of numbers (which may begin with zero), and strings (containing all alphanumeric characters, and also dot, ', /, \, and so on):
Description1;Number1;Description2;Number2;Description3;Number3;......;DescriptionN;NumberN
Of course, N is not known, this means that the number of couples for every record can vary from record to record.
In every couple the first element is always the number (which may begin with zero, I repeat), and the second element is the string.
The field separator is ALWAYS semicolon (;).
I would like to transform the numbers as follows:
00123457 ===> 001-23457
91399399119 ===> 913-99399119
13456 ===> 134-56
This means, after the first three digits of the number, I need to put a dash "-"
How can I achieve this using regular expressions?
Thank you in advance for your kind cooperation!
I don't know Oracle/PL/SQL, but I can provide a regex:
([[:digit:]]{3})([[:digit:]]+)
matches a number of at least four digits and remembers the first three separately from the rest.
RegexBuddy constructs the following code snippet from this:
DECLARE
result VARCHAR2(255);
BEGIN
result := REGEXP_REPLACE(subject, '([[:digit:]]{3})([[:digit:]]+)', '\1-\2', 1, 0, 'c');
END;
If you need to make sure that those numbers are always directly surrounded by ;, you can alter this slightly:
(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)
However, this will not work if two numbers can directly follow each other (12345;67890 will only match the first number). If that's not a problem, use
result := REGEXP_REPLACE(subject, '(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)', '\1\2-\3\4', 1, 0, 'c');