pattern to extract linkedin username from text - regex

I am trying to extract linkedin url that is written in this format,
text = "patra 12 EXPERIENCE in / in/sambhu-patra-49b4759/ 2020 - Now O Skin Curate Research Pvt Ltd Embedded System Developer, WB 0 /bindasssambhul O SKILLS LANGUAGES Arduino English Raspberry Pi Movidius Hindi Bengali ICS Intel Compute Stick PCB Design Python UI Design using Tkinter HOBBIES HTML iti CSS G JavaScript JQuery IOT\n"
pattern = \/?in\/.+\/?\s+
I need to extract this in/sambhu-patra-49b255129/ from the any noisy text like the one above,
It's a linkedin url written in short form.
My pattern is not working

You can use
m = re.search(r'\bin\s*/\s*(\S+)', text)
if m:
print(m.group(1))
See the regex demo.
Details:
\b - word boundary
in - a preposition in
\s* - zero or more whitespaces
/ - a / char
\s* - zero or more whitespaces
(\S+) - Capturing group 1: any one or more whitespaces.

Another option matching word characters, optionally repeated by a - and word characters with an optional / at the end:
(?<!\S)in/\w+(?:-\w+)*/?
The pattern matches:
(?<!\S) Assert a whitspace boundary to the left
in/ Match literally
\w+(?:-\w+)* match 1+ word chars, optionally repeated by - and 1+ word chars
/? Match optional /
Regex demo
import re
s = ("patra 12 EXPERIENCE in / in/sambhu-patra-49b4759/ 2020 - Now O Skin Curate Research Pvt Ltd Embedded System Developer, WB 0 /bindasssambhul O SKILLS LANGUAGES Arduino English Raspberry Pi Movidius Hindi Bengali ICS Intel Compute Stick PCB Design Python UI Design using Tkinter HOBBIES HTML iti CSS G JavaScript JQuery IOT")
m = re.search(r"(?<!\S)in/\w+(?:-\w+)*/?", s)
if m:
print(m.group())
Output
in/sambhu-patra-49b4759/

How about just:
text.split(" ")[5]

This can be done without using any regex:
>>> text = "patra 12 EXPERIENCE in / in/sambhu-patra-49b4759/ 2020 - Now O Skin Curate Research Pvt Ltd Embedded System Developer, WB 0 /bindasssambhul O SKILLS LANGUAGES Arduino English Raspberry Pi Movidius Hindi Bengali ICS Intel Compute Stick PCB Design Python UI Design using Tkinter HOBBIES HTML iti CSS G JavaScript JQuery IOT\n"
>>> s = text[text.find(' in/')+1:]
>>> print (s[0:s.find(' ')])
in/sambhu-patra-49b4759/

Here is one of the ways.
regex = re.compile("\/\s?in\/(.*?)\/")
def check(str):
search = re.search(regex, str)
if search is not None:
print(search.group(1))
Output
sambhu-patra-49b4759

Related

Advanced regex: What would be the regex for this pattern?

Want to identify names of all authors in the following text:
#misc{diaz2006automatic,
title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},
month=jul # "~12",
note={EP Patent 1,678,025}
}
#article{standefer1984sitting,
title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},
volume={14},
number={6},
pages={649--658},
year={1984},
publisher={LWW}
}
#article{gentsch1992identification,
title={Identification of group A rotavirus gene 4 types by polymerase chain reaction.},
author={GenTSCH, JoN R and Glass, RI and Woods, P and Gouvea, V and Gorziglia, M and Flores, J and Das, BK and Bhan, MK},
journal={Journal of Clinical Microbiology},
volume={30},
number={6},
pages={1365--1373},
year={1992},
publisher={Am Soc Microbiol}
}
For the above text, regex should match:
match1 - Diaz, Navarro David
match2 - Gines, Rodriguez Noe
match3 - Standefer, Michael
match4 - Janet W
match5 - Trusso, Russell
...and so on
Although what you want should be easily achievable by capturing the contents between { and } for all lines starting with author= and then just splitting it using \s*(?:,|\band\b)\s* regex which will give you all the author names.
But just in case, your regex engine is PCRE based, you can use this regex, whose group1 content will give you the author names like you want.
^\s*author={|(?!^)\G((?:(?! and|, )[^}\n])+)(?: *and *)?(?:[^\w\n]*)
This regex exploits \G operator to match lines starting with author= and then starts matching the names which shouldn't contain and or , within it using (?!^)\G((?:(?! and|, )[^}\n])+)(?: *and *)?(?:[^\w\n]*) regex part
Regex Demo

Regex currency Python 3.5

I am trying to reformat the Euro currency in text data. The original format is like this: EUR 3.000.00 or also EUR 33.540.000.- .
I want to standardise the format to €3000.00 or €33540000.00.
I have reformatted EUR 2.500.- successfully using this code:
import re
format1 = "a piece of text with currency EUR 2.500.- and some other information"
regexObj = re.compile(r'EUR\s*\d{1,3}[.](\d{3}[.]-)')
text1 = regexObj.sub(lambda m:"\u20ac"+"{:0.2f}".format(float(re.compile('\d+(.\d+)?(\.\d+)?').search(m.group().replace('.','')).group())),format1)
Out: "a piece of text with currency €2500.00 and some other information"
This gives me €2500.00 which is correct. I've tried applying the same logic to the other formats to no avail.
format2 = "another piece of text EUR 3.000.00 and EUR 5.000.00. New sentence"
regexObj = re.compile('\d{1,3}[.](\d{3}[.])(\d{2})?')
text2 = regexObj.sub(lambda m:"\u20ac"+"{:0.2f}".format(float(re.compile('\d+(.\d+)?(\.\d+)?').search(m.group().replace('.','')).group())),format2)
Out: "another piece of text EUR €300000.00 and EUR €500000.00. New sentence"
and
format3 = "another piece of text EUR 33.540.000.- and more text"
regexObj = regexObj = re.compile(r'EUR\s*\d{1,3}[.](\d{3}[.])(\d{3}[.])(\d{3}[.]-)')
text3 = regexObj.sub(lambda m:"\u20ac"+"{:0.2f}".format(float(re.compile('\d+(.\d+)?(.\d+)?').search(m.group().replace('.','')).group())),format3)
Out: "another piece of text EUR 33.540.000.- and more text"
I think the problem might be with the regexObj.sub(), as the .format() part of it is confusing me. I've tried to change re.compile('\d+(.\d+)?(.\d+)?') within that, but I can't seem to generate the result I want. Any ideas much appreciated. Thanks!
Let's start with the regex. My propositions is:
EUR\s*(?:(\d{1,3}(?:\.\d{3})*)\.-|(\d{1,3}(?:\.\d{3})*)(\.\d{2}))
Details:
EUR\s* - The starting part.
(?: - Start of a non-capturing group - a container for alternatives.
( - Start of a capturing group #1 (Integer part with ".-" instead of
the decimal part).
\d{1,3} - Up to 3 digits.
(?:\.\d{3})* - ".ddd" part, 0 or more times.
) - End of group #1.
\.- - ".-" ending.
| - Alternative separator.
( - Start of a capturing group #2 (Integer part)
\d{1,3}(?:\.\d{3})* - Like in alternative 1.
) - End of group #2.
(\.\d{2}) - Capturing group #3 (dot and decimal part).
) - End of the non-capturing group.
Instead of a lambda function I used "ordinary" replicating function,
I called it repl. It contains 2 parts, for group 1 and group 2 + 3.
In both variants dots from the integer part are deleted, but the "final"
dot (after the integer part) is a part of group 3, so it is not deleted.
So the whole script can look like below:
import re
def repl(m):
g1 = m.group(1)
if g1: # Alternative 1: ".-" instead of decimal part
res = g1.replace('.','') + '.00'
else: # Alternative 2: integet part (group 2) + decimal part (group 3)
res = m.group(2).replace('.','') + m.group(3)
return "\u20ac" + res
# Source string
src = 'xxx EUR 33.540.000.- yyyy EUR 3.000.00 zzzz EUR 5.210.15 vvvv'
# Regex
pat = re.compile(r'EUR\s*(?:(\d{1,3}(?:\.\d{3})*)\.-|(\d{1,3}(?:\.\d{3})*)(\.\d{2}))')
# Replace
result = pat.sub(repl, src)
The result is:
xxx €33540000.00 yyyy €3000.00 zzzz €5210.15 vvvv
As you can see, no need to use float or format.

Skip multiple lines in YML Regex

I am creating YML templates to match files (through Python parsing), and in the YML template I have to enter fields which match from the input file and Python, then converts into a database (CSV file).
But I am facing a problem matching company details. A portion of the file looks like this:
COMPANY DETAILS
Date : 01-06-2018
ABC Industries
12-31 Lane
New York
Contact No. 1111
And the company is actually ABC Industries. But in the file that I have, the Date is coming between the COMPANY DETAILS text and the actual company details.
I matched the Date as:
date: Date :\s+(\d+\-\d+\-\d+)
in the YML template file. But I am unable to match the company details.
I am using Regex like this to skip the line starting with the text DATE:
company: COMPANY DETAILS\s+^(Date :.*)?([A-Za-z\s*]*)\s+Contact No.
But it isn't working. Please help me out with a proper Regex which skips any blank lines or the lines which start with Date : so that I can extract the proper company details from the text.
Thanks in advance.
EDIT
This problem is solved now.
COMPANY DETAILS\s+Date :\s+\d+\-\d+\-\d+\s+([A-Z ]*)\n
Did the trick.
You may use
COMPANY DETAILS\s+Date\s*:.*\s*(.+)
See the regex demo
Details
COMPANY DETAILS - a literal substring
\s+ - 1+ whitespace
Date\s*: - Date, 0+ whitespaces, :
.*\s* - a line with any whitespaces after
(.+) - Group 1: the line with the company data.
Python demo:
import re
rx = r"COMPANY DETAILS\s+Date\s*:.*\s*(.+)"
s = "COMPANY DETAILS\n\nDate : 01-06-2018\n\nABC Industries\n12-31 Lane\nNew York\n\nContact No. 1111"
m = re.search(rx, s, re.MULTILINE)
if m:
print(m.group(1)) # => ABC Industries
Using re.search
Demo:
import re
s = """COMAPNY DETAILS
Date : 01-06-2018
ABC Industries
12-31 Lane
New York
Contact No. 1111"""
m = re.search("(?<=COMAPNY DETAILS)(?P<company>.*?(?=Contact))", s, flags=re.DOTALL)
if m:
print( m.group('company') )
Output:
ABC Industries
12-31 Lane
New York

PIG REGEX_EXTRACT_ALL is not working

I have the below data.
• PRT_Edit & Set Shopping Cart in Retail
• PRT_Confirm Shopping Cart for Goods
o PRT-Ret_Process Supplier Invoice
o PRT-Web_Overview of Orders
o PRT_Update Outfirst Agreement
PRT_Axn_-Purchase and Requisition
The data has special symbols, tab space and spaces. I want to extract only the text part from this data as:
PRT_Edit & Set Shopping Cart in Retail
PRT_Confirm Shopping Cart for Goods
PRT-Ret_Process Supplier Invoice
PRT-Web_Overview of Orders
PRT_Update Outfirst Agreement
I have tried using REGEX_EXTRACT_ALL in Pig Script as below but it does not work.
PRT = LOAD '/DATA' USING TEXTLOADER() AS (LINE:CHARARRAY);
Cleansed = FOREACH PRT GENERATE REGEX_EXTRACT_ALL(LINE,'[A-Z]*') AS DATA;
When I try dumping Cleansed, it does not show any data. Can any one please help.
You can use
Cleansed = FOREACH PRT GENERATE FLATTEN(
REGEX_EXTRACT_ALL(LINE, '^[^a-zA-Z]*([a-zA-Z].*[a-zA-Z])[^a-zA-Z]*$'))
AS (FIELD1:chararray), LINE;
The regex matches the following:
^ - start of string
[^a-zA-Z]* - 0 or more characters other than the Latin letters in the character class
([a-zA-Z].*[a-zA-Z]) - a capturing group that we'll reference to as FIELD1 later, matching:
[a-zA-Z].*[a-zA-Z] - a Latin letter, then any characters, as many as possible (the greedy * is used, not *? lazy one)
[^a-zA-Z]* - 0 or more characters other than the Latin letters
$ - end of string

Regex for non-greedy range of digits

I am trying to use regex grab the first occurrence of InvoiceItemID in the JSON data shown below:
This is the regex string that I current have:
\d{6}(?=","Location":"ARN:190801210003100)
The current regex string returns two matches, but I am only interested in the first occurrence/match. I understand that what I need to do is make the regex non-greedy, which usually involves using something like this: (.*?) but I do not know where to implement this non-greedy code. Any help would be appreciated. Thanks.
Here is some raw data if needed for testing purposes:
{"ClientRef":"","Date":"2015-09-29 10:02:51 AM","InvoiceID":"451393","InvoiceItemID":"495340","Location":"ARN:193602013349538<br\/>16 LEIGHLAND DR , CITY OF MARKHAM, ON, L3R 7R4","ReportID":"268172,","Type":"ICI Commercial \/ Industrial Report"},{"ClientRef":"","Date":"2015-09-28 8:39:41 PM","InvoiceID":"451035","InvoiceItemID":"494939","Location":"ARN:190801210003100<br\/>2250 SHEPPARD AVE W, CITY OF TORONTO, ON, M9M 1L7","ReportID":"267810,","Type":"Basic Report"},{"ClientRef":"","Date":"2015-09-28 8:39:20 PM","InvoiceID":"451034","InvoiceItemID":"494938","Location":"ARN:190801210003100<br\/>2250 SHEPPARD AVE W, CITY OF TORONTO, ON, M9M 1L7","ReportID":"267809,","Type":"ICI Commercial \/ Industrial Report"},{"ClientRef":"","Date":"2015-09-28 2:59:03 PM","InvoiceID":"450515","InvoiceItemID":"494348","Location":"ARN:240201011110900<br\/>26-34 PLAINS RD E, BURLINGTON CITY, ON, L7T 2B9","ReportID":"267272,","Type":"ICI Commercial \/ Industrial Report"}
Your regex is not the way to go. You can see how you fail to match all here. Here is a demo of it working with this regex:
InvoiceItemID":"(\d{6})