Regex on Splitting String - regex

I am trying to split the following string into proper output using regex. Answers do not have to be in perl but in general regex is fine:
Username is required.
Multi-string name is optional
Followed by Uselessword is there but should be be parsed
Followed by an optional number
Following by an IP in brackets < > (Required)
String = username optional multistring name uselessword 45 <100.100.100.100>
Output should be:
Match 1 = username
Match 2 = optional multistring name
Match 3 = 45
Match 4 = 100.100.100.100

This sort of things are easier to handle using multiple regex. Here is an example:
my #arr = (
'username optional multistring name uselessword 45 <100.100.100.100>',
'username 45 <100.100.100.100>'
);
for(#arr){
## you can use anchor ^ $ here
if(/(\S+) (.+?) (\d+) <(.+?)>/){
print "$1\n$2\n$3\n$4\n";
}
## you can use anchor ^ $ here
elsif(/(\S+) (\d+) <(.+?)>/){
print "$1\n$2\n\n$3\n";
}
print "==========\n";
}
First if block is looking for four groups from the input. And the second block is looking for three groups.
If you need, you can use [ ]+ to handle multiple spaces between the groups.
Also, if you need, you can adjust the optional group (.+?) according to your preferred characters(usually through the character class [bla]).

Related

Parsing between characters using Perl in SAS

I am sure this is a simple thing to do but I cannot seem to find any examples or make it past the numerous documentation sources I have been using.
I have a variable in a table (called location) such as: OH_DRT HOME_G4-T7 77 Cafe Entrance
I want to be able to parse this into several columns based on some delimiters. There is variability in my data set so I thought using perl expressions for pattern matching would be the way to go. I am trying to take that string and break it up into something like this:
State
Building
Name
Desc
OH
DRT HOME
G4
T7 Cafe Entrance
FL
Cleveland
RG
03 Back Entry
I am able to split the first part out
Data Mydata;
Set Int_Data;
retain re;
if _N_ = 1 Then re = prxparse("/(\D{2})/");
if prxmatch(re, location) Then Do
State= prxposn(re,1,location);
end;
It is parsing out any of the other sections I am at a loss for. The only one I have been able to get to work correctly is the State. I assume I should be able to pull anything between two characters.
In my head I should be able to split something like this:
Anything before the first _, anything between the first _ and second _, anything second _ to first -, and then finally anything after the -
Are all records exactly the same? If so:
use warnings;
use strict;
my $data = 'OH_DRT HOME_G4-T7 77 Cafe entrance';
my ($state, $building, $name, $desc);
if ($data =~ /^([A-Z]{2})_(.*)_(\w{2})-\w{2}\s+(.*)$/) {
$state = $1;
$building = $2;
$name = $3;
$desc = $4;
}
print "$state, $building, $name, $desc\n";
The regex works as follows:
Capture two upper-cased letters at the start of the string and put it into $1
Skip an underscore and capture everything until the next underscore and put it into $2
Capture the following two word characters and put them into $3
Skip a hyphen and the following two word characters along with any amount of whitespace, and put the remaining portion of the string into $4
Assign the numbered matches into the more descriptive named variables
Note that if any of the matches/captures fail, all of the named variables will be undefined.
The output of the above is:
OH, DRT HOME, G4, 77 Cafe entrance
You can use a pattern with 4 capture groups, but note that when keeping the following remark into account, it will give T7 77 Cafe entrance in the last group.
and then finally anything after the -
If you want to match anything between the underscores and the - you can use a negated character class excluding characters to match that you specify.
To not cross newlines, you can add a newline and a carriage return [^_\r\n]+
^([^_]+)_([^_]+)_([^-]+)-(.*)
Explanation
^ Start of string
([^_]+)_ Capture 1+ chars other than _ in group 1 and then match it
([^_]+)_ Capture 1+ chars other than _ in group 2 and then match it
([^-]+)- Capture 1+ chars other than - in group 3 and then match it
(.*) Match all after the underscore in group 4
Regex demo
If you want 77 Cafe entrance in group 4:
^([^_]+)_([^_]+)_([^-]+)-[^\s-]*\s*(.*)
Regex demo
I'm sure the regex solution works fine. If you wanted a SCAN solution.
Data WANT(Keep STATE BUILDING NAME DESC);
Length State $2 Building $50 Name $2 Desc $100;
TEST="OH_DRT HOME_G4-T7 77 Cafe Entrance";
State=scan(test,1,"_");
Building=scan(test,2,"_");
temp=scan(test,3,"_");
Name=scan(temp,1,"-");
Desc=scan(temp,2,"-");
Run;

Regular Expression To Extract Names

I have strings in this form:
"""00.000000 00.000000; X-XX000-0000-0; France; Paris; Street 12a;
00.000000 00.000000; X-XX000-0000-0; Spain; Barcelona; Street 123;"""
I want to get specific data towns above string. How do I get this data??
If you just want to get the city for your given example, you could use a positive lookahead:
\b[^;]+(?=;[^;]+;$)
Explanation
\b # Word boundary
[^;]+ # Match NOT ; one or more times
(?= # Positive lookahead that asserts what follows is
; # Match semicolon
[^;]+ # Match NOT ; one or more times
; # Match ;
$ # Match end of the string
) # Close lookahead
Assuming Python (three quotes-string):
string = """00.000000 00.000000; X-XX000-0000-0; France; Paris; Street 12a;
00.000000 00.000000; X-XX000-0000-0; Spain; Barcelona; Street 123;"""
towns = [part[3] for line in string.split("\n") for part in [line.split("; ")]]
print(towns)
Which yields
['Paris', 'Barcelona']
No regex needed, really.
If you have the city on the 4th field, you can match it using this pattern:
/(?:[^;]*;){3}([^;]*);/
See the demo
[^;]*; you find a field consisting in non-semicolons and ending with a semicolon
(?:...){3} you find it 3 times, but you do not capture it
([^;]*); then you get 4th column matching its content (not the semicolon)

Extracting email addresses from messy text in OpenRefine

I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john#doe.com> format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n#doe.com"]
value.match(
/.*([a-zA-Z0-9_\-\+]+#[\._a-zA-Z0-9-]+).*/
)
Any help is much appreciated.
The n is captured because you are using .* before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before #.
If you can get partial matches git rid of the .* and use
/[^<\s]+#[^\s>]+/
See the regex demo
Details
[^<\s]+ - 1 or more chars other than < and whitespace
# - a # char
[^\s>]+ - 1 or more chars other than whitespace and >.
Python/Jython implementation:
import re
res = ''
m = re.search(r'[^<\s]+#[^\s>]+', value)
if m:
res = m.group(0)
return res
There are other ways to match these strings. In case you need a full string match .*<([^<]+#[^>]+)>.* where .* will not gobble the name since it will stop before an obligatory <.
If some cells contain just the email, it's probably better to use the #wiktor-stribiżew's partial match. In the development version of Open Refine, there is now a value.find() function that can do this, but it will only be officially implemented in the next version (2.9). In the meantime, you can reproduce it using Python/Jython instead of GREL:
import re
return re.findall(r"[^<\s]+#[^\s>]+", value)[0]
Result :

Regex to extract string between two symbols

I have a string like this
Affiliation / Facility Name = Provider 1069860 # Admissions = 1 #
Potentially Avoidable = 0
I want a Regex Expression to extract the value Provider 1069860 from it.
I tried "= [a-zA-Z]+ #" but it is giving a blank result
With this :
.*Facility Name = ([a-zA-Z 0-9]+) #.*
You match what you want in the match group one
https://regex101.com/r/EnYZ55/1

Regex - Extracting a number when preceeded OR followed by a currency sign

if (preg_match_all('((([£€$¥](([ 0-9]([0-9])*)((\.|\,)(\d{2}|\d{1}))|([ 0-9]([0-9])*)))|(([0-9]([0-9])*)((\.|\,)(\d{2}|\d{1})(\s{0}|\s{1}))|([0-9]([0-9])*(\s{0}|\s{1})))[£€$¥]))', $Commande, $matches)) {
$tot1 = $matches[0];
This is my tested solution.
It works for all 4 currencies when sign is placed before or after, with or without a space in between.
It works with a dot or a comma for decimals.
It works without decimal, or with just 1 number after the dot or comma.
It extracts several amounts in the same string in a mix of formats declined above as long as there is a space in between.
I think it covers everything, although I am sure it can be simplified.
It was Needed for an international order form where clients enter the amounts themselves as well as the description in the same field.
You can use a conditional:
if (preg_match_all('~(\$ ?)?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?(?:[pcm]|bn|[mb]illion)?(?(1)| ?\$)~i', $order, $matches)) {
$tot = $matches[0];
}
Explanation:
I put the currency in the first capturing group: (\$ ?) and I make it optional with a ?
At the end of the pattern, I use an if then else:
(?(1) # if the first capturing group exist
# then match nothing
| # else
[ ]?\$ # matches the currency
) # end of the conditional
You should check for optional $ at the end of amount:
\$? ?(\d[\d ,]*(?:\.\d{1,2})?|\d[\d,](?:\.\d{2})?) ?\$?(?:[pcm]|bn|[mb]illion)
Live demo