Extract data from dataset - regex

I need to extract title from name but cannot understand how it is working . I have provided the code below :
combine = [traindata , testdata]
for dataset in combine:
dataset["title"] = dataset["Name"].str.extract(' ([A-Za-z]+)\.' , expand = False )
There is no error but i need to understand the working of above code
Name
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
Allen, Mr. William Henry
Moran, Mr. James
above is the name feature from csv file and in dataset["title"] it stores the title of each name that is mr , miss , master , etc

Your code extracts the title from name using pandas.Series.str.extract function which uses regex
pandas.series.str.extract - Extract capture groups in the regex pat as columns in a DataFrame.
' ([A-Za-z]+)\.' this is a regex pattern in your code which finds the part of string that is here Name wherever a . is present.
[A-Za-z] - this part of pattern looks for charaters between alphabetic range of a-z and A-Z
+ it states that there can be more than one character
\. looks for following . after a part of string
An example is provided on the link above where it extracts a part from
string and puts the parts in seprate columns

I found this specific response with the link very helpful on how to use the 'str's extract method and put the strings in columns and series with changing the expand's value from True to False.

Related

Convert MS Outlook formatted email addresses to names of attendees using RegEx

I'm trying to use Notepadd ++ to find and replace regex to extract names from MS Outlook formatted meeting attendee details.
I copy and pasted the attendee details and got names like.
Fred Jones <Fred.Jones#example.org.au>; Bob Smith <Bob.Smith#example.org.au>; Jill Hartmann <Jill.Hartmann#example.org.au>;
I'm trying to wind up with
Fred Jones; Bob Smith; Jill Hartmann;
I've tried a number of permutations of
\B<.*>; \B
on Regex 101.
Regex is greedy, <.*> matches from the first < to the last > in one fell swoop. You want to say "any character which is neither of these" instead of just "any character".
*<[^<>]*>
The single space and asterisk before the main expression consumes any spaces before the match. Replace these matches with nothing and you will be left with just the names, like in your example.
This is a very common FAQ.

Regex - get string after full date and before standard text

I'm stuck on another regex. I'm extracting email data. In the below example, only the time, date and message in quotes changes.
Message Received 6:06pm 21st February "Hello. My name is John Smith" Some standard text.
Message Received 8:08pm 22nd February "Hello. My name is "John Smith"" Some standard text.
How can I get the message only if I need to start with the positive lookbehind, (?<=Message Received ) to begin searching at this particular point of the data? The message will always start and end with quotes but the user is able to insert their own quotes as in the second example.
You can just use a negated charcter class in a capturing group:
/Message Received.*?"([^\n]+)"/
Snippet:
$input = 'Message Received 6:06pm 21st February "Hello. My name is John Smith" Some standard text.
Message Received 8:08pm 22nd February "Hello. My name is "John Smith"" Some standard text.}';
preg_match_all('/Message Received.*?"([^\n]+)"/', $input, $matches);
foreach ($matches[1] as $match) {
echo $match . "\r\n";
}
Output:
> Hello. My name is John Smith
> Hello. My name is "John Smith"
For extracting message in between double quotes.
(?=Message Received)[^\"]+\K\"[\w\s\"\.]+\"
Regex demo
You capture the message in a group
(?<=Message Received)[^"]*(.*)(?=\s+Some standard text)
Two out of the other three posted answers on this page provide an incorrect result. None of the other posted answers are as efficient as they could be:
To correctly extract the substring between the outer double quotes, use one of the following patterns:
/Message Received[^"]+"\K[^\n]+(?=")/ (No capture group, takes 132 steps, Demo)
/Message Received[^"]+"([^\n]+)"/ (Capture group, takes 130 steps, Demo)
Both patterns provide maximum accuracy and efficiency using negated character classes leading up to and including the targeted substring. The first pattern reduces preg_match_all()'s output array bloat by 50% by using \K instead of a capture group. For these reasons, one of these patterns should be used in your project. As your input string increases in size, my patterns provide increasingly better performance versus the other posted patterns.
PHP Implementation:
$in represents your input string.
Pattern #1 Method:
var_export(preg_match_all('/Message Received[^"]+"\K[^\n]+(?=")/',$in,$out)?$out[0]:[]);
// notice the output array only has elements in the fullstring subarray [0]
Output:
array (
0 => 'Hello. My name is John Smith',
1 => 'Hello. My name is "John Smith"',
)
Pattern #2 Method:
var_export(preg_match_all('/Message Received[^"]+"([^\n]+)"/',$in,$out)?$out[1]:[]);
// notice because a capture group is used, [0] subarray is ignored, [1] is used
Output:
array (
0 => 'Hello. My name is John Smith',
1 => 'Hello. My name is "John Smith"',
)
Both methods provide the desired output.
Anirudha's incorrect pattern: /(?<=Message Received)[^"]*(.*)(?=\s+Some standard text)/ (345 steps + a capture group + includes the unwanted outer double quotes)
Josh Crozier's pattern: /Message Received.*?"([^\n]+)"/ (174 steps + a capture group)
Sahil Gulati's incorrect pattern: /(?=Message Received)[^\"]+\K\"[\w\s\"\.]+\"/ (109 steps + includes the unwanted outer double quotes + unnecessarily escapes characters in the pattern)

Using regex and vba, extracting parts of data

I have an excel spreadsheet and within its contents it is formatted like -
Street Name, Street Number Street Direction(may not be present represented be an NSWE)
So it could look like John Doe Ave, 900 E or Jane Doe DR, 100
However, the people who used this spreadsheet put business names or other information that shouldn't be present
The part I'm stuck at is using regex patterns I'm not familiar with it and it confuses me
I have this variable
Dim strPattern As String: strPattern = "^(.+),\s(\d+)\s([NWSEnwse])"
So, I have this its working SLIGHTLY I wanted to know what changes I could make to this so it would include or exlude NWSEnwse, because right now it detects the address only when street direction is present
You may use this regex pattern to match it.
^(.+),\s+(\d+)(\s+[NWSEnwse])?
The ? at the end signifies that that part is optional.
I also replaced \s with \s+ to account for any extra spaces that might have crept in.

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :

Regular Expression help needed to convert lst file to csv

I have a file (ratings.lst) downloaded from IMDB Interfaces. The content appears to be in in the following format :-
Distribution Votes Rating Title
0000001222 297339 8.4 Reservoir Dogs (1992)
0000001223 64504 8.4 The Third Man (1949)
0000000115 48173 8.4 Jodaeiye Nader az Simin (2011)
0000001232 324564 8.4 The Prestige (2006)
0000001222 301527 8.4 The Green Mile (1999)
My aim is to convert this file into a CSV file (comma separated) with the following desired result (example for 1 line) :
Distribution Votes Rating Title
0000001222, 301527, 8.4, The Green Mile (1999)
I am using textpad and it supports regex based search and replace. I'm not sure what type of regex is needed to achieve the above desired results. Can somebody please help me on this. Thanks in advance.
The other regular expressions are somewhat overcomplicated. Because whitespace is guaranteed not to appear in the first three columns, you don't have to do a fancy match - "three columns of anything separated by whitepace" will do.
Try replacing ^(.+?)\s+(.+?)\s+(.+?)\s+(.+?)$ with \1,\2,\3,"\4" giving the following output (using Notepad++)
Distribution,Votes,Rating,"Title"
0000001222,297339,8.4,"Reservoir Dogs (1992)"
0000001223,64504,8.4,"The Third Man (1949)"
0000000115,48173,8.4,"Jodaeiye Nader az Simin (2011)"
0000001232,324564,8.4,"The Prestige (2006)"
0000001222,301527,8.4,"The Green Mile (1999)"
Note the use of a non-greedy quantifier, .+?, to prevent accidentally matching more than we should. Also note that I've enclosed the fourth column with quote marks "" in case a comma appears in the movie title - otherwise the software you use to read the file would interpret Avatar, the Last Airbender as two columns.
The nice tabular alignment is gone - but if you open the file in Excel it will look fine.
Alternately, just do the entire thing in Excel.
First replace all " with "" then do this:
Find: ^\([0-9]+\)[ \t]+\([0-9]+\)[ \t]+\([^ \t]+\)[ \t]+\(.*\)
Replace with: \1,\2,\3,"\4"
Press F8 to open Replace dialog
Make sure Regular Expression is selected
In Find what: put: ^([[:digit:]]{10})[[:space:]]+([[:digit:]]+)[[:space:]]+([[:digit:]]- {1,2}\.[[:digit:]])[[:space:]]+(.*)$
In Replace with: put \1,\2,\3,"\4"
Click Replace All
Note: This uses 1 or more spaces between fields from ratings.lst - you might be better off specifying the exact number of spaces if you know it.
Also Note: I didn't put spaces between the comma seperated items, as generally you don't, but feel free to add those in
Final Note: I put the movie title in quotes, so that if it contains a comma it doesn't break the CSV format. You may want to handle this differently.
MY BAD This is a C# program. I will leave it up for an alternate solution.
The ignorepattern whitespace is for commenting the pattern.
This will create data which can be placed into a CSV file. Note CSV files do not have optional whitepsace in them as per your example....
string data =#"Distribution Votes Rating Title
0000001222 297339 8.4 Reservoir Dogs (1992)
0000001223 64504 8.4 The Third Man (1949)
0000000115 48173 8.4 Jodaeiye Nader az Simin (2011)
0000001232 324564 8.4 The Prestige (2006)
0000001222 301527 8.4 The Green Mile (1999)
";
string pattern = #"
^ # Always start at the Beginning of line
( # Grouping
(?<Value>[^\s]+) # Place all text into Value named capture
(?:\s+) # Match but don't capture 1 to many spaces
){3} # 3 groups of data
(?<Value>[^\n\r]+) # Append final to value named capture group of the match
";
var result = Regex.Matches(data, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace)
.OfType<Match>()
.Select (mt => string.Join(",", mt.Groups["Value"].Captures
.OfType<Capture>()
.Select (c => c.Value))
);
Console.WriteLine (result);
/* output
Distribution,Votes,Rating,Title
0000001222,297339,8.4,Reservoir Dogs (1992)
0000001223,64504,8.4,The Third Man (1949)
0000000115,48173,8.4,Jodaeiye Nader az Simin (2011)
0000001232,324564,8.4,The Prestige (2006)
0000001222,301527,8.4,The Green Mile (1999)
*/