Python regex re.sub() is not matching and replacing as expected - regex

The following regex isn't replacing substrings as expected.
I've tried running the code with the following modifications (one at a time, of course) all with no luck:
Utilizing list comprehensions (current)
Using a traditional for loop
Adding the regex result back to the iterator itself
Appending the regex result to a new list
Checked the type of 'name' (it's a string)
Utilized (copied) code format from another regex in my notebook that is currently working
Put the regex into regex101.com to verify that it's functioning (you can see the regex and data I'm using here
Adding/removing the raw string indicators preceding the regex and substitution patterns
names is a list of strings
reg_pattern = r"(?!\\s)(\\W[^\\W,]+)(?!,) and\\s([^ ]+ )([^ ]+)"
sub_pattern = r"\\1 \\3 \\2\\3"
cleaned_names = []
cleaned_names = [re.sub(reg_pattern, sub_pattern, name) for name in names]
The goal can be seen in the link above (particularly in the 'substitution' section at the bottom of that page), but ultimately, I need to append group3 of the regex to the end of group1.

I'm guessing that maybe, you're trying to re.sub the couples names, for which you can likely write some expression similar to:
([A-Z][a-z]+)\s+and\s+(.*)([A-Z]\S*)
if you are not having edge cases, if you do then, you'd probably want to modify the char classes, [A-Z], and add those other chars, in there.
Demo
Test
import re
l = ['George Rosario, Ali Jones, Barbara Boll, and Lindsay McKelvoy', 'Jan and Edgar Adelman', 'Bill Mack and Les Lieberman', 'Dr. Susan Muehle-Bussel, Ray Morales, and Dr. Samuel Barker', 'Dan Barroso and Emily High', 'Cassie and George Sorenson', 'Tom Scott and Mark Smith', 'The scene at IDEAL School & Academy’s 10th\xa0Annual Gala.',
'Les Lieberman, Barri Lieberman, Isabel Kallman, Trish Iervolino, and Ron Iervolino', 'Chuck Grodin', 'Diana Rosario, Ali Sussman, Sarah Boll, Jen Zaleski, Alysse Brennan, and Lindsay Macbeth', 'Kelly and Tom Murro', 'Udo Spreitzenbarth', 'Ron Iervolino, Trish Iervolino, Russ Middleton, and Lisa Middleton', 'Barbara Loughlin, Dr. Gerald Loughlin, and Debbie Gelston', 'Julianne Michelle']
e = r'([A-Z][a-z]+)\s+and\s+(.*)([A-Z]\S*)'
l_out = []
for names in l:
if re.match(e, names):
l_out.append(re.sub(e, r'\1 \3 and \2\3', names))
else:
l_out.append(names)
print(l_out)
Output
['George Rosario, Ali Jones, Barbara Boll, and Lindsay McKelvoy', 'Jan
Adelman and Edgar Adelman', 'Bill Mack and Les Lieberman', 'Dr. Susan
Muehle-Bussel, Ray Morales, and Dr. Samuel Barker', 'Dan Barroso and
Emily High', 'Cassie Sorenson and George Sorenson', 'Tom Scott and
Mark Smith', 'The scene at IDEAL School & Academy’s 10th\xa0Annual
Gala.', 'Les Lieberman, Barri Lieberman, Isabel Kallman, Trish
Iervolino, and Ron Iervolino', 'Chuck Grodin', 'Diana Rosario, Ali
Sussman, Sarah Boll, Jen Zaleski, Alysse Brennan, and Lindsay
Macbeth', 'Kelly Murro and Tom Murro', 'Udo Spreitzenbarth', 'Ron
Iervolino, Trish Iervolino, Russ Middleton, and Lisa Middleton',
'Barbara Loughlin, Dr. Gerald Loughlin, and Debbie Gelston', 'Julianne
Michelle']
Or you can try
import re
l = ['George Rosario, Ali Jones, Barbara Boll, and Lindsay McKelvoy', 'Jan and Edgar Adelman', 'Bill Mack and Les Lieberman', 'Dr. Susan Muehle-Bussel, Ray Morales, and Dr. Samuel Barker', 'Dan Barroso and Emily High', 'Cassie and George Sorenson', 'Tom Scott and Mark Smith', 'The scene at IDEAL School & Academy’s 10th\xa0Annual Gala.',
'Les Lieberman, Barri Lieberman, Isabel Kallman, Trish Iervolino, and Ron Iervolino', 'Chuck Grodin', 'Diana Rosario, Ali Sussman, Sarah Boll, Jen Zaleski, Alysse Brennan, and Lindsay Macbeth', 'Kelly and Tom Murro', 'Udo Spreitzenbarth', 'Ron Iervolino, Trish Iervolino, Russ Middleton, and Lisa Middleton', 'Barbara Loughlin, Dr. Gerald Loughlin, and Debbie Gelston', 'Julianne Michelle']
e = r'([A-Z][a-z]+)\s+and\s+(.*)([A-Z]\S*)'
l_out = []
for names in l:
if re.match(e, names):
l_out.append(re.sub(e, r'\1 \3', names))
l_out.append(re.sub(e, r'\2\3', names))
else:
l_out.append(names)
print(l_out)
Output
['George Rosario, Ali Jones, Barbara Boll, and Lindsay McKelvoy', 'Jan
Adelman', 'Edgar Adelman', 'Bill Mack and Les Lieberman', 'Dr. Susan
Muehle-Bussel, Ray Morales, and Dr. Samuel Barker', 'Dan Barroso and
Emily High', 'Cassie Sorenson', 'George Sorenson', 'Tom Scott and Mark
Smith', 'The scene at IDEAL School & Academy’s 10th\xa0Annual Gala.',
'Les Lieberman, Barri Lieberman, Isabel Kallman, Trish Iervolino, and
Ron Iervolino', 'Chuck Grodin', 'Diana Rosario, Ali Sussman, Sarah
Boll, Jen Zaleski, Alysse Brennan, and Lindsay Macbeth', 'Kelly
Murro', 'Tom Murro', 'Udo Spreitzenbarth', 'Ron Iervolino, Trish
Iervolino, Russ Middleton, and Lisa Middleton', 'Barbara Loughlin, Dr.
Gerald Loughlin, and Debbie Gelston', 'Julianne Michelle']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Related

Pyspark: Create additional column based on Regex

I recently started Pyspark and I'm trying to figure out the regex matching.
For the regexes I've created a list and if one of these items in the list is found in the name column, the added column must be true. This Regex matching must not be case sensitive as seen in the example below.
I have a Table with the following format:
seqno
name
1
john jones
2
John Jones
3
John Stones
4
Mary Wild
5
William Wurt
6
steven wurt
I need to change the Table above to the format of the Table below. This is just a small part of the actual table so hard coding is not going to cut it unfortunately.
seqno
name
regex
1
john jones
True
2
John Jones
True
3
John Stones
True
4
Mary Wild
False
5
William Wurt
True
6
steven wurt
True
Here is the code to create part of the Table:
regex_list = [john, wurt]
columns = ['seqno', 'name']
data = [('1', 'john jones'),
('2', 'John Jones'),
('3', 'John Stones'),
('4', 'Mary Wild'),
('5', 'William Wurt'),
('6', 'steven wurt')]
df = spark.createDataFrame(data=data, schema=columns)
I've been trying numerous applications with .isin and .rlike but can't seem to make it work. Any help would be gladly appreciated.
Thanks in advance!
Use rlike to check if any of the listed regex are like names. can change case in both list and column while test happens Code beloow
df.withColumn('regex',upper(col('name')).rlike(('|').join([x.upper() for x in regex_list]))).show()
+-----+------------+-----+
|seqno| name|regex|
+-----+------------+-----+
| 1| john jones| true|
| 2| John Jones| true|
| 3| John Stones| true|
| 4| Mary Wild|false|
| 5|William Wurt| true|
| 6| steven wurt| true|
+-----+------------+-----+

Google Sheets ARRAYFORMULA count preceeding rows that meet condition

Let's say I have a spreadsheet that looks something like this:
Name D-List
--------------------- ------
Arnold Schwarzenegger
Bruce Willis
Dolph Lundgren
Dwayne Johnson
Jason Statham
Keanu Reeves
Samuel L. Jackson
Sylvester Stallone
Vin Diesel
For the D-List column, I'd like to count the number of proceeding rows that contains the string "d". If the row doesn't contain a "d", then I want it to return an empty string.
For any given row, I can get this to work with the following pseudo formula:
=IF(REGEXMATCH(A<row>, "d"), COUNTIF(A<row>, "*d*"), "")
Name D-List
--------------------- ------
Arnold Schwarzenegger 1
Bruce Willis
Dolph Lundgren 2
Dwayne Johnson 3
Jason Statham
Keanu Reeves
Samuel L. Jackson
Sylvester Stallone
Vin Diesel 4
I can turn this into an expression that can be duplicated between rows by using INDIRECT and ROW:
=IF(REGEXMATCH(A2, "(?i)d"), COUNTIF(INDIRECT("A2:A" & ROW(A2)), "*D*"), "")
Name D-List
--------------------- ------
Arnold Schwarzenegger 1
Bruce Willis
Dolph Lundgren 2
Dwayne Johnson 3
Jason Statham
Keanu Reeves
Samuel L. Jackson
Sylvester Stallone
Vin Diesel 4
However, if I try to stick it in an ARRAYFORMULA, it doesn't work.
=ARRAYFORMULA(IF(REGEXMATCH(A2:A, "D"), COUNTIF(INDIRECT("A2:A" & ROW(A2:A)), "*D*"), ""))
Name D-List
--------------------- ------
Arnold Schwarzenegger
Bruce Willis
Dolph Lundgren 1
Dwayne Johnson 1
Jason Statham
Keanu Reeves
Samuel L. Jackson
Sylvester Stallone
Vin Diesel 1
What am I missing?
try:
=ARRAYFORMULA(IF(
REGEXMATCH(A2:A, "(?i)d"), COUNTIFS(
REGEXMATCH(A2:A, "(?i)d"),
REGEXMATCH(A2:A, "(?i)d"), ROW(A2:A), "<="&ROW(A2:A)), ))

How can I read certain elements from a text file?

I have a text file that has multiple sets of book information (title, author, etc). I need to be able to use a loop to read from the file and assign each piece of info to a corresponding string. I have it working to where I goes through the entire file, it's just messing up while going through the file.
Book readOne(ifstream &fin) {
string titleOne;
getline(fin, titleOne, ',');
string firstOne;
getline(fin, firstOne, ',');
string lastOne;
getline(fin, lastOne, ',');
string formatOne;
getline(fin, formatOne, ',');
string pubDateOne;
getline(fin, pubDateOne, ',');
string priceOne;
getline(fin, priceOne);
Here is the text file:
Gone With the Wind, Margaret Mitchell, Hardcover, 1936, 17.49
The Adventures of Sherlock Holmes, Arthur Doyle, Paperback, 1892, 6.85
The Illustrated A Brief History of Time, Stephen Hawking, Hardcover, 1996, 9.59
Frankenstein, Mary Shelley, Paperback, 1818, 7.99
Command Authority, Tom Clancy, Paperback, 2013, 15.99
Origin, Dan Brown, Ebook, 2017, 14.99
The Lost Order, Steve Berry, Audiobook, 2017, 5.95
The Hunt for Red October, Tom Clacy, Audiobook, 1984, 7.00
Patriot Games, Tom Clancy, Audiobook, 1987, 22.50
The 14th Colony, Steve Berry, Paperback, 2016, 9.99
The Bishop's Pawn, Steve Berry, Ebook, 2018, 14.99
Pride and Prejudice, Jane Austen, Ebook, 1813, 8.99
Sense and Sensibility, Jane Austen, Hardcover, 1811, 19.99
Wuthering Heights, Emily Bronte, Paperback, 1847, 6.99
Jane Eyre, Charlotte Bronte, Hardcover, 1847, 10.95
Anna Karenina, Leo Tolstoy, Paperback, 1877, 5.99
Sahara, Clive Cussler, Ebook, 1992, 5.99
The Notebook, Nicholas Sparks, Hardcover, 1996, 12.59
A Walk to Remember, Nicholas Sparks, Ebook, 1999, 7.99
See Me, Nicholas Sparks, Ebook, 2015, 7.99
The Last Song, Nicholas Sparks, Paperback, 2009, 5.99
The Wedding, Nicholas Sparks, Ebook, 2003, 7.99
My thinking was that it would read until a comma, assign that piece of info to the string, then continue. Instead, it outputs as if it didn't see certain commas.
Gone With the Wind by Margaret Mitchell Hardcover on 1936. Published on 17.49
The Adventures of Sherlock Holmes. It costs $0
The Illustrated A Brief History of Time by Stephen Hawking Hardcover on 1996. Published on 9.59
Frankenstein. It costs $0
Command Authority by Tom Clancy Paperback on 2013. Published on 15.99
Origin. It costs $0
The Lost Order by Steve Berry Audiobook on 2017. Published on 5.95
The Hunt for Red October. It costs $0
Patriot Games by Tom Clancy Audiobook on 1987. Published on 22.50
The 14th Colony. It costs $0
The Bishop's Pawn by Steve Berry Ebook on 2018. Published on 14.99
Pride and Prejudice. It costs $0
Sense and Sensibility by Jane Austen Hardcover on 1811. Published on 19.99
Wuthering Heights. It costs $0
Jane Eyre by Charlotte Bronte Hardcover on 1847. Published on 10.95
Anna Karenina. It costs $0
Sahara by Clive Cussler Ebook on 1992. Published on 5.99
The Notebook. It costs $0
A Walk to Remember by Nicholas Sparks Ebook on 1999. Published on 7.99
See Me. It costs $0
The Last Song by Nicholas Sparks Paperback on 2009. Published on 5.99
The Wedding. It costs $0
This is nothing but a csv (comma separated value) file. And there are ample code samples to read it. Refer this sample.

Printing Out Linked List - C++

I'm trying to print out all of the employees in my linked list but am encountering an issue to where all of but the last employee is being printed out. I have a printRoster() function to where it prints out all of the names of my list correctly which is 3 total, but my print function only seems to print out just 2. (I can post more code if necessary)
Here is my text file:
START_OF_FILE
INSERT_EMPLOYEE
123456
John
Smith
64000
35
INSERT_EMPLOYEE
345678
Mike
Jones
70000
30
INSERT_EMPLOYEE
234567
Dean
Thomas
72000
40
PRINT_ROSTER
PRINT_EMPLOYEE
John
Smith
PRINT_EMPLOYEE
Mike
Jones
PRINT_EMPLOYEE
Dean
Thomas
END_OF_FILE
My output:
John Smith, 123456
Mike Jones, 345678
Dean Thomas, 234567
John Smith, 123456
Salary: 64000
Hours: 35
Mike Jones, 345678
Salary: 70000
Hours: 30
Expected output:
John Smith, 123456
Mike Jones, 345678
Dean Thomas, 234567
John Smith, 123456
Salary: 64000
Hours: 35
Mike Jones, 345678
Salary: 70000
Hours: 30
Dean Thomas, 234567
Salary: 72000
Hours: 40
Problem is in your printEmployee function's while loop while (tempEmployee->next != NULL)
You are checking if next employee is present or not, and if it is present then and only then your loop is executed.
In your case when your loop is at last employee, it checks if next employee is present or not and as it is not present your loop is not executed and the info of last employee is not printed.
you should change your while loop like this
while(tempEmployeee != NULL)

GNU Sed format USA address to Street, City, State, Zip

I have the following data for some of my customers:
719 13th Street East, Glencoe MN, 55336
626 Valley Road, Montclair NJ, 07043
666 EAST DYER ROAD, SANTA ANA CA, 92705
20800 N. 135th Ave, Sun City West AZ, 85375
9775 Herring Gull Drive, Indianapolis IN, 46280
712 21st Street, Vero Beach FL, 32960
PO BOX 324, PORT SALERNO FL, 34992
207 Middleton Road, Lafayette LA, 70503
5091 nw fiddle leaf ct, port saint lucie FL, 34986
347 Mayberry Lane, Dover DE, 19904
2648 SW 137th Ave, Miramar FL, 33027
4410 Williams Dr SUITE 104, Georgetown TX, 78628
17020 Windsor Court, Homer Glen IL, 60491
11 Technology Drive North, Warren NJ, 07059
655 Boylston St, Boston MA, 02116
1375 bishops terrace, wixom MI, 48393
4705 Center Blvd Apt. 808, Long Island City NY, 11109
5340 CORNELIA HWY, ALTO GA, 30510
1541 Paces Ferry North, Smyrna GA, 30080
603 west pacific coast hwy, wilmington CA, 90744
2503Paddock CT, Louisville KY, 40216
9421 Dunbar dr, Oakland CA, 94603
1804 Third Avenue Apt #8, New York NY, 10029
2504 bellaire st, wantagh NY, 11793
1380 avon lane apt 21, north lauderdale FL, 33068
How can I use SED regex to format it like
Street Address|City|State|Zip
eg.
719 13th Street East|Glencoe|MN|55336
626 Valley Road|Montclair|NJ|07043
666 EAST DYER ROAD|SANTA ANA|CA|92705
Thanks!
sed 's/^\(.*\), *\(.*\) \(..\), \([0-9][0-9][0-9][0-9][0-9]\)/\1|\2|\3|\4/'
or:
sed -r 's/^(.*), *(.*) (..), ([0-9]{5})/\1|\2|\3|\4/'
Output:
719 13th Street East|Glencoe|MN|55336
626 Valley Road|Montclair|NJ|07043
666 EAST DYER ROAD|SANTA ANA|CA|92705
20800 N. 135th Ave|Sun City West|AZ|85375
9775 Herring Gull Drive|Indianapolis|IN|46280
712 21st Street|Vero Beach|FL|32960
PO BOX 324|PORT SALERNO|FL|34992
207 Middleton Road|Lafayette|LA|70503
5091 nw fiddle leaf ct|port saint lucie|FL|34986
347 Mayberry Lane|Dover|DE|19904
2648 SW 137th Ave|Miramar|FL|33027
4410 Williams Dr SUITE 104|Georgetown|TX|78628
17020 Windsor Court|Homer Glen|IL|60491
11 Technology Drive North|Warren|NJ|07059
655 Boylston St|Boston|MA|02116
1375 bishops terrace|wixom|MI|48393
4705 Center Blvd Apt. 808|Long Island City|NY|11109
5340 CORNELIA HWY|ALTO|GA|30510
1541 Paces Ferry North|Smyrna|GA|30080
603 west pacific coast hwy|wilmington|CA|90744
2503Paddock CT|Louisville|KY|40216
9421 Dunbar dr|Oakland|CA|94603
1804 Third Avenue Apt #8|New York|NY|10029
2504 bellaire st|wantagh|NY|11793
1380 avon lane apt 21|north lauderdale |FL|33068
Try with this:
sed -e 's/\([A-Z]*\) \([A-Z][A-Z]\),/\1\|\2,/g' -e 's/, /\|/g'
it gets all , and subtitutes to |. Prior to that, searches for AAAA AA, and changes it to AAAA|AA, for the City|State part.
Test
$ sed -e 's/\([A-Z]*\) \([A-Z][A-Z]\),/\1\|\2,/g' -e 's/, /\|/g' your_file
719 13th Street East|Glencoe|MN|55336
626 Valley Road|Montclair|NJ|07043
666 EAST DYER ROAD|SANTA ANA|CA|92705
20800 N. 135th Ave|Sun City West|AZ|85375
9775 Herring Gull Drive|Indianapolis|IN|46280
712 21st Street|Vero Beach|FL|32960
PO BOX 324|PORT SALERNO|FL|34992
207 Middleton Road|Lafayette|LA|70503
5091 nw fiddle leaf ct|port saint lucie|FL|34986
347 Mayberry Lane|Dover|DE|19904
2648 SW 137th Ave|Miramar|FL|33027
4410 Williams Dr SUITE 104|Georgetown|TX|78628
17020 Windsor Court|Homer Glen|IL|60491
11 Technology Drive North|Warren|NJ|07059
sed -e 's/, /|/g' -e 's/ \([^ ]\+\)$/|\1/' file