regular expression to replace lastname, firstname middle initial to email format - regex

Need a little help creating a regular expression to take, for example :
Smith, John R
and turn it into
john.r.smith#gmail.com

You can use the following regex in C++11:
string s = "Smith, John R"; // to john.r.smith#gmail.com
const regex r("(.*), (.*) (.*)");
const string fmt("$2.$3.$1#gmail.com");
cout << regex_replace(s, r, fmt) << endl;
Note: this will give you John.R.Smith#gmail.com, you may further need to change it to lowercase if you need john.r.smith#gmail.com, which is quite a easy task.

Since the language is not specified , i tried this in VIM and it works perfectly.
%s/\v(\w*),\s*(\w*)\s*(\w)/\L\2.\L\3.\1#gmail.com/
Attached is screenshot

Same as previous answer, but in Shell:
echo "Smith, John R" | awk '{print tolower($0)}' | sed 's/\(.*\),\s\(.*\)\s\(.*\)/\2.\3.\1#gmail.com/g'
john.r.smith#gmail.com
Actually thanks to #KP6, I realized sed can lowercase too :)
So, much simpler version would be:
echo "Smith, John R" | sed 's/\(.*\),\s\(.*\)\s\(.*\)/\L\2.\L\3.\L\1#gmail.com/g'
john.r.smith#gmail.com

Just capture the needed tokens:
(.+?),\s(.+?)\s(.+)
$1 is last name.
$2 is first name
$3 is middle name
Now build your email address.
As mention by other, regex seems like overkill

Related

Find the first name that starts with any letter than S using regex

I am new to regex and I am trying to find the last names that only start with S followed by comma and then space and then the first names that doesn't start with S from a text file.
I am using the terminal on a MacBook.
This is my regex
^[S\w][,]?[' ']?[A-RT-Z]?
My full command
cat People.txt | grep -E ^[S\w][,]?[' ']?[A-RT-Z]?
The first name is the second word and the last name is the first word on each line.
The results I get:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
What I am expecting to get
Schmidt, Paul
Smith, Peter
The first rule of writing regular expressions in a shell script (or at the terminal) is "enclose the regular expression in single quotes" so that the shell doesn't try to interpret the metacharacters in the regex. You might sometimes use double quotes instead of single quotes if you need to match single quotes but not double quotes or if you need to interpolate a variable, but aim to use single quotes. Also, avoid UUoC — Useless Use of cat.
Your question currently shows two regular expressions:
^[S\w][,]?[' ']?[A-RT-Z]?
cat People.txt | grep -E ^[S\w][,]?[' ']?[P\w+]?
If written as suggested, these would become:
grep -E -e '^[Sw],? ?[A-RT-Z]?' People.txt
grep -E -e '^[Sw],? ?[Pw+]?' People.txt
The shell removes the backslashes in your rendition. The + in the character class matches a plus sign. You don't need square brackets around the comma (though they do no major harm). I use the -e option for explicitness, and so I can add extra arguments after the regex (-w or -l or -n or …) when editing commands via history. (I also dislike having options recognized after non-option arguments; I often run with $POSIXLY_CORRECT set in my environment. That's a personal quirk.)
The first of the two commands looks for a line starting S or w, followed by an optional comma, an optional blank, and an optional upper-case letter other than S. The second is similar except that it looks for an optional P or w. None of this bears much relationship to the question.
You need an expression more like one of these:
grep -E -e '^[S][[:alpha:]]*, [^S]' People.txt
grep -E -e '^[S][a-zA-Z]*, [^S]' People.txt
These allow single-character names — just S — but you can use + instead of * to require one or more letters.
There are lots of refinements possible, depending on how much you want to work, but this does the primary job of finding 'first word on the line starts with S, and is followed by a comma, a blank, and the second word does not start with S'.
Given a file People.txt containing:
Randall, Steven
Rogers, Timothy
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Titus, Persephone
Williams, Shirley
Someone
S
Your regular expressions produce the output:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Someone
S
My commands produce:
Schmidt, Paul
Smith, Peter
Something like this seems to work fine:
^S.*, [^S].*$
^S.* - must start with S and start capturing everything
, [^S] - leading up to a comma, space, not S
.*$ - capture the rest of the string
https://regex101.com/r/76bfji/1

Sed remove only first occurence of a string

I have several string in my text file witch have this case:
Brisbane, Queensland, Australia|BNE
I know how to use the SED command, to replace any character by another one. This time I want to replace the characters coma-space by a pipe, only for the first match to not affect the country name at the same time.
I need to convert it to something like that:
Brisbane|Queensland, Australia|BNE
As you can see, only the first coma-space was replaced, not the second one and I keep the country name "Queensland, Australia" complete. Can someone help me to achieve this, thanks.
Here is a sample of my file:
Brisbane, Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
If you do: sed 's/, /|/' file.txt doesn't work.
The output should be like that:
Brisbane|Queensland, Australia|BNE
Simply don't use the g option. Your sed command should look like this:
sed 's/, /|/'
The s command will by default only the replace the first occurrence of a string in the pattern buffer - unless you pass the g option.
Since you have not posted the output of your test file, we can only guess what you need. And here is may guess:
awk -F", *" 'NF>2{$0=$1"|"$2 OFS $3}1' OFS=", " file
Brisbane|Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
As you see it counts fields to see if it needs | or not. If it neds | then reconstruct the line.

Why this regex does not work with grep?

I have a text file this way
"an arbitrary string" = "this is the text one"
"other arbitrary string" = "second text"
"a third arbitrary string" = "the text number three"
I want to obtain only this
an arbitrary string
other arbitrary string
a third arbitrary string
That is, the text inside the first quotes, or between the first " and the " =. I used this regex
(?!").*(?=(" =))
This is working when I tried it in RegExr and in this online tool. But in my OSX Terminal it does not work, the output is empty
grep -o '(?!").*(?=(" =))' input.txt
What is wrong here? Do I have to escape some characters? I try everyone and nothing changes.
Thank you so much and please excuse my lack of knowledge about this topic.
Lookaheads and lookbehinds are PCRE features so you have to use the parameter -P:
grep -Po '(?!").*(?=(" =))' input.txt
This should do:
awk -F\" '{print $2}' file
It uses " as separators, and then print second field.
steffen`s answer is right, you have to use -P flag. But there is also a problem with your regex.
Imagine this input:
"an arbitrary string" = " =this is the text one"
Your regex will fail dramatically.
To solve this you have to use something like this:
grep -Po '^"\K.*?(?=(" =))'
^ to prevent other matches that do not begin from the line start.
\K is just easier to read. (It also allows you to match strings with arbitrary length)
.*? to make it non-greedy.

RegEx Find and Replace Sentence

I'm looking for a way to find and replace a sentence using regex. The regex should be able to find a sentence of any length. I can get the entire sentence with .* but that doesn't allow it to replace with \1.
FIND:
"QUESTION1" = "What is the day satellite called?"
"ANSWER1" = "The sun"
REPLACE:
<key>What is the day satellite called?</key>
<key>The sun</key>
You need to use capturing groups. So that you can refer the captured groups through back-reference.
Regex:
.*(?<= \")([^"]*).*
Replacement string:
<key>\1</key>
DEMO
Find using the following expression (modifiers required: g and m):
^[^=]+= "(.*?)"$
and then replace them using:
<key>$1</key>
or
<key>\1</key>
using perl:
> cat temp
"QUESTION1" = "What is the day satellite called?"
"ANSWER1" = "The sun"
> perl -lne 'print "<key>".$1."<\/key>" if(/\".*?\".*?\"(.*?)\"/)' temp
<key>What is the day satellite called?</key>
<key>The sun</key>
>
Perl One-Liner
A compact approach: search for (?m)"([^"]+)"$
Replace with <key>$1</key> if you want <key>What is the day satellite called?</key>
or
Replace: "<key>$1</key>" if you want "<key>What is the day satellite called?</key>"
With a perl one-liner:
perl -pe 's!(?m)"([^"]+)"$!<key>$1</key>!g' yourfile
For those who are coming from google search, you can play with this great tool here to find the right regex expression to use: http://regexr.com/

Grep Regex - Words in brackets?

I want to know the regex in grep to match everything that isn't a specific word. I know how to not match everything that isn't a single character,
gibberish blah[^.]*jack
That would match blah, jack and everything in between as long as the in between didn't contain a period. But is it possible to do something like this?
gibberish blah[^joe]*jack
Match blah, jack and everything in between as long as the in between didn't contain the word "joe"?
UPDATE:
I can also use AWK if that would better suit this purpose.
So basically, I just want to get the sentence "gibberish blah other words jack", as long as "joe" isn't in the other words.
Update 2 (The Answer, to a different question):
Sorry, I am tired. The sentence actually can contain the word "joe", but not two of them. So "gibberish blah jill joe moo jack" would be accepted, but "gibberish blah jill joe moo joe jack" wouldn't.
Anyway, I figured out the solution to my problem. Just grep for "gibberish.*jack" and then do a word count (wc) to see how many "joes" are in that sentence. If wc comes back with 1, then it's ok, but if it comes back with 2 or more, the sentence is wrong.
So, sorry for asking a question that wouldn't even solve my problem. I will mark sputnick's answer as the right one, since his answer looks like it would solve the original posts problem.
What you're looking for is named look around, it's an advanced regex technique in pcre & perl. It's used in modern languages. grep can handle this expressions if you have the -P switch. If you don't have -P, try pcregrep instead. (or any modern language).
See
http://www.perlmonks.org/?node_id=518444
http://www.regular-expressions.info/lookaround.html
NOTE
If you just want to negate a regex, maybe a simple grep -v "regex" will be sufficient. (It depends of your needs) :
$ echo 'gibberish blah other words jack' | grep -v 'joe'
gibberish blah other words jack
$ echo 'gibberish blah joe other words jack' | grep -v 'joe'
$
See
man grep | less +/invert-match
Try the negative lookbehind syntax:
blahish blah(?<!joe)*jack