Sed remove only first occurence of a string - regex

I have several string in my text file witch have this case:
Brisbane, Queensland, Australia|BNE
I know how to use the SED command, to replace any character by another one. This time I want to replace the characters coma-space by a pipe, only for the first match to not affect the country name at the same time.
I need to convert it to something like that:
Brisbane|Queensland, Australia|BNE
As you can see, only the first coma-space was replaced, not the second one and I keep the country name "Queensland, Australia" complete. Can someone help me to achieve this, thanks.
Here is a sample of my file:
Brisbane, Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
If you do: sed 's/, /|/' file.txt doesn't work.
The output should be like that:
Brisbane|Queensland, Australia|BNE

Simply don't use the g option. Your sed command should look like this:
sed 's/, /|/'
The s command will by default only the replace the first occurrence of a string in the pattern buffer - unless you pass the g option.

Since you have not posted the output of your test file, we can only guess what you need. And here is may guess:
awk -F", *" 'NF>2{$0=$1"|"$2 OFS $3}1' OFS=", " file
Brisbane|Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
As you see it counts fields to see if it needs | or not. If it neds | then reconstruct the line.

Related

Find the first name that starts with any letter than S using regex

I am new to regex and I am trying to find the last names that only start with S followed by comma and then space and then the first names that doesn't start with S from a text file.
I am using the terminal on a MacBook.
This is my regex
^[S\w][,]?[' ']?[A-RT-Z]?
My full command
cat People.txt | grep -E ^[S\w][,]?[' ']?[A-RT-Z]?
The first name is the second word and the last name is the first word on each line.
The results I get:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
What I am expecting to get
Schmidt, Paul
Smith, Peter
The first rule of writing regular expressions in a shell script (or at the terminal) is "enclose the regular expression in single quotes" so that the shell doesn't try to interpret the metacharacters in the regex. You might sometimes use double quotes instead of single quotes if you need to match single quotes but not double quotes or if you need to interpolate a variable, but aim to use single quotes. Also, avoid UUoC — Useless Use of cat.
Your question currently shows two regular expressions:
^[S\w][,]?[' ']?[A-RT-Z]?
cat People.txt | grep -E ^[S\w][,]?[' ']?[P\w+]?
If written as suggested, these would become:
grep -E -e '^[Sw],? ?[A-RT-Z]?' People.txt
grep -E -e '^[Sw],? ?[Pw+]?' People.txt
The shell removes the backslashes in your rendition. The + in the character class matches a plus sign. You don't need square brackets around the comma (though they do no major harm). I use the -e option for explicitness, and so I can add extra arguments after the regex (-w or -l or -n or …) when editing commands via history. (I also dislike having options recognized after non-option arguments; I often run with $POSIXLY_CORRECT set in my environment. That's a personal quirk.)
The first of the two commands looks for a line starting S or w, followed by an optional comma, an optional blank, and an optional upper-case letter other than S. The second is similar except that it looks for an optional P or w. None of this bears much relationship to the question.
You need an expression more like one of these:
grep -E -e '^[S][[:alpha:]]*, [^S]' People.txt
grep -E -e '^[S][a-zA-Z]*, [^S]' People.txt
These allow single-character names — just S — but you can use + instead of * to require one or more letters.
There are lots of refinements possible, depending on how much you want to work, but this does the primary job of finding 'first word on the line starts with S, and is followed by a comma, a blank, and the second word does not start with S'.
Given a file People.txt containing:
Randall, Steven
Rogers, Timothy
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Titus, Persephone
Williams, Shirley
Someone
S
Your regular expressions produce the output:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Someone
S
My commands produce:
Schmidt, Paul
Smith, Peter
Something like this seems to work fine:
^S.*, [^S].*$
^S.* - must start with S and start capturing everything
, [^S] - leading up to a comma, space, not S
.*$ - capture the rest of the string
https://regex101.com/r/76bfji/1

Regex to replace a string in context but not the context

I am new to regex and want to do the following task:
I have a string say, JOHN.S and I would want to replace the period with tab. However, the replacement should only occur if the period is between two letters. Something that I don't want it to happen is to replace period in John, S. with a tab. Instead, I will just replace , with a tab, which I know how to do.
If I try to replace /[a-zA-Z]\.[a-zA-Z]/, then the surrounding letters will be removed but obviously I want to keep them. They should just be used to identify the context.
I have searched for a long time but have not come up a solution. More specifically, I am working with bash. So maybe sed is what I am going to use.
Thank you.
It is just a matter of catching the surrounding information with () and printing them back with \1, \2, etc:
sed -r 's/(\w)\.(\w)/\1\t\2/g' file
Using your syntax:
sed -r 's/([a-zA-Z])\.([a-zA-Z])/\1\t\2/g' file
Test
$ cat file
John, S.
JOHN.S
blabla
$ sed -r 's/(\w)\.(\w)/\1\t\2/g' file
John, S.
JOHN S
blabla

regular expression to replace lastname, firstname middle initial to email format

Need a little help creating a regular expression to take, for example :
Smith, John R
and turn it into
john.r.smith#gmail.com
You can use the following regex in C++11:
string s = "Smith, John R"; // to john.r.smith#gmail.com
const regex r("(.*), (.*) (.*)");
const string fmt("$2.$3.$1#gmail.com");
cout << regex_replace(s, r, fmt) << endl;
Note: this will give you John.R.Smith#gmail.com, you may further need to change it to lowercase if you need john.r.smith#gmail.com, which is quite a easy task.
Since the language is not specified , i tried this in VIM and it works perfectly.
%s/\v(\w*),\s*(\w*)\s*(\w)/\L\2.\L\3.\1#gmail.com/
Attached is screenshot
Same as previous answer, but in Shell:
echo "Smith, John R" | awk '{print tolower($0)}' | sed 's/\(.*\),\s\(.*\)\s\(.*\)/\2.\3.\1#gmail.com/g'
john.r.smith#gmail.com
Actually thanks to #KP6, I realized sed can lowercase too :)
So, much simpler version would be:
echo "Smith, John R" | sed 's/\(.*\),\s\(.*\)\s\(.*\)/\L\2.\L\3.\L\1#gmail.com/g'
john.r.smith#gmail.com
Just capture the needed tokens:
(.+?),\s(.+?)\s(.+)
$1 is last name.
$2 is first name
$3 is middle name
Now build your email address.
As mention by other, regex seems like overkill

Regex code for address separated by commas

How can I extract the state text which is before third comma only using the regex code?
54 West 21st Street Suite 603, New York,New York,United States, 10010
I've managed to extract the rest how I wanted but this one is a problem.
Also, how can I extract the "United States" please?
It looks like you want to use capturing groups:
.*,.*,(.*),(.*),.*
The first capturing group will be "New York" and the second will be "United States" (try it on Rubular).
Or you can split by commas (which will probably be even simpler) as #Jerry points out, assuming the language/tool you're using supports that.
You can use this regex:
(?:[^,]*,){2}([^,]*)
And use captured group # 1 for your desired String.
TL;DR
A lot depends on your regular expression engine, and whether you really need a regular expression or field-splitting. You can do field-splitting in Ruby and Awk (among others), but sed and grep only do regular expressions. See some examples below to get you started.
Ruby
str = '54 West 21st Street Suite 603, New York,New York,United States, 10010'
str.match /(?:.*?,){2}([^,]+)/
$1
#=> "New York"
GNU sed
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
sed -rn 's/([^,]+,){2}([^,]+).*/\2/p'
GNU awk
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
awk -F, '{print $3}'

Grep Regex - Words in brackets?

I want to know the regex in grep to match everything that isn't a specific word. I know how to not match everything that isn't a single character,
gibberish blah[^.]*jack
That would match blah, jack and everything in between as long as the in between didn't contain a period. But is it possible to do something like this?
gibberish blah[^joe]*jack
Match blah, jack and everything in between as long as the in between didn't contain the word "joe"?
UPDATE:
I can also use AWK if that would better suit this purpose.
So basically, I just want to get the sentence "gibberish blah other words jack", as long as "joe" isn't in the other words.
Update 2 (The Answer, to a different question):
Sorry, I am tired. The sentence actually can contain the word "joe", but not two of them. So "gibberish blah jill joe moo jack" would be accepted, but "gibberish blah jill joe moo joe jack" wouldn't.
Anyway, I figured out the solution to my problem. Just grep for "gibberish.*jack" and then do a word count (wc) to see how many "joes" are in that sentence. If wc comes back with 1, then it's ok, but if it comes back with 2 or more, the sentence is wrong.
So, sorry for asking a question that wouldn't even solve my problem. I will mark sputnick's answer as the right one, since his answer looks like it would solve the original posts problem.
What you're looking for is named look around, it's an advanced regex technique in pcre & perl. It's used in modern languages. grep can handle this expressions if you have the -P switch. If you don't have -P, try pcregrep instead. (or any modern language).
See
http://www.perlmonks.org/?node_id=518444
http://www.regular-expressions.info/lookaround.html
NOTE
If you just want to negate a regex, maybe a simple grep -v "regex" will be sufficient. (It depends of your needs) :
$ echo 'gibberish blah other words jack' | grep -v 'joe'
gibberish blah other words jack
$ echo 'gibberish blah joe other words jack' | grep -v 'joe'
$
See
man grep | less +/invert-match
Try the negative lookbehind syntax:
blahish blah(?<!joe)*jack