regex to catch substring in a line and change output with sed - regex

I am trying to put together a regex that will find a substring that has the following format:
[a-z].* ('x' number of lowercase letters)
A '-' sign [a-z].*
('x' number of lower case letters [0-9].*
('x' number of numbers from 0 to 9)
Anything else in the line that follows this substring (including space or ',') would not be caught by the regex and then I would add a new line to the results so they are in a list.
If this regex works the way I would like then from the following string
file.txt: hostname abcd-efg123, zfdh-eif23 , reox-bmo552, 'coor-dto201',
I would receive this output
abcd-efg123
zfdh-eif23
reox-bmo552
coor-dto201
This is what I have so far. I'm trying to use the regex and then store the result as as two variables which I can then put back into sed. I'm not getting the results I expected.
The regex/sed I am using is
sed 's/\([a-z].*\)-\([a-z].*[0-9].*\)/\2 \1 \n/g'
Here is the command straight from the prompt
macbook:~ user$ echo "file.txt: hostname abcd-efg123, zfdh-eif23 , reox-bmo552, 'coor-dto201'," | sed 's/\([a-z].*\)-\([a-z].*[0-9].*\)/\2 \1 \n/g'
dto201', file.txt: hostname abcd-efg123, zfdh-eif23 , reox-bmo552, 'coor n

Here's the regex I would use to match :
[a-z]*-[a-z]*[0-9]*
The main problem in your regex was the use of .* where you obviously meant *. As I commented, * is the quantifier that means "any number of times (including 0)", while . is the "any character" wildcard. You want to apply the quantifier to your previous character class rather than to ., the . has no reason to be here.
Note that using * includes 0 repetition, so the regex would match a single dash, which might not be to your taste.
Maybe you could be more specific, with a regex along those lines :
[a-z]{4}-[a-z]{3}[0-9]{2,3}
Here instead of using * as a quantifier, we use numbers between curly brackets : they give us the possibility to specify an exact number of repetitions (i.e. .{4} means "any 4 characters") or a range of repetitions (i.e. [0-9]{2,6} means "2 to 6 digits"). You could also use +, a quantifier that means "at least one time", as mentioned by Kenavoz.
And here's how I would use it in a linux command :
grep -o '[a-z]*-[a-z]*[0-9]*'
or
grep -Eo '[a-z]{4}-[a-z]{3}[0-9]{2,3}'
Here it is in action :
$ echo "file.txt: hostname abcd-efg123, zfdh-eif23 , reox-bmo552, 'coor-dto201'," | grep -o '[a-z]*-[a-z]*[0-9]*'
abcd-efg123
zfdh-eif23
reox-bmo552
coor-dto201
Or with the more specific regex :
$ echo "file.txt: hostname abcd-efg123, zfdh-eif23 , reox-bmo552, 'coor-dto201'," | grep -Eo "[a-z]{4}-[a-z]{3}[0-9]{2,3}"
abcd-efg123
zfdh-eif23
reox-bmo552
coor-dto201

Related

How to search for multiple words of a specific pattern and separator?

I'm trying to trim out multiple hex words from my string. I'm searching for exactly 3 words, separated by exactly 1 dash each time.
i.e. for this input:
wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar
I'd like to get this output:
wonder-indexing-service-0.20.0.jar
I was able to remove the hex words by repeating the pattern. How can I simplify it? Also, I wasn't able to change * to +, to avoid allowing empty words. Any idea how to do that?
What I've got so far:
# Good, but how can I simplify?
% echo 'wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar' | sed 's/\-[a-fA-F0-9]*\-[a-fA-F0-9]*\-[a-fA-F0-9]*//g'
druid-indexing-service-0.20.0.jar
# Bad, I'm allowing empty words
% echo 'wonder-indexing-service-0.20.0-1605296913-49b045f-.jar' | sed 's/\-[a-fA-F0-9]*\-[a-fA-F0-9]*\-[a-fA-F0-9]*//g'
druid-indexing-service-0.20.0.jar
Thank you!
EDIT: I had a typo in original output, thank you anubhava for pointing out.
You may use this sed:
s='wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar'
sed -E 's/(-[a-fA-F0-9]{3,})+//' <<< "$s"
wonder-indexing-service-0.20.0.jar
Breakup:
(: Start a group
-: Match a hyphen
[a-fA-F0-9]{3,}: Match 3 or more hex characters
)+: End the group. Repeat this group 1+ times
If you want to use the + you have to escape it \+, but you can repeat matching 3 words prepended by a hyphen using a quantifier which also need escaping
\(-[a-fA-F0-9]\+\)\{3\}
Example
echo 'wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar' | sed 's/\(-[a-fA-F0-9]\+\)\{3\}//g'
Output
wonder-indexing-service-0.20.0.jar
If you don't want to allow a trailing - then you can match the .jar and put that back in the replacement.
echo 'wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar' | sed 's/\(-[a-fA-F0-9]\+\)\{3\}\(\.jar$\)/\2/g'
printf "wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar" | cut -d'-' -f1-4 | sed s'#$#.jar#'

Delete all lines which don't match a pattern

I am looking for a way to delete all lines that do not follow a specific pattern (from a txt file).
Pattern which I need to keep the lines for:
x//x/x/x/5/x/
x could be any amount of characters, numbers or special characters.
5 is always a combination of alphanumeric - 5 characters - e.g Xf1Lh, always appears after the 5th forward slash.
/ are actual forward slashes.
Input:
abc//a/123/gds:/4AdFg/f3dsg34/
y35sdf//x/gd:df/j5je:/x/x/x
yh//x/x/x/5Fsaf/x/
45wuhrt//x/x/dsfhsdfs54uhb/
5ehys//srt/fd/ab/cde/fg/x/x
Desired output:
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
grep selects lines according to a regular expression and your x//x/x/x/5/x/ just needs minor changes to make it into a regular expression:
$ grep -E '.*//.*/.*/.*/[[:alnum:]]{5}/.*/' file
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Explanation:
"x could be any amount of characters, numbers or special characters". In a regular expression that is .* where . means any character and * means zero or more of the preceding character (which in this case is .).
"5 is always a combination of alphanumeric - 5 characters". In POSIX regular expressions, [[:alnum:]] means any alphanumeric character. {5} means five of the preceding. [[:alnum:]] is unicode-safe.
Possible improvements
One issue is how x should be interpreted. In the above, x was allowed to be any character. As triplee points out, however, another reasonable interpretation is that x should be any character except /. In that case:
grep -E '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
Also, we might want this regex to match only complete lines. In that case, we can either surround the regex with ^ an $ or we can use grep's -x option:
grep -xE '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
I was figuring out how to do it in awk at the same time as the other answer and came up with:
awk -F/ 'BEGIN{OFS=FS}$2==""&&$6~/[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]/&&NF=8'
The awk I worked it out on didn't support the {5} regexp frob.
You can use -P option for extended perl support like
grep -P "^(?:[^/]*/){5}[A-Za-z0-9]{5}/(?:/|$)" input
Output
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Regex Breakdown
^ #Start of line
(?: #Non capturing group
[^/]* #Match anything except /
/ #Match / literally
){5} #Repeat this 5 times
[A-Za-z0-9]{5} #Match alphanumerics. You can use \w if you want to allow _ along with [A-Za-z0-9]
(?: #Non capturing group
/ #Next character should be /
| #OR
$ #End of line
)
Using sed and in place edit to delete all lines that do not follow a specific pattern (from a txt file):
$ sed -i.bak -n "/.*\/\/.*\/.*\/.*\/[a-zA-Z0-9]\{5\}\/.*\//p" test.in
$ cat test.in
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
-i.bak in place edit creating a test.in.bak backup file, -n quiet, do not print non-matches to output
and ".../p" print matches.

Shell script linux, validating integer

This code is for check if a character is a integer or not (i think). I'm trying to understand what this means, I mean... each part of that line, checking the GREP man pages, but it's really difficult for me. I found it on the internet. If anyone could explain me the part of the grep... what means each thing put there:
echo $character | grep -Eq '^(\+|-)?[0-9]+$'
Thanks people!!!
Analyse this regex:
'^(\+|-)?[0-9]+$'
^ - Line Start
(\+|-)? - Optional + or - sign at start
[0-9]+ - One or more digits
$ - Line End
Overall it matches strings like +123 or -98765 or just 9
Here -E is for extended regex support and -q is for quiet in grep command.
PS: btw you don't need grep for this check and can do this directly in pure bash:
re='^(\+|-)?[0-9]+$'
[[ "$character" =~ $re ]] && echo "its an integer"
I like this cheat sheet for regex:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
It is very useful, you could easily analyze the
'^(+|-)?[0-9]+$'
as
^: Line must begin with...
(): grouping
\: ESC character (because + means something ... see below)
+|-: plus OR minus signs
?: 0 or 1 repetation
[0-9]: range of numbers from 0-9
+: one or more repetation
$: end of line (no more characters allowed)
so it accepts like: -312353243 or +1243 or 5678
but do not accept: 3 456 or 6.789 or 56$ (as dollar sign).

Regex get value between bracket and comma

I have a few strings like these:
new google.maps.LatLng(52.80359, -4.7127),
new google.maps.LatLng(53.80645306, -5.45455287),
new google.maps.LatLng(51.8035914546, -4.7123622894287),
I need to get both the longitude and latitude, so one regex for each number, the - symbol needs including where possible.
I have tried a few tools online but none seem to pickup on a decent pattern
Simply use grep grep -oE '[-0-9]+\.[0-9]+'
$ echo "new google.maps.LatLng(52.80359, -4.7127)," | grep -oE '[-0-9]+\.[0-9]+'
52.80359
-4.7127
$ echo "new google.maps.LatLng(53.80645306, -5.45455287)," | grep -oE '[-0-9]+\.[0-9]+'
53.80645306
-5.45455287
$ echo "new google.maps.LatLng(51.8035914546, -4.7123622894287)," | grep -oE '[-0-9]+\.[0-9]+'
51.8035914546
-4.7123622894287
Grep is the command line tool for matching lines in files (or stdout) against a particular pattern, the -o is tells grep to display on the part of the line that matches (by default grep displays the whole line that matches the given pattern). The -E tell grep to use grep to use extended regexp.
The regexp pattern [-0-9] matches either a minus sign - or a digit the following + says repeated the previous item one or more times i.e in abc123xyz match 123 not just 1 the \. matches the decimal place we have to escaped with \ because a single . matches any character in regexp then match any digits after the decimal place using [0-9]+ again.
See the reference for more information on regular expressions.
I would use this approach:
LatLng\((-*\d+\.*\d+),\s(-*\d+\.*\d+)\)
While it matches more than what you probably need, it places the latitude in capture group 1 and the longtitude in capture group 2, both excluding the surrounding parantheses' and the comma.
See it in action here: http://regexr.com?32od6
in C# use Regex.Match as follows:
using System.Text.RegularExpressions;
...
Match match = Regex.Match(input, #"([-]?\d+(?:[.]\d+)?)\D+?([-]?\d+(?:[.]\d+)?)");
if (match.Success)
{
string Lat = match.Groups[1].Value;
string Lng = match.Groups[2].Value;
}

Grep regular expression for digits in character string of variable length

I need some way to find words that contain any combination of characters and digits but exactly 4 digits only, and at least one character.
EXAMPLE:
a1a1a1a1 // Match
1234 // NO match (no characters)
a1a1a1a1a1 // NO match
ab2b2 // NO match
cd12 // NO match
z9989 // Match
1ab26a9 // Match
1ab1c1 // NO match
12345 // NO match
24 // NO match
a2b2c2d2 // Match
ab11cd22dd33 // NO match
to match a digit in grep you can use [0-9]. To match anything but a digit, you can use [^0-9]. Since that can be any number of , or no chars, you add a "*" (any number of the preceding). So what you'll want is logically
(anything not a digit or nothing)* (any single digit) (anything not a digit or nothing)* ....
until you have 4 "any single digit" groups. i.e. [^0-9]*[0-9]...
I find with grep long patterns, especially with long strings of special chars that need to be escaped, it's best to build up slowly so you're sure you understand whats going on. For example,
#this will highlight your matches, and make it easier to understand
alias grep='grep --color=auto'
echo 'a1b2' | grep '[0-9]'
will show you how it's matching. You can then extend the pattern once you understand each part.
I'm not sure about all the other input you might take (i.e. is ax12ax12ax12ax12 valid?), but this will work based on what you posted:
%> grep -P "^(?:\w\d){4}$" fileWithInput
With grep:
grep -iE '^([a-z]*[0-9]){4}[a-z]*$' | grep -vE '^[0-9]{4}$'
Do it in one pattern with Perl:
perl -ne 'print if /^(?!\d{4}$)([^\W\d_]*\d){4}[^\W\d_]*$/'
The funky [^\W\d_] character class is a cosmopolitan way to spell [A-Za-z]: it catches all letters rather than only the English ones.
If you don't mind using a little shell as well, you could do something like this:
echo "a1a1a1a1" |grep -o '[0-9]'|wc -l
which would display the number of digits found in the string. If you like, you could then test for a given number of matches:
max_match=4
[ "$(echo "a1da4a3aaa4a4" | grep -o '[0-9]'|wc -l)" -le $max_match ] || echo "too many digits."
Assuming you only need ASCII, and you can only access the (fairly primitive) regexp constructs of grep, the following should be pretty close:
grep ^[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*$ | grep [a-zA-Z]
You might try
[^0-9]*[0-9][^0-9]*[0-9][^0-9]*[0-9][^0-9]*[0-9][^0-9]*
But this will match 1234. why doesn't that match your criteria?
The regex for that is:
([A-Za-z]\d){4}
[A-Za-z] - for character class
\d - for number
you wrapp them in () to group them indicating the format character follow by number
{4} - indicating that it must be 4 repetitions
you can use normal shell script, no need complicated regex.
var=a1a1a1a1
alldigits=${var//[^0-9]/}
allletters=${var//[0-9]/}
case "${#alldigits}" in
4)
if [ "${#allletters}" -gt 0 ];then
echo "ok: 4 digits and letters: $var"
else
echo "Invalid: all numbers and exactly 4: $var"
fi
;;
*) echo "Invalid: $var";;
esac
thanks for your answers
finaly i wrote some script and it work perfect:
. /P ab2b2 cd12 z9989 1ab26a9 1ab1c1 1234 24 a2b2c2d2
#!/bin/bash
echo "$#" |tr -s " " "\n"s >> sorting
cat sorting | while read tostr
do
l=$(echo $tostr|tr -d "\n"|wc -c)
temp=$(echo $tostr|tr -d a-z|tr -d "\n" | wc -c)
if [ $temp -eq 4 ]; then
if [ $l -gt 4 ]; then
printf "%s " "$tostr"
fi
fi
done
echo