Regex trying to match everything before backslash / not working - regex

I am attempting to use regex to match everything before the /, but when i try the following I get nothing outputted. I double checked my regex and seems okay, but not sure why it isn't working..
[user#user my_dir]$ tar -tf abc_de123_01.02.03.4.tgz | grep -m1 /
abc_de123_01.02.03.4/abcde.ini
[user#user my_dir]$ tar -tf abc_de123_01.02.03.4.tgz | grep -m1 .*\/
[user#user my_dir]$ tar -tf abc_de123_01.02.03.4.tgz | grep -m1 /$
expected output:
abc_de123_01.02.03.4/

There are three problems here.
One problem is that * has a special meaning to your shell; if you run echo grep -m1 .*\/, you'll see that your shell is expanding .*\/ in a way you don't expect.
One problem is that grep prints matching lines by default. If you want it to print just the matching part of a line, you need the -o flag.
One problem that's not actually breaking your command, but that you should nonetheless fix, is that your shell uses \ as a quoting (escape) character, so \/ actually means just /. (The reason this doesn't break anything is that / isn't special to grep anyway, so you didn't actually need the \ for anything.)
So:
grep -m1 -o '.*/'
which finds the first line containing /, and prints everything up through the last / on that line.
Incidentally, / is not a backslash, but simply a slash (or sometimes forward slash). A backslash is \.

Grep by default works on line wise operations. If you need only part of the string in all the lines you might use cut instead.
tar -tf abc_de123_01.02.03.4.tgz | cut -d'/' -f1
Now if you need only the first part of the first match sed come in hand:
tar -tf abc_de123_01.02.03.4.tgz | sed "1q;d" | cut -d'/' -f1

Related

Delete any special character using Sed

I have yet another list of subdomain. I want to remove any Wildcard subdomain which include these special characters:
()!&$#*+?
Mostly, the data are prefixly random. Also, could be middle. Here's some sample of output data
(www.imgur.com
***************diet.blogspot.com
*-1.gbc.criteo.com
------------------------------------------------------------i.imgur.com
This has been quite an inconvenience while scanning through the list. As always, I'm trying sed to fix it:
sed -i "/[!()#$&?+]/d" foo.txt ###Didn't work
sed -i "/[\!\(\)\#\$\&\?\+]/d" ###Escaping char didn't work
Performing commands above still result in an unchanged list and the file still on original state. I'm thinking that; to fix this is to pipe series of sed command in order to remove it one by one:
cat foo.txt | sed -e "/!/d" -e "/#/d" -e "/\*/d" -e "/\$/d" -e "/(/d" -e "/)/d" -e "/+/d" -e "/\'/d" -e "/&/d" >> foo2.txt
cat foo.txt | sed -e "/\!/d" | sed -e "/\#/d" | sed -e "/\*/d" | sed -e "/\$/d" | sed -e "/\+/d" | sed -e "/\'/d" | sed -e "/\&/d" >> foo2.txt
If escaping all special char doesn't work, it must've been my false logic. Also tried with /g still doesn't increase my luck.
As a side note: I don't want - to be deleted as some valid subdomain can have - character:
line-apps.com
line-apps-beta.com
line-apps-rc.com
line-apps-dev.com
Any help would be cherished.
Using sed
$ sed '/[[:punct:]]/d' input_file
This should delete all lines with special characters, however, it would help if you provided sample data.
To do what you're trying to do in your answer (which adds [ and ] and more to the set of characters in your question) would be:
sed '/[][!?+,#$&*() ]/d'
or just:
grep -v '[][!?+,#$&*() ]'
Per POSIX to include ] in a bracket expression it must be the first character otherwise it indicates the end of the bracket expression.
Consider printing lines you want instead of deleting lines you do not want, though, e.g.:
grep '^[[:alnum:]_.-]$' file
to print lines that only contain letters, numbers, underscores, dashes, and/or periods.

Cryptic sed command syntax confusion

Can someone explain, how this sed command works here?
pkg info | sed -e 's/\([^.]*\).*/\1/' -e 's/\(.*\)-.*/\1/'
This command removes version numbers from packages and prints into stdout like this
yajl-2.1.0 Portable JSON parsing and serialization library in ANSI C
youtube_dl-2018.12.03 Program for downloading videos from YouTube.com
zathura-0.4.1 Customizable lightweight pdf viewer
zathura-pdf-poppler-0.2.9_1 Poppler render PDF plugin for Zathura PDF viewer
zip-3.0_1 Create/update ZIP files compatible with PKZIP
zsh-5.6.2 The Z shell
and turns into this
yajl
youtube_dl
zathura
zathura-pdf-poppler
zip
zsh
But I am having a hard time understanding the parts ([^.]*\).* \(.*\)-.*. I understand the case of \, -e, s. But those wildcards seems very cryptic here.
In your regex ([^.]*\).*, ( which actually is \( is the start of a capturing group and then [^.]* captures every character except a literal dot and * means zero or more, then \) is the mark of closing of group that we started, then .* captures whatever remains after capturing group1.
Similar will be the explanation for \(.*\)-.* regex, where \(.*\) will capture everything greedily in capturing group but will stop at last hyphen - and then will match hyphen and further .* will match remaining text.
To explain with an example, lets take youtube_dl-2018.12.03.
Here, \([^.]*\) will capture everything until dot, hence it will capture youtube_dl-2018 and then remaining .* will capture .12.03. Then it will be replaced by \1 which means youtube_dl-2018 will be passed to the next regex -e 's/\(.*\)-.*/\1/'.
Then in your second regex, \(.*\)-.*, \(.*\) will capture youtube_dl and put in group1 because after that there is a hyphen and .* will capture remaining text which is 2018. And as it is replaced by \1 hence final text will become youtube_dl.
Seeing your data, I believe, you can also simplify your command to this, as your first regex in sed command seems redundant. Try this following command and see if it outputs same result?
pkg info | sed -e 's/\(.*\)-.*/\1/'
You can only use this simplified command, as none of your data contains a . before a -, otherwise you should use your own command which has two sed rules.
Also, on another note, if you use -r, (or -E for OS X), for extended regex, you don't need to escape the parentheses and you can write your regex as,
pkg info | sed -r 's/([^.]*).*/\1/' -r 's/(.*)-.*/\1/'
It is a difficult way for saying:
Remove all substrings starting with a dot or hypen.
The part before the delimiter is matched and remembered.
Alternatives:
# Incorrect: removes from first, not last hypen:
# pkg info | sed 's/[-.].*//'
# pkg info | cut -d "-" -f1 | cut -d"." -f1
# pkg info | awk -F "-|[.]" '{print $1}'
# The dot is not needed when you remove the substring starting with the last hypen
pkg info | sed 's/-[^-]*$//'
pkg info | rev | cut -d"-" -f2- | rev
pkg info | awk -F "[.]" '{print $1}' | awk -F "[-]" -vOFS='-' 'NF>1 { NF--;print;}'
Silly invisible-text GNU grep method that works on the console,
but which would fail if sent to a file or piped to a filter:
pkg info | GREP_COLORS='ms=30;30;30' grep '\-[^-]*\s.*$'
How it works: grep is used to find the last hyphen before a
space, and everything after that, (i.e. everything we don't want
to see), which grep shows in highlighted colors as defined in the
GREP_COLORS environmental variable. Since the highlight colors
30;30;30 is a black font, (on a black background), the unwanted
text is invisible.
If the terminal background is already black, GREP_COLORS='ms=30
would be sufficient.
sed method based on not printing the grep regex:
pkg info | sed 's#\(^.*\)\(-[^-]*[[:space:]].*$\)#\1#'
...this method can be sent to pipes and filters. Shorter version using GNU sed:
pkg info | sed 's#\(^.*\)\(-.*\s.*\)#\1#'

“sed” command to remove a line that matches an exact string on first word

I've found an answer to my question here: "sed" command to remove a line that match an exact string on first word
...but only partially because that solution only works if I query pretty much exactly like the answer person answered.
They answered:
sed -i "/^maria\b/Id" file.txt
...to chop out only a line starting with the word "maria" in it and not maria if it's not the first word for example.
I want to chop out a specific url in a file, example: "cnn.com" - but, I also have a bunch of local host addressses, 0.0.0.0 and both have some with a single space in front. I also don't want to chop out sub domains like ads.cnn.com so that code "should" work but doesn't when I string in more commands with the -e option. My code below seems to clean things up well except that I can't get it to whack out the cnn.com! My file is called raw.txt
sed -r -e 's/^127.0.0.1//' -e 's/^ 127.0.0.1//' -e 's/^0.0.0.0//' -e 's/^ 0.0.0.0//' -e '/#/d' -e '/^cnn.com\b/d' -e '/::/d' raw.txt | sort | tr -d "[:blank:]" | awk '!seen[$0]++' | grep cnn.com
When I grep for cnn.com I see all the cnn's INCLUDING the one I don't want which is actually "cnn.com".
ads.cnn.com
cl.cnn.com
cnn.com <-- the one I don't want
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
If I just use that one piece of code with the cnn.com chop out it seems to work.
sed -r '/^cnn.com\b/d' raw.txt | grep cnn.com
* I'm not using the "-e" option
Result:
ads.cnn.com
cl.cnn.com
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
Nothing I do seems to work when I string commands together with the "-e" option. I need some help on getting my multiple option command kicking with SED.
Any advice?
Ubuntu 12 LTS & 16 LTS.
sed (GNU sed) 4.2.2
The . is metacharacter in regex which means "Match any one character". So you accidentally created a regex that will also catch cnnPcom or cnn com or cnn\com. While it probably works for your needs, it would be better to be more explicit:
sed -r '/^cnn\.com\b/d' raw.txt
The difference here is the \ backslash before the . period. That escapes the period metacharacter so it's treated as a literal period.
As for your lines that start with a space, you can catch those in a single regex (Again escaping the period metacharacter):
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d' raw.txt
This (^[ ]*|^) says a line that starts with any number of repeating spaces ^[ ]* OR | starts with ^ which is then followed by your match for 127.0.0.1.
And then for stringing these together you can use the | OR operator inside of parantheses to catch all of your matches:
sed -r '/(^[ ]*|^)(127\.0\.0\.1|cnn\.com|0\.0\.0\.0)\b/d' raw.txt
Alternatively you can use a ; semicolon to separate out the different regexes:
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d; /(^[ ]*|^)cnn\.com\b/d; /(^[ ]*|^)0\.0\.0\.0\b/d;' raw.txt
sed doesn't understand matching on strings, only regular expressions, and it's ridiculously difficult to try to get sed to act as if it does, see Is it possible to escape regex metacharacters reliably with sed. To remove a line whose first space-separated word is "foo" is just:
awk '$1 != "foo"' file
To remove lines that start with any of "foo" or "bar" is just:
awk '($1 != "foo") && ($1 != "bar")' file
If you have more than just a couple of words then the approach is to list them all and create a hash table indexed by them then test for the first word of your line being an index of the hash table:
awk 'BEGIN{split("foo bar other word",badWords)} !($1 in badWords)' file
If that's not what you want then edit your question to clarify your requirements and include concise, testable sample input and the expected output given that input.

Matching arbitrary number of digits using grep regex

I've got a file that has lines in it that look similar as follows
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
What I am looking to do is use regex to match any line that starts with data and ends with later AND has numbers in between. Here is what I've concocted so far:
^[D,d]ata[0-9]*later$
However the output includes all datalater lines. I suppose I could pipe the output and grep -v datalater, but I feel like a single expression should do the trick.
Use + instead of *.
+ matches at least one or more of the preceding.
* matches zero or more.
^[Dd]ata[0-9]+later$
In grep you need to escape the +, and we can use \d which is a character class and matches single digits.
^[Dd]ata\d\+later$
In you example file you also have a line:
datafhj893724897290384later
This currently will not be matched due to there being letters in-between data and the numbers. We can fix this by adding a [^0-9]* to match anything after data until the digits.
Our final command will be:
grep '^[Dd]ata[^0-9]*\d\+later$' filename
Using Cygwin, the above commands didn't work. I had to modify the commands given above to get the desired results.
$ cat > file.txt <<EOL
> data
> datalater
> 983290842
> Data387428later
> datafhj893724897290384later
> 4329804928later
> EOL
I always like to make sure my file has what I expect it to have:
$ cat file.txt
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
$
I needed to run Perl-style expressions with the -P flag. This meant I couldn't use the [^0-9]+, whose necessity #Tom_Cammann aptly pointed out. Instead, I used .* which matches any sequence of characters not matching the next part of the pattern. Here are my command and output.
$ grep -P '^[Dd]ata.*\d+later$' file.txt
Data387428later
datafhj893724897290384later
$
I wish I could give a better explanation of WHY Perl expressions are needed, but I just know that Cygwin's grep works a bit differently.
System Info
$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
My Results from the previous answers
$ grep '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep '^[Dd]ata\d+later$' file2.txt
$ grep -P '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep -P '^[Dd]ata\d+later$' file2.txt
Data387428later
$
You're matching zero or more digits with the * qualifier. Try
^[Dd]ata\d+later$
instead. You were also finding commas at the beginning of the string (e.g. ",ata1234later"). And \d is a shortcut to finding any digit character. So I changed those as well.
You should put a "+" (which means one or several) instead of "*" (which means zero, one or several
The "+" syntax only works for extended-regexp, not standard grep.
At least, that's my experience on RHEL.
To use extended-regexp, run egrep or pass "-E" / "--extended-regexp"
Examples...
Standard grep
echo abc123n1 | grep "abc[0-9]+n1"
<no output>
egrep
echo abc123n1 | egrep "abc[0-9]+n1"
abc123n1
grep with -E
echo abc123n1 | grep -E "abc[0-9]+n1"
abc123n1
HTH
grep -Eio "^(data)[0-9]+(later)$"
^[dD]ata=^d later$=r$
🎯 MOTIVATION
The rest of answers don't work on all systems.
🗒️ REQUISITES
grep
The option: --extended-regexp
Character groups, aka: [:group:]
Matching one or more of the preceding, aka: +
Optionally setting as starting or ending: ^whatever$
📟 COMMAND
grep --extended-regexp "[[:group:]]+"
🗂️ GROUPS
alnum
alpha
blank
cntrl
digit
graph
lower
print
punct
space
upper
xdigit

sed for removing trailing zeroes - regex - nongreedy

I have a file which has few lines as below
ABCD|100.19000|90.100|1000.000010|SOMETHING
BCD|10.100|90.1|100.019900|SOMETHING
Now, after applying sed on this, I would like the output to be as below (To use it for further processing)
ABCD|100.19|90.1|1000.00001|SOMETHING
BCD|10.1|90.1|100.0199|SOMETHING
i.e. I would like all the trailing zeros (the ones before the |) to be removed from the result.
I tried the following: (regtest is the file containing the original data as shown above)
cat regtest | sed 's/|\([0-9]*\)\.\([0-9]*\)0*|/|\1\.\2|/g'
Did not work as I think it's greedy.
cat regtest | sed 's/|\([0-9]*\)\.\([0-9]*\)0|/|\1\.\2|/g'
Will work. But, I will have to apply this sed command repeatedly on the same file to remove the zeros one after another. Does not make sense.
How can I go about it? Thanks!
$ echo "ABCD100|100.19000|90.100|1000.000010|STH" | \
sed -r -e 's/\|/||/g' -e 's/(\|[0-9.]+[1-9])0+\|/\1|/g' -e 's/\|\|/|/g'
ABCD100|100.19|90.1|1000.00001|STH
If you want to depend on the | following the zeroes to be removed
cat regtest | sed -r 's/(00*)(\|)/\2/g'
If you want to remove zeroes not trailed by a . or a digit
cat regtest | sed -r 's/(00*)([^.0-9])/\2/g'
(Note I'm using the 00* instead of 0+ to avoid unique features of GNU sed not available in other versions)
Edit: answer to comment request for removing trailing zeroes only between a decimal point and a pipe:
cat regtest | sed -r 's/(\.[1-9])*(00*)(\|)/\1\3/g'
Using Perl's extended regular expressions
perl -pe 's{\.\d*?\K0*(\||$)}{$1}g'
This removes zeroes that occur between (a dot and optionally some digits) and (a pipe or the end of the line).