Find a string after a certain character - regex

An example will explain it better:
structure_1/structure_2/<I NEED WHAT'S HERE/structure_3
Structure_1 is always the same value
Structure_2 is a string that can be of any size, sometimes with _ or -
What I need is behind the second forward slash
I don't care what comes after
Other example:
order/shirt/blue_stripes/America
order/pants_ripped/green/Europe
order/skirts/yellow-folded/Asia
order/socks/orange/Africa
Results that I want to become after regex
blue_stripes
pants_ripped
yellow-folded
orange
I'm writing a BASH script for my Unix machine
UPDATE
I first used a regex in order to do this but I was informed by Flying that it would be better to use the command 'awk' and this did the trick with ease!

This one will do the trick: ^(?:[^\/]+\/){2}([^\/]+). You're basically need to skip first 2 groups of chars. You can check it by yourself here.
UPDATE: Since, as defined into comment, actual task is not about finding correct regular expression, but about grepping information from Unix file - it is much better to use awk instead:
awk -F"/" '{print $3}' orders.txt

Related

How to conditionally remove characters and preserve a text in between?

How could sed or another POSIX command be used to remove the braces but only when we encounter "codeBlock":{"_id":{"varying24characters"}. There may be multiple matches with this condition in the line and I want to avoid removing the braces on something that looks similar like the smoreBlock.
Input (a single line)
test,"codeBlock":{"_id":{"4c9d4e1fe2c101000138eb4b"},morestuff,"smoreBlock":{"_id":{"6c9d4e1fe2c101000138eb4b"},hey,stuff,test,"codeBlock":{"_id":{"7c9d4e1fe7c101111138eb4b"},otherstuff
Desired output
test,"codeBlock":{"_id":"4c9d4e1fe2c101000138eb4b",morestuff,"smoreBlock":{"_id":{"6c9d4e1fe2c101000138eb4b"},hey,stuff,test,"codeBlock":{"_id":"7c9d4e1fe7c101111138eb4b",otherstuff
I've been banging my head reading about sed backreferences and can't even get close to what I'm looking for. Unfortunately this is not homework. I could write a small program to brute force through it but I know there has got to be a way for sed, awk, or perl to handle this. Planning to run this on a RHEL7 or CENTOS7 host.
Think it the other way, match both needed and unneeded together, but keep former in capturing groups. Thus you can replace whole match with only needed parts.
sed 's/\("codeBlock":{"_id":\){\("[0-9a-f]\{24\}"\)}/\1\2/g' file
Or, if you have GNU sed:
sed -E 's/("codeBlock":\{"_id":)\{("[0-9a-f]{24}")\}/\1\2/g' file
both yield:
test,"codeBlock":{"_id":"4c9d4e1fe2c101000138eb4b",morestuff,"smoreBlock":{"_id":{"6c9d4e1fe2c101000138eb4b"},hey,stuff,test,"codeBlock":{"_id":"7c9d4e1fe7c101111138eb4b",otherstuff

Is there an alternative to negative look ahead in sed

In sed I would like to be able to match /js/ but not /js/m I cannot do /js/[^m] because that would match /js/ plus whatever character comes after. Negative look ahead does not work in sed. Or I would have done /js/(?!m) and called it a day. Is there a way to achieve this with sed that would work for most similar situations where you want a section of text that does not end in another section of text?
Is there a better tool for what I am trying to do than sed? Possibly one that allows look ahead. awk seems a bit too much with its own language.
Well you could just do this:
$ echo 'I would like to be able to match /js/ but not /js/m' |
sed 's:#:#A:g; s:/js/m:#B:g; s:/js/:<&>:g; s:#B:/js/m:g; s:#A:#:g'
I would like to be able to match </js/> but not /js/m
You didn't say what you wanted to do with /js/ when you found it so I just put <> around it. That will work on all UNIX systems, unlike a perl solution since perl isn't guaranteed to be available and you're not guaranteed to be allowed to install it.
The approach I use above is a common idiom in sed, awk, etc. to create strings that can't be present in the input. It doesn't matter what character you use for # as long as it's not present in the string or regexp you're really interested in, which in the above is /js/. s/#/#A/g ensures that every occurrence of # in the input is followed by A. So now when I do s/foobar/#B/g I have replaced every occurrence of foobar with #B and I KNOW that every #B represents foobar because all other #s are followed by A. So now I can do s/foo/whatever/ without tripping over foo appearing within foobar. Then I just unwind the initial substitutions with s/#B/foobar/g; s/#A/#/g.
In this case though since you aren't using multi-line hold-spaces you can do it more simply with:
sed 's:/js/m:\n:g; s:/js/:<&>:g; s:\n:/js/m:g'
since there can't be newlines in a newline-separated string. The above will only work in seds that support use of \n to represent a newline (e.g. GNU sed) but for portability to all seds it should be:
sed 's:/js/m:\
:g; s:/js/:<&>:g; s:\
:/js/m:g'

Using sed to fix format of date string

The question specifically involves modifying a string of form
abc_MM-DD-YY_XX.jpg
(where XX can be comprised of two or three digits) to
xyz_YYYY-MM-DD_XXX.jpg
I was able to do this using:
sed 's/\(.*_\)\(.\{5\}\)-\([0-9][0-9]\)_\([0-9][0-9]\.\)/xyz_20\3-\2_0\4/'
I was wondering, though, if there are any better, perhaps more concise alternatives. Also, is using TRE (tagged regular expression) the only way sed can accomplish such a task? Thanks!
EDIT: Sorry, to clarify, the original string can either be in the format "abc_MM-DD-YY_XX.jpg" or "abc_MM-DD-YY_XXX.jpg", but the output format must be "abc_MM-DD-YY_XXX.jpg". So in the first case I would want to pad "XX" with a 0 and in the second case I would want to leave it be. I also realized that my expression doesn't work for the second case...
This will work only in the century!
Using awk
I would use awk for that. It is simpler to use:
awk -F'[-_]' '$0="xyz_20"$4"-"$2"-"$3"_"sprintf("%03d",$5)' <<<'abc_03-24-15_11.jpg'
will give you:
xyz_2015-03-24_011.jpg
while:
awk -F'[-_]' '$0="xyz_20"$4"-"$2"-"$3"_"sprintf("%03d",$5)' <<<'abc_03-24-15_111.jpg'
will give you:
xyz_2015-03-24_111.jpg
what should be what you want.
Explanation:
I'm using either - or _ as the field delimiter and simply reorganize the fields. To achieve the padding of and XX value to XXX I'm using sprintf(). (Thanks Amadan)
Using sed
Btw, you can simplify the sed command a lot if you would use the -r option and if you simply match sequences of not occurring characters:
sed -r 's/([^_]+)_([^-]+)-([^-]+)-([^_]+)_([^.]+)/xyz_20\4-\2-\3_0\5/;' <<<'abc_03-24-15_12.jpg'
(This doesn't work perfectly since it does not solve the XX to XXX problem properly at the moment. Give me a minute ... )
To solve that you can simply append another s command:
s/0([0-9]{3})\./\1./
which will replace the sequence 0123 by 123. The final command looks like this:
sed -r 's/([^_]+)_([^-]+)-([^-]+)-([^_]+)_([^.]+)/xyz_20\4-\2-\3_0\5/;s/0([0-9]{3})\./\1./' <<<'abc_03-24-15_12.jpg'
Doesn't it look simpler using -r ;) (hihi)

Regex that matches contents of () with nested () in it

I have a messy text file(Due to nature of the contents I cannot paste it).
In the file I want to match things that are in unnested parenthesis.
Here is sample that includes the problem:
a(b()c((d)e)f()g)h(i)
The output that I need is:
(b()c((d)e)f()g)(i)
(basically everything in the largest parenthesis, less 'a' and 'h')
Again I cannot post the actual contents but above example illustrates the problem I have in original file.
I am working on this from bash, I am familiar with sed, grep, but not awk unfortunately.
Thanks
Since regex will find the longest possible match, you can just use
\(.*\)
If you care about nesting and want to find the outermost, e.g. for ((a)) and (b)))) you want to find ((a)) and (b), then that's a typical example of a grammar that you technically can't match with regular expressions.
However, since you tagged your post PCRE:
grep -P -o '(?xs)(?(DEFINE) (?<c>([^()]|(?&p))) (?<p>\((?&c)*\)))((?&p))'
Ha, I know there is a good answer, but posting the one I came up with to enrich the range of ides. (demo).
(?x)
(?(DEFINE)(?<nest>(\((?:[^()]*(?1)?[^()]*)\))))
\((?:[^()]*(?&nest)?[^()]*)*\)
Of course it needs to be flattened onto the grep line.

Unix grep regex containing 'x' but not containing 'y'

I need a single-pass regex for unix grep that contains, say alpha, but does not contain beta.
grep 'alpha' <> | grep -v 'beta'
The other answers here show some ways you can contort different varieties of regex to do this, although I think it does turn out that the answer is, in general, “don’t do that”. Such regular expressions are much harder to read and probably slower to execute than just combining two regular expressions using the boolean logic of whatever language you are using. If you’re using the grep command at a unix shell prompt, just pipe the results of one to the other:
grep "alpha" | grep -v "beta"
I use this kind of construct all the time to winnow down excessive results from grep. If you have an idea of which result set will be smaller, put that one first in the pipeline to get the best performance, as the second command only has to process the output from the first, and not the entire input.
Well as we're all posting answers, here it is in awk ;-)
awk '/x/ && !/y/' infile
I hope this helps.
^((?!beta).)*alpha((?!beta).)*$ would do the trick I think.
I'm pretty sure this isn't possible with true regular expressions. The [^y]*x[^y]* example would match yxy, since the * allows zero or more non-y matches.
EDIT:
Actually, this seems to work: ^[^y]*x[^y]*$. It basically means "match any line that starts with zero or more non-y characters, then has an x, then ends with zero or more non-y characters".
Try using the excludes operator: [^y]*x[^y]*
Q: How to match x but not y in grep without pipe if y is a directory
A: grep x --exclude-dir='y'
Simplest solution:
grep "alpha" * | grep -v "beta"
Please take care of gaps and double quotes.