sed or awk regex, stop matching after semi-colon - regex

I have a multiple strings in a file that looks like this
TXT 20131101 094502,20131101 094502,Fri Nov 1 09:45:02 UTC 2013;
I want a regex that will get everything after TXT and only display that up until the ; using sed or awk
I have tried many ways but I cant seem to get it to stop at the ;
Thanks for any help

I want a regex that will get everything after TXT and only display
that up until the ;
grep -oP 'TXT[^;]*' filename
Using awk:
awk -F';' '{print $1}' filename
Using sed:
sed 's/\([^;]*\).*/\1/' filename

sed "s/TXT\([^;]*\);.*/\1/"
between TXT (so also first space if any) and first ; (not included)
reply from devnull include the "TXT" in the output

Related

Extract the string matched in a regex, not the line, with awk

This should not be too difficult but I could not find a solution.
I have a HTML file, and I want to extract all URLs with a specific pattern.
The pattern is /users/<USERNAME>/ - I actually only need the USERNAME.
I got only to this:
awk '/users\/.*\//{print $0}' file
But this filters me the complete line. I don't want the line.
Even just the whole URL is fine (e.g. get /users/USERNAME/), but I really only need the USERNAME....
If you want to do this in single awk then use match function:
awk -v s="/users/" 'match($0, s "[^/[:blank:]]+") {
print substr($0, RSTART+length(s), RLENGTH-length(s))
}' file
Or else this grep + cut will do the job:
grep -Eo '/users/[^/[:blank:]]+' file | cut -d/ -f
set the delimiter and do a literal match to second field and print the third.
$ awk -F/ '$2=="users"{print $3}'
Assuming your statement gives you the entire line of something like
/users/USERNAME/garbage/otherStuff/
You could pipe this result through head assuming you always know that it will be
/users/USERNAME/....
After piping through head, you can also use cut commands to remove more of the end text until you have only the piece you want.
The command will look something like this
awk '/users\/.*\//{print $0}' file | head (options) | cut (options)

sed replace AFTER match and retain

I've been racking my brains for hours on this, but it seems simple enough. I have a large list of strings similar to the ones below and would like to replace the hyphens only after the comma, to commas:
abc-d-ef,1-2-3-4
gh-ij,1-2-3-4
to this
abc-def,1,2,3,4
gh-ij,1,2,3,4
I can't use s/-/,/2g to replace from second occurrence as the data differs, and also though about using cut, but there must be a way to use sed with something like:
"s/\(,\).*-/\1,&/g"
Thank you
This is more suitable for awk as we can break all lines using comma as field separator:
awk 'BEGIN{FS=OFS=","} {gsub(/-/, OFS, $2)} 1' file
abc-d-ef,1,2,3,4
gh-ij,1,2,3,4
If you want sed solution only then use:
sed -E -e ':a' -e 's/([^,]+,[^-]+)-/\1,/g;ta' file
abc-d-ef,1,2,3,4
gh-ij,1,2,3,4
An awk proposal.
awk -F, '{sub(/d-ef/,"def")gsub(/-/,",",$2)}1' OFS=, file
abc-def,1,2,3,4
gh-ij,1,2,3,4

Get specific Text between Specific Tags

At the top of my HTML files, I have...
<H2>City</H2>
<P>Liverpool</P>
or
<H2>City</H2>
<P>Dublin</P>
I want to output the text between the tags straight after <H2>City</H2> instances. So in the examples above which are separate files, I want to print out Liverpool and in the second example, Dublin.
Looking at this thread, I try:
sed -e 's/City\(.*\)\/P/\1/'
which I hope would get me half way there... but that just prints out the entire file. Any ideas?
awk to the rescue! You need multi-char RS support though (gawk has it)
$ awk -F'[<>]' -v RS='<H2>City</H2>' 'NF{print $3}' file
another approach can be
$ awk 'c&&c--{sub(/<[^>]*>/,""); print} /<H2>City<\/H2>/{c=1}' file
find the next record after City and trim the angle brackets...
Try using the following regex :
(?s)(?<=City<\/H2>\n<P>).*?(?=<\/P>)
see regex demo / explanation
sed
sed -e 's/(?s)(?<=City<\/H2>\n<P>).*?(?=<\/P>)/'
I checked and the \s seem not work for spaces. You should use the newline character \n:
sed -e 's/<H2>City<\/H2>\n<P>\(.*\)<\/P>/\1/'
There is no need of use lookbehind (like above), that is an overkill.
With sed, you can use the n command to read next line after your pattern. Then just remove the tag to output your content:
sed -n '/<H2>City<\/H2>/n;s/ *<\/*P> *//gp;' file
I think this should work in your mac:
echo -e "<H2>City</H2>\n<P>Dublin</P>" |awk -F"[<>]" '/City/{getline;print $3}'
Dublin

Deleting lines matching a pattern from a Unix file

I have a file containing strings of the following format:
05|KEEP|REDEFINES|NO_TYPE|PIC|9.
05|DELETE|REDEFINES|VARIABLE.
05|KEEP2|REDEFINES|VARIABLE2
|PIC|9(5).
I want to be able to use something like sed or awk to delete lines containing the word REDEFINES but NOT if the word PIC is also in there or if there is no full stop at the end of a line as this means the string has been split over 2 lines. So out of the 4 lines (3 strings) stated above I would only want to delete 05|DELETE|REDEFINES|VARIABLE.
I thought you might be able to use some kind of negation or lookahead but these don't seem to be available or I can't get them to work
Using awk this deletes anything containing REDEFINES in the String following the pattern in the example above:
awk '!/[[:print:]]*\REDEFINES[[:print:]]*\./'
Similarly using sed:
sed '/[[:print:]]*|REDEFINES[[:print:]]*\./d'
I just can't work out how to extend it to do what I need. Is this possible in sed or awk or do I need another tool?
Any help greatly appreciated.
Using awk
awk -v RS= '!/REDEFINES/ || /PIC/' file
05|KEEP|REDEFINES|NO_TYPE|PIC|9.
05|KEEP2|REDEFINES|VARIABLE2
|PIC|9(5).
Using sed (with older input data):
sed -i.bak '/REDEFINES/{/PIC/!d;}' file
05|KEEP|REDEFINES|NO_TYPE|PIC|9.
You can try the below command. Print the line if it contains PIC or if it does not contain REDEFINES. It is maintainable as it is not so tricky and could be understood without much of an effort.
cat input.txt | awk '{if ($0 ~ /PIC/ || $0 !~ /REDEFINES/){print $0}}'
Why don't you just use grep? Using negations on your question, here is what I understood:
keep the lines terminated with a full-stop, containing both REDEFINES and PIC.
So grep seems easy:
$ grep -E 'REDEFINES.*\.$' file | grep PIC
05|KEEP|REDEFINES|NO_TYPE|PIC|9.
Hope this helps.
This might work for you (GNU sed):
sed -r '/REDEFINES/{/PIC|[^.]$/!d}' file
or perhaps more easily:
sed '/PIC/b;/REDEFINES.*\.$/d' file
or if you prefer:
sed '/PIC/!{/REDEFINES.*\.$/d}' file

Replacing Part of Text Using Sed

I have the following text file
Eif2ak1.aSep07
Eif2ak1.aSep07
LOC100042862.aSep07-unspliced
NADH5_C.0.aSep07-unspliced
LOC100042862.aSep07-unspliced
NADH5_C.0.aSep07-unspliced
What I want to do is to remove all the text starting from period (.) to the end.
But why this command doesn't do it?
sed 's/\.*//g' myfile.txt
What's the right way to do it?
You're missing a period there. You want:
s/\..*$//g
you can use awk or cut, since dots are your delimters.
$4 awk -F"." '{print $1}' file
Eif2ak1
Eif2ak1
LOC100042862
NADH5_C
LOC100042862
NADH5_C
$ cut -d"." -f1 file
Eif2ak1
Eif2ak1
LOC100042862
NADH5_C
LOC100042862
NADH5_C
easier than using regular expression.