Why doesn't this simple regex match what I think it should? - regex

I have a data file that looks like the following example. I've added '%' in lieu of \t, the tab control character.
1234:56% Alice Worthington
alicew% Jan 1, 2010 10:20:30 AM% Closed% Development
Digg:
Reddit:
Update%% file-one.txt% 1.1% c:/foo/bar/quux
Add%% file-two.txt% 2.5.2% c:/foo/bar/quux
Remove%% file-three.txt% 3.4% c:/bar/quux
Update%% file-four.txt% 4.6.5.3% c:/zzz
... many more records of the above form
The records I'm interested in are the lines beginning with "Update", "Add", "Remove", and so on. I won't know what the lines begin with ahead of time, or how many lines precede them. I do know that they always begin with a string of letters followed by two tabs. So I wrote this regex:
generate-report-for 1234:56 | egrep "^[[:alpha:]]+\t\t.+"
But this matches zero lines. Where did I go wrong?
Edit: I get the same results whether I use '...' or "..." for the egrep expression, so I'm not sure it's a shell thing.

Apparently \t isn't a special character for egrep. You can either use grep -P to enable Perl-compatible regex engine, or insert literal tabs with CtrlvCtrli
Even better, you could use the excellent ack

It looks like the shell is parsing "\t\t" before it is sent to egrep. Try "\\t\\t" or '\t\t' instead. That is 2 slashes in double quotes and one in single quotes.

The file might not be exactly what you see. Maybe there are control characters hidden. It happens, sometimes. My suggestion is that you debug this. First, reduce to the minimum regex pattern that matches, and then keep adding stuff one by one, until you find the problem:
egrep "[[:alpha:]]"
egrep "[[:alpha:]]+"
egrep "[[:alpha:]]+\t"
egrep "[[:alpha:]]+\t\t"
egrep "[[:alpha:]]+\t\t.+"
egrep "^[[:alpha:]]+\t\t.+"
There are variations on that sequence, depending on what you find out at each step. Also, the first step can really be skipped, but this is just for the sake of showing the technique.

you can use awk
awk '/^[[:alpha:]]\t\t/' file

Related

How to replace spaces after a certain pattern with commas?

I am new to coding and I'm trying to format some bioinformatics data. I am trying to remove all the spaces after GT:GL:GOF:GQ:NR:NV with commas, but not anything outside of the format xx:xx:xx:xx:xx (like the example). I know I need to use sed with regex option but I'm not very familiar with how to use it. I've never actually used sed before and got confused trying so any help would be appreciated. Sorry if I formatted this poorly (this is my first post).
EDIT 2: I got actual data from the file this time which may help solve the problem. Removed the bad example.
New Example: I pulled this data from my actual file (this is just two samples), and it is surrounded by other data. Essentially the line has a bunch of data followed by "GT:GL:GOF:GQ:NR:NV ", after this there is more data in the format shown below, and finally there is some more random data. Unfortunately I can't post a full line of the data because it is extremely long and will not fit.
Input
0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0
Output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
With Basic Regular Expressions, you can use character classes and backreferences to accomplish your task, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\)[ ]\([0-9][0-9]*:[0-9][0-9]*\)/\1,\2/g' file
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT BB
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 10:13:12,41:41:1:13,13:131:1:1 AB GT RT
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT
Which basically says:
find and capture any [0-9][0-9]* one or more digits,
separated by a :, and
followed by [0-9][0-9]* one or more digits -- as capture group 1,
match a space following capture group 1 followed by capture group 2 (which is the same as capture group 1),
then replace the space separating the capture groups with a comma reinserting the capture group text using backreference 1 and 2 (e.g. \1 and \2), finally
make the replacement global (e.g. g) to replace all matching occurrences.
Edit Based On New Input Posted
If you still need all of the original commas added, and you now want to add a comma between ,0 0/ (where there is a comma precedes a single-digit followed by the space to be replaced with a comma, followed by a single-digit and a forward-slash), then all you need to do is make your capture groups conditional (on either capturing the original data as above -or- capturing this new segment. You do that by including an OR (e.g. \| in basic regex terms) between the conditions.
For instance by adding \|,[0-9] at the end of the first capture group and \|[0-9][/] at the end of the second, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\|,[0-9]\)[ ]\([0-9][0-9]*:[0-9][0-9]*\|[0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
If you have other caveats in your file, I suggest you post several complete lines of input, and if they are too long, then create a zip, gzip, bzip or xz file and post it to a site like pastebin and add the link to your question.
If all you really care about now is the space in ,0 0/, then you can shorten the sed command to:
$ sed 's/\(,[0-9]\)[[:space:]]\([0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
(note: I've included [[:space:]] to handle any whitespace (space, tab, ...) instead of just the literal [ ] (space) in the new example)
Let me know if this fixes the issue.
I'm assuming that the xx:xx:xx or xx:xx:xx:xx can have any number of parts, since some have 3, and some have 4.
This is quite difficult to do reliably with sed, as it does not support lookarounds, which seem like they might be needed for this example.
You can try something like:
perl -pe 's/(?<=\d) (?=\d+(:\d+){2,})/,/g' input.txt
If you've got your heart set on sed, you can try this, but it may miss some cases:
sed -r 's/(:[0-9]+) ([0-9]+:)/\1,\2/g' input.txt
Could you please try following. This will take care of printing those values also which are NOT coming in match of regex. Also we would have made regex mentioned in match a bit shorter by doing it as [0-9]+\.{4} etc since this is tested on old awk so couldn't test it.
awk '
BEGIN{
OFS=","
}
match($0,/GT:GL:GOF:GQ:NR:NV [0-9]+:[0-9]+:[0-9]+:[0-9]+:[0-9]+/){
value=substr($0,RSTART!=1?1:RSTART,RSTART+RLENGTH-1)
value1=substr($0,RSTART+RLENGTH+1)
gsub(/[[:space:]]+/,",",value1)
print value,value1
next
}
1
' Input_file
You may also achieve your desired result without regex, using awk:
awk '{printf "%s", $1FS$2FS$3FS$4FS$5","$6","$7; for (i=8;i<=NF;i++) printf "%s", FS$i; print ""}' input.txt
Basically, it outputs from field 1 to 5 with the default field separator ("space"), then from field 5 to 7 with the comma separator, then from field 8 onwards with default separator again.
perl myscript.pl '0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0'
myscript.pl,
#!/usr/local/ActivePerl-5.20/bin/env perl
my $input = $ARGV[0];
$input =~ s/ /\,/g;
print $input, "\n";
__DATA__
output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
This will remove all spaces, not just the space in question

Using sed to find inside of quotes and skip escaped quotes

I have a curl call that queries JIRA REST API and returns a JSON string like the following (expect on a single line):
{
"expand":"renderedFields,names,schema,transitions,operations,editmeta,changelog",
"id":"36112","self":"https://jira.company.com/rest/api/2/issue/36112",
"key":"FOO-1218",
"fields":
{"summary":"the \"special\" field is not returning what is expected"}
}
I was trying to parse out the "summary" field using this sed script:
sed 's/^.*summary":"\([^"]*\)".*$/\1/'
Which works fine if the "summary" doesn't have an escaped \" inside of it - but of course, with the escaped quote all I get back is from the example is:
the \
My desired output would either be:
the \"special\" field is not returning what is expected
Or even more fancily this:
the "special" field is not returning what is expected
It doesn't appear that I can do a lookbehind in sed, is there a simple way to solve this in a bash script?
You're asking for a JSON parser written in sed. Sorry, but this is insane.
Here's an example of a sane way to do this in python:
import requests
response = requests.get(JIRA_API_ENDPOINT, headers = JIRA_HEADERS)
obj = response.json()
obj['fields']['summary']
There's also a good JIRA API wrapper in python, called jira-python. Just use that and you wont have to do any parsing at all. I've used it to good effect before. Link here: http://jira-python.readthedocs.org/en/latest/
Your coworkers will thank you.
For the inside of double quotes, you really want at least one of these facilities:
lookarounds (so you can check that what precedes and follows are quote).
\K (so you can drop the opening quote)
the ability to examine capture groups (so you can match the whole quote, but only capture what's inside).
Typically, you would want something like this:
(?<=(?<!\\)")(?:\\"|[^"])*(?=")
In grep -P mode, which uses PCRE, you can tap into even more features, such as the possessive quantifier I'll add here:
(?<=(?<!\\)")(?:\\"|[^"])*+(?=")
Note that the [^"] can normally run across multiple lines, which you'd typically control with [^"\r\n], but grep only looks line by line anyway.
For this limited case, you could use something like
vnix$ sed -n 's/.*summary":"\(\([^\\"]*\|\\.\)*\)".*/\1/p' file.json
the \"special\" field is not returning what is expected
Inside the quoted string, double quotes are disallowed, except any character is allowed immediately after a literal backslash. The character class disallows backslashes, too, to prevent a backslash from "leaking" into the wrong partial match. The repeat after the character class is just an optimization to avoid needless backtracking.
Any attempt at generalizing this will quickly become quite unwieldy. The Friedl book has an example which stretches over more than a page just to illustrate the futility of this.
After serious struggling, I have figured out a method that is working for this very specific use-case. I convern the escaped quotes (\") into an even more obscure character sequence of five underscores (_), do the regex, and then convert it back:
sed -e 's/\\"/_____/g' -e 's/^.*summary":"\([^"]*\)".*$/\1/' -e 's/_____/"/g'
So the full test looks like this:
echo '{"expand":"renderedFields,names,schema,transitions,operations,editmeta,changelo‌​g","id":"36112","self":"https://jira.company.com/rest/api/2/issue/36112","key":"F‌​OO-1218","fields":{"summary":"the \"special\" field is not returning what is expected"}}' | sed -e 's/\\"/_____/g' -e 's/^.*summary":"\([^"]*\)".*$/\1/' -e 's/_____/"/g'
And the output looks like this:
the "special" field is not returning what is expected

Change delimiter of grep command

I am using grep to detect something here
This is not working when the link is split on two lines in the input. I want to grep to check till it detects a </a> but right now it only is taking the input into grep till it detects a new line.
So if input is like something here it works, but if input is like
<a href="xxxx">
something here /a>
, then it doesn't.
Any solutions?
I'd use awk rather than grep. This should work:
awk '/a href="xxxx">/,/\/a>/' filename
I think you would have much less trouble using some xslt tool, but you can do it with sed, awk or an extended version of grep pcregrep, which is capable of multiline pattern (-M).
I'd suggest to fold input so openning and closing tags are on the same line, then check the line against the pattern. An idiomatic approach using sed(1):
sed '/<[Aa][^A-Za-z]/{ :A
/<\/[Aa]>/ bD
N
bA
:D
/\n/ s// /g
}
# now try your pattern
/<[Aa][^A-Za-z] href="xxx"[^>]*>[^<]*something here[^<]*<\/[Aa]>/ !d'
This is probably a repeat question:
Grep search strings with line breaks
You can try it with tr '\n' ' 'command as was explained in one of the answers, if all you need is to find the files and not the line numbers.
Consider egrep -3 '(<a|</a>)'
"-3" prints up to 3 surrounding lines around each regex match (3 lines before and 3 lines after the match). You can use -1 or -2 as well if that works better.
perl -e '$_=join("", <>); m#<a.*?>.*?<.*?/a>#s; print "$&\n";'
So the trick here is that the entire input is read into $_. Then a standard /.../ regex is run. I used the alternate syntax m#...# so that I do not have to backslash "/"s which are used in xml. Finally the "s" postfix makes multiline matches work by making "." also match newlines (note also option "m" which changes the meaning of ^ and $). "$&" is the matched string. It is the result you are looking for. If you want just the inner-text, you can put round brackets around that part and print $1.
I am assuming that you meant </a> rather than /a> as an xml closing delimiter.
Note the .*? is a non-greedy version of .* so for <a>1</a><a>2</a>, it only matches <a>1</a>.
Note that nested nodes may cause problems eg <a><a></a></a>. This is the same as when trying to match nested brackets "(", ")" or "{", "}". This is a more interesting problem. Regex's are normally stateless so they do not by themselves support keeping an unlimited bracket-nesting-depth. When programming parsers, you normally use regex's for low-level string matching and use something else for higher level parsing of tokens eg bison. There are bison grammars for many languages and probably for xml. xslt might even be better but I am not familiar with it. But for a very simple use case, you can also handle nested blocks like this in perl:
Nested bracket-handling code: (this could be easily adapted to handle nested xml blocks)
$_ = "a{b{c}e}f";
my($level)=(1);
s/.*?({|})/$1/; # throw away everything before first match
while(/{|}/g) {
if($& eq "{") {
++$level;
} elsif($& eq "}") {
--$level;
if($level == 1) {
print "Result: ".$`.$&."\n";
$_=$'; # reset searchspace to after the match
last;
}
}
}
Result: {b{c}e}

Regular Expression - Capture and Replace Select Sequences

Take the following file...
ABCD,1234,http://example.com/mpe.exthttp://example/xyz.ext
EFGH,5678,http://example.com/wer.exthttp://example/ljn.ext
Note that "ext" is a constant file extension throughout the file.
I am looking for an expression to turn that file into something like this...
ABCD,1234,http://example.com/mpe.ext
ABCD,1234,http://example/xyz.ext
EFGH,5678,http://example.com/wer.ext
EFGH,5678,http://example/ljn.ext
In a nutshell I need to capture everything up to the urls. Then I need to capture each URL and put them on their own line with the leading capture.
I am working with sed to do this and I cannot figure out how to make it work correctly. Any ideas?
If the number of URLs in each line is guaranteed to be two, you can use:
sed -r "s/([A-Z0-9,]{10})(.+\.ext)(.+\.ext)/\1\2\n\1\3/" < input
This does not require the first two fields to be a particular width or limit the set of (non-comma) characters between the commas. Instead, it keys on the commas themselves.
sed 's/\(\([^,]*,\)\{2\}\)\(.*\.ext\)\(http:.*\)/\1\3\n\1\4/' inputfile.txt
You could change the "2" to match any number of comma-delimited fields.
I have no sed available to me at the moment.
Wouldn't
sed -r 's/(....),(....),(.*\.ext)(http.*\.ext)/\1,\2,\3\n\1,\2,\4/g'
do the trick?
Edit: removed the lazy quantifier

matching text in quotes (newbie)

I'm getting totally lost in shell programming, mainly because every site I use offers different tool to do pattern matching. So my question is what tool to use to do simple pattern matching in piped stream.
context: I have named.conf file, and i need all zones names in a simple file for further processing. So I do ~$ cat named.local | grep zone and get totally lost here. My output is ~hundred or so newlines in form 'zone "domain.tld" {' and I need text in double quotes.
Thanks for showing a way to do this.
J
I think what you're looking for is sed... it's a stream editor which will let you do replacements on a line-by-line basis.
As you're explaining it, the command `cat named.local | grep zone' gives you an output a little like this:
zone "domain1.tld" {
zone "domain2.tld" {
zone "domain3.tld" {
zone "domain4.tld" {
I'm guessing you want the output to be something like this, since you said you need the text in double quotes:
"domain1.tld"
"domain2.tld"
"domain3.tld"
"domain4.tld"
So, in reality, from each line we just want the text between the double-quotes (including the double-quotes themselves.)
I'm not sure you're familiar with Regular Expressions, but they are an invaluable tool for any person writing shell scripts. For example, the regular expression /.o.e/ would match any line where there's a word with the 2nd letter was a lower-case o, and the 4th was e. This would match string containing words like "zone", "tone", or even "I am tone-deaf."
The trick there was to use the . (dot) character to mean "any letter". There's a couple of other special characters, such as * which means "repeat the previous character 0 or more times". Thus a regular expression like a* would match "a", "aaaaaaa", or an empty string: ""
So you can match the string inside the quotes using: /".*"/
There's another thing you would know about sed (and by the comments, you already do!) - it allows backtracking. Once you've told it how to recognize a word, you can have it use that word as part of the replacement. For example, let's say that you wanted to turn this list:
Billy "The Kid" Smith
Jimmy "The Fish" Stuart
Chuck "The Man" Norris
Into this list:
The Kid
The Fish
The Man
First, you'd look for the string inside the quotes. We already saw that, it was /".*"/.
Next, we want to use what's inside the quotes. We can group it using parens: /"(.*)"/
If we wanted to replace the text with the quotes with an underscore, we'd do a replace: s/"(.*)"/_/, and that would leave us with:
Billy _ Smith
Jimmy _ Stuart
Chuck _ Norris
But we have backtracking! That'll let us recall what was inside the parens, using the symbol \1. So if we do now: s/"(.*)"/\1/ we'll get:
Billy The Kid Smith
Jimmy The Fish Stuart
Chuck The Man Norris
Because the quotes weren't in the parens, they weren't part of the contents of \1!
To only leave the stuff inside the double-quotes, we need to match the entire line. To do that we have ^ (which means "beginning of line"), and $ (which means "end of line".)
So now if we use s/^.*"(.*)".*$/\1/, we'll get:
The Kid
The Fish
The Man
Why? Let's read the regular expression s/^.*"(.*)".*$/\1/ from left-to-right:
s/ - Start a substitution regular expression
^ - Look for the beginning of the line. Start from there.
.* - Keep going, reading every character, until...
" - ... until you reach a double-quote.
( - start a group a characters we might want to recall later when backtracking.
.* - Keep going, reading every character, until...
) - (pssst! close the group!)
" - ... until you reach a double-quote.
.* - Keep going, reading every character, until...
$ - The end of the line!
/ - use what's after this to replace what you matched
\1 - paste the contents of the first group (what was in the parens) matched.
/ - end of regular expression
In plain English: "Read the entire line, copying aside the text between the double-quotes. Then replace the entire line with the content between the double qoutes."
You can even add double-quote around the replacing text s/^.*"(.*)".*$/"\1"/, so we'll get:
"The Kid"
"The Fish"
"The Man"
And that can be used by sed to replace the line with the content from within the quotes:
sed -e "s/^.*\"\(.*\)\".*$/\"\1\"/"
(This is just shell-escaped to deal with the double-quotes and slashes and stuff.)
So the whole command would be something like:
cat named.local | grep zone | sed -e "s/^.*\"\(.*\)\".*$/\"\1\"/"
Well, nobody mentioned cut yet, so, to prove that there are many ways to do something with the shell:
% grep '^zone' /etc/bind/named.conf | cut -d' ' -f2
"gennic.net"
"generic-nic.net"
"dyn.generic-nic.net"
"langtag.net"
1.
zoul#naima:etc$ cat named.conf | grep zone
zone "." IN {
zone "localhost" IN {
file "localhost.zone";
zone "0.0.127.in-addr.arpa" IN {
2.
zoul#naima:etc$ cat named.conf | grep ^zone
zone "." IN {
zone "localhost" IN {
zone "0.0.127.in-addr.arpa" IN {
3.
zoul#naima:etc$ cat named.conf | grep ^zone | sed 's/.*"\([^"]*\)".*/\1/'
.
localhost
0.0.127.in-addr.arpa
The regexp is .*"\([^"]*\)".*, which matches:
any number of any characters: .*
a quote: "
starts to remember for later: \(
any characters except quote: [^"]*
ends group to remember: \)
closing quote: "
and any number of characters: .*
When calling sed, the syntax is 's/what_to_match/what_to_replace_it_with/'. The single quotes are there to keep your regexp from being expanded by bash. When you “remember” something in the regexp using parens, you can recall it as \1, \2 etc. Fiddle with it for a while.
You should have a look at awk.
As long as someone is pointing out sed/awk, I'm going to point out that grep is redundant.
sed -ne '/^zone/{s/.*"\([^"]*\)".*/\1/;p}' /etc/bind/named.conf
This gives you what you're looking for without the quotes (move the quotes inside the parenthesis to keep them). In awk, it's even simpler with the quotes:
awk '/^zone/{print $2}' /etc/bind/named.conf
I try to avoid pipelines as much as possible (but not more). Remember, Don't pipe cat. It's not needed. And, insomuch as awk and sed duplicating grep's work, don't pipe grep, either. At least, not into sed or awk.
Personally, I'd probably have used perl. But that's because I probably would have done the rest of whatever you're doing in perl, making it a minor detail (and being able to slurp the whole file in and regex against everything simultaneously, ignoring \n's would be a bonus for cases where I don't control /etc/bind, such as on a shared webhost). But, if I were to do it in shell, one of the above two would be the way I'd approach it.