Extract info from the large file, use newline character in awk regex - c++

I need to extract some information from the very large file.
I want to extract specific lines using regular expressions.
What is the fastest way to do this?
I'm coding in c++ on linux.
I want to use grep, but seems my regex is not working as expected.
For example \s, \w are not working properly.
In man grep is written that \wand [:alnum:] are synonyms, so, \w should work properly but it shouldn't.
I need to use newline characters in my regex, so, I'couldn't use grep, therefore, I decided to use awk.
How should I use newline character in awk regex?
Let's consider we have a file (test.txt) with the content below:
HELLO worl_d5 ; some statement HELLO world1 ; some
statement hi hi some statement ...
And I want to get only these lines:
HELLO worl_d5 ; some statement HELLO world1 ; some
statement
I.e., I want to find lines that start with HELLO word followed by the space character(s), then some alphanumeric( or containing /) word followed by the space character(s) and then, a single ;. But I want to get this kind of lines when they are followed by the some statement line only.
I wrote:
awk '/HELLO[[:space:]]([[:alnum:]]|\/)+[[:space:]];\n[[:space:]]*some[[:space:]] statement [[:space:]];/ { print }' test.txt
But I couldn't get needed results.
Or just provide an example where newline is used in regex.

I solved this by using pcregrep and newline just worked fine!
pcregrep -M '(HELLO[[:space:]]([[:alnum:]]|\/|_)+[[:space:]];)[\r\n]([[:space:]]*some[[:space:]]statement[[:space:]];)' test.txt

Related

How to locate a mismatched text delimiter

I'm trying to remove double quotes that appear within a string coming from a dB because it's causing an stream error in another application. I can't clean up the dB to remove these, so I need to replace the character on the fly.
I've tried using sed, ssed, and perl all without success. This regular expression is locating the problem quotes, but when I plug it into sed to replace them with a single quote my output still contains the double quote.
sed "s/(\?<\!\t|^)\"(\?\!\t|$)/'/g" test.txt
I'm on Mac, if this looks a bit odd.
The regex is valid, but when I test on a tab-delimited file containing this:
"foo" "rea"son" "text's"
My output is identical to the above. Any idea what I'm doing wrong?
Thanks
I assume you want to turn all occurrences of " that are not on a field boundary (i.e. either preceded or succeeded by either a tab or the beginning/end of the string) by '.
This can be done using perl and the following substitution:
s/(?<=[^\t])"(?=[^\t\n])/'/g;
(With sed this is not directly possible as it does not support look-behind / look-ahead assertions.)
To use this code on the command line, it needs to be escaped for whatever shell you're using. Assuming bash or a similar sh-like shell:
perl -pe 's/(?<=[^\t])"(?=[^\t\n])/'\''/g' test.txt
Here I use '...' to quote most of the code. To get a single ' into the quoted string, I leave the quoted area ...', add an escaped single quote \', and switch back into a single-quoted string '.... That's why a literal ' turns into '\'' on the command line.

How to replace spaces after a certain pattern with commas?

I am new to coding and I'm trying to format some bioinformatics data. I am trying to remove all the spaces after GT:GL:GOF:GQ:NR:NV with commas, but not anything outside of the format xx:xx:xx:xx:xx (like the example). I know I need to use sed with regex option but I'm not very familiar with how to use it. I've never actually used sed before and got confused trying so any help would be appreciated. Sorry if I formatted this poorly (this is my first post).
EDIT 2: I got actual data from the file this time which may help solve the problem. Removed the bad example.
New Example: I pulled this data from my actual file (this is just two samples), and it is surrounded by other data. Essentially the line has a bunch of data followed by "GT:GL:GOF:GQ:NR:NV ", after this there is more data in the format shown below, and finally there is some more random data. Unfortunately I can't post a full line of the data because it is extremely long and will not fit.
Input
0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0
Output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
With Basic Regular Expressions, you can use character classes and backreferences to accomplish your task, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\)[ ]\([0-9][0-9]*:[0-9][0-9]*\)/\1,\2/g' file
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT BB
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 10:13:12,41:41:1:13,13:131:1:1 AB GT RT
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT
Which basically says:
find and capture any [0-9][0-9]* one or more digits,
separated by a :, and
followed by [0-9][0-9]* one or more digits -- as capture group 1,
match a space following capture group 1 followed by capture group 2 (which is the same as capture group 1),
then replace the space separating the capture groups with a comma reinserting the capture group text using backreference 1 and 2 (e.g. \1 and \2), finally
make the replacement global (e.g. g) to replace all matching occurrences.
Edit Based On New Input Posted
If you still need all of the original commas added, and you now want to add a comma between ,0 0/ (where there is a comma precedes a single-digit followed by the space to be replaced with a comma, followed by a single-digit and a forward-slash), then all you need to do is make your capture groups conditional (on either capturing the original data as above -or- capturing this new segment. You do that by including an OR (e.g. \| in basic regex terms) between the conditions.
For instance by adding \|,[0-9] at the end of the first capture group and \|[0-9][/] at the end of the second, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\|,[0-9]\)[ ]\([0-9][0-9]*:[0-9][0-9]*\|[0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
If you have other caveats in your file, I suggest you post several complete lines of input, and if they are too long, then create a zip, gzip, bzip or xz file and post it to a site like pastebin and add the link to your question.
If all you really care about now is the space in ,0 0/, then you can shorten the sed command to:
$ sed 's/\(,[0-9]\)[[:space:]]\([0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
(note: I've included [[:space:]] to handle any whitespace (space, tab, ...) instead of just the literal [ ] (space) in the new example)
Let me know if this fixes the issue.
I'm assuming that the xx:xx:xx or xx:xx:xx:xx can have any number of parts, since some have 3, and some have 4.
This is quite difficult to do reliably with sed, as it does not support lookarounds, which seem like they might be needed for this example.
You can try something like:
perl -pe 's/(?<=\d) (?=\d+(:\d+){2,})/,/g' input.txt
If you've got your heart set on sed, you can try this, but it may miss some cases:
sed -r 's/(:[0-9]+) ([0-9]+:)/\1,\2/g' input.txt
Could you please try following. This will take care of printing those values also which are NOT coming in match of regex. Also we would have made regex mentioned in match a bit shorter by doing it as [0-9]+\.{4} etc since this is tested on old awk so couldn't test it.
awk '
BEGIN{
OFS=","
}
match($0,/GT:GL:GOF:GQ:NR:NV [0-9]+:[0-9]+:[0-9]+:[0-9]+:[0-9]+/){
value=substr($0,RSTART!=1?1:RSTART,RSTART+RLENGTH-1)
value1=substr($0,RSTART+RLENGTH+1)
gsub(/[[:space:]]+/,",",value1)
print value,value1
next
}
1
' Input_file
You may also achieve your desired result without regex, using awk:
awk '{printf "%s", $1FS$2FS$3FS$4FS$5","$6","$7; for (i=8;i<=NF;i++) printf "%s", FS$i; print ""}' input.txt
Basically, it outputs from field 1 to 5 with the default field separator ("space"), then from field 5 to 7 with the comma separator, then from field 8 onwards with default separator again.
perl myscript.pl '0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0'
myscript.pl,
#!/usr/local/ActivePerl-5.20/bin/env perl
my $input = $ARGV[0];
$input =~ s/ /\,/g;
print $input, "\n";
__DATA__
output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
This will remove all spaces, not just the space in question

Regex for string matching ****${****}***

I am trying to write a regex that matches and excludes all strings in a file that contain ${ followed by } with any characters between or around it. In between could be any characters/numbers/underscores/dashes/etc (there won't be another parenthesis inside).
Example matches:
hello ${VAR}
${HELLO_VAR} world
https://${WEB_VAR}
I came up with this: egrep -v '^\${[a-zA-Z?]', though it seems to be working partially and I am not too sure if its right. How can I do this?
The input file has strings separated by a newline, very similar to simple java properties.
You can trying using sed command.
sed 's/\$\{[^}]*\}//g' <input_file> > <output_file>
Sed here excludes all the characters between '{' and '}' and writes the new content in a new output file.
You can give this one a shot:
\$\{[^}]*\}
Match ${ literally, followed by everything except }, followed by }
You say you're trying to exclude all strings in a file, so it sounds like you need something a bit more advanced than just a regex with grep. I'd do this with an awk script:
awk '{while(match($0,/\$\{[^}]*\}/)){$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)}} 1' input.txt
Or, split for easier reading and commenting:
{
while (match($0,/\$\{[^}]*\}/)) {
$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)
}
}
1
The idea here is that for each line, we'll check to see whether the regex matches anything on the line. If it does, we'll replace the line with the parts around the matched regex. (We could alternate sub(/RE/,""), but that would require applying the regex twice per match rather than once.)
The final 1 is shorthand that says "print the current line". It runs whether or not the loop processed any matches.
Just use the global wilcard .* around the two sequences, as in:
.*\$\{.*\}.*
As you want to match entire lines, you have to use wilcard at both sides, to extend the regexp to both ends (it doesn't matter if you anchor it with ^ and $ as the greedy algorithm will try to extend as much as possible) Note that the $, { and } must be escaped, as they are reserved by the regexp language.
This can be seen in action here.
note
the title of this question doesn't specify that the substring between the two curly braces should not have a }, and as you want only to match the whole line, then it is not necessary to check for something except a }, the only requirement is that } must be after the ${ in the line. Anyway, this has no drawback in efficiency, as the NFA that parses this regexp has the same number of states as the other.

Perl: regular expression: capturing group

In a code file, I want to remove any (one or more) consecutive white lines (lines that may include only zero or more spaces/tabs and then a newline) that go between a code text and the concluding } of a block. This concluding } may have spaces for indentation before it, so I want to keep them.
Here is what I try to do:
perl -i -0777 -pe 's/\s+\n([ ]*)\}/\n($1)\}/g' file
For example, if my code file looks like (□ is the space character):
□□□□while (true) {\n
□□□□□□□□print("Yay!");□□□□□□\n
□□□□□□□□□□□□□□□□\n
□□□□}\n
Then I want it to become:
□□□□while (true) {\n
□□□□□□□□print("Yay!");\n
□□□□}\n
However it does not do the change I expected. Any idea what I am doing wrong here?
The only issues I can see with your regex are
you don't need the parenthesis around the matching variable,
and
the use of a character class when extracting the match is
redundant (unless you want to match tabs as well as spaces).
So, you could try
s/\s+\n( *)\}/\n$1\}/g
instead.
This works as expected when run on your test input.
To tidy it up even more, you could try the following.
s/\s+(\n *\})/$1/g
If there might be tabs as well as spaces, you can use a character class. (You do not need to include '|' inside the character class).
s/\s+(\n[ \t]*\})/$1/g
perl -pi -0777 -e's/^\s*\n(?=\s*})//mg' yourfile
(Remove whitespace from the beginning of a line through a newline that precedes a line with } as the first non-whitespace.)
Try using this regex instead, which uses a positive look-ahead assertion. This way you only capture the part that you want to remove, and then replace it with nothing:
s/\s+(?=\n[ ]*\})//g
You can try the following one liner
perl -0777 -pe 's/\s*\n*(\s*\n)/$1/g' test

Change delimiter of grep command

I am using grep to detect something here
This is not working when the link is split on two lines in the input. I want to grep to check till it detects a </a> but right now it only is taking the input into grep till it detects a new line.
So if input is like something here it works, but if input is like
<a href="xxxx">
something here /a>
, then it doesn't.
Any solutions?
I'd use awk rather than grep. This should work:
awk '/a href="xxxx">/,/\/a>/' filename
I think you would have much less trouble using some xslt tool, but you can do it with sed, awk or an extended version of grep pcregrep, which is capable of multiline pattern (-M).
I'd suggest to fold input so openning and closing tags are on the same line, then check the line against the pattern. An idiomatic approach using sed(1):
sed '/<[Aa][^A-Za-z]/{ :A
/<\/[Aa]>/ bD
N
bA
:D
/\n/ s// /g
}
# now try your pattern
/<[Aa][^A-Za-z] href="xxx"[^>]*>[^<]*something here[^<]*<\/[Aa]>/ !d'
This is probably a repeat question:
Grep search strings with line breaks
You can try it with tr '\n' ' 'command as was explained in one of the answers, if all you need is to find the files and not the line numbers.
Consider egrep -3 '(<a|</a>)'
"-3" prints up to 3 surrounding lines around each regex match (3 lines before and 3 lines after the match). You can use -1 or -2 as well if that works better.
perl -e '$_=join("", <>); m#<a.*?>.*?<.*?/a>#s; print "$&\n";'
So the trick here is that the entire input is read into $_. Then a standard /.../ regex is run. I used the alternate syntax m#...# so that I do not have to backslash "/"s which are used in xml. Finally the "s" postfix makes multiline matches work by making "." also match newlines (note also option "m" which changes the meaning of ^ and $). "$&" is the matched string. It is the result you are looking for. If you want just the inner-text, you can put round brackets around that part and print $1.
I am assuming that you meant </a> rather than /a> as an xml closing delimiter.
Note the .*? is a non-greedy version of .* so for <a>1</a><a>2</a>, it only matches <a>1</a>.
Note that nested nodes may cause problems eg <a><a></a></a>. This is the same as when trying to match nested brackets "(", ")" or "{", "}". This is a more interesting problem. Regex's are normally stateless so they do not by themselves support keeping an unlimited bracket-nesting-depth. When programming parsers, you normally use regex's for low-level string matching and use something else for higher level parsing of tokens eg bison. There are bison grammars for many languages and probably for xml. xslt might even be better but I am not familiar with it. But for a very simple use case, you can also handle nested blocks like this in perl:
Nested bracket-handling code: (this could be easily adapted to handle nested xml blocks)
$_ = "a{b{c}e}f";
my($level)=(1);
s/.*?({|})/$1/; # throw away everything before first match
while(/{|}/g) {
if($& eq "{") {
++$level;
} elsif($& eq "}") {
--$level;
if($level == 1) {
print "Result: ".$`.$&."\n";
$_=$'; # reset searchspace to after the match
last;
}
}
}
Result: {b{c}e}