Replace similar strings in a file in place - regex

I have a file with the following types of pairs of strings:
Call Stack: [UniqueObject1] | [UnOb2] | [SuspectedObject1] | [SuspectedObject2] | [SuspectedObject3] | [UnOb3] | [UnOb4] | [UnOb5] | ... end till unique objects
Call Stack: [UniqueObject1] | [UnOb2] | 0x28798765 | 0x18793765 | 0x48792767 | [UnOb3] | [UnOb4] | [UnOb5] | ... end till unique objects
There are many such pairs that occur in the file.
The attributes of this pair are that the first part of the pair has "SuspectedObject1","SuspectedObject2" and so on, which in the second part of the pair are replaced by HEX-VALUES of the address of those objects.
What I want to do is, remove all the second part of the pairs.
Please note the pairs do not occur in any specific order and might be separated by many lines in between.
I plan to iterate through each line of this file, if I see a hex-string given as an address instead of a suspected object, I would want to start comparing the following regex
Call Stack: [UniqueObject1] | [UnOb2] | * | * | * | [UnOb3] | [UnOb4] | [UnOb5] | ... end till unique objects
in the whole file and if a string does match, I want to remove this specific line from the file.
Can someone suggest a shell way to do this?

If I have understood your question correctly, you may need to use awk. Run like:
awk -f script.awk file file
Contents of script.awk:
BEGIN {
FS=" \\| "
}
FNR==NR {
$3=$4=$5=""
a[$0]++
next
}
$3 ~ /^0x[0-9]{8}$/ {
r=$0
$3=$4=$5=""
if (a[$0]<2) {
print r
}
next
}1
Alternatively, here's the one-liner:
awk -F ' \\| ' 'FNR==NR { $3=$4=$5=""; a[$0]++; next } $3 ~ /^0x[0-9]{8}$/ { r=$0; $3=$4=$5=""; if (a[$0]<2) print r; next }1' file{,}

Related

How to use a capture variable as a field name in JQ?

I'm trying to use jq to automate changing i18n string files from the format taken by one library to another.
I have a json file which has looks like this:
{
"some_label": {
"message": "a string in English with a $VARIABLE$",
"description": "directions to translators",
"placeholders": {
"VARIABLE": {
"content": "{variable}"
}
}
},
// more of the same...
}
And I need that to turn in to "some-label": "a string in English with a {variable}"
I am pretty close to getting it. Currently, I'm using
jq '[.
| to_entries
| .[]
| .key |= (gsub("_";"-"))
| .value.placeholders as $p
| .value.message |= (sub("\\$KEY_NAME\\$";$p.KEY_NAME.content))
| .value = .value.message
] | from_entries'
The next step is to use a capture group in the sub call so I can programmatically get variables with different names, but I'm not sure how to use the capture group to index into $p.
I've tried sub("\\$(?<id>VARIABLE)\\$";$p.(.id).content) which gave a compiler error, and I'm pretty much stuck on what to try next.
Here is one way of achieving the desired result. It could be simplified further too. At the top level it removes the usage of to_entries/from_entries by enclosing the whole filter under with_entries() and modifying the .value field as required
with_entries(
.key |= ( gsub("_";"-") ) |
.value.placeholders as $p |
.value.message as $m |
( $m | match(".*\\$(.*)\\$") | .captures[0].string ) as $c |
( $p | .[$c].content ) as $v |
( "\\$" + $c + "\\$" ) as $t |
.value = ( $m | sub($t; $v) )
)
My view of the key parts of the expression are
The part $m | match(".*\\$(.*)\\$") | .captures[0].string makes a regex match to extract the part within the $..$ in the .message
The part $p | .[$c].content does a generic object index fetch using the dynamic value of $c
Since the first argument of sub()/gsub() functions are a regex, the value captured $c needs to be created as \\$VARIABLE\\$
jqplay - Demo
Here's a basic JQ. Haven't tried with complex inputs, and haven't accommodated for $. I guess you can build on top of this -
to_entries | map(. as $kv | { "\($kv.key)": $kv.value.placeholders | to_entries | map(. as $p | $kv.value.message | sub("\\$\($p.key)\\$"; $p.value.content))[0]}) | add
output -
{
"some_label": "a string in English with a {variable}"
}

grep for a particular string and count the number of fatals and errors

I am having a file called violations.txt as below:
column1 column2 column3 column4 Situation
Data is preesnt | Bgn | Status (!) | There are no current runs | Critical level
Data is not existing | Nbgn | Status (*) | There are runs | Medium level
Data limit is exceeded | Gnp | Status (!) | The runs are not present | Higher level
Dats existing|present | Esp | Status (*) | The runs are present | Normal|Higher level
I need the output like this:
violations.txt:
Fatal:
Bgn : 1
Gnp : 1
Total number of fatals : 2
Errors:
Nbgn : 1
Esp : 1
Total number of errors : 2
I am trying to execute if the file violations.txt conatins in the column3 the word Status (!) as a fatal and if it contains the word Status(*) as a warning and also the count of it.
I tried the below code but not getting the exact output:
#!/bin/bash
pwd
echo " " ;
File="violations.txt"
for g in $File;
do
awk -F' +\\| +'
if "$3"== "Status (!) /" "$File" ; then
'BEGIN{ getline; getline }
truncate -s -1 "$File"
echo "$g:";
{ a[$2]++ }
END{ for(i in a){ print i, a[i]; s=s+a[i] };
print "Total numer of fatals:", s}' violations.txt
else
echo "$g:";
'BEGIN{ getline; getline }
truncate -s -1 "$File"
echo "$g:";
{ a[$2]++ }
END{ for(i in a){ print i, a[i]; s=s+a[i] };
print "Total numer of errors:", s}' violations.txt
fi
done
Haven't we already covered this in a somewhat different reincarnation?
$ cat tst.awk
BEGIN {
FS="[[:blank:]][|][[:blank:]]"
OFS=" : "
}
FNR>1{
gsub(/[[:blank:]]/, "", $2)
gsub(/[[:blank:]]/, "", $3)
a[$3][$2]++
}
END {
#PROCINFO["sorted_in"]="#ind_str_desc"
print "Out" OFS
for(i in a) {
print ($i~/*/?"Fatal":"Error") OFS
t=0
for(j in a[i]) {
print "\t" j, a[i][j]
t+=a[i][j]
}
print "Total", t
t=0
}
}
running awk -f tst.awk myFile results in:
Out :
Fatal :
Gnp : 1
Bgn : 1
Total : 2
Fatal :
Esp : 1
Nbgn : 1
Total : 2
Could you please try following, written and tested with shown samples. Written and tested in
https://ideone.com/rsVIV4
awk '
BEGIN{
FS="\\|"
}
FNR==1{ next }
/Status \(\!\)/{
match($0,/\| +[a-zA-Z]+ +\| Status/)
val=substr($0,RSTART,RLENGTH)
gsub(/\| +| +\| Status/,"",val)
countEr[val]++
val=""
}
/Status \(\*\)/{
match($0,/\| +[a-zA-Z]+ +\| Status/)
val=substr($0,RSTART,RLENGTH)
gsub(/\| +| +\| Status/,"",val)
countSu[val]++
val=""
}
END{
print "Fatal:"
for(i in countEr){
print "\t"i,countEr[i]
sumEr+=countEr[i]
}
print "Total number of fatal:" sumEr
for(i in countSu){
print "\t"i,countSu[i]
sumSu+=countSu[i]
}
print "Total number of errors:"sumSu
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS="\\|" ##Setting field separator as | for all lines here.
}
FNR==1{ next } ##Checking condition if FNR==1 then go next and do not do anything on this line.
/Status \(\!\)/{ ##Checking condition if line contains Status (!) then do following.
match($0,/\| +[a-zA-Z]+ +\| Status/) ##Using match function to match pipe space letters space and | space and Status string here.
val=substr($0,RSTART,RLENGTH) ##Creating sub-string from current line here.
gsub(/\| +| +\| Status/,"",val) ##Globally substituting pipe space and Status keyword with NULL in val here.
countEr[val]++ ##Creating array countEr with index of val and increment its count with 1 here.
val="" ##Nullifying val here.
}
/Status \(\*\)/{ ##Checking condition if line contains Status (*) then do following.
match($0,/\| +[a-zA-Z]+ +\| Status/) ##Using match function to match pipe space letters space and | space and Status string here.
val=substr($0,RSTART,RLENGTH) ##Creating sub-string from current line here.
gsub(/\| +| +\| Status/,"",val) ##Globally substituting pipe space and Status keyword with NULL in val here.
countSu[val]++ ##Creating array countSu with index of val and increment its count with 1 here.
val="" ##Nullifying val here.
}
END{ ##Starting END block of this program from here.
print "Fatal:" ##Printing Fatal keyword here.
for(i in countEr){ ##Traversing through countEr here.
print "\t"i,countEr[i] ##Printing tab i and value of countEr with index i here.
sumEr+=countEr[i] ##Creating sumEr and keep adding value of countEr here.
}
print "Total number of fatal:" sumEr ##Printing string Total number of fatal/l and value of sumEr here.
for(i in countSu){ ##Traversing through countSu here.
print "\t"i,countSu[i] ##Printing tab i and value of countSu with index i here.
sumSu+=countSu[i] ##Creating sumSu and keep adding value of countSu here.
}
print "Total number of errors:"sumSu ##Printing string Total number of errors: with value of sumSu here.
}
' Input_file ##Mentioning Input_file name here.
With GNU awk for various extensions and using the fact that your input is fixed-width fields:
$ cat tst.awk
BEGIN {
FIELDWIDTHS="24 1 11 1 15 1 27 1 *"
}
NR>1 {
type = ($5 ~ /!/ ? "Fatal" : "Error")
keyTot[type][gensub(/\s/,"","g",$3)]++
tot[type]++
}
END {
for (type in tot) {
print type ":"
for (key in keyTot[type]) {
print " " key " : " keyTot[type][key]
}
print "Total number of " type " : " tot[type]+0
}
}
.
$ awk -f tst.awk file
Error:
Esp : 1
Nbgn : 1
Total number of Error : 2
Fatal:
Gnp : 1
Bgn : 1
Total number of Fatal : 2
Your file looks very badly formatted, from a computer point of view, let me explain you why:
column1 column2 column3 column4 Situation
Data is preesnt | Bgn | Status (!) | There are no current runs | Critical level
Data is not existing | Nbgn | Status (*) | There are runs | Medium level
Data limit is exceeded | Gnp | Status (!) | The runs are not present | Higher level
Dats existing|present | Esp | Status (*) | The runs are present | Normal|Higher level
The locations of the first character of the headers of columns 1, 3 and 4 are equal to the first characters of the contents, but for columns 2 and 5, this is not the case.
You are using the pipe character "|" as a separator for your columns, but also for a separator within the columns themselves. This combination is very bad for automatic parsin, based on "|" character as a separator.
Therefore I have following proposals for improving your file:
First let's take care of the column headings first characters:
column1 column2 column3 column4 Situation
Data is preesnt | Bgn | Status (!) | There are no current runs | Critical level
Data is not existing | Nbgn | Status (*) | There are runs | Medium level
Data limit is exceeded | Gnp | Status (!) | The runs are not present | Higher level
Dats existing|present | Esp | Status (*) | The runs are present | Normal|Higher level
If you agree on this, you might use the amount of characters for reading your columns.
Second, let's change the internal separator (replace it by a slash character):
column1 column2 column3 column4 Situation
Data is preesnt | Bgn | Status (!) | There are no current runs | Critical level
Data is not existing | Nbgn | Status (*) | There are runs | Medium level
Data limit is exceeded | Gnp | Status (!) | The runs are not present | Higher level
Dats existing/present | Esp | Status (*) | The runs are present | Normal/Higher level
Do you agree with my first or second proposal? If yes, please adapt your question (by adding the agreed proposal), this will make everything easier to handle.

Removing symbols and making a tab delimited file while keeping all the words after a certain string in one column

I have a file full of such lines:
>Mouse|chr9:95713136-95716028 | element 1367 | positive | hindbrain (rhombencephalon)[5/8] | midbrain (mesencephalon)[3/8] | other[7/8]
>Mouse|chr16:90449561-90451327 | element 1672 | positive | forebrain[4/8] | heart[6/8]
>Mouse|chr3:137446183-137449401 | element 4 | positive | heart[3/4]
What I want to get is something like this:
Mouse chr9 95713136 95716028 element 1367 positive hindbrain (rhombencephalon)[5/8]|midbrain (mesencephalon)[3/8]|other[7/8]
Such that all the words after "positive" are in one column of their own separated by a pipe, and all the columns are separated by tab.
This is what I did:
sed -E 's/ *[>\|:-] */\t/g' mouse_genome_vista1.txt > mouse_genome_vista2.txt
sed "s/^[ \t]*//" -i mouse_genome_vista2.txt
My output was like this:
Mouse chr9 95713136 95716028 element 1367 positive hindbrain (rhombencephalon)[5/8] midbrain (mesencephalon)[3/8] other[7/8]
Mouse chr16 90449561 90451327 element 1672 positive forebrain[4/8] heart[6/8]
Mouse chr3 137446183 137449401 element 4 positive heart[3/4]
It works if I have just one word after "positive" it'll be alone in its column . However if I have more than one I'll have multiple columns. For instance hindbrain, midbrain , and other are each in their own tab delimited columns I want them to be pipe separated in one column.
You may try this with perl or awk:
[|:-](?=.*positive)|positive\s+\K\|
Regex 101 Demo
Sample Perl Solution(note it illustrates over a set of string not file):
use strict;
my $str = 'Mouse|chr9:95713136-95716028 | element 1367 | positive | hindbrain (rhombencephalon)[5/8] | midbrain (mesencephalon)[3/8] | other[7/8]
Mouse|chr16:90449561-90451327 | element 1672 | positive | forebrain[4/8] | heart[6/8]
Mouse|chr3:137446183-137449401 | element 4 | positive | heart[3/4]
';
my $regex = qr/[|:-](?=.*positive)|positive\s+\K\|/xmp;
my $subst = '\\t';
my $result = $str =~ s/$regex/$subst/rg;
print $result;

Command line to remove in-line dupes

What is fast and succinct way to remove dupes from within a line?
I have a file in the following format:
alpha • a | b | c | a | b | c | d
beta • h | i | i | h | i | j | k
gamma •  m | n | o
delta • p | p | q | r | s | q
So there's a headword in column 1, and then various words delimited by pipes, with an unpredictable amount of duplication. The desired output has the dupes removed, as:
alpha • a | b | c | d
beta • h | i | j | k
gamma •  m | n | o
delta • p | q | r | s
My input file is a few thousand lines. The Greek names above correspond to category names (e.g., "baseball"); and the alphabet corresponds English dictionary words (which might contain spaces or accents), e.g. "ball game | batter | catcher | catcher | designated hitter".
This could be programmed many ways, but I suspect there's a smart way to do it. I encounter variations of this scenario a lot, and wonder if there's a concise and elegant way to do this. I am using MacOS, so a few fancy unix options are not available.
Bonus complexity, I often have a comment at the end which should be retained, e.g.,
zeta • x | y | x | z | z ; comment here
P.S. this input is actually the output of a prior StackOverflow question:
Command line to match lines with matching first field (sed, awk, etc.)
BSD awk does not have sort functions builtin where GNU awk does, but I'm not sure they're necessary. The bullet, • (U+2022), causes some grief with awk.
I suggest pre-processing the bullet to a single-byte character. I chose #, but you could use Control-A or something else if you prefer. Your data was in a file data. I note that there was a double space before m in the gamma line; I'm assuming that isn't significant.
sed 's/•/#/' data |
awk -F ' *[#|] *' '
{
delete names
delete comments
delete fields;
if ($NF ~ / *;/) { split($NF, comments, / *; */); $NF=comments[1]; }
j = 1;
for (i = 2; i <= NF; i++)
{
if (names[$i]++ == 0)
fields[j++] = $i;
}
printf("%s", $1);
delim = "•"
for (k = 1; k < j; k++)
{
printf(" %s %s", delim, fields[k]);
delim = "|";
}
if (comments[2])
printf(" ; %s", comments[2]);
printf("\n");
}'
Running this yields:
alpha • a | b | c | d
beta • h | i | j | k
gamma • m | n | o
delta • p | q | r | s
zeta • x | y | z ; comment here
With bash, sort, xargs, sed:
while IFS='•;' read -r a b c; do
IFS="|" read -ra array <<< "$b"
array=( "${array[#]# }" )
array=( "${array[#]% }" )
readarray -t array < <(printf '%s\0' "${array[#]}" | sort -zu | xargs -0n1)
SAVE_IFS="$IFS"; IFS="|"
s="$a• ${array[*]}"
[[ $c != "" ]] && s="$s ;$c"
sed 's/|/ | /g' <<< "$s"
IFS="$SAVE_IFS"
done < file
Output:
alpha • a | b | c | d
beta • h | i | j | k
gamma • m | n | o
delta • p | q | r | s
zeta • x | y | z ; comment here
I suppose the two spaces before "m" are a typo.
This might work for you (GNU sed):
sed 'h;s/.*• \([^;]*\).*/cat <<\\! | sort -u |\1|!/;s/\s*|\s*/\n/2ge;s/\n/ | /g;G;s/^\(.*\)\n\(.*• \)[^;]*/\2\1/;s/;/ &/' file
The sketch of this idea is: to remove the head and tail of each line, morph the data into a mini file, use standard utilities to sort and remove duplicates, then put the line back together again.
Here a copy of the line is held in the hold space. The id and comments removed. The data is munged into a file using cat and the bash here-document syntax and piped through a sort (and uniq if your sort does not come equipped with the -u option). The pattern space is evaluated and the line reassembled by appending the original line to the pattern space and using regexp pattern matching.

How can I use sed/awk/perl to replace a matched pattern with an equivalent number of dashes?

I would like to search for all occurrences of a pattern in a file and replace the matches with an equivalent number of padding such as spaces or dashes. It is important to note that I DO NOT WANT TO ALTER THE FILE! I would like like to print the result as standard output. This is why I prefer using sed. The output should be the same length as the file since I would like to replace each pattern found by the regex with the length of that pattern in dashes.
Example: Say the file contains the following:
data | more data | "to be dashed"
Desired Output:
data | more data | --------------
I currently have some thing like this:
sed -e 's/["][^"]*["]/-/g' file
which results in:
data | more data | -
Any Thoughts?
With Perl:
perl -pe 's/(".*?")/ "-" x length($1) /ge' <<END
data | more data | "to be dashed"
data | "more data" | "multi words " "to be dashed"
END
data | more data | --------------
data | ----------- | -------------- --------------
Since you need to find the string length of the matched text, you need to run the substitution part of s/// through a round of evaluation, hence the e flag.
Using GNU awk:
gawk 'BEGIN{ FS = "" }{ while (match($0, /^(.*)(["][^"]*["])(.*)$/, a)){ gsub(/./, "-", a[2]); $0 = a[1] a[2] a[3]; } } 1' file
Examples:
$ echo 'data | more data | "to be dashed"' | gawk 'BEGIN{ FS = "" }{ while (match($0, /^(.*)(["][^"]*["])(.*)$/, a)){ gsub(/./, "-", a[2]); $0 = a[1] a[2] a[3]; } } 1'
data | more data | --------------
$ echo 'data | more data | "to be dashed" x "1234"' | gawk 'BEGIN{ FS = "" }{ while (match($0, /^(.*)(["][^"]*["])(.*)$/, a)){ gsub(/./, "-", a[2]); $0 = a[1] a[2] a[3]; } } 1'
data | more data | -------------- x ------
A sed solution:
sed -r '
:loop
h # copy pattspace to holdspace
s/(.*)("[^"]+")(.*)/\1\n\3/ # replace quoted field with newline
T # if no replacement occurred, start next cycle
x # exchange pattspace and holdspace
s/.*("[^"]+").*/\1/ # isolate quoted field
s/./-/g # change all chars to dashes
G # append newline and holdspace to pattspace
s/(-*)\n(.*)\n(.*)/\2\1\3/ # reorder fields using newlines
t loop # repeat (must be conditional for T to work)
' file
OSX/BSD may not have the T command (jump to label (or next cycle) if substitution has not been made since last line read or last conditional jump). In that case, replace T with:
t keeplooping # branch over b if substitution occurred
b # unconditional branch to next cycle
:keeplooping