I have files where missing data is inserted as '+'. So lines look like this:
substring1+++++substring2++++++++++++++substring3+substring4
I wanna replace all repetitions of '+' >5 with 'MISSING'. This makes it more readable for my team and makes it easier to see the difference between missing data and data entered as '+' (up to 5 is allowed).
So far I have:
while read l; do
echo "${l//['([+])\1{5}']/'MISSING'}"
done < /path/file.txt
but this replaces every '+' with 'MISSING'. I need it to say 'MISSING' just once.
Thanks in advance.
You can't use regex in Bash variable expansion.
In your loop, you may use
sed 's/+\{1,\}/MISSING/g' <<< "$l"
Or, you may use sed directly on the file
sed 's/+\{1,\}/MISSING/g' /path/file.txt
The +\{1,\} POSIX BRE pattern matches a literal + (+) 1 or more times (\{1,\}).
See the sed demo online
sed 's/+\{1,\}/MISSING/g' <<< "substring1+++++substring2++++++++++++++substring3+substring4"
# => substring1MISSINGsubstring2MISSINGsubstring3MISSINGsubstring4
If you need to make changes to the same file use any technique described at sed edit file in place.
new to regex and have a problem. I want to replace hyphens with underscores in certain places in a file. To simplify things, let's say I want to replace the first hyphen. Here's an example "file":
dont-touch-these-hyphens
leaf replace-these-hyphens
I want to replace hyphens in all lines found by
grep -P "leaf \w+-" file
I tried
sed -i 's/leaf \(\w+\)-/leaf \1_/g' file
but nothing happens (wrong replacement would have been better than nothing). I've tried a few tweaks but still nothing. Again, I'm new to this so I figure the above "should basically work". What's wrong with it, and how do I get what I want? Thanks.
You can simplify things by using two distinct regex's ; one for matching the lines that need processing, and one for matching what must be modified.
You can try something like this:
$ sed '/^leaf/ s/-/_/' file
dont-touch-these-hyphens
leaf replace_these-hyphens
Just use awk:
$ awk '$1=="leaf"{ sub(/-/,"_",$2) } 1' file
dont-touch-these-hyphens
leaf replace_these-hyphens
It gives you much more precise control over what you're matching (e.g. the above is doing a string instead of regexp comparison on "leaf" and so would work even if that string contained regexp metacharacters like . or *) and what you're replacing (e.g. the above only does the replacement in the text AFTER leaf and so would continue to work even if leaf itself contained -s):
$ cat file
dont-touch-these-hyphens
leaf-foo.*bar replace-these-hyphens
leaf-foobar dont-replace-these-hyphens
Correct output:
$ awk '$1=="leaf-foo.*bar"{ sub(/-/,"_",$2) } 1' file
dont-touch-these-hyphens
leaf-foo.*bar replace_these-hyphens
leaf-foobar dont-replace-these-hyphens
Wrong output:
$ sed '/^leaf-foo.*bar/ s/-/_/' file
dont-touch-these-hyphens
leaf_foo.*bar replace-these-hyphens
leaf_foobar dont-replace-these-hyphens
(note the "-" in leaf-foo being replaced by "_" in each of the last 2 lines, including the one that does not start with the string "leaf-foo.*bar").
That awk script will work as-is using any awk on any UNIX box.
I have a file that contains lines in a format similar to this...
/data/file.geojson?10,20,30,40
/data/file.geojson?bbox=-5.20751953125,49.05227025601607,3.0322265625,56.46249048388979
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-2.8482055664062496,54.38935426009769,-0.300750732421875,55.158473983815306
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
I've tried a combination of grep, sed, gawk, and |(pipes) to try and pattern match and then change the format to be more like this...
[10,40],[30,40],[30,20][10,20],
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979].....
Hopefully you get the idea from the first line so I don't have to type out all the examples manually!
I've got the hang of regex to match the co-ordinates. In fact the input file is the result of extracting from apache access logs. It might be easier to read/understand answers if they just match positive integer numbers, I will then be able to slot in a more complicated pattern to match the right range.
To be able to arrange the results like you which it is important to be able to access the last for values per line.
No pattern matching is required if you use awk. You can split the input strings by a set of delimiters and reassemble the resulting fields. 40 can be accessed as $(NF), 30 as $(NF-1) and so on.
awk -F'[?,=]' '
{printf "[%s,%s],[%s,%s],[%s,%s],[%s,%s]\n",
$(NF-3),$(NF),$(NF-1),$(NF),
$(NF-1),$(NF-2),$(NF-3),$(NF-2)
}' file
I'm using ?, , or = as the field delimiters. This makes it simple to access the columns of interest.
Output:
[10,40],[30,40],[30,20],[10,20]
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979],[3.0322265625,49.05227025601607],[-5.20751953125,49.05227025601607]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-2.8482055664062496,55.158473983815306],[-0.300750732421875,55.158473983815306],[-0.300750732421875,54.38935426009769],[-2.8482055664062496,54.38935426009769]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
Btw, also sed can be used here:
sed -r 's/.*[?=]([^,]+),([^,]+),([^,]+),(.*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
The command is capturing the numbers at the end each in a separate capturing group and re-assembles them in the replacement part.
Not all versions of sed support the + quantifier. The most compatible version would look like this :)
sed 's/.*[?=]\([^,]\{1,\}\),\([^,]\{1,\}+\),\([^,]\{1,\}\),\(.*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
sed strips off items prior to numbers, then awk splits on comma and outputs in different order. Assuming data is in a file called "td.txt"
sed 's/^[^0-9-]*//' td.txt|awk -F, '{print "["$1","$4"],["$3","$4"],["$3","$2"],["$1","$2"],"}'
This might work for you (GNU sed):
sed -r 's/^.*\?[^-0-9]*([^,]*),([^,]*),([^,]*),([^,]*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
Or with more toothpicks:
sed 's/^.*\?[^-0-9]*\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
You can use the following to match:
(\/data\/file\.geojson\?(?:bbox=)?)([0-9.-]+),([0-9.-]+),([0-9.-]+),([0-9.-]+)
And replace with the following:
$1[$2,$3],[$4,$5]
See DEMO
I have a script written in bash, with one particular grep command I need to modify.
Generally I have two patterns: A & B. There is a textfile that can contain lines with all possible combinations of those patterns, that is:
"xxxAxxx", "xxxBxxx", "xxxAxxxBxxx", "xxxxxx", where "x" are any characters.
I need to match ALL lines APART FROM the ones containing ONLY "A".
At the moment, it is done with "grep -v (A)", but this is a false track, as this would exclude also lines with "xxxAxxxBxxx" - which are OK for me. This is why it needs modification. :)
The tricky part is that this one grep lies in the middle of a 'multiply-piped' command with many other greps, seds and awks inside. Thus forming a smarter pattern would be the best solution. Others would cause much additional work on changing other commands there, and even would impact another parts of the code.
Therefore, the question is: is there a possibility to match pattern and exclude a subpattern in one grep, but allow them to appear both in one line?
Example:
A file contains those lines:
fooTHISfoo
fooTHISfooTHATfoo
fooTHATfoo
foofoo
and I need to match
fooTHISfooTHATfoo
fooTHATfoo
foofoo
a line with "THIS" is not allowed.
You can use this awk command:
awk '!(/THIS/ && !/THAT/)' file
fooTHISfooTHATfoo
fooTHATfoo
foofoo
Or by reversing the boolean expression:
awk '!/THIS/ || /THAT/' file
fooTHISfooTHATfoo
fooTHATfoo
foofoo
You want to match lines that contain B, or don't contain A. Equivalently, to delete lines containing A and not B. You could do this in sed:
sed -e '/A/{;/B/!d}'
Or in this particular case:
sed '/THIS/{/THAT/!d}' file
Tricky for grep alone. However, replace that with an awk call: Filter out lines with "A" unless there is a "B"
echo "xxxAxxx
xxxBxxx
xxxAxxxBxxx
xxxBxxxAxxx
xxxxxx" | awk '!/A/ || /B/'
xxxBxxx
xxxAxxxBxxx
xxxBxxxAxxx
xxxxxx
grep solution. Uses perl regexp (-P) for Lookaheads (look if there is not, some explanation here).
grep -Pv '^((?!THAT).)*THIS((?!THAT).)*$' file
I've got some textfiles that hold names, phone numbers and region codes. One combination per line.
The syntax is always "Name Region_code number"
With any number of spaces between the 3 variables.
What I want to do is search for specific region codes, like 23 or 493, forexample.
The problem is that these numbers might appear in the longer numbers too, which might enable a return that shouldn't have been returned.
I was thinking of this sort of command:
grep '04' numbers.txt
But if I do that, a line that contains 04 in the number but not as region code will show as a result too... which is not correct.
I'm sure you are about to get buried in clever regular expressions, but I think in this case all you need to do is include one of the spaces on each side of your region code in the grep.
grep ' 04 ' numbers.txt
I'd do:
awk '$2 == "04"' < numbers.txt
and with grep:
grep -e '^[^ ]*[ ]*04[ ]*[^ ]*$' numbers.txt
If you want region codes alone, you should use:
grep "[[:space:]]04[[:space:]]"
this way it will only look for numbers on the middle column, while start or end of strings are considered word breaks.
You can even do:
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" FILE
}
replacing FILE with the name of your file,
and use
search_region_codes 04
or even
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" $2
}
and using
search_region_codes NUMBER FILE
Are you searching for an entire region code, or a region code that contains the subpattern?
If you want the whole region code, and there is at least one space on either side, then you can format the grep by adding a single space on either side of the specific region code. There are other ways to indicate word boundaries using regular expressions.
grep ' 04 ' numbers.txt
If there can be spaces in the name or phone number fields, than that solution might not work. Also, if you the pattern can be a sub-part of the region code, then awk is a better tool. This assumes that the 'name' field contains no spaces. The matching operator '==' requires that the pattern exactly match the field. This can be tricky when there is whitespace on either side of the field.
awk '$2 == "04" {print $0}' < numbers.txt
If the file has a delimiter, than can be set in awk using the '-F' argument to awk to set the field separator character. In this example, a comma is used as the field separator. In addition, the matching operator in this example is a '~' allowing the pattern to be any part of the region code (if that is applicable). The "/y" is a way to match work boundaries at the beginning and end of the expression.
awk -F , '$2 ~ /\y04\y/ {print $0}' < numbers.txt
In both examples, the {print $0} is optional, if you want the full line to be printed. However, if you want to do any formatting on the output, that can be done inside that block.
use word boundaries. not sure if this works in grep, but in other regex implementations i'd surround it with whitespace or word boundary patterns
'\s+04\s+' or '\b04\b'
Something like that