Files that have at most two ‘a’ characters in their full path - regex

I'm trying to figure out how to find files that have at most two 'a' characters in their full path with AWK.
The following is what I've come up so far but it's not doing the job.
BEGIN{}
{
if( match( $1, ".*[a].*[a].*[^a]+" ) )
print $1
}
END{}
It reads the files names with their full paths from a file called "data" created separately via the following command.
find / -name '*'
What should I modify?

The following is judged too short to be an answer on its own, but it's all I meant to write:
^[^a]*(a[^a]*(a[^a]*)?)?$
By the way, you don't need awk. grep -E would work fine.
But now that I think of it, if you are going to use awk, the following is even simpler:
awk '!/a.*a.*a/'

You have three errors.
You need to include the start-of-line and end-of-line patterns ^ and $ otherwise an arbitrary prefix or suffix may contain some as.
You need to make the occurrences of a optional, by using parenthesis and ?.
.* can contain a so you need to use [^a] to match the non-a characters.
The result would be a regular expression like:
^([^a]*a)?([^a]*a)?[^a]*$
Edit:
As Ed points out in the comments below his answer, if you pass the --re-interval flag to Awk, you can use intervals.
The expression would then be:
^([^a]*a){0,2}[^a]*$
This allows us say we want to find between 0 and 2 as.

The correct solution is this:
awk '!/(.*a){3}/' file
or either of these if your awk doesn't support RE intervals:
awk 'gsub(/a/,"&") < 3' file
awk 'split($0,x,/a/) < 3' file
so in either case if you want to test for fewer than 17 "a"s you just change 3 to 17 (for example):
awk '!/(.*a){17}/' file
rather than writing:
awk '^[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*(a[^a]*)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?)?$'
or similar.

Related

how to replace repetitive string of variable length with another string in bash?

I have files where missing data is inserted as '+'. So lines look like this:
substring1+++++substring2++++++++++++++substring3+substring4
I wanna replace all repetitions of '+' >5 with 'MISSING'. This makes it more readable for my team and makes it easier to see the difference between missing data and data entered as '+' (up to 5 is allowed).
So far I have:
while read l; do
echo "${l//['([+])\1{5}']/'MISSING'}"
done < /path/file.txt
but this replaces every '+' with 'MISSING'. I need it to say 'MISSING' just once.
Thanks in advance.
You can't use regex in Bash variable expansion.
In your loop, you may use
sed 's/+\{1,\}/MISSING/g' <<< "$l"
Or, you may use sed directly on the file
sed 's/+\{1,\}/MISSING/g' /path/file.txt
The +\{1,\} POSIX BRE pattern matches a literal + (+) 1 or more times (\{1,\}).
See the sed demo online
sed 's/+\{1,\}/MISSING/g' <<< "substring1+++++substring2++++++++++++++substring3+substring4"
# => substring1MISSINGsubstring2MISSINGsubstring3MISSINGsubstring4
If you need to make changes to the same file use any technique described at sed edit file in place.

Sed replace hyphen with underscore

new to regex and have a problem. I want to replace hyphens with underscores in certain places in a file. To simplify things, let's say I want to replace the first hyphen. Here's an example "file":
dont-touch-these-hyphens
leaf replace-these-hyphens
I want to replace hyphens in all lines found by
grep -P "leaf \w+-" file
I tried
sed -i 's/leaf \(\w+\)-/leaf \1_/g' file
but nothing happens (wrong replacement would have been better than nothing). I've tried a few tweaks but still nothing. Again, I'm new to this so I figure the above "should basically work". What's wrong with it, and how do I get what I want? Thanks.
You can simplify things by using two distinct regex's ; one for matching the lines that need processing, and one for matching what must be modified.
You can try something like this:
$ sed '/^leaf/ s/-/_/' file
dont-touch-these-hyphens
leaf replace_these-hyphens
Just use awk:
$ awk '$1=="leaf"{ sub(/-/,"_",$2) } 1' file
dont-touch-these-hyphens
leaf replace_these-hyphens
It gives you much more precise control over what you're matching (e.g. the above is doing a string instead of regexp comparison on "leaf" and so would work even if that string contained regexp metacharacters like . or *) and what you're replacing (e.g. the above only does the replacement in the text AFTER leaf and so would continue to work even if leaf itself contained -s):
$ cat file
dont-touch-these-hyphens
leaf-foo.*bar replace-these-hyphens
leaf-foobar dont-replace-these-hyphens
Correct output:
$ awk '$1=="leaf-foo.*bar"{ sub(/-/,"_",$2) } 1' file
dont-touch-these-hyphens
leaf-foo.*bar replace_these-hyphens
leaf-foobar dont-replace-these-hyphens
Wrong output:
$ sed '/^leaf-foo.*bar/ s/-/_/' file
dont-touch-these-hyphens
leaf_foo.*bar replace-these-hyphens
leaf_foobar dont-replace-these-hyphens
(note the "-" in leaf-foo being replaced by "_" in each of the last 2 lines, including the one that does not start with the string "leaf-foo.*bar").
That awk script will work as-is using any awk on any UNIX box.

Regex command line change format of each line

I have a file that contains lines in a format similar to this...
/data/file.geojson?10,20,30,40
/data/file.geojson?bbox=-5.20751953125,49.05227025601607,3.0322265625,56.46249048388979
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-2.8482055664062496,54.38935426009769,-0.300750732421875,55.158473983815306
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
I've tried a combination of grep, sed, gawk, and |(pipes) to try and pattern match and then change the format to be more like this...
[10,40],[30,40],[30,20][10,20],
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979].....
Hopefully you get the idea from the first line so I don't have to type out all the examples manually!
I've got the hang of regex to match the co-ordinates. In fact the input file is the result of extracting from apache access logs. It might be easier to read/understand answers if they just match positive integer numbers, I will then be able to slot in a more complicated pattern to match the right range.
To be able to arrange the results like you which it is important to be able to access the last for values per line.
No pattern matching is required if you use awk. You can split the input strings by a set of delimiters and reassemble the resulting fields. 40 can be accessed as $(NF), 30 as $(NF-1) and so on.
awk -F'[?,=]' '
{printf "[%s,%s],[%s,%s],[%s,%s],[%s,%s]\n",
$(NF-3),$(NF),$(NF-1),$(NF),
$(NF-1),$(NF-2),$(NF-3),$(NF-2)
}' file
I'm using ?, , or = as the field delimiters. This makes it simple to access the columns of interest.
Output:
[10,40],[30,40],[30,20],[10,20]
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979],[3.0322265625,49.05227025601607],[-5.20751953125,49.05227025601607]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-2.8482055664062496,55.158473983815306],[-0.300750732421875,55.158473983815306],[-0.300750732421875,54.38935426009769],[-2.8482055664062496,54.38935426009769]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
Btw, also sed can be used here:
sed -r 's/.*[?=]([^,]+),([^,]+),([^,]+),(.*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
The command is capturing the numbers at the end each in a separate capturing group and re-assembles them in the replacement part.
Not all versions of sed support the + quantifier. The most compatible version would look like this :)
sed 's/.*[?=]\([^,]\{1,\}\),\([^,]\{1,\}+\),\([^,]\{1,\}\),\(.*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
sed strips off items prior to numbers, then awk splits on comma and outputs in different order. Assuming data is in a file called "td.txt"
sed 's/^[^0-9-]*//' td.txt|awk -F, '{print "["$1","$4"],["$3","$4"],["$3","$2"],["$1","$2"],"}'
This might work for you (GNU sed):
sed -r 's/^.*\?[^-0-9]*([^,]*),([^,]*),([^,]*),([^,]*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
Or with more toothpicks:
sed 's/^.*\?[^-0-9]*\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
You can use the following to match:
(\/data\/file\.geojson\?(?:bbox=)?)([0-9.-]+),([0-9.-]+),([0-9.-]+),([0-9.-]+)
And replace with the following:
$1[$2,$3],[$4,$5]
See DEMO

bash grep regexp - excluding subpattern

I have a script written in bash, with one particular grep command I need to modify.
Generally I have two patterns: A & B. There is a textfile that can contain lines with all possible combinations of those patterns, that is:
"xxxAxxx", "xxxBxxx", "xxxAxxxBxxx", "xxxxxx", where "x" are any characters.
I need to match ALL lines APART FROM the ones containing ONLY "A".
At the moment, it is done with "grep -v (A)", but this is a false track, as this would exclude also lines with "xxxAxxxBxxx" - which are OK for me. This is why it needs modification. :)
The tricky part is that this one grep lies in the middle of a 'multiply-piped' command with many other greps, seds and awks inside. Thus forming a smarter pattern would be the best solution. Others would cause much additional work on changing other commands there, and even would impact another parts of the code.
Therefore, the question is: is there a possibility to match pattern and exclude a subpattern in one grep, but allow them to appear both in one line?
Example:
A file contains those lines:
fooTHISfoo
fooTHISfooTHATfoo
fooTHATfoo
foofoo
and I need to match
fooTHISfooTHATfoo
fooTHATfoo
foofoo
a line with "THIS" is not allowed.
You can use this awk command:
awk '!(/THIS/ && !/THAT/)' file
fooTHISfooTHATfoo
fooTHATfoo
foofoo
Or by reversing the boolean expression:
awk '!/THIS/ || /THAT/' file
fooTHISfooTHATfoo
fooTHATfoo
foofoo
You want to match lines that contain B, or don't contain A. Equivalently, to delete lines containing A and not B. You could do this in sed:
sed -e '/A/{;/B/!d}'
Or in this particular case:
sed '/THIS/{/THAT/!d}' file
Tricky for grep alone. However, replace that with an awk call: Filter out lines with "A" unless there is a "B"
echo "xxxAxxx
xxxBxxx
xxxAxxxBxxx
xxxBxxxAxxx
xxxxxx" | awk '!/A/ || /B/'
xxxBxxx
xxxAxxxBxxx
xxxBxxxAxxx
xxxxxx
grep solution. Uses perl regexp (-P) for Lookaheads (look if there is not, some explanation here).
grep -Pv '^((?!THAT).)*THIS((?!THAT).)*$' file

Problem with regular expression using grep

I've got some textfiles that hold names, phone numbers and region codes. One combination per line.
The syntax is always "Name Region_code number"
With any number of spaces between the 3 variables.
What I want to do is search for specific region codes, like 23 or 493, forexample.
The problem is that these numbers might appear in the longer numbers too, which might enable a return that shouldn't have been returned.
I was thinking of this sort of command:
grep '04' numbers.txt
But if I do that, a line that contains 04 in the number but not as region code will show as a result too... which is not correct.
I'm sure you are about to get buried in clever regular expressions, but I think in this case all you need to do is include one of the spaces on each side of your region code in the grep.
grep ' 04 ' numbers.txt
I'd do:
awk '$2 == "04"' < numbers.txt
and with grep:
grep -e '^[^ ]*[ ]*04[ ]*[^ ]*$' numbers.txt
If you want region codes alone, you should use:
grep "[[:space:]]04[[:space:]]"
this way it will only look for numbers on the middle column, while start or end of strings are considered word breaks.
You can even do:
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" FILE
}
replacing FILE with the name of your file,
and use
search_region_codes 04
or even
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" $2
}
and using
search_region_codes NUMBER FILE
Are you searching for an entire region code, or a region code that contains the subpattern?
If you want the whole region code, and there is at least one space on either side, then you can format the grep by adding a single space on either side of the specific region code. There are other ways to indicate word boundaries using regular expressions.
grep ' 04 ' numbers.txt
If there can be spaces in the name or phone number fields, than that solution might not work. Also, if you the pattern can be a sub-part of the region code, then awk is a better tool. This assumes that the 'name' field contains no spaces. The matching operator '==' requires that the pattern exactly match the field. This can be tricky when there is whitespace on either side of the field.
awk '$2 == "04" {print $0}' < numbers.txt
If the file has a delimiter, than can be set in awk using the '-F' argument to awk to set the field separator character. In this example, a comma is used as the field separator. In addition, the matching operator in this example is a '~' allowing the pattern to be any part of the region code (if that is applicable). The "/y" is a way to match work boundaries at the beginning and end of the expression.
awk -F , '$2 ~ /\y04\y/ {print $0}' < numbers.txt
In both examples, the {print $0} is optional, if you want the full line to be printed. However, if you want to do any formatting on the output, that can be done inside that block.
use word boundaries. not sure if this works in grep, but in other regex implementations i'd surround it with whitespace or word boundary patterns
'\s+04\s+' or '\b04\b'
Something like that