I am trying to use TextWrangler to take a bunch of text files, match everything within some angle-bracket tags (so far so good), and for every match, substitute all occurrences of a specific character with another.
For instance, I'd like to take something like
xx+xx <f>bar+bar+fo+bar+fe</f> yy+y <f>fee+bar</f> zz
match everything within <f> and </f> and then substitute all +'s with, say, *'s (but ONLY inside the "f" tag).
xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz
I think I can easily match "f" tags containing +'s with an expression like
<f>[^<]*\+[^<]*</f>
but I have no idea on how to substitute only a subclass of character for each match. I don't know a priori how many +'s there are in each tag.
I think I should run a regular expression for all matches of the first regular expression, but I am not really sure how to do that.
(In other words, I would like to match all +'s but only inside specific angle-bracket tags).
Does anyone have a hint?
Thanks a lot,
Daniele
In case you're OK with an awk solution:
$ awk '{
while ( match($0,/<f>[^<]*\+[^<]*<\/f>/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/\+/,"*",tgt)
$0 = substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
print
}' file
xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz
The above will work using any awk in any shell on any UNIX box. It relies on there being no < within each <f>...</f> as indicated by your sample code. If there can be then include that in your example and we can tweak the script to handle it:
$ awk '{
gsub("</f>",RS)
while ( match($0,/<f>[^\n]*\+[^\n]*\n/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/\+/,"*",tgt)
$0 = substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
gsub(RS,"</f>")
print
}' file
xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz
Related
I have an old app that generates something like:
USERLIST (
"jasonr"
"jameso"
"tommyx"
)
ROLELIST (
"op"
"admin"
"ro"
)
I need some form of regex that changes ONLY the USERLIST section to USERLIST("jasonr", "jameso", "tommyx") and the rest of the text remain intact:
USERLIST("jasonr", "jameso", "tommyx")
ROLELIST (
"op"
"admin"
"ro"
)
In addition to the multiline issue, I don't know how to handle the replacement in only part of the string. I've tried perl (-0pe) and sed, can't find a solution. I don't want to write an app to do this, surely there is a way...
perl -0777 -wpe'
s{USERLIST\s*\(\K ([^)]+) }{ join ", ", $1 =~ /("[^"]+")/g }ex' file
Prints the desired output on the shown input file. Broken over lines for easier view.
With -0777 switch the whole file is read at once into a string ("slurped") and is thus in $_. With /x modifier literal spaces in the pattern are ignored so can be used for readability.
Explanation
Capture what follows USERLIST (, up to the first closing parenthesis. This assumes no such paren inside USERLIST( ... ). With \K lookbehind all matches prior to it stay (are not "consumed" out of the string) and are excluded from $&, so we don't have to re-enter them in the replacement side
The replacement side is evaluated as code, courtesy of /e modifier. In it we capture all double-quoted substrings from the initial $1 capture (assuming no nested quotes) and join that list by , . The obtained string is then used for the replacement for what was in the parentheses following USERLIST
With your shown samples in GNU awk please try following awk code.
awk -v RS='(^|\n)USERLIST \\(\n[^)]*\\)\n' '
RT{
sub(/[[:space:]]+\(\n[[:space:]]+/,"(",RT)
sub(/[[:space:]]*\n\)\n/,")",RT)
gsub(/"\n +"/,"\", \"",RT)
print RT
}
END{
printf("%s",$0)
}
' Input_file
Explanation: Setting RS(record separator) as (^|\n)USERLIST \\(\n[^)]*\\)\n for all lines of Input_file. Then in main program checking condition if RT is NOT NULL then substituting [[:space:]]+\(\n[[:space:]]+ with "(" and then substituting [[:space:]]*\n\)\n with ) and then substituting "\n +" with \" finally printing its value. Then in this program's END block printing line's value in printf function to get rest of the values.
Output will be as follows:
USERLIST("jasonr", "jameso", "tommyx")
ROLELIST (
"op"
"admin"
"ro"
)
This might work for you (GNU sed):
sed '/USERLIST/{:a;N;/^)$/M!ba;s/(\n\s*/(/;s/\n)/)/;s/\n\s*/, /g}' file
If a line contains USERLIST, gather up the list and format as required.
I have a large dictionary file that contains one word per line.
I want to extract all lines that contain only one kind of vowel, so "see" and "best" and "levee" and "whenever" would be extracted, but "like" or "house" or "and" wouldn't. It's fine for me having to go over the file a few times, changing the vowel I'm looking for each time.
This command: grep -io '\b[eqwrtzpsdfghjklyxcvbnm]*\b' dictionary.txt
returns no words containing any other vowels but E, but it also gives me words like BBC or BMW. How can I make the contained vowel a requirement?
How about
grep -i '^[^aiou]*e[^aiou]*$'
?
Here is an Awk attempt which collects all the hits in a single pass over the input file, then prints each bucket.
awk 'BEGIN { split("a:e:i:o:u", vowel, ":")
c = "[b-df-hj-np-tv-z]"
for (v in vowel)
regex = (regex ? regex "|" : "") "^" c "*" vowel[v] c "*(" vowel[v] c "]*)*$" }
$0 ~ regex { for (v in vowel) if ($0 ~ vowel[v]) {
hit[v] = ( hit[v] ? hit[v] ORS : "") $0
next } }
END { for (v in vowel) {
printf "=== %s ===\n", vowel[v]
print hit[v] } }' /usr/share/dict/words
You'll notice that it prints words with syllabic y like jolly and cycle. A more complex regex should fix that, though the really thorny cases (like rhyme) need a more sophisticated model of English orthography.
The regex is clumsy because Awk does not support backreferences; an earlier version of this answer contained a simpler regex which would work with grep -E or similar, but then collect all matches in the same bucket.
Demo: https://ideone.com/wNrvPu
Using -P (perl) option:
^(?=.*e)[^aiou]+$
Explanation:
^ # beginning of line
(?=.*e) # positive lookahead, make sure we at least 1 "e"
[^aiou]+ # 1 or more any character that is not vowel
$ # end of line
cat file.txt
see
best
levee
whenever
like
house
and
BBC
BMW
grep -P '^(?=.*e)[^aiou]+$' file.txt
see
best
levee
whenever
I am writing a shell script which calls an awk script and then I take some user input in the BEGIN using getline, and I save the input to some variables.
BEGIN {printf "What's the word?"
getline word < "-"
}
Now, one of these variables is called "word" and I want to use it in another pattern in the script to print all lines containing the word given. I tried something like this:
/(^| )word( |$)/
which will print all lines containing the word "word", and I know that it's not gonna work because it's not recognized as being a variable. I'd searched a lot and found patterns starting with
$0~
but it's not working either in my case. Is there a way I could pass a variable to this pattern and print all lines containing the word stored in the variable?
If you use a BEGIN section to build a variable with your full pattern, you can refer to it later:
awk -v word="hello" '
BEGIN {
pattern = "(^|[[:space:]])" word "([[:space:]]|$)"
}
$0 ~ pattern { print $0 }
'
...that said, you don't even need to do that, if you don't mind the overhead of reconstructing the pattern for every line:
awk -v word="hello" '$0 ~ ("(^|[[:space:]])" word "([[:space:]]|$)") { print $0 }'
(Why [[:space:]] instead of ? That way tabs and other whitespace characters other than hex-20 vanilla spaces can also act as word separators).
another alternative is using the word boundary, which you can apply in the variable with some backslash escaping
awk -v v="\\\y$word\\\y" '$0~v' file
not sure all awks support this though. Alternatively you can use \< and \> for the left and right boundaries.
sorry for the nth simple question on regexp but I'm not able to get what I need without a what seems to me a too complicated solution. I'm parsing a file containing sequence of only 3 letters A,E,D as in
AADDEEDDA
EEEEEEEE
AEEEDEEA
AEEEDDAAA
and I'd like to identify only those that start with E and ends in D with only one change in the sequence as for example in
EDDDDDDDD
EEEDDDDDD
EEEEEEEED
I'm fighting with the proper regexp to do that. Here my last attempt
echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E[(ED){1,1}]*D$/ && $2 !~ /^E[(ED){2,}]*D$/) print $0}'
which does not work. Any help?
Thanks in advance.
If i understand correctly your request a simple
awk '/^E+D+$/' file.input
will do the trick.
UPDATE: if the line format contains pre/post numbers (with post optional) as showed later in the example, this can be a possible pure regex adaptation (alternative to the use of field switch-F,):
awk '/^[0-9]+,E+D+(,[0-9]+)?$/' input.test
First of all, you need the regular expression:
^E+[^ED]*D+$
This matches one or more Es at the beginning, zero or more characters that are neither E nor D in the middle, and one or more Ds at the end.
Then your AWK program will look like
$2 ~ /^E+[^ED]*D+$/
$2 refers to the 2nd field of the current record, ~ is the regex matching operator, and /s delimit a regular expression. Together, these components form what is known in AWK jargon as a "pattern", which amounts to a boolean filter for input records. Note that there is no "action" (a series of statements in {s) specified here. That's because when no action is specified, AWK assumes that the action should be { print $0 }, which prints the entire line.
If I understand you correct you want to match patterns that starts with at least one E and then continues with at least one D until the end.
echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E+D+$) print $0}'
I want to print all the lines of a file where the first element of each line begins with a number using awk. Below are the details on the data contained in the file and command used:
filename contents:
12.44.4444goad ABCDEF/END
LMNOP/START joker
98.0 kites
command used:
awk '{ $1 ~ /^\d[a-zA-Z0-9]*/ }' filename
After running the above command, no results are displayed on the prompt.
Please let me know if there is any correction that needs to be made to the above command.
To print the lines starting with a digit, you can try the following:
awk '/^[[:digit:]]+/' file
as pointed out by #HenkLangeveld your syntax is incorrect. Also the regex \d is not available in awk.
If you only need to match at least one digit at the start of the line, all you need is ^ to match the start of a line and [0-9] to match a digit.
You can use curly brackets with an if statement:
awk '{if($1 ~ /^[0-9]/) print $0}' filename
But that would just be longhand for this:
awk '$1 ~ /^[0-9]/' filename
From your attempted solution, it looks like you want:
awk 'NF>1 && $1 ~ /^[0-9.]*$/' filename
You need to explicitly match the . if you want to include the decimal point, and you need the $ anchor to make the * meaningful. This will miss lines in which the first column looks like 5e39 or -2.3. You can try to catch those cases with:
awk 'NF>1 && $1 ~ /^-?[0-9.]*(e[0-9*])?$/' filename
but at this point I would tell you to use perl and stop trying to be more robust with awk.
Perhaps (this will print blank lines...not sure which behavior you want):
perl -lane 'use POSIX qw(strtod); my ($num, $end) = strtod($F[0]);
print unless $end;' filename
This uses strtod to parse the number and tells you the number of characters at the end of the string that are not part of it.
Drop the braces and the \d, like this:
awk ' $1 ~ /^[0-9]/ ' filename
Awk programs come in chunks. A chunk is a pattern block pair, where the block
defaults to { print }. (An empty pattern defaults to true.)
The /\d/ is a perl-ism and might work in some versions awk - not in those that I tried*. You need either the traditional /^[0-9]/ or the POSIX /^[[:digit:]]/ notation.
*
gnu and ast