I have a huge text file in the following pattern
####
Some Question 1
answer 1
####
####
Some Question 2
answer 2
some answer 2
another answer 2
####
####
Some Question 3
answer 3
some answer 3
####
in my project I need to:
1. find lines between two characters and I already did it by (####)(.+?)(####)
2. put a question mark at the end of the first line after ####
3. put a slash before the second line and before third line
to have a result like this
Some Question 1 ? answer1
Some Question 2 ? answer 2 / some answer 2 / another answer 2
Some Question 3 ? answer 3 / some answer 3
as I mentioned I already marked the text and made 3 groups \1 & 3 #### \2 the in-between lines, how can I separate those lines and make the desired changes ?
I recommend you to do this job outside of notepad, using a script launched from the command line interface.
If you have awk installed on your system, write the following script, say script.awk:
#!/usr/bin/awk -f
/^####$/ { if (q != "") {
print q a
}
q = "";
a = "";
next
}
# other lines
{ if (q == "") {
q = $0 " ? "
} else {
if (a == "") {
a = $0;
next
} else {
a = a " / " $0 ;
next
}
}
}
Assuming your input is in file input.txt, you can run this script from the command line issuing:
./script.awk input.txt
or:
awk -f script.awk input.txt
I assume you can work in a Unix-like environment.
Related
How to skip current awk rule when its sanity check failed?
{
if (not_applicable) skip;
if (not_sanity_check2) skip;
if (not_sanity_check3) skip;
# the rest of the actions
}
IMHO, it's much cleaner to write code this way than,
{
if (!not_applicable) {
if (!not_sanity_check2) {
if (!not_sanity_check3) {
# the rest of the actions
}
}
}
}
1;
I need to skip the current rule because I have a catch all rule at the end.
UPDATE, the case I'm trying to solve.
There is multiple match point in a file that I want to match & alter, however, there's no other obvious sign for me to match what I want.
hmmm..., let me simplify it this way, I want to match & alter the first match and skip the rest of the matches and print them as-is.
As far as I understood your requirement, you are looking for if, else if here. Also you could use switch case available in newer version of gawk packages too.
Let's take an example of a Input_file here:
cat Input_file
9
29
Following is the awk code here:
awk -v var="10" '{if($0<var){print "Line " FNR " is less than var"} else if($0>var){print "Line " FNR " is greater than var"}}' Input_file
This will print as follows:
Line 1 is less than var
Line 2 isgreater than var
So if you see code carefully its checking:
First condition if current line is less than var then it will be executed in if block.
Second condition in else if block, if current line is greater than var then print it there.
I'm really not sure what you're trying to do but if I focus on just that last sentence in your question of I want to match & alter the first match and skip the rest of the matches and print them as-is. ... is this what you're trying to do?
{ s=1 }
s && /abc/ { $0="uvw"; s=0 }
s && /def/ { $0="xyz"; s=0 }
{ print }
e.g. to borrow #Ravinder's example:
$ cat Input_file
9
29
$ awk -v var='10' '
{ s=1 }
s && ($0<var) { $0="Line " FNR " is less than var"; s=0 }
s && ($0>var) { $0="Line " FNR " is greater than var"; s=0 }
{ print }
' Input_file
Line 1 is less than var
Line 2 is greater than var
I used the boolean flag variable name s for sane as you also mentioned something in your question about the conditions tested being sanity checks so each condition can be read as is the input sane so far and this next condition is true?.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I would like to be able to use awk in place of a while loop to remove subdomains from an input string if it also contains the main domain.
Source file:
1234.f.dsfsd.test.com
abc.test.com
ad.sdk.kaffnet.com
amazon.co.uk
analytics.test.dailymail.co.uk
bbc.co.uk
bbc.test.com
dailymail.co.uk
kaffnet.com
sdk.kaffnet.com
sub.test.bbc.co.uk
t.dailymail.co.uk
test.amazon.co.uk
test.bbc.co.uk
test.com
test.dailymail.co.uk
Desired Output:
amazon.co.uk
bbc.co.uk
dailymail.co.uk
kaffnet.com
test.com
Solution: #EdMorton
Check the last part of a domain and see which string is the shortest one among them:
BEGIN{FS="."}
{
ind=$(NF-1) FS $NF;
if (!(ind in size) || (size[ind] > length)) {
size[ind]=length # check the minimum size for this domain
domain[ind]=$0 # store the string with the minimum size on this domain
}
}
END {for (ind in domain) print domain[ind]}
As a one-liner:
$ awk 'BEGIN{FS="."} {ind=$(NF-1) FS $NF; if (!(ind in size) || (size[ind] > length)) { size[ind]=length; domain[ind]=$0}} END {for (ind in domain) print domain[ind]}' file
test.com
bbc.co.uk
Previous approach, that works for top level domains:
Just make use of the field separator and set it to the dot. This way, it is just a matter of storing the penultimate and last one as a string and check how many different ones you find:
$ awk -F. '{a[$(NF-1) FS $NF]} END{for (i in a) print i}' file
test.com
How does this work? a[] is an array to which we keep adding indeces. The index is defined with the penultimate field followed by a dot and the last field. This way, any new bla.test.com will still have the same index and do not add extra info into the array.
With other inputs:
$ cat file
1234.f.dsfsd.test.com
abc.test.com
bbc.test.com
test.com
bla.com
another.bla.com
$ awk -F. '{a[$(NF-1) FS $NF]} END{for (i in a) print i}' file
test.com
bla.com
New answer based on new requirements and new sample input file:
$ cat tst.awk
{ doms[$0] }
END {
for (domA in doms) {
hasSubDom = 0
for (domB in doms) {
if ( index(domA,domB ".") == 1 ) {
hasSubDom = 1
}
}
if ( !hasSubDom ) {
print domA
}
}
}
$ rev file | awk -f tst.awk | rev
bbc.co.uk
dailymail.co.uk
amazon.co.uk
kaffnet.com
test.com
$ rev file | sort |
awk -F'.' 'index($0,prev FS)!=1{ print; prev=$1 FS $2 }' |
rev
bbc.co.uk
test.com
The above just implements the algorithm you described in your question. It reverses the chars on each line and then sorts the result just like you were already doing, then if the previous line was foo.bar.stuff then prev is foo.bar and so if the current line is foo.bar.otherstuff then the call to index WILL find that foo.bar. (note the . at the end - adding that last . to the comparison is important so that foo.bar doesn't falsely match foo.barristers.wig) DOES occur at the start (index position 1) of the current line and so we will NOT print that line and prev will remain as is. If, on the other hand the current line is my.sharona.song then prev (foo.bar) DOES NOT occur at the start of that line and so that line IS printed and prev gets set to my.sharona. Finally it just reverses the chars on each output line back to their original order.
You can test a dynamic regex inside awk if you build a variable with the ~ operator
awk 'NR==1{a=$0} NR>1{if(length(a)>0){regex="^"a;if($0~regex){print a}}a=$0}'
Example (using tac and rev to facilitate the reversion)
The problem with your method is that you need at least 2 lines for the domain because you only display the previous line, but what if you did not have a previous line? Maybe that is not an issue for you if your domains always come with at least 2 lines.
For what it's worth, here is a version that works without requiring reversing and sorting the input.
awk -F. 'BEGIN {
SLDs = "co.uk,gov.uk,add.others" # general-use second-level domains we recognize
split(SLDs, slds, /,/);
for (i in slds) slds[slds[i]] = 1
}
/./ {
tld = $(NF-1) "." $(NF)
if (NF > 2 && tld in slds) tld = $(NF-2) "." tld
lines[NR] = $0
tlds[NR] = tld
if (tld == $0) existing_tlds[tld] = 1
}
END {
for (i = 1; i <= length(lines); i++) {
line = lines[i]; tld = tlds[i]
if (!(tld in existing_tlds) || tld == line) print(line)
}
}' input_file
This goes through the file and builds an array of existing TLDs. In the END block it prints a line only when it is a TLD itself or its TLD does not exist in said array.
When input_file is
1234.f.dsfsd.test.com
abc.test.com
amazon.co.uk
bbc.co.uk
bbc.test.com
sub.test.bbc.co.uk
test.amazon.co.uk
test.bbc.co.uk
test.com
it prints
amazon.co.uk
bbc.co.uk
test.com
I have a file rev.txt like this:
header1,header2
1, some text here
2, some more text here
3, text and more text here
I also have a vocabulary document with all unique words from rev.txt, like so (but sorted):
a
word
list
text
here
some
more
and
I want to generate a term frequency table for each line in rev.txt where it lists the occurence of each vocabulary word in each line of rev.txt, like so:
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 0
0 0 0 2 1 0 1 1
They could be comma separated as well.
This is similar to a question here. However, instead of search through the entire document, I want to do this line by line, using the complete vocabulary I already have.
Re: Jean-François Fabre
Actually, I am performing these in MATLAB. However, bash (I believe) would be faster for this preprocessing as I have direct disk access to the files.
Normally, I would use python, but limiting myself to using bash, this hacky one-liner solution will works for the given test case.
perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt | sed '1d' | awk -F' ' 'FILENAME=="wordlist.txt" {wc[$1]=0; wl[wllen++]=$1; next}; {for(i=1; i<=NF; i++){wc[$i]++}; for(i=0; i<wllen; i++){print wc[wl[i]]" "; wc[wl[i]]=0; if(i+1==wllen){print "\n"} }}' ORS="" wordlist.txt -
Explanation/My thinking...
In the first part, perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt, was used to pull out everything after the first comma (+removing the leading whitespace) from "rev.txt".
In the next part, sed '1d', was used to remove the first i.e. header line.
In the next part, we specified awk -F' ' ... ORS="" wordlist.txt - to use whitespace as a field delimiter, the output record delimiter as no space (note: we will print them as we go), and to read input from wordlist.txt (i.e. the "vocabulary document with all unique words from rev.txt") and stdin.
In the awk command, if the FILENAME is equal to "wordlist.txt", then (1) initialize array wc where the keys are the vocab words and the count is 0, and (2) initialize a list wl where the word order in the same as wordlist.txt.
FILENAME=="wordlist.txt" {
wc[$1]=0;
wl[wllen++]=$1;
next
};
After initialization, for each word in a line of stdin (i.e. the tidy rev.txt), increment the count of the word in wc.
{ for (i=1; i<=NF; i++) {
wc[$i]++
};
After the word counts have been added for a line, for each word in the list of words wl, print the count of that word with a whitespace and reset the count in wc back to 0. If the word is the last in the list, then add a whitespace to the output.
for (i=0; i<wllen; i++) {
print wc[wl[i]]" ";
wc[wl[i]]=0;
if(i+1==wllen){
print "\n"
}
}
}
Overall, this should produce the specified output.
Here's one in awk. It reads in the vocabulary file voc.txt (it's a piece of cake to produce it automatically in awk), copies the word list for each row of text and counts the word frequencies:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_str_asc" # order for copying vocabulary array w
}
NR==FNR { # store the voc.txt to w
w[$1]=0
next
}
FNR>1 { # process text files to matrix
for(i in w) # copy voc array
a[i]=0
for(i=2; i<=NF; i++) # count freqs
a[$i]++
for(i in a) # output matrix row
printf "%s%s", a[i], OFS
print ""
}
Run it:
$ awk -f program.awk voc.txt rev.txt
0 0 1 0 0 1 1 0
0 0 1 0 1 1 1 0
0 1 1 0 1 0 2 0
I have a file like this:
aaa b b ccc 345
ddd fgt f u 3456
e r der der 5 674
As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Note:
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.
Better explanation of the question by josifoski :
Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.
$ cat tst.awk
BEGIN{ FS=OFS=""; ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR {
for (i=1;i<=NF;i++) {
if ($i == " ") {
space[i]
}
else {
nonSpace[i]
}
}
next
}
FNR==1 {
for (i in nonSpace) {
delete space[i]
}
}
{
for (i in space) {
$i = ","
}
gsub(/,+/,",")
print
}
$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Another in awk
awk 'BEGIN{OFS=FS=""} # Sets field separator to nothing so each character is a field
FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next} #Increments array with key as character
#position based on whether a space is in that position.
#Skips all further commands for first file.
{ # In second file(same file but second time)
for(i=1;i<=NF;i++) #Loops through fields
if(!a[i]){ #If field is set
$i="," #Change field to ","
x=i #Set x to field number
while(!a[++x]){ # Whilst incrementing x and it is not set
$x="" # Change field to nothing
i=x # Set i to x so it doesnt do those fields again
}
}
}1' test{,} #PRint and use the same file twice
Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.
# required package
require(readr)
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)
I have a given file:
application_1.pp
application_2.pp
#application_2_version => '1.0.0.1-r1',
application_2_version => '1.0.0.2-r3',
application_3.pp
#application_3_version => '2.0.0.1-r4',
application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp
#application_5_version => '3.0.0.1-r8',
application_5_version => '3.0.0.2-r9',
I would like to be able to read this file and search for the string
".pp"
When that string is found, it adds that line into a variable and stores it.
It then reads the next line of the file. If it encounters a line preceded by a # it ignores it and moves onto the next line.
If it comes across a line that does not contain ".pp" and doesn't start with # it should print out that line next to a the last stored variable in a new file.
The output would look like this:
application_1.pp
application_2.pp application_2_version => '1.0.0.2-r3',
application_3.pp application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp application_5_version => '3.0.0.2-r9',
I would like to achieve this with awk. If somebody knows how to do this and it is a simple solution i would be happy if they could share it with me. If it is more complex, it would be helpful to know what in awk I need to understand in order to know how to do this (arrays, variables, etc). Can it even be achieved with awk or is another tool necessary?
Thanks,
I'd say
awk '/\.pp/ { if(NR != 1) print line; line = $0; next } NF != 0 && substr($1, 1, 1) != "#" { line = line $0 } END { print line }' filename
This works as follows:
/\.pp/ { # if a line contains ".pp"
if(NR != 1) { # unless we just started
print line # print the last assembled line
}
line = $0 # and remember this new one
next # and we're done here.
}
NF != 0 && substr($1, 1, 1) != "#" { # otherwise, unless the line is empty
# or a comment
line = line $0 # append it to the line we're building
}
END { # in the end,
print line # print the last line.
}
You can use sed:
#n
/\.pp/{
h
:loop
n
/[^#]application.*version/{
H
g
s/\n[[:space:]]*/\t/
p
b
}
/\.pp/{
x
p
}
b loop
}
If you save this as s.sed and run
sed -f s.sed file
You will get this output
application_1.pp
application_2.pp application_2_version => '1.0.0.2-r3',
application_3.pp application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp application_5_version => '3.0.0.2-r9',
Explanation
The #n supresses normal output.
Once we match the /\.pp/, we store that line into the hold space with h, and start the loop.
We go to the next line with n
If it matches /[^#]application.*version/, meaning it doesn't start with a #, then we append the line to the hold space with H, then copy the hold space to the pattern space with g, and substitute the newline and any subsequent whitespace for a tab. Finally we print with p, and skip to the end of the script with b
If it matches /\.pp/, then we swap the pattern and hold spaces with x, and print with p.