Separate string of digits into 3 columns using awk/sed

Separate string of digits into 3 columns using awk/sed - regex

I have a string of digits in rows as below:
6390212345678912011012112121003574820069121409100000065471234567810
6390219876543212011012112221203526930428968109100000065478765432196
That I need to split into 6 columns as below:
639021234567891,201101211212100,3574820069121409,1000000,654712345678,10
639021987654321,201101211222120,3526930428968109,1000000,654787654321,96
Conditions:
Field 1 = 15 Char
Field 2 = 15 Char
Field 3 = 15 or 16 Char
Field 4 = 7 Char
Field 5 = 12 Char
Field 6 = 2 Char
Final Output:
639021234567891,3574820069121409,654712345678
639021987654321,3526930428968109,654787654321

It's not clear how detect whether field 3 should have 15 or 16 chars. But as draft for the first 3 fields you could use something like that:
echo 63902910069758520110121121210035748200670169758510 |
awk '{ printf("%s,%s,%s",substr($1,1,15),substr($1,16,15),substr($1,30,15)); }'

Or with sed:
echo $NUM | sed -r 's/^([0-9]{16})([0-9]{15})([0-9]{15,16}) ...$/\1,\2,\3, .../'
This will use 15 or 16 for the length of field 3 based the length of the whole string.

If you're using gawk:
gawk -v f3w=16 'BEGIN {OFS=","; FIELDWIDTHS="15 15 " f3w " 7 12 2"} {print $1, $3, $5}'
Do you know ahead of time what the width of Field 3 should be? Do you need it to be programatically determined? How? Based on the total length of the line? Does it change line-by-line?
Edit:
If you don't have gawk, then this is a similar approach:
awk -v f3w=16 'BEGIN {OFS=","; FIELDWIDTHS="15 15 " f3w " 7 12 2"; n=split(FIELDWIDTHS,fw," ")} { p=1; r=$0; for (i=1;i<=n;i++) { $i=substr(r,p,fw[i]); p += fw[i]}; print $1,$3,$5}'

Related

Awk if-statement to count the number of characters (wc -m) coming from a pipe

I tried to scratch my head around this issue and couldn't understand what it wrong about my one liner below.
Given that
echo "5" | wc -m
2
and that
echo "55" | wc -m
3
I tried to add a zero in front of all numbers below 9 with an awk if-statement as follow:
echo "5" | awk '{ if ( wc -m $0 -eq 2 ) print 0$1 ; else print $1 }'
05
which is "correct", however with 2 digits numbers I get the same zero in front.
echo "55" | awk '{ if ( wc -m $0 -eq 2 ) print 0$1 ; else print $1 }'
055
How come? I assumed this was going to return only 55 instead of 055. I now understand I'm constructing the if-statement wrong.
What is then the right way (if it ever exists one) to ask awk to evaluate if whatever comes from the | has 2 characters as one would do with wc -m?
I'm not interested in the optimal way to add leading zeros in the command line (there are enough duplicates of that).
Thanks!

I suggest to use printf:
printf "%02d\n" "$(echo 55 | wc -m)"
03
printf "%02d\n" "$(echo 123456789 | wc -m)"
10
Note: printf is available as a bash builtin. It mainly follows the conventions from the C function printf().. Check
help printf # For the bash builtin in particular
man 3 printf # For the C function

Facts:
In AWK strings or variables are concatenated just by placing them side by side.
For example: awk '{b="v" ; print "a" b}'
In AWK undefined variables are equal to an empty string or 0.
For example: awk '{print a "b", -a}'
In AWK non-zero strings are true inside if.
For example: awk '{ if ("a") print 1 }'
wc -m $0 -eq 2 is parsed as (i.e. - has more precedence then string concatenation):
wc -m $0 -eq 2
( wc - m ) ( $0 - eq ) 2
^ - integer value 2, converted to string "2"
^^ - undefined variable `eq`, converted to integer 0
^^ - input line, so string "5" converted to integer 5
^ - subtracts 5 - 0 = 5
^^^^^^^^^^^ - integer 5, converted to string "5"
^ - undefined variable "m", converted to integer 0
^^ - undefined variable "wc" converted to integer 0
^^^^^^^^^ - subtracts 0 - 0 = 0, converted to a string "0"
^^^^^^^^^^^^^^^^^^^^^^^^^ - string concatenation, results in string "052"
The result of wc -m $0 -eq 2 is string 052 (see awk '{ print wc -m $0 -eq 2 }' <<<'5'). Because the string is not empty, if is always true.
It should return only 55 instead of 055
No, it should not.
Am I constructing the if statement wrong?
No, the if statement has valid AWK syntax. Your expectations to how it works do not match how it really works.

To actually make it work (not that you would want to):
echo 5 | awk '
{
cmd = "echo " $1 " | wc -m"
cmd | getline len
if (len == 2)
print "0"$1
else
print $1
}'
But why when you can use this instead:
echo 5 | awk 'length($1) == 1 { $1 = "0"$1 } 1'
Or even simpler with the various printf solutions seen in the other answers.

How can I group unknown (but repeated) words to create an index?

I have to create a shellscript that indexes a book (text file) by taking any words that are encapsulated in angled brackets (<>) and making an index file out of that. I have two questions that hopefully you can help me with!
The first is how to identify the words in the text that are encapsulated within angled brackets.
I found a similar question that was asked but required words inside of square brackets and tried to manipulate their code but am getting an error.
grep -on \\<.*> index.txt
The original code was the same but with square brackets instead of the angled brackets and now I am receiving an error saying:
line 5: .*: ambiguous redirect
This has been answered
I also now need to take my index and reformat it like so, from:
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
Into:
big: 1 3 9
but: 2
sun: 4 6 7 8
I know that I can flip the columns with an awk command like:
awk -F':' 'BEGIN{OFS=":";} {print $2,$1;}' index.txt
But am not sure how to group the same words into a single line.
Thanks!

Could you please try following(if you are not worried about sorting order, in case you need to sort it then append sort to following code).
awk '
BEGIN{
FS=":"
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1
}
END{
for(key in name){
print key": "name[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=":" ##Setting field separator as : here.
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1 ##Creating array named name with index of $2 and value of $1 which is keep appending to its same index value.
}
END{ ##Starting END block of this code here.
for(key in name){ ##Traversing through name array here.
print key": "name[key] ##Printing key colon and array name value with index key
}
}
' Input_file ##Mentioning Input_file name here.

If you want to extract multiple occurrences of substrings in between angle brackets with GNU grep, you may consider a PCRE regex based solution like
grep -oPn '<\K[^<>]+(?=>)' index.txt
The PCRE engine is enabled with the -P option and the pattern matches:
< - an open angle bracket
\K - a match reset operator that discards all text matched so far
[^<>]+ - 1 or more (due to the + quantifier) occurrences of any char but < and > (see the [^<>] bracket expression)
(?=>) - a positive lookahead that requires (but does not consume) a > char immediately to the right of the current location.

Something like this might be what you need, it outputs the paragraph number, line number within the paragraph, and character position within the line for every occurrence of each target word:
$ cat book.txt
Wee, <sleeket>, cowran, tim’rous beastie,
O, what a panic’s in <thy> breastie!
Thou need na start <awa> sae hasty,
Wi’ bickerin brattle!
I wad be laith to rin an’ chase <thee>
Wi’ murd’ring pattle!
I’m <truly> sorry Man’s dominion
Has broken Nature’s social union,
An’ justifies that ill opinion,
Which makes <thee> startle,
At me, <thy> poor, earth-born companion,
An’ fellow-mortal!
.
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="\t" }
{
for (lineNr=1; lineNr<=NF; lineNr++) {
line = $lineNr
idx = 1
while ( match( substr(line,idx), /<[^<>]+>/ ) ) {
word = substr(line,idx+RSTART,RLENGTH-2)
locs[word] = (word in locs ? locs[word] OFS : "") NR ":" lineNr ":" idx + RSTART
idx += (RSTART + RLENGTH)
}
}
}
END {
for (word in locs) {
print word, locs[word]
}
}
.
$ awk -f tst.awk book.txt | sort
awa 1:3:21
sleeket 1:1:7
thee 1:5:34 2:4:24
thy 1:2:23 2:5:9
truly 2:1:6
Sample input courtesy of Rabbie Burns

GNU datamash is a handy tool for working on groups of columnar data (Plus some sed to massage its output into the right format):
$ grep -oPn '<\K[^<>]+(?=>)' index.txt | datamash -st: -g2 collapse 1 | sed 's/:/: /; s/,/ /g'
big: 1 3 9
but: 2
sun: 4 6 7 8

To transform
index.txt
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
into:
big: 1 3 9
but: 2
sun: 4 6 7 8
you can try this AWK program:
awk -F: '{ if (entries[$2]) {entries[$2] = entries[$2] " " $1} else {entries[$2] = $2 ": " $1} }
END { for (entry in entries) print entries[entry] }' index.txt | sort
Shorter version of the same suggested by RavinderSingh13:
awk -F: '{
{ entries[$2] = ($2 in entries ? entries[$2] " " $1 : $2 ": " $1 }
END { for (entry in entries) print entries[entry] }' index.txt | sort

Bash select valid rows from file with awk

I have a large data set with some invalid rows. I want to copy to another file only rows which start with valid date (regex digits).
Basically check if awk $1 is digit ([0-9]), if yes, write whole row ($0) to output file, if no skip this row, go to next row.
How I imagine it like (both versions give syntax error):
awk '{if ($1 =~ [0-9]) print $0 }' >> output.txt
awk '$1 =~ [0-9] {print $0}' filename.txt
while this does print the first field, I have no idea how to proceed.
awk '{ print $1 }' filename.txt
19780101
19780102
19780103
a
19780104
19780105
19780106
...
Full data set:
19780101 1 1 1 1 1
19780102 2 2 2 2 2
19780103 3 3 3 3 3
a a a a a a
19780104 4 4 4 4 4
19780105 5 5 5 5 5
19780106 6 6 6 6 6
19780107 7 7 7 7 7
19780108 8 8 8 8 8
19780109 9 9 9 9 9
19780110 10 10 10 10 10
19780111 11 11 11 11 11
19780112 12 12 12 12 12
19780113 13 13 13 13 13
19780114 14 14 14 14 14
19780115 15 15 15 15 15
19780116 16 16 16 16 16
a a a a a a
19780117 17 17 17 17 17
19780118 18 18 18 18 18
19780119 19 19 19 19 19
19780120 20 20 20 20 20
The data set can be reproduced with R
library(dplyr)
library(DataCombine)
N <- 20
df = as.data.frame(matrix(seq(N),nrow=N,ncol=5))
df$date = format(seq.Date(as.Date('1978-01-01'), by = 'day', len = N), "%Y%m%d")
df <- df %>% select(date, everything())
df <- InsertRow(df, NewRow = rep("a", 6), RowNum = 4)
df <- InsertRow(df, NewRow = rep("a", 6), RowNum = 18)
write.table(df,"filename.txt", quote = FALSE, sep="\t",row.names=FALSE)
Questions about reading first N rows don't address my need, because my invalid rows could be anywhere. This solution doesn't work for some reason.

Since you have a large data set and such a simple requirement, you could just use grep for this as it'd be faster than awk:
grep '^[0-9]' file

Based on your data, you can check if first column has 8 digits to be representing a date in YYYYMMDD format using this command:
awk '$1 ~ /^[0-9]{8}$/' file > output

You can just go with this:
awk '/^[0-9]+/' file.txt >> output.txt
By default awk works with lines, so you tell him (I am assuming he is a boy) to select the lines that starts (^) with at least one digit ([0-9]+), and to print them, redirecting in output.txt.
Hope helps.

You can also try this..
sed '/^[0-9]/!d' inputfile > outputfile

Getting the last column of a grep match for each line

Let's say I have
this is a test string
this is a shest string
this est is another example of sest string
I want the number of the character in string of the last "t" IN THE WORDS [tsh]EST, how do I get it? (In bash)
EDIT2: I can get the wanted substring with [tsh]*est if I'm not wrong.
I cannot rely on the first match (awk where=match(regex,$0) ) since it gives the first character position but the size of the match is not always the same.
EDIT: Expected output ->
last t of [tsh]*est at char number: 14
last t of [tsh]*est at char number: 15
last t of [tsh]*est at char number: 35
Hope I was clear, I think I edited the question too many times sorry !

What you got wrong
where=match(regex,$0)
the syntax of match is wrong. its string followd by the regex. That is match($0, regex)
Correction
$ awk '{print match($0, "t[^t]*$")}' input
17
18
38
EDIT
Get number of the character in string of the last "t" IN THE WORDS [tsh]EST,
$ awk '{match($0, "(t|sh|s)est"); print RSTART+RLENGTH-1}' input
14
15
35
OR
a much simpler version
$ awk 'start=match($0, "(t|sh|s)est")-1{$0=start+RLENGTH}1' input
14
15
35
Thanks Jidder for the suggestion
EDIT
To use the regex same as OP has provided
$ awk '{for(i=NF; match($i, "(t|sh|s)*est") == 0 && i > 0; i--); print index($0,$i)+RLENGTH-1;}' input
14
15
35

You can use this awk using same regex as provided by OP:
awk -v re='[tsh]*est' '{
i=0;
s=$0;
while (p=match(s, re)) {
p+=RLENGTH;
i+=p-1;
s=substr(s, p)
}
print i;
}' file
14
15
35

Try:
awk '{for (i=NF;i>=0;i--) { if(index ($i, "t") != 0) {print i; break}}}' myfile.txt

This will print the column with the last word that contains t
awk '{s=0;for (i=1;i<=NF;i++) if ($i~/t/) s=i;print s}' file
5
5
8
awk '{s=w=0;for (i=1;i<=NF;i++) if ($i~/t/) {s=i;w=$i};print "last t found in word="w,"column="s}'
last t found in word=string column=5
last t found in word=string column=5
last t found in word=string column=8

joining lines with a specific pattern in a text file

I'm trying to join rows based on the first value in a row. My file looks like this:
the structure is: ID, KEY, VALUE
1 1 Joe
1 2 Smith
1 3 30
2 2 Doe
2 1 John
2 3 20
The KEY stands for some kind of attribute of the ID, in this case KEY '1' is first name, '2' is surname and '3' is age.
The output should look like this:
1 Joe Smith 30
2 John Doe 20
I know that this can be done by fairly simple awk script, but I'm having trouble finding it on SO or with Google.

{
a[$1,$2]=$3
if ($1>m) {m=$1}
}
END {
for(i=1;i<=m;i++)
{
j=1
printf i " "
while (a[i,j] != "")
{
printf a[i,j] " "
j++
}
printf "\n"
}
}

This awk command should work:
awk '$2==1{fn=$3} $2==2{ln=$3} $2==3{age=$3} NR>1 && NR%3==0 {print $1,fn,ln,age}' file

One way with awk
awk '{a[$1]=(a[$1])?a[$1]FS$3:$3}END{for(;x<length(a);)print ++x,a[x]}' <(sort -nk12 file)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Separate string of digits into 3 columns using awk/sed - regex

It's not clear how detect whether field 3 should have 15 or 16 chars. But as draft for the first 3 fields you could use something like that: echo 63902910069758520110121121210035748200670169758510 | awk '{ printf("%s,%s,%s",substr($1,1,15),substr($1,16,15),substr($1,30,15)); }'

Or with sed: echo $NUM | sed -r 's/^([0-9]{16})([0-9]{15})([0-9]{15,16}) ...$/\1,\2,\3, .../' This will use 15 or 16 for the length of field 3 based the length of the whole string.

Related

Awk if-statement to count the number of characters (wc -m) coming from a pipe

How can I group unknown (but repeated) words to create an index?

Bash select valid rows from file with awk

Getting the last column of a grep match for each line

joining lines with a specific pattern in a text file

Categories

Resources