splitting bash string by delimiter (last line with delimiter) into array - regex

I'm having a hard time splitting a string like this:
444,555,text with, separator
into this:
444
555
text with, separator
i.e. into a 3-element array (last element may contain comma)
I tried sed but I end up having 4 elements due to the last comma.
Any ideas?
Thanks,

With bash and array:
s='444,555,text with, separator'
IFS=, read -r a b c <<< "$s"
array=("$a" "$b" "$c")
declare -p array
Output:
declare -a array='([0]="444" [1]="555" [2]="text with, separator")'

sed editor allows replacing the number th match of the regexp(i.e. the k-th occurence of the string within a line):
str="444,555,text with, separator"
sed 's/,/\n/1; s/,/\n/1' <<< $str
The output:
444
555
text with, separator
s/,/\n/1 - 1 here is a number flag which points to the first occurrence of , to replace with \n
The following will give the same result(implying the first match on each substitution):
sed 's/,/\n/; s/,/\n/' <<< $str
Two consecutive substitutions will give 3 lines(chunks)

echo "444,555,text with, separator" | sed "s/\([0-9]*\),\([0-9]*\),\(.*\)/\1\n\2\n\3/"
Output:
444
555
text with, separator

Related

How to check last 3 chars of a string are alphabets or not using awk?

I want to check if the last 3 letters in column 1 are alphabets and print those rows. What am I doing wrong?
My code :-
awk -F '|' ' {print str=substr( $1 , length($1) - 2) } END{if ($str ~ /^[A-Za-z]/ ) print}' file
cat file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
.*/|982376
0NRT0|928731
expected output :
12300USD|0392
abc56eur|97834
238aed|23911
aabccxx|38731
73716yen|19287
$ awk -F'|' '$1 ~ /[[:alpha:]]{3}$/' file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
Regarding what's wrong with your script:
You're doing the test for alphabetic characters in the END section for the final line read instead of once per input line.
You're trying to use shell variable syntax $str instead of awk str.
You're testing for literal character ranges in the bracket expression instead of using a character class so YMMV on which characters that includes depending on your locale.
You're testing for a string that starts with a letter instead of a string that ends with 3 letters.
Use grep:
grep -P '^[^|]*[A-Za-z]{3}[|]' in_file > out_file
Here, GNU grep uses the following option:
-P : Use Perl regexes.
The regex means this:
^ : Start of the string.
[^|]* : Any non-pipe character, repeated 0 or more times.
[A-Za-z]{3} : 3 letters.
[|] : Literal pipe.
sed -n '/^[^|]*[a-Z][a-Z][a-Z]|/p' file
grep '^[^|]*[a-Z][a-Z][a-Z]|' file
{m,g}awk '!+FS<NF' FS='^[^|]*[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '$!_!~"[|]"' FS='[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '($!_~"[|]")<NF' FS='[A-Za-z][A-Za-z][A-Za-z][|]' # to play it safe
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287

Search for Pattern in Text String, then Extract Matched Pattern

I am trying to match and then extract a pattern from a text string. I need to extract any pattern that matches the following in the text string:
10289 20244
Text File:
KBOS 032354Z 19012KT 10SM FEW060 SCT200 BKN320 24/17 A3009 RMK AO2 SLP187 CB DSNT NW T02440172 10289 20244 53009
I am trying to achieve this using the following bash code:
Bash Code:
cat text_file | grep -Eow '\s10[0-9].*\s' | head -n 4 | awk '{print $1}'
The above code attempts to search for any group of approximately five numeric characters that begin with 10 followed by three numeric characters. After matching this pattern, the code prints out the rest of text string, capturing the second group of five numeric characters, beginning with 20.
I need a better, more reliable way to accomplish this because currently, this code fails. The numeric groups I need are separated by a space. I have attempted to account for this by inserting \s into the grep portion of the code.
grep solution:
grep -Eow '10[0-9]{3}\b.*\b20[0-9]{3}' text_file
The output:
10289 20244
[0-9]{3} - matches 3 digits
\b - word boundary
awk '{print $(NF-2),$(NF-1)}' text_file
10289 20244
Prints next to last and the one previous.
awk '$17 ~ /^10[0-9]{3}$/ && $18 ~ /^20[0-9]{3}$/ { print $17, $18 }' text_file
This will check field 17 for "10xxx" and field 18 for "20xxx", and when BOTH match, print them.

regex, repeat, count group

i need some help with a regex that follows up this format:
First part of the string is a email address, followed by eight columns divided by ";".
a.test#test.com;Alex;Test;Alex A.Test;Alex;12;34;56;78
the first part i have is (.*#.*com)
these are also possible source strings:
a.test#test.com;Alex;;Alex A.Test;;12;34;56;78
a.test#test.com;Alex;;Alex A.Test;Alex;;34;;78
a.test#test.com;Alex;Test;;Alex;12;34;56; and so on
You can try this regex:
^(.*#.*com)(([^";\n]*|"[^"\n]*");){8}(([^";\n]*|"[^"\n]*"))$
If you have a different number of columns after the adress change the number between { and }
For your data here the catches:
1. `a.test#test.com`
2. `56;`
3. `56`
4. `78`
Here the test
If you are sure there will be no " in your strings you can use this:
^(.*#.*com)(([^;\n]*);){8}([^;\n]*)$
Here the test
Edit:
OP suggested this usage:
For use the first regex with sed you need -i -n -E flags and escape the " char.
The result will look like this:
sed -i -n -E "/(.*#.*com)(([^\";\n]*|\"[^\"\n]*\");){8}(([^\";\n]*|\"[^\"\n]*\"))/p"
you can have something like
".*#.*\.com;[A-Z,a-z]*;[A-Z,a-z]*;[A-Z,a-z, ,.,]*;[A-Z,a-z]*;[0-9][0-9];[0-9][0-9];[0-9][0-9];[0-9][0-9]"
Assuming the numbers are only two digit
Using awk you can do this easily:
awk -F ';' '$1 ~ /\.com$/{print NF}' file
9
9
9
cat file
a.test#test.com;Alex;;Alex A.Test;;12;34;56;78
a.test#test.com;Alex;;Alex A.Test;Alex;;34;;78
a.test#test.com;Alex;Test;;Alex;12;34;56; and so on

Awk 3 Spaces + 1 space or hyphen

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.
You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'
Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt
If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.

Regex with perl one liner

I have the following:
XXUM_7_mauve_999119_ser_11.255255
UXUM_566_mauve_999119_ser_11.255255
IXUM_23_mauve_999119_ser_11.255255
and my attempt, which did not work, at a perl one liner to extract the first digit is as follows;
perl -pi -e "s/\S+_(\.+)_.+/Number$1/g" *.txt
I expected the following results:
Number 007
Number 566
Number 023
pls help
I'd use the -n option instead of the -p option and do the printing and formatting in the code:
perl -i~ -ne 'if (($num) = /[0-9]+/g) {
printf "Number %03d\n", $num;
} else {
print
}' *.txt
The problem is that this regex pattern /\S+_(\.+)_.+/ looks for a sequence of one or more literal dots . surrounded by underscores, so something like _..._ would match, but such a sequence doesn't exist in your file. I think you didn't mean to escape the dot. But even then, because the \S+ is greedy, it would find and capture the last field delimited by underscores, and so would capture ser from all three lines. Perhaps you meant to write \d+ instead of \.+, which is pretty much what I have written below.
This will do as you ask. It looks for the first occurrence of an underscore that is followed by a number of decimal digits, and uses printf to format the number as three digits.
You can add the -i qualifier, but I suggest you test it as it is first to save overwriting your data with erroneous results. Of course you could redirect the output to another file if you wished.
perl -ne'/_(\d+)/ and printf "Number %03d\n", $1' myfile
output
Number 007
Number 566
Number 023
cat > /tmp/test
XXUM_7_mauve_999119_ser_11.255255
UXUM_566_mauve_999119_ser_11.255255
IXUM_23_mauve_999119_ser_11.255255
perl -i -ne 'if ($_=~/^\w+\_(\d+)\_mauve/g) { printf "Number %03d\n", $1; }' /tmp/test
cat /tmp/test
Number 007
Number 566
Number 023