Extract a dictionary of substring in bash? - regex

I have files of the kind:
(1), (2), (3), (4), (5), (6), (10), (11), (12), (13), (14), (15), (16), (17), (18), (24), (25), (26), (27), (28), (29), (30), (31), (32), (33), (34), (35), (36), (37), (38), (39), (40), (41), (42), (43), (51), (52), (53), (54), (55), (56), (57), (58), (62), (63), (64), (65), (66), (67), (68), (69), (70), (71), (72), (73), (74) Use method number 1. (7), (8), (9), (19), (20), (21), (22), (23), (59), (60), (61) Use method number 2. (44), (45), (46), (47), (48), (49), (50) Use method number 3.
I would like to build a dictionary containing the numbers between parentheses and link them to the sentences of the type: "Use method number #". So, in this case:
1,2,3,4,5...74 --> Use method number 1.
7,8,9,19....61 --> Use method number 2.
Currently I am building a complex while that reads regexs (^ *\([0-9]+\)), extracts each number, deletes the coincidence and starts again until regex is not found and then extracts the sentence. But this is quite poor in performance and tedious to maintain.
Have you got any suggestions on how to improve this through more compact methods other than the while do?
I am not bothered by the dictionary structure, do not consider it right now if it does not imply modifying the method.
EDIT. ADDING REAL DATA STRING:
(12), (13), (14), (15) P.S.: 3 días en cultivo de invernadero. Efectuar un máximo de 6 aplicaciones por
campaña a intervalos de 7 días utilizando un volumen máximo de caldo de 600 l/Ha. y un máximo de
7,5 Kg de cobre inorgánico por campaña.
(28) Tratamiento en otoño, pulverizando hasta una altura de 1,5 m.
(44), (45), (46), (47), (48), (49), (50), (51) Efectuar sólo tratamientos desde la cosecha hasta la
floración, limitando la aplicación a 1200 l. de caldo/Ha. y un máximo de 3 aplicaciones por campaña
(con un intervalo de tratamientos de 14 días) y un máximo de 7,5 Kg. de cobre inorgánico/Ha.por
campaña.

You can use sed:
sed -r 's/( *\(|\))//g;s/\./\n/g' input.txt
This assumes that your input file does not contain line breaks. If it contains line breaks the command needs to get modified a bit.
Explanation:
The first command s/( *\(|\))//g removes the parentheses and additional whitespace. The second command s/\./\n/g adds a newline after a dot.
Oh I missed that you want to add an additional -->. If you really need that, the second sed commands needs to get modified:
sed -r 's/( *\(|\))//g;s/U[^.]+\./--> \0\n/g' input.txt
Now the second command searches for the sequence U --> until a dot and prepends a --> plus adds the newline after the dot.
Output:
1,2,3,4,5,6,10,...,74 --> Use method number 1.
7,8,9,19,20,21,22,23,59,60,61 --> Use method number 2.
44,45,46,47,48,49,50 --> Use method number 3.
One another thing: The above commands adds an additional newline at the end of output. You can suppress that by adding a third sed command s/\n$// which removes the additional new line before the end of the output:
sed -r 's/( *\(|\))//g;s/U[^.]+\./--> \0\n/g;s/\n$//' input.txt

Quite an idiomatic gnu awk solution:
awk -v RS="Use method number [0-9]."
-v OFS=" --> "
'NF{gsub(/\s*|\(|\)/, ""); print $0, RT}' file
Test
$ awk -v RS="Use method number [0-9]." -v OFS=" --> " 'NF{gsub(/\s*|\(|\)/, ""); print $0, RT}' a
1,2,3,4,5,6,10,11,12,13,14,15,16,17,18,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,51,52,53,54,55,56,57,58,62,63,64,65,66,67,68,69,70,71,72,73,74 --> Use method number 1.
7,8,9,19,20,21,22,23,59,60,61 --> Use method number 2.
44,45,46,47,48,49,50 --> Use method number 3.
Explanation
-v RS="Use method number [0-9]." set the record separator to the string "Use method number X.`, X being a digit.
-v OFS=" --> " set the print separator.
NF{gsub(/\s*|\(|\)/, ""); print $0, RT} main code
-- NF {} if there is at least one field, proceed.
-- gsub(/\s*|\(|\)/, "") remove all spaces, ( and ) from the string.
-- print $0, RT print the replaced string together with the record separator that was used ("Use method number X."). Using RT instead of RS so that we catch the value of the specific X used in the string.
From man awk:
RT
The record terminator. Gawk sets RT to the input text that matched
the character or regular expression specified by RS.

you can very intuitively do it with an ed script,
:: ed.script ::
# first you split your data in multiple lines
,s/\(\(([0-9]*), \)*([0-9]*)\)/\
\1\
/g
# then for each matching line with numbers, you remove unwanted chars
# and append " --> " to the next line
,g/\(\(([0-9]*), \)*([0-9]*)\)/\
s/[)( ]//g\
a\
-->\
.
# and finally you join lines
,g/^ -->/-1,+1j
# save if you want
w
Then you launch it with the following command:
cat ed.script | ed -s file.txt
that was the part intuitive... and it works with your sample data.

Related

Combining multiple regex expressions

I have to perform multiple operations on few string to sanitize them. I have been able to do so but in multiple operations in bash as,
# Getting the Content between START and END
var1=$(sed '/START/,/END/!d;//d' <<< "$content")
# Getting the 4th Line
var2=$(sed '4q;d' <<< "$content")
# Stripping all the new lines
var1=${var1//$'\n'/}
var2=${var2//$'\n'/}
# Escaping the double quotes i.e. A"B => A\"B
var1=$(sed "s/\"/\\\\\"/g" <<< "$var1")
var2=$(sed "s/\"/\\\\\"/g" <<< "$var2")
# Removing the contents wrapped in brackets i.e. A[...]B => AB
var1=$(sed -e 's/\[[^][]*\]//g' <<< "$var1")
var2=$(sed -e 's/\[[^][]*\]//g' <<< "$var2")
No doubt it's extremely bad to read the same thing over and over again when the same can be done in single operation.
Any suggestions?
Working Example:
SAMPLE INPUT
[1][INTRO]
[2][NAV]
ABAQUESNE, Masséot
...
START
French ceramist, who was the first grand-master of the glazed pottery
at Sotteville-ls-Rouen (20 years before [8]Bernard Palissy). He took
part in the development of the ceramic factory of Rouen. He was the
author - among others - of the ceramic triptych representing the Flood
(1550, couen, Muse de la Renaissance).
END
DESIRED OUTPUT
ABAQUESNE, Masséot
French ceramist, who was the first grand-master of the glazed pottery at Sotteville-ls-Rouen (20 years before Bernard Palissy). He took part in the development of the ceramic factory of Rouen. He was the author - among others - of the ceramic triptych representing the Flood (1550, couen, Muse de la Renaissance).
You can use awk for this:
awk 'NR==4{sub(/^[[:blank:]]+/, ""); print}' file
ABAQUESNE, Masséot
and 2nd awk:
awk '{sub(/^[[:blank:]]+/, "")}
/^START/{p=1; next}
/^END/{sub(/\[[^]]*\]/, "", s); gsub(/"/, "\\\\&", s); print s; p=0; next}
p{s = s $0}' file
French ceramist, who was the first grand-master of the glazed potteryat Sotteville-ls-Rouen (20 years before Bernard Palissy). He tookpart in the development of the ceramic factory of Rouen. He was theauthor - among others - of the ceramic triptych representing the Flood(1550, couen, Muse de la Renaissance).

Sed Regex to delete all numbers except ordinals

I need to delete all numbers from a file except those followed by (ST|TH|[RN]D) (ordinal numbers). I'm not sure how to introduce an exception into sed like that (I know of [^] but that wouldn't let me give the string optional (ST|TH|[RN]D).
It seems that lookaheads might be the answer but my construction isn't working
s/[0-9][0-9]*(?!(ST|[RN]D))//g
Sample input:
12663 METRO CONDOMINIUM AS DESC IN INST# 200800031138 UNIT A
126TH AVENUE INDUSTRIAL PARK
13 AND 12-29-19
102-1st AVE CONDO
Just added the last one, and that is a doozy of input. I would really like to eliminate the preceding numbers but leave the ordinal. Revo's example worked pretty well. But this edge case is actually important to me.
Expected output:
METRO CONDOMINIUM AS DESC IN INST# UNIT A
126TH AVENUE INDUSTRIAL PARK
AND --
-1st AVE CONDO
Don't care about eliminating spaces. Can do that on my own.
Sed doesn't support look-ahead, but Perl does. However, your regex isn't quite right: In 123RD it matches 12 (because 12 is a sequence of digit that's not followed by ST or ND or RD; it's followed by 3).
You can fix this by adding adding [0-9] to the look-ahead:
perl -pe 's/[0-9][0-9]*(?!([0-9]|ST|[RN]D))//g'
Also, you don't need the inner capturing parens in the look-ahead group, XX* can be simplified to X+, and we want to exclude TH as well:
perl -pe 's/[0-9]+(?![0-9]|ST|[RN]D|TH)//g'
Sample output from your test input:
METRO CONDOMINIUM AS DESC IN INST# UNIT A
126TH AVENUE INDUSTRIAL PARK
AND --
-st AVE CONDO
Note that the 1 in 1st was removed. This is because S does not match s. We can fix that by making the regex case insensitive:
perl -pe 's/[0-9]+(?![0-9]|ST|[RN]D|TH)//ig' test.txt
METRO CONDOMINIUM AS DESC IN INST# UNIT A
126TH AVENUE INDUSTRIAL PARK
AND --
-1st AVE CONDO
Since sed doesn't have support for lookarounds you have to define each path using:
[0-9]+(([sS]([^Tt]|$)|[Tt]([^Hh]|$)|[RNrn]([^Dd]|$))|[^RNSTrnst0-9]|$)
Live demo
For case-insensitivity I included both upper and lower cases into bracket notations.
GNU sed command (POSIX ERE):
sed -r 's/[0-9]+(([sS]([^Tt]|$)|[Tt]([^Hh]|$)|[RNrn]([^Dd]|$))|[^RNSTrnst0-9]|$)/\1/g' file
Regex breakdown:
[0-9]+ # Match digits
( # Start of Capturing Group #1
( # Start of Capturing Group #2
[sS] # Match S or s
( # Start of Capturing Group #3
[^Tt] # If a character exists after S it shouldn't be T
| # Or
$ # Match end of line position
) # End of Capturing Group #3
| # Or
[RNrn] # Match a letter from set
( # Start of Capturing Group #4
[^Dd] # If a character exists after R or N it shouldn't be D
| # Or
$ # Match end of line position
) # End of Capturing Group #4
) # End of Capturing Group #2
| # Or
[^RNSrns0-9] # Match a letter from other than one in set
| # Or
$ # Match end of line position
) # End of Capturing Group #1
Perhaps this will get you most of the way there: a sequence of digits not followed by an alphanumeric character or end-of-line
$ cat file
foo 1234 bar 32nd gaz 1234
1234hello
$ sed -E 's/[[:digit:]]+($|[^[:alnum:]])/\1/g' file
foo bar 32nd gaz
1234hello
sed is for simple substitutions on individual lines (e.g. s/old/new/), that is all. For anything else you should be using awk. With GNU awk for multi-char RS, RT, and IGNORECASE:
$ awk -v RS='[0-9]+(ST|TH|[RN]D)' -v IGNORECASE=1 '{gsub(/[0-9]+/,""); ORS=RT} 1' file
METRO CONDOMINIUM AS DESC IN INST# UNIT A
126TH AVENUE INDUSTRIAL PARK
AND --
-1st AVE CONDO
With sed and your input file
sed -E 's/(\<[0-9]+\>)//g' infile
output
METRO CONDOMINIUM AS DESC IN INST# UNIT A
126TH AVENUE INDUSTRIAL PARK
AND --
-1st AVE CONDO
This might work for you (GNU sed):
sed -r 's/^/\n/;:a;s/\n([^0-9]+)/\1\n/;ta;s/\n([0-9]*(1st|2nd|3rd|[4-90]th))/\1\n/I;ta;s/\n[0-9]+/\n/;ta;s/\n//' file
Use a newline as a delimiter to parse each line. Insert a newline at the head of the line. If the string following the newline is not numeric, pass over that string. If the string following the newline is ordinal, also pass over the string. If the string following the newline is numeric, remove it. At the end of the line, remove the newline delimiter.

Edit Floating Specific Point Field Preserving/Padding w/ Whitespace in AWK?

I have a file with the line:
CH1 12.30 4.800 12 !
I want to replace a specific field ... say $2 with some equivalent scaled by chosen floating point scalar on [0.0,1.0). However, I want to keep the same number of decimal digits and further to pad the front end with spaces to maintain the original length.
I'm thinking some combination len/gsub/printf in awk could accomplish this.
As an example of what I have tried currently:
scalar=0.00; echo 'CH1 12.30 4.800 12 !' | awk -v sc=$scalar '/CH1/{gsub(/[0-9]*\.[0-9]*/,$2*sc,$2);} {print;}'
Output:
CH1 0 4.800 12 !
Output:
Correctly outputs scaled #, but spaces are stripped from not just field $2, but entire line.
scalar=0.00; echo 'CH1 12.30 4.800 12 !' | awk -v sc=$scalar '/CH1/{gsub(/$2/,$2*sc,$0);} {print;}'
Output:
CH1 12.30 4.800 12 !
Notes:
Does nothing! Output is unchanged.
Assumptions:
Fields $2 and $3 may be the same, but I ONLY want to change field $2.
Field $1 contains only alphanumeric characters.
Fields $2 and $3 are floating point numbers with an arbitrary # of decimal digits, typically with the # of digits being on the range [1,4]. The whole part has no more than 3 digits..
Field $4 is an integer on the range [8,99].
Anything after field $4 is a comment and may contain special characters.
Searching for similar questions I've come across some questions pertaining to whitespace preservation, and those having given me some ideas... but mine is a bit different because I actually want to add whitespace, to keep the decimal place effectively locked in the same spot on the line, to keep user formatting nice in the targeted file.
The gsub(/$2/,...) expressions fail because /$2/ is looking for a literal $2 string, as opposed to whatever is in field 2. (And gsub is overkill since we are only changing one instance, so plain sub suffices, but gsub is harmless here.)
We can use just $2 (without slashes, although it's going to be treated as a regular expression rather than a literal string):
$ scalar=0.00; echo 'CH1 12.30 4.800 12 !' |
awk -v sc=$scalar '/CH1/{gsub($2,$2*sc);} {print;}'
CH1 0 4.800 12 !
This loses the decimal place stuff too, so is still not quite what we want, but shows that your approach can work.
Given that sprintf() can produce a string according to a format directive like "%5.2f" (which is what we would want to get 12.30), all we need to do is figure out the total length of the field $2 and the length of the fractional part (after the .), which is easy using split and length. Constructing the replacement string is even easier than it might first look, because instead of a literal 5 and 2, we can use * to extract integer arguments. Hence:
$ cat foo.sh
#! /bin/sh
scalar=0.00
echo 'CH1 12.30 4.800 12 !'
echo 'CH1 12.30 4.800 12 !' |
awk -v sc=$scalar '
$2 ~ /[0-9]*\.[0-9]*/ {
split($2, parts, /\./)
ofraclen = length(parts[2])
repl = sprintf("%*.*f", length($2), ofraclen, $2 * sc)
sub(/[0-9]*\.[0-9]*/, repl)
}
{print}
'
$ sh foo.sh
CH1 12.30 4.800 12 !
CH1 0.00 4.800 12 !
I put in the extra echo so that we can see that the fields still line up. I changed the matching criteria to $2 ~ ... so that we are guaranteed that $2 will split properly. We split it into its integer and fractional parts, grab the length of the fractional part, produce the replacement string, and then use sub on the (first) occurrence of a floating point number (safe if and only if field $1 never matches, there's no test for $1 matching and if so we'll sub the wrong one).
(I actually like the semicolons after each statement, but I took them all out here since they're not strictly required. Also, most of the temporary variables can be eliminated, keeping just parts, but the result will be difficult to understand.)
This is a general approach to reproducing the padding from the input in the output after operating on some field(s):
$ cat tst.awk
NR==1 {
# Find the width of each space-padded, right-aligned field:
rec = $0
for (i=1; i<=NF; i++) {
match(rec,/[^[:space:]]+/)
w[i] = RSTART - 1 + RLENGTH
rec = substr(rec,w[i]+1)
}
# Find the precision of the target field:
match($2,/\..*/)
p = RLENGTH - 1
}
{
# print the original just for comparison
print
# do the math:
$2 = sprintf("%.*f", p, $2 * scalar)
# print the updated record:
for (i=1;i<=NF;i++) {
printf "%*s", w[i], $i
}
print ""
}
.
$ awk -v scalar=0 -f tst.awk file
CH1 12.30 4.800 12 !
CH1 0.00 4.800 12 !
$ awk -v scalar=0.5 -f tst.awk file
CH1 12.30 4.800 12 !
CH1 6.15 4.800 12 !
$ awk -v scalar=9 -f tst.awk file
CH1 12.30 4.800 12 !
CH1 110.70 4.800 12 !
The above will work no matter what the value of scalar or which floating point field you want to change (easy tweak to work for decimal fields too if desired) and no matter what the value of $1.

sed: Dynamically remove all text columns except positions defined by pattern

By searching and trying (no regex expert), I have managed to process a text output using sed or grep, and extract some lines, formatted this way:
Tree number 280:
1 0.500 1 node_15 6 --> H 1551.code
1 node_21 S ==> H node_20
Tree number 281:
1 0.500 1 node_16 S ==> M 1551.code
1 node_20 S --> H node_19
Then, using
sed 's/^.\{35\}\(.\{9\}\).*/\1/' infile , I get the desired part, plus some output which I get rid of later (not a problem).
Tree number 280:
6 --> H
S ==> H
Tree number 281:
S ==> M
S --> H
However, the horizontal position of the C --> C pattern may vary from file to file, although it is always aligned. Is there a way to extract the --> or ==> including the single preceeding and following characters, no matter which columns they are found in?
The Tree number # part is not necessary and could be left blank as well, but there has to be a separator of a kind.
UPDATE (alternative approach)
Trying to use grep, I issued
grep -Eo '(([a-zA-Z0-9] -- |[a-zA-Z0-9] ==)> [a-zA-Z0-9]|Changes)' infile.
A sample of my initial file follows, if anyone thinks of a better, more efficient approach, or my use of regex is insane, please comment!
..MISC TEXT...
Character change lists:
Character CI Steps Changes
----------------------------------------------------------------
1 0.000 1 node_235 H --> S node
1 node_123 S ==> 6 1843
1 node_126 S ==> H 2461
1 node_132 S ==> 6 1863
1 node_213 H --> I 1816
1 node_213 H --> 8 1820
..CT...
Character change lists:
Character CI Steps Changes
----------------------------------------------------------------
1 0.000 1 node_165 H --> S node
1 node_123 S ==> 6 1843
1 node_231 H ==> S 1823
..MISC TEXT...
Grep is a bit easier for just extracting the matching regex (if you need different separators you can add them to the list separated by pipes [-|=]
grep -o '. [-|=][-|=]> .' infile
Of if you really want to sed for this, this should do the first part matches only lines that have the pattern and the second part extracts only the matching regex
sed -n '/[--|==]>/{s/.*\(. [=|-][-|=]> .\).*/\1/p}' infile

understand Regular expression in a sed command

I am learning Sed and I've been banging my head for an hour about to understand this command, here is the example from my book:
$ sed -n -e '/^[^|]*|[^|]*|56/w dpt_56' -e '/^[^|]*|[^|]*|89/w dpt_89' tel2.txt
$ cat dpt_56
Karama Josette|256 rue de la tempete|56100|Lorient|85.26.45.58
Zanouri Joel|45/48 boulevard du Gard|56100|Lorient|85/56/45/58
$ cat dpt_89
Joyeux Giselle|12. rue de la Source|89290|Vaux|45.26.28.47
Hers is what i understand:
- this command has the purpose to store in the dpt_56 file the lines of the poeple from the 56...district, ans the same for the 89 district in the dpt_89.
What I dont understand is the purpose or effect of the "|" and "^" caracters in the regex expression => What do ^[^|]*|[^|]*|56 means ? All i see is "choose every line that doesnt begin with zero or several times "|" OR that have several on no times "|"... but i get confused.
The expression [^|]*| means "any number of characters that aren't | followed by a |".
The reason [^|] is used instead of . is to ensure that the . wildcard doesn't greedily eat too much input.
It looks like the sed command itself is checking the 3rd field of a pipe delimited input. If the value starts with 56 then it writes it to dpt_56, if the value starts with 89, then it writes it to dpt_89.