Combining multiple regex expressions - regex

I have to perform multiple operations on few string to sanitize them. I have been able to do so but in multiple operations in bash as,
# Getting the Content between START and END
var1=$(sed '/START/,/END/!d;//d' <<< "$content")
# Getting the 4th Line
var2=$(sed '4q;d' <<< "$content")
# Stripping all the new lines
var1=${var1//$'\n'/}
var2=${var2//$'\n'/}
# Escaping the double quotes i.e. A"B => A\"B
var1=$(sed "s/\"/\\\\\"/g" <<< "$var1")
var2=$(sed "s/\"/\\\\\"/g" <<< "$var2")
# Removing the contents wrapped in brackets i.e. A[...]B => AB
var1=$(sed -e 's/\[[^][]*\]//g' <<< "$var1")
var2=$(sed -e 's/\[[^][]*\]//g' <<< "$var2")
No doubt it's extremely bad to read the same thing over and over again when the same can be done in single operation.
Any suggestions?
Working Example:
SAMPLE INPUT
[1][INTRO]
[2][NAV]
ABAQUESNE, Masséot
...
START
French ceramist, who was the first grand-master of the glazed pottery
at Sotteville-ls-Rouen (20 years before [8]Bernard Palissy). He took
part in the development of the ceramic factory of Rouen. He was the
author - among others - of the ceramic triptych representing the Flood
(1550, couen, Muse de la Renaissance).
END
DESIRED OUTPUT
ABAQUESNE, Masséot
French ceramist, who was the first grand-master of the glazed pottery at Sotteville-ls-Rouen (20 years before Bernard Palissy). He took part in the development of the ceramic factory of Rouen. He was the author - among others - of the ceramic triptych representing the Flood (1550, couen, Muse de la Renaissance).

You can use awk for this:
awk 'NR==4{sub(/^[[:blank:]]+/, ""); print}' file
ABAQUESNE, Masséot
and 2nd awk:
awk '{sub(/^[[:blank:]]+/, "")}
/^START/{p=1; next}
/^END/{sub(/\[[^]]*\]/, "", s); gsub(/"/, "\\\\&", s); print s; p=0; next}
p{s = s $0}' file
French ceramist, who was the first grand-master of the glazed potteryat Sotteville-ls-Rouen (20 years before Bernard Palissy). He tookpart in the development of the ceramic factory of Rouen. He was theauthor - among others - of the ceramic triptych representing the Flood(1550, couen, Muse de la Renaissance).

Related

Convert Python Regex to Bash Regex

I am trying to write a bash script to convert files for streaming on the home network.
I am wondering if the community could recommend something that would allow me to use my existing regex to search a string for the presence of a pattern and replace the text following a pattern.
Part of this involves naming the file to include the quality, release year and episode information (if any of these are available).
I have some Python regex I am trying to convert to a bash regex search and replace.
There are a few options such as Sed, Grep or AWK but I am not sure what is best for my approach.
My existing python regex apparently uses an extended perl form of regex.
# Captures quality 1080p or 720p
determinedQuality = re.findall("[0-9]{3}[PpIi]{1}|[0-9]{4}[PpIi]{1}", next_line)
# Captures year (4 characters long and only numeric)
yearInitial = str(re.findall("[0-9]{4}[^A-Za-z]", next_line))
# Lazy programming on my part to clear up the string gathered from the year
determinedYear = re.findall("[0-9]{4}", yearInitial)
# If the string has either S00E00 or 1X99 present then its a TV show
determinedEpisode = re.findall("[Ss]{1}[0-9]{2}[Ee]{1}[0-9]{2}|[0-9]{1}[x]{1}[0-9]{2}", next_line)
My aim is to end up with a filename all in lowercase with underscores instead of spaces in the filename along with quality information if possible:
# Sample of desired file names
harry_potter_2001_720p_philosphers_stone.mkv
S01E05_fringe_1080p.mkv
I simplified the regexs, for example if you need 3 or 4 you can use {3,4} and {1} is redundant you can remove it.
#!/bin/bash
INPUT="harry_potter_2001_720p_philosphers_stone.mkv"
#INPUT="S01E05_fringe_1080p.mkv"
determinedQuality=$(echo "$INPUT" | grep -Po '[0-9]{3,4}[PpIi]')
determinedYear=$(echo "$INPUT" | grep -Po '[0-9]{4}[^A-Za-z]' | grep -Po '[0-9]{4}')
determinedEpisode=$(echo "$INPUT" | grep -Po '[Ss]{1}[0-9]{2}[Ee][0-9]{2}|[0-9]x[0-9]{2}')
echo "quality: $determinedQuality"
echo "year: $determinedYear"
echo "episode: $determinedEpisode"
output for first one:
quality: 720p
year: 2001
episode:
output for second one:
quality: 1080p
year:
episode: S01E05

Best way to split data of varying length into columns [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have some data that includes 4 columns. First column is a place and last three columns are numbers or ranges of numbers.
What is the best way to split this data into four columns?
Red Coast Van 240-250 240-250 285-365
Beanbelt 310-400 310-400 450-540
North Star 310-400 310-400 450-540
Hamilton Fines, TA 380-390 380-390 505-530
Western Beanbelt 310-400 310-400 450-525
Main, PA 370-380 370-380 505-525
Dust Dodge, NY 380-390 380-390 520-525
Midwest Bean Belt (Des) m 400-475 400-475 572-615
Desired Output
Red Coast Van; 240-250; 240-250; 285-365
Beanbelt; 310-400; 310-400; 450-540
North Star; 310-400; 310-400; 450-540
Hamilton Fines, TA; 380-390; 380-390; 505-530
Western Beanbelt; 310-400; 310-400; 450-525
Main, PA; 370-380; 370-380; 505-525
Dust Dodge, NY; 380-390; 380-390; 520-525
Midwest Bean Belt (Des) m; 400-475; 400-475; 572-615
It's completely simple to do this in Vim:
:%s/ \(\d\)/; \1/g
You instantly get the result you want. ( 24 substitutions on 8 lines )
Notepad++
Ctrl+H
Find: (.+) (\d+-\d+) (\d+-\d+) (\d+-\d+)$
Replace: \1; \2; \3; \4;
Check the Regular expression
Based on your example and description, it looks like columns 2nd, 3rd, and 4th are space-separated. Therefore you can do this using awk the following way:
set the input field separator FS to space and the output field separator OFS to ; (you can do this una tantum in the BEGIN procedure);
join all the fields from $1 to $(NF - 3) (i.e. the columns from the 1st to the 4th to last one) in a single field, $1;
shift the remaining fields $(NF - 2)-$NF to $2-$4 and adjust NF to 4;
print the modified record.
From words to code:
awk '
BEGIN { FS = " "; OFS = "; " }
{
for (i = 2; i <= NF - 3; ++i)
$1 = $1" "$i
}
{
$2 = $(NF - 2)
$3 = $(NF - 1)
$4 = $NF
NF = 4
}
{ print $0 }' youfile
A sed solution is also possible, which looks pretty much like the vim one, except that it has no fancy stuff (e.g. no \d in place of [0-9]):
sed 's/ \([0-9]\+-[0-9]\+\) \([0-9]\+-[0-9]\+\) \([0-9]\+-[0-9]\+\)$/; \1; \2; \3/' yourfile
Analogously, however, you can simplify it by means of the -E option
sed -E 's/ ([0-9]+-[0-9]+) ([0-9]+-[0-9]+) ([0-9]+-[0-9]+)$/; \1; \2; \3/' yourfile
Since you tagged vim, here's a Vim solution:
:%s/ \(\d\+-\d\+\) \(\d\+-\d\+\) \(\d\+-\d\+\)$/; \1; \2; \3/
which can be "prettified" with verymagic, or \v:
:%s/\v (\d+-\d+) (\d+-\d+) (\d+-\d+)$/; \1; \2; \3/
Note that there's no need to match whatever the first column is made up of; you only need to match the last 3 columns and prepend ; to each.
The above solution does not make any assumption on how the first column is structured. If, instead, the first column is guaranteed not to contain a digit preceded by space, than an easier solution is possible (which is the slim version of another answer):
:%s/\ze \d/;/g
Obviously, if any of your lines is of the form
Western 666 Beanbelt 310-400 310-400 450-525
this last solution will break the first column in two.

Removing specific character from anywhere between two specific strings?

I have a large text file that contains content as per the below example:
number="+123 123 123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456 789" text="Numbers here should keep their spaces"
number="+9 8 7 6 5" text="example 123 123 123"
What I would like is to remove any whitespace character between two identifying strings, in this case number= and " text= without touching the rest of the line. So that the desired output would be:
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"
A regex like (?<=[0-9])(\s)(?=[0-9]) will interfere with with text field, which is undesirable.
I have tested a few variations of using something along the lines of (?<=address)(\s)(?=date) but this doesn't work. I think the problem lies in not being able to deal with the extra possible numbers in between the whitespace and the markers?
Adding wildcard matches into the lookbehinds/lookaheads such as (?<=address.*)(\s)(?=.*date) doesn't seem valid or else I've done it wrong? Also making the whitespace lazy with (/s+?) doesn't seem to help me, but this is about where my knowledge of regex really falls to pieces :)
Ideally I would also like to restrict between the extra equals and quotes characters for safety. I.e number=" at the beginning marker and text=" as the end marker.
Any sed/awk or similar solutions are also welcomed if easier.
Using awk:
awk 'BEGIN{FS=OFS="\""}{gsub(/ /,"",$2)}1' file
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"
Using a substitution and a loop:
sed ':l s/\(number="[^" \t]*\)\s\s*/\1/g;tl' input
this one gives:
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"
Search: [ ](?=[^"]*" text=) (the [brackets] around the space are optional, they are there for clarity)
Replace: empty string.
In the regex demo, see the substitutions at the bottom.
Command-Line Syntax
I don't know the sed syntax to search and replace. With Perl (courtesy of #jaypal and #AvinashRaj):
perl -pe 's/ (?=[^"]*" text=)//g' file
From perl --help,
-p assume loop like -n but print line also, like sed
-e program one line of program (several -e's allowed, omit programfile)
Another awk solution:
awk -F ' text="' '{ gsub(/ /, "", $1); print $1 FS $2 }' file
-F text="' splits each input line into the part before text=" ($1), and the part after ($2) - the -F option sets the special FS (*f*ield *s*eparator) awk variable to a regex that awk uses to split each input line into fields.
gsub(/ /, "", $1) (*g*lobal *sub*stitution) removes all spaces from $1 (the part before text="; replaces spaces with the empty string).
print $1 FS $2 prints the output: the modified $1 (spaces removed), joined with FS (i.e., text="), joined with $2 (the unmodified part after text=").
Note: This is a complement to the existing answers to compare their performance.
Test environments:
OS X 10.9.4.
FreeBSD awk 20070501
FreeBSD sed (cannot tell version number)
Perl v5.16.2
Ubuntu 14.04
GNU Awk 4.0.1
sed (GNU sed) 4.2.2
Perl v5.18.2
The short of it:
The awk solutions are fastest.
On OS X, #jaypal's solution is faster, on Ubuntu it's #mklement0's (mine).
Followed by the perl solution.
The sed solution (accepted answer) is slowest.
Note that removing the unnecessary g option does improve things measurably, but doesn't change the big picture.
On OS X, the differences aren't dramatic.
On Ubuntu, the differences between the awk and the perl solutions are small, but the sed solution is dramatically slower.
Sample numbers, running against a 100,000-line input file 10 times.
Don't compare them directly (Ubuntu is running in a VM on the OS X machine), just look at their ratios. (Curiously, though, awk and perl ran faster in the Ubuntu VM):
OS X:
# awk (#japyal)
real 0m3.848s
user 0m3.773s
sys 0m0.049s
# awk (#mklement0)
real 0m4.011s
user 0m3.959s
sys 0m0.045s
# perl
real 0m4.382s
user 0m4.291s
sys 0m0.063s
# sed
real 0m4.867s
user 0m4.816s
sys 0m0.044s
# sed (no `g`)
real 0m4.510s
user 0m4.460s
sys 0m0.044s
Ubuntu:
# awk (#mklement0)
real 0m1.850s
user 0m1.788s
sys 0m0.020s
# awk (#jaypal)
real 0m2.055s
user 0m1.996s
sys 0m0.012s
# perl
real 0m2.349s
user 0m2.276s
sys 0m0.024s
# sed
real 0m8.278s
user 0m8.196s
sys 0m0.016s
# sed (no `g`)
real 0m7.580s
user 0m7.488s
sys 0m0.028s

understand Regular expression in a sed command

I am learning Sed and I've been banging my head for an hour about to understand this command, here is the example from my book:
$ sed -n -e '/^[^|]*|[^|]*|56/w dpt_56' -e '/^[^|]*|[^|]*|89/w dpt_89' tel2.txt
$ cat dpt_56
Karama Josette|256 rue de la tempete|56100|Lorient|85.26.45.58
Zanouri Joel|45/48 boulevard du Gard|56100|Lorient|85/56/45/58
$ cat dpt_89
Joyeux Giselle|12. rue de la Source|89290|Vaux|45.26.28.47
Hers is what i understand:
- this command has the purpose to store in the dpt_56 file the lines of the poeple from the 56...district, ans the same for the 89 district in the dpt_89.
What I dont understand is the purpose or effect of the "|" and "^" caracters in the regex expression => What do ^[^|]*|[^|]*|56 means ? All i see is "choose every line that doesnt begin with zero or several times "|" OR that have several on no times "|"... but i get confused.
The expression [^|]*| means "any number of characters that aren't | followed by a |".
The reason [^|] is used instead of . is to ensure that the . wildcard doesn't greedily eat too much input.
It looks like the sed command itself is checking the 3rd field of a pipe delimited input. If the value starts with 56 then it writes it to dpt_56, if the value starts with 89, then it writes it to dpt_89.

How to split a string or file that may be delimited by a combination of comments and spaces, tabs, newlines, commas, or other characters

If file: list.txt contains really ugly data like so:
aaaa
#bbbb
cccc, dddd; eeee
ffff;
#gggg hhhh
iiii
jjjj,kkkk ;llll;mmmm
nnnn
How do we parse/split that file, excluding the commented lines, delimiting it by all commas, semicolons, and all white-space (including tabs, spaces, and newline and carrage-return characters) with a bash script?
Using shell commands:
grep -v "^[ |\t]*#" file|tr ";," "\n"|awk '$1=$1'
It can be done with the following code:
#!/bin/bash
### read file:
file="list.txt"
IFSO=$IFS
IFS=$'\r\n'
while read line; do
### skip lines that begin with a "#" or "<whitespace>#"
match_pattern="^\s*#"
if [[ "$line" =~ $match_pattern ]];
then
continue
fi
### replace semicolons and commas with a space everywhere...
temp_line=(${line//[;|,]/ })
### splitting the line at whitespaces requires IFS to be set back to default
### and then back before we get to the next line.
IFS=$IFSO
split_line_arr=($temp_line)
IFS=$'\r\n'
### push each word in the split_line_arr onto the final array
for word in ${split_line_arr[*]}; do
array+=(${word})
done
done < $file
echo "Array items:"
for item in ${array[*]} ; do
printf " %s\n" $item
done
This was not posed as a question, but rather a better solution to what others have touched upon when answering other related questions. The bit that is unique here is that those other questions/solutions did not really address how to split a string when it is delimited with a combination of spaces and characters and comments; this is one solution that address all three simultaneously...
Related questions:
How to split one string into multiple strings separated by at least one space in bash shell?
How do I split a string on a delimiter in Bash?
Additional notes:
Why do this with bash when other scripting languages are better suited for splitting? A bash script is more likely to have all the libraries it needs when running from a basic upstart or cron (sh) shell, compared with a perl program for example. An argument list is often needed in these situations and we should expect the worst from people who maintain those lists...
Hopefully this post will save bash newbies a lot of time in the future (including me)... Good luck!
sed 's/[# \t,]/REPLACEMENT/g' input.txt
above command replaces comment characters ('#'), spaces (' '), tabs ('\t'), and commas (',') with an arbitrary string ('REPLACEMENT')
to replace newlines, you can try:
sed 's/[# \t,]/replacement/g' input.txt | tr '\n' 'REPLACEMENT'
if you have Ruby on your system
File.open("file").each_line do |line|
next if line[/^\s*#/]
puts line.split(/\s+|[;,]/).reject{|c|c.empty?}
end
output
# ruby test.rb
aaaa
cccc
dddd
eeee
ffff
iiii
jjjj
kkkk
llll
mmmm
nnnn