Splitting a line in bash based on delimiter with Sed / Regex

Splitting a line in bash based on delimiter with Sed / Regex - regex

Regex rookie and hoping to change that. I have the following seemingly very simple problem that I cannot figure the correct regex implementation to parse properly. Basically I have a file that has lines that looks like this:
time:3:35PM
I am just trying to cut out all characters up to and including ONLY FIRST ':' delimiter and keep the rest intact with sed so that I can process on many files with same format. What I am trying to get is this:
3:35PM
The below is the closest I got but is just using the last occurrence of the delimiter instead of the first.:
sed 's/.*://'
I have also tried with python but have challenges with applying a python function to iterate through all lines in many files as opposed to just one file.
Any help would be greatly appreciated.

You can do this in just about every text processing tool (many without using regular expressions at all).
ed
If the in-place editing is really important, the canonical correct way is not sed (the stream editor) but ed (the file editor).
ed "$file" << EOF
,s/^[^:]*://g
w
EOF
sed
(Pretty much the same commands as ed, formatted a little differently)
sed 's/^[^:]*://' < "$file" > "$file".new
mv "$file".new "$file"
BASH
This one doesn't cause any new processes to be spawned. (For whatever that's worth.)
while IFS=: read _ time; do
printf '%s\n' "$time"
done < "$file" > "$file".new
mv "$file".new "$file"
awk
awk -F: 'BEGIN{ OFS=":" } { print $2,$3 }' < "$file" > "$file".new
mv "$file".new "$file"
cut
cut -d: -f2- < "$file" > "$file".new
mv "$file".new "$file"

Since you don't need a regular expression to match a single, known character, consider using cut instead of sed.
This simple expression sets : as the d-elimiter and emits f-ields 2, onwards (-):
cut -d: -f2-
Example:
% echo 'time:3:35PM' | cut -d: -f2-
3:35PM

kojiro's answer has a plenty of great alternatives, but you have asked how to do that with regex. Here are some pure regex solutions:
grep -oP '[^:]*:\K.*' file.txt
\K makes it forget everything before the occurrence of \K.
But if you know the exact prefix length then you can use lookaround feature:
grep -oP '(?<=^time:).*' file.txt
Note that most of regex implementations do not support these features. You can use it in grep with -P flag and perl itself. I wonder if any other utility supports these.

To remove every instance up to : and including the : you could do..
sed -i.bak 's/^[^:]*://' file.txt
on multiple .txt files
sed -i.bak 's/^[^:]*://' *.txt
The -i option specifies that files are to be edited in-place. By creating a temporary file and sending output to this file rather than to the standard output.

Please consider my answer here:
How to use regex with cut at the command line?
You could for example just write:
echo 'time:3:35PM' | cutr -d : -f 2- -r :
In your particular case, you could simply use cut though:
echo 'time:3:35PM' | cut -d : -f 2-
Any feedback welcome. cutr isn't perfect yet, but before I invest too much time into it, I wanted to get some feedback.

Related

How to pass regular expression matching string from a file in awk?

I have a requirement where I have to split a large file into small files. Each line of the large file containing the matching string should be put into another file with the output file name same as the matching string. For one string I can get it done via awk as shown below.
awk '/apple/{print}' large_file.txt > apple.txt
I want a script which takes the regular expression matching string from another file and puts the results into a file with the same name as the matching string. How to get it done with awk command?
Let's say the string to be matched is put into a file called matching_string.txt the contents of which would look like this:
apple
orange
mango
If the large_file.txt is something like:
apple is a great fruit
we should eat apple
orange is juicy
mango is the king of fruits
litchi is a seasonal fruit
then the resulting file should be
apple.txt:
apple is a great fruit
we should eat apple
orange.txt:
orange is juicy
mango.txt:
mango is the king of fruits
I am new to the Linux environment and beginner level at scripting. Any other solution using regular expression, sed, python etc. should be also okay.
EDIT
Working Script:
I tweaked my script a little based on the answer by #Stephen Quan, it works for the tsch shell.
#!/bin/tcsh -f
foreach word ("`cat pattern.txt`")
if (-r ${word}.txt) then
rm -rf ${word}.txt
endif
awk "/${word}/ { print }" large.txt > ${word}.txt
end

Why use awk? Grep does the job too. Usually, awk '/pattern/{print}' can be replaced by the shorter grep -e 'pattern'.
pattern=apple
grep -e "$pattern" large.txt > "$pattern.txt"
Write a script or a shell function. For instance, a simple shell function can be defined ad-hoc and then called.
filter() { grep -e "$1" large.txt > "$1.txt"; }
for pattern in apple orangle mango; do filter "$pattern"; done
As a shell script (e.g. filter.sh):
#!/bin/sh
grep -e "$1" large.txt > "$1.txt"
Needless to say, the script file must have the executable bit set, otherwise it cannot be executed (obviously).
Assuming your pattern file (e.g. pattern.txt) contains one pattern per line:
#!/bin/sh
while IFS= read -r pattern <&3; do
filter "$pattern"
# or: ./filter.sh "$pattern"
done 3< pattern.txt
All of that can be done without script or function if you simply want a one-shot task to be done (but defining and using the function is not really more complicated than calling its body directly):
while IFS= read -r pattern <&3; do
grep -e "$pattern" large.txt > "$pattern.txt"
done 3< pattern.txt
Note that a for loop cannot be used here, since your program will break as soon as one of your patterns contains space or tab characters.

To do this in awk:
for word in $(cat matching_string.txt)
do
awk "/${word}/ { print }" large_file.txt > ${word}.txt
done
while IFS= read -r word
do
if [ -f ${word}.txt ]; then rm ${word}.txt; fi
awk "/${word}/ { print }" large_file.txt > ${word}.txt
done < matching_string.txt
The pattern is a regex pattern followed by a command. Note that when you get into regex-capture groups, you may find that the implementation of awk varies from one platform to another.
If it is a simplistic regex, I prefer perl because in cross-platform environments (particularly osx and git-bash on Windows), perl has a more consistent implementation for regex handling. In this case, the perl solution would be:
while IFS= read -r word
do
if [ -f ${word}.txt ]; then rm ${word}.txt; fi
perl -ne "if (/${word}/) { print }" < large_file.txt > ${word}.txt
done < matching_string.txt
I wanted to also demonstrate capture groups. In this case, it is a bit of over-engineered to represent your line as 3 capture groups (prefix, word, postfix), but, I do this because it serves as a template for you to create more complex regex capture group processing scenarios:
while IFS= read -r word
do
if [ -f ${word}.txt ]; then rm ${word}.txt; fi
perl -ne "if (/(.*)(${word})(.*)/) { print $1$2$3 . '\n' }" < large_file.txt > ${word}.txt
done < matching_string.txt

use grep -e pattern:
pattern=orange
grep -e "$pattern" large.txt > "$pattern.txt"
then use the read command to read all Patterns and generate all files:
filename='patternfile.txt'
while read pattern; do
grep -e "$pattern" large.txt > "$pattern.txt"
done < $filename

How to find specific text in a text file, and append it to the filename?

I have a collection of plain text files which are named as yymmdd_nnnnnnnnnn.txt, which I want to append another number sequence to the filenames, so that they each become named as yymmdd_nnnnnnnnnn_iiiiiiiii.txt instead, where the iiiiiiiii is taken from the one line in each file which contains the text "GST: 123456789⏎" (or similar) at the end of the line. While I am sure that there will only be one such matching line within each file, I don't know exactly which line it will be on.
I need an elegant one-liner solution that I can run over the collection of files in a folder, from a bash script file, to rename each file in the collection by appending the specific GST number for each filename, as found within the files themselves.
Before even getting to the renaming stage, I have encountered a problem with this. Here is what I tried, which didn't work...
# awk '/\d+$/' | grep -E 'GST: ' 150101_2224567890.txt
The grep command alone works perfectly to find the relevant line within the file, but the awk doesn't return just the final digits group. It fails with the error "warning: regexp escape sequence \d is not a known regexp operator". I had assumed that this regex should return any number of digits which are at the end of the line. The text file in question contains a line which ends with "GST: 112060340⏎". Can someone please show me how to make this work, and maybe also to help with the appropriate coding to move the collection of files to the new filenames? Thanks.
Thanks to a comment from #Renaud, I now have the following code working to obtain just the GST registration number from within a text file, which puts me a step closer towards a workable solution.
awk '/GST: / {printf $NF}' 150101_2224567890.txt
I still need to loop this over the collection instead of just specifying one filename. I also need to be able to use the output from #Renaud's contribution, to rename the files. I'm getting closer to a working solution, thanks!

This awk should work for you:
awk '$1=="GST:" {fn=FILENAME; sub(/\.txt$/, "", fn); print "mv", FILENAME, fn "_" $2 ".txt"; nextfile}' *_*.txt | sh
To make it more readable:
awk '$1 == "GST:" {
fn = FILENAME
sub(/\.txt$/, "", fn)
print "mv", FILENAME, fn "_" $2 ".txt"
nextfile
}' *_*.txt | sh
Remove | sh from above to see all mv commands together.

You may try
for f in *_*.txt; do echo mv "$f" "${f%.txt}_$(sed '/.*GST: /!d; s///; q' "$f").txt"; done
Drop the echo if you're satisfied with the output.

As you are sure there is only one matching line, you can try:
$ n=$(awk '/GST:/ {print $NF}' 150101_2224567890.txt)
$ mv 150101_2224567890.txt "150101_2224567890_$n.txt"
Or, for all .txt files:
for f in *.txt; do
n=$(awk '/GST:/ {print $NF}' "$f")
if [[ -z "$n" ]]; then
printf '%s: GST not found\n' "$f"
continue
fi
mv "$f" "$f{%.txt}_$n.txt"
done

Another one-line solution to consider, although perhaps not so elegant.
for original_filename in *_*.txt; do \
new_filename=${original_filename%'.txt'}_$(
grep -E 'GST: ' "$original_filename" | \
sed -E 's/.*GST//g; s/[^0-9]//g'
)'.txt' && \
mv "$original_filename" "$new_filename"; \
done
Output:
150101_2224567890_123456789.txt

If you are open to a multi line script:-
#!/bin/sh
for f in *.txt; do
prefix=$(echo "${f}" | sed s'#\.txt##')
cp "${f}" f1
sed -i s'#GST#%GST#' "./f1"
cat "./f1" | tr '%' '\n' > f2
number=$(cat "./f2" | sed -n '/GST/'p | cut -d':' -f2 | tr -d ' ')
newname="${prefix}_${number}.txt"
mv -v "${f}" "${newname}"
rm -v "./f1"
rm -v "./f2"
done
In general, if you want to make your files easy to work with, then leave as many potential places for them to be split with newlines as possible. It is much easier to alter files by simply being able to put what you want to delete or print on its' own line, than it is to search for things horizontally with regular expressions.

replace string with underscore and dots using sed or awk

I have a bunch of files with filenames composed of underscore and dots, here is one example:
META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
I want to remove the part that contains .bed.nodup.sortedbed.roadmap.sort.fgwas.gz. so the expected filename output would be META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
I am using these sed commands but neither one works:
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo $stringZ | sed -e 's/\([[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.\)//g'
echo $stringZ | sed -e 's/\[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.//g'
Any solution is sed or awk would help a lot

Don't use external utilities and regexes for such a simple task! Use parameter expansions instead.
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo "${stringZ/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz}"
To perform the renaming of all the files containing .bed.nodup.sortedbed.roadmap.sort.fgwas.gz, use this:
shopt -s nullglob
substring=.bed.nodup.sortedbed.roadmap.sort.fgwas.gz
for file in *"$substring"*; do
echo mv -- "$file" "${file/"$substring"}"
done
Note. I left echo in front of mv so that nothing is going to be renamed; the commands will only be displayed on your terminal. Remove echo if you're satisfied with what you see.

Your regex doesn't really feel too much more general than the fixed pattern would be, but if you want to make it work, you need to allow for more than one lower case character between each dot. Right now you're looking for exactly one, but you can fix it with \+ after each [[:lower:]] like
printf '%s' "$stringZ" | sed -e 's/\([[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.\)//g'
which with
stringZ="META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params"
give me the output
META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params

Try this:
#!/bin/bash
for line in $(ls -1 META*);
do
f2=$(echo $line | sed 's/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz//')
mv $line $f2
done

Using Awk to remove one of multiple file extensions?

I'm working on some stuff related to converting files, and I'm trying to find a shell command to remove the original file extension.
For example, if I convert a file called text.rtf, it will be converted into text.rtf.mobi. I'd like to use something to remove the .rtf (or any other extension) so it's only text.mobi.
I've been playing with awk and sed but I couldn't get anything to work. I'm not sure how to get it to pick up both the original extension and the .mobi, but only remove the original extension.
Somewhat related, where should I be going to pick up regex and actually understand it instead of just immense amounts of Googling? Thanks.
EDIT: I was a little unclear in the original post so let me clarify. The shell command I need is for removing the original extension in a converted file, such as text.ANYTHING.mobi. Sorry about the confusion.

The classic way is the basename command:
file="text.rtf"
new=$(basename "$file" .rtf).mobi
The more modern way avoids exercising other programs:
file="text.rtf"
new="${file%.rtf}.mobi"
If you really must use awk, then I suppose you use:
file="text.rtf"
new=$(echo "$file" | awk '/\.rtf$/ { sub(/\.rtf$/, ".mobi"); } { print }')
For sed, you use:
file="text.rtf"
new=$(echo "$file" | sed 's/\.rtf$/.mobi/')
For a really good explanation of regular expressions, then you want Friedl's "Mastering Regular Expressions" book.
To convert text.rtf.mobi to text.mobi, you can use any of the tools previously shown with minor adaptations:
new=$(basename "$file" .rtf.mobi).mobi
new="${file%.rtf.mobi}.mobi"
new=$(echo "$file" | awk '/\.rtf\.mobi$/ { sub(/\.rtf\.mobi$/, ".mobi"); } { print }')
new=$(echo "$file" | sed 's/\.rtf\.mobi$/.mobi/')
And things are only marginally different if the .rtf can be any other extension, but you start to ask yourself "why doesn't he remove the original extension from the file before converting it, or use the file naming facilities in the converter to get the required output name?"
There is no longer a sensible way to do it with basename.
new="${file/.[!.]*.mobi/}" # bash
new=$(echo "$file" | awk '/\.[^.]+\.mobi$/ { sub(\.[^.]*\.mobi$/, ".mobi"); } { print }')
new=$(echo "$file" | sed 's/\.[^.]*\.mobi$/.mobi/')

Just remove all the extensions and then add back in .mobi
$ x=something.whatever.mobi
$ echo ${x%%.*}.mobi
something.mobi

Here's an example using xargs and basename to strip .wav from file names for batch conversion of wave files into mp3s using lame:
/bin/ls | xargs -I % basename % ".wav" | xargs -I $ lame --tl Xenologix --ta Drøn --tg IDM --tt $ $.wav $.mp3

for f in *.mobi
do
mv $f $(echo $f | cut -d'.' -f1).mobi
done

for f in *.mobi
do
mv $f `echo $f|awk -F"." '{print $1"."$3}'`
done

bash script regex matching

In my bash script, I have an array of filenames like
files=( "site_hello.xml" "site_test.xml" "site_live.xml" )
I need to extract the characters between the underscore and the .xml extension so that I can loop through them for use in a function.
If this were python, I might use something like
re.match("site_(.*)\.xml")
and then extract the first matched group.
Unfortunately this project needs to be in bash, so -- How can I do this kind of thing in a bash script? I'm not very good with grep or sed or awk.

Something like the following should work
files2=(${files[#]#site_}) #Strip the leading site_ from each element
files3=(${files2[#]%.xml}) #Strip the trailing .xml
EDIT: After correcting those two typos, it does seem to work :)

xbraer#NO01601 ~
$ VAR=`echo "site_hello.xml" | sed -e 's/.*_\(.*\)\.xml/\1/g'`
xbraer#NO01601 ~
$ echo $VAR
hello
xbraer#NO01601 ~
$
Does this answer your question?
Just run the variables through sed in backticks (``)
I don't remember the array syntax in bash, but I guess you know that well enough yourself, if you're programming bash ;)
If it's unclear, dont hesitate to ask again. :)

I'd use cut to split the string.
for i in site_hello.xml site_test.xml site_live.xml; do echo $i | cut -d'.' -f1 | cut -d'_' -f2; done
This can also be done in awk:
for i in site_hello.xml site_test.xml site_live.xml; do echo $i | awk -F'.' '{print $1}' | awk -F'_' '{print $2}'; done

If you're using arrays, you probably should not be using bash.
A more appropriate example wold be
ls site_*.xml | sed 's/^site_//' | sed 's/\.xml$//'
This produces output consisting of the parts you wanted. Backtick or redirect as needed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Splitting a line in bash based on delimiter with Sed / Regex - regex

Since you don't need a regular expression to match a single, known character, consider using cut instead of sed. This simple expression sets : as the d-elimiter and emits f-ields 2, onwards (-): cut -d: -f2- Example: % echo 'time:3:35PM' | cut -d: -f2- 3:35PM

Related

How to pass regular expression matching string from a file in awk?

How to find specific text in a text file, and append it to the filename?

replace string with underscore and dots using sed or awk

Using Awk to remove one of multiple file extensions?

bash script regex matching

Categories

Resources