Find string in a file after a string pattern using shell script - regex

i have my output file with 4 lines
storefront/storefront.war/location/header-info.jsp:30:<input type="hidden" id="welcomeConfigValue" value="${welcomeConfig}"/>
storefront/storefront.war/location/header-info.jsp:31:<span id="selected-location" class="top-txt top-nav-fix">
storefront/storefront.war/location/header-info.jsp:33:<span id="headRestName"></span><span class="header-spacing"> | </span><span id="headRestPhone"></span><span class="header-spacing"> | </span>
storefront/storefront.war/location/header-info.jsp:35:<a href="#" class="capitalize link-wht" id="location-show"><fmt:message
I'd like to get output string after id= with the UNIX shell.
I.e., output should be like this:
welcomeConfigValue
selected-location
headRestName
headRestPhone
location-show

you can try with grep:
grep -Po '\sid="\K[^"]*' file

Command:
sed -r 's/(^.*id=")([^"]+)(.*$)/\2/g' < file.txt
Output:
sdlcb#Goofy-Gen:~/AMD$ sed -r 's/(^.*id=")([^"]+)(.*$)/\2/g' < ff.txt
welcomeConfigValue
selected-location
headRestPhone
location-show
Here, we are grouping the patterns into 3 sets using "(" & ")". First set contains all characters from beginning of the line till 'id="' including. Second set contains characters between the "s (i.e between 'id="' and the pair '"'). Third set contains the remaining chars till the end of the line. Then we just avoid the 1st and 3rd patterns.

Related

SED - remove attribute from HTML tag

I want to remove a specific attribute(name in my example) from the HTML tag, that might be in different positions for each line in my file
Example Input:
<img name="something_random_for_each_tag" src="https://websiteurl.com/286.jpg" alt="img">
Expected output:
<img src="https://websiteurl.com/286.jpg" alt="img">
My code:
sed 's/name=".*"//g' <<< '<img name="something_random_for_each_tag" src="https://websiteurl.com/286.jpg" alt="img">'
but it only shows <img >, I am losing src attribute as well
Notes:
name attribute might be in any position in a tag (not necessarily at the beginning)
you can use sed, awk, Perl, or anything you like, it should work on the command line
Your sed expression matches the text up to the last " in the line. It must have been
sed 's/ name="[^"]*"//g'
With your shown samples, could you please try following. Written and tested in GNU awk.
awk '/^<img/ && match($0,/src.*/){print substr($0,1,4),substr($0,RSTART,RLENGTH)}' Input_file
2nd solution: Using sub(substitute function) of awk.
awk '/^<img/{sub(/name="[^"]*" /,"")} 1' Input_file
Explanation:
1st solution: Using match function of awk to match from src till last of line and printing 1st 4 characters with space with matched regex value.
2nd solution: Checking condition if line starts from <img then substitute name=" till again " comes with NULL and printing current line.

How to cut a string till first numerical value appears using regex

I am trying to write a script which can extract the words from a string untill the first number appears.
ex :- I have a file named as typed-list-4.1.3.Final.jar and I want the output as:- typed-list.jar
Since all the files have different names, but, they end with a version number and .jar extension so I was trying to sed the part from where the first number appears and then append .jar.
My files look like :-
log4j-slf4j-impl-2.8.2.jar, hibernate-core-5.0.12.Final.jar etc
I tried to use sed command like this but it's not working :-
sed -i 's/-[0-9]*$//g' test1.sh --- where test1.sh contains this string "typed-list-4.1.3.Final.jar"
How about:
sed 's/-\([0-9]\+\.\)\+[0-9]\+.*\.jar/.jar/' Input_file
Results for the provided inputs:
typed-list.jar
log4j-slf4j-impl.jar
hibernate-core.jar
The regex matches with a substring such as:
starting with a dash -
pattern repetition of digit(s) dot digit(s) ...
some other substring in between (such as Final)
ends with the extension .jar
Then the sed command replaces the matched substring with just the extension.
Hope this helps.
Sed:
sed -E 's/(.*)-([[:digit:]]+\.){2}[[:digit:]]+.*(\.[^.]+)$/\1\3/' dat
log4j-slf4j-impl.jar
hibernate-core.jar
typed-list.jar
echo typed-list-4.1.3.Final.jar | awk 'sub(/-4.{10}/,"",$0)'
typed-list.jar

Finding and replacing a numeric string between colons, before a space, using sed?

I am attempting to change all coordinate information in a fastq file to zeros. My input file is composed of millions of entries in the following repeating 4-line structure:
#HWI-SV007:140:C173GACXX:6:2215:16030:89299 1:N:0:CAGATC
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
+
###FFFDFHGGDHIIHGIJJJJJJJJJJJGIJJJJJJJIIIDHGHIGIJJIIIJJIJ
I would like to replace the two numeric strings in the first line 16030:89299 with zeros in a generic way, such that any numeric string between the colons, before the space, is replaced. I would like the output to appear as follows, replacing the two strings globally throughout the file with zeros:
#HWI-SV007:140:C173GACXX:6:2215:0:0 1:N:0:CAGATC
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
+
###FFFDFHGGDHIIHGIJJJJJJJJJJJGIJJJJJJJIIIDHGHIGIJJIIIJJIJ
I am attempting to do this using the following sed:
sed 's/:^[0-9]+$:^[0-9]+$\s/:0:0 /g'
However, this does not behave as expected.
I think you will need to use sed -r option.
Also, ^ matches beginning of the line and $ matches end of the line.
Thus this is the command line that works against your sample.
sed -r 's/:[0-9]+:[0-9]+\s/:0:0 /g'
some alternative
awk -F ":" 'BEGIN{ OFS = ":" }{ if ( NF > 1 ) {$6 = 0; sub( /^[0-9]*/, 0, $7)}; print $0 }' YourFile
using column separate by :
sed 's/^\(\([^:]*:\)\{5\}\)[^[:blank:]]*/\10:0/' YourFile
using 5 first element separate by : thant space as delimiter
for your sed
sed 's/:[0-9]+:[0-9]+\(\s\)/:0:0\1/'
^and $ are relative to the whole string not the current word
option to keep the original space instead of replacing by a blank space (case of several or other like \t)
g is not needed (and better not to use here) because normaly only 1 occurence per line
you need to be sure that the pattern is not possible somewhere else (never a space after the previous number) because it's a small one

Filter words starting and ending with hyphen but not when it's found in the middle

I have a list of words I want to filter: only those that starts or ends with a hyphen but not those with a hyphen in the middle. That is, to filter entries like: "a-" or "-cefalia" but not "castellano-manchego".
I have tried with many options and the most similar thing I've found it'sgrep -E '*\-' minilemario.txt however it filters all hyphens. Could you please provide me with a solution?
a
a-
aarónico
aaronita
amuzgo
an-
-án
ana
-ana
ana-
anabaptismo
anabaptista
blablá
bla-bla-bla
blanca
castellano
castellanohablante
castellano-leonés
castellano-manchego
castellanoparlante
cedulario
cedulón
-céfala
cefalalgia
cefalálgico
cefalea
-cefalia
cefálica
cefálico
cefalitis
céfalo
-céfalo
cefalópodo
cefalorraquídeo
cefalotórax
cefea
ciabogar
cian
cian-
cianato
cianea
cianhídrico
cianí
ciánico
cianita
ciano-
cianógeno
cianosis
cianótico
cianuro
ciar
ciática
ciático
zoo
zoo-
zoófago
Using grep, say:
grep -E '^-|-$' filename
to get the words starting and ending with -. And
grep -v -E '^-|-$' filename
to exclude the words starting and ending with -.
^ and $ are anchors denoting the start and end of line respectively. You used '*\-' which would match anything followed by - (it doesn't say that - is at the end of the line).
Here is a bash only solution. Please see the comments for details:
#!/usr/bin/env bash
# Assign the first argument (e.g. a textfile) to a variable
input="$1"
# Bash 4 - read the data line by line into an array
readarray -t data < "$input"
# Bash 3 - read the data line by line into an array
#while read line; do
# data+=("$line")
#done < "$input"
# For each item in the array do something
for item in "${data[#]}"; do
# Line starts with "-" or ends with "-"
[[ "$item" =~ ^-|-$ ]] && echo "$item"
done
This will produce the following output:
$ ./script input.txt
a-
an-
-án
-ana
ana-
-céfala
-cefalia
-céfalo
cian-
ciano-
zoo-

How to find/extract a pattern from a file?

Here are the contents of my text file named 'temp.txt'
---start of file ---
HEROKU_POSTGRESQL_AQUA_URL (DATABASE_URL) ----backup---> b687
Capturing... done
Storing... done
---end of file ----
I want to write a bash script in which I need to capture the string 'b687' in a variable. this is really a pattern (which is the letter 'b' followed by 'n' number of digits). I can do it the hard way by looping through the file and extracting the desired string (b687 in example above). Is there an easy way to do so? Perhaps by using awk or sed?
Try using grep
v=$(grep -oE '\bb[0-9]{3}\b' file)
This will seach for a word starting with b followed by '3' digits.
regex101 demo
Using sed
v=$(sed -nr 's/.*\b(b[0-9]{3})\b.*/\1/p' file)
varname=$(awk '/HEROKU_POSTGRESQL_AQUA_URL/{print $4}' filename)
what this does is reads the file when it matches the pattern HEROKU_POSTGRESQL_AQUA_URL print the 4th token in this case b687
your other option is to use sed
varname=$(sed -n 's/.* \(b[0-9][0-9]*\)/\1/p' filename)
In this case we are looking for the pattern you mentioned b####... and only print that pattern the -n tells sed not to print line that do not have that pattern. the rest of the sed command is a substitution .* is any string at the beginning. followed by a (...) which forms a group in which we put the regex that will match your b##### the second part says out of all that match only print the group 1 and the p at the end tells sed to print the result (since by default we told sed not to print with the -n)