RegEx exclude sets while grouping all characters 2 by 2 - regex

I want to modify a binary file with a pattern. I've converted the file to a plain hexdump with xxd (from package vim). The plain file looks like this (only 1 line with no trailing LF):
$ xxd -ps file.bin | tr -d '\n' | tee out.txt
3a0a5354...
I want to remove all patterns that match \x01[^\xFF]*\xFF (an opening token and a closing token and everything between them except another closing token) in the original file, but sed doesn't work like this.
Example Input and Desired Match:
020202020101010101feeffeefff0000...
~~~~~~~~~~~~~~~~~~~~
And I'm thinking about doing this:
sed 's/regex//g' in.file > out.file
Now I'm trying to match all chatacters 2-by-2 while excluding ff. Any ideas?

This should do the trick:
((..)|01([0-9a-e][0-9a-f]|[0-9a-f][0-9a-e])*ff)*
That is, we match pairs of hexadecimal digits where either the first or the second digit can be f but not both. In the surrounding context we must also match everything two characters at a time to ensure that our matches start from an even digit.
Obviously, you must add something that actually removes the inner group from the output, which is specific to your regex engine. I realized only after posting this that a simple s/ won't do.

Related

Substitute any other character except for a specific pattern in Perl

I have text files with lines like this:
U_town/u_LN0_pk_LN3_bnb_LN155/DD0 U_DESIGN/u_LNxx_pk_LN99_bnb_LN151_LN11_/DD5
U_master/u_LN999_pk_LN767888_bnb_LN9772/Dnn111 u_LN999_pk_LN767888_bnb_LN9772_LN9999_LN11/DD
...
I am trying to substitute any other character except for / to nothing and keep a word with pattern _LN\d+_ with Perl one-liner.
So the edited version would look like:
/_LN0__LN3__LN155/ /_LN99__LN151_LN11_/
/_LN999__LN767888_/ _LN999__LN767888__LN9772_LN9999_/
I tried below which returned empty lines
perl -pe 's/(?! _LN\d+_)[^\/].+//g' file
Below returned only '/'.
perl -pe 's/(?! _LN\d+_)\w+//g' file
Is it perhaps not possible with a one-liner and I should consider writing a code to parse character by character and see if a matching word _LN\d+_ or a character / is there?
To merely remove everything other than these patterns can simply match the patterns and join the matches back
perl -wnE'say join "", m{/ | _LN[0-9]+_ }gx' file
or perhaps, depending on details of the requirements
perl -wnE'say join "", m{/ | _LN[0-9]+(?=_) }gx' file
(See explanation in the last bullet below.)
Prints, for the first line (of the two) of the shown sample input
/_LN0__LN3_//_LN99__LN151_
...
or, in the second version
/_LN0_LN3//_LN99_LN151_LN11/
...
The _LN155 is not there because it is not followed by _. See below.
Questions:
Why are there spaces after some / in the "edited version" shown in the question?
The pattern to keep is shown as _LN\d+_ but _LN155 is shown to be kept even though it is not followed by a _ in the input (but by a /) ...?
Are underscores optional by any chance? If so, append ? to them in the pattern
perl -wnE'say join "", m{/ | _?LN[0-9]+_? }gx' file
with output
/_LN0__LN3__LN155//_LN99__LN151_LN11_/
(It's been clarified that the extra space in the shown desired output is a mistake.)
If the underscores "overlap," like in _LN155_LN11_, in the regex they won't be both matched by the _LN\d+_ pattern, since the first one "takes" the underscore.
But if such overlapping instances nned be kept then replace the trailing _ with a lookahead for it, which doesn't consume it so it's there for the leading _ on the next pattern
perl -wnE'say join "", m{/ | _LN[0-9]+(?=_) }gx' file
(if the underscores are optional and you use _?LN\d+_? pattern then this isn't needed)

Remove anything before primary domain or after forward slash

How can I extract domain names from the text input below? I tried this but it didn't work as expected:
grep -oP '(?<=[.])\w+(?=[.])'
Is there anyway to do this in sed/awk or any other Linux command?
Input:
netgear.com
myapi.arlo.com
https://updates.netgear.com/arlo
https://bugcrowd-pub.bounty.accellion.net
client-api.arkoselabs.com
Output desired:
netgear.com
arlo.com
netgear.com
accellion.net
arkoselabs.com
I found so many solution thanks Google, Tried to craft my own regex ,
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$
[a-zA-Z0-9-]+\.[a-zA-Z]+($|(?=\/))
awk -F"." '{print $(NF-1)"."$NF}'
It looks like you are not only trying to remove the /, you are actually trying to extract the main domain from those URLs.
If you put the input in a file called input.txt, the following works for me on Ubuntu 20.10:
cat input.txt | sed -e 's;..([a-zA-Z0-9-].[a-zA-Z0-9-]).$;\1;'
As a brief explanation:
The domain name "parts" (the words between the dots) can only use numbers, letters and the dash symbol as characters. That pattern can be represented as:
[a-zA-Z0-9-]*
The regex above will match 2 of those, separated by a dot, proceeded by a dot (and possibly a number of characters), and succeeded by either the end of line or a group of characters that are not part of the previous groups. I believe the greedy nature of .* will make sure that only the main domain is captured.
There is probably more robust solutions available too.

How to replace spaces after a certain pattern with commas?

I am new to coding and I'm trying to format some bioinformatics data. I am trying to remove all the spaces after GT:GL:GOF:GQ:NR:NV with commas, but not anything outside of the format xx:xx:xx:xx:xx (like the example). I know I need to use sed with regex option but I'm not very familiar with how to use it. I've never actually used sed before and got confused trying so any help would be appreciated. Sorry if I formatted this poorly (this is my first post).
EDIT 2: I got actual data from the file this time which may help solve the problem. Removed the bad example.
New Example: I pulled this data from my actual file (this is just two samples), and it is surrounded by other data. Essentially the line has a bunch of data followed by "GT:GL:GOF:GQ:NR:NV ", after this there is more data in the format shown below, and finally there is some more random data. Unfortunately I can't post a full line of the data because it is extremely long and will not fit.
Input
0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0
Output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
With Basic Regular Expressions, you can use character classes and backreferences to accomplish your task, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\)[ ]\([0-9][0-9]*:[0-9][0-9]*\)/\1,\2/g' file
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT BB
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 10:13:12,41:41:1:13,13:131:1:1 AB GT RT
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT
Which basically says:
find and capture any [0-9][0-9]* one or more digits,
separated by a :, and
followed by [0-9][0-9]* one or more digits -- as capture group 1,
match a space following capture group 1 followed by capture group 2 (which is the same as capture group 1),
then replace the space separating the capture groups with a comma reinserting the capture group text using backreference 1 and 2 (e.g. \1 and \2), finally
make the replacement global (e.g. g) to replace all matching occurrences.
Edit Based On New Input Posted
If you still need all of the original commas added, and you now want to add a comma between ,0 0/ (where there is a comma precedes a single-digit followed by the space to be replaced with a comma, followed by a single-digit and a forward-slash), then all you need to do is make your capture groups conditional (on either capturing the original data as above -or- capturing this new segment. You do that by including an OR (e.g. \| in basic regex terms) between the conditions.
For instance by adding \|,[0-9] at the end of the first capture group and \|[0-9][/] at the end of the second, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\|,[0-9]\)[ ]\([0-9][0-9]*:[0-9][0-9]*\|[0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
If you have other caveats in your file, I suggest you post several complete lines of input, and if they are too long, then create a zip, gzip, bzip or xz file and post it to a site like pastebin and add the link to your question.
If all you really care about now is the space in ,0 0/, then you can shorten the sed command to:
$ sed 's/\(,[0-9]\)[[:space:]]\([0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
(note: I've included [[:space:]] to handle any whitespace (space, tab, ...) instead of just the literal [ ] (space) in the new example)
Let me know if this fixes the issue.
I'm assuming that the xx:xx:xx or xx:xx:xx:xx can have any number of parts, since some have 3, and some have 4.
This is quite difficult to do reliably with sed, as it does not support lookarounds, which seem like they might be needed for this example.
You can try something like:
perl -pe 's/(?<=\d) (?=\d+(:\d+){2,})/,/g' input.txt
If you've got your heart set on sed, you can try this, but it may miss some cases:
sed -r 's/(:[0-9]+) ([0-9]+:)/\1,\2/g' input.txt
Could you please try following. This will take care of printing those values also which are NOT coming in match of regex. Also we would have made regex mentioned in match a bit shorter by doing it as [0-9]+\.{4} etc since this is tested on old awk so couldn't test it.
awk '
BEGIN{
OFS=","
}
match($0,/GT:GL:GOF:GQ:NR:NV [0-9]+:[0-9]+:[0-9]+:[0-9]+:[0-9]+/){
value=substr($0,RSTART!=1?1:RSTART,RSTART+RLENGTH-1)
value1=substr($0,RSTART+RLENGTH+1)
gsub(/[[:space:]]+/,",",value1)
print value,value1
next
}
1
' Input_file
You may also achieve your desired result without regex, using awk:
awk '{printf "%s", $1FS$2FS$3FS$4FS$5","$6","$7; for (i=8;i<=NF;i++) printf "%s", FS$i; print ""}' input.txt
Basically, it outputs from field 1 to 5 with the default field separator ("space"), then from field 5 to 7 with the comma separator, then from field 8 onwards with default separator again.
perl myscript.pl '0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0'
myscript.pl,
#!/usr/local/ActivePerl-5.20/bin/env perl
my $input = $ARGV[0];
$input =~ s/ /\,/g;
print $input, "\n";
__DATA__
output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
This will remove all spaces, not just the space in question

sed is returning more than I need

Every line of the input file will match one of the patterns:
"SCnnnn"
"SC-nnnn"
"SC_nnnn"
( n=[0-9], SC is literal but may be upper or lowercase and will be followed immediately by 1-4 digits delimited at the end by an alphanumeric, space or other non-numeric character)
Somewhere in the line there will also be a file extension (matching ".abc") where abc = upper|lower alphanumeric in any position.
I want to extract the first pattern and print this together with the extracted file extension for each line. This is what I have so far:
sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
Here's a sample input line:
SCSCSCSCSCSCSCSCSC1867SCBrSCSCSCSC&SCBlSCkSCSCBSCrSCbSCckSC.xyz
with required output being:
SC1867.xyz
but what I am getting is:
SCSCSCSCSCSCSCSCSC1867.xyz
Can someone please tell me why this is returning the "SC"s before the part I want? I know it's something to do with greediness, but I can't get my head around it.
(Everything works fine where my "SCnnnn" match is at the beginning of the line.)
I am open to other tools - e.g. awk - if they offer a more straightforward solution.
EDIT: I think I found a solution - at least it appears to work:
sed -E -n 's/.*([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p'
It's actually not necessarily the greediness that is at play here. The reason this is happening is because sed is replacing a part of a line and then printing the whole line (the suffix of p on your s// command does this).
To more clearly see what's happening, make infile contain a more obvious string like 0o0o0o0o0o0o0o0oSC1867lalalalalalfalalala.xyz and run your first command. The following is the result
[user#localhost ~]$ sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
0o0o0o0o0o0o0o0oSC1867.xyz
As a slow-mo: sed finds your [Ss][Cc] characters beginning after the 0o0o0s and dutifully replaces the string you have described with the desired substitution; namely, it maintains the SC_-like part and four digits, then deletes everything after the numbers until the suffix. The problem is seen when the p command prints out the partially-changed line, including all of the unwanted 0oze.
Alternately
As an alternate solution, not involving printing partially changed lines but instead matching an entire line and altering it to your purpose, the following command extracted the correct answer to stdout for a file containing your example string:
[user#localhost ~]$ sed -e 's/^.*\([Ss][Cc][-_]\?[0-9]\{4\}\).*\(\.[a-Z]\{3\}\)$/\1\2/' infile
SC1867.xyz
To break that regex down a bit: the regex begins with a beginning of line (^), consumes all characters (.*) until it sees an SC (upper or lower, [Ss][Cc]), then it checks for an optional hyphen or underscore ([-_]\?), followed by exactly four digits ([0-9]\{4\}). Then, all characters are consumed until a dot (\.) is seen, followed by exactly three alphanumerical characters ([a-Z]\{3\}) and an end of line ($). The two expressions not consumed by a wildcard are saved to registers and concatenated (\1\2).
... sed -E 's/^.*([Ss][Cc][-_]?[0-9]{4}).*(\.[a-Z]{3})$/\1\2/' infile works too, if you don't enjoy backslashes as much as I do.

sed - match regex in specific position

I'm having some trouble creating a one liner or a simple script to edit some fixed length files using sed.
Supposing my file has lines in this format:
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTH3FOO
If the entire lines are considered as a string, I can say I would want to match the substring that starts in position 10 and has length 3 with a regex. If it matches the regex I want to had some other string in the end of that line.
Assuming the matching regex is B.R, and the string to append in the end of the line is NOT, I would want my file to turn into:
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTHEFOONOT
The lines in the files are bigger than the ones in this sample.
So far I have this:
sed -i '/B.R/ s/$/NOT/' file.name
The problem is that this ignores the position where the regex is matched, making the first line of the example a match as well:
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTH3FOO
I'm open to use awk as well.
Thanks in advance.
You are almost there. You just need to specify the characters which exists before B.R . If B is at 10th position then there must be 9 characters exists before B
sed -i '/^.\{9\}B.R/s/$/NOT/' file.name
Example:
$ sed '/^.\{9\}B.R/s/$/NOT/' file
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTHEFOONOT