Remove the data before the second repeated specified character in linux - regex

I have a text file which has some below data:
AB-NJCFNJNVNE-802ac94f09314ee
AB-KJNCFVCNNJNWEJJ-e89ae688336716bb
AB-POJKKVCMMMMMJHHGG-9ae6b707a18eb1d03b83c3
AB-QWERTU-55c3375fb1ee8bcd8c491e24b2
I need to remove the data before the second hyphen (-) and produce another text file with the below output:
802ac94f09314ee
e89ae688336716bb
9ae6b707a18eb1d03b83c3
55c3375fb1ee8bcd8c491e24b2
I am pretty new to linux and trying sed command with unsuccessful attempts for the last couple of hours. How can I get the desired output with sed or any other useful command like awk?

You can use a simple cut call:
$ cat myfile.txt | cut -d"-" -f3- > myoutput.txt
Edit:
Some explanation, as requested in the comments:
cut breaks up a string of text to fields according to a given delimiter.
-d defines the delimiter, - in this case.
-f defines which fields to output. In this case, we want to eliminate everything before the second hyphen, or, in other words, return the third field and onwards (3-).
The rest of the command is just piping the output. cating the file into cut, and then saving the result to an output file.

Or, using sed:
cat myfile.txt | sed -e 's/^.\+-//'

Related

Modifying a pattern-matched line as well as next line in a file

I'm trying to write a script that, among other things, automatically enable multilib. Meaning in my /etc/pacman.conf file, I have to turn this
#[multilib]
#Include = /etc/pacman.d/mirrorlist
into this
[multilib]
Include = /etc/pacman.d/mirrorlist
without accidentally removing # from lines like these
#[community-testing]
#Include = /etc/pacman.d/mirrorlist
I already accomplished this by using this code
linenum=$(rg -n '\[multilib\]' /etc/pacman.conf | cut -f1 -d:)
sed -i "$((linenum))s/#//" /etc/pacman.conf
sed -i "$((linenum+1))s/#//" /etc/pacman.conf
but I'm wondering, whether this can be solved in a single line of code without any math expressions.
With GNU sed. Find row starting with #[multilib], append next line (N) to pattern space and then remove all # from pattern space (s/#//g).
sed -i '/^#\[multilib\]/{N;s/#//g}' /etc/pacman.conf
If the two lines contain further #, then these are also removed.
Could you please try following, written with shown samples only. Considering that multilib and it's very next line only you want to deal with.
awk '
/multilib/ || found{
found=$0~/multilib/?1:""
sub(/^#+/,"")
print
}
' Input_file
Explanation:
First checking if a line contains multilib or variable found is SET then following instructions inside it's block.
Inside block checking if line has multilib then set it to 1 or nullify it. So that only next line after multilib gets processed only.
Using sub function of awk to substitute starting hash one or more occurences with NULL here.
Then printing current line.
This will work using any awk in any shell on every UNIX box:
$ awk '$0 == "#[multilib]"{c=2} c&&c--{sub(/^#/,"")} 1' file
[multilib]
Include = /etc/pacman.d/mirrorlist
and if you had to uncomment 500 lines instead of 2 lines then you'd just change c=2 to c=500 (as opposed to typing N 500 times as with the currently accepted solution). Note that you also don't have to escape any characters in the string you're matching on. So in addition to being robust and portable this is a much more generally useful idiom to remember than the other solutions you have so far. See printing-with-sed-or-awk-a-line-following-a-matching-pattern/17914105#17914105 for more.
A perl one-liner:
perl -0777 -api.back -e 's/#(\[multilib]\R)#/$1/' /etc/pacman.conf
modify in place with a backup of original in /etc/pacman.conf.back
If there is only one [multilib] entry, with ed and the shell's printf
printf '/^#\[multilib\]$/;+1s/^#//\n,p\nQ\n' | ed -s /etc/pacman.conf
Change Q to w to edit pacman.conf
Match #[multilib]
; include the next address
+1 the next line (plus one line below)
s/^#// remove the leading #
,p prints everything to stdout
Q exit/quit ed without error message.
-s means do not print any message.
Ed can do this.
cat >> edjoin.txt << EOF
/multilib/;+j
s/#//
s/#/\
/
wq
EOF
ed -s pacman.conf < edjoin.txt
rm -v ./edjoin.txt
This will only work on the first match. If you have more matches, repeat as necessary.
This might work for you (GNU sed):
sed '/^#\[multilib\]/,+1 s/^#//' file
Focus on a range of lines (in this case, two) where the first line begins #[multilib] and remove the first character in those lines if it is a #.
N.B. The [ and ] must be escaped in the regexp otherwise they will match a single character that is m,u,l,t,i or b. The range can be extended by changing the integer +1 to +n if you were to want to uncomment n lines plus the matching line.
To remove all comments in a [multilib] section, perhaps:
sed '/^#\?\[[^]]*\]$/h;G;/^#\[multilib\]/M s/^#//;P;d' file

add characters each two places within sed

I am working with csv files, they seismic catalogs from a database, I need to arrange them like USGS format in order to start another steps.
My input data format is:
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909,7,23,170000,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913,12,14,024500,-17.780,-63.170,5.6,0,PRE-GEM-ISC
The USGS input format is
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909-7-23T17:00:00,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913-12-14T02:45:00,-17.780,-63.170,5.6,0,PRE-GEM-ISC
To "convert" my input to USGS format I did the following steps:
archi='catalog.txt'
sed 's/,/-/1' $archi > temp1.dat # to change "," to "-"
sed 's/,/-/1' temp1.dat > temp2.dat # same as above
sed 's/,/T/1' temp2.dat > temp3.dat # To add T between date and time
sed -i.bak "1 s/^.*$/DatesT,Latitude,Longitude,Magnitude,Depth,Catalog/" temp3.dat #to preserve the header.
I have the following output:
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909-7-23T170000,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913-12-14T024500,-17.780,-63.170,5.6,0,PRE-GEM-ISC
I tried to implement the following command:
sed 's/.\{13\}/&: /g' temp3.dat > temp4.dat
Unfortunately it did not work as I thought because it did not have the same place for all lines.
Do you have any idea to improve my code?
One way using GNU sed:
sed -r 's/([0-9]{4}),([0-9]{1,2}),([0-9]{1,2}),([0-9]{2})([0-9]{2})([0-9]{2})(,.*)/\1-\2-\3T\4:\5:\6\7/' file
You split the file into individual tokens,meaning column as token one, 2nd column as token 2, and when it comes to 4th column, take 2 numbers as a token, and then substitute it as required.
You can do:
cat initialfile.csv|perl -p -e "s/^(\d{4}),(\d+),(\d+),(\d{2})(\d{2})(\d{2}),([0-9.-]+),([0-9.-]+),(.*)$/\1-\2-\3T\4:\5:\6,\7,\8,\9/g"
or for inline edit:
perl -p -i -e "s/^(\d{4}),(\d+),(\d+),(\d{2})(\d{2})(\d{2}),([0-9.-]+),([0-9.-]+),(.*)$/\1-\2-\3T\4:\5:\6,\7,\8,\9/g" initialfile.csv
which should output USGS format
This might work for you (GNU sed):
sed -E '1!s/^([^,]*),([^,]*),([^,]*),(..)(..)/\1-\2-\3T\4:\5:/' file
Forget about the header.
Replace the first and second fields delimiters (all fields are delimited by a comma ,) with a dash -.
Replace the third fields delimiter by T.
Split the fourth field into three equal parts and separate each part by a colon :.
N.B. The last part of the fourth field will stay as is and so does not need to be defined.
Sometimes as programmers we become too focused on data and would be better served by looking at the problem as an artist and coding what we see.

How to use awk and grep on 300GB .txt file?

I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.
awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt
This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="file_name" FNR=117897124 NR=117897124
Any suggestions?
The error message tells you:
line(117897124) has to many fields (>32767).
You'd better check it out:
sed -n '117897124{p;q}' file_name
Use cut to extract 1st column:
cut -d ' ' -f 1 < file_name | ...
Note: You may change ' ' to whatever the field separator is. The default is $'\t'.
The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.
I suspect that the awk and grep steps could be combined into one:
sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt
That might evade the awk problem entirely (that sed command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).
Seems to me that your awk implementation has an upper limit for the number of records it can read in one go of 117,897,124. The limits can vary according to your implementation, and your OS.
Maybe a sane way to approach this problem is to program a custom script that uses split to split the large file into smaller ones, with no more than 100,000,000 records each.
Just in case that you don't want to split the file, then maybe you could look for the limits file correspondent to your awk implementation. Maybe you can define unlimited as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...
If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim
The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name
If that's a problem then try:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u
There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.

sed match pattern \tTEXT\t not working

I use the following command on a huge text file
sed 's/\tEN-GB\t//g' "/home/ubuntu/0214/corpus/C.txt"
The file contains a [tab]EN-GB[tab] in each row, but what I get is the original text. I cannot figure out why.
NOTE: when I'm using 's/\t//g' it works and the resulting string is [a lot of no-tabs]EN-GB[a lot of no-tabs] in each row, so the tabs vanished.
UPDATE: Here is the incriminated part of the output from cat -vet:
^#2^#0^#0^#7^#0^#1^#0^#4^#~^#1^#6^#3^#2^#4^#3^#^I^#^I^#0^#^I^#E^#N^#-^#G^#B^#^I^#T^#h^#e^# ^#a^#d^#m^#i^#n^#i^#s^#t^#
I'm out of black magic... thanks in advance
It appears that your sed command is correct but you have some null characters in your text file
Run this sed command to remove nulls first:
sed -i.bak 's/\x0//g; s/\tEN-GB\t//g' "/home/ubuntu/0214/corpus/C.txt"
You can use ANSI-C quoting to represent the TAB character:
sed 's/'$'\tEN-GB\t''//g' filename
EDIT: The output of cat -vet suggests that you have NULL characters in your input. Remove those before piping the results to the above command. Say:
tr -d '\x0' < filename | sed 's/'$'\tEN-GB\t''//g'

How can I remove all characters in each line after the first space in a text file?

I have a large log file from which I need to extract file names.
The file looks like this:
/path/to/loremIpsumDolor.sit /more/text/here/notAlways/theSame/here
/path/to/anotherFile.ext /more/text/here/differentText/here
.... about 10 million times
I need to extract the file names like this:
loremIpsumDolor.sit
anotherFile.ext
I figure my first strategy is to find/replace all /path/to/ with ''. But I'm stuck how to remove all characters after the space.
Can you help?
sed 's/ .*//' file
It doesn't take any more. The transformed output appears on standard output, of course.
In theory, you could also use awk to grab the filename from each line as:
awk '{ print $1 }' input_file.log
That, of course, assumes that there are no spaces in any of the filenames. awk defaults to looking for whitespace as the field delimiters, so the above snippet would take the first "field" from your log file (your filename) for each line, and output it.
Pass it to cut:
cut '-d ' -f1 yourfile
a bash-only solution:
while read path otherstuff; do
echo ${path##*/}
done < filename