extract specific column from file? - regex

I've one file having records like below
AAA***000***LLL
BBB***111***PPP
Want only second column values in output file.
OutputFile
000
111
Is there any way I could do it using linux command ?

The simplest way is to use awk
awk -v FS='[*]{3}' '{print $2}' file
The FS='[*]{3}' means three *s will be used verbatim as field separator. Notice that setting the FS as FS='***' is wrong since the *** is not a valid regular expression.
If awk is not available, which is highly unlikely on a Linux box, you can use GNU sed:
sed -En 's/[*]{3}/\n/; s/[*]{3}.*//; s/.*\n//p' file

Related

Why does grep matches all the lines no matter what the pattern

I'm having a problem using grep.
I have a file http://pastebin.com/HxAcciCa that I want to check for certain patterns. And when I"m trying to search for it grep returns all the lines provided that the pattern already exists in the given file.
To explain more this is the code that I'm running
grep -F "ENVIRO" "$file_pos" >> blah
No matter what else I try even if I provide a whole line as a pattern bash always returns all the lines.
These are variations of what I'm trying:
grep -F "E20" "$file_pos" >> blah
grep E20 "$file_pos" >> blah
grep C:\E20-II\ENVIRO\SSNHapACS480.dll "$file_pos" >> blah
grep -F C:\E20-II\ENVIRO\SSNHapACS480.dll "$file_pos" >> blah
Also for some strange reasons when adding the -x option to grep, it doesn't return any line despite the fact that the exact pattern exists.
I've searched the web and the bash documentation for the cause but couldn't find anything.
My final test was the following
grep -F -C 1 "E20" "$store_pos" >> blah #store_pos has the same value as $file_pos
I thought maybe it was printing the lines after the result but that was not the case.
I was using the blah file to see the output.
Also I'm using Linux mint rebecca.
Finally although the naming is quite familiar this question is not similiar to Why does grep match all lines for the pattern "\'"
And finally I would like to say that I am new to bash.
I suspect The error might be due to the main file http://pastebin.com/HxAcciCa rather than the code?
From the comments, it appears that the file has carriage returns delimiting the lines, rather than the linefeeds that grep expects; as a result, grep sees the file as one huge line, that either matches or fails to match as a whole.
(Note: there are at least three different conventions about how to delimit the lines in a "plain text" file -- unix uses newline (\n), DOS/Windows uses carriage return followed by newline (\r\n), and pre-OSX versions of MacOS used just carriage return (\r).)
I'm not clear on how your file wound up in this format, but you can fix it easily with:
tr '\r' '\n' <badfile >goodfile
or on the fly with:
tr '\r' '\n' <badfile | grep ...
Check the line endings in your input file: file, wc -l.
Check you are indeed using the correct grep: which grep.
Use > to redirect the output, or | more or | less to not be confused by earlier attempts you are appending to.
Edit: Looks like your file has the wrong line endings (old Mac OS (CR) perhaps). If you have dos2unix you can try to convert them to Unix style line endings (LF).
I don't have access to a PC at the moment, but what could possibly help you troubleshoot:
1. Use grep --color -F to see if it matches correctly.
2. After your statement, use | cat -A to see if there's any surprising control characters, lines should end in $, any other characters like \I or \M can sometimes be a headache.
I suspect number 2 as it seems to be Windows output. In which case you can cat filename | dos2unix | grep stmt should solve it
Did you save the dos2unix output as another file?
Just double check the file, it should be similar to this:
[root#pro-mon9001 ~]# cat -A Test.txt
Windows^M$
Style^M$
Files^M$
Are^M$
Hard ^M$
To ^M$
Parse^M$
[root#pro-mon9001 ~]# dos2unix Test.txt
dos2unix: converting file Test.txt to Unix format ...
[root#pro-mon9001 ~]# cat -A Test.txt
Windows$
Style$
Files$
Are$
Hard$
To$
Parse$
Now it should parse properly - so just verify that it did convert the file properly
Good luck!

Unable to create sed substitution to deduplicate file

I have the file with many duplicates of the form
a
a
b
b
c
c
Which I need to reduce to
a
b
c
So I wrote a sed command: sed -r 's/^(.*)$\n^(.*)$/\1/mg' filename, but the file was still showing duplicates. However I'm sure this regex works because I tested it here.
So what am I doing wrong?
I suspect it may be related to the -r option, as I'm not really sure what that does (but without it I get a invalid reference \1 ons' command's RHS` error).
Either of 2 simpler approaches should work for you.
A simple awk command to print a line only first time by maintaining an array of already printed lines:
awk '!seen[$0]++' file
a
b
c
Since file is already sorted you can use uniq also:
uniq file
a
b
c
Edit: Newer gnu-awk versions support in place editing also using:
awk -i 'inplace' '!seen[$0]++' file

Extract columns from a CSV file using Linux shell commands

I need to "extract" certain columns from a CSV file. The list of columns to extract is long and their indices do not follow a regular pattern. So far I've come up with a regular expression for a comma-separated value but I find it frustrating that in the RHS side of sed's substitute command I cannot reference more than 9 saved strings. Any ideas around this?
Note that comma-separated values that contain a comma must be quoted so that the comma is not mistaken for a field delimiter. I'd appreciate a solution that can handle such values properly. Also, you can assume that no value contains a new line character.
With GNU awk:
$ cat file
a,"b,c",d,e
$ awk -vFPAT='([^,]*)|("[^"]+")' '{print $2}' file
"b,c"
$ awk -vFPAT='([^,]*)|("[^"]+")' '{print $3}' file
d
$ cat file
a,"b,c",d,e,"f,g,h",i,j
$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, -vcols="1,5,7,2" 'BEGIN{n=split(cols,a,/,/)} {for (i=1;i<=n;i++) printf "%s%s", $(a[i]), (i<n?OFS:ORS)}' file
a,"f,g,h",j,"b,c"
See http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content for details. I doubt if it'd handle escaped double quotes embedded in a field, e.g. a,"b""c",d or a,"b\"c",d.
See also What's the most robust way to efficiently parse CSV using awk? for how to parse CSVs with awk in general.
CSV is not that easy to parse like it might look in the first place.
This is because there can be a plenty of different delimiters or fixed column widths to separate the data, and also the data may contain the delimiter itself (escaped).
Like I already told here I would use a programming language which supports a CVS library for that.
Use
Python
Perl
Ruby
PHP
or even C.
Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
I provided sample code within my answer here: parse csv file using gawk
There is command-line csvtool available - https://colin.maudry.com/csvtool-manual-page/
# apt-get install csvtool

extract filename and path with slash delimiter from text using grep and lookaround

I am trying to write a bash script to automate checkin of changed files after comparing a local folder to another remote folder.
To achieve this I am trying to extract the filename with a portion of the path of the remote folder, to be used in the checkin commands. I am seeking assistance on extracting the filename with it's path.
To achieve the comparison I used diff command as follows
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/
The above command prints output in following format:
Files /home/xxxx/myprojects/company/apps/product/package/test/fileName.java and /productdev/product/product121/java/package/test/filename.java differ
I want to the extract the file name between 'and' & 'differ'. So I used lookarounds regular expression in a grep command :
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/ | grep -oP '(?<=and) .*(?=differ)'
which gave me:
/productdev/product/product121/java/package/test/filename.java
I would like to display the path starting from java to the end of the text as in: java/package/test/filename.java ?
You could try the below grep command,
grep -oP 'and.*\/\Kjava.*?(?= differ)'
That is,
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/ | grep -oP 'and.*\/\Kjava.*?(?=\s*differ)'
For how I see it you will be getting all the files compared in both folders and will get several lines like the ones you mentioned.
So first step would be grep-ing all the lines "differ" in them. (If that command gives any other kind of lines too)
You can ignore the above step if I am wrong and didn't understand it right.
So the next step is grep-ing both the paths. For that you can use these :
awk '{print $2,$4}'
This will print only 2nd and third fields i.e., both the paths.
awk prints fields irrespective of the spaces.
Another simple way of doing it is :
cut -d" " -f 2,4
This will also do the same.
Here with "-d" flag we are specifying a delimiter to separate strings and with "-f" flag we are specifying the number of field places to pick from(so 2nd and 4th field).
Once you get these paths you can always store them in two variables and cut or awk whatever parts of it you want.

Find matches between 2 files

I'm trying to output matching lines in 2 files using AWK. I made it easier by making 2 files with just one column, they're phone numbers. I found many people asking the same question and getting the answer to use :
awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
The problem I encountered was that it simply doesn't want to work. The first file is small (~5MB) and the second file is considerably larger (~250MB).
I have some general knowledge of AWK and know that the above script should work, yet I'm unable to figure out why it's not.
Is there any other way I can achieve the same result?
GREP is a nice tool, but it clogs up the RAM and dies within seconds due to the file size.
I did run some spot checks to find out whether there are matches, and when I did a grep of random numbers from the smaller file and grep'd them through the big one and I did find matches, so I'm sure that there are.
Any help is appreciated!
[edit as requested by #Jaypal]
Sample code from both files :
File1:
01234567895
01234577896
01234556894
File2:
01234642784
02613467246
01234567895
Output:
01234567895
What I get:
xxx#xxx:~$ awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
xxx#xxx:~$
Update
The problem happens to be with the kind of file you were using. Apparently it came from a DOS system and had many \r around. To solve it, do "sanitize" them with:
dos2unix
Former answer
Your awk is pretty fine. However, you can also compare files with grep -f:
grep -f file1 file2
This will look for lines in file1 that are also in file2.
You can add options to make a better matching:
grep -wFf file1 file2
-w matches words
-F matches fixed strings (no regex).
Examples
$ cat a
hello
how are
you
I am fine areare
$ cat b
hel
are
$ grep -f b a
hello
how are
I am fine areare
$ grep -wf b a
how are