Extract columns from a CSV file using Linux shell commands - regex

I need to "extract" certain columns from a CSV file. The list of columns to extract is long and their indices do not follow a regular pattern. So far I've come up with a regular expression for a comma-separated value but I find it frustrating that in the RHS side of sed's substitute command I cannot reference more than 9 saved strings. Any ideas around this?
Note that comma-separated values that contain a comma must be quoted so that the comma is not mistaken for a field delimiter. I'd appreciate a solution that can handle such values properly. Also, you can assume that no value contains a new line character.

With GNU awk:
$ cat file
a,"b,c",d,e
$ awk -vFPAT='([^,]*)|("[^"]+")' '{print $2}' file
"b,c"
$ awk -vFPAT='([^,]*)|("[^"]+")' '{print $3}' file
d
$ cat file
a,"b,c",d,e,"f,g,h",i,j
$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, -vcols="1,5,7,2" 'BEGIN{n=split(cols,a,/,/)} {for (i=1;i<=n;i++) printf "%s%s", $(a[i]), (i<n?OFS:ORS)}' file
a,"f,g,h",j,"b,c"
See http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content for details. I doubt if it'd handle escaped double quotes embedded in a field, e.g. a,"b""c",d or a,"b\"c",d.
See also What's the most robust way to efficiently parse CSV using awk? for how to parse CSVs with awk in general.

CSV is not that easy to parse like it might look in the first place.
This is because there can be a plenty of different delimiters or fixed column widths to separate the data, and also the data may contain the delimiter itself (escaped).
Like I already told here I would use a programming language which supports a CVS library for that.
Use
Python
Perl
Ruby
PHP
or even C.

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
I provided sample code within my answer here: parse csv file using gawk

There is command-line csvtool available - https://colin.maudry.com/csvtool-manual-page/
# apt-get install csvtool

Related

extract specific column from file?

I've one file having records like below
AAA***000***LLL
BBB***111***PPP
Want only second column values in output file.
OutputFile
000
111
Is there any way I could do it using linux command ?
The simplest way is to use awk
awk -v FS='[*]{3}' '{print $2}' file
The FS='[*]{3}' means three *s will be used verbatim as field separator. Notice that setting the FS as FS='***' is wrong since the *** is not a valid regular expression.
If awk is not available, which is highly unlikely on a Linux box, you can use GNU sed:
sed -En 's/[*]{3}/\n/; s/[*]{3}.*//; s/.*\n//p' file

How to replace using sed command in shell scripting to replace a string from a txt file present in one directory by another?

I am very new to shell scripting and trying to learn the "sed" command functionality.
I have a file called configurations.txt with some variables defined in it with some string values initialised to each of them.
I am trying to replace a string in a file (values.txt) which is present in some other directory by the values of the variables defined. The name of the file is values.txt.
Data present in configurations.txt:-
mem="cpu.memory=4G"
proc="cpu.processor=Intel"
Data present in the values.txt (present in /home/cpu/script):-
cpu.memory=1G
cpu.processor=Dell
I am trying to make a shell script called repl.sh and I dont have alot of code in it for now but here is what I got:-
#!/bin/bash
source /home/configurations.txt
sed <need some help here>
Expected output is after an appropriate regex applied, when I run script sh repl.sh, in my values.txt , It must have the following data present:-
cpu.memory=4G
cpu.processor=Intell
Originally which was 1G and Dell.
Would highly appreciate some quick help. Thanks
This question lacks some sort of abstract routine and looks like "help me do something concrete please". Thus it's very unlikely that anyone would provide a full solution for that problem.
What you should do try to split this task into number of small pieces.
1) Iterate over configuration.txt and get values from each line. To do that you need to get X and Y from a value="X=Y" string.
This regex could be helpful here - ([^=]+)=\"([^=]+)=([^=]+)\". It contains 3 matching groups separated by ". For example,
>> sed -r 's/([^=]+)=\"([^=]+)=([^=]+)\"/\1/' configurations.txt
mem
proc
>> sed -r 's/([^=]+)=\"([^=]+)=([^=]+)\"/\2/' configurations.txt
cpu.memory
cpu.processor
>> sed -r 's/([^=]+)=\"([^=]+)=([^=]+)\"/\3/' configurations.txt
4G
Intel
2) For each X and Y find X=Z in values.txt and substitute it with a X=Y.
For example, let's change cpu.memory value in values.txt with 4G:
>> X=cpu.memory; Y=4G; sed -r "s/(${X}=).*/\1${Y}/" values.txt
cpu.memory=4G
cpu.processor=Dell
Use -i flag to do changes in place.
Here is an awk based answer:
$ cat config.txt
cpu.memory=4G
cpu.processor=Intel
$ cat values.txt
cpu.memory=1G
cpu.processor=Dell
cpu.speed=4GHz
$ awk -F= 'FNR==NR{a[$1]=$2; next;}; {if($1 in a){$2=a[$1]}}1' OFS== config.txt values.txt
cpu.memory=4G
cpu.processor=Intel
cpu.speed=4GHz
Explanation: First read config.txt & save in memory. Then read values.txt. If a particular value was defined in config.txt, use the saved value from memory (config.txt).

how to handle unix command having \x in python code

I want to execute command
sed -e 's/\x0//g' file.xml
using Python code.
But getting error ValueError: invalid \x escape
You are not showing your Python code, so there is room for speculation here.
But first, why does the file contain null bytes in the first place? It is not a valid XML file. Can you fix the process which produces this file?
Secondly, why do you want to do this with sed? You are already using Python; use its native functions for this sort of processing. If you expect to read the file line by line, something like
with open('file.xml', 'r') as xml:
for line in xml:
line = line.replace('\x00', '')
# ... your processing here
or if you expect the whole file as one long byte string:
with open('file.xml', 'r') as handle:
xml = handle.read()
xml = xml.replace('\x00', '')
If you really do want to use an external program, tr would be more natural than sed. What syntax exactly to use depends on the dialect of tr or sed as well, but the fundamental problem is that backslashes in Python strings are interpreted by Python. If there is a shell involved, you also need to take the shell's processing into account. But in very simple terms, try this:
os.system("sed -e 's/\\x0//g' file.xml")
or this:
os.system(r"sed -e 's/\x0//g' file.xml")
Here, the single quotes inside the double quotes are required because a shell interprets this. If you use another form of quoting, you need to understand the shell's behavior under that quoting mechanism, and how it interacts with Python's quoting. But you don't really need a shell here in the first place, and I'm guessing in reality your processing probably looks more like this:
sed = subprocess.Popen(['sed', '-e', r's/\x0//g', 'file.xml'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
result, err = sed.communicate()
Because no shell is involved here, all you need to worry about is Python's quoting. Just like before, you can relay a literal backslash to sed either by doubling it, or by using a r'...' raw string.
Hex escapes in Python need two hex digits.
\x00

how to retrieve filename or extension within bash [duplicate]

This question already has answers here:
Extract filename and extension in Bash
(38 answers)
Closed 8 years ago.
i have a script that is pushing out some filesystem data to be uploaded to another system.
it would be very handy if i could tell myself what 'kind' of file each file actually is, because it will help with some querying later on down the road.
so, for example, say that my script is spitting out the following:
/home/myuser/mydata/myfile/data.log
/home/myuser/mydata/myfile/myfile.gz
/home/myuser/mydata/myfile/mod.conf
/home/myuser/mydata/myfile/security
/home/myuser/mydata/myfile/last
in the end, i'd like to see:
/home/myuser/mydata/myfile/data.log log
/home/myuser/mydata/myfile/myfile.gz gz
/home/myuser/mydata/myfile/mod.conf conf
/home/myuser/mydata/myfile/security security
/home/myuser/mydata/myfile/last last
there's gotta be a way to do this with regular expressions and sed, but i can't figure it out.
any suggestions?
EDIT:
i need to get this info via the command line. looking at the answers so far, i obviously have not made this clear. so with the example data i provided, assume that data is all being fed via greps and seds (data is already sterlized). i need to be able to pipe the example data to sed/grep/awk/whatever in order to produce the desired results.
Print last filed that are separated by a none alpha character.
awk -F '[^[:alpha:]]' '{ print $0,$NF }'
/home/myuser/mydata/myfile/data.log log
/home/myuser/mydata/myfile/myfile.gz gz
/home/myuser/mydata/myfile/mod.conf conf
/home/myuser/mydata/myfile/security security
/home/myuser/mydata/myfile/last last
This should work for you:
x='/home/myuser/mydata/myfile/security'
( IFS=[/.] && arr=( $x ) && echo ${arr[#]:(-1):1} )
security
x='/home/myuser/mydata/myfile/data.log'
( IFS=[/.] && arr=( $x ) && echo ${arr[#]:(-1):1} )
log
To extract the last element in a filename path:
filename=$(path##*/}
To extract characters after a dot in a filename:
extension=${filename##*.}
But (my comment) rather than looking at the extension, it might be better to use file. See man file.
As others have already answered, to parse the file names:
extension="${full_file_name##*.}" # BASH and Kornshell/POSIX only
filename=$(basename "$full_file_name")
dirname=$(dirname "$full_file_name")
Quotes are needed if file names could have spaces, tabs, or other strange characters in them.
You can also test whether a file is a directory or file or link with the test command (which is linked to [ so that test -f foo is the same as [ -f foo ].
However, you said: "it would be very handy if i could tell myself what kind of file each file actually is".
In that case, you may want to investigate the file command. This command will return the file type as determined by some sort of magic file (traditionally in /etc/magic), but newer implementations can use the user's own scheme. This can tell file type by extension and by the magic number in the file's header, or by looking at the first few lines in the file (looking for a regular expression ^#! .*/bash$ in the first line.
This extracts the last component after a slash or a dot.
awk -F '[/.]' '{ print $NF }'

Extracting username from UNIX path using Regex

I need to get a username from an Unix path with this format:
/home/users/myusername/project/number/files
I just want "myusername" I've been trying for almost a hour and I'm completely clueless.
Any idea?
Thanks!
Maybe just /home/users/([a-zA-Z0-9_\-]*)/.*?
Note that the critical part [a-zA-Z0-9_\-]* has to contain all valid characters for unix usernames. I took from here, that a username should only contain digits, characters, dashes and underscores.
Also note that the extracted username is not the whole matching, but the first group (indicated by (...)).
The best answer to this depends on what you are trying to achieve. If you want to know the user who owns that file then you can use the stat command, this unfortunately has slightly different syntax dependant on the operating system however the following two commands work
Max OS/X
stat -f '%Su' /home/users/myusername/project/number/files
Redhat/Fedora/Centos
stat -c '%U' /home/users/myusername/project/number/files
If you really do want the string following /home/users then the either of the Regexes provided above will do that, you could use that in a bash script as follows (Mac OS/X)
USERNAME=$(echo '/home/users/myusername/project/number/files' | \
sed -E -e 's!^/home/users/([^/]+)/.*$!\1!g')
Check http://rubular.com/r/84zwJmV62G. The first match, not the entire match, is the username.
in a bourne shell something like :
string="/home/users/STRINGWEWANT/some/subdir/here"
echo $string | awk -F\/ '{print $3}'
would be one option, assuming its always the third element of the path. There are more lightweight that use only the shell builtins :
echo ${x#*users/}
will strip out everything up to and including 'users/'
echo ${y%%/*}
Will strip out the remainder.
So to put it all together :
export path="/home/users/STRINGWEWABT/some/other/dirs"
export y=`echo ${path#*users/}` && echo ${y%%/*}
STRINGWEWABT
Also checkout the bash manpage and search for "Parameter Expansion"
(\/home\/users\/)([^\/]+)
The 2nd capture group (index 1) will be myusername