extract data column-wise from text file using regex

extract data column-wise from text file using regex - regex

I have a text file like
27/02/2017 17:59:39 562803 299060 235155
27/02/2017 17:59:44 562803 299058 235158
27/02/2017 17:59:49 562803 299057 235158
27/02/2017 17:59:54 562803 299057 235158
I want to extract data from a particular column using regex. Which expression will extract the 3rd column?

If you have Linux, awk is the best utility for this
awk '{print $3}' file
awk divides a file into fields, usually separated by tabs or whitespace. $3 represents the third column or field.
Another good utility is cut
cut -d " " -f3 file

Related

Concatenate urls based on result of two columns

I would like to first take out of the string in the first column parenthesis which I can do with:
awk -F"[()]" '{print $2}'
Then, concatenate it with the second column to create a URL with the following format:
"https://ftp.drupal.org/files/projects/"[firstcolumn stripped out of parenthesis]-[secondcolumn].tar.gz
With input like:
Admin Toolbar (admin_toolbar) 8.x-2.5
Entity Embed (entity_embed) 8.x-1.2
Views Reference Field (viewsreference) 8.x-2.0-beta2
Webform (webform) 8.x-5.28
Data from the first line would create this URL:
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz

Something like
sed 's!^[^(]*(\([^)]*\))[[:space:]]*\(.*\)!https://ftp.drupal.org/files/projects/\1-\2.tar.gz!' input.txt

If a file a has your input, you can try this:
$ awk -F'[()]' '
{
split($3,parts," *")
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz\n", $2, parts[2]
}' a
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
The trick is to split the third field ($3). Based on your field separator ( -F'[()]'), the third field contains everything after the right paren. So, split can be used to get rid of all the spaces. I probably should have searched for an awk "trim" equivalent.

In the example data, the second last column seems to contain the part with the parenthesis that you are interested in, and the value of the last column.
If that is always the case, you can remove the parenthesis from the second last column, and concat the hyphen and the last column.
awk '{
gsub(/[()]/, "", $(NF-1))
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", $(NF-1), $NF, ORS
}' file
Output
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
Another option with a regex and gnu awk, using match and 2 capture groups to capture what is between the parenthesis and the next field.
awk 'match($0, /^[^()]*\(([^()]+)\)\s+(\S+)/, ary) {
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", ary[1], ary[2], ORS
}' file

This might work for you (GNU sed):
sed 's#.*(#https://ftp.drupal.org/files/projects/#;s/)\s*/-/;s/\s*$/.tar.gz/' file
Pattern match, replacing the unwanted parts by the required strings.
N.B. The use of the # as a delimiter for the substitution command to avoid inserting back slashes into the literal replacement.
The above solution could be ameliorated into:
sed -E 's#.*\((.*)\)\s*(\S*).*#https://ftp.drupal.org/files/projects/\1-\2.tar.gz#' file

remove duplicate lines (only first part) from a file

I have a list like this
ABC|Hello1
ABC|Hello2
ABC|Hello3
DEF|Test
GHJ|Blabla1
GHJ|Blabla2
And i want it to be this:
ABC|Hello1
DEF|Test
GHJ|Blabla1
so i want to remove the duplicates in each line before the: |
and only let the first one there.

A simple way using awk
$ awk -F"|" '!seen[$1]++ {print $0}' file
ABC|Hello1
DEF|Test
GHJ|Blabla1
The trick here is to set the appropriate field separator "|" in this case after which the individual columns can be accessed column-wise starting with $1. In this answer, am maintaining a unique-value array seen and printing the line only if the value from $1 is not seen previously.

Adding commas when necessary to a csv file using regex

I have a csv file like the following:
entity_name,data_field_name,type
Unit,id
Track,id,LONG
The second row is missing a comma. I wonder if there might be some regex or awk like tool in order to append commas to the end of line in case there are missing commas in these rows?
Update
I know the requirements are a little vague. There might be several alternative ways to narrow down the requirements such as:
The header row should define the number of columns (and commas) that is valid for the whole file. The script should read the header row first and find out the correct number of columns.
The number of columns might be passed as an argument to the script.
The number of columns can be hardcoded into the script.
I didn't narrow down the requirements at first because I was ok with any of them. Of course, the first alternative is the best but I wasn't sure if this was easy to implement or not.
Thanks for all the great answers and comments. Next time, I will state acceptable alternative requirements explicitly.

You can use this awk command to fill up all rows starting from 2nd row with the empty cell values based on # of columns in the header row, in order to avoid hard-coding # of columns:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} NF{$nc=$nc} 1' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
Earlier solution:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} {printf "%s", $0;
for (i=NF+1; i<=nc; i++) printf "%s", OFS; print ""}' file

I would use sed,
sed 's/^[^,]*,[^,]*$/&,/' file
Example:
$ echo 'Unit,id' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,
$ echo 'Unit,id,bar' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,bar

Try this:
$ awk -F , 'NF==2{$2=$2","}1' file
Output:
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

With another awk:
awk -F, 'NF==2{$3=""}1' OFS=, yourfile.csv

to present balance to all the awk solutions, following could be a vim only solution
:v/,.*,/norm A,
rationale
/,.*,/ searches for 2 comma's in a line
:v apply a global command on each line NOT matching the search
norm A, enters normal mode and appends a , to the end of the line

This MIGHT be all you need, depending on the info you haven't shared with us in your question:
$ awk -F, '{print $0 (NF<3?FS:"")}' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

Delete rows with extra delimiter from csv file in unix

I have a csv file with 3 columns separated by ',' delimiter. Some values have , in data and I would like to remove the whole record. Suggest if I can do this using sed/awk,grep commands .
Input file :
monitor,display,45
keyboard,input,20
loud,speaker,output,20
mount,input,20
Expected Output :
monitor,display,45
keyboard,input,20
mount,input,20

I used grep command to filter out rows with extra commas.
grep -v '.*,.*,.*,.*' input_file > output_file.
We need to define the regex pattern between .*
-v excludes the records which match the pattern specified.

Below is how you can do the same using awk , basically you want the record in which there are exactly 3 fields
$ awk -F, 'NF==3 {print $0}' data1.txt
monitor,display,45
keyboard,input,20
mount,input,20

Extracting Last column using VI

I have a csv file which contains some 1000 fields with values, the headers are something like below:
v1,v2,v3,v4,v5....v1000
I want to extract the last column i.e. v1000 and its values.
I tried %s/,[^,]*$// , but this turns out to be exact opposite of what i expected, Is there any way to invert this expression in VI ?
I know it can be done using awk as awk -F "," '{print $NF}' myfile.csv, but i want to make it happen in VI with regular expression,Please also note that i have VI and don't have VIM and working on UNIX, so i can't do visual mode trick as well.
Many thanks in advance, Any help is much appreciated.

Don't you just want
%s/.*,\s*//
.*, is match everything unto the last comma and the \s* is there to remove whitespace if its there.

You already accepted answer, btw you can still use awk or other nice UNIX tools within VI or VIM. Technique below calls manipulating the contents of a buffer through an external command :!{cmd}
As a demo, let's rearrange the records in CSV file with sort command:
first,last,email
john,smith,john#example.com
jane,doe,jane#example.com
:2,$!sort -t',' -k2
-k2 flag will sort the records by second field.
Extract last column with awk as easy as:
:%!awk -F "," '{print $NF}'

Dont forget cut!
:%!cut -d , -f 6
Where 6 is the number of the last field.
Or if you don't want to count the number of fields:
:%!rev | cut -d , -f 1 | rev

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

extract data column-wise from text file using regex - regex

I have a text file like 27/02/2017 17:59:39 562803 299060 235155 27/02/2017 17:59:44 562803 299058 235158 27/02/2017 17:59:49 562803 299057 235158 27/02/2017 17:59:54 562803 299057 235158 I want to extract data from a particular column using regex. Which expression will extract the 3rd column?

If you have Linux, awk is the best utility for this awk '{print $3}' file awk divides a file into fields, usually separated by tabs or whitespace. $3 represents the third column or field. Another good utility is cut cut -d " " -f3 file

Related

Concatenate urls based on result of two columns

remove duplicate lines (only first part) from a file

Adding commas when necessary to a csv file using regex

Delete rows with extra delimiter from csv file in unix

Extracting Last column using VI

Categories

Resources