CLI method for vlookup like search [closed] - regex

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a huge csv file, demo.csv (few GBs in size) which has 3 columns like the following:
$ cat demo.csv
call_start_time,called_no,calling_no
43284.85326,1111111111,2222222222
43284.83192,3333333333,1111111111
43284.83205,2222222222,1111111111
43284.81304,4444444444,3333333333
I am trying to find the rows which has repeated values in either column 2 or 3 (whatever the order). For example, this should be the output for the data shown above:
call_start_time,called_no,calling_no
43284.85326,1111111111,2222222222
43284.83205,2222222222,1111111111
I tried to use csvkit:
csvsql --query "select called_no, calling_no, call_start_time, count(1) from file123 group by called_no,calling_no having count(1)>1" file123.csv > new.csv

With awk you can build an associative array a with records as values and keys k build with the fields $2 and $3 sorted and joined with a pipe.
awk -F, 'NR==1; { k=($3<$2) ? $3"|"$2 : $2"|"$3; if (a[k]) { if (a[k]!="#") {print a[k];a[k]="#"} print} else a[k]=$0}' file
If the current record has a key that already exists, the stored record is printed (only if it is the first time) and the current record is printed too.

$ awk '
NR==1 { print; next }
{ key = ($2>$3 ? $2 FS $3 : $3 FS $2) }
seen[key]++ { print orig[key] $0; delete orig[key]; next }
{ orig[key] = $0 ORS }
' file
call_start_time called_no calling_no
43284.85326 1111111111 2222222222
43284.83205 2222222222 1111111111

Related

How to sort textfile text in this way? [duplicate]

This question already has answers here:
How to merge every two lines into one from the command line?
(21 answers)
Closed 3 years ago.
I have a textfile contains below:
#_.5_sh
#handa12247
#lshydymhmwd
#ahmr0784
#f7j.i
#carameljeddah
#lnqm_iii2
#raghad.ayman.524
#asfhfdfgt4355
#kuw871
#nouralhuda_muhammad
#gogo56817gma
#kaald10000
#sal_0221
#kaled_24009165
#km_kn124
#princess.hana89
#fulefulemm
#norah.0._
#ommajed965
#lam3aastar
#alimarar265
#klthmlmdy
#anas.sasan55
#s.m_b.b
#asnosy_almgrhe_
#norh7132
#880ali7
#tv.creativity
#ksakking3
I'd like to sort them each 5 users in each line.
#_.5_sh #handa12247 #lshydymhmwd #ahmr0784 #f7j.i
#carameljeddah #lnqm_iii2 #raghad.ayman.524 #asfhfdfgt4355 #kuw871
#nouralhuda_muhammad #gogo56817gma #kaald10000 #sal_0221 #kaled_24009165
#km_kn124 #princess.hana89 #fulefulemm #norah.0._ #ommajed965
#lam3aastar #alimarar265 #klthmlmdy #anas.sasan55 #s.m_b.b
#asnosy_almgrhe_ #norh7132 #880ali7 #tv.creativity #ksakking3
i've played around seq, and awk, with failed attempt. i wish somebody who can help me out sorting this way.
Using awk
awk 'ORS=NR%5?FS:RS' file
#_.5_sh #handa12247 #lshydymhmwd #ahmr0784 #f7j.i
#carameljeddah #lnqm_iii2 #raghad.ayman.524 #asfhfdfgt4355 #kuw871
#nouralhuda_muhammad #gogo56817gma #kaald10000 #sal_0221 #kaled_24009165
#km_kn124 #princess.hana89 #fulefulemm #norah.0._ #ommajed965
#lam3aastar #alimarar265 #klthmlmdy #anas.sasan55 #s.m_b.b
#asnosy_almgrhe_ #norh7132 #880ali7 #tv.creativity #ksakking3
It changes the Output Record Selector for every 5 Lines
Edit: This does not work on dos format file, so run dos2unix yourfile before awk

Extracting part of lines with specific pattern and sum the digits using bash

I am just learning bash scripting and commands and i need some help with this assignment.
I have txt file that contains the following text and i need to:
Extract guest name ( 1.1.1 ..)
Sum guest result and output the guest name with result.
I used sed with simple regex to extract out the name and the digits but i have no idea about how to summarize the numbers becuase the guest have multiple lines record as you can see in the txt file. Note: i can't use awk for processing
Here is my code:
cat file.txt | sed -E 's/.*([0-9]{1}.[0-9]{1}.[0-9]{1}).*([0-9]{1})/\1 \2/'
And result is:
1.1.1 4
2.2.2 2
1.1.1 1
3.3.3 1
2.2.2 1
Here is the .txt file:
Guest 1.1.1 have "4
Guest 2.2.2 have "2
Guest 1.1.1 have "1
Guest 3.3.3 have "1
Guest 2.2.2 have "1
and the output should be:
1.1.1 = 5
2.2.2 = 3
3.3.3 = 1
Thank you in advance
I know your teacher wont let you use awk but, since beyond this one exercise you're trying to learn how to write shell scripts, FYI here's how you'd really do this job in a shell script:
$ awk -F'[ "]' -v OFS=' = ' '{sum[$2]+=$NF} END{for (id in sum) print id, sum[id]}' file
3.3.3 = 1
2.2.2 = 3
1.1.1 = 5
and here's a bash builtins equivalent which may or may not be what you've covered in class and so may or may not be what your teacher is expecting:
$ cat tst.sh
#!/bin/env bash
declare -A sum
while read -r _ id _ cnt; do
(( sum[$id] += "${cnt#\"}" ))
done < "$1"
for id in "${!sum[#]}"; do
printf '%s = %d\n' "$id" "${sum[$id]}"
done
$ ./tst.sh file
1.1.1 = 5
2.2.2 = 3
3.3.3 = 1
See https://www.artificialworlds.net/blog/2012/10/17/bash-associative-array-examples/ for how I'm using the associative array. It'll be orders of magnitude slower than the awk script and I'm not 100% sure it's bullet-proof (since shell isn't designed to process text there are a LOT of caveats and pitfalls) but it'll work for the input you provided.
OK -- since this is a class assignment, I will tell you how I did it, and let you write the code.
First, I sorted the file. Then, I read the file one line at a time. If the name changed, I printed out the previous name and count, and set the count to be the value on that line. If the name did not change, I added the value to the count.
Second solution used an associative array to hold the counts, using the guest name as the index. Then you just add the new value to the count in the array element indexed on the guest name.
At the end, loop through the array, print out the indexes and values.
It's a lot shorter.

match variable string at end of field with awk

Yet again my unfamiliarity with AWK lets me down, I can't figure out how to match a variable at the end of a line?
This would be fairly trivial with grep etc, but I'm interested in matching integers at the end of a string in a specific field of a tsv, and all the posts suggest (and I believe it to be the case!) that awk is the way to go.
If I want to just match a single one explicity, that's easy:
Here's my example file:
PVClopT_11 PAU_02102 PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A No DOI found.
PVCpnf_18 PAK_3526 PAK_03186 3fxq 3fxq_A 99.7 2.7e-21 7e-26 122.2 >3fxq_A LYSR type regulator of TSAMBCD; transcriptional regulator, LTTR, TSAR, WHTH, DNA- transcription, transcription regulation; 1.85A {Comamonas testosteroni} PDB: 3fxr_A* 3fxu_A* 3fzj_A 3n6t_A 3n6u_A* 10.1111/j.1365-2958.2010.07043.x
PVCunit1_19 PAU_02807 PAU_02793 3kx6 3kx6_A 19.7 45 0.0012 31.3 >3kx6_A Fructose-bisphosphate aldolase; ssgcid, NIH, niaid, SBRI, UW, emerald biostructures, glycolysis, lyase, STRU genomics; HET: CIT; 2.10A {Babesia bovis} No DOI found.
PVClumt_17 PAU_02231 PAU_02190 3lfh 3lfh_A 39.7 12 0.0003 28.9 >3lfh_A Manxa, phosphotransferase system, mannose/fructose-speci component IIA; PTS; 1.80A {Thermoanaerobacter tengcongensis} No DOI found.
PVCcif_11 plu2521 PLT_02558 3h2t 3h2t_A 96.6 2.6e-05 6.7e-10 79.0 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_16 PAU_03338 PAU_03377 5jbr 5jbr_A 29.2 22 0.00058 23.9 >5jbr_A Uncharacterized protein BCAV_2135; structural genomics, PSI-biology, midwest center for structu genomics, MCSG, unknown function; 1.65A {Beutenbergia cavernae} No DOI found.
PVCunit1_17 PAK_2892 PAK_02622 1cii 1cii_A 63.2 2.7 6.9e-05 41.7 >1cii_A Colicin IA; bacteriocin, ION channel formation, transmembrane protein; 3.00A {Escherichia coli} SCOP: f.1.1.1 h.4.3.1 10.1038/385461a0
PVCunit1_11 PAK_2886 PAK_02616 3h2t 3h2t_A 96.6 1.9e-05 4.9e-10 79.9 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_11 PAU_03343 PAU_03382 3h2t 3h2t_A 97.4 4.4e-07 1.2e-11 89.7 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCunit1_5 afp5 PAU_02779 4tv4 4tv4_A 63.6 2.6 6.7e-05 30.5 >4tv4_A Uncharacterized protein; unknown function, ssgcid, virulence, structural genomics; 2.10A {Burkholderia pseudomallei} No DOI found.
And I can pull out all the lines which have a "_11" at the end of the first column by running the following on the commandline:
awk '{ if ($1 ~ /_11$/) { print } }' 02052017_HHresults_sorted.tsv
I want to enclose this in a loop to cover all integers from 1 - 5 (for instance), but I'm having trouble passing a variable in to the text match.
I expect it should be something like the following, but $i$ seems like its probably incorrect and by google-fu failed me:
awk 'BEGIN{ for (i=1;i<=5;i++){ if ($1 ~ /_$i$/) { print } } }' 02052017_HHresults_sorted.tsv
There may be other issues I haven't spotted with that awk command too, as I say, I'm not very awk-savvy.
EDIT FOR CLARIFICATION
I want to separate out all the matches, so can't use a character class. i.e. I want all the lines ending in "_1" in one file, then all the ones ending in "_2" in another, and so on (hence the loop).
You can't put variables inside //. Use string concatenation, which is done by simply putting the strings adjacent to each other in awk. You don't need to use a regexp literal when you use the ~ operator, it always treats the second argument as a regexp.
awk '{ for (i = 1; i <= 5; i++) {
if ( $1 ~ ("_" i "$") ) { print; break; }
}' 02052017_HHresults_sorted.tsv
It sounds like you're thinking about this all wrong and what you really need is just (with GNU awk for gensub()):
awk '{ print > ("out" gensub(/.*_/,"",1,$1)) }' 02052017_HHresults_sorted.tsv
or with any awk:
awk '{ n=$1; sub(/.*_/,"",n); print > ("out" n) }' 02052017_HHresults_sorted.tsv
No need to loop, use regex character class [..]:
awk 'match($1,/_([1-5])$/,a){ print >> a[1]".txt" }' 02052017_HHresults_sorted.tsv

Extracting values from columns in tab delim text file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
perlI have tab delimited text file with 3 columns like following
Name Description Ontology
dda1 box1_homodomain gn=box1 os=homo C:Cell;C:surface;F:binding;P:toy
dda2 sox2_plurinet gn=plu os=mouse C:Organ;F:transport;P:carrier;P:avi
dd13 klf4_iPSC gn=klf os=Bos C:Cell;F:tiad;P:abs;P:digestion
Now I would like to split the values (gn=xxx and os=xxx) in column Description and values in Ontology column(C:xxx;F=xxx;P=xxx;) into seperate columns like following:
Name Description gn os C F P
dda1 box1_homodomain box1 homo Cell;surface binding toy
dda2 sox2_plurinet plu mouse Organ; transport carrier;avi
dd13 klf4_iPSC klf Bos Cell; tiad abs;digestion
I want this has to export as tab delim file or excel file.I would be really great if someone can guide how can I achieve that in perl. Please help me.
Thanks in advance
I saw perl question after 5 years of my Java. I was excited and i want to do this exercise. Now, that i remember things i did and pasted the code below. Just enrich the same code for last column 'Ontology' with regexp and same hash concept you are done. You can do multiple ways in perl. It could be more code. But, i remember below way.
#!/usr/bin/perl
use Data::Dumper;
my %output;
open(IN, "stack.txt");
while(<IN>) {
my #nameColumns, #descriptionColumns;
if ($_ =~ /Name/) {
$ouput{'Name'} = #nameColumns;
$ouput{'Description'} = #descriptionColumns;
next;
}
my ($group1, $group2, $group3, $group4, $group5, $group6, $group7) = ($_ =~ m/(\w+)\s+(\w+)\s+(\w+)\=(\w+)\s+(\w+)\=(\w+)\s+(.*)/gi);
# Column 1
#nameColumns = #{$output{'Name'}};
push(#nameColumns, $group1);
$output{'Name'} = [#nameColumns];
# Column 2
#print "$group2, $group3, $group4, $group5, $group6, $group7";
#descriptionColumns = #{$output{'Description'}};
push(#descriptionColumns, $group2);
$output{'Description'} = [#descriptionColumns];
# column 3
#column3 = #{$output{$group3}};
push(#column3, $group4);
$output{$group3} = [#column3];
# column 4
#column4 = #{$output{$group5}};
push(#column4, $group6);
$output{$group5} = [#column4];
#Column ...
}
close(IN);
print Dumper(\%output);
$VAR1 = {
'gn' => [
'box1',
'plu',
'klf'
],
'os' => [
'homo',
'mouse',
'Bos'
],
'Name' => [
'dda1',
'dda2',
'dd13'
],
'Description' => [
'box1_homodomain',
'sox2_plurinet',
'klf4_iPSC'
]
};
Note : Output above. If you still didn't figure out, how to finish this program let me know to spend more time

Add value between column using sed/awk based on matching value at certain column

I have a log files with many records. All line of rows and columns have same format. I want to use sed to match value in certain column and adding new value in between column. As an example, a log like this :
2014.3.17 23:57:11 127.0.0.3 10.21.31.141 http://vcs2.msg.yahoo.com/capacity *DENIED* Banned site: msg.yahoo.com GET 0 0 3 403 - working_time content3 -
My command will search the log for msg.yahoo.com (column 9th) and if match it will add value (Social Media) between column 12 and 13. As intended output :
2014.3.17 23:57:11 127.0.0.3 10.21.31.141 http://vcs2.msg.yahoo.com/capacity *DENIED* Banned site: msg.yahoo.com GET 0 0 Social Media 3 403 - working_time content3 -
My awk code only put Social Media between column 12 and 13 :
awk -v column=12 -v value="Social Media" '
BEGIN {
FS = OFS = " ";
}
{
for ( i = NF + 1; i > column; i-- ) {
$i = $(i-1);
}
$i = value;
print $0;
}
' access3.log
but it need to find msg.yahoo.com in column 9 before add value. Its like this, if column
9 = msg.yahoo.com, put Social Media after column 12 or between 12 and 13 column.
Workable but ugly is sed (as things so often are):
sed '/\([^ ]* \)\{8\}msg\.yahoo\.com/s/\(\([^ ]* \)\{12\}\)/\1Social Media /' filename
Here is the fix for awk
awk '$9=="msg.yahoo.com"{$(NF-6)=$(NF-6) " Social Media"}1' access3.log
Explanation
$9=="msg.yahoo.com" only target on the line which msg.yahoo.com in column 9
$(NF-6)=$(NF-6) " Social Media" column (NF-6) is the reverse column 6 from end, and replace with a new value.
1 just means true and print.