How to select a column from a file with command line - regex

I have a file that appears as follow :
some random text : azoidfalkrnalrkazlkja
zlazekamzlekazmlekalzkemlkmlkmlkmlkmlkml
o&kjoik&oék"&po"éképo"k&éo"kéo"koé"kk"k"
Column1 Column2 Column3 Column4 Column5
=======================================
0 1 1000 No_Light X Disabled (Persistent)
1 1 1010 Online X E-Port 10:20:30:40:50:60:70:80 "some comment"
2 1 1020 Online X F-Port 10:00:00:00:00:00:00:00
3 1 1030 No_Light X Disabled (Persistent)
I can extract all "Online" status with grep "^ *[0-9].*Online" ./myfile. How can I then extract further information for each line (for instance, add each value to a $COLUMN variable) ?
I would like to extract all data from the 3rd column, and then treat the result as an array to extract the data from each line.
EDIT: Quotting Jotne's Answer, I did somthing like that :
COLUMN=3
MYVARIABLE=($(awk '/Online/ {print $c}' c="$COLUMN" file))
echo ${MYVARIABLE[0]}

To get information from eks column #3 and that is online:
COLUMN=3
awk '/Online/ {print $c}' c="$COLUMN" file
1010
1020

Related

How to sum by group in Power Query Editor?

My table look like this :
Serial WO# Value Indicator
A 333 10 333-1
A 333 4 333-2
B 456 5 456-1
A 334 1 334-1
A 334 5 334-2
I want to create a new column that sums up the Values based on WO#. It should look like this:
Serial WO# Value Indicator SumValue
A 333 10 333-1 14
A 333 4 333-2 14
B 456 5 456-1 5
A 334 1 334-1 6
A 334 5 334-2 6
Eventually I will remove duplicates on the WO# and remove the Value and Indicator Columns from the data. I can't seem to find a function in M that allows for sum by group. Thanks in advance!
If you load the data with Power Query, there is a Group command on the ribbon that will do just that.
Make sure to use the Advanced option and add all columns you want to retain to the grouping section. Screenshot from Excel ....
.... and from Power BI

Extract text before a specific word with hive

I have data in a column that looks like below :
Avenue 1 HE1 345 HOUSE 123.
FLAT 202 HRE2 D34 HOUSE 345.
DOOR 324 HA1 345 HOUSE 67
5.
I need to extract the postcode which comes always before house varying between 6-7 characters in all the cases. There's always a white space before HOUSE and in between postcode and one before postcode.
Desired output :
HE1 345
HRE2 D34
HA1 345
I've tried using substring_index two times only to know that hive doesn't support the function. I'm pretty much novice to Hive. Help and any reference to material will be a great gesture too.
Thanks in advance.
You can use this regex pattern ' (\\w+ \\w+) HOUSE'. This means one space, one or more word characters, one space, one or more characters, one space, HOUSE. In the parentheses is a group to be extracted. Group index is 1.
Demo:
select regexp_extract(s,' (\\w+ \\w+) HOUSE',1)
from
(select 'Avenue 1 HE1 345 HOUSE 123.' s union all
select 'FLAT 202 HRE2 D34 HOUSE 345.' s union all
select 'DOOR 324 HA1 345 HOUSE 67' s) s;
OK
HE1 345
HRE2 D34
HA1 345
Time taken: 26.472 seconds, Fetched: 3 row(s)
For case insensitive use (?i) modifier:
hive>
>
> select regexp_extract(s,' (\\w+ \\w+) (?i)HOUSE',1)
> from
> (select 'Avenue 1 HE1 345 HOUSe 123.' s union all
> select 'FLAT 202 HRE2 D34 HOUsE 345.' s union all
> select 'DOOR 324 HA1 345 HOuSE 67' s) s;
OK
HE1 345
HRE2 D34
HA1 345
See regex docs here: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
About case insensitive: http://www.regular-expressions.info/modifiers.html
You can save that file as csv file (Copy the contents in notepad and save that with .csv extension).
Now you can create table in hive and load data from csv file in the table.
hive> create table text(column1 string,column2 string,column3 string,column4 string,column5 string, column6 string) Row format delimited fields terminated by ' ' ;
OK
Time taken: 0.137 seconds
For loading data into table :
Use
hive> load data LOCAL inpath 'location of you file ' overwrite into table text;
hive> load data LOCAL inpath '/home/cloudera/FinalProjects/text.csv' overwrite into table text;
Loading data to table default.text
Table default.text stats: [numFiles=1, numRows=0, totalSize=84, rawDataSize=0]
OK
Time taken: 0.59 seconds
hive> select column3, column4 from text;
OK
HE1 345
HRE2 D34
HA1 345
Time taken: 0.145 seconds, Fetched: 3 row(s)

how to add new line between header and data in flat file using informatica?

how to add new line between header and data in flat file using informatica?
below is the example.
current:
ID NAME AGE
1 RAJA 28
2 JOHN 29
3 JOE 2O
EXPECTED:
ID NAME AGE
1 RAJA 28
2 JOHN 29
3 JOE 20
Use the Header Command option in session to generate the header row and newlines
e.g. echo ID NAME AGE;echo;echo
Use a command task in a workflow or post command in a session and provide the below unix script,
awk 'NR==2{print ""} {print}' FileName > FileName
This will insert a blank line next to the header line in a file

how can I sort the line of a file according to the special part of the line

This is the file content (just a simplified example) which I want to sort:
1 33 blabla_0_banana
2 32333 lablab_4_apple
3 1232312 hahaah_1_banana
4 3342222 ohohoh_2_apple
And I want to sort the results with two requirement:
firstly by the end word(eg: banana\apple)
secondly by the number between the two "_" symbols : _[number]_ (eg: 0\4\1\2)
this is the the result I wanted:
4 3342222 ohohoh_2_apple
2 32333 lablab_4_apple
1 33 blabla_0_banana
3 1232312 hahaah_1_banana
And finally, I want to delete the line with the second number >100000, this is also the result I wanted:
2 32333 lablab_4_apple
1 33 blabla_0_banana
How can I do this? Maybe with the command 'sort' , 'awk' or others.
Using sort:
sort -t_ -k3 -k2n file
4 3342222 ohohoh_2_apple
2 32333 lablab_4_apple
1 33 blabla_0_banana
3 1232312 hahaah_1_banana
To keep only rows with 2nd column < 100000 use awk:
awk '$2<100000' file | sort -t_ -k3 -k2n
2 32333 lablab_4_apple
1 33 blabla_0_banana
Working Code Demo

space delimited file handling

I have insider transactions of a company in a space delimited file. Sample data looks like the following:
1 Gilliland Michael S January 2,2013 20,000 19
2 Still George J Jr January 2,2013 20,000 19
3 Bishkin S. James February 1,2013 150,000 21
4 Mellin Mark P May 28,2013 238,000 25.26
Col1 is Serial# that I dont need to print
Col2 is the name of the person who did trades. This column is not consistent. It has first name and second name and middle initial and for some insiders salutations as well (Mr, Dr. Jr etc)
col3 is the date format Month Day,Year
col4 is the number of shares traded
col5 is the price at which shares were purchased or sold.
I need you guys help to print each column value separately. Thanks for your help.
Count the total number of fields read; the difference between that and the number of non-name fields gives you the width of the name.
#!/bin/bash
# uses bash features, so needs a /bin/bash shebang, not /bin/sh
# read all fields into an array
while read -r -a fields; do
# calculate name width assuming 5 non-name fields
name_width=$(( ${#fields[#]} - 5 ))
cur_field=0
# read initial serial number
ser_id=${fields[cur_field]}; (( ++cur_field ))
# read name
name=''
for ((i=0; i<name_width; i++)); do
name+=" ${fields[cur_field]}"; (( ++cur_field ))
done
name=${name# } # trim leading space
# date spans two fields due to containing a space
date=${fields[cur_field]}; (( ++cur_field ))
date+=" ${fields[cur_field]}"; (( ++cur_field ))
# final fields are one span each
num_shares=${fields[cur_field]}; (( ++cur_field ))
price=${fields[cur_field]}; (( ++cur_field ))
# print in newline-delimited form
printf '%s\n' "$ser_id" "$name" "$date" "$num_shares" "$price" ""
done
Run as follows (if you saved the script as process):
./process <input.txt >output.txt
It might be a little easier in perl.
perl -lane '
#date = splice #F, -4, 2;
#left = splice #F, -2, 2;
splice #F, 0, 1;
print join "|", "#F", "#date", #left
' file
Gilliland Michael S|January 2,2013|20,000|19
Still George J Jr|January 2,2013|20,000|19
Bishkin S. James|February 1,2013|150,000|21
Mellin Mark P|May 28,2013|238,000|25.26
You can change the delimiter in the join as per your requirement.
Here is the data separated using awk
awk '{c1=$1;c5=$NF;c4=$(NF-1);c3=$(NF-3)FS$(NF-2);$1=$NF=$(NF-1)=$(NF-2)=$(NF-3)="";gsub(/^ | *$/,"");c2=$0;print c1"|"c2"|"c3"|"c4"|"c5}' file
1|Gilliland Michael S|January 2,2013|20,000|19
2|Still George J Jr|January 2,2013|20,000|19
3|Bishkin S. James|February 1,2013|150,000|21
4|Mellin Mark P|May 28,2013|238,000|25.26
You know have your data in variable c1 to c5
Or better displayed here:
awk '{c1=$1;c5=$NF;c4=$(NF-1);c3=$(NF-3)FS$(NF-2);$1=$NF=$(NF-1)=$(NF-2)=$(NF-3)="";gsub(/^ | *$/,"");c2=$0;print c1"|"c2"|"c3"|"c4"|"c5}' file | column -t -s "|"
1 Gilliland Michael S January 2,2013 20,000 19
2 Still George J Jr January 2,2013 20,000 19
3 Bishkin S. James February 1,2013 150,000 21
4 Mellin Mark P May 28,2013 238,000 25.26