How to identify '\N' character in data using Pig - regex

I'm getting very weird character '\N' in my data. I want to remove or replace this character from data. Below is the data sample:
Girls Shoes,1325051884
\N,\N
Men's Shirts,\N
Delimiter : comma (,)
I tried couple of ways to replace/identify this \N character but not working.

In Pig, positional notation is indicated with the dollar sign ($) and begins with zero (0); for example, $0, $1, $2.
So, in the data mentioned above, the first field is identified by $0 (for e.g. "Girls Shoes") and second is identified by $1 (for e.g. 1325051884).
Following script has logic to replace '\N':
A = LOAD '/data.txt' USING PigStorage(',');
B = FILTER A BY ($0 != '\\N') OR ($1 != '\\N');
dump B;
C = FOREACH B GENERATE ($0 == '\\N' ? '' : $0), ($1 == '\\N' ? '' : $1);
dump C;
Where '/data.txt' contains following data:
Girl's Shoes,1325051884
\N,\N
Men's Shirts,\N
\N,Boy's Pants
Logic:
A = LOAD '/data.txt' USING PigStorage(',');
Loads data, by assuming the delimiter to be comma (,).
B = FILTER A BY ($0 != '\\N') OR ($1 != '\\N');
For each loaded record, filter the records by condition: $0 (first field) NOT EQUALS '\N' OR $1 (second field) NOT EQUALS '\N'
Output of this stage would be (2nd record containing both '\N' is filtered out):
(Girl's Shoes,1325051884)
(Men's Shirts,\N)
(\N,Boy's Pants)
C = FOREACH B GENERATE ($0 == '\\N' ? '' : $0), ($1 == '\\N' ? '' : $1);
For each of the records generated in the 2nd step, it checks: if $0 is equal to '\N'. If yes, it emits blank (''), else emits $0. Similar logic is applied to $1.
Output of this stage would be:
(Girl's Shoes,1325051884)
(Men's Shirts,)
(,Boy's Pants)
You can see that, '\N' is replaced by blank ('').
I am using Apache Pig 0.15. This script worked perfectly for your data.

A = FILTER data by $2 =='//N'
it will list out all data with such character appearance.

Related

perl regular expression match scalar plus punctuation

I have scalars (columns in a table) that have one or two email addresses separated by a comma. such as 'Joek#xyznco.com, jrancher#candyco.us' or 'jsmith#wellingent.com,mjones#wellingent.com' for several of these records I need to remove a bad/old email address and the trailing comma (if one exists).
if jmsith#wellingent is no longer valid how do I remove that address and the trailing comma?
This only removes the address but leaves the comma.
my $general_email = 'jsmith#wellingent.com,mjones#wellingent.com';
my $bad_addr = 'jsmith#wellingent.com';
$general_email =~ s/$bad_addr//;
Thanks for any help.
You may be better off without a regex but with list splitting:
use strict;
use warnings;
sub remove_bad {
my ($full, $bad) = #_;
my #emails = split /\s*,\s*/, $full; # split at comma, allowing for spaces around the comma
my #filtered = grep { $_ ne $bad } #emails;
return join ",", #filtered;
}
print 'First: ' , remove_bad('me#example.org, you#example.org', 'me#example.org'), "\n";
print 'Last: ', remove_bad('me#example.org, you#example.org', 'you#example.org'), "\n";
print 'Middle: ', remove_bad('me#example.org, you#example.org, other#eample.org', 'you#example.org'), "\n";
First, split the bad email address list at the comma, creating an array. Filter that using grep to remove the bad address. join the remaining elements back into a string.
The above code prints:
First: you#example.org
Last: me#example.org
Middle: me#example.org,other#eample.org

Get matching group from previous line

I'm working on a sed script to run through a file and make substitutions. The existing file will have a sequence of floating point numbers, and the sequence ends when a letter is found. Most of the substitutions are straightforward and look like this:
s/(-?[0-9]*\.?[0-9]*) (-?[0-9]*\.?[0-9]*) l/lineto(\1,\2);/g
To just replace the raw command with a function call.
Some commands have no 1:1 equivalent to a function call, because they depend on coordinates found on the previous line.
So I need to turn this:
1.068 7.399 m
-11.794 13.153 -11.843 12.234 v
Into this:
move(1.068,7.399);
curveto(1.068,7.399,-11.794,13.153,-11.843,12.234);
The last set of coordinates from the previous line needs to be used as the first set of coordinates for this line. The coordinates in the previous line don't always end in the same token, so that this:
-7.451 17.792 -10.366 16.42 -11.198 14.444 c
-11.794 13.153 -11.843 12.234 v
Needs to become this:
curveto(-7.451,17.792,-10.366,16.42,-11.198,14.444);
curveto(-11.198,14.444,-11.794,13.153,-11.843,12.234);
Here's my attempt (which is not working, broken into lines for readability, this is a one liner):
s/
.*(-?[0-9]*\.?[0-9]*) (-?[0-9]*\.?[0-9]*) [a-zA-Z]$^(-?[0-9]*\.?[0-9]*) (-?[0-9]*\.?[0-9]*) (-?[0-9]*\.?[0-9]*) (-?[0-9]*\.?[0-9]*) y/
curveto(\1,\2,\3,\4,\5,\6);/
g
What's the correct way to do this?
For your problem, you could try something along these lines:
$NF == "m" { print "move(" $1 "," $2 ");" }
$NF == "v" { print "curveto(" one "," two "," $1 "," $2 "," $3 "," $4 ");" }
$NF == "c" { print "curveto(" $1 "," $2 "," $3 "," $4 ");" }
{ one = $(NF-2); two = $(NF - 1) }
$NF is the last field of each line and is used to select which transformation to apply. The two fields preceding the command are assigned to the variables one and two (x and y might be a better choice).

Finding columns with only white space in a text file and replace them with a unique separator

I have a file like this:
aaa b b ccc 345
ddd fgt f u 3456
e r der der 5 674
As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Note:
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.
Better explanation of the question by josifoski :
Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.
$ cat tst.awk
BEGIN{ FS=OFS=""; ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR {
for (i=1;i<=NF;i++) {
if ($i == " ") {
space[i]
}
else {
nonSpace[i]
}
}
next
}
FNR==1 {
for (i in nonSpace) {
delete space[i]
}
}
{
for (i in space) {
$i = ","
}
gsub(/,+/,",")
print
}
$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Another in awk
awk 'BEGIN{OFS=FS=""} # Sets field separator to nothing so each character is a field
FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next} #Increments array with key as character
#position based on whether a space is in that position.
#Skips all further commands for first file.
{ # In second file(same file but second time)
for(i=1;i<=NF;i++) #Loops through fields
if(!a[i]){ #If field is set
$i="," #Change field to ","
x=i #Set x to field number
while(!a[++x]){ # Whilst incrementing x and it is not set
$x="" # Change field to nothing
i=x # Set i to x so it doesnt do those fields again
}
}
}1' test{,} #PRint and use the same file twice
Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.
# required package
require(readr)
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)

Replace several occurences of the same character in a different way in AWK

I want to replace several characters in a csv file depending on the characters around them using AWK.
For example in this line:
"Example One; example one; EXAMPLE ONE; E. EXAMPLE One"
I would like to replace all capital "E"'s with "EE" if they are within a word that uses only capitals and with "Ee" if they are in a word with upper and lower case letters or in an abbreviation (like the E., it's an adress file so there are no cases where this could also be the end of a sentence) so it should look like this:
"Eexample One; example one; EEXAMPLEE ONEE; Ee. EEXAMPLEE One"
Now what I have tried is this:
{if ($0 ~/E[A-Z]+/)
$0 = gensub(/E/,"EE","g",$0)
else if ($0 ~/[A-Z]E/)
$0 = gensub(/E/,"EE","g",$0)
else
$0 = gensub(/E/,"Ee","g",$0)
}
This works fine in most cases, but for lines (or fieds for that matter) that contain several "E"'s where I'd want one to be replaced as a "Ee" and one as a "EE" like in "E. EXAMPLE One", it matches the E in "EXAMPLE" and just replaces all "E"'s in that line with "EE".
Is there a better way to do this? Can I maybe somehow use if within gensub?
ps: Hope this makes sense, I just started learning the basics of programming!
$ cat tst.awk
{
head = ""
tail = $0
while ( match(tail,/[[:alpha:]]+\.?/) ) {
tgt = substr(tail,RSTART,RLENGTH)
add = (tgt ~ /^[[:upper:]]+$/ ? "E" : "e")
gsub(/E/,"&"add,tgt)
head = head substr(tail,1,RSTART-1) tgt
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}
$ awk -f tst.awk file
Eexample One; example one; EEXAMPLEE ONEE; Ee. EEXAMPLEE One
It's not clear though how you distinguish a string of letters followed by a period as an abbreviation or just the end of a sentence.

I want to replace ',' on the 150th location in a String with a <br>

My String is : PI Last Name equal to one of
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN','ALVAREZ','AMARYAN','AMBESI-IMPIOMBATO','AMEGBETO','AMOWITZ', 'ANAGNOSTARAS','ANAND','ANDERSEN','ANDERSON', 'ANDRADE','ANDREEFF','ANDROPHY','ANGER','ANHOLT','ANTHONY','ANTLE','ANTONELLI','ANTONY', 'ANZULOVICH', 'APODACA','APOSHIAN','APPEL','APPLEBY','APRIL','ARAUJO','ARBIB','ARBOLEDA', 'ARCHAKOV','ARCHER', 'ARECHAVALETA-VELASCO','ARENS','ARGON','ARGYROKASTRITIS', 'ARIAS','ARIZAGA','ARMSTRONG','ARNON', 'ARSHAVSKY','ARVIN','ASATRYAN','ASCOLI','ASKENASE','ASSI','ATALAY','ATANASOVA','ATKINSON','ATTYGALLE','ATWEH','AU','AVETISYAN','AWE','AYOUB','AZAD','BACSO','BAGASRA','BAKER','BALAS', 'BALCAZAR','BALK','BALKAY','BALLOU','BALRAJ','BALSTER','BANERJEE','BANKOLE','BANTA','BARAL','BARANOWSKA','BARBAS', 'BARBER','BARILLAS-MURY','BARKHOLT','BARNES','BARNETT','BARRETT','BARRIA','BARROW','BARROWS','BARTKE','BARTLETT','BASSINGTHWAIGHTE','BASSIOUNY','BASU','BATES','BATTAGLIA','BATTERMAN','BAUER','BAUERLE','BAUM','BAUME', 'BAUMLER','BAVISTER','BAWA','BAYNE','BEASLEY','BEATTY','BEATY','BEBENEK','BECK','BECKER','BECKMAN','BECKMAN-SUURKULA' ,'BEDFORD','BEDOLLA','BEEBE','BEEMON','BEHETS','BEHRMAN','BEIER','BEKKER','BELL','BELLIDO','BELMAIN', 'BENATAR','BENBENISHTY','BENBROOK','BENDER','BENEDETTI','BENNETT','BENNISH','BENZ','BERG','BERGER','BERGEY','BERGGREN','BERK','BERKOWITZ','BERLIN','BERLINER','BERMAN','BERTINO','BERTOZZI','BERTRAND','BERWICK','BETHONY','BEYERS','BEYRER' ,'BEZPROZVANNY','BHAGWAT','BHANDARI','BHARGAVA','BHARUCHA','BHUJWALLA','BIANCO','BIDLACK','BIELERT','BIER','BIESSMANN','BIGELOW' ,'BILLER','BILLINGS','BINDER','BINDMAN','BINUTU','BIRBECK','BIRGE','BIRNBAUM','BIRO','BIRT','BISHAI','BISHOP','BISSELL','BJORKEGREN','BJORNSTAD','BLACK','BLANCHARD','BLASS','BLATTNER','BLIGNAUT','BLOCH','BLOCK','BLOOM','BLOOM,','BLUM','BLUMBERG' ,'BLUMENTHAL','BLYUKHER','BODDULURI','BOFFETTA','BOGOLIUBOVA', 'BOLLINGER','BOLLS','BOMSZTYK','BONANNO','BONNER','BOOM','BOOTHROYD','BOPPANA','BORAWSKI','BORG','BORIS-LAWRIE','BORISY','BORLONGAN','BORNSTEIN','BORODOVSKY','BORST','BOS','BOTO','BOWDEN','BOWEN','BOYCE-JACINO','BRADEN','BRADY' ,'BRAITHWAITE','BRANN','BRASH','BRAUNSTEIN', 'BREMAN','BRENNAN','BRENNER','BRETSCHER','BREW','BREYSSE','BRIGGS','BRITES','BRITT','BRITTENHAM','BRODIE','BRODY','BROOK','BROOTEN','BROSCO','BROSNAN','BROWN','BROWNE','BRUCKNER','BRUNENGRABER','BRYL','BRYSON','BU','BUCHAN','BUDD','BUDNIK', 'BUEKENS','BUKRINSKY','BULLMORE','BULUN','BURBANO','BURGENER','BURGESS','BURKS','BURMEISTER','BURNETT','BURNHAM','BURNS','BURRIDGE','BURTON','BUSCIGLIO','BUSHEK','BUSIJA','BUZSAKI','BZYMEK','CABA')
I need to have a regex which will greedily looks for up to 150 characters with a last character being a ','. And then replace the last ',' of the 150 with a <br />
Any suggestions pls?
I used this ','(?=[^()]*\)) but this one replaces all the occurences. I want the 150th ones to be replaced.
Thanks everyone for your suggestions. I managed to do it with Java code instead of regex.
StringBuilder sb = new StringBuilder(html);
int i = 0;
while ((i = sb.indexOf("','", i + 150)) != -1) {
int j = sb.lastIndexOf("','", i + 150);
sb.insert(i+1, "<BR>");
}
return sb.toString();
However, this breaks at the first encounter of ',' in the 150 chars.
Can anyone help modify my code to incorporate the break at the last occurence of ',' withing the 150 chars.
You'll want something like this:
Look for every occurrence of \([^)]+*,[^)]+*\) (Find a parenthesis-wrapped string with a comma in it and then run the following regular expression on each of the matched elements:
(.{135,150}[^,]*?),
The first number is the minimum number of characters you want to match before you add a break tag -- the second is the maximum number of characters you would like to match before inserting a break tag. If there is no , between the characters in question then the regular expression will continue to consume characters until it finds a comma.
You could probably do it like this:
regex ~ /(^.{1,14}),/
replacement ~ '\1<replacement' or "$1<insert your text>"
In Perl:
$target = ','x 22;
$target =~ s/(^ .{1,14}) , /$1<15th comma>/x;
print $target;
Output
,,,,,,,,,,,,,,<15th comma>,,,,,,,
Edit: As an alternative, if you want to break the string up into succesive 150 or less
you could do it this way:
regex ~ /(.{1,150},)/sg
replacement ~ '\1<br/>' or "$1<br\/>"
// That is a regex of type global (/g) and include newlines (/s)
In Perl:
$target = "
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN', ... )
";
if ($target =~ s/( .{1,150} , )/$1<br\/>/sxg) {
print $target;
}
Output:
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI',<br/>'ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA',<br/>'AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE',<br/>'ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN',<br/> ... )