How to remove lines containing % in Pig?

How to remove lines containing % in Pig? - hdfs

I have file that contain data in 3ed column i wanna filter that data using pig and perform other oprations on it.
string is like %D0%90%D0%BB%D0%B0 and all the other strings are similer but containing % char.
How can i Filter data what i am doing is
Z = FILTER A BY not (a3 matches '.*%%D0%%*.');

First, make sure that a3 is a Chararray. Then, you should filter like this:
Z= FILTER A BY NOT a3 MATCHES '.*%D0%.*';
As far as I know, there is no need to escape % and you should put only one % if you want to filter those that do not contain %D0%. However, if you want to filter those expressions that do not contain %%D0%% your expression should work fine.

This worked in my case:
Z = FILTER A BY NOT a3 matches '.*.[%].*.';
by using this filter i am able to remove lines containing '%'

Related

Regular expression to get specific pattern in snowflake

I have a table with column data like below.
Column5 :
1) ["[\"( "ABC12345678", "ABC00123451","ABC00543211")\"]"]
2) ["[\"( ABC87654321\"]"]
I just need to clean this column and fetch it like below.
1) ABC12345678,ABC00123451,ABC00543211
2) ABC87654321
Currently I am using replace function repeatedly to clean the data
replace( replace (replace (replace(replace(replace(replace (Column5,'[',''),']',''),'',''),'"',''),'\'',''),')',''),'(','') as column5list
is there any regular expression which can I use for the purpose to clean the data.
Pattern remains same ABC followed by 8 digits

Nested REPLACE could be simplfiied with TRANSLATE:
SELECT Column5, TRANSLATE(Column5, $$[]"'()$$, '') AS result
FROM tab;
Sample data:
CREATE OR REPLACE TABLE tab AS
SELECT '["[\"( "ABC12345678", "ABC00123451","ABC00543211")\"]"]' AS column5;
Output:

Google Sheets formula to add case-insensitive text + text in cell

I have some text on row A, and I want to write on cell E1 to filter whenever I put this formula
=Filter(A1:A10;ArrayFormula(E1 REGEXMATCH(A1:A10;E1)))
but I want it to CONTAINS not EXACT text
=filter(A1:A10;REGEXMATCH(A1:A10;"(i?) TEX"))
This works but I want to add a cell value
so somehow to combine this to together
I'm trying to put value in cell E1 (?i)TEX and it finds TEXT on A row, but I want to put (?i) in the formula but can't find how to do it.
I tried
=Filter(A1:A10;ArrayFormula(E1 REGEXMATCH(A1:A10;"(i?) +"E1"")))
doesn't work
=Filter(A1:A10;ArrayFormula(E1 REGEXMATCH(A1:A10;"(i?)"+E1)))
doesn't work
=filter(A1:A10;REGEXMATCH(A1:A10;"(i?)&" "&E1"))
doesn't work
I really don't have an idea of how to add (i?) to cell value

To make a match case-insensitive you'll need (?i) instead of (i?). I believe this should work
=filter(A1:A10;REGEXMATCH(A1:A10; "(?i)"&E1))

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)

=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns

To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.

I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

Postgresql - How do I extract the first occurence of a substring in a string using a regular expression pattern?

I am trying to extract a substring from a text column using a regular expression, but in some cases, there are multiple instances of that substring in the string.
In those cases, I am finding that the query does not return the first occurrence of the substring. Does anyone know what I am doing wrong?
For example:
If I have this data:
create table data1
(full_text text, name text);
insert into data1 (full_text)
values ('I 56, donkey, moon, I 92')
I am using
UPDATE data1
SET name = substring(full_text from '%#"I ([0-9]{1,3})#"%' for '#')
and I want to get 'I 56' not 'I 92'

You can use regexp_matches() instead:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1];
As no additional flag is passed, regexp_matches() only returns the first match - but it returns an array so you need to pick the first (and only) element from the result (that's the [1] part)
It is probably a good idea to limit the update to only rows that would match the regex in the first place:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1]
where full_text ~ 'I [0-9]{1,3}'

Try the following expression. It will return the first occurrence:
SUBSTRING(full_text, 'I [0-9]{1,3}')

You can use regexp_match() In PostgreSQL 10+
select regexp_match('I 56, donkey, moon, I 92', 'I [0-9]{1,3}');
Quote from documentation:
In most cases regexp_matches() should be used with the g flag, since
if you only want the first match, it's easier and more efficient to
use regexp_match(). However, regexp_match() only exists in PostgreSQL
version 10 and up. When working in older versions, a common trick is
to place a regexp_matches() call in a sub-select...

R: replacing special character in multiple columns of a data frame

I try to replace the german special character "ö" in a dataframe by "oe". The charcter occurs in multiple columns so I would like to be able to do this all in one by not having to specify individual columns.
Here is a small example of the data frame
data <- data.frame(a=c("aö","ab","ac"),b=c("bö","bb","ab"),c=c("öc","öb","acö"))
I tried :
data[data=="ö"]<-"oe"
but this did not work since I would need to work with regular expressions here. However when I try :
data[grepl("ö",data)]<-"oe"
I do not get what I want.
The dataframe at the end should look like:
> data
a b c
1 aoe boe oec
2 ab bb oeb
3 ac ab acoe
>
The file is a csv import that I import by read.csv. However, there seems to be no option to change to fix this with the import statement.
How do I get the desired outcome?

Here's one way to do it:
data <- apply(data,2,function(x) gsub("ö",'oe',x))
Explanation:
Your grepl doesn't work because grepl just returns a boolean matrix (TRUE/FALSE) corresponding to the elements in your data frame for which the regex matches. What the assignment then does is replace not just the character you want replaced but the entire string. To replace part of a string, you need sub (if you want to replace just once in each string) or gsub (if you want all occurrences replaces). To apply that to every column you loop over the columns using apply.

If you want to return a data frame, you can use:
data.frame(lapply(data, gsub, pattern = "ö", replacement = "oe"))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to remove lines containing % in Pig? - hdfs

I have file that contain data in 3ed column i wanna filter that data using pig and perform other oprations on it. string is like %D0%90%D0%BB%D0%B0 and all the other strings are similer but containing % char. How can i Filter data what i am doing is Z = FILTER A BY not (a3 matches '.%%D0%%.');

This worked in my case: Z = FILTER A BY NOT a3 matches '..[%]..'; by using this filter i am able to remove lines containing '%'

Related

Regular expression to get specific pattern in snowflake

Google Sheets formula to add case-insensitive text + text in cell

How can I separate a string by underscore (_) in google spreadsheets using regex?

Postgresql - How do I extract the first occurence of a substring in a string using a regular expression pattern?

R: replacing special character in multiple columns of a data frame

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to remove lines containing % in Pig? - hdfs

I have file that contain data in 3ed column i wanna filter that data using pig and perform other oprations on it. string is like %D0%90%D0%BB%D0%B0 and all the other strings are similer but containing % char. How can i Filter data what i am doing is Z = FILTER A BY not (a3 matches '.*%%D0%%*.');

This worked in my case: Z = FILTER A BY NOT a3 matches '.*.[%].*.'; by using this filter i am able to remove lines containing '%'

Related

Regular expression to get specific pattern in snowflake

Google Sheets formula to add case-insensitive text + text in cell

How can I separate a string by underscore (_) in google spreadsheets using regex?

Postgresql - How do I extract the first occurence of a substring in a string using a regular expression pattern?

R: replacing special character in multiple columns of a data frame

Categories

Resources

I have file that contain data in 3ed column i wanna filter that data using pig and perform other oprations on it. string is like %D0%90%D0%BB%D0%B0 and all the other strings are similer but containing % char. How can i Filter data what i am doing is Z = FILTER A BY not (a3 matches '.%%D0%%.');

This worked in my case: Z = FILTER A BY NOT a3 matches '..[%]..'; by using this filter i am able to remove lines containing '%'