Python spark extract characters from dataframe

Python spark extract characters from dataframe - python-2.7

I have a dataframe in spark, something like this:
ID | Column
------ | ----
1 | STRINGOFLETTERS
2 | SOMEOTHERCHARACTERS
3 | ANOTHERSTRING
4 | EXAMPLEEXAMPLE
What I would like to do is extract the first 5 characters from the column plus the 8th character and create a new column, something like this:
ID | New Column
------ | ------
1 | STRIN_F
2 | SOMEO_E
3 | ANOTH_S
4 | EXAMP_E
I can't use the following codem, because the values in the columns differ, and I don't want to split on a specific character, but on the 6th character:
import pyspark
split_col = pyspark.sql.functions.split(DF['column'], ' ')
newDF = DF.withColumn('new_column', split_col.getItem(0))
Thanks all!

Use something like this:
df.withColumn('new_column', concat(df.Column.substr(1, 5),
lit('_'),
df.Column.substr(8, 1)))
This use the function substr and concat
Those functions will solve your problem.

Related

How to extract words from a string that end with substrings listed in an array? BigQuery

I have a table of rows with cells containing multiple strings. Like this:
K1111=V1111;K1=V1;kv13_key4=--xxxxxsomething;id5=true;impid=23123123;location=domain_co_uk
I need to extract a substring that begins with kv13_key4= and ends with anything after but the lengths all vary and the substrings are all separated by a semicolon ; . I tried
REGEXP_EXTRACT(customtargeting,'%in2w_key4%;') As contains_key_Value
but didn't work. I need something like this:
| Original Cell | Extracted |
| key88=1811111;id89=9990string;K1=V1;23234234234tttttttt13_key4=--x;id5=true;impid=23123;url=domain_co_uk | kv13_key4=--x |
| K1111=V1111;K1=V1;kv13_key4=--xsomething;id5=true;impid=23123123;location=domain_co_uk | kv13_key4=--xsomething |
| ;id5=true;T6791=V1111;K1=V1;kv13_key4=--xxxxxsomething123;impid=23123 | kv13_key4=--xxxxxsomething123 |

Consider below
select *, regexp_extract(customtargeting, r'kv13_key4=[^;]+') as Extracted
from your_table
if applied to sample data in your question - output is

Does this regex work:
(?<=kv13_key4=)[^;]+(?=;)
It captures everything between 'kv13_key4=' and the nearest ';'
Your REGEX_EXTRACT would look like:
REGEXP_EXTRACT(customtargeting,r'(?<=kv13_key4=)[^;]+(?=;)')

How to concatenate duplicated values from a muti column / row array in Google Sheets

I need to get a value in a single cell which looks at a 2 row and multi-column array and finds the values that are the same between the 2 rows
I'm pretty sure that an INDEX / MATCH function should do the job however there's no combination that I've been able to find that adequately achieves the result
A working example can be summarised as such:
The array looks like this:-
ColA | ColB | ColC | ColD | ColE | ColF
Row1 | Dogs | Cats | Mice | Frog | Goat
Row2 | Mice | Frog
The function needs to look at all the values in Row 1 and compare them to all the values in Row 2, find the matching ones and output them (with a delimiter) in another cell
The desired output is "Mice-Frog"

=ARRAYFORMULA(TEXTJOIN("-", 1,
IFERROR(REGEXEXTRACT(1:1, TEXTJOIN("|", 1, 2:2)))))
or already mentioned:
=ARRAYFORMULA(JOIN("-", HLOOKUP(INDIRECT("A2:"&ADDRESS(2, COUNTA(2:2))), 1:1, 1, 0)))

Perhaps try this:
=join("-",ARRAYFORMULA(hlookup(A2:B2,A1:E1,1,0)))

How to extract domain name using dynamic regex in Redshift?

I need to extract domain name from url using Redshift PostgreSQL. Example : extract 'google.com' from 'www.google.com'. Each url in my dataset has different top level domain (TLD). My approach was to first join the matching TLD to the dataset and use regex to extract 'first_string.TLD'. In Redshift, I'm getting error 'The pattern must be a valid UTF-8 literal character expression'. Is there a way around this?
A sample of my dataset:
+---+------------------------+--------------+
| id| trimmed_domain | tld |
+---+------------------------+--------------+
| 1 | sample.co.uk | co.uk |
| 2 | www.sample.co.uk | co.uk |
| 3 | www3.sample.co.uk | co.uk |
| 4 | biz.sample.co.uk | co.uk |
| 5 | digital.testing.sam.co | co |
| 6 | sam.co | co |
| 7 | www.google.com | com |
| 8 | 1.11.220 | |
+---+------------------------+--------------+
My code:
SELECT t1.extracted_domain, COUNT(DISTINCT(t1.id))
FROM(
SELECT
d.id,
d.trimmed_domain,
CASE
WHEN d.tld IS null THEN d.trimmed_domain ELSE
regexp_replace(d.trimmed_domain,'(.*\.)((.[a-z]*).*'||replace(tld,'.','\.')||')','\2')
END AS "extracted_domain"
FROM dataset d
)t1
GROUP BY 1
ORDER BY 2;
Expected output:
+------------------------+--------------+
| extracted_domain | count |
+------------------------+--------------+
| sample.co.uk | 4 |
| sam.co | 2 |
| google.com | 1 |
| 1.11.220 | 1 |
+------------------------+--------------+

I'm so sure about the query. However, you can use this tool and design any expression that you wish to modify your query.
My guess is that maybe this would help:
^(?!d|b|www3).*
You can list any domain that you wish to exclude in the list using OR (?!d|b|www3).
RegEx Circuit
You can visualize your expressions in this link:
You maybe want to add your desired URLs to an expression similar to:
^(sam|www.google|1.11|www.sample|www3.sample).*

So, I've found a solution. Redshift does not support column based regex so the alternative is to use Python UDF.
Change the tld column to regex pattern.
Go row by row and extract the domain name using the regex pattern column.
Group by the extracted_domain and count the users.
The SQL query is as below:
CREATE OR REPLACE function extractor(col_domain varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
_regex = ''
for domain in col_domain:
if domain is None:
continue
else:
_regex += r'{}'.format(domain)
domain_regex = r'([^/.]+\.({}))'.format(_regex)
return domain_regex
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION regex_match(in_pattern varchar, input_str varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
if in_pattern == '':
a = str(input_str)
else:
a= str(re.search(in_pattern, input_str).group())
return a
$$ LANGUAGE plpythonu;
SELECT
t2.extracted_domain,
COUNT(DISTINCT(t2.id)) AS "Unique Users"
FROM(
SELECT
t1.id,
t1.trimmed_domain,
regex_match(t1.regex_pattern, t1.trimmed_domain) AS "extracted_domain"
FROM(
SELECT
id,
trimmed_domain,
CASE WHEN tld is null THEN '' ELSE extractor(tld) END AS "regex_pattern"
FROM dataset
)t1
)t2
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
Python UDF seems to be slow on a large dataset. So, I'm open to suggestions on improving the query.

If you know the prefixes you would like to remove from the domains, then why not just exclude these? The following query simply removes the know www/http/etc prefixes from domain names and counts normalized domain names.
SELECT COUNT(*) from
(select REGEXP_REPLACE(domain, '^(https|http|www|biz)') FROM domains)
GROUP BY regexp_replace;

Regular Expression to parse Query Output

I have a need to execute a metadata query which will dump a list of tables into a file. However, I need a way to eliminate all formatting besides the tableId itself. Can this be done through a regex? Appreciate all help in advance.
+-------------------------------------+-------+
| tableId | Type |
+-------------------------------------+-------+
| t_margins | TABLE |
| t_rev_test | TABLE |
| t_rev_share | TABLE |

You have some options, but I would suggest something like this:
^\| (\S+)
It will match on the line from the start, a pipe, a space and then all non-spaces. The non-spaces will be your tableId. Here is a little example in Python:
import re
my_string = '''| t_margins | TABLE |
| t_rev_test | TABLE |
| t_rev_share | TABLE |'''
my_list = my_string.split('\n')
for line in my_list:
match = re.search("^\| (\S+)", line)
print (match.group(1))
This will give you:
t_margins
t_rev_test
t_rev_share

The following regexp captures just the column values of the first column:
^\| (\w+)
https://regex101.com/r/gODhra/3

How to match sub pattern in Robot Framework?

I am doing following things in RFW:
STEP 1 : I need to match the "NUM_FLOWS" value from the following command output.
STEP 2 : If its "Zero - 0" , Testcase should FAIL. If its NON-ZERO, Test case is PASS.
Sample command output:
router-7F2C13#show app stats gmail on TEST/switch1234-15E8CC
--------------------------------------------------------------------------------
APPLICATION BYTES_IN BYTES_OUT NUM_FLOWS
--------------------------------------------------------------------------------
gmail 0 0 4
--------------------------------------------------------------------------------
router-7F2C13#
How to do this with "Should Match Regexp" and "Should Match" keywords? How to check only that number sub-pattern? (Example: In the above command output, NUM_FLOWS is NON-ZERO, Then testcase should PASS.)
Please help me to achieve this.
Thanks in advance.
My New robot file content:
Write show dpi app stats BitTorrent_encrypted on AVC/ap7532-15E8CC
${raw_text} Read Until Regexp .*#
${data[0].num_flows} 0
| | ${data}= | parse output | ${raw_text}
| | Should not be equal as integers | ${data[0].num_flows} | 0
| | ... | Excepted num_flows to be non-zero but it was zero | values=False

There are many ways to solve this. A simple way is to use robot's regular expression keywords to look for "gmail" at the start of a line, and then expect three numbers and then the number 0 (zero) followed by the end of the line. This assumes that a) NUM_FLOWS is always the last column, and b) there is only one line that begins with "gmail". I don't know if those are valid assumptions or not.
Because the data spans multiple lines, the pattern includes (?m) (the multiline flag) so that $ means "end of line" in addition to "end of string".
| | Should not match regexp | ${data} | (?m)\\s+gmail\\s+\\d+\\s+\\d+\\s+0\\s*$
| | ... | Expected non-zero value in the fourth column for gmail, but it was zero.
There are plenty of other ways to solve the problem. For example, if you need to check for other values in other columns, you might want to write a python keyword that parses the data and returns some sort of data structure.
Here's a quick example. It's not bulletproof, and makes some assumptions about the data passed in. I wouldn't use it in production, but it illustrates the technique. The keyword returns a list of items, and each item is a custom object with four attributes: name, bytes_in, bytes_our and num_flows:
# python library
import re
def parse_output(data):
class Data(object):
def __init__(self, raw_text):
columns = re.split(r'\s*', raw_text.strip())
self.name = columns[0]
self.bytes_in = int(columns[1])
self.bytes_out = int(columns[2])
self.num_flows = int(columns[3])
lines = data.split("\n")
result = []
# skip first four lines and the last two
for line in lines[4:-3]:
result.append(Data(line))
return result
Using it in a test:
*** Test Cases ***
| | # <put your code here to get the data from the >
| | # <router and store it in ${raw_text} >
| | ${raw_text}= | ...
| | ${data}= | parse output | ${raw_text}
| | Should not be equal as integers | ${data[0].num_flows} | 0
| | ... | Excepted num_flows to be non-zero but it was zero | values=False

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python spark extract characters from dataframe - python-2.7

Use something like this: df.withColumn('new_column', concat(df.Column.substr(1, 5), lit('_'), df.Column.substr(8, 1))) This use the function substr and concat Those functions will solve your problem.

Related

How to extract words from a string that end with substrings listed in an array? BigQuery

How to concatenate duplicated values from a muti column / row array in Google Sheets

How to extract domain name using dynamic regex in Redshift?

Regular Expression to parse Query Output

How to match sub pattern in Robot Framework?

Categories

Resources