Hive Split function to split a variable on \n - regex

Task: I want to split a variable called "website" in a hive table to get all the websites that are delimited by character space or \n
Issue: When I use either of the following queries:
SELECT website,split(website, '[\\s]') as websites FROM temp_pages
SELECT website,split(website, '[\\s, \\n]') as websites FROM temp_pages
I am unable to achieve the desired results.
Here are the results that I get
Expected Output - delimited on space
Input: http://www.insync4all.com http://www.insync4all.nl
Output: ["http://www.insync4all.com","http://www.insync4all.nl"]
Unexpected output - Delimited on \n.
When there is an \n character instead of splitting the websites based on \n character it introduces \\n
Input: www.imtherealthing.com\nwww.childmodelmagazine.com
Output: ["www.imtherealthing.com\\nwww.childmodelmagazine.com"]
Can someone help me to split the website field on \n. It will also be good to understand what is going wrong in the \n case.

Related

BigQuery remove <0x00> hidden characters from a column

I have a table with unwanted hidden characters such as my_table:
id
fruits
1
STuff1 stuff_2 ����������������������
2
Blahblah-blahblah �������������
3
nothing
How do I remove ���������������������� when selecting this column?
Current query:
SELECT fruits, TRIM(REGEXP_REPLACE(fruits, r'[^a-zA-Z,0-9,-]', ' ')) AS new_fruits
FROM `project-id.MYDATASET.my_table`
This query is too flaw because I'm worried if I accidentally exclude/replace important data. I only want to be specific on this weird characters.
Upon opening the data as csv, the weird characters shows as <0x00>. How do I solve this?
First you have to identify which is this character, because as it is a non printable this sign is just a random representation. For replace it without remove any other important information, do the following:
identify the hexadecimal of the character. Copy from csv and past on this site:
Use the replace function in bigquery replacing the char of this hex, as following:
SELECT trim(replace(string_field_1,chr(0xfffd)," ")) FROM `<project>.<dataset>.<table>`;
if your character result is different than fffd, put you value on the chr() function

Extract a text string with regex

I have a large set of data I need to clean with open refine.
I am quite bad with regex and I can't think of a way to get what I want,
which is extracting a text string between quotes that includes lots of special characters like " ' / \ # # -
In each cell, it has the same format
caption': u'text I want to extract', u'likes':
Any help would be highly appreciated!
If you want to extract text string that includes lots of special characters in between, and is located between quotes ' ', You can do it in general this way:
\'[\S\s]*?\'
Demo
.
In your case, if you want to extract only the medial quote from this: caption': u'text I want to extract', u'likes': , Try this Regex:
(?<=u\')[\V]*?(?=\'\,)
Demo
We designed OpenRefine with a few smart functions to handle common cases such as yours without using Regex.
Two other cool ways to handle this in OpenRefine.
Using drop down menu:
Edit Column
Split into several columns
by separator Separator '
Using smartSplit
(string s, optional string sep)
returns: array
Returns the array of strings obtained by splitting s with separator sep. Handles quotes properly. Guesses tab or comma separator if "sep" is not given.
value.smartSplit("'")[2]

Load file in pig based on whitespace

I am trying to load a file in PIG which 2 words may be separated with spaces or tabs (may me more than one). Is there a way to delimit the file load using a regex for whitespace? Or is there any other way to achieve the below?
Input:
COUNTESS This young gentlewoman had a father,--O, that`
Output:
COUNTESS
This
young
gentlewoman
had
a
father,--O,
that
It would be great to have a comma delimiter also, but that would make it more complex. For now, only the whitespace delimiter should work for me.
Load the file as a line and then use TOKENIZE.If you have a mixture of tabs and space then after loading the data add a step to replace the tabs with spaces in the line and then use TOKENIZE.
A = LOAD 'test2.txt' as (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(A.$0));
C = FOREACH B GENERATE TOBAG(*);
DUMP C;
OUTPUT
I don't really know PIG, but here's some info:
https://pig.apache.org/docs/r0.9.1/func.html#strsplit
STRSPLIT(string, regex, limit)
regex could be something like [\s,]+. That will split on any blocks of whitespace and commas. So for instance, a b,c ,d, e would split in to each letter. the order of space and comma does not matter.

Is there an efficient way to scrape substrings from column values in Postgres?

I have a column called user_response, on which I want to do variety of operations like take out words contained in quotes, and take out the part of the string after colon (:)
One such operation is this:
Let's say for a record
user_response = "My company: 'XYZ Co.' has allowed to use:: the following \n \n kind of product: RealMadridTShirts"
Now, I want to scrape the part of the string after last colon(:). Hence, my output should be RealMadridTShirts
I could achieve this somehow with the following hack:
SELECT reverse(split_part(reverse(user_response), ' :', 1))
However, this is grossly inefficient, specially when I am having to do this over 500,000 rows. It's not an operation that I will doing throughout the day. This operation is for a once-a-day load but even then the load is becoming very expensive.
Coming from Oracle, I know I could have used INSTR and SUBSTR functions to achieve it in a more elegant fashion (without having to reverse the string and all.
Also, what if I had to scrape the text after the second last colon?
Find the string after the last colon, right?
My company: 'XYZ Co.' has allowed to use:: the following \n \n kind of product: RealMadridTShirts
It's trivial with a regular expression:
regress=> SELECT (regexp_matches(
'My company: ''XYZ Co.'' has allowed to use:: the following \n \n kind of product: RealMadridTShirts',
'.*:(.*?)$')
)[1];
regexp_matches
--------------------
RealMadridTShirts
(1 row)
The apparent lack of a function to request the position of a string counting from a particular starting point makes it harder to do without using a regexp, but as a regexp is sure to be the fastest way to solve this I doubt that's an issue.
Your bigger problem is likely to be that you're scanning so much data. That's never going to be fast.

Is it possible to tokenize text in PL/PGSQL using regular expressions?

I want to tokenize text in my database with RegEx and store the resulting tokens in a table. First I want to split the words by spaces, and then each token by punctuation.
I'm doing this in my application, but executing it in the database might speed it up.
Is it possible to do this?
There is a number of functions for tasks like that.
To retrieve the 2nd word of a text:
SELECT split_part('split this up', ' ', 2);
Split the whole text and return one word per row:
SELECT regexp_split_to_table('split this up', E'\\s+');
Actually, the last example splits on any stretch of whitespace.)