I'm not sure how to do this in a dataframe context
I have the table below here with text information
TEXT |
-------------------------------------------|
"Get some new #turbo #stacks today!" |
"Is it one or three? #phone" |
"Mayhaps it be three afterall..." |
"So many new issues with phone... #iphone" |
And I want to edit it down to where only the words with a '#' symbol are kept, like in the result below.
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"" |
"#iphone" |
In some cases, I'd also like to know if it's possible to eliminate the rows that are empty by checking for NaN as true or if you run a different kind of condition to get this result:
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"#iphone" |
Python 2.7 and pandas for this.
You could try using regex and extractall:
df.TEXT.str.extractall('(#\w+)').groupby(level=0)[0].apply(' '.join)
Output:
0 #turbo #stacks
1 #phone
3 #iphone
Name: 0, dtype: object
Related
I was trying to get some insights on regexp_extract in pyspark and I tried to do a check with this option to get better understanding.
Below is my dataframe
data = [('2345', 'Checked|by John|for kamal'),
('2398', 'Checked|by John|for kamal '),
('2328', 'Verified|by Srinivas|for kamal than some random text'),
('3983', 'Verified|for Stacy|by John')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+-----------------------------------------------------+
| ID| Notes |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal |
|2398|Checked|by John|for kamal |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John |
+----+-----------------------------------------------------+
So here I was trying to identify whether an ID is checked or verified by John
With the help of SO members I was able to crack the use of regexp_extract and came to below solution
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(Checked|Verified)(\\|by John)', 1))
result.show()
+----+------------------------------------------------+------------+
| ID| Notes |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal | Checked|
|2398|Checked|by John|for kamal | Checked|
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John | |
+----+--------------------+----------------------------------------+
For few ID's this gives me perfect result ,But for last ID it didn't print Verified. Could someone please let me know whether any other action needs to be performed in the mentioned regular expression?
What I feel is (Checked|Verified)(\\|by John) is matching only adjacent values. I tried * and $, still it didn't print Verified for ID 3983.
I would have phrased the regex as:
(Checked|Verified)\b.*\bby John
Demo
This pattern finds Checked/Verified followed by by John with the two separated by any amount of text. Note that I just use word boundaries here instead of pipes.
Updated code:
result = df.withColumn('Employee', regexp_extract(col('Notes'), '\b(Checked|Verified)\b.*\bby John', 1))
You can try this regex:
import pyspark.sql.functions as F
result = df.withColumn('Employee', F.regexp_extract('Notes', '(Checked|Verified)\\|.*by John', 1))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...| |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+
Another way is to check if the column Notes contains a string by John:
df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'), 'Checked').when(col('Notes').like('%by John'), 'Verified').otherwise(" ")).show(truncate=False)
+----+----------------------------------------------------+--------+
|ID |Notes |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal |Checked |
|2398|Checked|by John|for kamal |Checked |
|2328|Verified|by Srinivas|for kamal than some random text| |
|3983|Verified|for Stacy|by John |Verified|
+----+----------------------------------------------------+--------+
I have a column in Google Sheets where each cell contains pre-defined logic. For example, something like the second column in this table:
| 1 | =A1*-1 |
| 2 | =B2*-1 |
| -3 | =C2*-1 |
Let's say later I want to add the same logic to each cell in column B. For example, make it such that it looks like:
| 1 | =MAX(A1*-1,0) |
| 2 | =MAX(B2*-1,0) |
| -3 | =MAX(C2*-1,0) |
What is the fastest way to do this, besides manually typing MAX(...,0) in each cell? Normal Sheets functions act on the value of the cell, not the logic, so I'm a bit lost.
To my knowledge there isn't a function that pipes in the logic from one cell to another ...
try:
=ARRAYFORMULA(IF(A1:A="",,IF(SIGN(A1:A)<0, A1:A*-1, 0)))
=ARRAYFORMULA(IF(A1:A="",,IF(SIGN(A1:A)>0, A1:A, 0)))
I figured out how to read files into my pyspark shell (and script) from an S3 directory, e.g. by using:
rdd = sc.wholeTextFiles('s3n://bucketname/dir/*')
But, while that's great in letting me read all the files in ONE directory, I want to read every single file from all of the directories.
I don't want to flatten them or load everything at once, because I will have memory issues.
Instead, I need it to automatically go load all the files from each sub-directory in a batched manner. Is that possible?
Here's my directory structure:
S3_bucket_name -> year (2016 or 2017) -> month (max 12 folders) -> day (max 31 folders) -> sub-day folders (max 30; basically just partitioned the collecting each day).
Something like this, except it'll go for all 12 months and up to 31 days...
BucketName
|
|
|---Year(2016)
| |
| |---Month(11)
| | |
| | |---Day(01)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| | |---Day(02)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| |---Month(12)
|
|---Year(2017)
| |
| |---Month(1)
| | |
| | |---Day(01)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| | |---Day(02)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| |---Month(2)
Each arrow above represents a fork. e.g. I've been collecting data for 2 years, so there are 2 years in the "year" fork. Then for each year, up to 12 months max, and then for each month, up to 31 possible day folders. And in each day, there will be up to 30 folders just because I split it up that way...
I hope that makes sense...
I was looking at another post (read files recursively from sub directories with spark from s3 or local filesystem) where I believe they suggested using wildcards, so something like:
rdd = sc.wholeTextFiles('s3n://bucketname/*/data/*/*')
But the problem with that is it tries to find a common folder among the various subdirectories - in this case there are no guarantees and I would just need everything.
However, on that line of reasoning, I thought what if I did..:
rdd = sc.wholeTextFiles("s3n://bucketname/*/*/*/*/*')
But the issue is that now I get OutOfMemory errors, probably because it's loading everything at once and freaking out.
Ideally, what I would be able to do is this:
Go to the sub-directory level of the day and read those in, so e.g.
First read in 2016/12/01, then 2016/12/02, up until 2012/12/31, and then 2017/01/01, then 2017/01/02, ... 2017/01/31 and so on.
That way, instead of using five wildcards (*) as I did above, I would somehow have it know to look trough each sub-directory at the level of "day".
I thought of using a python dictionary to specify the file path to each of the days, but that seems like a rather cumbersome approach. What I mean by that is as follows:
file_dict = {
0:'2016/12/01/*/*',
1:'2016/12/02/*/*',
...
30:'2016/12/31/*/*',
}
basically for all the folders, and then iterating through them and loading them in using something like this:
sc.wholeTextFiles('s3n://bucketname/' + file_dict[i])
But I don't want to manually type out all those paths. I hope this made sense...
EDIT:
Another way of asking the question is, how do I read the files from a nested sub-directory structure in a batched way? How can I enumerate all the possible folder names in my s3 bucket in python? Maybe that would help...
EDIT2:
The structure of the data in each of my files is as follows:
{json object 1},
{json object 2},
{json object 3},
...
{json object n},
For it to be "true json", it either just needed to be like the above without a trailing comma at the end, or something like this (note square brackets, and lack of the final trailing comma:
[
{json object 1},
{json object 2},
{json object 3},
...
{json object n}
]
The reason I did it entirely in PySpark as a script I submit is because I forced myself to handle this formatting quirk manually. If I use Hive/Athena, I am not sure how to deal with it.
Why dont you use Hive, or even better, Athena? These will both deploy tables ontop of file systems, to give you access to all the data. Then you can capture this in to Spark
Alternatively, I believe you can also use HiveQL in Spark to set up a tempTable ontop of your file system location, and it'll register it all as a Hive table which you can execute SQL against. It's been a while since I've done that, but it is definitely do-able
Browsing through examples all over the web, I can see that people generate the path using something like "parent_id.node_id". Examples:-
uid | name | tree_id
--------------------
1 | Ali | 1.
2 | Abu | 2.
3 | Ita | 1.3.
4 | Ira | 1.3.
5 | Yui | 1.3.4
But as explained in this question - Sorting tree with a materialized path?, using zero padding to the tree_id make it easy to sort it by the creation order.
uid | name | tree_id
--------------------
1 | Ali | 0001.
2 | Abu | 0002.
3 | Ita | 0001.0003.
4 | Ira | 0001.0003.
5 | Yui | 0001.0003.0004
Using fix length string like this also make it easy for me to calculate the level - length(tree_id)/5. What I'm worried is it would limit me to maximum 9999 users rather than 9999 per branch. Am I right here ?
9999 | Tar | 0001.9999
10000 | Tor | 0001.??
You are correct -- zero-padding each node ID would allow you to sort the entire tree quite simply. However, you have to make the padding width match the upper limit of digits of the ID field, as you have pointed out in your last example. E.g., if you're using an int unsigned field for your ID, the highest value would be 4,294,967,295. This is ten digits, meaning that the record set from your last example might look like:
uid | name | tree_id
9999 | Tar | 0000000001.0000009999
10000 | Tor | 0000000001.0000010000
As long as you know you're not going to need to change your ID field to bigint unsigned in the future, this will continue work, though it might be a bit data-hungry depending on how huge your tables get. You could shave off two bytes per node ID by storing the values in hexadecimal, which would still be sorted correctly in a string sort:
uid | name | tree_id
9999 | Tar | 00000001.0000270F
10000 | Tor | 00000001.00002710
I can imagine this would make things a real headache when trying to update the paths (pruning nodes, etc) though.
You can also create extra fields for sorting, e.g.:
uid | name | tree_id | name_sort
9999 | Tar | 00000001.0000270F | Ali.Tar
10000 | Tor | 00000001.00002710 | Ali.Tor
There are limitations, however, as laid out by this guy's answer to a similar materialized path sorting question. The name field would have to be padded to a set length (fortunately, in your example, each name seems to be three characters long), and it would take up a lot of space.
In conclusion, given the above issues, I've found that the most versatile way to do sorting like this is to simply do it in your application logic -- say, using a recursive function that builds a nested array, sorting the children of each node as it goes.
I'm trying to capture the id of an element that will be randomly generated. I can successfully capture the value of my element id like this...
| storeAttribute | //div[1]#id | variableName |
Now my variable will be something like...
divElement-12345
I want to remove 'divElement-' so that the variable I am left with is '12345' so that I can use it later to select the 'form-12345' element associated with it...something like this:
| type | //tr[#id='form-${variableName}']/td/form/fieldset/p[1]/input | Type this |
How might I be able to accomplish this?
You have two options in Selenium, XPath and CSS Selector. I have read that CSS Selector is better for doing tests in both FireFox and IE.
Using the latest version of Selenium IDE (3/5/2009) I've had success with using storeEval which evaluates Javascript expressions, giving you access to javascript string functions.
XPath:
storeAttribute | //div[1]#id | divID
storeEval | '${divID}'.replace("divElement-", "") | number
type | //tr[#id='form-${number}']/td/form/fieldset/p[1]/input | Type this
CSS Selector:
storeAttribute | css=div[1]#id | divID
storeEval | '${divID}'.replace("divElement-", "") | number
type | css=tr[id='form-${number}'] > td > form > fieldset > p[1] > input | Type this
There are many functions in XPATH which should solve your problem. Assuming "divElement-" is a constant that will not change and that you are using XPath 2.0, I would suggest:
substring-after(div[1]/#id/text(),"divElement-")