Equivalent of SQL LIKE operator in R - regex

In an R script, I have a function that creates a data frame of files in a directory that have a specific extension.
The dataframe is always two columns with however many rows as there are files found with that specific extension.
The data frame ends up looking something like this:
| Path | Filename |
|:------------------------:|:-----------:|
| C:/Path/to/the/file1.ext | file1.ext |
| C:/Path/to/the/file2.ext | file2.ext |
| C:/Path/to/the/file3.ext | file3.ext |
| C:/Path/to/the/file4.ext | file4.ext |
Forgive the archaeic way that I express this question. I know that in SQL, you can apply where functions with like instead of =. So I could say `where Filename like '%1%' and it would pull out all files with a 1 in the name. Is there a way use something like this to set a variable in R?
I have a couple of different scripts that need to use the Filename pulled from this dataframe. The only reliable way I can think to tell the script which one to pull from is to set a variable like this.
Ultimately I would like these two (pseudo)expressions to yield the same thing.
x <- file1.ext
and
x like '%1%'
should both give x = file1.ext

you can use grepl() as in this answer
subset(a, grepl("1", a$filename))
Or if you're coming from an SQL background, you might want to look into sqldf

you can use like from data.table to get your sql like behaviour here.
From the documentation see this example
library(data.table)
DT = data.table(Name=c("Mary","George","Martha"), Salary=c(2,3,4))
DT[Name %like% "^Mar"]
for your problem suppose you have a data.frame df like this
path filename
1: C:/Path/to/the/file1.ext file1.ext
2: C:/Path/to/the/file2.ext file2.ext
3: C:/Path/to/the/file3.ext file3.ext
4: C:/Path/to/the/file4.ext file4.ext
do
library(data.table)
DT<-as.data.table(df)
DT[filename %like% "1"]
should give
path filename
1: C:/Path/to/the/file1.ext file1.ext

Related

nested if loop in splunk

I would like to write in splunk a nested if loop:
What I want to achieve
if buyer_from_France:
do eval percentage_fruits
if percentage_fruits> 10:
do summation
if summation>20:
total_price
if total_price>$50:
do(trigger bonus coupon)
My current code (that works):
> | eventstats sum(buyers_fruits) AS total_buyers_fruits by location
> | stats sum(fruits) as buyers_fruits by location buyers
> | eval percentage_fruits=fruits_bought/fruits_sold
> | table fruits_bought fruits_sold buyers
> | where percentage_fruits > 10
> | sort - percentage_fruits
How do I complete the syntax/expression for the 2nd (summation) and consequently, 3rd (total price), 4th if-loop (trigger)?
SPL doesn't do "loops". A close [enough] analog is that each line in SPL is similar to a single command in bash (hence the pipe separator between commands). IOW, SPL is purely linear in processing. Use a multi-condition eval..if like this:
index=ndx sourcetype=srctp
| eval myfield=if(match(fieldA,"someval") AND !match(fieldC,"notthis"),"all true","else val")
Or like this:
| eval myfield=if(match(fieldA,"someval"),if(match(fieldB,"otherval"),"matched A&B",if(!match(fieldC,"notthis"),"not A & not C","else val")))
If you can explain your use case/end goal better, we can probably provide better direction

How to use regexp_replace in spark.sql() to extract hashtags from string

I need to write a regexg_replace query in spark.sql() and I'm not sure how to handle it. For readability purposes, I have to utilize SQL for it. I am trying to pull out the hashtags from the table. I know how to do this using the python method but most of my team are SQL users.
My dataframe example looks like so:
Insta_post
Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House…
RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…
RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…
My code:
I create a tempview:
post_df.createOrReplaceTempView("post_tempview")
post_df = spark.sql("""
select
regexp_replace(Insta_post, '.*?(.|'')(#)(\w+)', '$1') as a
from post_tempview
where Insta_post like '%#%'
""")
My end result:
+--------------------------------------------------------------------------------------------------------------------------------------------+
|a |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… |
|RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…|
|RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…|
+--------------------------------------------------------------------------------------------------------------------------------------------+
desired result:
+---------------------------------+
|a |
+---------------------------------+
| #SaveTheInternet, #NetNeutrality|
| #NALCABPolicy2018 |
| #NetNeutrality |
+---------------------------------+
I haven't really used regexp_replace too much so this is new to me. Any help would be appreciated as well as an explanation of how to structure the subsets!
For Spark 3.1+, you can use regexp_extract_all function to extract multiple matches:
post_df = spark.sql("""
select regexp_extract_all(Insta_post, '(#\\\\w+)', 1) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+----------------------------------+
#|a |
#+----------------------------------+
#|[#SaveTheInternet, #NetNeutrality]|
#|[#NALCABPolicy2018] |
#|[#NetNeutrality] |
#+----------------------------------+
For Spark <3.1, you can use regexp_replace to remove all that doesn't match the hashtag pattern :
post_df = spark.sql("""
select trim(trailing ',' from regexp_replace(Insta_post, '.*?(#\\\\w+)|.*', '$1,')) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+-------------------------------+
#|a |
#+-------------------------------+
#|#SaveTheInternet,#NetNeutrality|
#|#NALCABPolicy2018 |
#|#NetNeutrality |
#+-------------------------------+
Note the use trim to remove the unnecessary trailing commas created by the first replace $,.
Do you really need a view? Because the following code might do it:
df = df.filter(F.col('Insta_post').like('%#%'))
col_trimmed = F.trim((F.regexp_replace('Insta_post', '.*?(#\w+)|.+', '$1 ')))
df = df.select(F.regexp_replace(col_trimmed,'\s',', ').alias('a'))
df.show(truncate=False)
# +--------------------------------+
# |a |
# +--------------------------------+
# |#SaveTheInternet, #NetNeutrality|
# |#NALCABPolicy2018 |
# |#NetNeutrality |
# +--------------------------------+
I ended up using two of regexp_replace, so potentially there could be a better alternative, just couldn't think of one.

Including parenthesis when joining dataframes using rlike in pyspark

I have 2 pyspark dataframes that I am trying to join where some of the values in the columns have parenthesis.
For example one of the values is
"Mangy (Dog)"
If I try joining like so:
df1.join(df2 expr("df1.animal rlike df2.animal_stat")
I don't get any results.
So I tried filtering using rlike just to see if I am able to capture the values.
The filtering worked on all values except those with parenthesis. For example when i try to filter like so:
df.filter(col('animal').rlike("Mangy (Dog)")).show()
I don't get any results.
However, if I filter with rlike("Mangy") or rlike("(Dog)" it seems to work. Even though I specified parenthesis in (Dog).
Is there a way to make rlike to include parenthesis in its matches?
EDIT:
I have 2 dataframes df1 and df2 like so:
+-----------------+-------+
| animal| origin|
+-----------------+-------+
| mangy (dog)|Streets|
| Cat| house|
|[Bumbling] Bufoon| Utopia|
| Cheetah| Congo|
|(Sprawling) Snake| Amazon|
+-----------------+-------+
+-------------------+-----------+
| animal_stat|destination|
+-------------------+-----------+
| ^dog$| House|
| ^Cat$| Streets|
|^[Bumbling] Bufoon$| Circus|
| ^Cheetah$| Zoo|
| ^(Sprawling)$| Glass Box|
+-------------------+-----------+
I am trying to join the two using rlike using the following method:
dff1=df1.alias('dff1')
dff2=df2.alias('dff2')
combine=dff1.join(dff2, expr("dff1.animal rlike dff2.animal_stat"), how='left')
.drop(dff2.animal_stat)
I would like the output dataframe to be like so:
+-----------------+-------+-----------+
| animal| origin|destination|
+-----------------+-------+-----------+
| mangy (dog)|Streets| House|
| Cat| house| Streets|
|[Bumbling] Bufoon| Utopia| Circus|
| Cheetah| Congo| Zoo|
|(Sprawling) Snake| Amazon| Glass Box|
+-----------------+-------+-----------+
Edit:
combine = df1.alias('df1').join(
df2.withColumn('animal_stat', F.regexp_replace(F.regexp_replace(F.regexp_replace(F.regexp_replace('animal_stat', '\\(', '\\\\('), '\\)', '\\\\)'), '\\[', '\\\\['), '\\]', '\\\\]')).alias('df2'),
F.expr('df1.animal rlike df2.animal_stat'),
'left'
)
If you're not using any regex, you probably want to use like instead of rlike. For example, you can do
df1.join(df2, expr("df1.animal like concat('%', df2.animal_stat, '%')"))
To do a filter, you can try
df.filter(col('animal').like("%Mangy (Dog)%")).show()
.rlike() is the same as .like() except it uses regex. You need to escape the parentheses. Try filtering like this:
df.filter(col('animal').rlike("Mangy \(Dog\)")).show()
Not sure I can help with the original join issue without some sample data.

search for specific characters within column and then create different columns from it

I have param_Value column that have different values. I need to extract these values and create columns for all of them.
|PARAM_NAME |param_Value |
__________|____________
|Step 4 | SP:0.09 |
|Procedure | MAX:125 |
|Step 4 | SP:Ambient|
|(null) | +/-:N/A |
|Steam | SP:2 |
|Step 3 | MIN:0 |
|Step 4 | RDPHN427B |
|Testing De | N/A |
I only want columns with: And give them names:
SP: SET_POINT_VALUE,
MAX: MAX_LIMIT,
MIN: MIN_LIMIT,
+/-: UPPER_LOWER_LIMIT
So what I have so far is:
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME,
REGEXP_LIKE("param_Value", 'SP:') SET_POINT_VALUE,
REGEXP_LIKE("param_Value", '+/-:') UPPER_LOWER_LIMIT,
REGEXP_LIKE("param_Value", 'MAX:') MAX_VALUE,
REGEXP_LIKE("param_Value", 'MIN:') MIN_VALUE
FROM PROCESS_STEPS
;
I'm more familiar with TSQL and MySQL, but this ought to do what I think you're looking for. If it doesn't exactly, it should at least point you in the right direction.
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME
, CASE WHEN "param_Value" LIKE 'SP:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END SET_POINT_VALUE
, CASE WHEN "param_Value" LIKE '+/-:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END UPPER_LOWER_LIMIT
, CASE WHEN "param_Value" LIKE 'MAX:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MAX_VALUE
, CASE WHEN "param_Value" LIKE 'MIN:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MIN_VALUE
FROM PROCESS_STEPS
;
The basic concept here is identifying the information you want via LIKE, then using SUBSTR and INSTR to extract it. While LIKE is normally something to stay away from, since there's no leading % in your case, it's Sargable, and thus probably not a total efficiency sink.
Really, though, I have to ask you to question why you're laying out your data like this - substring operations are slow in any language, and a DB is no exception. Why not use another column for your limit type? Why not lay it out in the view you're currently looking at?

How to retrieve data from multiple Stata files?

I have 53 Stata .dta files each of them is 150 - 200 Mb and contain identical set of variables, but for different years. It is not useful to combine or merge them due to their size .
I need to retrieve some averaged values (percentages etc.) Therefore, I want to create a new Stata file New.dta and write a .do file that would run on that new Stata file in the following way: it should open each of those 53 Stata files, make certain calulations, and store the results in the new Stata file, New.dta.
I am not sure how i can keep two Stata file open simultaneuosly, and how can i store the calculated values?
When I open a second .dta file, how can i make the first one still be open? How can i store the calculated values in the global variable?
What springs to mind here is the use of postfile.
Here is a simple example. First, I set up an example of several datasets. You already have this.
clear
forval i = 1/10 {
set obs 100
gen foo = `i' * runiform()
save test`i'
clear
}
Now I set up postfile. I need to set up a handle, what variables will be used, and what file will be used. Although I am using a numeric variable to hold file identifiers, it will perhaps be more typical to use a string variable. Also, looping over filenames may be a bit more challenging than this. fs from SSC is a convenience command that helps put a set of filenames into a local macro; its use is not illustrated here.
postfile mypost what mean using alltest.dta
forval i = 1/10 {
use test`i', clear
su foo, meanonly
post mypost (`i') (`r(mean)')
}
Now flush results
postclose mypost
and see what we have.
u alltest
list
+-----------------+
| what mean |
|-----------------|
1. | 1 .5110765 |
2. | 2 1.016858 |
3. | 3 1.425967 |
4. | 4 2.144528 |
5. | 5 2.438035 |
|-----------------|
6. | 6 3.030457 |
7. | 7 3.356905 |
8. | 8 4.449655 |
9. | 9 4.381101 |
10. | 10 5.017308 |
+-----------------+
I didn't use any global macros (not global variables) here; you should not need to.
An alternative approach is to loop over files and use collapse to "condense" these files to the relevant means, and than append these condensed files. Here is an adaptation of Nick's example:
// create the example datasets
clear
forval i = 1/10 {
set obs 100
gen foo = `i' * runiform()
gen year = `i'
save test`i', replace
clear
}
// use collapse and append
// to create the dataset you want
use test1, clear
collapse (mean) year foo
save means, replace
forvalues i = 2/10 {
use test`i', clear
collapse (mean) year foo
append using means
save means, replace
}
// admire the result
list
Note that if your data sets are not named sequentially like test1.dta, test2.dta, ..., test53.dta, but rather like results-alaska.dta, result_in_alabama.dta, ..., "wyoming data.dta" (note the space and hence the quotes), you would have to organize the cycle over these files somewhat differently:
local allfiles : dir . files "*.dta"
foreach f of local allfiles {
use `"`f'"', clear
* all other code from Maarten's or Nick's approach
}
This is a more advanced of local macros, see help extended macro functions. Note also that Stata will produce a list that will look like "results-alaska.dta" "result_in_alabama.dta" "wyoming data.dta" with quotes around file names, so when you invoke use, you will have to enclose the file name into compound quotes.