R inverted regex pattern for ls - regex

In R I load one environment from a file that contains various timeseries plus one configuration object/vector.
I want to process all timeseries in the environment in a loop but want to exclude the configuration object.
At the moment my code is like this:
for(x in ls(myEnv)) {
if(x!="configData") {
# do something, e. g.
View(myEnv[[x]], x)
}
}
Is there a way to use the pattern parameter of the ls-function to omit the if clause?
for(x in ls(myEnv, pattern="magic regex picks all but *configData*")) {
# do something, e. g.
View(myEnv[[x]], x)
}
All examples I could find for pattern were based on a whitelist-approach (positive list), but I'd like to get all except configData.
Is this possible?
Thanks.

for( x in setdiff(ls(myEnv), "configData") )
and
for(x in grep("configData", ls(myEnv), value=TRUE, invert=TRUE))
both work fine, thanks.
BTW, cool! I wasn't aware of hiding it by using a leading "." ... so the best solution for me is to make sure that configData becomes .configData in the source file so that ls() won't show it.

Related

Two action after then in Ocaml

Is it possible to make two actions after a then in Ocaml ?
I try to search and I found that I could use a semicolon.
Should I use it like this ? :
let test (a:int)=
if a = 0
then print_int(1);print_int(2)
else()
;;
It's just an example. In my case I want to launch a function and give a tuple like that :
let move_square(x,y:int*int):int*int=
..
let direction : int = Random.int(5);
if direction = 0
then draw_square(x,y+1);x,y+1
else ..
Thanks for helping me
You can refer to §Séquence of https://caml.inria.fr/pub/old_caml_site/FAQ/qrg-fra.html.
Generally you have to group ocaml statement in an if-then-else structure,
either by using explicitly beginand end keywords, or by using parenthesis to group your sequence.

Pig capture matching string with regex

I am trying to capture image url's from inside tweets.
REGISTER 'hdfs:///user/cloudera/elephant-bird-pig-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-core-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-hadoop-compat-4.1.jar';
--Load Json
loadJson = LOAD '/user/cloudera/tweetwall' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []);
B = FOREACH loadJson GENERATE flatten(json#'tweets') as (m:map[]);
tweetText = FOREACH B GENERATE FLATTEN(m#'text') as (str:chararray);
intermediate date looks like this:
(#somenameontwitter your nan makes me laugh with some of the things she comes out with like http://somepics.com/my.jpg)
then I try to do the following to get only the image url back :
x = foreach tweetText generate REGEX_EXTRACT_ALL(str, '((http)(.*)(.jpg|.bmp|.png))');
dump x;
but that doesn't seem to work. I have also been trying with filter to no avail.
Even when trying the above with .* it returns empty results () or (())
I'm not good with regex and pretty new to Pig so it could be that I'm missing something simple here that I'm just not seeing.
update
example input data
{"tweets":[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"#Beace_ your nan makes me laugh with some of the things she comes out with blabla http://t.co/b7hjMWNg is an url, but not a valid one http://www.something.com/this.jpg should be a valid url","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}]}
Try this and let me know if this works
x = foreach tweetText generate REGEX_EXTRACT(str,'.*(http://.*.[jpg|bmp|png])',1);
DUMP x;
I managed to get it working (though I doubt it is totally optimal)
x = foreach tweetText generate REGEX_EXTRACT(str,'(http://.*(.jpg|.bmp|.png))',1) as image;
filtered = FILTER x BY $0 is not null;
dump filtered;
so the initial problem was just the regex (and my lack of knowledge on the subject).
Thanks for the assistance sivasakthi jayaraman!

Having issue with a complicated regex python

s = {"densityThreshold": 2.4543288981124E+14}
I was thinking something like this
re.search(".[A-Za-z]*.:\s\d\.\d+..\d+", k) or if re.search(".[A-Za-z]*.:\s\d\.\w+.\d+", k):
but neither seem to work..
I need to group "densityThreshold" and "2.4543288981124E+14" to create another dictionary.. I would usually use group() but i m stuck at search!
x='s = {"densityThreshold": 2.4543288981124E+14}'
k=re.search(".[A-Za-z]*.:\s\d\.\d+..\d+", x)
print k.group()
You can do this if you want the whole thing in one group.Or if you want separately use
x='s = {"densityThreshold": 2.4543288981124E+14
k=re.search("(.[A-Za-z]*.):(\s\d\.\d+..\d+)", x)
print k.groups()

Mallet in R regex error :java.lang.NoSuchMethodException: No suitable method for the given parameters

Ive been following the tutorial on how to use mallet in R to create topic models. My text file has 1 sentence per line. It looks like this and has about 50 sentences.
Thank you again and have a good day :).
This is an apple.
This is awesome!
LOL!
i need 2.
.
.
.
This is my code:
Sys.setenv(NOAWT=TRUE)
#setup the workspace
# Set working directory
dir<-"/Users/jxn"
Dir <- "~/Desktop/Chat/malletR/text" # adjust to suit
require(mallet)
documents1 <- mallet.read.dir(Dir)
View(documents1)
stoplist1<-mallet.read.dir("~/Desktop/Chat/malletR/stoplists")
View(stoplist1)
**mallet.instances <- mallet.import(documents1$id, documents1$text, "~/Desktop/Chat/malletR/stoplists/en.txt", token.regexp ="\\p{L}[\\p{L}\\p{P}]+\\p{L}")**
Everything works except for the last line of the code
**`**mallet.instances <- mallet.import(documents1$id, documents1$text, "~/Desktop/Chat/malletR/stoplists/en.txt", token.regexp ="\\p{L}[\\p{L}\\p{P}]+\\p{L}")**`**
I keep getting this error :
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.NoSuchMethodException: No suitable method for the given parameters
According to the package, this is how the function should be:
mallet.instances <- mallet.import(documents$id, documents$text, "en.txt",
token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
I believe it has something to do with the token.regexp argument as
documents1 <- mallet.read.dir(Dir) works just fine which means that the first 3 arguments supplied to mallet.instances was correct.
This is a link to the git repo that i was following the tutorial from.
https://github.com/shawngraham/R/blob/master/topicmodel.R
Any help would be much appreciated.
Thanks,
J
I suspect the problem is with your text file. I have encountered the same error and resolved it by using the as.character() function as follows:
mallet.instances <- mallet.import(as.character(documents$id),
as.character(documents$text),
"en.txt",
FALSE,
token.regexp="\\p{L}[\\p{L}\\p{P}]+\\p{L}")
Are you sure you converted the id field also to character ? It is easy to overlook the advice and leave it as an integer.
Also there is a typo in the code sample: the backslashes have to be escaped:
token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}"
This usually occurs because the html text editor eats up one backslash.

Stacking related lines together in notepad++

Hi so I'm trying to use find and replace in notepad++ with regular expression to do the following:
I have two set of lines
first set:
[c][eu][e]I37ANKCB[/e]
[c][eu][e]OIL8ZEPW[/e]
[c][eu][e]4OOEL75O[/e]
[c][eu][e]PPNW5FN4[/e]
[c][eu][e]E2BXCWUO[/e]
[c][eu][e]SD9UQNT8[/e]
[c][eu][e]E6BK6IGO[/e]
second set:
[u]7ubju2jvioks[u2]_261
[u]89j408tah1lz[u2]_262
[u]j673xnd49tq0[u2]_263
[u]dv73osmh1wzu[u2]_264
[u]twz3u4yiaeqr[u2]_265
[u]cuhtg6r71kud[u2]_266
[u]yts0ktvt9a3r[u2]_267
now I want to the second set to by places after each of the first set like this:
[c][eu][e]I37ANKCB[/e][u]7ubju2jvioks[u2]_261
[c][eu][e]OIL8ZEPW[/e][u]89j408tah1lz[u2]_262
[c][eu][e]4OOEL75O[/e][u]j673xnd49tq0[u2]_263
[c][eu][e]PPNW5FN4[/e][u]dv73osmh1wzu[u2]_264
[c][eu][e]E2BXCWUO[/e][u]twz3u4yiaeqr[u2]_265
[c][eu][e]SD9UQNT8[/e][u]cuhtg6r71kud[u2]_266
[c][eu][e]E6BK6IGO[/e][u]yts0ktvt9a3r[u2]_267
any suggestions?
You can mark the second block in column mode using ALT and the left mouse button. Then just copy paste it at the end of the first row.
No need/Not possible using regex.
I would solve this via a simple script written in Python or Ruby or something equally quick. This works, for example:
import os
path = os.path.dirname(__file__)
with open(os.path.join(path, 'file1')) as file1:
with open(os.path.join(path, 'file2')) as file2:
lines = zip(file1.readlines(), file2.readlines())
print ''.join([a.rstrip() + b for a, b in lines])
Running it gives the correct result:
> python join.py
[c][eu][e]I37ANKCB[/e][u]7ubju2jvioks[u2]_261
[c][eu][e]OIL8ZEPW[/e][u]89j408tah1lz[u2]_262
[c][eu][e]4OOEL75O[/e][u]j673xnd49tq0[u2]_263
[c][eu][e]PPNW5FN4[/e][u]dv73osmh1wzu[u2]_264
[c][eu][e]E2BXCWUO[/e][u]twz3u4yiaeqr[u2]_265
[c][eu][e]SD9UQNT8[/e][u]cuhtg6r71kud[u2]_266
[c][eu][e]E6BK6IGO[/e][u]yts0ktvt9a3r[u2]_267
Customize to suit your needs.