How to replace substring in DataFrame column by variable pattern in Julia? - replace

Suppose I have a DataFrame with two columns - gibberish and letter.
I want to replace substrings in gibberish so that only the ones matching letter remain, e.g. If gibberish is "kjkkj" and the letter is "j" I want gibberish to equal "jj".
The DataFrame is defined as:
df = DataFrame(gibberish = ["dqzzzjbzz", "jjjvjmjjkjjjjjjj", "mmbmmlvmbmmgmmf"], letter = ["z", "j", "m"])
If I had no letter variable and wanted only, let's say "x" to remain I would do:
df.gibberish.= replace.(gibberish, r"[^x;]" => "")
and that works fine, but when I try doing the same, but putting in letter column as a variable in the regex expression, it just breaks.
I tried doing that the "normal" DataFrames.jl way and with DataFramesMeta.jl shortcut #transform:
df.gibberish.= replace.(gibberish, Regex(join(["[^", letter, ";]"])) => "")
which results in an error of
ERROR: UndefVarError: letter not defined
while the #transform way just doesn't do anything:
julia> #transform(df, filtered = replace(:gibberish, Regex.(join(["[^", :letter, ";]"])) => ""))
3×3 DataFrame
│ Row │ letter │ gibberish │ filtered │
│ │ String │ String │ String │
├──────┼────────┼───────────────────┼───────────────────┤
│ 1 │ z │ dqzzzjbzz │ dqzzzjbzz │
│ 2 │ j │ jjjvjmjjkjjjjjjj │ jjjvjmjjkjjjjjjj │
│ 3 │ m │ mmbmmlvmbmmgmmf │ mmbmmlvmbmmgmmf │
I'm a very fresh beginner in Julia and I'm probably missing something very basic, but the proper solution just escapes me.
How do I solve this problem, other than writing a rowwise loop which would be horribly inefficient?

replace.(gibberish, Regex(join(["[^", letter, ";]"]))
letter refers here to a Julia variable (which is not defined), not to a column of the DataFrame.
You could try something like
Regex.(string.("[^" .* df.letter .* ";]"))
to construct an array of Regexes using a DataFrame row as input.

Related

Fibonacci problem causes Arithmetic overflow

The problem: create a function with one input. Return the index of an array containing the fibonacci sequence (starting from 0) whose element matches the input to the function.
16 ~ │ def fib(n)
17 ~ │ return 0 if n == 0
18 │
19 ~ │ last = 0u128
20 ~ │ current = 1u128
21 │
22 ~ │ (n - 1).times do
23 ~ │ last, current = current, last + current
24 │ end
25 + │
26 + │ current
27 │ end
28 │
60 │ def usage
61 │ progname = String.new(ARGV_UNSAFE.value)
62 │
63 │ STDERR.puts <<-H
64 │ #{progname} <integer>
65 │ Given Fibonacci; determine which fib value would
66 │ exist at <integer> index.
67 │ H
68 │
69 │ exit 1
70 │ end
71 │
72 │ if ARGV.empty?
73 │ usage
74 │ end
75 │
76 │ begin
77 ~ │ i = ARGV[0].to_i
78 ~ │ puts fib i
79 │ rescue e
80 │ STDERR.puts e
81 │ usage
82 │ end
My solution to the problem is in no way elegant and I did it at 2AM when I was quite tired. So I'm not looking for a more elegant solution. What I am curious about is that if I run the resultant application with an input larger than 45 then I'm presented with Arithmetic overflow. I think I've done something wrong with my variable typing. I ran this in Ruby and it runs just fine so I know it's not a hardware issue...
Could someone help me find what I did wrong in this? I'm still digging, too. I just started working with Crystal this week. This is my second application/experiment with it. I really like, but I am not yet aware of some of its idiosyncrasies.
EDIT
Updated script to reflect suggested change and outcome of runtime from said change. With said change, I can now run the program successfully over the number 45 now but only up to about low 90s. So that's interesting. I'm gonna run through this and see where I may need to add additional explicit casting. It seems very unintuitive that changing the type at the time of initiation didn't "stick" through the entire runtime, which I tried first and that failed. Something doesn't make sense here to me.
Original Results
$ crystal build fib.cr
$ ./fib 45
1836311903
$ ./fib 46
Arithmetic overflow
$ ./fib.rb 460
985864329041134079854737521712801814394706432953315\
510410398508752777354792040897021902752675861
Latest Results
$ ./fib 92
12200160415121876738
$ ./fib 93
Arithmetic overflow
./fib <integer>
Given Fibonacci; determine which fib value would
exist at <integer> index.
Edit ^2
Now also decided that maybe ARGV[0] is the problem. So I changed the call to f() to:
62 begin
63 i = ARGV[0].to_u64.as(UInt64)
64 puts f i
65 rescue e
66 STDERR.puts e
67 usage
68 end
and added a debug print to show the types of the variables in use:
22 return 0 if p == 0
23
24 puts "p: %s\tfib_now: %s\tfib_last: %s\tfib_hold: %s\ti: %s" % [typeof(p), typeof(fib_now), typeof(fib_last), typeof(fib_hold), typeof(i)]
25 loop do
p: UInt64 fib_now: UInt64 fib_last: UInt64 fib_hold: UInt64 i: UInt64
Arithmetic overflow
./fib <integer>
Given Fibonacci; determine which fib value would
exist at <integer> index.
Edit ^3
Updated with latest code after bug fix solution by Jonne. Turns out the issue is that I'm hitting the limits of the structure even with 128 bit unsigned integers. Ruby handles this gracefully. Seems that in crystal, it's up to me to gracefully handle it.
The default integer type in Crystal is Int32, so if you don't explicitly specify the type of an integer literal, you get that.
In particular the lines
fib_last = 0
fib_now = 1
turn the variables into the effective type Int32. To fix this, make sure you specify the type of these integers, given you don't need negative numbers, UInt64 seems most appropriate here:
fib_last = 0u64
fib_now = 1u64
Also note the the literal syntax I'm using here. Your 0.to_i64's create an In32 and then an Int64 out of that. The compiler will be smart enough to do this conversion at compile time in release builds, but I think it's nicer to just use the literal syntax.
Edit answering to to the updated question
Fibonacci is defined as F0 = 0, F1 = 1, Fn = Fn-2 + Fn-1, so 0, 1, 1, 2, 3, 5.
Your algorithm is off by one. It calculates Fn+1 for a given n > 1, in other words 0, 1, 2, 3, 5, in yet other words it basically skips F2.
Here's one that does it correctly:
def fib(n)
return 0 if n == 0
last = 0u64
current = 1u64
(n - 1).times do
last, current = current, last + current
end
current
end
This correctly gives 7540113804746346429 for F92 and 12200160415121876738 for F93. However it still overflows for F94 because that would be 19740274219868223167 which is bigger than 264 = 18446744073709551616, so it doesn't fit into UInt64. To clarify once more, your version tries to calculate F94 when being asked for F93, hence you get it "too early".
So if you want to support calculating Fn for n > 93 then you need to venture into the experimental Int128/UInt128 support or use BigInt.
I think one more thing should be mentioned to explain the Ruby/Crystal difference, besides the fact that integer literals default to Int32.
In Ruby, a dynamically typed interpreted language, there is no concept of variable type, only value type. All variables can hold values of any type.
This allows it to transparently turn a Fixnum into a Bignum behind the scenes when it would overflow.
Crystal on the contrary is a statically typed compiled language, it looks and feels like Ruby thanks to type inference and type unions, but the variables themselves are typed.
This allows it to catch a large number of errors at compile time and run Ruby-like code at C-like speed.
I think, but don't take my word for it, that Crystal could in theory match Ruby's behavior here, but it would be more trouble than good. It would require all operations on integers to return a type union with BigInt, at which point, why not leave the primitive types alone, and use big integers directly when necessary.
Long story short, if you need to work with very large integer values beyond what an UInt128 can hold, require "big" and declare the relevant variables BigInt.
edit: see also here for extreme cases, apparently BigInts can overflow too (I never knew) but there's an easy remedy.

How to get RPM installed package names and versions using RegularExpression in python

I want to parse rpm -qa output to get package name and version info. My idea is to search for first occurrence of r'(-\w+\.)' (this regex should match the first occurrence of a substring that lies between '-' and '.') and split the data with it. The first part will be the package name and the matched regex with omitted '-' concatenated with second part will be the version.
Example:
boost-license-1.36.0-12.3.1: '-1.' should be first occurrence of the regex matched part
boost-license: The first part after splitting string with -1. will be the package name
-1. + 36.0-12.3.1: from matched part remove '-' and add it to second part to obtain the version.
How to implement this in python and is there any alternate way to identify Package name and version?
boost-license-1.36.0-12.3.1 -> boost-license and 1.36.0-12.3.1
yast2-schema-2.17.5-0.5.42 -> yast2-schema and 2.17.5-0.5.42
release-notes-sles-11.3.34-0.7.1 -> release-notes-sles and 11.3.34-0.7.1
yast2-country-data-2.17.55-0.7.1 -> yast2-country-data and 2.17.55-0.7.1
Code part:
command = 'rpm -qa'
pkgList = []
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
try:
client.connect('ipaddress', username='user', password='pwd')
except SSHException as error:
print (str(error) + "\n"+ "Authentication error")
else:
stdin, stdout, stderr = client.exec_command(command)
for line in stdout:
pkgList.append(line.strip('\n'))
for line in stderr:
print('' + line.strip('\n'))
Glad you could solve part of your problem,i have solved the other half for you:
Here is the simple function for it:
def slicer(pkgList):
#Description: Take packages and slice them into package names and versions
#Param: pkgList:python list - takes a python list and returns a dict with
#package name as keys and versions as values
items = pkgList
packages = {}
non_packages = []
for item in items:
target = re.search('(-\d+\.)', item)
try:
start = item.index(target.group(0))
package_name = item[:start]
package_version = item[start+1:]
packages[package_name] = package_version
except:
non_packages.append(item)
print('Non Packages:\n',non_packages)
return packages
#returned value is a dict so to get the packages
#packages = slicer(pkgList)
#for names,versions in packages.items():
# print(names,'\t',versions)
Hope this helps
here is my output
I have implemented this in python. I am removing items from the list that don't have regex match.
Eg: gpg-pubkey-307e3d54
There is a reason for using one more list removedList to remove elements from pkgList. Removing elements directly from pkgList misses some elements.
Eg: Removing elements while iterating through the same list will miss some items.
for eachPkgVersion in pkgList:
if(not(p.search(eachPkgVersion))):
pkgList.remove(eachPkgVersion)
The above code removes only 3 items from 'pkgList' even if we have below 6 items
Input:
gpg-pubkey-307e3d54
gpg-pubkey-39db7c82
gpg-pubkey-3d25d3d9
gpg-pubkey-50a3dd1c
gpg-pubkey-9c800aca
gpg-pubkey-b37b98a9
Output: The list still have below items in it
gpg-pubkey-39db7c82
gpg-pubkey-50a3dd1c
gpg-pubkey-b37b98a9
Complete Fix:
p = re.compile('(-\w+\.)')
removedList = []
for eachPkgVersion in pkgListkgList:
if(not(p.search(eachPkgVersion))):
removedList.append(eachPkgVersion)
for eachPkgVersion in removedList:
pkgList.remove(eachPkgVersion)
for eachPkgVersion in pkgList:
delimiter = p.search(eachPkgVersion).group(1)
list = eachPkgVersion.split(delimiter)
pkgName = list[0]
pkgVerson = delimiter.strip('-') + list[1]

Using for statement, join each line in 2 lists and store in variable

a = [['gigethernet1/0/1', 234], ['vlan1', 675876], ['fastethernet1/0', 3534]]
b = [['gigethernet1/0/1', 78678], ['vlan1', 6789679687], ['fastethernet1/0', 67896786]]
anewlist = [line for line in a if "thernet" in line[0]]
bnewlist = [line for line in b if "thernet" in line[0]]
I have 2 variables with multiple lists. In variable A I am creating a new variable that filters specific lines with a certain string.'thernet' For variable B I am doing the same thing. I want to merge variables A and B together for each line but only include 1 element of variable B, see desired output below.
['gigethernet1/0/1', 234, 78678]
['fastethernet1/0', 3534, 67896786]
Try this:
data = [dict(anewlist), dict(bnewlist)] #Create a list of dicts
# dict([['a','b'],['c','d']]) => {'a': 'b', 'c': 'd'}
output = [[key]+[d.get(key) for d in data] for key in set().union(*data)]
# '*' operator unpacks the list of dicts. Google 'python unpacking lists' for more info
# set().union(dicts) create a set() of keys of the dicts passed in. Read about python sets, especially its `union` method, and what happens when a dict is passed as an argument to `set()`.
# Rest are list comprehension and dictionary operations!
Hope this helps!

how to check if previous element is similar to next elemnt in python

I have a text file like:
abc
abc
abc
def
def
def
...
...
...
...
Now I would like o create a list
list1=['abc','abc','abc']
list2=['def','def','def']
....
....
....
I would like to know how to check if next element is similar to previous element in a python for loop.
You can create a list comprehension and check if the ith element is equal to the ith-1 element in your list.
[ list1[i]==list1[i-1] for i in range(len(list1)) ]
>>> list1=['abc','abc','abc']
>>> [ list1[i]==list1[i-1] for i in range(len(list1)) ]
[True, True, True]
>>> list1=['abc','abc','abd']
>>> [ list1[i]==list1[i-1] for i in range(len(list1)) ]
[False, True, False]
This can be written within a for loop as well:
aux_list = []
for i in range(len(list1)):
aux_list.append(list1[i]==list1[i-1])
Check this post:
http://www.pythonforbeginners.com/lists/list-comprehensions-in-python/
for i in range(1,len(list)):
if(list[i] == list[i-1]):
#Over here list[i] is equal to the previous element i.e list[i-1]
file = open('workfile', 'r') # open the file
splitStr = file.read().split()
# will look like splitStr = ['abc', 'abc', 'abc', 'def', ....]
I think the best way to progress from here would be to use a dictionary
words = {}
for eachStr in splitStr:
if (words.has_key(eachStr)): # we have already found this word
words[eachStr] = words.get(eachStr) + 1 # increment the count (key) value
else: # we have not found this word yet
words[eachStr] = 1 # initialize the new key-value set
This will create a dictionary so the result would look like
print words.items()
[('abc', 3), ('def', 3)]
This way you store all of the information you want. I proposed this solution because its rather messy to create an unknown number of lists to accommodate what you want to do, but it is easy and memory efficient to store the data in a dictionary from which you can create a list if need be. Furthermore, using dictionaries and sets allow you to have a single copy of each string (in this case).
If you absolutely need new lists let me know and I will try to help you figure it out

How to use regular expression with ANY array operator

I have a column containing an array of authors. How can I use the ~* operator to check if any of its values match a given regular expression?
The ~* operator takes the string to check on the left and the regular expression to match on the right. The documentation says the ANY operator has to be on the right side so, obviously
SELECT '^p' ~* ANY(authors) FROM book;
does not work as PostgreSQL tries to match the string ^p against expressions contained in the array.
Any idea?
The first obvious idea is to use your own regexp-matching operator with commuted arguments:
create function commuted_regexp_match(text,text) returns bool as
'select $2 ~* $1;'
language sql;
create operator ~!## (
procedure=commuted_regexp_match(text,text),
leftarg=text, rightarg=text
);
Then you may use it like this:
SELECT '^p' ~!## ANY(authors) FROM book;
Another different way of looking at it to unnest the array and formulate in SQL the equivalent of the ANY construct:
select bool_or(r) from
(select author ~* '^j' as r
from (select unnest(authors) as author from book) s1
) s2;
SELECT * FROM book where EXISTS ( SELECT * from unnest(author) as X where x ~* '^p' )
Here's an idea if you can make reasonable assumptions about the data. Just concatenate the array into a string and do a regex-search against the whole string.
select array_to_string(ARRAY['foo bar', 'moo cow'], ',') ~ 'foo'
Off the cuff, and without any measurements to back me up, I would say that most performance issues related to the regex stuff could be dealt with by smart uses of regex, and maybe some special delimiter characters. Creating the string may be a performance issue, but I wouldn't even dare to speculate on that.
I use this:
create or replace function regexp_match_array(a text[], regexp text)
returns boolean
strict immutable
language sql as $_$
select exists (select * from unnest(a) as x where x ~ regexp);
$_$;
comment on function regexp_match_array(text[], text) is
'returns TRUE if any element of a matches regexp';
create operator ~ (
procedure=regexp_match_array,
leftarg=text[], rightarg=text
);
comment on operator ~(text[], text) is
'returns TRUE if any element of ARRAY (left) matches REGEXP (right); think ANY(ARRAY) ~ REGEXP';
Then use it much like you'd use ~ with text scalars:
=> select distinct gl from x where gl ~ 'SH' and array_length(gl,1) < 7;
┌──────────────────────────────────────┐
│ gl │
├──────────────────────────────────────┤
│ {MSH6} │
│ {EPCAM,MLH1,MSH2,MSH6,PMS2} │
│ {SH3TC2} │
│ {SHOC2} │
│ {BRAF,KRAS,MAP2K1,MAP2K2,SHOC2,SOS1} │
│ {MSH2} │
└──────────────────────────────────────┘
(6 rows)
You can define your own operator to do what you want.
Reverse the order of the arguments and call the appropriate function :
create function revreg (text, text) returns boolean
language sql immutable
as $$ select texticregexeq($2,$1); $$;
(revreg ... please choose your favorite name).
Add a new operator using our revreg() function :
CREATE OPERATOR ### (
PROCEDURE = revreg,
LEFTARG = text,
RIGHTARG = text
);
Test:
test=# SELECT '^p' ### ANY(ARRAY['ika', 'pchu']);
t
test=# SELECT '^p' ### ANY(ARRAY['ika', 'chu']);
f
test=# SELECT '^p' ### ANY(ARRAY['pika', 'pchu']);
t
test=# SELECT '^p' ### ANY(ARRAY['pika', 'chu']);
t
Note that you may want to set JOIN and RESTICT clauses to the new operator to help the planner.
My solution
SELECT a.* FROM books a
CROSS JOIN LATERAL (
SELECT author
FROM unnest(authors) author
WHERE author ~ E'p$'
LIMIT 1
)b;
Use cross lateral join, subquery is evaluated for every row of table "books", if one of rows returned by unnest, meets the condition, subquery returns one row (becouse of limit).
I use a generalization of Reece's approach:
select format($$
create function %1$s(a text[], regexp text) returns boolean
strict immutable language sql as
%2$L;
create operator %3$s (procedure=%1$s, leftarg=text[], rightarg=text);
$$, /*1*/nameprefix||'_array_'||oname, /*2*/q1||o||q2, /*3*/oprefix||o
)
from (values
('tilde' , '~' ), ('bang_tilde' , '!~' ),
('tilde_star' , '~*' ), ('bang_tilde_star' , '!~*' ),
('dtilde' , '~~' ), ('bang_dtilde' , '!~~' ),
('dtilde_star', '~~*'), ('bang_dtilde_star', '!~~*')
) as _(oname, o),
(values
('any', '', 'select exists (select * from unnest(a) as x where x ', ' regexp);'),
('all', '#', 'select true = all (select x ', ' regexp from unnest(a) as x);')
) as _2(nameprefix, oprefix, q1, q2)
\gexec
Executing this in psql creates 16 functions and 16 operators that cover all applicable 8 matching operators for arrays -- plus 8 variations prefixed with # that implement the ALL equivalent.
Very handy!