NOT operator in redis scan command when pattern matching - regex

Consider these commands:
First i create a sample trip as hash:
HMSET trip:1734 id 1734 start 08:35 end 09:30
Then I need to store some route points for the trip, I do it with lists like this:
RPUSH trip:1734:routes 22444.345566,32.875553 77.3,44.3
Now I need to retrieve only trips (list of trips) without route part, how I can do that?
I tried:
SCAN 0 MATCH trip:*[^:routes]
and it is working well.
But why does this not work?
SCAN 0 MATCH trip:*[^:*]

Related

How do I conditionally remove text from a string in a column in a Scala dataframe?

I'm currently exploring Azure Databricks for a POC (Scala and Databricks are both completely new to me. I'm using this (Cars - Corgis) sample dataset to show off the manipulation characteristics of Databricks.
My problem is that I have a dataframe column called 'model' that contains data like '2009 Audi A3' and '2005 Mercedes E550'. What I would like to be able to do is alter that column so instead of the aforementioned, it reads as 'Audi A3' or 'Mercedes E550'. I have a separate model year column so trying to reduce the size of the columns where possible.
From what I have seen, replaceAllIn doesn't seem to work with strings with Scala.
This is my code so far:
//Use the dataframe from the previous cell and trim the model year from the model column so for example it reads as 'Audi A3' instead of '2009 Audi A3'
import scala.util.matching.Regex
val modelPrefixPatternMatch = "[0-9 ]".r
val newModel = modelPrefixPatternMatch.replaceAllIn((specificColumnsDf.select("model")),"")
However, when I run this code, I get the following error message:
command-1778339999318469:5: error: overloaded method value replaceAllIn with alternatives:
(target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and>
(target: CharSequence,replacement: String)String
cannot be applied to (org.apache.spark.sql.DataFrame, String)
val newModel = modelPrefixPatternMatch.replaceAllIn((specificColumnsDf.select("model")),"")
I have also tried completing the SparkSQL but didn't have any luck there either.
Thanks!
In Spark you would normally add additional columns using withColumn and then select only the columns you want. In this simple example, I use regexp_replace function to trim out the years, something like this:
%scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
df
.withColumn("cleanColumn", regexp_replace($"`Identification.Model Year`", "20[0-2][0-9] ","") )
.select($"`Identification.Model Year`", $"cleanColumn").distinct
.show(false)
My results:
We could probably make the regular expression tighter, eg tie it to the start of the column or open it up for years 1980, 1990 etc - this is just an example.
If the year is always at the start then you could just use substring and start at position 5. The regex approach at least protects from the year not being there for some records.
HTH

HiveQL: Parse strings and count

I am using HiveQL to work with millions of rows of domain name text data stored in HDFS. The following is a hand-selected subset to illustrate lexical diversity. There are duplicate entries.
dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.
mgmtsubnet.mgmtvcn.oraclevcn.com.
asdf.mgmtvcn.oraclevcn.com.
dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.
localhost.
a.localhost.
img.pulsemgr.com.
36.136.154.156.in-addr.arpa.
accounts.spotify.com.
_dmarc.ixia-devops.com.
&eventtype=close&reason=4&duration=35.
&eventtype=close&reason=3&duration=10336.
I am trying to get a count of # of rows based on the last two levels of the domain, where sometimes the 2nd level is absent (i.e. localhost.). For example:
domain_root count
oraclevcn.com. 4
localhost. 1
a.localhost. 1
pulsemgr.com. 1
in-addr.arpa. 1
spotify.com. 1
ixia-devops.com 1
It would be nice to also see how to filter out domains 2nd level is absent.
I am not sure where to start. I have seen use of the SPLIT() function, but that may not be robust since there could be many levels to a domain name, for example: a.b.c.d.e.f.g.h.i etc.
Any ideas are implementations are appreciated.
Below would be the query with regexp_extract.
select domain_root, count(*) from (select regexp_extract('dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.', '[A-Za-z0-9-]+\.[A-Za-z0-9-]+\.$', 0) as domain_root from table) A group by A.domain_root -- replace first argument with column name
regex will extract for domain root with Alphanumeric and special character '-'
hope this helps.

Extract data from stdout_lines in Ansible

I'm trying to extract a specific number from my stdout_lines in Ansible and use that as a variable. I'm running a show command in my playbook and all I want to get from the output is the highest sequence number from my crypto map. For example this is my playbook:
-
asa_command:
commands:
- show run crypto map
provider: "{{ base_provider }}"
register: result
-
debug: var=result.stdout_lines
This produces the output fine but I'm not sure how to go about extracting the sequence number from the following (I have omitted most of the crypto map just to make it easier to explain).
"crypto map map1 60 set ikev1 transform-set test",
"crypto map map1 60 set security-association lifetime seconds 3600",
"crypto map map1 61 set peer 1.1.1.1 ",
"crypto map map1 61 set ikev1 transform-set test1",
"crypto map map1 61 set security-association lifetime seconds 3600",
"crypto map map1 interface outside"
So basically, I would like to extract the highest sequence number (in this case "61") so I can input it as a variable in the same playbook. Any thoughts would be appreciated :-)
I tried looking at some jinja2 filters but I couldn't figure out what would be most appropriate for my usage.
http://ansible-docs.readthedocs.io/zh/stable-2.0/rst/playbooks_filters.html
I also tried the suggestions on this page but I didn't get far with that either.
ansible parse text string from stdout
Note that I'm pants-ing this in a notepad without full access to the tools, so please check my syntax, especially on those double-backslash escapes. That said, let's take a stab at a chain of filters that gets what you need. How about:
- debug: msg="{{ result.stdout |
regex_findall ('^"crypto map map1 \\d\\d set ') |
regex_replace ('^"crypto map map1 (\\d\\d) set .*',
'\\1') |
max
}}"

redis: matching partial keys of hash

In a hash, I have a bunch of keys-values pairs
my keys are in the following format: name:city
john:newyork
kate:chicago
lisa:atlanta
Im using python to access redis and in https://redis-py.readthedocs.org/en/latest/, i dont see any hash operations that does the partial matching
i would like to be able to get all keys in the hash with a city name
is that possible?
It is possible, but not with HASH objects, but with sorted sets. As long as all elements in a sorted set have the same score, you can do lexicographical prefix matching.
let's say you do the following (raw redis commands, but the same applies with the python client):
ZADD foo 0 john:newyork:<somevalue>
ZADD foo 0 john:chicago:<somevalue>
ZADD foo 0 kate:chicago:<somevalue>
....
You can then query by using ZRANGEBYLEX:
ZRANGEBYLEX foo [john: (john:\xff
will give you all entries that start with john, and you can extract the value with regular expressions or splitting.
Note that this is a prefix search and not suffix search. if you want "all entries in new york" you need to reverse the order in the sorted set.
I was able to achieve matching hash keys partially by:
pool = redis.ConnectionPool(host='localhost', port=6379, db=0)
r = redis.StrictRedis(connection_pool=pool)
cmd = "hscan <hashname> 0 match *:atlanta"
print r.execute_command(cmd)

Counting unique login using Map Reduce

Let say I have a very big log file with this kind of format( based on where a user login )
UserId1 , New York
UserId1 , New Jersey
UserId2 , Oklahoma
UserId3 , Washington DC
....
userId999999999, London
Note that UserId1 logged in New York first and then he flied to New Jersey and logged again from there.
If I need to get how many unique user login (means 2 login will same userid considered as 1 login), how should I map and reduce it?
My initial plan is that I want to map it first to this kind of format :
UserId1, 1
UserId1, 1
UserId2, 1
UserId3, 1
And then reduce it to
UserId1, 2
UserId2, 1
UserId3, 1
But would this cause the output to be still big in number (Especially if common behaviour of user is to login 1 or 2 times a day ). Or is there a better way to implement this?
Do map-reduce.
For example, you have 10000 lines of data, but you can only process 1000 lines of data in a time.
Then, process 1000 lines of data for 10 times.
If the sum of lines of the 10 processing's result > 1000:
do the above step again.
else:
use set directly.
I recommend making use of a custom key in the map phase. You can refer the tutorial here for writing and using custom keys. The custom key should have two parts 1) userid 2)placeid. So essentially in the mapper phase you are doing this.
emit(<userid, place>, 1)
In the reduce phase, you just have to access the key and emit the two parts of the key separately.