Regex to calculate straight poker hand? - regex

Is there a regex to calculate straight poker hand?
I'm using strings to represent the sorted cards, like:
AAAAK#sssss = 4 aces and a king, all of spades.
A2345#ddddd = straight flush, all of diamonds.
In Java, I'm using these regexes:
regexPair = Pattern.compile(".*(\\w)\\1.*#.*");
regexTwoPair = Pattern.compile(".*(\\w)\\1.*(\\w)\\2.*#.*");
regexThree = Pattern.compile(".*(\\w)\\1\\1.*#.*");
regexFour = Pattern.compile(".*(\\w)\\1{3}.*#.*");
regexFullHouse = Pattern.compile("((\\w)\\2\\2(\\w)\\3|(\\w)\\4(\\w)\\5\\5)#.*");
regexFlush = Pattern.compile(".*#(\\w)\\1{4}");
How to calculate straight (sequences) values with regex?
EDIT
I open another question to solve the same problem, but using ascii value of char,
to regex be short. Details here.
Thanks!

I have to admit that regular expressions are not the first tool I would have thought of for doing this. I can pretty much guarantee that any RE capable of doing that to an unsorted hand is going to be far more hideous and far less readable than the equivalent procedural code.
Assuming the cards are sorted by face value (and they seem to be otherwise your listed regexes wouldn't work either), and you must use a regex, you could use a construct like
2345A|23456|34567|...|9TJQK|TJQKA
to detect the face value part of the hand.
In fact, from what I gather here of the "standard" hands, the following should be checked in order of decreasing priority:
Royal/straight flush: "(2345A|23456|34567|...|9TJQK|TJQKA)#(\\w)\\1{4}"
Four of a kind: ".*(\\w)\\1{3}.*#.*"
Full house: "((\\w)\\2\\2(\\w)\\3|(\\w)\\4(\\w)\\5\\5)#.*"
Flush: ".*#(\\w)\\1{4}"
Straight: "(2345A|23456|34567|...|9TJQK|TJQKA)#.*"
Three of a kind: ".*(\\w)\\1\\1.*#.*"
Two pair: ".*(\\w)\\1.*(\\w)\\2.*#.*"
One pair: ".*(\\w)\\1.*#.*"
High card: (none)
Basically, those are the same as yours except I've added the royal/straight flush and the straight. Provided you check them in order, you should get the best score from the hand. There's no regex for the high card since, at that point, it's the only score you can have.
I also changed the steel wheel (wrap-around) straights from A2345 to 2345A since they'll be sorted that way.

I rewrote the regex for this because I found it frustrating and confusing. Groupings make much more sense for this type of logic. The sorting is being done using a standard array sort method in javascript hence the strange order of the cards, they are in alphabetic order. I did mine in javascript but the regex could be applied to java.
hands = [
{ regex: /(2345A|23456|34567|45678|56789|6789T|789JT|89JQT|9JKQT|AJKQT)#(.)\2{4}.*/g , name: 'Straight flush' },
{ regex: /(.)\1{3}.*#.*/g , name: 'Four of a kind' },
{ regex: /((.)\2{2}(.)\3{1}#.*|(.)\4{1}(.)\5{2}#.*)/g , name: 'Full house' },
{ regex: /.*#(.)\1{4}.*/g , name: 'Flush' },
{ regex: /(2345A|23456|34567|45678|56789|6789T|789JT|89JQT|9JKQT|AJKQT)#.*/g , name: 'Straight' },
{ regex: /(.)\1{2}.*#.*/g , name: 'Three of a kind' },
{ regex: /(.)\1{1}.*(.)\2{1}.*#.*/g , name: 'Two pair' },
{ regex: /(.)\1{1}.*#.*/g , name: 'One pair' },
];

Related

NLP with less than 20 Words on google cloud

According to this documentation: the classifyText method requires at least 20 words.
https://cloud.google.com/natural-language/docs/classifying-text#language-classify-content-nodejs
If I send in less than 20 words I get this no matter how clear the content is:
Invalid text content: too few tokens (words) to process.
Looking for a way to force this without disrupting the NLP too much. Are there neutral vector words that can be appended to short phrases that would allow the classifyText to process anyways?
ex.
async function quickstart() {
const language = require('#google-cloud/language');
const client = new language.LanguageServiceClient();
//less than 20 words. What if I append some other neutral words?
//.. a, of , it, to or would it be better to repeat the phrase?
const text = 'The Atlanta Braves is the best team.';
const document = {
content: text,
type: 'PLAIN_TEXT',
};
const [classification] = await client.classifyText({document});
console.log('Categories:');
classification.categories.forEach(category => {
console.log(`Name: ${category.name}, Confidence: ${category.confidence}`);
});
}
quickstart();
The problem with this is you're adding bias no matter what kind of text you send.
Your only chance is to fill up your string up to the minimum word limit with empty words that will be filtered out by the preprocessor and tokenizer before they go to the neural network.
I would try to add a string suffix at the end of the sentence with just stopwords from NLTK like this:
document.content += ". and ourselves as herserf for each all above into through nor me and then by doing"
Why the end? Because usually text has more information at the beginning.
In case Google does not filter stopwords behind the scenes (which I doubt), this would add just white noise where the network has no focus or attention.
Remember: DO NOT add this string when you have enough words because you are billed for 1K character blocks before they are filtered.
I would also add that string suffix to sencences in your train/test/validation set that have less than 20 words and see how it works. The network should learn to ignore the whole sentence.

Map Reduce Removing Duplicates

I have been given a large text file and want to find the number of different words that start with each letter. I am trying to understand input and output values for map and reduce functions.
I understand a simpler problem which does not need to deal with duplicate words: determine the frequency with which each letter of the alphabet starts a word in the text using map reduce.
Map input: <0, “everyday i am city in tomorrow easy over school i iterate tomorrow city community”>
Map output: [<e,1>,<i,1>,<a,1>,<c,1>,<i,1>,<t,1>,<e,1>,<o,1>,<s,1>,<i,1>,<i,1>,<t,1>,<c,1>,<c,1>]
Reduce input: <a,[1]>,<c,[1,1,1]>,<e,[1,1]>,<i,[1,1,1,1]>,<o,[1]>,<s,[1]>,<t,[1,1]>
Reduce output: [<a,1>,<c,3>,<e,2>,<i,4>,<o,1>,<s,1>,<t,2>]
For the above problem the words 'i' 'city' and 'tomorrow' appear more than once so my final output should be:
Reduce output: [<a,1>,<c,2>,<e,2>,<i,3>,<o,1>,<s,1>,<t,1>]
I am unsure of how I would ensure duplicate words are remove in the above problem (would it be done in a pre processing phase or could it be implemented on either map or reduce functions). If I could get help understanding the map and reduce outputs of the new problem I would appreciate it.
You can do it in two map-reduce passes:
find all the distinct word by using word as a map output and in reduce outputting each word once
you already solved - find frequency of each initial letter on each unique word.
Alternatively, since there are not many unique words you can cache them in mapper and output each one (or its first letter) only once and the reduce will be identical to your simpler problem. Actually, no, that won't work because same words can appear in different mappers. But you can still cache the words in mapper in the first solution and output each word only once per mapper - a little bit less traffic between map and reduce.
Maybe something like this would help,
let str = "everyday i am city in tomorrow easy over school i iterate tomorrow city community"
let duplicatesRemoved = Set(str.split(separator: " "))
Output:
["city", "community", "tomorrow", "easy", "everyday", "over", "in", "iterate", "i", "am", "school"]
And maybe you don't need those map statements and can achieve something like this,
Code
var varCount = [Character: Int]()
for subStr in duplicatesRemoved {
if let firstChar = subStr.first {
varCount[firstChar] = (varCount[firstChar] ?? 0) + 1
}
}
Output
["i": 3, "t": 1, "e": 2, "c": 2, "s": 1, "a": 1, "o": 1]

Pandas: Grouping rows by list in CSV file?

In an effort to make our budgeting life a bit easier and help myself learn; I am creating a small program in python that takes data from our exported bank csv.
I will give you an example of what I want to do with this data. Say I want to group all of my fast food expenses together. There are many different names with different totals in the description column but I want to see it all tabulated as one "Fast Food " expense.
For instance the Csv is setup like this:
Date Description Debit Credit
1/20/20 POS PIN BLAH BLAH ### 1.75 NaN
I figured out how to group them with an or statement:
contains = df.loc[df['Description'].str.contains('food court|whataburger', flags = re.I, regex = True)]
I ultimately would like to have it read off of a list? I would like to group all my expenses into categories and check those category variable names so that it would only output from that list.
I tried something like:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
That obviously didn't work.
If there is a better way of doing this I am wide open to suggestions.
Also I have looked through quite a few posts here on stack and have yet to find the answer (although I am sure I overlooked it)
Any help would be greatly appreciated. I am still learning.
Thanks
You can assign a new column using str.extract and then groupby:
df = pd.DataFrame({"description":['Macdonald something', 'Whataburger something', 'pizza hut something',
'Whataburger something','Macdonald something','Macdonald otherthing',],
"debit":[1.75,2.0,3.5,4.5,1.5,2.0]})
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
df["found"] = df["description"].str.extract(f'({"|".join(fast_food)})',flags=re.I)
print (df.groupby("found").sum())
#
debit
found
Macdonald 5.25
Whataburger 6.50
pizza hut 3.50
Use dynamic pattern building:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
pattern = r"\b(?:{})\b".format("|".join(map(re.escape, fast_food)))
contains = df.loc[df['Description'].str.contains(pattern, flags = re.I, regex = True)]
The \b word boundaries find whole words, not partial words.
The re.escape will protect special characters and they will be parsed as literal characters.
If \b does not work for you, check other approaches at Match a whole word in a string using dynamic regex

Matching PoS tags with specific text with `testacy.extract.pos_regex_matches(...)`

I'm using textacy's pos_regex_matches method to find certain chunks of text in sentences.
For instance, assuming I have the text: Huey, Dewey, and Louie are triplet cartoon characters., I'd like to detect that Huey, Dewey, and Louie is an enumeration.
To do so, I use the following code (on testacy 0.3.4, the version available at the time of writing):
import textacy
sentence = 'Huey, Dewey, and Louie are triplet cartoon characters.'
pattern = r'<PROPN>+ (<PUNCT|CCONJ> <PUNCT|CCONJ>? <PROPN>+)*'
doc = textacy.Doc(sentence, lang='en')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
print(list.text)
which prints:
Huey, Dewey, and Louie
However, if I have something like the following:
sentence = 'Donald Duck - Disney'
then the - (dash) is recognised as <PUNCT> and the whole sentence is recognised as a list -- which it isn't.
Is there a way to specify that only , and ; are valid <PUNCT> for lists?
I've looked for some reference about this regex language for matching PoS tags with no luck, can anybody help? Thanks in advance!
PS: I tried to replace <PUNCT|CCONJ> with <[;,]|CCONJ>, <;,|CCONJ>, <[;,]|CCONJ>, <PUNCT[;,]|CCONJ>, <;|,|CCONJ> and <';'|','|CCONJ> as suggested in the comments, but it didn't work...
Is short, it is not possible: see this official page.
However the merge request contains the code of the modified version described in the page, therefore one can recreate the functionality, despite it's less performing than using a SpaCy's Matcher (see code and example -- though I have no idea how to reimplement my problem using a Matcher).
If you want to go down this lane anyway, you have to change the line:
words.extend(map(lambda x: re.sub(r'\W', '', x), keyword_map[w]))
with the following:
words.extend(keyword_map[w])
otherwise every symbol (like , and ; in my case) will be stripped off.

How to find the day of the week from timestamp

I have a timestamp 2015-11-01 21:45:25,296 like I mentioned above. is it possible to extract the the day of the week(Mon, Tue,etc) using any regular expression or grok pattern.
Thanks in advance
this is quite easy if you want to use the ruby filter. I am lazy so I am only doing this.
Here is my filter:
filter {
ruby {
code => "
p = Time.parse(event['message']);
event['day-of-week'] = p.strftime('%A');
"
}
}
The 'message' variable is the field that contains your timestamp
With stdin and stdout and your string, you get:
artur#pandaadb:~/dev/logstash$ ./logstash-2.3.2/bin/logstash -f conf2/
Settings: Default pipeline workers: 8
Pipeline main started
2015-11-01 21:45:25,296
{
"message" => "2015-11-01 21:45:25,296",
"#version" => "1",
"#timestamp" => "2016-08-03T13:07:31.377Z",
"host" => "pandaadb",
"day-of-week" => "Sunday"
}
Hope that is what you need,
Artur
What you want is:
Assuming your string is 2015-11-01 21:45:25,296
mydate='2015-11-01'
date +%a -d ${mydate% *}
Will give you what you want.
Short answer is not, you can't.
A regex, according to Wikipedia:
...is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations.
So, a regex allow you to parse a String, it searches for information within the String, but it doesn't make calculations over it.
If you want to make such calculations you need help from a programming language (java, c#, or Ruby[like #pandaadb suggested] etc) or some other tool that makes those calculations (Epoch Converter).