I am new to reinforcement learning. I had recently learned about approximate q learning, or feature-based q learning, in which you describe states by features to save space. I have tried to implement this in a simple grid game. Here, the agent is supposed to learn to not go into a firepit(signaled by an f) and to instead eat up as much dots as possible. Here is the grid used:
...A
.f.f
.f.f
...f
Here A signals the agent's starting location. Now, when implementing, I set up two features. One was 1/((distance to closest dot)^2), and the other was (distance to firepit) + 1. When the agent enters a firepit, the program returns with a reward of -100. If it goes to a non firepit position that was already visited(and thus there is no dot to be eaten), the reward is -50. If it goes to an unvisited dot, the reward is +500. In the above grid, no matter what the initial weights are, the program never learns the correct weight values. Specifically, in the output, the first training session gains a score(how many dots it ate) of 3, but for all other training sessions, the score is just 1 and the weights converge to an incorrect value of -125 for weight 1(distance to firepit) and 25 for weight 2(distance to unvisited dot). Is there something specifically wrong with my code or is my understanding of approximate q learning incorrect?
I have tried to play around with the rewards that the environment is giving and also with the initial weights. None of these have fixed the problem.
Here is the link to the entire program: https://repl.it/repls/WrongCheeryInterface
Here is what is going on in the main loop:
while(points != NUMPOINTS){
bool playerDied = false;
if(!start){
if(!atFirepit()){
r = 0;
if(visited[player.x][player.y] == 0){
points += 1;
r += 500;
}else{
r += -50;
}
}else{
playerDied = true;
r = -100;
}
}
//Update visited
visited[player.x][player.y] = 1;
if(!start){
//This is based off the q learning update formula
pairPoint qAndA = getMaxQAndAction();
double maxQValue = qAndA.q;
double sample = r;
if(!playerDied && points != NUMPOINTS)
sample = r + (gamma2 * maxQValue);
double diff = sample - qVal;
updateWeights(player, diff);
}
// checking end game condition
if(playerDied || points == NUMPOINTS) break;
pairPoint qAndA = getMaxQAndAction();
qVal = qAndA.q;
int bestAction = qAndA.a;
//update player and q value
player.x += dx[bestAction];
player.y += dy[bestAction];
start = false;
}
I would expect that both weights would still be positive, but one of them is negative(the one giving distance to the firepit).
I also expected the program to learn overtime that it is bad to enter a firepit and also bad, but not as bad, to go to an unvisited dot.
Probably not the anwser you want to hear, but:
Have you try to implement the simpler tabular Q-learning before approximate Q-learning? In your setting, with a few states and actions, it will work pefectly. If you are learning, I strongly recommend you to start with the simpler cases in order to get a better understanding/intuition about how Reinforcement Learning works.
Do you know the implications of using approximators instead of learning the exact Q function? In some cases, due to the complexity of the problem (e.g., when the state space is continuous) you should approximate the Q function (or the policy, depending on the algorithm), but this may introduce some convergence problems. Additionally, in you case, you are trying to hand-pick some features, which usually required a depth knowledge of the problem (i.e., environment) and the learning algorithm.
Do you understand the meaning of the hyperparameters alpha and gamma? You can not choose them randomly. Sometimes they are critical to obtain the expected results, not always, depending heavely on the problem and the learning algorithm. In your case, taking a look to the convergence curve of you weights, it's pretty clear that you are using a value of alpha too high. As you pointed out, after the first training session your weigths remain constant.
Therefore, practical recommendations:
Be sure to solve your grid game using a tabular Q-learning algorithm before trying more complex things.
Experiment with different values of alpha, gamma and rewards.
Read more in depth about approximated RL. A very good and accesible book (starting from zero knowledge) is the classical Sutton and Barto's book: Reinforcement Learning: An Introduction, which you can obtain for free and was updated in 2018.
Related
To start, I want to thank everyone who has helped me so far on previous problems I have had with working through the CGAL Library, it is greatly appreciated.
Background on myself: I am still very new with C++ and my coding experience is in MATLAB so there is a lot of concepts that I am learning very quickly and are therefore very new to me, so please excuse my erroneous language that I may use with regard to C++.
The Problem:
I have recently wrote some code that finds the Minkowski sum of a polyline and a circle (i.e., buffer of a polyline) using the code found in the documentation of Boolean Set Operations on General Polygons.
Here, a General_polygon_set_2 concept is utilized in the output, and if the output code is used from the example above I can get the following output of a Polygon_with_holes_2 class:
48 [775.718 -206.547 --> 769.134 -157.991] (769 -157 1 1) [769.134 -157.991 --> 770 -157] (769 -157 1 1) [770 -157 --> 768.866 -156.009] [768.866 -156.009 --> 762.282 -107.453] [762.282 -107.453 --> 703.282 -115.453] [703.282 -115.453 --> 708.072 -150.778] ...
7 15 [549.239 -193.612 --> 569.403 -216.422] ... 3 [456.756 -657.812 --> 657.930 908.153] ...
Here, if I understand correctly, the first integer refers to the number of a vertices in the .outer_boundary() , followed by descriptions of the curves for each "edge" of the general polygon. In my problem, the outputs will only consist of linear functions and circular arcs.
Linear: [775.718 -206.547 --> 769.134 -157.991]
Circular Arc (x-monotone): (769 -157 1 1) [769.134 -157.991 --> 770 -157]
The linear element is simple, go from this x-y coordinate to this other one by a line. As for the the circular arc, it is little bit more different, it says to use this circle described by the arguments in these brackets () to go from this x-y coordinate to this other one contained in these brackets []. The arguments to circle are: (x,y,radius,orientation).
Next, since we have holes, after the .outer_boundary() has been written out, two more integers are displayed. The first one states the number of holes, the second states the number vertices in this hole, then followed by those vertices for that hole. Then once that hole is written out, another integer is written describing the number of vertices in that hole, and this then continues for all of the holes, completing the description of the polygon.
So with that, my current problem is parsing out each individual curve one at a time so that I can do operations on them.
I have the following functions from the documentation to work with:
.outer_boundary(): returns the general polygon that represents the outer boundary.
.holes_begin(): returns the begin iterator of the holes.
.holes_end():
So my thought is to break the General_polygon_set_2 to General_polygon_2, then break that down into the .outer_boundary() and the different holes. Finally, for each set of curves, break those down into individual curves.
I am not really sure how to go about this, I just know that I need individual curve data so I can do my own operations on them. Any help, will be, as always, greatly appreciated!
Note: I actually deleted this post after reading through the arrangements documentation thinking that this was too obvious of an answer, but after sometime I still really do not see how to pull this info properly, I think the biggest issue is in my lacking knowledge of C++. Sorry about this being a noob-ish question.
Solution in Progress:
list<Polygon_with_holes_2> res;
S.polygons_with_holes (back_inserter (res));
list<Polygon_with_holes_2>::iterator i = res.begin();
Polygon_with_holes_2 mink = *i;
minkOuter = mink.outer_boundary();
cout << minkOuter << endl;
int numHoles = mink.holes_end()-mink.holes_begin();
cout << numHoles << endl;
Now I am working on isolating the holes, followed by breaking those down into each individual curve.
The doc here states that the value_type of a Hole_const_iterator is a General_polygon_2, which means that what you can iterate through all "curves" using "holes_begin()" and "holes-end", like you thought. To do that, use the following syntax:
for(auto h_it = mink.holes_begin(); h_it != mink.holes_end(); ++h_it)
{
//in here h_it is an iterator with value type General_polygon_2, so *h_it will be a the polygon describing a hole. Every step of this loop will give you another hole.
}
Then, you can iterate the curves of each polygon with curves_begin() and curves_end() the same way.
So to iterate each curve of a polygon_with_holes:
for(auto h_it = mink.holes_begin(); h_it != mink.holes_end(); ++h_it)
{
for(auto curve_it = h_it->curves_begin(); curves_it != h_it->curves_end(); ++curves_it)
{
//*curves_it gives you a curve.
}
}
I would like to use a convolution LSTM in my research but I'm having a difficult time figuring out the exact way to implement this class in tensorflow. Here is what I have so far. I get no errors, but I am seriously doubting my implementation. Can anyone confirm if I am doing this correctly?
n_input = 4
x = tf.placeholder(tf.float32,shape=[None,n_input,HEIGHT,WIDTH,2])
y = tf.placeholder(tf.float32,shape=[None,HEIGHT,WIDTH,2])
convLSTM_cell = tf.contrib.rnn.ConvLSTMCell(
conv_ndims=2,
input_shape = [HEIGHT,WIDTH,DEPTH],
output_channels=2,
kernel_shape=[3,3]
)
outputs, states = tf.nn.dynamic_rnn(convLSTM_cell, x, dtype=tf.float32)
weights = tf.Variable(tf.random_normal([3,3,2,2]))
biases = tf.Variable(tf.random_normal([2]))
conv_out = tf.nn.conv2d(outputs[-1],weights,strides=[1,1,1,1],padding='SAME')
out = tf.nn.sigmoid(conv_out + biases)
UPDATE:
printing the size of outputs gives the shape=(?,4,436,1024,2) but I think I want (?,5,436,1024,2) or (?,1,436,1024,2).
UPDATE2:
So according to a fellow lab mate, the 4 outputs corresponds to the lstm outputs for each frame and so it is working correctly. Apparently all I have to do is take output #4 and that is the predicted future time frame.
A stackoverflow confirmation would put my mind at ease on this whole thing.
Yes, you are correct!
The output dimension will match the input dimension. If you actually want the (?,5,436,1024,2) output, you will have to look at the history, state.h. the last four [-4] of it will still correspond to the output.
I am new to Python, coming from MATLAB, and long ago from C. I have written a script in MATLAB which simulates sediment transport in rivers as a Markov Process. The code randomly places circles of a random diameter within a rectangular area of a specified dimension. The circles are non-uniform is size, drawn randomly from a specified range of sizes. I do not know how many times I will step through the circle placement operation so I use a while loop to complete the process. In an attempt to be more community oriented, I am translating the MATLAB script to Python. I used the online tool OMPC to get started, and have been working through it manually from the auto-translated version (was not that helpful, which is not surprising). To debug the code as I go, I use the
MATLAB generated results to generally compare and contrast against results in Python. It seems clear to me that I have declared variables in a way that introduces problems as calculations proceed in the script. Here are two examples of consistent problems between different instances of code execution. First, the code generated what I think are arrays within arrays because the script is returning results which look like:
array([[ True]
[False]], dtype=bool)
This result was generated for the following code snippet at the overlap_logix operation:
CenterCoord_Array = np.asarray(CenterCoordinates)
Diameter_Array = np.asarray(Diameter)
dist_check = ((CenterCoord_Array[:,0] - x_Center) ** 2 + (CenterCoord_Array[:,1] - y_Center) ** 2) ** 0.5
radius_check = (Diameter_Array / 2) + radius
radius_check_update = np.reshape(radius_check,(len(radius_check),1))
radius_overlap = (radius_check_update >= dist_check)
# Now actually check the overalp condition.
if np.sum([radius_overlap]) == 0:
# The new circle does not overlap so proceed.
newCircle_Found = 1
debug_value = 2
elif np.sum([radius_overlap]) == 1:
# The new circle overlaps with one other circle
overlap = np.arange(0,len(radius_overlap), dtype=int)
overlap_update = np.reshape(overlap,(len(overlap),1))
overlap_logix = (radius_overlap == 1)
idx_true = overlap_update[overlap_logix]
radius = dist_check(idx_true,1) - (Diameter(idx_true,1) / 2)
A similar result for the same run was produced for variables:
radius_check_update
radius_overlap
overlap_update
Here is the same code snippet for the working MATLAB version (as requested):
distcheck = ((Circles.CenterCoordinates(1,:)-x_Center).^2 + (Circles.CenterCoordinates(2,:)-y_Center).^2).^0.5;
radius_check = (Circles.Diameter ./ 2) + radius;
radius_overlap = (radius_check >= distcheck);
% Now actually check the overalp condition.
if sum(radius_overlap) == 0
% The new circle does not overlap so proceed.
newCircle_Found = 1;
debug_value = 2;
elseif sum(radius_overlap) == 1
% The new circle overlaps with one other circle
temp = 1:size(radius_overlap,2);
idx_true = temp(radius_overlap == 1);
radius = distcheck(1,idx_true) - (Circles.Diameter(1,idx_true)/2);
In the Python version I have created arrays from lists to more easily operate on the contents (the first two lines of the code snippet). The array within array result and creating arrays to access data suggests to me that I have incorrectly declared variable types, but I am not sure. Furthermore, some variables have a size, for example, (2L,) (the numerical dimension will change as circles are placed) where there is no second dimension. This produces obvious problems when I try to use the array in an operation with another array with a size (2L,1L). Because of these problems I started reshaping arrays, and then I stopped because I decided these were hacks because I had declared one, or more than one variable incorrectly. Second, for the same run I encountered the following error:
TypeError: 'numpy.ndarray' object is not callable
for the operation:
radius = dist_check(idx_true,1) - (Diameter(idx_true,1) / 2)
which occurs at the bottom of the above code snippet. I have posted the entire script at the following link because it is probably more useful to execute the script for oneself:
https://github.com/smchartrand/MarkovProcess_Bedload
I have set-up the code to run with some initial parameter values so decisions do not need to be made; these parameter values produce the expected results in the MATLAB-based script, which look something like this when plotted:
So, I seem to specifically be having issues with operations on lines 151-165, depending on the test value np.sum([radius_overlap]) and I think it is because I incorrectly declared variable types, but I am really not sure. I can say with confidence that the Python version and the MATLAB version are consistent in output through the first step of the while loop, and code line 127 which is entering the second step of the while loop. Below this point in the code the above documented issues eventually cause the script to crash. Sometimes the script executes to 15% complete, and sometimes it does not make it to 5% - this is due to the random nature of circle placement. I am preparing the code in the Spyder (Python 2.7) IDE and will share the working code publicly as a part of my research. I would greatly appreciate any help that can be offered to identify my mistakes and misapplications of python coding practice.
I believe I have answered my own question, and maybe it will be of use for someone down the road. The main sources of instruction for me can be found at the following three web pages:
Stackoverflow Question 176011
SciPy FAQ
SciPy NumPy for Matlab users
The third web page was very helpful for me coming from MATLAB. Here is the modified and working python code snippet which relates to the original snippet provided above:
dist_check = ((CenterCoordinates[0,:] - x_Center) ** 2 + (CenterCoordinates[1,:] - y_Center) ** 2) ** 0.5
radius_check = (Diameter / 2) + radius
radius_overlap = (radius_check >= dist_check)
# Now actually check the overalp condition.
if np.sum([radius_overlap]) == 0:
# The new circle does not overlap so proceed.
newCircle_Found = 1
debug_value = 2
elif np.sum([radius_overlap]) == 1:
# The new circle overlaps with one other circle
overlap = np.arange(0,len(radius_overlap[0]), dtype=int).reshape(1, len(radius_overlap[0]))
overlap_logix = (radius_overlap == 1)
idx_true = overlap[overlap_logix]
radius = dist_check[idx_true] - (Diameter[0,idx_true] / 2)
In the end it was clear to me that it was more straightforward for this example to use numpy arrays vs. lists to store results for each iteration of filling the rectangular area. For the corrected code snippet this means I initialized the variables:
CenterCoordinates, and
Diameter
as numpy arrays whereas I initialized them as lists in the posted question. This made a few mathematical operations more straightforward. I was also incorrectly indexing into variables with parentheses () as opposed to the correct method using brackets []. Here is an example of a correction I made which helped the code execute as envisioned:
Incorrect: radius = dist_check(idx_true,1) - (Diameter(idx_true,1) / 2)
Correct: radius = dist_check[idx_true] - (Diameter[0,idx_true] / 2)
This example also shows that I had issues with array dimensions which I corrected variable by variable. I am still not sure if my working code is the most pythonic or most efficient way to fill a rectangular area in a random fashion, but I have tested it about 100 times with success. The revised and working code can be downloaded here:
Working Python Script to Randomly Fill Rectangular Area with Circles
Here is an image of a final results for a successful run of the working code:
The main lessons for me were (1) numpy arrays are more efficient for repetitive numerical calculations, and (2) dimensionality of arrays which I created were not always what I expected them to be and care must be practiced when establishing arrays. Thanks to those who looked at my question and asked for clarification.
I have been trying to write a small program that runs behind Counter-Strike: Global Offensive. The program must be completely external, as there is no way to access live match data or the API during a competitive ranked match. So far, the best approach I could find is as follows:
case WM_TIMER:
COLORREF curColor, safetyColor;
for(int a = 1750;a < 1830;a++)
for (int b = 75;b < 100;b++)
{
curColor = GetPixel(csgoDC, a, b);
safetyColor = GetPixel(csgoDC, 1856, 1060);
if (GetRValue(curColor) == 255 && GetGValue(curColor) == 255 && GetBValue(curColor) == 255 && GetRValue(safetyColor) != 255)
PlaySound(_T("C:\\headshot.wav"), NULL, SND_FILENAME | SND_ASYNC);
}
break;
This is triggered by a timer. It does not work reliably, and I am fairly certain it won't work at all in fullscreen mode (Hard to test when it isn't reliable to begin with). The indices in the nested for loop correspond to the place on the screen where the killfeed shows up. Here is an example of the UI (From Google)
Screenshot
Note that I am not specifically going for a pixel-based approach; this is just the only solution I could think of.
A much more work intensive way of doing this would be to get runtime information from your machine, i.e. variables and associated values in memory. This would require at least to some degree reverse engineering of the CS:GO client code. But, if you could manage it, I imagine it would be the most reliable way of determining the information you desire. (Be wary of a ban-hammer to the face though.)
I'm a junior programmer and I know the basics of pascal and c++. I made a Tic Tac Toe game with Player-Computer and the game is all finished.
The computer generates a random place where the Os go on the table and that's not good.
I thought that i should multiple procedures that check every winning position and the computer should else try to block the player's Xs or to make a winning position, BUT this would have been lots of time lost cause all of the if's.
Then I thought of a simpler version with some kind of ifs but it would still have been taking lots of time to do.
Then i thought deeper: What about a find-four game? How in the earth someone would manage to check every space available and how it would've been possible that someone could make a function that checks absolutely any winning or progress of player/computer position, Oh and wait, that's not ALL, what if the player is doing some tricks so he blocks the computer? How would the computer know that?!? For sure, that would take ages to program. And I am not talking about something that seems more impossible: Chess.
So here I am, asking myself that there SHOULD be a way more simpler way the computer should search and solve some problems than tons of ifs.
In this case, if any of you know any way of solving this, how can i manage to make the simplest procedure to block and beat the player in a TicTacToe game?
If someone wants to check my code or use it: http://pastebin.com/jhyUn7d1
What you're looking for is Minimax.
Using this algorithm the computer will win every Tic Tac Toe game or you could adjust the depth that the computer analyzes the moves in order to achieve some kind of medium difficulty.
It's not hard to implement, you should be familiar with recursivity and you're set, of course the implementation differs according to your code, but the wikipedia page offers a pretty good starting point.
Tic tac toe algorithm is something like:
Take spot if going to win
Take spot if going to lose
Take corner
Take non-corner non-center
Take center
The short answer is "try all the different moves until the game is won, and record which ones lead to computer winning".
Long answerĂ–
For a limited size TTT game, the number of possible moves before the game is won, isn't that much, so simply try a each possible move, then recursively try all possible opponent moves, and keep going until the game ends. Give each move a "score" of how well it went (e.g. how many different solutions did you get that went successful for the computer, and how many went success to the opponent, and pick the one with the "best" result). Beware that you will probably end up with something that is near on impossible to win against if you do well.
I recently dealt with this, although my code was in C#.
I came up with a way of scoring each candidate move. The approach I took creates a score based on the number of moves that would then be required for a win (less moves needed results in a higher score).
My algorithm also considers the combined number of moves for multiple squares. As a result, the algorithm favors moves that would produce multiple potential wins (the only real tactic I know about for Tic Tac Toe). For example, it is possible sometimes to make a move that produces two potential wins that must be blocked. Since the opponent can only block one, it produces a win.
I posted my entire code and a description of it in the article A Tic-Tac-Toe Game Engine.
I did this once, a long time ago. I don't know if I still have the code lying around...
Anyway, I created a function, return type int, which was the square in which the computer should place its piece (assuming 0 is top left and 8 is bottom right square). Yours uses a 2D array, so would be a little different.
Anyway, for each row, column and diagonal, check to see if any two pieces on that row belong to the player. If they don't, check for the same but belonging to the computer. On the first row that this is true, check the remaining piece - if it's available, put the piece there for the win. If you have a player-dominated row, check that you don't already have a piece there and stick it in to block.
const int PlayerPiece = 1;
const int CPiece = 2;
const int Empty = 0;
int board[3][3];
if(board[0][0] == PlayerPiece && board[0][1] == PlayerPiece && board [0][2] == Empty)
{
//Put_Your_Piece_In_[0][2]
}
You could then go on to changing it so that it could check each row i.e.
int numRows = 3;
for(int i = 0; i < numRows; i++)
{
if(board[i][0] == PlayerPiece && board[i][1] == PlayerPiece && board[i][2] == Empty)
{
//Put_Piece_In_[i][2]
}
}
Then, do the same for rows.
You could always consider that Tic-Tac-Toe is essentially just a magic square, described fairly well here: http://www.sciforums.com/showthread.php?134281-An-isomorphism-Tic-Tac-Toe-on-Magic-Square
There is a perfect strategy for Tic-Tac-Toe available on wikipedia. It is really simple. Due to the small size of the grid, the number of cases you need to test (eg test if there are 2 blocks in a row), are very small.