Introduction
The CartPole system is a classic benchmark for nonlinear control. It is an inherently unstable and underactuated mechanical system. The dynamics of this system is used to understand tasks involving the maintenance of balance, such as walking, control of rocket thrusters and selfbalancing mechanical systems.
The system consists of a pole, which acts as an inverted pendulum, attached to a cart. The force applied to the cart can be controlled, and the goal is to swing the pole up and balance it around the upward position.
Our algorithms are applied and tested on a python simulation that was implemented by us. This simulation obeys the rules of physics and thus represents the problem as it is in the real world.
Approaches
All our algorithms are based on reinforcement learning.
The state space consists of (θ, θ̇) pairs that describe the angle of the pendulum and its angular velocity. Each state is independent of the others. In addition, we have defined three actions: left, right and no action. If the pole is downward at some angle range, the reward is 1. If the pole is upward in some angle range, the reward is 5. Else, the reward is 0.1 to encourage the algorithm to find the best solution. In order to prevent the pole from spinning uncontrollably at high speed and thus accumulate positive rewards infinite number of times, we subtract a value that is proportional to the pole's absolute horizontal velocity. This encourages the pole to stabilize within the angle range at which there is a positive reward while minimizing its horizontal velocity.
It should be noted that the state space is very large. However, it is not continuous. Therefore, it is interesting to test whether this aspect has any impact on the results. Hence, we will propose two different approaches towards a solution:
 Standard QLearning algorithms with a discrete state space.
 Deep QLearning^{[1]} and Linear QLearning^{[2]} with a continuous state space.
In addition, we present two different approaches to the exploration vs. exploitation dilemma:
 εGreedy.
 SoftMax.
Finally, we define two different system settings:
 Limited movement space  we limit the cart movement by placing a wall on each side of the cart.
 Unlimited space  the cart can move along an infinite horizontal axis. This is a simpler problem than the first one since it has less predefined limitations.
Methods

QLearning:
The Q(state, action) function is represented by a table of (s, a) pairs against Qvalues. A state is represented by s = (θ, θ̇) and an action is represented by a∊{f, 0, f}, where f is the force applied to the cart.
The discretization of the state space is done by rounding off θ and θ̇ to one decimal place. That is, θ∊[0.0, 0.1, ..., 6.1, 2π] is limited to only 62 values. However, θ̇∊[0.0, 0.1, ...] is theoretically infinite. But, above some angular velocity the Qvalue is significantly low due to the subtraction of the pole's horizontal velocity from the reward. Hence, states with high angular velocities are rarely visited.

Exploration vs. Exploitation:
 εGreedy: the amount of exploration is globally controlled by a parameter, ε, that determines the randomness in action selection. One advantage of the εGreedy is the fact that no memorization of exploration specific data is required, which makes the method particularly interesting for very large state spaces. At each time step, the agent selects a random action with a fixed probability, 0 ≤ ε ≥ 1, instead of selecting greedily one of the learned optimal actions with respect to the Qfunction.
 SoftMax: In contrast to εGreedy, SoftMax bias exploration towards promising actions. The action selection methods grade action probabilities by estimated values. We used the Gibbs distribution as our SoftMax function over the Qvalues of the actions:

Linear QLearning:
In this case the state space is continuous. So, we cannot implement a regular QLearning algorithm. Instead, we use a linear function approximation. That is, we approximate the true actionvalue function q_{π} by a linear combination of n stateaction features:
where x(s, a)∊R_{n} is the feature vector. Given features x, our goal is to choose the parameter w∊Rn such that the meansquared error between q' and q_{π} is minimized. Doing so using stochastic gradient descent yields the update rule:
where r' is the immediate reward associated with taking action a from state s, and s' is the successor state.Let s = [θ_{0}, θ_{1}, θ̇_{0}, θ̇_{1}], where θ_{0} is the horizontal position of the cart and θ_{1} is the angle of the pole. We define the future vector x = [f_{1}, f_{2}, f_{3}, f_{4}], where f_{1} = normalAngle(θ_{1}), f_{2} = θ̇_{1}, f_{3} = aθ_{1}, f_{4} = aθ̇_{1} and normalAngle(φ) is the difference between φ and its closest multiple of 2π.

Deep QLearning:
We implemented the neural network that is described in paper [1]. The paper proposes a neural networkbased reinforcement learning controller that is able to learn control policies in a highly data efficient manner. This allows to apply reinforcement learning directly to real plants; neither a transition model nor a simulation model of the plant is needed for training. The only training information provided to the controller are transition experiences collected from interactions with the real plant. By storing these transition experiences explicitly, they can be reconsidered for updating the neural Qfunction in every training step. This results in a stable learning process of a neural Qvalue function.
Results

QLearning:
For best results, we examined different values of the discount factor γ, learning rate α and ε. We measured the quality of the parameters based on the "learning process"  the reward as a function of time. Every 100 episodes, all the rewards over these episodes were summed and divided by 100. For the full elaboration on these parameters please see the report.
After choosing the best values for the parameters, we created a graphical representation of the maximum Qvalue for each state. We call this representation a heatmap. The heatmap was created in order to comfortably track the Qtable.
At the beginning of the training session most states' Qvalues are close to zero and the negative values are around the initial state (π/2, 0), as shown in figure 5.1. Later, higher values are placed around (π/2, 0), the goal state, and lower values around (π/2, 0), as shown in figure 5.2. At the end of the training session there are high values around the goal state and a trail of states connects it to the initial state, as shown in figure 5.3. This indicates that the agent has learned how to swing up the pole and stabilize it through these specific states.

εGreedy vs. SoftMax:
The following graphs describe the reward as a function of time of both algorithm with unbounded space:
In addition, we measured how long it took for each model to swing the pole up and how long was the pole stable (an average of 5 tries).
 εGreedy: swing up took 2.848 seconds and keeping the pole stable lasted 18.992 seconds.
 SoftMax: : swing up took 2.426 seconds and keeping the pole stable lasted 18.924 seconds.
It is clear that εGreedy reaches higher rewards in a shorter amount of time. It is also clear that unlike εGreedy, SoftMax stops exploring at some point and is therefore converges to a fixed reward. As for the performance of both algorithms, SoftMax is slightly better at the swing up. After the training process, it always chooses the best actions that maximizes the reward, i.e. leads to the goal state, as opposed to εGreedy that keeps exploring even when the agent already knows the best actions that lead to the goal state.
The following graphs describe the reward as a function of time of both algorithm with bounded space, i.e. with a wall on each side of the cart:
Though this problem may be harder, the constraint of the space does not affect much on the resolutions. However, there is a difference between εGreedy and SoftMax. It is much harder for εGreedy to keep the pole stable, and it takes more time to train it to get good results.
The state space stays the same as before and the lack of movability of the cart does not seem to really challenge the agents. However, this setting of the problem is more realistic since real life control problems have a finite space. So, it is a very positive outcome for us.

Deep QLearning vs. QLearning:
Unfortunately, our neural network implementation does not seem to actually learn any reasonable policy. It should be noted that it is the first time we try to implement such network, i.e. with a 'history' component for training, rather than examples that are known in advance. None the less, we are still interested in comparing between this method and the ones we implemented. So, we will use an existing code^{[3]} for the comparison. The only difference is that the pole?s initial position will be upwards and thus the agent only needs to balance it without the swing.
We ran our QLearning algorithms and the neural network (training step = 1850, Adam optimizer, learning rate = 0.001, loss cross entropy) for 6 minutes each (or until convergence in the SoftMax case). Both methods gave the same results. The agents were able to learn how to stabilize the pole and after the training process the pole did not fell even once (out of 10 tests).
We conclude that this aspect of the problem, whether the state space is discrete or continuous, does not affect the resolution of the problem.

Linear QLearning:
Like the neural network, here too we were unable to learn any reasonable policy.
Conclusions
We conclude that the SoftMax policy is much better than εGreedy. SoftMax is much stable and promising than εGreedy that tends to often stumble.
These learning algorithms and policies are as good as a neural network for the predefined settings we examined and are much simpler. As for other settings, it is hard for us to answer how good are our models versus deep learning. But, it is clear that other approaches need to be taken in order to solve related but harder problems.
As for further work, one can think how to adjust our algorithms for the double inverted pendulum problem. We tried to use the same methods for this problem, i.e. (θ_{0}, θ_{1}, θ̇_{0}, θ̇_{1}) for the state space, but the state space is too big for this approach.
Additional Information
References
[1] Martin Riedmiller, "Neural Reinforcement Learning to Swingup and Balance A Real Pole".
[2] Fredrik Gustafsson, "Control of Inverted Double Pendulum using Reinforcement Learning".
[3] Neural network,
Cart.