PolicyIteration

class safe_learning.PolicyIteration(policy, dynamics, reward_function, value_function, gamma=0.98)

A class for policy iteration.

Parameters:
policy : callable

The policy that maps states to actions.

dynamics : callable

A function that can be called with states and actions as inputs and returns future states.

reward_function : callable

A function that takes the state, action, and next state as input and returns the reward corresponding to this transition.

value_function : instance of DeterministicFunction

The function approximator for the value function. It is used to evaluate the value function at states.

gamma : float

The discount factor for reinforcement learning.

Methods

bellmann_error(self, states) Compute the squared bellmann error.
discrete_policy_optimization(self, action_space) Optimize the policy for a given value function.
future_values(self, states[, policy, …]) Return the value at the current states.
optimize_value_function(self, \*\*solver_options) Optimize the value function using cvx.
value_iteration(self) Perform one step of value iteration.
bellmann_error(self, states)

Compute the squared bellmann error.

Parameters:
states : array
Returns:
error : float
discrete_policy_optimization(self, action_space, constraint=None)

Optimize the policy for a given value function.

Parameters:
action_space : ndarray

The parameter value to evaluate (for each parameter). This is geared towards piecewise linear functions.

constraint : callable

A function that can be called with a policy. Returns the slack of the safety constraint for each state. A policy is safe if the slack is >=0 for all constraints.

future_values(self, states, policy=None, actions=None, lyapunov=None, lagrange_multiplier=1.0)

Return the value at the current states.

Parameters:
states : ndarray

The states at which to compute future values.

policy : callable, optional

The policy for which to evaluate. Defaults to self.policy. This argument is ignored if actions is not None.

actions : array or tensor, optional

The actions to be taken for the states.

lyapunov : instance of Lyapunov

A Lyapunov function that acts as a constraint for the optimization.

lagrange_multiplier: float

A scaling factor for the slack of the optimization problem.

Returns:
The expected long term reward when taking an action according to the
policy and then taking the value of self.value_function.
optimize_value_function(self, **solver_options)

Optimize the value function using cvx.

Parameters:
solver_options : kwargs, optional

Additional solver options passes to cvxpy.Problem.solve.

Returns:
assign_op : tf.Tensor

An assign operation that updates the value function.

value_iteration(self)

Perform one step of value iteration.