Conquering the ball from an attacking player is risky and difficult to implement, because (i) it’s hard to devise a trivial scheme to handle the broad variety of utilized dribbling strategies (ii) risk of over-specializing to some type of dribble strategies and loss of generalization for others that lowers the overall efficiency of the scheme and (iii) the importance of a duel between attacker and defender: if the defending player looses this duel, the attacker overruns him, and will achieve more space and better opportunities with few defenders ahead.
“Brainstromers” team has employed an effective scheme for the hassling task since Robocup 2007 competitions called neuroHassle. We are working on an enhanced version of this approach to be embedded in our block mechanism. The goal of this problem is to train defensive agents with reinforcement learning to hassle an attacker. In the other words, a given naïve defender finds a policy by trial and error, to conquer the ball from an opponent ball leading player with no a priori knowledge about his dribbling capabilities. The proposed reinforcement learning solution is value function estimation by a multi layer perceptron neural network. The architecture of our proposed solution differs slightly from the one explained in [] yet use similar basics and training concepts.
Architecture:
A MLP neural network with one hidden layer consists of 20 neurons with sigmoidal activation function. The neural network training is run in batch mode and uses back-propagation to minimize the mean square error of the value function approximation.
Inputs: These features are extracted from the environment and fed to the neural network.
1. Distance between defender and ball possessing attacker (Scalar)
2. Distance between ball and our goal (Scalar)
3. Velocity of defender (Vectored and Relative)
4. Velocity of attacker (Scalar: The absolute value of velocity)
5. Position of the ball (Vectored and Relative)
6. Defender body angle (Relative)
7. Attacker body angle (Relative to his direction toward our center of goal)
8. Strategic angle (GÔM: G is the center of goal, O is the position of the opponent, and M is the position of our player
9. Stamina of the defender
The coordinated system is centered on the center of our player and the abscissa is aligned through our and the opponent player. The degree of partial observability is kept low.
Training:
A large training data set should be provided for this task. This data set should cover various velocities and body angles of players and initial position of ball between them (to handle different start up situation for dribbling and defending), various regions of field (because dribbling players are very likely to behave differently depending on where they are positioned on the field), different adversary agent (to avoid over-specialization and maintain generalization), and different stamina size of defender (to consider realistic situation of the game).
Reinforce Signal: The outcome of a training scenario can be categorized in several groups. Regarding this outcome, a different reinforcement should be given to the agent:
- Erroneous Episode: Failure due to losing the ball by attacker because of a mistake, go out of the field, wrong self localization of the agent etc. is known as erroneous episodes and is omitted from training data.
- Success: Conquering the ball by the defender whether he has the ball inside of his kickable area or has a probably successful opportunity of tackling. This outcome will be rewarded by a great value.
- Opponent Panic: A non-dribbling behavior of attacking ball leading opponent player. This behavior takes place (i)when a defender approaches the attacker, (ii) when the defender hassles him too much , or (iii) when he simply do not consider the situation as a suitable one for dribbling. In these cases the attacker kicks the ball as a pass, toward goal or somewhere else (usually forward). This outcome is considered as a draw and with respect to the type of shoot to be toward the goal or not, we penalize or reward the situation by a small value.
- Failure: If none of the other cases has happen. This means that attacker has the ball in his kick range and overrun defender by some distance, or has approached the goal such that a goal shot is hardly stoppable. This outcome is punished by a large value.
- Time Out: If the struggle over the ball doesn’t come into one of above mentioned states within a reasonable time. This situation will be punished or rewarded based on the offset of the ball from its initial position.
Actions:
An agent is allowed to choose the low level actions of turn(x) and dash(y) where the domains of bots commands’ parameters (x from [−100, 100], y from [−180◦, 180◦]) are discretized such that in total 76 actions are available to the agent at each time step.
Although the effectiveness of policy will be influenced by the presence of other players in the field and the attacker may behave differently, but by a good formation of other defenders, so that passing between opponent players become more risky, this policy gains more importance.
Future Works:
- Enable a defender to shout for help if his stamina level decreases to a critical level;
- When the score of the team is in good winning margin, the defenders tries to reach a state of Time Out and save more energy by preventing a player to dribble ahead;
- When the attacking team has ball in their defensive area and a gap in the midfield, our players start to hassle them from opponent defensive area to conquer the ball and gain good chance of scoring;
- Train a defender to hassle when one more player from each team of attacker and defender are present in the field to enable hassling player to block the passes from the source.
http://www.springerlink.com/content/p15566725w553751/
No comments:
Post a Comment