Tuesday, February 16, 2010

Brainstormers: NeuroHassle

Defending against incoming attacks and recapturing the ball is a crucial task for each team. Defending strategy consist of two sub-task: Positioning and Hassling. The former task aims to arrange players in free spaces so that they are capable of intercepting potential opponent passes, covering the direct defending player, marking the attacker player possesses the ball, and avoiding opponent to have clear shoot toward the goal. The latter task is to improving the aggression skill of defender in the manner that they can interfere the opponent ball leading player, “hassle” him, and bringing ball under their control while simultaneously hindering him from dribbling ahead. Moreover, the assignment of these two tasks is challenging because they can conflict and result in two undesired situations: no one interferes the attacker or two players decide to hassle the ball leader and leave a breach in defensive formation or leaving an opponent player uncovered. Also this assignment should maximize the collaborative defense utility. A common choice for this assignment is to give the task of hassling to the closest player to the ball while others maintain a good defensive coverage formation.
Conquering the ball from an attacking player is risky and difficult to implement, because (i) it’s hard to devise a trivial scheme to handle the broad variety of utilized dribbling strategies (ii) risk of over-specializing to some type of dribble strategies and loss of generalization for others that lowers the overall efficiency of the scheme and (iii) the importance of a duel between attacker and defender: if the defending player looses this duel, the attacker overruns him, and will achieve more space and better opportunities with few defenders ahead.

“Brainstromers” team has employed an effective scheme for the hassling task since Robocup 2007 competitions called neuroHassle. We are working on an enhanced version of this approach to be embedded in our block mechanism. The goal of this problem is to train defensive agents with reinforcement learning to hassle an attacker. In the other words, a given naïve defender finds a policy by trial and error, to conquer the ball from an opponent ball leading player with no a priori knowledge about his dribbling capabilities. The proposed reinforcement learning solution is value function estimation by a multi layer perceptron neural network. The architecture of our proposed solution differs slightly from the one explained in [] yet use similar basics and training concepts.

Architecture:
A MLP neural network with one hidden layer consists of 20 neurons with sigmoidal activation function. The neural network training is run in batch mode and uses back-propagation to minimize the mean square error of the value function approximation.
Inputs: These features are extracted from the environment and fed to the neural network.
1. Distance between defender and ball possessing attacker (Scalar)
2. Distance between ball and our goal (Scalar)
3. Velocity of defender (Vectored and Relative)
4. Velocity of attacker (Scalar: The absolute value of velocity)
5. Position of the ball (Vectored and Relative)
6. Defender body angle (Relative)
7. Attacker body angle (Relative to his direction toward our center of goal)
8. Strategic angle (GÔM: G is the center of goal, O is the position of the opponent, and M is the position of our player
9. Stamina of the defender
The coordinated system is centered on the center of our player and the abscissa is aligned through our and the opponent player. The degree of partial observability is kept low.

Training:
A large training data set should be provided for this task. This data set should cover various velocities and body angles of players and initial position of ball between them (to handle different start up situation for dribbling and defending), various regions of field (because dribbling players are very likely to behave differently depending on where they are positioned on the field), different adversary agent (to avoid over-specialization and maintain generalization), and different stamina size of defender (to consider realistic situation of the game).
Reinforce Signal: The outcome of a training scenario can be categorized in several groups. Regarding this outcome, a different reinforcement should be given to the agent:
  • Erroneous Episode: Failure due to losing the ball by attacker because of a mistake, go out of the field, wrong self localization of the agent etc. is known as erroneous episodes and is omitted from training data.
  • Success: Conquering the ball by the defender whether he has the ball inside of his kickable area or has a probably successful opportunity of tackling. This outcome will be rewarded by a great value.
  • Opponent Panic: A non-dribbling behavior of attacking ball leading opponent player. This behavior takes place (i)when a defender approaches the attacker, (ii) when the defender hassles him too much , or (iii) when he simply do not consider the situation as a suitable one for dribbling. In these cases the attacker kicks the ball as a pass, toward goal or somewhere else (usually forward). This outcome is considered as a draw and with respect to the type of shoot to be toward the goal or not, we penalize or reward the situation by a small value.
  • Failure: If none of the other cases has happen. This means that attacker has the ball in his kick range and overrun defender by some distance, or has approached the goal such that a goal shot is hardly stoppable. This outcome is punished by a large value.
  • Time Out: If the struggle over the ball doesn’t come into one of above mentioned states within a reasonable time. This situation will be punished or rewarded based on the offset of the ball from its initial position.
The learning task of this problem is episodic and the scenario is reset after each episode so there’s no need for discounting and the learning rate used should be 1.0. Also to enable exploration to find better and more effective solution for defense we use criteria of energy saving mixed with Boltzman exploration to modify online greedy policy during training. The idea behind this choice is that although large sets and random episodes with start situation brings about a good level of state space exploration as assumed in the paper, but the found policy may be not efficient in the terms of stamina, and yet may not cover various dribbling tricks enough and not generalized properly.

Actions:
An agent is allowed to choose the low level actions of turn(x) and dash(y) where the domains of bots commands’ parameters (x from [−100, 100], y from [−180◦, 180◦]) are discretized such that in total 76 actions are available to the agent at each time step.

Although the effectiveness of policy will be influenced by the presence of other players in the field and the attacker may behave differently, but by a good formation of other defenders, so that passing between opponent players become more risky, this policy gains more importance.

Future Works:
  • Enable a defender to shout for help if his stamina level decreases to a critical level;
  • When the score of the team is in good winning margin, the defenders tries to reach a state of Time Out and save more energy by preventing a player to dribble ahead;
  • When the attacking team has ball in their defensive area and a gap in the midfield, our players start to hassle them from opponent defensive area to conquer the ball and gain good chance of scoring;
  • Train a defender to hassle when one more player from each team of attacker and defender are present in the field to enable hassling player to block the passes from the source.
Reference:
http://www.springerlink.com/content/p15566725w553751/

No comments:

Post a Comment