Publication Date

Fall 2020

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Teng Moh

Second Advisor

Katerina Potika

Third Advisor

Mike Wu

Keywords

multi-agent deep reinforcemnt learning, MADRL

Abstract

This project was motivated by seeking an AI method towards Artificial General Intelligence (AGI), that is, more similar to learning behavior of human-beings. As of today, Deep Reinforcement Learning (DRL) is the most closer to the AGI compared to other machine learning methods. To better understand the DRL, we compares and contrasts to other related methods: Deep Learning, Dynamic Programming and Game Theory.

We apply one of state-of-art DRL algorithms, called Proximal Policy Op- timization (PPO) to the robot walkers locomotion, as a simple yet challenging environment, inherently continuous and high-dimensional state/action space.

The end goal of this project is to train the agent by finding the optimal sequential actions (policy/strategy) of multi-walkers leading them to move forward as far as possible to maximize the accumulated reward (performance). This goal can be accomplished by finding the tuned hyperparameters of the PPO algorithm by monitoring the performances for the multi-agent DRL (MADRL) settings.

At the end, we can draw three conclusions from our findings based on the various MADRL experiments: 1) Unlike DL with explicit target labels, DRL needs larger minibatch size for better estimate of values from various gradients. There- fore, a minibatch size and its pool size (experience replay buffer) are critical hyperparameters in PPO algorithm. 2) For the homogeneous multi-agent envi- ronments, there is a mutual transferability between single-agent and multi-agent environments to be able to reuse the tuned hyperparameters. 3) For the homo- geneous multi-agent environments with a well tuned hyperparameter set, the

parameter sharing is a better strategy for the MADRL in terms of performance and efficiency with reduced parameters and less memory.

To conclude, reward-driven, sequential and evaluative learning, the DRL, would be closer to AGI if multiple DRL agents learn to collaborate to capture the true signal from the shared environment. This work provides one instance of implicit cooperative learning of MADRL.

Share

COinS