Disclaimer: This article is part of the Industry 4.0 Open Educational Resources (OER) Publication Initiatives jointly supported by Duke Learning Innovation Center and DKU Center for Teaching and Learning under the Carrying the Innovation Forward program. This article belongs to the OER Series No. 2 Computational Economics Spring 2022 collection. The Spring 2022 collection is partly supported by the Social Science Divisional Chair’s Discretionary Fund to encourage faculty engagement in undergraduate research and enhance student-faculty scholarly interactions outside of the classroom. The division chair is Prof. Keping Wu, Associate Professor of Anthropology at Duke Kunshan University. The co-author Tianyu Wu was the Teaching and Research Assistant for Prof. Luyao Zhang in the course: COMPSCI/ECON 206 Computational Microeconomics at Duke Kunshan University Spring 2022, when he completed the joint article. The co-authors are forever indebted to Prof. Vincent Conitzer, who presented “Computer Science Meets Economics” as a distinguished guest lecture for this course on Apr. 19, 2022.
Reinforcement learning is not a novel research method, with a history dating back to the 1960s (Waltz and Fu 1965). However, with the explosion of computing device performance in recent years, reinforcement learning has gradually become not only one of the major directions in computer science research but also a handy tool for interdisciplinary research.
In brief, reinforcement learning is “the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment” (Çalışır and Pehlivanoğlu 2019). In simple terms, reinforcement learning consists of one or multiple agents and an environment. The agent needs to interact with the environment to learn, by some means (usually by trial and error), the optimal behavioral patterns that will yield the greatest reward.
Compared with traditional machine learning and deep learning approaches, reinforcement learning has many outstanding advantages. For example, online reinforcement learning does not require a huge labeled dataset, which greatly reduces the cost of data processing; moreover, compared to traditional supervised learning, reinforcement learning agents can figure out the optimal behavior by trial and error, instead of exactly imitating the behavior taught in the labeled dataset. This has allowed reinforcement learning to beat humans in many complex tasks, as exemplified by AlphaGo (Silver et al. 2016). In addition, reinforcement learning has many other advantages in terms of performance, generalizability, etc.
Figure 1 shows the basic structure of reinforcement learning. To begin with, the agent takes an action in the environment; the environment reacts to the action and returns the new state and reward to the agent, and based on the returned information, the agent takes the next action in the environment, and the cycle continues. As the number of cycles increases, an agent learns through experience and optimizes its actions. The loop stops when the environment signals termination. Up to now, reinforcement learning has been widely used in various applications. For instance, self-driving cars, industrial automation, the financial sector, natural language processing, and so on.
As shown in figure 2, there are several common approaches to classifying reinforcement learning (RL) algorithms. The first classification divides RL into model-based and model-free algorithms based on whether the environment is mathematically modeled; the second divides RL into non-deep reinforcement learning and deep reinforcement learning based on whether deep neural networks are used, and the third divides RL into on-policy and off-policy algorithms based on policy. We now further elaborate on representative algorithms under the three taxonomies with specific examples.
In the context of reinforcement learning, the term “model” refers to “an ensemble of acquired environmental knowledge” (Kaiser et al. 2020). In simple terms, a model is a mathematical representation as an abstract of the environment.
As defined by Moerland et al. (2022), a model-based reinforcement learning (MBRL) algorithm is any MDP (Markov Decision Process) approach, where the agent learns from either a known or learned model to approximate a global value of the objective function or policy function of the optimal state-action mappings. In a nutshell, in the model-based approach, the agent uses a model to predict the reward of different actions and selects the optimal solution under different circumstances of the environments.
Most games are human-defined closed environments and the agent can predict the consequences of any action through this environment. Hence, a typical application of model-based algorithms is building game AI. For instance, kaiser et al. (2020) used model-based reinforcement learning algorithms to play Atari games and achieved better scores than humans. In economic scenarios, a typical application of model-based RL lies in planning. Thanks to the given or learned model of the environment, the agent can assist decision-makers in evaluating the consequences of various actions in applications such as dynamic portfolio optimization (Yu et al. 2019) under simple scenarios and even infectious disease control (Wan, Zhang, and Song 2021) at the governmental decision level.
Instead of evaluating through the model, model-free reinforcement learning (MFRL) algorithms determine the value of each action in different states by trial and error (Sutton and Barto 2018). Q-Learning (Watkins and Dayan 1992) is one of the early classical MFRL algorithms. Q-Learning records the values of all state-action pairs in a table and updates them by trial and error until convergence. However, the drawback of this simple algorithm is that for slightly complex cases, the demand for table storage space is staggering. One of the application scenarios of MFRL is financial portfolio management (Sato 2019) in complex market environments. The real market environment usually contains too many variables and information, making it difficult to model. Therefore, using MFRL to evaluate investment actions may be a better choice.
A “policy” is a function that returns feasible actions under a state. Before introducing the on and off-policy taxonomy, let's understand the concept of two kinds of policies: behavior policy and target policy. Target policy is the policy used by the agent during training, i.e. to learn the best strategy to act; and behavior policy is the policy used by the agent when it is being actually used or acted upon. When the behavior policy is identical to the target policy, the algorithm is on-policy; and when the behavior policy is not identical to the target policy, it is called off-policy.
In on-policy learning, the agent explores the environment using the behavioral policy and continuously updates the policies it uses by evaluating the rewards from different actions. The optimal strategy that the agent eventually obtains is closely related to how the agent explores the environment. SARSA (Rummery and Niranjan 1994) is a typical example of an on-policy algorithm. However, one drawback of on-policy learning is that it lacks a comprehensive exploration of the environment, and it easily converges to a local optimum.
In contrast to on-policy learning, the final optimal policy obtained by the off-policy algorithm is unrelated to the motivation of the agent. Off-policy algorithms separate the learning process from the final behavior policy. Q-Learning (Watkins and Dayan 1992) is a typical example. In Q-Learning, the algorithm constructs a Q-table that stores the value of every state-action pair, and updates it by iteratively exploring the actions in each state. Finally, in production, the agent uses the Q-table to generate the optimal policy.
The on-policy and off-policy have only different in how they update the policy, hence, these two approaches are not restricted to specific areas of application in economics; both are particularly widely used.
Sometimes, researchers prefer to classify reinforcement learning algorithms by whether or not they include deep neural networks in their models. When we replace the model, policy function, or value function with a deep neural network in any reinforcement learning method, it becomes a deep reinforcement learning method (deep RL) (Li 2018).
In the previous section, we mentioned that Q-Learning has difficulty in storing information about huge amounts of state-action pairs under complex environments. However, by replacing the table with a deep neural network (DNN), we can provide a more accurate value evaluation of most state-action pairs in a smaller space. The algorithm that combines Q-Learning and DNN is called Deep Q-Network (DQN) (Mnih et al. 2015), which obtains high performance playing Atari games by only inspecting the screen. In economic scenarios, deep RL also provides new methodologies for mechanism design (Tang 2017). In a way, deep RL agents learn in a similar way to humans which is by trial and error, and evaluating the behavior of the agent can help accelerate the process of mechanism design with rational humans as participants; For example, to help design the optimal sales mechanism for e-commerce (Cai et al. 2018).
In this article, we briefly introduce several taxonomies on reinforcement learning algorithms, namely model-based and model-free, on-policy and off-policy, and deep and non-deep reinforcement learning. We illustrate different applications of representative algorithms in each taxonomy. Currently, reinforcement learning methods haven’t been widely adopted in economics. However, with the parallel development of computer algorithms and hardware, machine learning and reinforcement learning are becoming increasingly powerful tools, especially in complex decision environments where human intelligent alone is not a solution. Reinforcement learning has great potential in empowering the future of human decision-making and economic prosperity.
Çalışır, Sinan, and Meltem Kurt Pehlivanoğlu. 2019. “Model-Free Reinforcement Learning Algorithms: A Survey.” In 2019 27th Signal Processing and Communications Applications Conference (SIU), 1–4. https://doi.org/10.1109/SIU.2019.8806389.
Kaiser, Lukasz, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H. Campbell, Konrad Czechowski, Dumitru Erhan, et al. 2020. “Model-Based Reinforcement Learning for Atari.” arXiv. https://doi.org/10.48550/arXiv.1903.00374.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski. 2015. “Human-Level Control through Deep Reinforcement Learning.” Nature 518 (7540): 529–33.
Silver, David, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, et al. 2016. “Mastering the Game of Go with Deep Neural Networks and Tree Search.” Nature 529 (7587): 484–89. https://doi.org/10.1038/nature16961.
Wan, Runzhe, Xinyu Zhang, and Rui Song. 2021. “Multi-Objective Model-Based Reinforcement Learning for Infectious Disease Control.” In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 1634–44. KDD ’21. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3447548.3467303.
Yu, Pengqian, Joon Sern Lee, Ilya Kulyatin, Zekun Shi, and Sakyasingha Dasgupta. 2019. “Model-Based Deep Reinforcement Learning for Dynamic Portfolio Optimization.” arXiv. https://doi.org/10.48550/arXiv.1901.08740.
Zesen Zhuang is one of the active members in SciEcon NFT research lab, as well as a junior student at Duke Kunshan University majoring in Data Science. He has a solid foundation in Computer Science and excels in Data Science courses. His areas of interest include reinforcement learning and algorithmic trading. He is working under the guidance of Prof. Luyao Zhang on the combined application of algorithmic trading and reinforcement learning. He is also involved in several projects at SciEcon CIC, for which he provides core technical support. He is the Chair of Technology Development at SciEcon CIC, working on exploring the possibilities of decentralized networks.
Xinyu Tian is a rising senior majoring in Data Science at Duke Kunshan University (DKU) and a full-admission scholarship recipient at DKU. Her area of interest includes theory and applications about cooperative AI, blockchain trust and consensus, game theory, and computer vision. Her paper about meta-learning algorithms has been published at the 2021 International Conference on Computer Engineering and Application (ICCEA) and she was supported by the Summer Research Scholarship (SRS) program at Duke Kunshan University in 2021 and 2022. She is now working on her Signature Work about cooperative AI mentored by Prof. Luyao Zhang. Besides her academic interest, she also hopes to support the communication between academia and industry. So she is serving as Chair of Communication at SciEcon CIC and contributing to the industry 4.0 Open Educational Resources (OER).
Agent is anything which perceives its environment, takes actions autonomously in order to achieve goals, and may improve its performance with learning or may use knowledge.
The transition probability distribution (or transition model) and the reward function are often collectively called the "model" of the environment (or MDP)
A policy defines how an agent acts from a specific state.
The value function represent how good is a state for an agent to be in.