Zhihong Deng’s Homepage

Causality x Agents: TMLR (2023) The Latest Survey on Causal Reinforcement Learning

2023-12-03T00:00:00-08:00

This blog post was originally written in Chinese. Readers interested in the original text can visit

this link. Thank you! 😸

Links to our paper:

TMLR: https://openreview.net/pdf?id=qqnttX9LPo
ArXiv: https://arxiv.org/abs/2307.01452

Tutorial:

Causal Reinforcement Learning: Empowering Agents with Causality

Github Project (We will keep this project regularly updated.): Feel free to give us a star ⭐ if you found this helpful. We also greatly appreciate everyone contributing together! 😊

Awesome-Causal-RL: A curated list of causal reinforcement learning resources.

Preface

This time, I’d like to share with you the recent publication of our comprehensive review paper on Causal Reinforcement Learning (Causal RL), which has officially been accepted by Transactions on Machine Learning Research (TMLR). Causal RL is an intriguing, burgeoning field that centers on empowering agents to better comprehend the causality within an environment, thereby enabling them to make more informed and effective decisions. In this blog post, alongside delving into segments of our survey paper, I’ll also briefly share some reflections on this topic. To keep our discussion focused, let’s establish a few key points:

1️⃣ A Fact: Agents Still Lack Understanding of Causality

By the end of 2023, agents still struggle considerably in grasping causality. ChatGPT stands as one of the most advanced agents based on LLMs. Let’s use it as an example:

A failure case of ChatGPT.

In this example, we created a hypothetical food called ‘tomapotato’ and shared with ChatGPT a supposed connection between eating this food and developing a mysterious disease (correlation). Additionally, we pointed out that individuals who regularly eat instant noodles rarely suffer from this disease. We then asked ChatGPT about whether sailors would fall ill with this disease if they had plenty of instant noodles (causality). ChatGPT ‘smoothly’ fell into our prepared trap and responded: ‘As long as the sailors avoid consuming tomapotato and stick to the provided instant noodles, they should not be susceptible to the disease.’ This is a classic case of mistaking correlation for causation. It’s noteworthy that if we were to replace this scenario with a historically accurate example—replacing tomapotato with contaminated meat and the mysterious disease with scurvy—ChatGPT would provide the correct answer. This example has been documented in various sources, so ChatGPT’s correct response isn’t surprising. However, minor alterations could mislead such a state-of-the-art agent, rendering its responses unreliable and potentially causing real risks. This emphasizes the need for reflection and further investigation.

Similarly, researchers from TU Darmstadt mentioned a phenomenon in their paper recently published on TMLR¹:

LLMs are `causal parrots.`

The authors referred to this phenomenon as ‘Causal Parrots.’ In simple terms, this implies that even the most powerful agents, although they might exhibit human-like behavior, are merely repeating causal knowledge that already exists in the training data, without genuinely understanding it, much like parrots mimicking speech.

2️⃣ A Value: Agents Should Understand Causality

While acknowledging the aforementioned facts, we sometimes hear a voice suggesting that intelligent agents do not need to understand causality since humans make mistakes about causality as well. Sometimes, decisions based on correlation alone seem satisfactory, so why bother pursuing causality for agents? Although agents like ChatGPT don’t fully grasp causality, it doesn’t stop them from quickly becoming an indispensable part of many people’s workflow. In the task of conversation, leveraging internet-scale data, sufficiently large model capacities, and proper alignment, many tasks based on natural language seem to have been solved. However, we believe there remains a significant gap between ‘usable’ and ‘reliably usable.’ If we hope for intelligent agents to more deeply engage in human society, particularly in making important decisions related to humans, we cannot settle for ‘usable’ but must forge a new path toward ‘reliably usable’—and Causal RL may be such a path. To quote Judea Pearl²:

I believe that causal reasoning is essential for machines to communicate with us in our own language about policies, experiments, explanations, theories, regret, responsibility, free will, and obligations—and, eventually, to make their own moral decisions.

Therefore, it is valuable for agents to understand causality, and it’s definitely something worth pursuing. The value is not only reflected in enabling agents to naturally communicate their choices and intentions with humans but also in enhancing our trust in agents, freeing us from the fear of super artificial intelligence.

3️⃣ A Method: How to Enable Agents to Understand Causality? Still Under Research

While we may reach consensus on the previous two points, the third one admits various answers. Should we impart causal knowledge to artificial agents directly, or merely offer serveral first principles for them to learn on their own? Should agents learn causality through interacting with the environment or by studying historical data alone? Should causal reasoning modules be integrated internally within the agent, or should they exist as external tools? Each distinct choice can lead to many intriguing solutions, employing differing techniques. In the survey, we organized the existing work on causal RL based on four major problems: enhancing sample efficiency, advancing generalizability and knowledge transfer, addressing spurious correlations, and promoting explainability, fairness, and safety. As research on agents continues to evolve, new challenges will emerge, and we genuinely welcome everyone to contribute and share your thoughts and insights on this topic.

Before diving into the main content, let’s add a bit of fundamental knowledge to help understand how causality is formalized mathematically.

Background Knowledge

First, let’s introduce Judea Pearl’s Structural Causal Model (SCM)³:

This is a tuple comprising an endogenous variable set, an exogenous variable set, a set of structural equations, and the joint distribution of exogenous variables. Endogenous variables are the ones we are interested in a research problem, such as the states and rewards in an MDP (Markov Decision Process). Exogenous variables represent the ones we don’t specifically care about, also known as background variables or noise variables. These variables are linked through structural equations. Each structural equation specifies how an endogenous variable is determined, where the variable on the left side of the equation is the effect, and the endogenous variable on the right side represents the corresponding causes. Because the equations themselves are deterministic, all randomness stems from exogenous variables. As a result, given the values of all exogenous variables, the values of endogenous variables are determined. In this way, an SCM describes the regularities of how a system (or the world) operates, enabling us to leverage it for discussing and understanding various causal concepts.

Earlier, we mentioned that ‘correlation does not imply causation,’ and we can better grasp this distinction using a pie chart:

Suppose the complete pie chart corresponds to the population we are interested in, such as all sailors. This population can be divided into three sections based on the consumption of different foods, which we denote as variable $x$, and then use variable $Y$ to indicate whether they get the disease. The question we want to ask is: does consuming a certain food make sailors more or less likely to get the disease? In traditional machine learning, we might attempt to establish a predictive model to fit the conditional distribution $P(Y \vert X)$. However, the conditional distribution only studies a single section; it can answer questions about correlation, for instance: if we (passively) observe sailors consuming a certain food, how likely are they to get the disease? Yet, what we are actually interested in is a causal one: what changes if we (actively) conduct a certain intervention? To distinguish between the two, researchers introduced the do-operator. The intervention distribution $P(Y \vert \text{do}(X=x))$ represents how likely all sailors are to get the disease when they are required to consume a certain food. It might seem like a slight difference, but it’s crucial. If a model can only answer questions about correlation, it will be challenging to derive reliable conclusions from its predictions. This correlation could potentially arise from some confounding factor omitted by the model (e.g., a common cause of A and B). Perhaps consuming the food itself doesn’t lead to the disease, but an unknown gene makes people both prefer this food and be more susceptible to the disease (which would create a spurious correlation between the two). If all sailors consume this food without being influenced by some unknown cause, we can figure out this mystery. In the language of SCM, $\text{do}(X=x)$ removes the structural equation of $X$ and directly sets its value as $x$, thereby creating a world that meets $X=x$ with the smallest difference from the original world.

Perhaps someone might ask: Is it really feasible to force all sailors to consume the same food? This approach seems unrealistic and unethical. If we cannot achieve this, how can we ensure that the model learns the intervention distribution we are truly interested in? This brings up one of the most important achievements in causal science — how to predict the effects of an intervention without actually enacting it. To explain with the previous example, we can express assumptions in the form of a causal graph, where each node represents an endogenous variable, and the presence or absence of edges represents our assumptions about causal relationships. $X \rightarrow Y$ represents our assumption that consuming a certain food might lead to a disease. $X \leftarrow Z \rightarrow Y$ represents our assumption that there exists a confounding variable $Z$, which affects both the food consumed and the likelihood of getting the disease. In this way, we can obtain Figure 3.3:

Similarly, to express that all sailors are forced to consume the same food, we can remove the edge $X \leftarrow Z$ from Figure 3.3 to obtain Figure 3.4. In other words, Figure 3.4 represents an intervened world, where the intervention distribution $P(Y \vert \text{do}(X=x))$ we care about is equivalent to the conditional distribution $P _m(Y \vert X=x)$. Leveraging basic knowledge of probability theory, we can make the following deductions:

\[\begin{split} P(Y=y \vert \text{do}(X=x)) &= P_m (Y=y \vert X=x)\\ &= \sum_z P_m(Y=y\vert X=x, Z=z) P_m(Z=z \vert X=x)\\ &= \sum_z P_m(Y=y\vert X=x, Z=z) P_m(Z=z)\\ &= \sum_z P(Y=y\vert X=x, Z=z) P(Z=z)\\ \end{split}\]

The third equality relies on the assumption that $X \rightarrow Z$ doesn’t exist, meaning that changes in variable $X$ don’t lead to changes in variable $Z$, so the conditional probability equals the marginal probability. The fourth equality is based on two invariances: the marginal probability $P(Z=z)$ remains unchanged before and after intervention since removing $X \leftarrow Z$ doesn’t affect the value of $Z$; the conditional probability $P(Y=y \vert X=x, Z=z)$ remains unchanged before and after intervention because whether $X$ changes spontaneously or is fixed to a certain value, the structural equation for $Y$ (the process that generates $Y$) remains unchanged. So, what’s the use of this conclusion? It precisely solves the problem that had been bothering us earlier. Looking at the right side of the equation, we can see that both terms are common probabilities, without any do-operator or subscript ‘$m$’. This means we can use observational data to answer causal questions! Of course, in reality, variable $Z$ may not always be observable, so sometimes the causal effect of interest is unidentifiable.

Apart from intervention, there’s another highly important concept in causal science: counterfactual. In traditional machine learning, we seldom discuss this concept, but it’s quite common in our daily lives. Continuing from the previous example, after observing sailors who consumed “tomapotato” falling ill with a mysterious disease, we often contemplate a question: What would have been the outcome if these sailors hadn’t consumed “tomapotato” initially? Traditional statistics lacks a language to articulate such a question because we can’t precisely characterize “traveling back in time” (if … initially) and “intervention” (hadn’t consumed) solely through passively observed empirical data. To articulate the correct question, we must consider causal relationships and go one step beyond intervention. Why is that? With the aid of the do-calculus, we can attempt to write down the question as:

\[P(Y \vert \text{do}(X=x'), Y=1).\]

We can see that there are two $Y$s in this expression, each with different meanings. The first $Y$ represents the outcome when sailors do not consume ‘tomapatato,’ while the second $Y$ means the actual outcome of sailors consumed ‘tomapatato.’ To differentiate between these two scenarios, we need to introduce a new language called ‘potential outcomes.’⁴ This involves using subscripts to specify the values of specific variables. For an individual $u$, we can denote their potential outcome if they consumed ‘tomapatato’ as $Y _{X=x}(u)$ and their potential outcome if they did not consume it as $Y _{X=x’}(u)$. Obviously, for the same individual, we can only observe one potential outcome (the outcome in the factual world). For those unobserved potential outcomes, we can imagine they are the ones observed in parallel worlds (counterfactual worlds). Through potential outcomes, we can express the quantities of interest as follows:

\[P(Y_{X=x'} \vert X=x, Y_{X=x}=1).\]

Here, we first combine the condition $\text{do}(X=x’)$ into the target variable $Y$ to obtain $Y _{X=x’}$, as this is the actual variable of interest. Then, we completed the conditions; the original condition $Y=1$ refers to $X=x$ and $Y _{X=x}=1$. Moreover, since the potential outcome expressed by $Y _{X=x}=1$ aligns with the condition $X=x$, we can rewrite this probability as:

\[P(Y_{X=x'} \vert X=x, Y=1).\]

This expression represents: “If we observe $X=x, Y=1$, what would be the outcome in the world of $X=x’$?” This is precisely the question we want to ask. Using SCM, we can calculate this counterfactual quantity through three steps: first, we infer the values of exogenous variables $U$ based on known facts $X=x$ and $Y=1$; then, we modify the structural equations according to $\text{do}(X=x’)$ to establish a parallel world; finally, by substituting the values of $U$ into the modified structural equations, we can calculate the value of the counterfactual variable. This process itself is transparent, so counterfactuals are not just a hypothetical concept; they can be rigorously characterized using mathematical language. However, because SCM is typically unknown in practical applications, this process might not always be applicable. In such cases, we often rely on qualitative assumptions (such as causal graphs) to assist in computations involving counterfactuals.

So far, we’ve introduced SCM, which provides a way to mathematically model a world following specific causal mechanisms. We emphasized the distinction between correlation and causation, introducing interventions and counterfactuals based on SCM. These concepts aren’t merely jargon; they address the limitation in traditional statistics which focuses on correlation rather than causation. Will a new recommendation algorithm increase click-through rates and user retention? Would a patient have recovered better if they hadn’t undergone a specific treatment? Does a university’s admission decision involve gender or racial bias? These are all causal questions and they drive scientific and social progress. Consider a world governed by correlation rather than causation, where we might be attacked by sharks because we eat more ice cream or avoid seeking medical treatment due to a fear of death. Such a world would be absurd.

Okay, the basics end here. Causal science is an extensive field with many concepts, terminologies, and techniques worth exploring. However, to keep this post concise, we won’t delve further into it. 🕊️ Interested readers can refer to related books³⁵, our survey paper and blogs for more information:

Zhihong Deng：【干货】《统计因果推理入门》读书笔记

Back to the topic of reinforcement learning, after going through the content on causality, you might have some thoughts:

What is the connection between policy in reinforcement learning and intervention in causality?
What about the environmental model in model-based reinforcement learning (MBRL) and the causal model we mentioned earlier?

Being able to instantly come up with these two questions demonstrates your sharp thinking! 😎💡Firstly, in reinforcement learning, the process of interaction between agents and the environment is itself a form of intervention. However, this intervention typically doesn’t directly set the action variable $A _t$ to a fixed value but retains its dependence on the state variable $S _t$. Since we acknowledge that a policy respresents a form of interventions, a natural question arises: Does reinforcement learning itself learn causal relationships? For on-policy RL, agents indeed learn the total effect of actions on outcomes, and the fact that on-policy methods generally perform better than off-policy methods supports this point. However, it’s essential to recognize that the environment itself contains other types of causal relationships, such as the causal relationships among various state dimensions in adjacent time steps. Without understanding these causal relationships, the ability of agents will be limited to a single fixed environment. When the environment changes (e.g., due to external interventions affecting other factors in the environment), its performance decreses significantly. Regarding the second question, although traditional MBRL learns the environment model (sometimes referred to as the world model), it stops at the level of correlations. Once unobservable confounding factors show up, the learned model would become unreliable (think about the earlier example).

At the end of this section, we present the definition of causal RL used in the paper:

Definition (Causal reinforcement learning). Causal RL is an umbrella term for RL approaches that incorporate additional assumptions or prior knowledge to analyze and understand the causal mechanisms underlying actions and their consequences, enabling agents to make more informed and effective decisions.

The definition is broad, focusing primarily on two aspects: first, it emphasizes an understanding of causal relationships rather than superficial correlations to enhance agents’ decision-making abilities; and second, to achieve the former, causal RL usually requires incorporating additional assumptions and prior knowledge. Unlike other domains such as meta RL or offline RL, causal RL does not have a unified problem formulation or strive to solve a certain type of task. The research in causal RL typically flexibly designs solutions tailored to certain difficulties in a research problem. Therefore, in our survey, we adopt a problem-oriented taxonomy to organize the existing work on causal RL: 1) sample efficiency; 2) generalization and knowledge transfer; 3) spurious correlations; 4) exaplanability, fairness, and safety. Next, I will select a few papers to briefly illustrate the new perspectives that causal RL offers on these issues.

Sample Efficiency

First, we know that sample inefficiency is often criticized in reinforcement learning. There has been a considerable amount of research centered around improving sample efficiency, and here, I’ll introduce two works related to causal RL.

The first one was presented at NeurIPS 2021, titled “Causal Influence Detection for Improving Efficiency in Reinforcement Learning.”⁶ It deals with a robotic manipulation problem:

In reinforcement learning, the state space is usually considered as a whole. However, it can be decomposed based on the physical meaning of state dimensions, where each part corresponds to an entity, such as the end effector of a robotic arm or objects on a table. For robotic manipulation tasks, an agent must establish physical contact with an object to succeed. Only then can it potentially move the object to a target location. From a causal perspective, the essence of establishing physical contact is whether actions can causally influence objects. Therefore, we can construct a causal quantity to assess the effect of actions on objects, referred to as ‘causal influence’ in the paper. Based on causal influence, we can guide the agent on how to explore the environment more effectively —— prioritizing exploration in areas where physical contact with objects can be established, thereby improving sample efficiency. This approach is straightforward and effectively prevents the agent from engaging in meaningless exploration in the vast state-action space (e.g., randomly waving its arm in the air).

The second work I want to discuss here is DeepMind’s paper accepted by ICLR 2019, titled “Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search.”⁷ It studies an episodic POMDP problem:

The idea is quite simple: Many existing MBRL methods synthesize data from scratch. However, a common human practice involves leveraging counterfactual reasoning to make full use of experiential data. For instance, after encountering a failure, we often contemplate changing a previous decision at some point in time to alter the outcome. This mode of thinking significantly improves our learning efficiency. Similarly, we aim for agents to harness the power of counterfactual reasoning as well:

The experimental environment is Sokoban where the green entity represents the agent. A visible area of 3×3 cells centered around the agent is observable, while the rest of the map is masked with a 90% probability, making it a POMDP problem. The authors employed an idealized setup assuming known dynamics, allowing the agent to focus on reasoning the distribution of exogenous variables. The experimental results effectively validated the effectiveness of this approach, as counterfactual reasoning significantly enhanced sample efficiency.

In addition to the methods discussed in these two works, another common approach involves learning causal representations — removing redundant dimensions, enabling agents to concentrate on those state dimensions that affect the outcomes. This approach effectively simplifies the problem, thereby indirectly enhancing sample efficiency.

Generalization Ability and Knowledge Transfer

Traditional RL mostly involves training and testing in the same environment. However, in recent years, there has been increasing researches and benchmarks focusing on agents’ generalization and adaptation abilities. For instance, attention has been drawn to settings like meta-learning, multi-task learning, and continual learning. These settings commonly involve multiple different yet similar environments and tasks. From a causal perspective, we can interpret the changes involved as different interventions on some contextual variables. For example, in robotic manipulation tasks, demanding an agent to be robust to color changes is essentially treating color as a contextual variable. Then, the agent needs to learn the optimal policy for the corresponding contextual MDP. This contextual MDP can be described using a causal model, where altering the color only intervenes in one variable of the model, and this variable might be irrelevant to the outcome. In such cases, an agent that understands the causality behind color changes exhibits robustness and can easily generalize to new environments. If the change involves altering an object’s mass, the agent simply needs to adapt to the module corresponding to mass, without the necessity of tuning the entire model.

In this section, I will also explain some interesting ideas provided by causal RL using two works as examples. The first work is ‘A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning,’⁸ presented at ICLR 2022. This work investigates the model generalization problem in MBRL, where context variables determining changes (such as weather, terrain, etc.) are unobservable. In traditional MBRL approaches, trajectory segments are commonly used to encode context information. However, this process tends to encode irrelevant information from states and actions into the context, thereby affecting the usefulness of the estimated context.

The author proposed a causality-inspired solution with the following framework:

Firstly, the learning objective involves a prediction term, i.e., predicting the next state $S’$ based on the current state $S$, action $A$, and contextual information $Z$. Additionally, to understand how to encode context, there needs to be a relational term. Its aim is to make the contextual variables of steps within the same trajectory or from similar trajectories generated from the same environment as consistent as possible. The challenge here lies in determining whether trajectories come from the same environment. From a causal perspective, within the same environment, the causal effects of contextual variables on $S’$ should be similar. Therefore, we can use this causal effect to determine if steps belong to the same environment. Furthermore, there exist multiple causal paths from $Z$ to $S’$, such as $Z \rightarrow S \rightarrow S’$ and $Z \rightarrow A \rightarrow S’$. The authors suggest that these indirect paths are susceptible to noise. Therefore, they choose to measure the controlled direct effect (CDE), which is the causal effect transmitted solely through the direct path $Z \rightarrow S’$. Experiments indicate that this method not only reduces prediction errors in state transitions but also achieves excellent zero-shot generalization in new environments. Based on the PCA results in Figure 1, it’s evident that the learned contextual variables effectively capture environmental changes.

The second paper to be introduced was also presented at ICLR 2022, titled “AdaRL: What, Where, And How to Adapt in Transfer Reinforcement Learning.”⁹ This paper focuses on the problem of transfer learning in reinforcement learning. Specifically, during training, the agent can access multiple source domains, and during testing, it needs to achieve good performance with only a small number of samples from the target domain. The paper proposes a framework called AdaRL:

In particular, the author divides the model into domain-specific and domain-shared parts. The domain-specific parameters are low-dimensional, depicting variations of a specific domain. These variations can manifest in the model’s observations, states, and rewards within that specific domain. The domain-shared parameters include the causal relationships (edges on a causal graph) among different state dimensions, actions, and rewards, and their underlying causal mechanisms. By utilizing data from multiple source domains, agents can effectively learn causal models that can be reliably transferred as knowledge. Simultaneously, the low-dimensional domain-specific parameters require only a small number of samples from the target domain, making this framework highly flexible.

Spurious Correlations

Spurious correlation is a very common phenomenon, but traditional reinforcement learning rarely focuses on this issue. Below, we’ll use recommendation systems as an example to introduce two types of spurious correlations:

The first type is called confounding bias, where there might not be a causal relationship between two variables, but because of an unobservable confounding variable simultaneously affecting these two variables, they exhibit a strong correlation. The second type is termed selection bias, in which the association between two variables arises when considering a third variable. Techniques for analyzing and addressing these two types of issues in causal inference are well-established, so our primary concern lies in identifying the spurious correlations present in reinforcement learning problems.

Here, let’s delve deeper into the concept of confounding bias in policy learning through the paper ‘Causal Confusion in Imitation Learning’¹⁰ presented at NeurIPS 2019. Intuitively, we often believe that the more information available for decision-making, the better. For instance, using multimodal instead of unimodal data or constructing various feature interactions rather than using raw features. However, this paper introduces an intriguing idea: having more information isn’t always better:

In this example depicted in the figure, the appearance of pedestrians is a confounding variable that causes both the brake light activation and the braking behavior. If an agent can observe the brake light, it might mistakenly believe that the brake light causes the act of braking. However, if the agent can only focus on the road ahead, it will correctly learn that the pedestrian in front is the actual cause of the braking behavior. If the agent learns the former, it will drive dangerously.

Meanwhile, our paper titled ‘False Correlation Reduction for Offline Reinforcement Learning,’¹¹ accepted by TPAMI this year, focuses on addressing spurious correlations in offline RL:

Due to the limited size of the sample set, the value of some suboptimal actions may appear ‘overrated.’ If an agent doesn’t consider the spurious correlations brought about by epistemic uncertainty, it may be influenced by these suboptimal actions, thereby failing to learn the optimal policy. For a detailed explanation, you can also refer to our blog posts for more details:

Explanaibility, Fairness, and Safety

Due to time constraints, I’ll update this section when I have more free time. Will try not to delay! 🕊️

Overall, we can summarize some common ideas provided by existing works in causal RL using the following diagram:

Limitations

So far, we have demonstrated the immense potential of causal RL methods in enhancing the decision-making capabilities of agents. However, it is not a panacea. Recognizing the limitations of existing methods is also important. One of the most significant limitations of Causal RL is its requirement for domain knowledge. Many causal RL methods rely on causal graphs, thus making accurate causal assumptions crucial. For instance, when dealing with confounding biases, some methods necessitate the use of proxy variables and instrumental variables, which, if not handled carefully, could potentially introduce additional risks.

Moreover, in some real-world scenarios, raw data is often highly unstructured, such as image and text data. The causal variables involved in the data generation process are usually unknown, necessitating the development of methods to extract meaningful causal representations from high-dimensional raw data – causal representation learning. These methods usually require learning from multi-domain data or allowing explicit interventions on environmental components to simulate the generation of intervention data. The quality and granularity of the learned representations heavily depend on available distributional shifts, interventions, or relevant signals, while real-world agents often only have access to a limited number of domains. Some methods not only aim to learn causal representations but also attempt to learn the underlying causal model, which is highly challenging. In some cases, learning a causal model might be more difficult than directly learning the optimal policy, potentially offsetting the gains in sample efficiency brought by using the model.

Finally, we need to acknowledge the limitations associated with counterfactual reasoning. Obtaining accurate and reliable counterfactual estimates often requires making strong assumptions about the underlying causal structure since, by definition, counterfactuals cannot be directly observed. Some counterfactual quantities are nearly impossible to identify, while others can be identified under appropriate assumptions, such as the effect of treatment on the treated (ETT). Additionally, the computational complexity of counterfactual reasoning is a bottleneck, especially when dealing with high-dimensional state and action spaces. This complexity might hinder real-time decision-making in complex tasks.

Resources

Prof. Elias Bareinboim¹² is one of the earliest scholars to systematically investigate the field of Causal RL. Many significant works in this field came from his research group. He initiated two tutorials at UAI 2019 and ICML 2020 respectively:

【Tutorial】Towards Causal Reinforcement Learning [Video] (UAI 2019)
【Tutorial】Towards Causal Reinforcement Learning [Video] (ICML 2020)

Dr. Chaochao Lu¹³ has also done excellent work in popularizing Causal RL. He believes that Causal RL is one of the pathways toward achieving Artificial General Intelligence (AGI):

Prof. Yoshua Bengio has been actively promoting the development of the emerging field of causal representation learning in recent years. His collaborative paper with Prof. Bernhard Schölkopf¹⁴ also includes discussions related to Causal RL.

Towards Causal Representation Learning [Slides]

Dr. Yan Zeng from Tsinghua University also wrote a survey on causal RL. Unlike the taxonomy taken in our survey, her paper categorizes existing work from the perspective of whether causal information is known, and we recommend reading the two papers together for a comprehensive understanding.

A Survey on Causal Reinforcement Learning [Paper]

Finally, there was a tutorial we launched at ADMA 2023:

【Tutorial】Causal Reinforcement Learning: Empowering Agents with Causality [Slides]

For more resources, you can refer to our GitHub project. Feel free to give it a star ⭐️:

Awesome-Causal-RL: A curated list of causal reinforcement learning resources.

Afterword

Agents was a relatively niche concept when I began writing the survey, mainly discussed among scholars in the fields of reinforcement learning and robotics. However, in one year, accompanied by the wave triggered by ChatGPT, from the Stanford village¹⁵ released at the beginning of the year to the recent release of OpenAI’s GPTs store¹⁶, LLM-driven agents have demonstrated amazing levels of intelligence and tremendous commercial potential. This has not only sparked extensive discussions within academia but also garnered attention across various industries. Amidst this buzz, many researchers are exploring diverse research pathways and the relationship between agents and human beings. Besides multimodality and embodied intelligence, causality is also a great entry point. We anticipate that causal RL will showcase greater insights and play a larger role in the era of agents.

A quick advertisement: for those interested in Causal Reinforcement Learning, feel free to reach out! I’m also open to collaborating on new research papers and projects in this field!

Please check Zhihong Deng’s Homepage for further information. Thank you! Enjoy your day~ 😊

References

“Causal Parrots: Large Language Models May Talk Causality But Are Not Causal.” https://arxiv.org/abs/2308.13067 ↩
“The Book of Why” - Introduction. http://bayes.cs.ucla.edu/WHY/ ↩
“Causality. ” http://bayes.cs.ucla.edu/BOOK-2K/ ↩ ↩²
“Causal Inference Using Potential Outcomes.” https://www.jstor.org/stable/27590541 ↩
“Causal Inference in Statistics: A Primer.” http://bayes.cs.ucla.edu/PRIMER/ ↩
“Causal Influence Detection for Improving Efficiency in Reinforcement Learning.” https://proceedings.neurips.cc/paper/2021/hash/c1722a7941d61aad6e651a35b65a9c3e-Abstract.html ↩
“Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search.” https://openreview.net/forum?id=BJG0voC9YQ ↩
“A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning.” https://openreview.net/forum?id=YRq0ZUnzKoZ ↩
“AdaRL: What, Where, And How to Adapt in Transfer Reinforcement Learning.” https://openreview.net/forum?id=8H5bpVwvt5 ↩
“Causal Confusion in Imitation Learning.” https://proceedings.neurips.cc/paper_files/paper/2019/hash/947018640bf36a2bb609d3557a285329-Abstract.html ↩
“False Correlation Reduction for Offline Reinforcement Learning.” https://ieeexplore.ieee.org/document/10301548 ↩
https://causalai.net/ ↩
https://causallu.com/ ↩
“Towards Causal Representation Learning.” https://ieeexplore.ieee.org/abstract/document/9363924 ↩
“Generative Agents: Interactive Simulacra of Human Behavior.” https://dl.acm.org/doi/abs/10.1145/3586183.3606763 ↩
“Introducing GPTs.” https://openai.com/blog/introducing-gpts ↩

The Insights and Story behind TPAMI (2023): False Correlation Reduction for Offline Reinforcement Learning

2023-11-16T00:00:00-08:00

This blog post was originally written in Chinese. Readers interested in the original text can visit

this link. Thank you! 😸

Links to our paper:

TPAMI: https://ieeexplore.ieee.org/document/10301548
ArXiv: https://arxiv.org/abs/2110.12468

Code: familyld/SCORE: Author’s implementation of SCORE

I’m delighted to share my paper here. The initial idea for this paper can be traced back to early 2021, the first draft was completed in October 2021, and it was officially accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) in October 2023. TPAMI is one of the best and most influential journals in the field of artificial intelligence, with an impact factor of 24.314. From developing the initial idea to its final publication, nearly three years have passed. This experience, despite its challenges, has been quite rewarding. Therefore, besides sharing the contents of our paper, I also want to reflect on the whole process. If my journey can offer you some inspiration, feel free to give it a thumbs up, star our github repo, or cite our paper in your future research. 😉👻 Thank you!

TLDR: High-quality uncertainty estimation + The pessimism principle -> Reliable offline reinforcement learning

The Beginning: A Tale of Pessimism

Back in late 2020, the first year of pandemic, I had just shifted my research focus to the track of reinforcement learning. Following my supervisor’s suggestion, I chose to study the emerging field of offline RL. Sergey Levine’s review paper¹ was the first paper I read when I delved into this field. At that time, offline RL wasn’t particularly popular, and there weren’t as many papers available, so each idea was quite novel. From the early SPIBB and BCQ to later ones like BEAR and CQL, we can establish such an intuition: online RL can afford trial-and-error, but offline RL cannot. Therefore, algorithms must try to avoid/suppress queries of out-of-distribution (OOD) actions, as the extrapolation error accumulated in the learning process could lead to failures in offline RL. Based on this intuition, many algorithmic solutions were devised, such as fitting a behavioral policy for regularization or sampling OOD actions and applying additional penalties to the Q-values of these samples. However, did such intuitions genuinely grasp the essence of offline RL?

A meme, credited to Google Research (https://offline-rl.github.io/).

As the last week of December 2020 approached, a paper titled “Is Pessimism Provably Efficient for Offline RL?”² appeared on arXiv:

At that time, researchers might have been relatively unfamiliar with the term ‘pessimism,’ but those who have studied the exploration problem in reinforcement learning might have some understanding of ‘optimism.’ In online reinforcement learning, optimism is provably efficient. This means that using uncertainty as a bonus to guide the agent’s exploration improves efficiency. Conversely, if we flip the sign (use uncertainty as a penalty), we arrive at ‘pessimism,’ as seen in the title of the paper.

In fact, such a straightforward method is provably efficient in offline reinforcement learning! This conclusion is not only rigorous in mathematical terms but also intuitive and elegant. If we can gather feedback from the environment, embracing uncertainty is a highly efficient strategy. However, when we are unable to collect new information, avoiding uncertainty becomes more critical. It makes so much sense! 😌

How is this conclusion derived mathematically? First, we need to define a metric to evaluate the performance of offline reinforcement learning:

\[\text{SubOpt}(\pi;x) = V^{\pi^*}_1(x) - V^{\pi}_1(x),\]

where $V^{\pi^ *}_1(x)$ represents the state value under the optimal policy for the initial state $s_1=x$, while $V^{\pi}_1(x)$ is the state value under the learned policy $\pi$. The difference between the two is termed as “suboptimality.” A larger value indicates a greater deviation of the learned policy from the optimal policy, resulting in poorer performance in offline RL. It can be easily verified that the suboptimality of the optimal policy $\pi^ *$ is zero.

Next, we aim to dissect the core problem of offline RL by decomposing the suboptimality:

\[\begin{split} \text{SubOpt}(\hat{\pi};x) &= V^{\pi^*}_1(x) - V^{\hat{\pi}}_1(x)\\ &= V^{\pi^*}_1(x) - \hat{V}_1(x) + \hat{V}_1(x) - V^{\hat{\pi}}_1(x)\\ &= -\left(\hat{V}_1(x) - V^{\pi^*}_1(x)\right) + \left(\hat{V}_1(x) - V^{\hat{\pi}}_1(x)\right)\\ &= -\left(\sum_{h=1}^H \mathbb{E}_{\pi^*}\left[\langle \hat{Q}_h(s_h, \cdot), \hat{\pi}(\cdot | s_h) - \pi^*(\cdot | s_h) \rangle_\mathcal{A} |s_1=x\right] + \sum_{h=1}^H \mathbb{E}_{\pi^*}\left[ \hat{Q}_h(s_h, a_h) - \left(\mathbb{B}_h \hat{V}_{h+1}\right)\left(s_h,a_h\right) |s_1=x\right] \right)\\ &\ \quad+ \left(\sum_{h=1}^H \mathbb{E}_{\hat{\pi}_h}\left[\langle \hat{Q}_h(s_h, \cdot), \hat{\pi}_h(\cdot | s_h) - \hat{\pi}_h(\cdot | s_h) \rangle_\mathcal{A} |s_1=x\right] + \sum_{h=1}^H \mathbb{E}_{\hat{\pi}_h}\left[ \hat{Q}_h(s_h, a_h) - \left(\mathbb{B}_h \hat{V}_{h+1}\right)\left(s_h,a_h\right) |s_1=x\right] \right)\\ &= \underbrace{- \sum_{h=1}^H \mathbb{E}_{\hat{\pi}_h} \left[\iota_h(s_h,a_h) | s_1=x\right]}_{\text{(i) Spurious Correlation}} + \underbrace{\sum_{h=1}^H \mathbb{E}_{\pi^*} \left[\iota_h(s_h,a_h) | s_1=x\right]}_{\text{(ii) Intrinsic Uncertainty}}\\ &\ \quad+\underbrace{\sum_{h=1}^H \mathbb{E}_{\pi^*}\left[\langle \hat{Q}_h(s_h, \cdot), \pi^*(\cdot | s_h) - \hat{\pi}_h(\cdot | s_h) \rangle_\mathcal{A} |s_1=x\right]}_{\text{(iii) Optimization Error}}. \end{split}\]

The third step here leverages the Extended Value Difference Lemma³, and the fourth step makes the expression more concise by introducing the model evaluation error $\iota_h(x,a) = \left(\mathbb{B} _h\hat{V} _{h+1}\right)\left(x,a\right) - \hat{Q} _h(x,a)$. By definition, $\iota_h$ is the error incurred when estimating the Bellman operator $\mathbb{B} _h$ using an offline dataset, quantifying the uncertainty in approximating $\mathbb{B} _h\hat{V} _{h+1}$.

We can see that suboptimality depends on three core factors, but how do we understand them? First, let’s look at optimization error, which is relatively simple. If we directly adopt a greedy policy w.r.t $\hat{Q}_h$, then this term is less than or equal to zero; it does not increase suboptimality. Now, consider intrinsic uncertainty. The paper demonstrates that this term originates from the information-theoretic lower bound and thus is irremovable. Intuitively, this term examines the dataset’s coverage of the optimal trajectory. Without sufficient information, learning the optimal policy $\pi^*$ becomes challenging.

Lastly, let’s focus on spurious correlation. Despite appearing simple, it is crucial in reducing suboptimality. The only difference between it and intrinsic uncertainty is that it calculates the expectation w.r.t the trajectories generated by $\hat{\pi}$ instead of $\pi^*$. This poses a problem: both $\hat{\pi}$ and $\iota$ are dependent on the dataset, leading to a spurious correlation between them. It might be a bit challenging to understand this purely from the formulas, so let’s consider the following example for clarity:

What is presented here is a typical Multi-Armed Bandit (MAB) problem, where there is no need to consider states and state transitions. The horizontal axis represents different actions, while the vertical axis represents the value of these actions. $\mu(a)$ stands for the true value of action $a$, while $\hat{\mu}(a)$ is the sample average estimator following a Gaussian distribution $\mathcal{N}\left(\mu(a), 1/N(a)\right)$. Under this setting, the model evaluation error $\iota(a) = \mu(a) - \hat{\mu}(a)$. It can be observed that when the policy $\hat{\pi}$ greedily selects actions based on $\hat{\mu}(a)$, it ends up choosing actions with over-estimated values due to insufficient knowledge. Hence, we say that $\hat{\pi}$ and $\iota$ spuriously correlate with each other.

Here is a less rigorous example: In some reports, we might see a tendency to highlight a certain entrepreneur’s dropout from school in their youth, as if dropping out contributed to their success in their career. However, the value of this behavior is overestimated, and the strong correlation between dropping out and career success is merely due to a small sample size. Due to such spurious correlations, $\hat{\pi}$ results in significant suboptimality. In sequential decision-making problems, this issue becomes more complicated, as OOD actions are not the only factor to consider. OOD states and in-distribution samples with relatively higher uncertainty also pose similar threats. To enhance the performance of offline RL, we need an effective approach to reduce spurious correlations.

Now it’s time to introduce pessimism. Earlier, we briefly mentioned that pessimism refers to using uncertainty as a penalty. Here, let’s further specify the type of uncertainty we aim to quantify:

To put is precisely, what we need is epistemic uncertainty rather than aleatoric uncertainty. Given perfect knowledge, the former can be reduced to zero, while the latter may not necessarily be zero. By utilizing $\Gamma_h$, we can define a pessimistic Bellman operator:

\[\left(\hat{\mathbb{B}}_h^-\hat{V}_{h+1}\right)\left(x,a\right) \colon = \left(\hat{\mathbb{B}}_h\hat{V}_{h+1}\right)\left(x,a\right) - \Gamma_h(x,a).\]

If we approximate $\mathbb{B} _h\hat{V} _{h+1}$ with it, then when event $\mathcal{E}$ occurs, the following inequality holds:

\[0 \leq \iota_h(x,a) = \left(\mathbb{B}_h\hat{V}_{h+1}\right)\left(x,a\right) - \left(\hat{\mathbb{B}}_h^-\hat{V}_{h+1}\right)\left(x,a\right) \leq 2\Gamma_h(x,a).\]

At this point, $\iota_h$ is non-negative, hence the contribution of spurious correlation to suboptimality is less than or equal to zero, no longer increasing suboptimality! Meanwhile, we have also obtained an upper bound for suboptimality:

\[\quad \text{SubOpt}(\hat{\pi};x) \leq 2\sum_{h=1}^H \mathbb{E}_{\pi^*} \left[\Gamma_h(s_h,a_h) | s_1=x\right].\]

The conclusion is very elegant, as it doesn’t rely on assumptions about the offline dataset $\mathcal{D}$, nor does it constrain the disparity between the final learned policy and the behavioral policy. This means that even if the behavioral policy isn’t optimal and the dataset contains many low-quality trajectories unrelated to the optimal policy, pessimism still ensures that the algorithm effectively utilizes those valuable segments (experimental results on the random and medium datasets in D4RL provide strong evidence).

This paper has deepened my understanding of offline RL while also raising a new question: How to construct a sufficiently “small” uncertainty quantifier that meets the definition, thereby establishing a tight enough suboptimality upper bound? This paper focuses on the linear MDP setting, where it provides an analytical form for $\Gamma_h$ and integrates it into the framework of value iteration. However, implementing the pessimism principle in a more general setting remains a major challenge.

I first came across this paper on social media when it was just released. However, at that time, I was occupied with other experiments, so it wasn’t until around March 2021 that I read through it for the first time. There were certain things that I couldn’t fully comprehend, prompting me to reach out via email to the authors of the paper, namely Ying Jin, Prof. Zhuoran Yang, and Prof. Zhaoran Wang. All three authors graciously answered my questions, and it was this interaction that led to subsequent collaborations.

Hands-on Pessimism

As April approached, we delved into a phase of extensive experimentation after finalizing our collaboration. We aimed to submit to NeurIPS 2021 with a deadline by the end of May, so time was rather tight. However, there was a minor hiccup: a critical bug was identified in D4RL, leading to the release of its v2 version. Some algorithms performed poorly on the updated dataset. To ensure a fair comparison, Yijun Yang and I reran experiments of various baselines using a consistent standard on the v2 dataset. Due to certain algorithms underperforming on the updated version of the dataset, we need to adjust their hyperparameters, which makes time even more pressing.

In D4RL-v2, CQL encountered issues with Q-value explosions.

That’s how it went—we worked diligently to reproduce existing methods while designing new algorithms. Simultaneously, we kept a close eye on the latest papers released on arXiv. In fact, later on, we found that most concurrent researches also revolved around the principle of pessimism. However, everyone chose different paths. Some methods were effective in practice but exhibited significant gaps with theory, and we aim to bridge such gaps.

During this period, I had the opportunity to meet Dr. Chenjia Bai and discuss how to implement pessimism together. Chenjia’s OB2I⁴ maintains a set of critics and effectively estimates epistemic uncertainty through the bootstrapped ensemble method. Based on such techniques, OB2I constructed a UCB-bonus to implement optimism, significantly improving sample efficiency and achieving excellent results in the Atari Games benchmark. Similar approaches can be observed in continuous control problems, as seen in OAC⁵ and SUNRISE⁶. Therefore, it seems natural to consider employing a similar method to construct an uncertainty quantifier in offline RL.

However, in practice, methods based on uncertainty often do not work well. For instance, the appendix of the BCQ paper⁷ mentions an experiment conducted on an expert dataset. The authors utilized the standard deviation of an ensemble of critic networks to quantify uncertainty, aiming to train the policy network with parameters $\phi$ by minimizing such uncertainty:

\[\phi \leftarrow {\arg\min}_\phi \sum_{(s,a) \in \mathcal{B}} \sigma\left(\{Q_{\theta_i}(s,a)\}_{i=1}^N\right).\]

This approach can actually be seen as a policy-constraint method that doesn’t require considering returns. If the uncertainty is accuratly measured, the agent should be able to effectively mimic an expert policy. However, experimental results indicate that this method falls far behind BCQ. The authors explain the experimental results in the following way:

Minimizing uncertainty can stabilize the value function, but it may not sufficiently constrain the action space. When considering settings beyond expert datasets, it’s also crucial to carefully balance the weight of estimated value and uncertainty in policy optimization. In Sergey Levine’s tutorial¹, we can see the following discussion:

Earlier, we mentioned that optimism in online RL also utilizes uncertainty. However, in the online setup, trial-and-error is possible, so even if uncertainty estimation is inaccurate, the agent can correct errors by collecting feedback. In contrast, in offline RL, because the agent learns only from a fixed dataset, if uncertainty estimation isn’t accurate enough, the effectiveness will be significantly reduced. High-quality uncertainty estimation is of significance!

Despite the obstacles facing model-free uncertainty-based methods, model-based uncertainty-based methods have shown promising results, e.g., MOPO⁸ and MOReL⁹:

However, in a subsequent research (COMBO)¹⁰, the authors of MOPO explicitly pointed out the difficulty in quantifying model-based uncertainty, leading them to abandon the uncertainty-based approach. Instead, they replaced uncertainty penalties with a regularization term similar to CQL. This also confirms the challenges we faced in our experiments - when reproducing MOPO and MOReL, considerable effort was required to obtain reasonable results on the v2 version of the dataset.

Time flies by, with rapid trial and error, analysis, discussions, and another round of trial and error. Despite our efforts to work diligently conducting experiments and analyzing results, we couldn’t produce satisfactory outcomes before the NeurIPS deadline. Nevertheless, this period has allowed us to accumulate valuable experience. For instance, employing uncertainty weighting failed to yield desired results. Those familiar with Offline RL might recall the UWAC¹¹ algorithm proposed by researchers in Apple; however, in our attempts to reproduce the experimental results, we found it heavily reliant on policy constraints, necessitating a combination with BEAR or similar algorithms to obtain strong performance. Furthermore, obtaining uncertainty estimates through dropout does not converge with increasing data, which contradicts the definition of the uncertainty quantifier required by the pessimism principle.

In mid to late June, several interesting papers were released on arXiv. One of them, MILO¹², also adopts the principle of pessimism, employing a model-based approach similar to MOReL. It utilizes model disagreement to quantify uncertainty and has shown promising results in offline imitation learning problems. Upon reviewing the code, we discovered that the authors made a small modification to the environment, allowing the agent to observe the velocity in the x-direction, which is a component of the reward and hence crucial for learning the model. This modification might have somewhat simplified the problem. Nevertheless, the experimental results of this paper still strongly support the pessimism theory, demonstrating that pessimism can assist agents effectively utilize expert data.

Another paper, contributed by researchers in France, introduces TD3-CVAE in the name of anti-exploration¹³. This method entails pretraining a Conditional Variational Autoencoder (CVAE), then leveraging prediction errors to construct uncertainty estimates, and used them in the training process of the value function and policy:

Although the paper didn’t have open-source code at the time, its idea was quite concise, so I quickly implemented the code. However, the empirical results were unexpectedly poor; the trained policy completely failed. After carefully inspecting the code, we reached out to the authors of this paper, providing our experimental results and the code. The author compared the codes, but couldn’t identify the issue due to limited familiarity with PyTorch. The two differences — LayerNorm and dividing the bonus by the dimension of the action space — didn’t work after being added to our code. Finally, after a thorough investigation, we found that the core issue lay in propagating uncertainty back during actor’s update! This practice is quite similar to the experiments described in the appendix of the BCQ paper. Allowing to backpropagate the gradients from uncertainty, rather than just having uncertainty as a scalar, introduces a policy constraint in the policy optimization process. This constraint plays a crucial role in the effectiveness of TD3-CVAE, yet it also makes the learned policy dependent on the behavior policy.

During the same period, Scott Fujimoto, the author of BCQ (and also TD3), introduced TD3-BC¹⁴, which established a new state-of-the-art (SoTA) in policy constraint methods. Without the need for pre-training to fit behavioral policies, simply integrating an extremely simple behavior cloning term during policy optimization resulted in remarkably impressive performance. This truly lives up to the name “Minimalist”:

Here’s a summary of the information we’ve gathered so far: ① Model-based uncertainty-based methods are feasible; ② The use of a value regularization method akin to CQL (to suppress the Q-values of out-of-distribution samples) is currently state-of-the-art； ③ Combining model-free uncertainty-based methods with policy constraints is feasible, and BC is a simple yet effective policy-constraint.

Reflecting on this experience, at this point, we chose three distinct paths. Yijun continued along the model-based approach to develop new algorithms, framing the maximization of model returns and minimization of uncertainty as a bi-objective optimization problem, resulting in P3¹⁵; Dr. Bai explored the approach based on OOD sampling, proposing a model-free algorithm named PBRL¹⁶; SCORE, on the other hand, employs behavioral cloning to “warm up” the value and policy functions, aiding the agent in obtaining better uncertainty estimates. While the underlying concepts are quite intuitive and can be easily explained in a single sentence, reaching this point is far from simple.

A Journey Full of Twists and Turns

In the latter half of September 2021, we established the theoretical framework corresponding to the designed algorithm and completed the primary experiments. The results were indeed promising; compared to existing algorithms, our method demonstrated comparable, if not superior, performance. Even when applied to random datasets (where the behavioral policy significantly deviates from the optimal policy), SCORE managed to learn fairly effective policies. This indicates that SCORE effectively extracts useful information from the dataset, thereby reducing interference from irrelevant information.

This is a real breakthrough. In practice, behavioral policies are often not optimal, so a good algorithm should “automatically ‘adapt’ to the support of the trajectory induced by $\pi^ *$, even though $\pi^ *$ is unknown a priori.”² Through conducting ablation experiments, we can obtain more evidence that pessimism eliminates spurious correlations effectively:

Without resorting to pessimism, performance may exhibit greater fluctuations or even significant degradation. Furthermore, by examining the curve of Q-values, we can see that the agent has overestimated the value of the actions it chose.

Despite achieving promising results, unfortunately, we were unable to complete all the ablation studies before the ICLR deadline. Therefore, we decided to first upload the paper to arXiv after completing all experiments and subsequently submit it to ICML 2022. In October 2021, we uploaded the SCORE paper to arXiv. After nearly half a year of effort, we finally made progress and happily shared it on my social network. Although offline RL requires pessimism, in reality, we were all hoping for the pandemic to end soon so we can shake off those pessimistic feelings:

"Cat-centric Supervised Learning."

During the rebuttal phase of the PBRL paper, Rishabh Agarwal from Google Brain suggested using the “rliable” toolkit for evaluation. At that time, they had authored a paper (later awarded Outstanding Paper at NeurIPS 2021) focusing on investigating the reliability of RL evaluation results and had open-sourced the “rliable” toolkit. To enhance the reliability of the reported experimental results, we also incorporated this suggestion in our SCORE paper, and the outcomes have been highly promising:

In January 2022, when we officially submitted SCORE to ICML, offline RL had gained significant attention after NeurIPS 2021 and ICLR 2022, firmly establishing itself as a primary research direction within the field of reinforcement learning. Not only did algorithmic research flourish, but applications based on offline RL algorithms also emerged. At this juncture, pessimism had gradually evolved into a sort of “consensus”. Consequently, SCORE encountered a very common issue during the submission process — its novelty was questioned. One of the reviewers commented that the motivation, experiments, and theory in this paper are all very good, but then cited a paper on risk-averse RL to argue that “the novelty of the proposed method is very limited, some related works are missing.” 😵‍💫 However, apart from this, no other specific feedback was provided.

Despite our best efforts to address the reviewers’ concerns, we received a rejection from ICML 2022. However, throughout this process, we’ve gathered some valuable rebuttal cases worth sharing. For instance, when faced with controversy regarding novelty, the authors of TD3-BC (NeurIPS 2021 Spotlight paper) responded in the following way:

Another example is as follows (taken from a blog post on Zhihu, where the author writes in Chinese):

In simple terms, the author suggests that the controversy surrounding novelty can be addressed by analyzing the differences in motivation, methodology, and outcomes.

We genuinely welcome everyone to share their rebuttal techniques and insights!

Despite the disappointment of rejection, I bounced back, revised the paper, and resubmitted it to NeurIPS 2022. The algorithms in Offline RL are evolving rapidly, yet we were committed to developing an approach that adheres to the principles of pessimism theory while remaining simple and effective. In order to complete new comparative and ablation experiments, I pushed through several nights of intense work in a highly tense mental state.

At the end of July, we received the reviews from NeurIPS. There were four reviewers in total, with two giving positive scores and the other two giving negative scores. The main reasons for the negative scores were twofold: novelty and experimental results. Regarding the latter, this is what the reviewers said:

SCORE does not significantly outperform PBRL-prior overall when considering the variance of both methods.

The specific numbers are as follows: SCORE - 77.0±2.0, PBRL-prior - 76.6±2.4, PBRL - 74.4±5.3. PBRL-prior is an upgraded version of PBRL that enhances uncertainty estimation by utilizing a set of random prior networks. However, this approach increases computational overhead. These results were obtained using an ensemble of 10 critics, meaning that each forward pass involves running 20 networks. In contrast, SCORE’s results were achieved using only 5 critics in the ensemble, without the need for random prior networks or out-of-distribution (OOD) sampling (a very time-consuming operation), making it more simple and efficient. Moreover, SCORE has greater advantages over PBRL in theory. It eliminates the need for OOD sampling, thus bypassing the strong assumption of access to the exact OOD target values.

Unfortunately, the reviewers did not respond or revise the scores. By the end of August, we received the rejection notification. The ACsummarized it as follows:

The post-rebuttal reviewer discussion ended in a split recommendation with two reviewers suggesting rejection and two reviewers recommending acceptance. On the rejection side, reviewers remain uncertain about the novelty and contributions of the paper. On the acceptance side, reviewers have pointed out that simple method should be much preferred to complex (and often unnecessary) technical extensions.

The two sides were in opposition, with an average score on the borderline, and ultimately, the AC decided to reject our paper. While facing paper rejection is common in academic research, having our cherished research rejected twice was quite disheartening. Nevertheless, we’ve decided to thoroughly revise the paper and attempt to submit it again. This time, we are determined to submit it to TPAMI.

Submitting to TPAMI was a difficult decision, given that the journal’s reviewing process is much slower than conferences, and a PhD program typically spans only 3 to 4 years. This time, we were fortunate. After a 9-month wait, we received a notification for minor revisions, with each reviewer rating the paper as excellent. We made slight adjustments based on the reviewers’ feedback, emphasizing the differences between SCORE and PBRL (lighter-weight & fewer assumptions) and the theoretical innovation (expanding the conclusions of PEVI from episodic MDP to infinite-horizon regularized MDP and taking policy optimization into consideration). Finally, after submitting the revisions, we were thrilled to receive the news of acceptance after another month and a half of waiting.

Afterword

For those interested in the technical and theoretical details of the paper, please refer directly to the main paper and the appendix. The purpose of this blog post is to reflect on and share the experience. Almost three years have passed. With the end of the pandemic and the acceptance of this paper, it’s time to close the chapter on pessimism and embrace a new beginning. During this period, there were moments when I felt I couldn’t overcome the challenges. I am very grateful for the support of my friends, collaborators, and supervisors. I especially want to thank my wife, who always found ways to help me relax when I was feeling down.

Finally, here’s a little something adorable for my dear readers:

A cute little seal photographed during my trip to the Kangaroo Island.

I hope all your papers can be smoothly accepted, and I wish you all the best in your future endeavors!

A quick advertisement: for those interested in Causal Reinforcement Learning, feel free to reach out! I’m also open to collaborating on new research papers and projects in this field!

Please check Zhihong Deng’s Homepage for further information. Thank you! Enjoy your day~ 😊

References

[2005.01643] Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. https://arxiv.org/abs/2005.01643 ↩ ↩²
Is Pessimism Provably Efficient for Offline RL? https://proceedings.mlr.press/v139/jin21e ↩ ↩²
Provably Efficient Exploration in Policy Optimization. https://proceedings.mlr.press/v119/cai20d ↩
Principled Exploration via Optimistic Bootstrapping and Backward Induction. http://proceedings.mlr.press/v139/bai21d.html ↩
Better Exploration with Optimistic Actor Critic. https://proceedings.neurips.cc/paper/2019/hash/a34bacf839b923770b2c360eefa26748-Abstract.html ↩
SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning. https://proceedings.mlr.press/v139/lee21g.html ↩
Off-Policy Deep Reinforcement Learning without Exploration. http://proceedings.mlr.press/v97/fujimoto19a.html ↩
MOPO: Model-based Offline Policy Optimization. https://proceedings.neurips.cc/paper/2020/hash/a322852ce0df73e204b7e67cbbef0d0a-Abstract.html ↩
MOReL: Model-Based Offline Reinforcement Learning. https://papers.nips.cc/paper/2020/hash/f7efa4f864ae9b88d43527f4b14f750f-Abstract.html ↩
COMBO: Conservative Offline Model-Based Policy Optimization. https://proceedings.neurips.cc/paper/2021/hash/f29a179746902e331572c483c45e5086-Abstract.html ↩
Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning. https://proceedings.mlr.press/v139/wu21i.html ↩
Mitigating Covariate Shift in Imitation Learning via Offline Data With Partial Coverage. https://proceedings.neurips.cc/paper/2021/hash/07d5938693cc3903b261e1a3844590ed-Abstract.html ↩
Offline Reinforcement Learning as Anti-exploration. https://ojs.aaai.org/index.php/AAAI/article/view/20783 ↩
abA Minimalist Approach to Offline Reinforcement Learning. https://papers.nips.cc/paper/2021/hash/a8166da05c5a094f7dc03724b41886e5-Abstract.html ↩
Pareto Policy Pool for Model-based Offline Reinforcement Learning. https://openreview.net/forum?id=OqcZu8JIIzS ↩
Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning. https://openreview.net/forum?id=Y4cs1Z3HnqL ↩