Sergey Levine, UC Berkeley: On the bottlenecks to generalization, why simulation is doomed to succeed, and picking good research problems

March 1, 2023

RSS · Spotify · Apple Podcasts · Pocket Casts

Sergey Levine, an assistant professor of EECS at UC Berkeley, is one of the pioneers of modern deep reinforcement learning. His research focuses on developing general-purpose algorithms for autonomous agents to learn how to solve any task. In this episode, we talked about the evolution of deep reinforcement learning, how previous robotics approaches were replaced, and why offline RL is significant for future generalization.

Below are some highlights from our conversation as well as links to the papers, people, and groups referenced in the episode.

Some highlights from our conversation

“I do think that, in science, it is a really good idea to sometimes see how extreme a design can still work because you learn a lot from doing that. This is, by the way, something, I get a lot of comments on this. You know, I’ll be talking to people and they’ll be like, ‘Well, we know how to do, like, robotic grasping, and we know how to do inverse kinematics, and we know how to do this and this, so why don’t you use those parts?’ And it’s, yeah, you could, but if you want to understand the utility, the value of some particular new design, it kind of makes sense to really zoom in on that and really isolate it and really just understand its value instead of trying to put in all these crutches to compensate for all the parts where we might have better existing kind of ideas.”

“The thing is, robots, if they are autonomous robots–they should be collecting data way more cheaply in a way larger scale than data we harvest from humans. For this reason, I actually think that robotics in the long run may actually be at a huge advantage in terms of its ability to collect data. We’re just not seeing this huge advantage now in robotic manipulation because we’re stuck at the smaller scale, more due to economics, rather than, I would say, science.”

“We want simplicity because simplicity makes it easy to make things work on a large scale. You know, if your method is simple, there are essentially fewer ways that it could go wrong. I don’t think the problem with clever prompting is that it’s too simple or primitive. I think the problem might actually be, that it might be too complex and that developing a good, effective reinforcement learning or planning method might actually be a simpler, more general solution.”

“I think, in reality, for any practical deployment of these kinds of ideas at scale, it would actually be many robots all collecting data, sharing it, and exchanging their brains over a network and all that. That’s the more scalable way to think about on the learning side. But, I do think that also on the physical side, there’s a lot of practical challenges, and just, you know, what kind of methods should we even have if we want the robot in your home to practice cleaning your dishes for three days. I mean, if you just run a reinforcement learning algorithm for a robot in your home, probably, the first thing it’ll do is wave its arm around, break your window, then break all your dishes, then break itself, and then spend the remaining time it has, just sitting there at the broken corner. So there’s a lot of practicalities in this.”

Referenced in this podcast

Full transcript

[00:00:00] Sergey Levine: I do think that in science, it is a really good idea to sometimes see how extreme of a design can still work because you learn a lot from doing that, and, this is by the way something I get a lot of comments on this–like, you know, I’ll, I’ll be talking to people and they’ll be like, “well, we know how to do like robotic grasping, and we know how to do inverse kinematics, and we know how, how to do this and this, so why don’t you like use those parts?” And it’s like, yeah, you could, but if you wanna understand the utility, the value of some particular new design, it kind of makes sense to really zoom in on that and really isolate it and really just understand its value instead of trying to put in all these crutches to compensate for all the parts where we might have better existing kind of ideas.

[00:00:38] Kanjun Qiu: We’re really excited to have you, Sergey. Welcome to the podcast. We start with: how did you develop your initial research interests and how have they evolved over time? I know you took a class from Andrew Ng where you decided to switch mid-grad school to machine learning. What were you doing before and then what happened after?

[00:01:37] Sergey Levine: When I started graduate school, I wanted to work on computer graphics. I was always interested in virtual worlds, CG film, video games, that sort of thing. And it was just really fascinating to me how you could essentially create of a synthetic environment in a computer.

[00:01:53] Sergey Levine: So I really want to figure out how we could advance the technology for doing that. When I thought about which area of computer graphics to concentrate in, I decided on computer animation, specifically character animation, because by that point this would’ve been 2009, had pretty good technology for simulating physics for how inanimate objects behave.

[00:02:11] Sergey Levine: And the big challenge was getting plausible behavior out of virtual humans, when I started working on this, I pretty quickly discovered that, you know, essentially the bottleneck with virtual humans is simulating their minds. All the way from very basic things like, you know, how do you decide how to move your legs if you wanna climb some stairs to more complex things like, you know, how should you conduct yourself if you’re playing a soccer game with teammates and opponents?

[00:02:35] Kanjun Qiu: Mm-hmm.

[00:02:36] Sergey Levine: This naturally leads us to think about decision making in AI, you know, in my case, initially in service to creating plausible virtual humans. But I realized how big that problem really was, it became natural to think of it more as just developing artificial intelligence systems. And certainly the initial wave of the deep learning revolution, which started around that same time was, you know, a really big part of what got me to switch over from pure computer graphics research into things that involved a combination of control and machine learning.

[00:03:08] Kanjun Qiu: Hmm. Was that in 2011? 2012?

[00:03:12] Sergey Levine: Yeah. So my first paper. Well, my first machine learning paper was actually earlier than that, but my first paper that involved something that we might call deep learning was in 2012. And was actually, right around the same time as the DeepMind Atari work came out. It was actually a little before and it focused on using what today we would call deep reinforcement learning, which back then was not really a term that was widely used for locomotion behaviors for 3D human.

[00:03:39] Kanjun Qiu: Interesting. And then what happened after that?

[00:03:42] Sergey Levine: I worked on this problem for a little while, for the last couple of years in grad school. And then after I finished my PhD I started looking for postdoc jobs because, you know, I was really only, only about partway through switching from graphics to machine learning. So well established really, in either community at that point. Perhaps a lesson for the PhD students that are listening to this that if you wanna switch gear in the fourth year of your PhD is a little chancey. Because I sort of ended up with like one foot on either side of that threshold and, nobody really know me very well. So I decided to do a postdoc in some area where I could get a little bit more established in machine learning.

[00:04:16] Sergey Levine: So, and, and I wanted to stay in the Bay Area for personal reasons. So I got in touch with professor Pieter Abbeel, who’s now my colleague here at UC Berkeley, about a postdoc position. It was kind of interesting because I interviewed for this job and I thought in the interview went horribly because, well, really, it wasn’t my fault because I, when I showed up for the interview at UC Berkeley, with Pieter’s lab, they had moved the deadline for IROS, which was a major robotics conference–it was supposed to be earlier. So after the deadline, and then all, you know, all the students would listen to my talk and Peter presumably would be a little relaxed. They moved the deadline to be that evening.

[00:04:48] Sergey Levine: So everyone listening to my talk was kind of stressed out. I could tell that like they, they, their minds were elsewhere. Afterwards there was a certain remark, I’m sure Pieter won’t mind me sharing this, but he mentioned something to the effect of like, “Oh, you know, I don’t think that I want my lab working on all this, like, animation stuff.” So I, I kind of felt like I really blew it. But, he gave me a call a few weeks later and offered me the job, which was fantastic.

[00:05:13] Sergey Levine: And I guess it was of generous on his part because at the time he was presumed taking a little bit of a chance, but it worked out really well. And I switched over to robotics and that was actually a very positive change in that. A lot of the things that, that, I was trying to figure out in computer animation, they would be tested more rigorously and more thoroughly in the robotics domain because there you really deal with all complexity of the real world.

[00:05:35] Kanjun Qiu: Mm-hmm. That makes sense. Now, how do you think about all the kind of progress with generative environments and animation? Like, do you feel like the original problems you are working on in animation are largely solved, or do you feel like there’s a lot more to do there?

[00:05:49] Sergey Levine: Yeah, that’s a good question. So I took a break from the computer graphics world for a while, but then over the last five years, there was actually a student in my lab, Jason Peng, who’s now a professor at Simon Frazier University in Canada. He just graduated last year and he, more or less, in his PhD, I would say, basically solved the problems that I had tried to do in my own PhD a decade prior. I think he did a much better job with it than I ever did. So he had several works that essentially took deep RL techniques, and combine them with large scale generative adversarial networks to, more or less, provide a pretty comprehensive solution to the computer animation problem. So his latest work, which was actually done in collaboration with NVIDIA, the kind of approach that he adopted is he takes a large data set of motion capture data. You can kind of think of it as like all the motion capture data we can get our hands on and trains a latent variable GAN on it, that will generate human-like motion and embedded into a latent space that will provide kind of a higher level space for control. So, you can sort of think of his method as producing this model where you, you feed it in random numbers and for every random number, it’ll produce some natural motion, running, jumping, whatever, and then those random numbers serve as a higher level action space. So that latent space, now, everything in that latent space is plausible motion. And then you can train some higher level policy that will steer it in that latent space. And that actually turns out to be a really effective way to do animation because once you get that latent space, now you can forget about whether the motion is plausible–it’ll always be plausible and realistic, and now you can just be entirely goal driven in that space.

[00:07:24] Kanjun Qiu: That’s very clever.

[00:07:25] Sergey Levine: He has a demo in SIGGRAPH this past year where he has virtual characters, you know, doing sword fighting and jumping around and so on. And, this is what in my PhD, if, if someone showed it to me, I would’ve said this, like science fiction. It was kinda like the dream for the computer graphics community for a long time. I think Jason really did a fantastic job of it. So if anyone listening to this is interested in computer animation, Jason Peng, that’s his work. His work is worth checking out. He is part-time in NVIDIA, too, so he is doing some mysterious things that he hasn’t… He’s very cagey with the details on it, but I think there might be something big coming out in the imminent future with.

[00:08:02] Kanjun Qiu: That’s really interesting. so you feel like your PhD work is at least solved by Jason?

[00:08:08] Sergey Levine: Yeah, I think he kind of, yeah, he kind of took care of that one.

[00:08:12] Kanjun Qiu: When you first got started in robotics, what did you feel like were the important problems?

[00:08:16] Sergey Levine: Robotics traditionally is thought of as very much like a geometry problem plus a physics problem. So if you open up like the Springer handbook on, on robotics or a more kind of classical advanced robotics course textbook, a lot of what you will learn about has to do with understanding the geometries of objects and modeling the mechanics of articulated rigid body systems. This approach took us very far from the earliest days of robotics in the fifties and sixties all the way to the kind of robots that are used in manufacturing all the time today. In some ways, the history of, of robotic technology is one of building abstractions, taking those abstractions as far as we can take them, and then hitting some really, really difficult wall. The really difficult wall that robotics generally hits with this kind of approach has to do with situations that are not as fully and cleanly structured as body abstraction would have us believe. Not just cause they have physical phenomena that are outside of this model, but also because they have challenges having to do with characterization identifiability. So if you’re, you know, let’s say you have a robot in your home that’s supposed to just like clean up your home and put away all the objects, even if those are rigid objects that are, in principle, fit within that abstraction, you don’t know exactly what shape they are, what their mass distribution is and all this stuff. You don’t have perception and things like that.

[00:09:33] Sergey Levine: You don’t, you don’t have perception and things like that. So all of those things, together, more or less put us in, in this place where the, clean abstraction kind of like really doesn’t give us anything. The analogy here is in the earliest days of computer vision, the first thing that people basically thought of when they thought about how to do computer vision is that, well, computer vision is like the inverse graphics problem. So if you believe the world is made out of shapes, you know, they have geometry, let’s figure out their vertices and their edges and so on, and people kind of tried to do this for a while and it was very reasonable and very sensible from engineering perspective until, in 2012, Alex Krizhevsky had a solution to the ImageNet challenge that didn’t use any of that stuff, whatsoever, and just use a giant neural net. So I kind of suspect that like, the robotics world is kind of just getting to that point, like, right around in the last half a decade or so.

[00:10:24] Kanjun Qiu: Hmm. Interesting. And so when you first joined Pieter Abbeel’s lab as a postdoc, you kind of saw the world of robotics where everything was these like rigid body abstractions. And what were you thinking? Like, were you like, okay, well, seems like, you know, nobody’s really using deep learning. No one’s really doing end-to-end learning. I’m gonna do that. Or kind of how did you end up?

[00:10:47] Sergey Levine: Yeah, so actually, I started working with a student who, his most recent accomplishment was to actually take ideas that were basically rooted in this kind of geometric approach to robotics and extend them somewhat so they could accommodate deformable objects, ropes and cloth, that sort of thing. So they had been doing laundry folding and not tying. And, I won’t go too much into the technical details, but, but it was kind of in the same wheelhouse as these geometry based methods that had been pioneered for rigid objects and, and grasping in the decades prior. And with some clever exceptions, they could do some not tying and things like that.

[00:11:19] Sergey Levine: And I, I started working on how we could, kind of more or less, throw all that out and replace it with end-to-end learning from deep nets. And I intentionally wanted to make it like a little bit extreme. So instead of trying to like gently turn these geometry-based methods to the ones that use learning more and more we actually decided that we would actually just completely do, you know, the maximally end-to-end thing. The student in question was John Schulman, and he ended up doing his PhD on end-to-end deep reinforcement learning, and later on developed the most widely used reinforcement learning method today, which is PPO. So he now works at OpenAI and perhaps his most recent accomplishment and something that some of your viewers might have heard about, it’s called ChatGPT. But that’s maybe a story for another time. So we did some algorithms work there, and then in terms of robotics applications, I worked with another student that some of your listeners might also know, Chelsea Finn. She’s now professor at Stanford. There we wanted to see if we could introduce kind of the latest and greatest convolution neural network techniques to directly control robot motion.

[00:12:19] Sergey Levine: And again, we, chose to have a very end-to-end design there where we, took the PR2 Robot, and we basically looked through the PR2 manual, and we found the lowest level control you could possibly have. You can’t command motor torques exactly, but you can command what’s called motor effort, which apparently is roughly proportional to current on the electric motors. So I did a little bit of coding to set up a controller that would directly command these efforts at some reasonable frequency. Chelsea coded up the ConvNet component. We wired it all together, managed to get a training end-to-end, and then we set up a set of experiments that were really intentionally meant to evaluate whether the end-to-end part really mattered. So this was not, you know, these, these days, this would be something that people would more or less take for granted. It’s like, yeah, of course end-to-end is better than plugging in a bunch of geometric stuff. But we really wanted to convince people that this was true. So we ran experiments where we separated out localization from control. We had like more traditional computer vision techniques, geometry based techniques, and we try to basically see whether going directly from raw pixels all the way to these motor effort commands could do better. And we set up some experiments that would actually validate that.

[00:13:19] Sergey Levine: So we had these experiments where the robot would take a little colored shape and insert it into a shape sorting cube. So it’s a children’s toy we’re supposed to match the shape to the shape of the hole. And one of the things that we were able to actually demonstrate is that the end-to-end approach was in fact better because essentially it could trade off errors more smartly. So if you’re inserting this shape into a hole, you don’t really need to be very, very accurate and figure out where the hole is vertically because you’ll just be pushing it down all the time. So that’s more robust to inaccuracy. But then in terms of errors in the other direction there, it’s a little more sensitive. So we could show that we could actually do better with end-to-end training than if we had localized the whole separately and then commanded a separate controller. That work actually resulted in a paper that was basically the first deep reinforcement learning paper for image-based, real world robotic manipulation. It was also rejected numerous times by robotics reviewers because at the time this was a little bit of a taboo to do, too many neural nets. Eventually, it ended up working out.

[00:14:17] Kanjun Qiu: One thing I’m really curious about with this end experiment is, did it just work? Like, you set everything up, you code up the CNN, was it really tricky to get working or did it work much better than you expected?

[00:14:29] Sergey Levine: It’s always very difficult to disentangle these things in science, because like, obviously it didn’t just work on the first try, but a big part of why it didn’t work on the first try had to do with a bunch of coding things. For example, one of the things that this, this was sort of before, there were really, really nice clean tools for GPU-based acceleration of ConvNets. So back then, Caffe was one of the things that everybody would use, and it was very difficult for us to get this running on board on the robots. So we actually had some fairly complicated jerry-rigged system where the ConvNet would actually run on one machine. Then in the middle of the network it would send the activations over to a different machine onboard the robot for the real time controller, so it’s still an end-to-end neural net, but like half the neural net was running on one computer, half it was running on another computer, and then the gradients would have to get passed back. So it was like, it was a little complicated, and the bulk of the challenges we had had more to do with systems design and that sort of thing. But part of why it did basically work once we debug things, was that the algorithm itself was based on things that I had developed for previous projects just without the computer vision components. So going from low dimensional input to actions was something that had already been developed and basically worked.

[00:15:36] Sergey Levine: This was a continuation of my PhD work, so a lot of the challenges that we had had to do with getting the systems parts right. They also had to do with getting the design of the component to be effective in relatively lower data regimes because these robot experiments, they would collect maybe four or five hours of data. So one of the things that Chelsea had to figure out is how to get a neural net architecture that could be relatively efficient. She basically used a proxy task that we designed where instead of actually iterating on the full control task on the real robot, we would have a little post detection task that we would use to just prototype the network and that she could iterate on just entirely offline. So she would test out the ConvNet on that, get it working properly, and then once we knew that it worked for this task, then we kind of knew that it was roughly good enough in terms of sample efficiency and then we just retrained with the end-to-end thing.

[00:16:22] Kanjun Qiu: That makes sense.

[00:16:23] Sergey Levine: So the moral of the story to, to folks who might be listening and working on these kinds of robotic learning systems, it does actually help to break it up into components. Even if you’re doing end-to-end thing in the end because you can kind of get the individual neural net components all working nicely and then just redo it with the end-to-end thing. And that does tend to take out a lot of the pain.

[00:16:44] Kanjun Qiu: Right. It sounds like you kind of got the components working first. It’s interesting you made this comment about just making the problem a lot more extreme when you were talking about the student using thin plate spines, and I’m curious, is this an approach you’ve used elsewhere? Kind of making the problem much more extreme and throwing out everything.

[00:17:01] Sergey Levine: I think it’s a good approach. I mean, it depends a little bit on what you wanna do, because if you really want to build a system that works really well, then of course you want to sort of put everything in the kitchen, sink in there, and just like use the best tools for every piece of it. But I do think that in science it is a really good idea to sometimes see how extreme of a design can still work, because you learn a lot from doing that. And this is by the way, something, I get a lot of comments on this. Like, you know, I’ll, I’ll be talking to people and they’ll be like, “well, we know how to do like robotic grasping, and we know how to do inverse kinematics, and we know how, how to do this and this, so why don’t you like use those parts?” And it’s like, yeah, you could, but if you wanna understand the utility, the value of some particular new design, it kind of makes sense to really zoom in on that and really isolate it and really just understand its value instead of trying to put in all these crutches to compensate for all the parts where we might have better existing kind of ideas.

[00:17:52] Sergey Levine: You know, like as an analogy, if you wanna design better engines for electric cars, like maybe you do build just like, not like a fancy hybrid car, but really just like an an electric race car or something. Just see like how fast can it go? And then whatever technology you develop there, like yeah, you can then put it in, you know, combine it with all these pragmatic and very sober decisions and make it work, afterwards.

[00:18:11] Kanjun Qiu: That’s really interesting. So kind of do the hardest thing first. Do the most extreme thing.

[00:18:15] Sergey Levine: Yeah.

[00:18:15] Kanjun Qiu: so after you published this extremely controversial paper that gets rejected everywhere, what happened then? What were you interested in next?

[00:18:22] Sergey Levine: There were a few things that we wanted to do there, but perhaps the most important one that we came to realize is, and this is going to lead to things that in some ways I’m still working on, is that of course we don’t really want end-to-end robotic deep learning systems that just train with like four or five hours of data. The full power of deep learning is really only realized once you have very large amounts of data that can enable broad generalization. So this was a nice technology demo and that it showed that deep nets could work with robots for manipulation, and of course, you know, many people took that up and there’s a lot more work on using deepness for about manipulation now. But it didn’t realize the full promise of deep learning because the full promise of deep learning required large data sets. And that was really the next big frontier. So what I ended up working on after this was some work that was done at Google. So I started at Google in 2015, and there we wanted to basically scale up deep robotic learning. And what we did is, we again, we took a fairly extreme approach. We’d intentionally chose not to do all sorts of fancy transfer learning and so on. We went for like the pure brute force thing and we put 18 robots in a room, and we turned them on for months and months and months and had them collect enormous amounts of data autonomously.

[00:19:42] Sergey Levine: And that, led to the, sometimes referred to as the arm farm project. Um, it might have actually been Jeff Dean who coined that term. At one point we wanted to call it the armpit, but I think really like for this project, we wanted to pick a robotic task that was kind of basic in the sense that it was something that like everybody would want, and it was fairly broad that like all robots should have that capability and it was something that could be applied to large sets of objects or something that really needed generalization. So we went with robotic grasping, like basically bin picking. Because that’s not, maybe that’s not the most glamorous thing, but it is something that really needs to generalize because you can pick all sorts of different objects. It’s something that every robot needs to have, and it’s something that we could scale up. So we went for that because that seemed like the right target for this kind of very extreme purist brute force. Basically what we did is we, went down to Costco and Walmart and we bought tons of plastic chunk and we would put it in front of these robots, and just like day after day, we would load up the bins in front of them and they would just run basically as much as possible. One of the things that I spent a lot of time on is just like getting the uptime on the robots to be as high as it could be. So, Peter Pastor, who’s a or roboticist at Google AI, we basically did a lot of work to increase that uptime, and of course, it was with a great team that was also supporting the effort. Peter Pastor was probably the main one who did a lot of that stuff, and after several months it got to a point where actually relatively simple techniques could acquire very effective robotic grasping policies. An interesting anecdote here is we were doing this work–it took us a while to do it–so it came out in 2016, and in just a few months, after AlphaGo was announced, Alex Krizhevsky who was working with us on the ConvNet design, when AlphaGo was announced, he actually told me something to the effect of like, “Oh, you know, for AlphaGo they have like a billion something games, and you gave me only a hundred thousand grasping episodes.

[00:21:33] Sergey Levine: This, doesn’t seem like this is gonna work.” So I remember I had some snarky retort where I said, “Well, yeah, they have like a billion games, but they still can’t pick up the go pieces.” But on a more serious note, like around this time, I was actually starting to get kind of disappointed because this thing didn’t really work very well. And I think some of this robotics wisdom had rubbed off on me. So I was saying, well, like, okay, maybe we should like put in some more like, you know, domain knowledge about the shapes of objects and so on. I remember Alex also told me like, “Oh, no, no, just like, be patient. Just like add more data to it.” So I headed that advice and after a few more months, it took a little while, but after a a few more months, basically the same things that he had been trying back then, just started working once there was enough of a critical mass. Obviously there were a few careful design decisions in there, but we did more or less succeed in this, fairly extreme kind of purest way of tackling the problem, which again, it was not by any means the absolute best way to build a grasping system. And actually since then, people have developed more hybrid grasping systems that use depth than 3D and simulation and also use deep learning, and I think it’s fair to say that they do work better. But it was a pretty interesting experience for us that just getting robots in a room for several months with some simple but careful design choices could result in a very effective grasping system.

[00:22:46] Kanjun Qiu: Mm-hmm. Mm-hmm. That’s really interesting.

[00:22:46] Josh Albrecht: One of the things that’s interesting for me is that the scale of that data, to his point about, you know, like a billion go games or like GPT-3, like the amount of data that it’s trained on, the scale of these robotics things is just so much smaller, like a few months, like what was the total number of months in arms? Like the total amount of time in that data set is only on the order of years. Right?

[00:23:07] Sergey Levine: Yeah, so it’s a little hard to judge because obviously the uptime for the robots is not a hundred percent, but roughly speaking, yeah, it’s, if I do a little bit of quick mental math, it would be on the order of a couple of years of robot time, and the total size of the data set was on the order of several hundred thousand trials which amounts to about 10 million images. But of course, you know, the images are correlated in time. So basically, it’s roughly like ImageNet sized, but not much bigger than that.

[00:23:35] Kanjun Qiu: Mm-hmm. Right. Right. And the images are much less diverse than ImageNet.

[00:23:40] Sergey Levine: Of course, yes.

[00:23:40] Kanjun Qiu & Josh Albrecht: Yeah. That’s interesting. It’s surprising that it worked at all. Given how small data, huge data set. Mm-hmm.

[00:23:48] Sergey Levine: Well, although, one thing I will say on this topic is that I think a lot of people are very concerned that large data sets in robotics might be impractical. And there’s a lot of work, a lot of very good work, I should say, on all sorts of transfer learning ideas. But I do think that it’s perhaps instructive to think about the problem as a prototype for a larger system because if someone actually builds, let’s say a home robot, and let’s say that one in a hundred people in America buy this robot and put it in their homes, that’s on the order of 3 million people, 3 million robots, and if those 3 million robots do things for even one month in those homes, that is a lot of data. So the thing is, robots, if they’re autonomous robots, they should be collecting data way more cheaply, in a way larger scale than data that we harvest from humans. So for this reason, I actually think that robotics in the long run may actually be at a huge advantage in terms of its ability to collect data.

[00:24:48] Sergey Levine: We’re just not seeing this huge advantage now in robotic manipulation because we’re stuck at the smaller scale, more due to economics rather than, I would say science. And by the way, here’s an example that maybe hammers this point home. If you work at Tesla, you probably don’t worry about the size of your data set. You might worry about the number of labels, you’re not gonna worry about the number of images you’ve got because that robot is actually used by many people. So if robotic arms get to the same point, we won’t worry about how many images we’re collecting.

[00:25:17] Kanjun Qiu: Mm-hmm. .I’m curious what your ideal robot to deploy would be like. What do you think about the humanoid robot versus some other robot type?

[00:25:24] Sergey Levine: Yeah, that’s a great question. If I was more practically minded, if I was a little more entrepreneurial, I would probably give maybe a more compelling answer. But to be honest I actually think that the most interesting kinds of robots to deploy, especially with reinforcement technology, might actually be robots that are very unlike humans.Of course it’s very tempting from science fiction stories and so on, to think, okay, well, robots, they’ll be like Rosie from the Jetsons or, you know Commander Data from Star Trek or something. They’ll look like people and they will kind of do things like people and maybe they will, that’s fine.

[00:25:56] Sergey Levine: There’s nothing wrong with that, and that’s kind of exciting. But perhaps even more exciting is the possibility that we could have morphologies that are so unlike us that we wouldn’t even know how these things could  do stuff. You know, maybe your home robot will be a swarm of a hundred Quadrobots that just like fly around like little flies and like clean up your house, right? So they will actually behave in ways that we would not have been able to design manually and where good reinforcement learning methods would actually figure out ways to control these bizarre morphologies in ways that are actually really effective.

[00:26:27] Kanjun Qiu: Huh? That’s really interesting.

[00:26:27] Josh Albrecht: It’d be interesting to see happen. I think maybe one other, I mean there’s lots of things against the humanoid, structure, but one thing that, it does have going for is most of the world is currently made for people. So like to open this door, right? This sliding door is like kind of heavy. It’s almost impossible. The Quadrobot doesn’t matter how clever it is cause it just doesn’t have enough force. But yeah, I would be interesting to think about like what kind of crazy strategies they might come up with.

[00:26:52] Kanjun Qiu: You worked on this Google Arm farm project for a while and eventually, it seems like enough data allows you to use relatively simple algorithms to be able to solve the grasping problem in this kind of extreme setup. What were you thinking about after that?

[00:27:06] Sergey Levine: After that, the next frontier that we need to address is to have systems that can handle a wide range of tasks. So grasping is great, but it’s a little special. It’s special in the sense that one very compact task definition, which is like, are you holding an object in your gripper can encompass a great deal of complexity. Most tasks aren’t like that. For most tasks, you need to really specify what it is that you want the robot to do, and it needs to be deliberate about pursuing that specific goal and not some other goal. So that leads us into things like multi-task learning. It leads us into things like goal specification and instructions.

[00:27:42] Sergey Levine: One of the things that my students and I worked on when I started as a professor at UC Berkeley, is trying to figure out how we can get goal condition reinforcement learning to work really well. So we sat down and we thought, well, like this grasping thing, that was great because like one very concise task definition leads to a lot of complexity. So you can define like a very simple thing, like are you holding an object? Lots of complexity emerges from that just through kind of autonomous interaction. So can we have something like that, some very compact definition that encompasses a wide range of different behaviors? The thing that we settled on to start with was goal conditioned reinforcement learning, where essentially the robot gets, in the early days, literally a picture of what the environment should be, and it tries to manipulate the environment until it matches that picture. Of course , you can do goal conditioned reinforcement learning in other ways. For example, more recently, the way that we and many others have been approaching it as by defining the go through language. But just defining through pictures is fine to get started because there you, you kind of just focus on just the visual and the control aspect of the problem.

[00:28:44] Sergey Levine: The very first work that we had on this image goal condition reinforcement learning, this was worked done by two students, Vitchyr Pong and Ashvin Nair, who both incidentally work at OpenAI now, but back then they were working on this image-based robotic control. The robot could do very simple things. It was like push an upside down blue bowl, like five inches across the table, right? That was the task. But that was the, first ever demonstration of an image-based goal conditioned RL system. So other people had done non-image based goal condition things, but, in the real world with images, that was the first demonstration, and yes, pushing an upside down Blue Bowl five inches across the table is kind of lame. But, it was a milestone. They got things rolling. From there they did other things that were a little more sophisticated. One of the experiments that really stands out in my mind that I thought was pretty neat is, we had set up a robot in front of a little cabinet with a door, and Vitchyr and Ash had developed an exploration algorithm where the robot would actually directly imagine possible images that it used the generative model.

[00:29:39] Sergey Levine: It was the VAE-based model that would literally hypothesize the kinds of images that they could accomplish in this environment, attempt to reach them and then update its model. So it was like the robot is sort of like dreaming up what it could do, attempting and see if it actually works and if it doesn’t work, imagine something else. They ran this experiment, and obviously it was a smaller scale experiment than the Arm farm. They ran it over just one day, but within 24 hours it would actually first figure out how to move the gripper around because it was really interesting that the gripper moved. But well then once it started touching the door, it saw that, oh, actually, like the door starts swinging open. So now it imagines lots of different angles for the open door. And, from there it starts actually manipulating it. And then it learns how to open the door to any desired angle at the end, and that was entirely autonomous, right? You just put it in front of the door and, and wait. So that was a pretty neat kind of sign of things to come, obviously, at a much smaller scale, suggests that if you have this kind of goal image thing, then you could push it further and further.

[00:30:28] Sergey Levine: And of course, since then we, and, many others have pushed this further. In terms of more recent work on this topic, there’s a very nice paper from Google called Actionable Models, where he actually combines this with offline reinforcement learning using a bunch of these large multi-robot data sets that have been collected at Google, to learn very general goal conditioned policies that could do things like rearrange things on a table and stuff like that. So this stuff has come along a long way since then.

[00:30:51] Josh Albrecht: For the goal condition on language, like from an image perspective, it’s easy to tell like, is this image the image that I wanted? But from language, like what sort of techniques are you excited about for evaluating whether this goal has actually been accomplished?

[00:31:05] Sergey Levine: There’s a lot of interesting work going on in this area right now and some of it my colleagues and I at Google are working on, there are many other groups that are working on this, like Dieter Fox’s lab is doing wonderful work in this area within Nvidia. And, well, so this is something that, people have had on their mind for a while, but I think that most recently, the thing that has really stimulated a lot of research in this area is the advent of vision language models like CLIP that actually work.

[00:31:30] Sergey Levine: And in some ways I feel a certain degree of vindication myself in focusing on just the image part of the problem for so long. Because I think one of the things that good vision language models allow you to do, is not worry about the language so much because if you have good visual goal models, then you can plug them in with vision language models and the vision language model almost acts like a front end for interfacing these non-linguistic robotic controllers with language. As a very kind of simple example of this, my student, Dhruv Shah has a paper called LM-Nav that basically does this for navigation. So Dhruv had been working on just purely image-based navigation, kind of in a similar regime where you specify an image goal and then together with Brian Ichter from Google and Błażej from University of Warsaw. They have a recent paper where they basically just kind of do the obvious thing. They take a vision language model, they take clip, and they just weld it onto this thing as a language front end. So everything underneath is just purely image based. And then clip just says like, okay, among these images, which one matches the instruction the user provided, and that basically does the job. It’s kind of nice that now progress on visual language models, which can take place entirely outside of robotics, would basically lead to better and better language front ends for purely visual goal condition systems.

[00:32:41] Kanjun Qiu: That’s interesting. How far do you feel like visual goal condition systems can go especially with imagination?

[00:32:48] Sergey Levine: I think they can go pretty far actually. And I think that the important thing there though is to kind of think about it the right way. Like I think we shouldn’t take the whole matching pixels thing a little too literally. It’s really more like the robot’s goal–there’s kind of a funny version of this that actually came up in a project on robotic navigation that Dhruv and I were doing where we had data of robots driving around at different times of day and there’s almost like a philosophical problem. You give it a picture of a building at night and it’s currently during the day, so what should do, should like drive to the building and then wait until it’s night or should it, like, you know, wait around until it gets dark because that’s closer. So you kind of have to be able to learn representations that abstract away all of these kind of non-functional things. But if you’re reaching your goal in a reasonable representation space, then it actually does make sense. And fortunately, with deep learning, there are a lot of ways to learn good representation. So as long as we don’t take the business thing too literally, and we use appropriate representation, learning methods it’s actually a fairly solid approach.

[00:33:46] Kanjun Qiu: Right. That makes sense. And that’s not actually a really interesting question. Kind of if you give a picture of a building at night and it’s daytime, it doesn’t matter in some situations, but in other situations it really does matter. It really depends on kind of what the higher level goal is but it doesn’t have that concept of higher level goal yet.

[00:34:00] Sergey Levine: Yeah. So in reinforcement learning, people have thought about these problems a bit. So from a very technical standpoint, goal condition policies do not represent all possible tasks that an agent could perform. But the set of state distributions does define the set of all possible outcomes. So if you can somehow lift it up from just conditioning on a single goal state to conditioning on a distribution over states, then that provably allows you to represent all tasks that could possibly be done. There are different ways that people have approached this problem that are very interesting. They’ve approached it from the standpoint of these things called successor features, which are based on successor representations, you can roughly think of these as low dimensional projections of state distributions. More recently there’s some really interesting work that I’ve seen out of FAIR. This is by a fellow named Ahmed Ander two researchers at Meta. They’re developing techniques for unsupervised acquisition of these kind of feature spaces where you can project state representations and get policies that are sort of conditional on any possible task. So there’s a lot of active research in this area. It’s something I’m really interested in. I think it’s possible to kind of take these goal condition things a little further and really conditional on any notion of a task.

[00:35:11] Josh Albrecht: When you’re thinking about what directions to pursue and especially given you know, the number of people that you collaborate and the number of students and things like that, like how do you think about picking which research questions to answer and how has that evolved over the years?

[00:35:25] Sergey Levine: There are a couple of things I could say here. Obviously the right way to pick research questions really depends a lot on one’s research values and what they want out of their research. But for me, I think that something that serves as a really good compass is to think about some very distant end-goal that I would really like to see. Like generally capable robotic systems, generally capable AI systems–AI systems that could do anything that humans can do. Then when thinking about research questions, I ask myself, “If a research project that I do is wildly successful, the most optimistic sort of upper confidence bound estimate of success, will it make substative progress towards this very distant end goal?” You really want to be optimistic when making that gauge, because obviously the expected outcome of any research project is failure. Like, you know, research is failure. That’s kind of the truth of it. But if the most optimistic outcome for your research project is not making progress on your long-term goals, then something is wrong. So I always make sure to look at whether the most optimistic guess at the outcome makes substantial progress towards the most distant and most ambitious goal that I have in mind.

[00:36:34] Kanjun Qiu: Has your distant end goal changed over time?

[00:36:36] Sergey Levine: In a huge way. But I think it’s easy to have a goal that doesn’t change much over time if it’s distant enough and big enough.

[00:36:44] Kanjun Qiu: That’s right.

[00:36:45] Sergey Levine: So if you’re end goal is something as broad as like, I just want generally capable AI systems that can do anything a person can do, it’s… I mean that may be a very far away target to hit, but it’s also such a big target to hit that it’s probably gonna be reasonably conservative over time.

[00:36:58] Kanjun Qiu: That’s right. That makes sense. And that’s yours, is to make general purpose.

[00:37:01] Sergey Levine: Yeah.

[00:37:02] Kanjun Qiu: What do you feel like are the most interesting questions to you right now?

[00:37:05] Sergey Levine: One thing I maybe that I can mention here is that I think that especially over the last one or two years, there has been a lot of a advances in machine learning systems, both in robotics and in other areas like vision and language, that do a really good job of emulating people through imitational learning, through supervised learning. That’s what language models do essentially, right? They’re trained to imitate huge amounts of human produced data. Imitational learning and robotics have been tremendously successful, but I think that ultimately we really need machine learning systems that do a good job of going beyond the best that people can do.

[00:37:43] Sergey Levine: That’s really the promise of reinforcement learning. If we were to chart the course of this kind of research, it was something like, well, about five years back when there was a lot of excitement about reinforcement learning things like AlphaGo–a really exciting prospect there was that emergent capabilities from these algorithms could lead to machines that are superhuman, that are significantly more capable than people at certain tasks. But it turned out that it was very difficult to make that recipe by itself scale because a lot of the most capable RL systems relied on a really strong way on simulation.

[00:38:18] Sergey Levine: So in the last few years, a lot of the major advances have taken a step back from that and instead focused on ways to bring in even more data, which is great because that leads to to really good generalization. But when using purely supervised methods with that, you get at best at emulation of human behavior, which in some cases, like with language models, it’s tremendously powerful because if you have the equivalent or even a loose approximation of human behavior for like typing text, that’s, tremendously useful.

[00:38:45] Sergey Levine: But I do think that we need to figure out how to take these advances and combine them with reinforced learning methods because that’s the only way that we’ll get to above human behavior to actually have emergent behavior that improves on the typical human. I think that’s actually where there’s a major open question on how to combine, not the simulation base, but the data-driven approach with reinforcement learning in a very effective way.

[00:39:09] Kanjun Qiu: Hmm. That’s interesting. Do you feel like you have any thoughts on how to do that combination?

[00:39:14] Sergey Levine: In my group at Berkeley, we’ve been focusing a lot on what we call offline reinforcement learning algorithms. And the idea is that traditionally reinforcement learning is thought of as a very online and interactive learning regime, right? So if you open up the classic Sutton and Barto textbook, most canonical diagram that everyone remembers is the cycle where the agent interacts with the environment and then produce an action. The environment produce some state and it all goes in a loop. It’s a very online, interactive picture of the world. But the most successful large scale machine learning systems, language models, giant ConvNets, et cetera, they’re all trained on data sets that have been collected and that are stored to disk and then reused repeatedly.

[00:39:56]  Sergey Levine: Because if you’re going to train on billions and billions of images or billions of documents of text, you don’t wanna recollect those intract each time you retrain your system. So the idea in offline reinforcement learning is to take a large dataset like that and extract a policy by analyzing the dataset not by interacting directly with a simulator or physical process. You could have some fine tuning afterwards, a little bit of interaction, but the bulk of your understanding of the world should come from a static dataset because that’s much more scalable. That’s the premise behind offline reinforcement learning. We’ve actually come a long way in developing algorithms that are effective for this. So when we started on this research in 2019, it was basically like nothing worked, you would take algorithms that worked great for online RL and in the offline regime, they just didn’t do anything, whereas now we actually like pretty respect algorithms for doing this, and we’re starting to apply them including to training of language models.

[00:40:47] Sergey Levine: We had a paper called Implicit Language Q Learning on this earlier this year, as well as pre-training large models for robotic control. That stuff is really just starting to work now, and I think that’s one of the things that we’ll see a lot of progress on very imminently.

[00:40:59] Kanjun Qiu: That’s interesting. When you first started working on offline RL, what were the problems that you felt like needed to be solved in order to get offline RL to work at all?

[00:41:06] Sergey Levine: So the basic problem with offline RL, which people, well, I can step back a little bit–in the past, people thought that offline RL really wasn’t that different from kind of traditional value-based methods like Q Learning, and you just needed to kind of come up with appropriate objectives and representations and then, you know, whatever you do to fit Q functions from online interaction, maybe you could just do the same thing with static data and that would kind of work. It actually did work in the olden days when everyone was using linear functional approximators because linear functional approximators are fairly low dimensional and you can run them on offline data and they kind of do more or less the same thing that they would do with online data, which is not much to be honest. But then with deep neural nets, when you run them with offline data, you get a problem because deep nets do a really good job of fitting to the distribution they’re trained on, and the trouble is that if you’re doing offline RL, the whole point is to change your policy.

[00:42:01] Sergey Levine: And when you change your policy, then the distribution that you will see when you run that policy is different from the one you trained on, and because neural nets are so good at fitting to the training distribution, that strength becomes a weakness when the distribution changes. It turns out this is something that people only started realizing a couple years back, but now is a very widely accepted notion that this distributional shift is a very fundamental challenge in offline reinforcement learning. And it really deeply connects to counterfactual inference. Reinforcement learning is really about counterfactual. It’s about saying, well, I saw you do this and that was the outcome, and I saw you do that, and that was the outcome. What if you did something different? Would the outcome be better or worse?

[00:42:38] Sergey Levine: That’s the basic question that reinforced learning asks. And that is a counterfactual question. And with counterfactual questions, you have to be very careful because some questions you simply cannot answer. So if you’ve only seen cars driving on a road and you’ve never seen them swerve off the road and go into the ditch, you actually can’t answer the question: what would happen if you go into the ditch? The data simply is not enough to tell you. So in offline RL the correct answer then is don’t do it because you don’t know what will happen. Avoid the distributional shift for which there’s no way for you to produce a reasonable answer. But at the same time, you still have to permit the model to generalize. You have, you know, if, if there’s something new that you can do that is sufficiently in distribution that you do believe you can produce an accurate estimate of the outcome, then you should do that because you need generalization to improve over the behavior that you saw in the dataset, and that’s a very delicate balance to strike.

[00:43:26] Josh Albrecht: Is there a principled answer to that, or is it just a sort of like heuristic, ah, we just pick something in the middle and it kind of works sometimes.

[00:43:34] Sergey Levine: There are multiple principled answers, but one answer that seems pretty simple and seems to work very well for us, this was it was developed in, a few different concurrent papers, but in terms of the algorithms that people tend to use today, probably one of the most widely used formulations, it was in a paper called Conservative Q Learning by other Aviral Kumar, one of my students here.

[00:43:54] Sergey Levine: The answer was, well, be pessimistic. So essentially, if you, are evaluating the value of some action and that action looks a little bit unfamiliar, give it a lower value, then your network thinks it has, and the more unfamiliar it is, the lower the value you should give it. And if you’re pessimistic in just the right way, that pessimism will cancel out any erroneous overestimation that you would get from mistakes in your neural network. That actually tends to work. It’s simple, it doesn’t require very sophisticated uncertainty estimation. It essentially harnesses the network’s own generalization abilities because this uh, pessimism, it affects the labels for the network and then the network will generalize from those labels.

[00:44:36] Sergey Levine: So in a sense, the degree to which it penalizes unfamiliar actions is very closely linked to how it’s generalizing. So that actually allows it to still make use of generalization while avoiding the really weird stuff that it should just not do.

[00:44:48] Kanjun Qiu: That’s interesting.

[00:44:48] Josh Albrecht: So then in offline, l thinking about techniques for going forward, do you feel like there’s a lot left to be done in offline l or are we sort of at the point where like, we have decent techniques, we’re learning a lot from these data sets that we have and we sort of need something else to move forward and, and actually make systems that are significantly better than what’s in the data already.

[00:45:10] Sergey Levine: Yeah. Yeah. I think we’ve made a lot of progress on offline RL. I do think there are major challenges still to address. And I would say that these major challenges fall into two broad categories. So the first category has to do with something that’s not really unique to offline RL. actually, like it’s a problem for all RL methods, and that has to do with their stability and scalability.

[00:45:32] Sergey Levine: So, RL methods, not just offline RL, all of them are harder to use than supervised learning methods, and a big part of why they’re harder to use, is that, for example, with value-based methods like q learning, they are not actually equivalent to gradient descent. So gradient descent is really easy to do. Gradient descent plus back prop, supervised learning, you know, cross entropy loss.

[00:45:52] Sergey Levine: Great. Like fair to say that that’s kind of at a point where it’s a turnkey thing, you code it up by torch jacks. It works wonderful. Value-based RL is not gradient descent. It’s fixed point iteration disguise is gradient descent because of that, a lot of the nice things that make gradient descent so simple and easy to use, start going a little awry when you’re doing q learning or value iteration type methods. We’ve actually made some progress in understanding this. There’s work on this in my group, there’s work on this in several other groups including for example Shimon Whiteson’s group at Oxford, many others that just recently we’ve sort of started to scratch the surface for what is it that really goes wrong when you use Q learning style methods, these fixed point interation methods rather than gradient descent. And the answer seems to be, and this is, kind of preliminary, but the answer seems to be that some of the things that make supervised deep learning so easy actually make RL hard. So l let me unpack this a little bit.

[00:46:49] Sergey Levine: If you told somebody who’s like a machine learning theorist in let’s say early 2000s that you’re going to train a neural net with like a billion parameters with gradient descent for like image recognition, they would probably tell you, well, yeah, that’s really dumb because you’re going to overfit and it’s gonna suck. So like, why are you even doing this? Based on the theory at that time, it would’ve been completely right. The surprising thing that happens is that when we train with supervised learning, with gradient descent there’s some kind of magical, mysterious fairy that comes in and applies some magic regularization that makes it not overfit and in machine learning theory, one of the really active areas of research is been to understand like, who is that fairy, what is the magic, and how does that work out? And there are a number of hypotheses that have been put forward. that are pretty interesting that all have to do with some kind of regularizing effect that basically makes it so this giant or parametrized neural net actually somehow comes up with a simple solution rather than an overly complex one. This is sometimes referred as implicit regularization–implicit in the sense that it emerges implicitly from the interplay of deep nets and stochastic gradient descent and it’s really good. Like that’s kind of what saves our bacon when we use these giant networks. And it seems to be that for reinforcement learning, because it’s not exactly great in gradient descent that implicit regularization effect actually sometimes doesn’t play in our favor.

[00:48:07] Sergey Levine: Like sometimes it’s not actually a fairy, it’s like an evil demon that comes in and like screws up your network. and that’s really worrying, right? Because like we have this like, mysterious thing that seems to have been like really helping us for supervised learning, and now suddenly we’re doing RL it comes in and hurts us instead. And at least to a degree, that seems to be part of what’s happening. So now that there’s a slightly better understanding of that question, and I don’t wanna overclaim how good our understanding of that is because there’s like major holes in that. So there’s a lot to do there. But at least we have an inkling. We have a, a suspect, so to speak, even if we can’t prove that they did it. We can start trying to solve the problem. We can try, for example, inserting explicit regularization methods that could counteract some of the ill effects of the no longer helpful implicit regularization.

[00:48:45] Sergey Levine: We can start designing architectures that are maybe more resilient to these kinds of effects. So that’s something that’s happening now, and it’s not by any means like a solved thing, but that’s where we could look for potential solutions to these kind of instability issues that seem to afflict reinforcement.

[00:49:00] Kanjun Qiu: What’s the intuition behind why implicit regularization seems to help in supervised networks, but be harmful in RL?

[00:49:07] Sergey Levine: The intuition is roughly that given a wide range of possible solutions, a wide range of different assignments to the weights of a neural net, you would select the one that is simpler, that results in the simpler function. So there are many possible values of neural net weights that would all give you a low training loss, but many of them are bad because they overfit and implicit regularization leads to selecting those assignments to the weights that result in simpler functions and yet still fit your training data and therefore generalized better.

[00:49:35] Kanjun Qiu: and so the intuition for RL is okay, for whatever reason, implicit regularization results in learning simpler functions, but actually those simpler functions are worse in an RL regime.

[00:49:47] Sergey Levine: Yeah, so in RL, it seems like you get one of two things. You either get that whole thing kind of fails entirely and you get really, really complicated functions, and roughly speaking, that’s like overfitting to your target values. Basically, your target values are incorrect when in the early stages.

[00:50:00] Sergey Levine: So you overfit to them and you get some crazy function. Essentially you get like a little bit of noise in your value estimates, and that noise exacerbated more and more and more until all you’ve got is noise or on the other hand, the other thing that seems to sometimes happen and experimentally this actually seems fairly common, is that this thing goes into overdrive and you actually discard too much of the detail and then you get an overly simple function.

[00:50:19] Sergey Levine: But somehow it seems hard to hit that sweet spot. The kind of sweet spot that you hit every time with supervised learning seems annoyingly hard to hit with reinforcement learning.

[00:50:27] Kanjun Qiu: That’s interesting. How much does data diversity help? Like if you were to add a lot more offline data of various types, does that seem to do anything to this problem or not really?

[00:50:39] Sergey Levine: We actually have a recent study on this. This was done by some of my students together actually in collaboration with Google on large-scale offline RL actually for Atari games, and there we study what happens when you have lots of data and also large networks. It seems like the conclusion that we reached is actually that if you’re careful in your choice of architecture, basically select architectures that are very easy to optimize, like ResNets, for example, and you use larger models and you think would be appropriate, larger than what you would need even for supervised learning.

[00:51:09] Sergey Levine: Then things actually seem to work out a lot better. And in that paper, kind of our takeaway, was that actually a lot of reasons why large scale RL efforts were so difficult before is that people were sort of applying their supervised learning intuition and selecting architectures according to that, when in fact if you go like, somewhat larger than that, maybe two times larger in terms of architecture size, that actually seems to mitigate some of the issues.

[00:51:34] Sergey Levine: It probably doesn’t fully solve them, but it does make things a lot easier. It’s not clear why that’s true, but one guess might be that when you’re doing reinforcement learning, you don’t just need to represent the final solution at the end. You don’t just need to represent the the optimal solution. You also need to represent everything in between. You need to represent all those suboptimal behaviors on the way there, and those suboptimal behaviors might be a lot more complicated like the final optimal behavior might be hard to find, but it might be actually a fairly simple parsimonious behavior.

[00:51:59] Sergey Levine: The suboptimal things where you’re like kind of, okay, here, kind of okay there maybe kind of optimal over there. Those might actually be more complicated and you might require more representational capacity to go on that journey and ultimately reach the optimal solution.

[00:52:11] Kanjun Qiu: It’s really interesting that in RL you need to do this counterfactual reasoning pretty explicitly. And so you’d need to represent these suboptimal behaviors, but in, let’s say a language model, you don’t need to, they’re often quite bad as a counterfactual reasoning, and we do see that they get better at that as they get larger. So there’s something interesting here.

[00:52:29] Sergey Levine: Yeah, absolutely. And actually trying to improve language models through reinforcement learning, particularly value-based reinforcement learning, is something that my students and I are doing quite a bit of work on these days. So obviously many of your listeners are probably familiar with the success of RL with human preferences in recent language models work. But one of the ways in which that falls short is that a lot of the ways that people do RL with language models now treats the language models task as a one step problem. So it’s just supposed to generate like one response and that response should get the maximal reward. But if we’re thinking about counterfactuals, that is typically situated a multi-step process. So maybe I would like to help you debug some kind of technical problem. Like maybe you’re having trouble reinstalling your graphics driver. Maybe I might ask you a question like, well, what kind of operating system do you have? Have you tried running this diagnostic. Now, in order to learn how to ask those questions appropriately, the system needs to understand that if it has some piece of information, then it can produce the right answer. And if it asks the question that can get that piece of information, it’s a multi-step process.

[00:53:36] Sergey Levine: And if it has suboptimal data of humans that were doing this task, maybe not so well, then it needs to do this counterfactual reasoning to figure out what is the most optimal questions to ask and so on. And that’s stuff that you’re not going to get with these kind of one step human preferences formulations. And certainly it’s not what you’re going to get with regular supervised learning formulations, which will simply copy the behavior of the typical human. So I think there’s actually a lot of potential to get much more powerful language models with appropriate value-based reinforcement learning, the kind of reinforcement learning that we do in robotics and other RL applications.

[00:54:06] Josh Albrecht: Digging into that a little bit, like how does that work tactically for you and for students at your lab, given that the larger you make these language models are, the more capable they are and you know, it’s kind of hard to run even inference for these things on the kind of compute that’s usually available at an academic institution. I mean, you guys have a decent amount of compute for universities, but still not quite the same as say, Google or OpenAI.

[00:54:27] Sergey Levine: Yeah, it’s certainly not easy, but I think it’s entirely possible to take that problem and subdivide it into its constituent parts so that if we’re developing an algorithm that is supposed to enable reinforcement learning with language models, well, that can be done with a smaller model evaluating the algorithm appropriately to just make sure that it’s like doing what it’s supposed to be doing. And that’s a separate piece of work from the question of how can it be scaled up to the largest size to really see how far it could be pushed. So subdividing the problem appropriately can make this quite feasible, and I don’t think that’s actually something that is uniquely demanded in academia.

[00:55:00] Sergey Levine: Like even if you work for a large company, even if you have all the TPUs and GPUs that you could wish for at your fingertips, which by the way, researchers at large companies don’t always have even then it’s a good idea to chop up your problem into parts because you don’t wanna be waiting three weeks just to see that you implemented something incorrectly in your algorithm.

[00:55:18] Sergey Levine: So in some ways it’s not actually that different, just that there’s that last stage of really fully scaling it up. But, you know, I mean, I think for graduate students that wanna finish their PhD, in many cases, they’re happy to leave that to somebody who is more engineering focused to get that last mile anyway. So as long as we have good ways to vet things, good benchmarks and good research practices, we can make a lot of progress on this stuff.

[00:55:39] Josh Albrecht: Mm-hmm. Is there any worry that emergent behaviors that you see at much larger scales would kind of cause you to make the wrong conclusion at a larger scale with some of these experiments?

[00:55:48] Sergey Levine: Yes, that’s definitely a really important thing to keep in mind. So, I think that it is important to have a loop, not just a one-directional pipeline. But, there’s a middle ground to this. So we have to kind of hit that middle ground. We don’t wanna be entirely–we don’t wanna commit the same sin that all too often people committed in the olden days of reinforcement learning research where we do things at too small of a scale to see the truth, so to speak. But at the same time, we wanna do it at a small enough scale that we can make progress, get some kind of turnaround maybe find the right collaborators in an industrial setting once we do get something working so that we can work together to scale it up and, complete the life cycle that way.

[00:56:24] Josh Albrecht: Yeah. Yeah. Actually, that brings me back to another question I was going to ask earlier. When you were talking about, the examination of, performance on Atari games as you made the models just much larger, like it does seem like in reinforcement learning, the models are much, much smaller than they are in many other parts of machine learning. Do you have any sense for exactly why that is it, is it just historical? Is it merely a performance thing? Like it just seems like, you know, I see a lot of like three layer continents or something, like, not even a ResNet, or like two layer MLP or something that’s just much, much simpler and, very small dimensions.

[00:56:57] Sergey Levine: Well, that has to do with the problems that people are working on. So it’s quite reasonable to say that if your images are attire, game images, it’s a reasonable guess that the visual representations that you would need for that are less complex than what you would need for realistic images and when you start attacking more realistic problems, more or less exactly what you expect happens that the, more modern architectures do become tremendously useful as the problem becomes more realistic. Certainly in our robotics work the kind of architectures we use generally are much closer to the latest architectures in compute production.

[00:57:28] Josh Albrecht: Mm-hmm. So it’s really just with relation to the problem, like as you get closer to the real world, the more the larger networks start to pay off quite a bit. Although, I guess the interesting thing about the Atari thing was like, as you made these larger, they seem to help anyway. Right?

[00:57:42] Sergey Levine: Yes, so that was kind of the surprising thing, is that certainly in robotics, this was not news that, you know, in robotics people, us and many others have used larger models, and yes, it was helping, but the fact that for these Atari games where if you just wanted to, let’s say, imitate good behavior, you get away with a very small network. Learning that good behavior with offline value-based reinforcement learning really benefited from the larger networks. And it seems to have more to do with kind of optimization benefits rather than just being able to represent the final answer.

[00:58:13] Kanjun Qiu: In terms of the goal of getting to more general intelligence, some people, they feel, if we just keep scaling up language models and adding things onto them doing, you know, multi-step human preferences, formulations, and finding some way to spend compute at inference so that it can do reasoning, then we’ll be able to get all the way with just, these language-based formulations. What are your thoughts on that? And kind of like the importance of robotics versus not.

[00:58:39] Sergey Levine: There are a couple of things that I could say on this topic. So first, let’s just keep the discussion just just to language models to start with. So let’s say that we believe that doing all the language tasks somebody would want to do is, is kind of, that’s good enough and that’s fine. Like, and there’s all you can do that way. Is it sufficient to simply build larger language models? I think that the answer there, in my opinion, would be no. Because there are really two things that you need the ability to learn patterns and data, and you need the ability to plan. Now, plan is a very loaded word and I use that term in the same sense that, for example, like Rich Sutton would use it, where plan really refers to some kind of computational process that determines a course of action, it doesn’t necessarily need to be literally like, you think of individual steps in a plan. It could be reinforcement. Reinforcement is a kind of amortized planning. But there’s some kind of some process that you need where you’re actually reflecting on the patterns you learned through some sort of optimization to find good actions rather than merely average actions. And that could be done at training time.

[00:59:38] Sergey Levine: So that could be like the value-based RL. It could also be done at test time. It could simply be that all you learn from your data is a predictive language model, but then a test time instead of simply doing the maximum posterior decoding, instead of simply finding the most likely answer, you actually do some kind of optimization to find an answer that actually leads to an outcome that you wanna see.

[00:59:55] Sergey Levine:  So maybe I’m trying to debug your graphics driver problem. And what I want is, I want you to say at the end, “thank you so much, you did a good job, you fixed my graphics driver.” So I might ask the model, well, what could I say now that would maximize the probability that’ll actually fix your graphics driver? And if the model can answer that question, maybe some kinda optimization procedure can answer that question. That’s planning. Planning could also mean just running Q learning. That’s fine too. So whatever thing it is, that’s actually very important. And I will say something here that a lot of people, when they appeal to the possibility that you can simply build larger and larger models, they often reference Rich Sutton’s Bitter Lesson essay.

[01:00:30] Sergey Levine: It’s a great essay. I would actually strongly recommend to everybody to read it, but to actually read it because he doesn’t say that you should use big models in lots of data. He says You should use learning and planning. That’s very, very important because learning is what gets you the patterns and planning is what gets you to be better than the average thing in those patterns.

[01:00:51] Kanjun Qiu: Yeah.

[01:00:52] Sergey Levine: So, that’s, we need the planning.

[01:00:54] Josh Albrecht: Yeah. Yeah. I’ve been telling people to actually read the–

[01:00:51] Kanjun Qiu: This is also Josh’s takeaway.

[01:01:02] Josh Albrecht: Yeah, yeah. But I guess, just to push back on that slightly as a devil’s advocate for a second, like it might be the case that I think, you know, some of these people saying that large language model models are saying, maybe we can get away with sort of simple types of planning in language.

[01:01:14] Josh Albrecht: So for example, chain of thought ensembling, or asking the language model, like, what would you do next? Or just sort of like, kind of heuristic, simple kind of bolted on planning in language afterwards.

[01:01:25] Sergey Levine: I think that’s a perfectly reasonable hypothesis for it’s worth. I think that the part that I might actually take issue is that that’s actually an easier way to do it. I think it might actually be more complex. It’s just ultimately what we want is something that is–we want simplicity, because simplicity makes it easy to make things work at a large scale. Like, you know, if, if your method is simple, there’s essentially fewer ways that it could go wrong. So I don’t think the problem with clever prompting is that it’s too simple or primitive. I think the problem might actually be that it might be too complex and that developing a good, effective reinforcement learning or planning method might actually be a simpler or more general solution.

[01:02:03] Josh Albrecht: What do you think of, other types of reinforcement learning setups? Like, I’m not sure if you saw the work by Anthropic maybe earlier this week or very recently. Basically, instead of doing RL with human feedback, they propose doing RL with AI feedback. It’s like, oh, okay, we’ll train this other preference model and then sort of use that to do the feedback loop as a way of sort of automating this and getting human outta the loop as maybe an alternative to offline RL.

[01:02:29] Sergey Levine: Yeah. I like that work very much. I think that the part I might suddenly disagree with that is I don’t think it’s an alternative to offline RL. I think it’s actually a very clever way to do offline RK. I like that line of work very much because I think it gets at a similar goal of trying to essentially do planning an optimization procedure at training time using what is in effect a model the language model is being used as a model. And that’s great because then you can get emergent behavior. And I think it’s actually, in my mind, it’s actually more interesting than leveraging human feedback. Because with human feedback you’re essentially relying on human teachers to like hammer this into you which is pragmatic. Like if you wanna, build a company and you really want things to work today, like yeah, it’s great to leverage humans because you can hire lots of humans and get them to hammer your model until it does what you want.

[01:03:10] Sergey Levine: But the prospect of having an autonomous improvement procedure, you know, that’s essentially the dream of reinforcement learning. Like an autonomous improvement procedure where the more compute you throw at it, the better it gets. So yeah, I read that paper. I think it’s great. in terms of technical details, I think a multi-step decision making process would be better than a single step decision pro making process. But I think a lot of the ideas in terms of leveraging the language models themselves to facilitate that improvement are great. And I think that is, actually in a reinforcement learning algorithm, an offline reinforcement learning algorithm in disguise, actually very thin disguise.

[01:03:39] Kanjun Qiu: These language models aside from what we talked about earlier with translating, images into language, can we use the embeddings that are learned or anything like that for robotics type problems?

[01:03:54] Sergey Levine: Yeah, so I think that perhaps one of the most immediate things that we get out of that is a kind of human front end in effect where we can build robotic systems that understand visual-motor control, basically how to manipulate the world and how to change things in the environment. We can build those kinds of systems and then we can hook them up to an interface that humans can talk to by using the, these visual language models.

[01:04:15] Sergey Levine: So that’s kinda the most obvious, most immediate application. I do think that a really interesting potential is for it to not simply be a front end, but to actually have it be a bidirectional thing where potentially these things can also take knowledge contained in language models and import it into robotic behavior. Because one of the things that language models are very good at is acting as almost like really, really fancy like relational databases, like kind of stuff that AI people were doing in, the eighties and nineties where you come up with a bunch of logical propositions and you can say, well, like is a true? And you look up some facts and you figure out, you know, A is like B, et cetera. Language models are great at essentially doing that. So if you want the robot to figure out like, oh, I’m in this building, where do I go if I wanna get a glass of milk, it’s like, well, the milk is probably in the fridge. The fridge is probably in the kitchen. The kitchen is probably down the hallway in the open area because kitchens tend to be near a break area. It’s an office building. Like  all this kind of factual stuff about the world, you can probably get a language mal just tell you that. And if you have a vision language model that acts as an interface between the symbolic linguistic world and the physical world, then you can, import that knowledge into your robot, essentially, and now for all this factual stuff, it’ll kind of take care of it.

[01:05:25] Kanjun Qiu: Mm-hmm.

[01:05:27] Sergey Levine: It won’t take care of all the low level stuff. It won’t tell the robot how to like move its fingers. The robot still needs us to do that. But it does a great job of taking care of these kind of factual semantic things.

[01:05:36] Kanjun Qiu: Right, right. Mm-hmm. And there’s a bunch of work using these language models for higher-level planning and then telling the instructions to the robot. What do you think abou the approach of collecting a lot of robotic data sets and then making a much larger model and then training on this diversity of data sets to get kind of “simulate” the generality of something that you would get from one of these large scaled self supervised models?

[01:06:00] Sergey Levine: That’s a great direction, and I should say that my students and I have been doing a lot of work and a lot of planning on how to build general and reusable robotic control models. So far, one of our results that’s kind of closest to this is a paper by Dhruv Shah called General Navigation Models which deals with a problem of robotic navigation. And what Dhruv did is basically, he went to all of his friends who work on robotic navigation and borrowed their data sets. So we put together a data set with 8 different robots. So it’s not a huge number, it’s only 8, but they really run the gamuts all the way from small scale RC cars. So these are all mobile robots, so small-scale RC car, something that’s like 10 inches long, all the way to full scale ATVs. So these are off-road vehicles that are used for research. Like you, you can actually sit in it. So there’s a large kind of car and everything in between.

[01:06:47] Sergey Levine: I think there’s a spot mini in there. There’s a bunch of other stuff. And he trained a single model that does goal-based navigation just using data from all these robots. The model is not actually told which robot its driving. It’s given a little context, so it has a little bit of memory, and basically just by looking at that memory, you can sort of guess roughly what the properties of the robot is currently driving are, and the model will actually generalize to drive new robots. So we actually got it, for example, to flag quad rotor. Now, the quad rotor had to pretend to be a car. So the quad rotor still controlled only in two because there were no flying vehicles in the data set. But it has, you know, a totally different camera. It has this fisheye lens. Obviously it flies, so it wobbles a bit. And the model could just in zero shot immediately fly the quad rotor. In fact, we put that demo together before a deadline. So the model worked on the first try. What took us the most time is figuring out how to replace the battery in the quad rotor, because we haven’t used it for a year. Once we figured out how to replace the battery, the model could actually figure out how to fly the drone immediately. So I mean navigation obviously is simpler in some ways in robotic manipulation because you’re not making contact with the environment, at least if everything’s going well.

[01:07:52] Sergey Levine: So in that sense, it’s a simpler problem, but it does seem like multi-robot generalization there was very effective for us. And we’re certainly exploring multi-robot generalization for manipulation. Right now, we’re trying to collaborate with a number of other folks that have different kinds of robots. There’s a large data collection effort from Chelsea Finn’s group at Stanford that we’re also partnering up with. So I think we’ll see a lot more of that coming in the future, and I’m really hopeful that a few years from now, the standard way that people approach robotics research will be just like envision and an LP to start with a pre-trained, multi-robot model that has basic capability and really build their stuff on top of that.

[01:08:28] Kanjun Qiu:  That’s cool. That’s really interesting. In terms of thinking about the next few years, like let’s say next five years, do you have a sense of what kind of developments you’d be most excited to see that you kind of expect will happen aside from pre-trained models for robotic?

[01:08:42] Sergey Levine: Obviously the pre-trained models one is a very pragmatic thing. That’s something that’s super important. But the thing that I would really hope to see is something that makes lifelong robotic learning really the norm. I think we’ve made a lot of progress on figuring out how to do large scale limitational learning. We’ve developed good RL methods. We’ve built a lot of building blocks, but to me, the real promise of robotic learning is that you can turn on a robot, leave it alone for a month, come back and suddenly it’s like figured out something amazing that you wouldn’t have thought of yourself. And I think to get there, really need to get in the mindset of robotic learning being an autonomous, continual, and largely unattended process. If I can get to the point where I can walk into the lab, turn onto my robot and come back in a few days, and it’s actually spent the intervening time productively, I would consider that to be a really major success.

[01:09:34] Josh Albrecht: Hmm. How much of that do you think is important to focus on the actual lifetime of the individual robot? Like treating it as an individual versus like, well, it’s just like a data collector for the offline RL data set, and it just sends it up there and like gets whatever coming back down afterwards.

[01:09:49] Sergey Levine: Oh, I think that’s perfectly fine. Yeah. And I think that in reality for any practical deployment of these kinds of ideas at scale, it would actually be many robots all collecting data, sharing it, exchanging their brains over a network and all that. That’s the more scalable way to think about on the learning side. But I do think that also on the physical side, there’s a lot of practical challenges and just like, you know, what kind of methods should we even have if we want the robot in your home to practice, you know, cleaning your dishes for three days. I mean, like, if you just run a reinforcement learning algorithm for a robot in your home, probably the first thing it’ll do is wave its arm around, break your window, then break all of your dishes, then break itself, and then spend the remaining time it has just sitting there at broken corner. So there’s a lot of practicalities in this.

[01:10:32] Kanjun Qiu: That’s right. And It won’t go out and buy more dishes, which is what you’d want it to do.

[01:10:38]  Josh Albrecht: No, no. I don’t think you’d want that to go outside and buy more dishes that would go outside, fall down the steps, hurt someone, get in the middle of the road and cause an accident like.

[01:10:44] Sergey Levine: In all seriousness, that’s where I think a lot of these challenges are wrapped up because in some ways, all of these difficulties that happen in the real world, they’re also opportunities. Maybe the breaking of the dish of the dishes is extreme, but if it like drops something on the ground, well great. Figure out how to pick it up off the ground. If it spills something, great, good time to figure out how to get out the sponge and clean up your spill. Like robots should be able to treat all these unexpected events that happen as new learning opportunities rather than things that just cause ‘em to fail.

[01:11:09] Sergey Levine: And I think that there’s a lot of interesting research wrapped up in that it’s just hard to attack that research because it always kind of falls in between different disciplines. Like it doesn’t slot neatly into just developing a better RL method or just developing a better controller or something.

[01:11:21] Kanjun Qiu: Hmm. That’s really interesting, huh? That, yeah, it’s kind of like somewhere between continual learning and robotics and some other stuff.

[01:11:31]  Josh Albrecht: And it’s all about the messy deployment parts. Like the part about the quad captor taking longer to replace the battery than to train. Probably wasn’t even in the paper. It wasn’t even in the appendix.

[01:11:40] Sergey Levine: No, it wasn’t in the appendix. It might be in the undergraduate students grad school application essay,

[01:11:47] Kanjun Qiu: Right. Right. Looking into the past, whose work do you feel like has impacted you the most?

[01:11:54] Sergey Levine: That’s an interesting question. there’s some kind of like very standard answers I could give, but I actually think that one body of work that I wanna highlight that maybe not many people are familiar with, that was actually quite influential on me, is work by Emanuel Todorov. So most people know about Professor Todorov from his work on developing the MuJoCo simulator. But before he did that, he actually did a lot of research at the intersections between control theory, reinforcement learning, and neuroscience. And in many ways, the work that he did was quite ahead of its time in terms of combining reinforcement learning ideas with probabilistic inference concepts and controls.

[01:12:34] Sergey Levine: And besides, you know, at some low technical level, a lot of the ideas that I capitalized on in developing new RL algorithms were based on some of these controls, inference concepts that his work as well as the work of other people in that area pioneered. But also I think the general approach and philosophy of combining very technical ideas in protic inference RL and neuroscience and controls altogether like that was something that, I would say, really shaped my approach to research, because essentially I think one of the things that, he and others in that kind of neck of the woods did really well, is really tear down the boundaries between these things. As an example of something like this, there’s this idea that sometimes referred to as common duality, which is basically the concept that a forward backward message passing algorithm, like what you would use in a hidden Markov model is more or less the same thing as a control algorithm.

[01:13:26] Sergey Levine: So, in inferring the most likely state to get you know given a sequence of observations, kind of looks at an awful lot like inferring the optimal action given some reward function and that could be made into a mathematically precise statement.

[01:13:37] Kanjun Qiu: Mm-hmm.

[01:13:38] Sergey Levine: So it’s not merely interdisciplinary, it’s really tearing down the boundaries between these areas and showing the underlying commonality that emerges when you, basically, reason about sequential processes and I think that was actually very influential on me in terms of how I thought about the technical concepts in these area.

[01:13:55] Kanjun Qiu: That’s really interesting. It reminds me of a lot of folks are really interested in, or maybe not a lot, but a few people are very interested in formulating RL. The RL formulation as kind of a sequence model formulation. And so, it feels like there’s like maybe a similar thing going on here. I’m curious what you think about this formulation.

[01:14:12] Sergey Levine: Yeah, I think, I think to a degree that’s true. So certainly the idea that inference and sequence models looks a lot like control is a very old idea. The reason the common duality is called the common duality is because it actually did show up in common’s original papers. That’s not what most people took away from, and most people took away that it’s a good way to do state estimation.

[01:14:30] Sergey Levine:  And, you know, that was in the age of a space race and people used it for state estimation, for like the Apollo program. But buried in there is the relationship between control and inference and sequence models that the same way that you would figure out what state that you’re in given a sequence of observations could be used to figure out what action to take to achieve some outcome. And yeah, it’s probably fair to say that the relationship between sequence models and control is an extremely old one. And there’s still more to be gained from that connection.

[01:14:55] Kanjun Qiu: Do you feel like you’ve read any papers or work recently that you were really surprised by?

[01:15:02] Sergey Levine: There are a few things…This is maybe a little bit tangental to what we discussed so far, but I have been a bit surprised by some of the investigations into how language models act as a few shot learners. So I worked a lot on meta learning, kind of, I would say at this point, really the previous generation of meta learning algorithms. So kind of the few shot stuff that was in 2018, 2019. But with language models, there’s a very interesting question as to the degree to which they actually act as meta learners are not. And there’s been somewhat contradictory evidence, like one way or the other.

[01:15:34] Sergey Levine: And some of that was kind of surprising to me, like, for example, you can take a few shot prompt and attach incorrect labels to it, and then the model will look at it and then start producing correct labels, which maybe kind of suggests that perhaps is not paying attention to the labels, but more to the format of the problem. But of course, all these studies are empirical and it’s always a question as to whether the next generation of models still exhibits the same behavior or not. So you kinda have to take it with a grain of salt. But I have found some of the conclusions there to be kind of surprising that maybe these things aren’t really meta learners. Rather, they’re just formats getting like format specification out of problems.

[01:16:07] Kanjun Qiu: Yeah, they’re like really, really, really good pattern matchers. Interesting. Also, as they get bigger, some people say they take less data to fine tune, and so maybe doing some kind of few shot learning during training as well.

[01:16:21] Sergey Levine: There’s an interesting tension there because you would really like, I think in the end for the ideal meta learning method to have something that can get a little bit of data for a new problem. Use that to solve that problem, but also use it to improve the model. And that’s something that’s always been a little tough with meta learning algorithms because typically the process of adapting to a new problem is very, very separate from the process of training the model itself. Certainly that’s true in the classic way of using language models with prompts as well.

[01:16:45] Sergey Levine: But it’s very appealing to have a model that can fine tune on small amounts of data because then the process of adapting to a task is the same as the process of improving the model. And the model actually gets better with every task. So, you could imagine, for example, that the logical conclusion of this kind of stuff is a kind of a lifelong online metal learning procedure where every new task that you’re exposed to, you can adapt to it more quickly and you can use it to improve your model so they can adapt to the next task even more quickly. So think that’s in the, in the world of meta learning, that’s actually kind of an important open problem is how to move towards lifelong and online meta learning procedures that really do get better at both the meta and the low level. And it’s not actually obvious how to do that or whether the advent of large language models makes that easier or harder. It’s an important problem.

[01:17:27] Kanjun Qiu: What do you feel like are some underrated approaches or overlooked approaches that, you don’t see many people looking at today, or it’s not very popular, but you think it might be important?

[01:17:38] Sergey Levine: One thing that comes to mind, I don’t know how much this counts as overlooked or underrated, but I do think that it might be that to some degree model-based RL is a little bit underutilized to some degree because, well, it sort of makes sense if we’ve seen big advanced in general models, then more explicit model-based sterile techniques perhaps can do better than they do now. And it may also be that there’s room for very effective methods to be developed. That hybridize model-based to model free RL in interesting ways that can do a lot better than either one individually, perhaps by leveraging latest ideas from building very effective generative models.

[01:18:14] Sergey Levine: Just as one point about what these things could look like. Model-based reinforcement learning at its core uses some mechanism that predicts the future. But typically we think of predicting the future kind of the way we think about like movies and videos. Like you predict the world like one frame at a time but there isn’t really any reason to think about it that way. All you really need to predict is what will happen in the future if you do something that doesn’t have to be one time, step or frame at a time. It could be that you predict something that will happen at some future point. Maybe you don’t even need to know which future point in particular, like soon or not so soon, right?

[01:18:42] Sergey Levine: And it may be that this kind of more flexible way of looking at prediction could provide for models that are easier to train that maybe leverage ideas from current generative models sufficient to do control, to do decision making, but not as complicated as like full on frame by frame, pixel by pixel prediction of everything that your robot will see for the next hour.

[01:19:02] Josh Albrecht: Yeah. Why do you think that we haven’t seen more advances there on the model-based reinforcement? I mean, given the success of these large generative models. I mean, people have been making large generative models really good for more than a few years now, but we haven’t really seen, I feel like them applied in the RL setting directly.

[01:19:20] Sergey Levine: Well, there is a big challenge there. The challenge is that actually prediction is often much harder than generation. One way you can think about it is if your task is to generate, let’s say, a picture of an open door that, you can draw any door you want, you know, it can be any color as long as it’s open, but if your goal is to predict this particular door in my office, what it would look like if I were to open it, now you really have to get all the other details right.

[01:19:45] Kanjun Qiu: Mm-hmm.

[01:19:46] Sergey Levine: And you really have to get them right if you want to use that for control because you want to get the system to figure out what thing in the scene needs to actually change. So if you messed up a bunch of other parts or it’s like, it’s not open the same way that this particular door opens, that’s actually much less useful to you. So prediction can be a lot harder than generation because with just straight up generation, you, kinda have a lot of freedom to fudge a lot of the details. When you get the freedom to fudge the details, you can basically do the easiest thing you know how to do for all the stuff except for the main subject of the picture.

[01:20:12] Kanjun Qiu: Mm-hmm. Yeah. It’s like once you have to do prediction, you need consistency, you need it over long time horizons. There are like all of these other things to work on. Why do you think we still see a lot of model-based all that? Does these kind of frame by frame rollouts versus predicting a point in the future or something like that?

[01:20:31] Josh Albrecht: Or also versus predicting some aspects of the future, as you were mentioning before, right? Like maybe this thing will happen or maybe this attribute will change, or maybe I expect this particular piece of the future.

[01:20:40] Sergey Levine: Well, I do think the decomposition into a predictive model and a planning method is very clean. So it’s very tempting to say, well, we know how to run our RL against a simulator. So as long as we get a model that basically acts as a slot in replacement for a simulator, then we know exactly how to use it. So it’s a very clean attempting decomposition. And part of why I think we should think about breaking that decomposition is because this notion of a very clean decomposition, it makes me harken back to the end-to-end stuff. Like, you know, in robotics we used to have another very clean decomposition, which is the decomposition between estimation and control. It used to be that perception and control were kept very separate because it’s such a clean decomposition and maybe here too, prediction and planning are kept very separate because it’s such a clean decomposition. But just because it’s clean doesn’t mean it’s right. that’s a notion that we ought to challenge.

[01:21:24] Kanjun Qiu: I see. And it, kind of just feels like it hasn’t been challenged so extremely, so far.

[01:21:44] Josh Albrecht: One question, just going back to the importance of making robots that don’t smash all your dishes and smash all the windows and everything like that, which does seem like a very useful thing for people to be working on and does seem a little bit underserved by existing incentives. Like, do you have any ideas how to fix that? Is it like a new conference? Is it a new way of judging paper? Is it just people being open in the importance of this problem? Like how do we actually make progress on that? Besides industry–industry can certainly make progress, but in academia…

[01:21:57] Sergey Levine: It’s something that I think about a lot. I think one great way to approach that problem is to actually like set your goal to build a robot that has some kind of existence, that has some kind of life of its own. I spent part of my time hanging out with robotics at Google, the Google Brain Robotics Research Lab. And there, I think we’ve actually done a pretty good job of this where, if you walk into our office, we’ll get a Googler to escort you, obviously, like, you know, don’t break into our office. But if you walk into our office legally, you will see robots just driving around and you walk into the micro kitchen there where people will go in and get their snacks and you might be standing in line behind a robot that’s getting a snack.

[01:22:31] Sergey Levine: And people have gotten into this habit of like, well, the robotics experiment is continual, it’s ongoing and it’ll live in the world that you live in, and you better deal with it. And you deal with that as a researcher, and that actually like gets you into this mindset where things do need to be more robust and they need to be more configured in such a way that they support this continual process and they don’t break the dishes. On the technical side, there’s still a lot to do there, but just getting to that mode of thinking about the research process, I think helps a ton. And we’re starting to move in that direction here at UC Berkeley too. We’ve got our little mobile robot roving around the building on a regular basis.

[01:23:04] Sergey Levine: We’ve got our robotic arm in the corner constantly trying to pick up objects. And I think once you start doing research that way, now it becomes much more natural to be thinking about these kinds of challenges.

[01:23:13] Kanjun Qiu: It’s another example breaking down a barrier. In this case, it’s between the experimental environment and your real life environment. Do you feel like there was a work of yours that was most overlooked?

[01:23:22] Sergey Levine: I think every researcher thinks that’ll work of theirs has been most overlooked, but one thing maybe I could talk about a little bit is some work that two of my postdocs, Nick Reinhardt and Glen Berseth did recently me and a number of other collaborators. Studying intrinsic motivation from a very different perspective. So intrinsic motivation and reinforcement learning is often thought of as the problem of seeking out novelty in the apps of supervision. So people formulate in different ways, you know, find something that’s surprising, find something that your model doesn’t fit, et cetera. Nick and Glen took a very different approach to it that was inspired by it was actually inspired by some neuroscience and cognitive science work from a gentleman named Karl Friston from the UK. There’s this idea that perhaps intrinsic motivation can actually be driven by the opposite objective.

[01:24:08] Sergey Levine: The objective of minimizing surprise. And the intuition for why this might be true is that if you imagine kind of a very ecological view of intelligence, let’s say, you’re a creature in the jungle and you’re, hanging out there and you wanna survive, well, maybe you actually don’t wanna find surprising things like, you know,  a tiger eating you would be very surprising and you would rather that not happen.

[01:24:25] Sergey Levine: So you’d rather like kind of find your niche, hang out there and be safe and comfortable. And that actually requires minimizing surprise. But minimizing surprise might require taking some kind of coordinated action. So you might think, well it might rain tomorrow and then I’ll get wet. And that kind of kicks me out of my comfortable niche. So maybe I’ll actually go on a little adventure and find some materials to build shelter, which, you know, that might be a very uncomfortable thing to do. It might be very surprising. But once I’ve built that shelter, now I’ll put myself in a more stable niche where I’m less likely to get surprised by something.

[01:24:54] Sergey Levine: So perhaps, paradoxically, minimizing surprise might actually lead to some behavioral that looks like curiosity or novelty seeking in service to getting yourself to be more comfortable later. It’s a very kind of strange idea in some ways, but perhaps, a really powerful one. If we want to situate agents in open world settings where we want them to explore without human supervision, but at the same time not get distracted by the million different things that could happen. Like, you know, they should explore, but they should explore in a way that kind of gets them to be more capable, sort of accumulates capabilities and things like that accumulates some ability to affect their world.

[01:25:27] Sergey Levine: So we had several papers that studied this one called SMiRL: Surprise Minimization Reinforcement Learning, and another one called IC2, which information Capture for Intrinsic Control. Both these papers looked at how minimizing novelty, either minimizing the entropy of your own beliefs, meaning manipulate the world so that you’re more certain about how the world works, or simply minimizing the entry of your state. So manipulate the world so that you occupy an narrow range of states can actually lead to immersion behavior. And this was like very experimental, preliminary half bay kind of stuff. But I think that’s maybe a direction that has some interesting implications in the future.

[01:26:05] Kanjun Qiu: That’s really interesting. That’s a very unusual formulation. What controversial or unusual research opinions do you feel like you have that other people don’t seem to agree with?

[01:26:15] Sergey Levine: I have quite a few although I say that I do tend to be open-minded and pragmatic about these things, so I’m more than happy to work with people even on projects that might invalidate some of these opinions. But some of the things that I think many people don’t entirely agree with, is, for one thing, there’s a lot of activity in robotic learning around using simulation to learn policies for real world robots. And I think that’s very pragmatic. I think if I were to like start a company today that that’s an approach that I might explore. The controversial part is I think in the long run we’re not gonna do that. And the reason that I think we’re not gonna do that in the long run is that ultimately it’ll be much easier to use data rather than simulation to enable robots to do things.

[01:26:59] Sergey Levine: And I think that’ll be true for several reasons. One of the reasons is that once we get the robots out there data is much more available and there’s a lot less reason to use simulation. So if you have, you know, if you’re in the Tesla regime, if you haven’t, you know, a million robots out there, now suddenly simulation doesn’t look as appealing because, hey, getting lots of data is easy. The other reason is that I think that the places where we’ll really want learning to attain really superhuman performance will be ones where the robot will need to figure things out in sort of in tight coupling with the world. So if we understand something well enough to simulate it really accurately, maybe that’s actually not the place where we most need learning. The other reason is that, well, if you look at other domains, if you look at like NLP or computer vision, I mean, nobody in NLP thinks about coding up a simulator to simulate how people produce language like that sounds ridiculous. Using data is the way to go. I mean, you might use like synthetic data from a language model, but you’re not gonna like write a computer program that simulates how human fingers and vocal chords work and create tech. You know, type one keyboards or, emit. Sounds like that just sounds crazy. You’d use data. In computer vision maybe there’s a little bit more simulation, but still, like using real images is just so much easier than generating synthetic images. Some people do work on synthetic images, but the data-driven paradigm is so powerful and relatively easy to use that most people just do that. And I think that we’ll get to that point in robotics too.

[01:28:13] Sergey Levine: Another one that I might say is that I think that well, this is maybe coming back to something that we discussed already, but I think there’s a lot of activity in robotics and also in other areas around using essentially imitation learning style approaches. So get humans to perform some tasks maybe robotic tasks or maybe they’re not, they’re booking flights on the internet or something. Whatever task you wanna do, get humans to generate lots of data for it, and then basically do a really good job of emulating that behavior. And again, I think this is like one of those things that I would put into the category of very pragmatic approaches that would be very good to leverage if you’re starting like a company right now.

[01:28:46] Sergey Levine: But if you want to really get general purpose, highly effective AI systems, I think we really need to go beyond that. And there’s a really cute quote that my former postdoc Glen posted this on Twitter after a recent conference. He said something like, oh, I saw a lot of papers on imitation learning. But perhaps it harkens back to an earlier quote by Rodney Brooks, that imitation learning is doomed to succeed. So Rodney Brooks had a quote years ago where he said simulation is doomed to succeed. What he meant by that is that when people do robotics research and simulation, it always works. It always succeeds, but then it’s hard to make that same thing work in the real world. And I think Glenn’s point was that with imitational learning, it’s easy to get it to work, but then you kinda like hit a wall where it’s like, it’s really good for the thing that imitational learning is good for. So it’s like, looks deceptively effective. But then if you wanna go beyond that, if you really wanna do something that people are not good at, then you just hit a wall. And I think that that’s that’s a really big deal. I think we should really be in robotics and in other areas where we want rational, intelligent decision making. Really be thinking hard about planning reinforcement, learning things that go beyond just copying humans.

[01:29:46] Kanjun Qiu: Yeah. That’s really interesting. I love this imitation is doomed to succeed.

[01:29:51] Sergey Levine: The third one, and maybe this is the last one that’s big enough to be interesting is to be honest, I’m actually very skeptical about the utility of language in the long run as a driving force for artificial intelligence. I think that language is very, very useful right now. I think there’s like a kind of a very cognitive science view of language where it says, well, people think in symbolic terms and language is sort of our expression of those symbolic concepts. And therefore language is like a fundamental substrate of thought. I think a very reasonable idea. What I’m skeptical about is the degree to which that is really a prerequisite for intelligence, because there are a lot of animals that are much more intelligent than our robots that do not possess language. They might possess some kind of symbolic, rational thought, but they certainly don’t speak to us.

[01:30:37] Sergey Levine: They certainly don’t express their thoughts and language. And because of that, my suspicion is actually that the success of things like language models has less to do with the fact that it’s language and more to do with the fact that we’ve got an internet full of language data. And that perhaps it’s really not so much about language.

[01:30:54] Sergey Levine: It’s really about the fact that there is this structured repository that happens to be written in language and that perhaps in the long run we’ll figure out how to do all the wonderful things that we do with language models, but without the language using, for example, sensory motor streams, videos, whatever. And we’ll get that generality and we’ll get that power. and it’ll come more from understanding the physical and visual concepts in the world, rather than necessarily parsing words in English or something of the like.

[01:31:20] Kanjun Qiu: Earlier, we talked about hitting walls, methods that hit walls. Do you think that the language-based method, when we think about an artificial general intelligence, would at some point hit a wall?

[01:31:30] Sergey Levine: Oh, absolutely. I do think though that we should be a little careful with that because language models hit walls, but you can build ladders over those walls using other mechanisms. Certainly, in recent robotics research including robotics research that the team that I work with at Google has done as well as many others. We’ve seen a lot of really excellent innovations from people where they use visual or visuomotor models that understand action, understand images to bridge the gap between language models, the symbolic lang world of language model and the physical world. And I think that we’ve come a long way in doing that, but I do think that purely language-based systems by themselves do have a major limitation in terms of the inability to really ground out things into the lowest level of perception and action.

[01:32:16] Sergey Levine: And that’s very problematic because actually the reason that we don’t have a lot of like text on the internet of like, oh, if you wanna throw a football, then you should fire this neuron and actuate this muscle and, so on. we don’t put that in text because it’s so easy for us. It’s so easy for us, but that doesn’t mean that it’s easy for our machines. The thing that where the gap between human capability and machine capability’s largest is exactly the thing that we’re not gonna express in language.

[01:32:40] Kanjun Qiu: Mm. So basically the way in which the internet data set is skewed is that all of the easy stuff is not on there. And so it doesn’t get that.

[01:32:49] Sergey Levine: Yeah.

[01:32:49] Kanjun Qiu: That’s interesting. What do you think about the idea that we might get an AGI that is able to solve all digital tasks on your computer? Like do everything digitally that a human can do, but we’ll still be many, many, many years away?

[01:33:02] Sergey Levine: Well, maybe there’s something comforting about that because then it can’t like go out into the world and start doing things that are too nefarious. But I think that kind of stuff is possible. In research, I do tend to be a little bit of an optimist, and I do think that we can figure out many of the nitty gritty, physical, robotic things.

[01:33:16] Sergey Levine: I’m not sure how long that’ll take exactly. But I’m also kind of hopeful that if we figure them out, we’ll actually get a better solution for some of the symbolic things. Like, you know, if your model understands how the physical world works, you can probably do a better job in the digital world because the digital world influences the physical world and a lot of the most important things there really do have a physical kind of connection. So maybe it’s actually gonna go the other way that figuring out the physical stuff will lead to better understanding of how to manipulate language.

[01:33:40] Kanjun Qiu: Yeah. Totally agree. thank you so much. This is super fun, and we really enjoy the conversation. Yeah, thanks a bunch.

[01:33:47] Sergey Levine: Yeah. Thank you very much.

Thanks to Tessa Hall for editing the podcast.