Jim Fan, NVIDIA: On foundation models for embodied agents, scaling data, and why prompt engineering will become irrelevant

RSS · Spotify · Apple Podcasts · Pocket Casts

Jim Fan is a research scientist at NVIDIA, and got his PhD at Stanford under Fei-Fei Li. Jim is interested in building generally capable autonomous agents, and he recently published MineDojo, a massively multiscale benchmarking suite built on Minecraft, which was an Outstanding Paper at NeurIPS. In this episode, we discuss foundation models for embodied agents, scaling data, and why prompt engineering will become irrelevant.

Below are some highlights from our conversation as well as links to the papers, people, and groups referenced in the episode.

Some highlights from our conversation

“The second implication of RLHF is that prompt engineering will go away eventually. Like, it is something fleeting, and the prompt engineers… it’s just not a real job. Let’s face it. The reason prompt engineering will not be relevant forever is because RLHF – why prompt engineering even exists in the first place – is because these systems are misaligned with what humans want, so we have to kind of coerce the model to give us what we want by typing out very unnatural sentences, and to essentially trick the model into solving the task.”

“I’m still really amazed by how humans do this task. Because we’re doing the lowest level of control, right? Like, we do the keyboard and mouse controls. And if we want to be, like, stricter about the concepts, we’re sending neural signals to our fingers and then controlling the finger torques, the torques in each joint, to operate a keyboard and also using a mouse. It’s incredible how low level we are going, as humans, to do World of Bits, and we seem to have very little problem with our computational efficiency, but I guess procrastination is our unique problem. So that is our unique problem. But otherwise, we’re computationally efficient. We’re very efficient. So I’m just wondering, like, maybe there’s a way to actually make the lowest level, the most general action space, computationally attractive and even, like, more efficient than we thought it would be.”

“When I was starting to play Minecraft, I watched YouTube videos. I also went to Wiki to look up what to do in my first and, and Wiki tells you that, ‘Okay, these are the tools that you must craft and you need to, like, prepare food, otherwise you will starve, and what kind of foods are good, right?’ It’s all in, in the Wiki, and I also go to Reddit whenever I have a question. I treat that as a stack overflow, and Reddit people give a lot of good advice. That’s how I played Minecraft even as a humor. That gets me thinking, right, like why shouldn’t our AI use all of these internet skill knowledge? And if we want our AI algorithm to play this from scratch, it’s almost impossible because exploration is intractable. If you just take random actions, kind of how big is a chance that you stumble upon a diamond–it’s almost literally zero, right? So that also inspired the algorithm approach that we did.”

“What we want is to develop – or maybe discover, right – like, general principles to embody intelligence. That’s what we wanna do. That’s what MineDojo and Avalon want to achieve, want to enable, right? Not just kind of solving these particular 1000 tasks in the, kind of, the most brute force way. So, yeah, just a word of caution to researchers: resist the urge to overfit, to cheat, to use things that are super specific to Minecraft that will not transfer elsewhere.”

Referenced in this podcast

Thanks to Tessa Hall for editing the podcast.