
RSS · Spotify · Apple Podcasts · Pocket Casts
Vincent Sitzmann (Google Scholar) (Website) is a postdoc at MIT. His work is on neural scene representations in computer vision. Ultimately, he wants to make representations that AI agents can use to solve the same visual tasks humans solve regularly, but that are currently impossible for AI. His most recent paper at NeurIPS presents SIREN, or Sinusoidal Representation Networks, which is an MLP network architecture that uses the periodic sine as its non-linearity. This aims to solve the problem current networks have where they struggle to model signals with fine detail and higher order derivatives.


Highlights from our conversation:
👁 “Vision is about the question of building representations”
🧠 “We (humans) likely have a 3D inductive bias”
🤖 “All computer vision should be 3D computer vision. Our world is a 3d world.”
Below are the show notes and full transcript. As always, please feel free to reach out with feedback, ideas, and questions!
[05:52] Vincent’s research:
“Vision is fundamentally about the question of building representations. That’s what my research is about. I think that many tasks that are currently not framed in terms of operating on persistent representations of the world would really profit from being framed in these terms.”
[08:36] Vincent’s opinion on how the brain makes visual representations:
“I think it is likely that we have a 3d inductive bias. It’s likely that we have structure that makes it such that all visual observations that we capture are explained in terms of a 3d representation. I think that is highly likely. It’s not entirely clear on how to think about that representation because clearly it’s not like a computer graphics representation…”
“Our brain, most likely, also doesn’t have only just a single representation of our environment, but there might very well be several representations. Some of them might be tests specific. Some of them might be in a hierarchy of increasing complexity or increasing abstractness.”
[15:32] Why neural implicit representations are so exciting:
“My personal fascination is certainly in this realm of neural implicit representations, it’s a very general representation in many ways. It’s actually very intuitive. And in many ways it’s basically the most general 3d representation you can think of. The basic idea is to say that a 3d scene is basically a continuous function that maps an XYZ coordinate to whatever is at that XYZ coordinate. And so it turns out that any representation from computer science is an instantiation of that basic function.”
[21:17] One big challenge for implicit neural representations, compositionality:
“Right now, that is not something we have addressed in a satisfying manner. If you have a model that is hierarchical, then inferring these hierarchies becomes much harder. How do you do that then? Versus, if you only have a flat hierarchy of objects that you could say, every object is separate, then it’s much easier to infer which object is which, but then you are failing to model this fractal aspect that you talk about.”
[26:08] The binding problem in computer vision:
“Assuming that you have something like a hierarchy of symbols or a hierarchy of concepts given a visual observation, how do you bind that visual observation to one of these concepts? That’s referred to as the binding problem.”
[31:52] What Vincent showed in his semantic segmentation paper:
“We show that you can use these features that are inferred by scene representation networks to render colors, to infer geometry, but we show that you can also use these features for very sparsely supervised semantic segmentation of these objects that you’re inferring representations of.”
[56:13] Gradient descent is an encoder:
“Fundamentally, gradient descent is also an encoder on function spaces. If you think about the input out behavior, if you say that your implicit representation is a neural network, then you give it a set of observations, you run gradient descent to fit these observations and outcomes, your representation of these observations. So gradient descent is also nothing else but an encoder.”
Thanks to Luke Cheng for writing drafts of this post and Tessa Hall for editing the podcast.