Meta’s AI Models Are Trained By Watching Video Footages

Meta’s AI Models Are Trained By Watching Video Footages

Yann LeCun, the leading AI expert at the company, sees promise in the V-JEPA model, suggesting it as a potential precursor to achieving artificial general intelligence.

Meta’s AI researchers have unveiled a novel model that diverges from the traditional methods of training large language models (LLMs). Instead of relying on written text, this new model learns from video footage, marking a significant departure in AI development.

Typically, LLMs are trained on vast datasets of sentences or phrases with certain words masked, compelling the model to fill in the missing words. Through this process, they gain a basic understanding of the world. Yann LeCun, the head of Meta’s FAIR (foundational AI research) group, envisions a more efficient learning approach for AI models by employing a similar masking technique on video content.

LeCun articulated the ambition behind this endeavor, stating, “Our goal is to build advanced machine intelligence that can learn more like humans do, forming internal models of the world around them to learn, adapt, and forge plans efficiently in the service of completing complex tasks.”

At the core of LeCun’s vision lies a research model named Video Joint Embedding Predictive Architecture (V-JEPA). It operates by analyzing unlabeled video segments and deducing probable events during obscured intervals.

It’s important to note that V-JEPA isn’t a generative model; rather, it constructs an internal conceptual understanding of the world. Meta researchers affirm that V-JEPA, post-pretraining via video masking, excels in discerning and comprehending intricate interactions between objects.

The implications of this research extend beyond Meta, potentially reshaping the broader AI landscape

The implications of this research extend beyond Meta, potentially reshaping the broader AI landscape. Meta has previously discussed the concept of a “world model” in the context of augmented reality glasses, envisioning an AI assistant that anticipates user needs and preferences based on an audio-visual understanding of the surroundings.

Moreover, V-JEPA could revolutionize AI model training methodologies. Current pretraining methods for foundational models necessitate substantial time and computational resources, often limiting access to larger organizations. However, with more efficient training techniques, the barrier to entry could lower, aligning with Meta’s ethos of open-source research dissemination.

LeCun highlights the current limitation of LLMs in learning from visual and auditory stimuli, hindering progress toward artificial general intelligence.

Meta’s next phase involves integrating audio data into the video, providing the model with additional sensory input akin to a child watching television. This auditory dimension will enrich the model’s learning experience, akin to how a child gains understanding through both sight and sound.

Meta intends to release the V-JEPA model under a Creative Commons noncommercial license, fostering collaboration and further exploration of its capabilities by researchers.

26 Comments

No comments yet. Why don’t you start the discussion?

Comments are closed