shutterstock 2253420073

ChatGPT can now see, hear, and speak

ChatGPT is now equipped with enhanced capabilities, including the ability to see, hear, and speak. These new features offer a more intuitive way to interact with ChatGPT, enabling voice conversations and image sharing for a variety of purposes.

Voice and image capabilities expand the versatility of ChatGPT in users’ daily lives. For instance, you can capture a snapshot of a landmark during your travels and engage in a live conversation about its interesting aspects. At home, you can take pictures of your refrigerator and pantry to help plan your meals, even asking follow-up questions for step-by-step recipes. Additionally, you can assist your child with math problems by photographing the problem, circling it, and having ChatGPT provide hints to both of you.

These voice and image features are gradually rolling out to Plus and Enterprise users over the next two weeks. Voice functionality will be available on both iOS and Android devices (opt-in through your settings), while image capabilities will be accessible across all platforms.

Conversational Interaction with ChatGPT

You can now engage in dynamic back-and-forth conversations with ChatGPT using your voice. This feature allows you to converse with your assistant on the go, request bedtime stories for your family, or resolve dinner table debates. To get started with voice conversations, navigate to Settings > New Features on the mobile app and opt-in to voice interactions. Then, tap the headphone icon in the top-right corner of the home screen to select your preferred voice from five available options.

The new voice feature is powered by an advanced text-to-speech model, capable of generating human-like audio from text input and a short sample of speech. Collaboration with professional voice actors has contributed to crafting each of these voices. Furthermore, Whisper, an open-source speech recognition system, transcribes spoken words into text.

Image-Based Conversations

You can now share one or more images with ChatGPT to address various tasks, such as troubleshooting a malfunctioning grill, planning meals by examining the contents of your fridge, or analyzing complex graphs for work-related data. To initiate image-based interactions, tap the photo button to capture or select an image. If you are using iOS or Android, tap the plus button first. You can also discuss multiple images or utilize the drawing tool in the mobile app to guide your assistant.

The understanding of images is powered by multimodal GPT-3.5 and GPT-4 models, which apply their language reasoning skills to a wide array of images, including photographs, screenshots, and documents that contain both text and images.

Gradual Deployment for Safety

OpenAI’s primary objective is to create safe and beneficial artificial general intelligence (AGI). The phased deployment approach allows for continuous improvement and risk mitigation while preparing users for more advanced systems in the future. This strategy is particularly crucial when introducing advanced voice and vision capabilities.

Voice Technology

The new voice technology, capable of creating realistic synthetic voices from short samples of real speech, has numerous creative and accessibility-focused applications. However, it also introduces new risks, such as the potential for impersonation of public figures or fraudulent activities. To mitigate these risks, voice chat is the primary use case for this technology, with voices crafted by voice actors directly involved in the process.

Image Input

Vision-based models present their own set of challenges, including potential misinterpretation of images in critical contexts. Prior to broader deployment, rigorous testing with red teamers and alpha testers was conducted to assess risks related to extremism and scientific proficiency. These findings helped establish guidelines for responsible usage.

Ensuring Utility and Safety

The vision feature in ChatGPT aims to assist users in their daily lives effectively by understanding visual information. This approach is influenced by collaboration with Be My Eyes, a mobile app for blind and low-vision individuals, to grasp the utility and limitations of visual interactions. Measures have been implemented to restrict ChatGPT’s ability to analyze and make direct statements about individuals, respecting privacy concerns.


OpenAI is committed to transparency regarding the model’s limitations. Users are encouraged to exercise caution with high-risk use cases that lack proper verification. Additionally, the model performs best with English text and may not be suitable for languages with non-Roman scripts, a limitation that non-English users are advised to consider.

Don’t forget to check more in the Tech section.

Similar Posts