Since 2012, Be My Eyes has been creating technology for the community of over 250 million people who are blind or have low vision. The Danish startup connects people who are blind or have low vision with volunteers for help with hundreds of daily life tasks like identifying a product or navigating an airport.
With the new visual input capability of GPT-4*, Be My Eyes began developing a GPT-4 powered Virtual Volunteer™ within the Be My Eyes app that can generate the same level of context and understanding as a human volunteer.
“In the short time we’ve had access, we have seen unparalleled performance to any image-to-text object recognition tool out there," says Michael Buckley, CEO of Be My Eyes. “The implications for global accessibility are profound. In the not so distant future, the blind and low vision community will utilize these tools not only for a host of visual interpretation needs, but also to have a greater degree of independence in their lives.”
Suddenly, the image someone sends of, say, the contents of their fridge, GPT-4 technology not only recognizes and names what’s in there, but extrapolates and analyzes what you can make with those ingredients. You could then ask it for a good recipe. The use cases are almost unlimited.
“That’s game changing,” says Buckley. “Ultimately, whatever the user wants or needs, they can re-prompt the tool to get more information that is usable, beneficial and helpful, nearly instantly.”
In early February, the company started beta-testing the GPT-backed assistant with a small group of employees; the results have been so positive the feature will be in the hands of users in weeks.
“There’s just an incredible potential for our community,” Buckley says. “Our beta testers, including Lucy Edwards, already love what this does.”
The difference between GPT-4 and other language and machine learning models, explains Jesper Hvirring Henriksen, CTO of Be My Eyes, is both the ability to have a conversation and the greater degree of analytical prowess offered by the technology. “Basic image recognition applications only tell you what’s in front of you”, he says. “They can’t have a discussion to understand if the noodles have the right kind of ingredients or if the object on the ground isn’t just a ball, but a tripping hazard—and communicate that.”
Already the company has a case where a user was able to navigate the railway system—arguably an impossible task for the sighted as well—not only getting details about where they were located on a map, but point-by-point instructions on how to safely reach where they wanted to go.
Yet traversing the complicated physical world is only half the story. Understanding what’s on a screen can be twice as arduous for a person who isn’t sighted. Screen readers, embedded in most modern operating systems, read through the pieces of a web page or desktop application line by line, section by section, speaking each word. Images, the heart of communication on the web, can be even worse.
Yet, Henriksen says now they’re able to show GPT-4 the webpage and the system knows—after countless training hours where deep learning algorithms build relationships to understand the “important” part of a webpage—which part to read or summarize. This can not only simplify tasks like reading the news online, but grants people who need visual assistance access to some of the most cluttered pages on the web: shopping and e-commerce sites. GPT-4 is able to summarize the search results the way the sighted naturally scan them—not reading every minuscule detail but bouncing between important data points—and help those needing sight support make the right purchase, in real-time.
“This is a fantastic development for humanity”, Buckley says, “but it also represents an enormous commercial opportunity.”