Google's Gemini forces contractors to evaluate AI responses beyond their areas of expertise.

The Unsung Heroes of AI Development

Generative AI systems, such as those developed by Google and OpenAI, have captivated the world with their seemingly magical abilities to produce human-like responses to a wide range of questions. However, behind the scenes, a crucial group of employees known as "prompt engineers" and analysts are working tirelessly to improve the accuracy of these chatbots.

The Concerns Surrounding Gemini

A recent internal guideline passed down from Google to contractors working on Gemini has raised concerns that this system may be more prone to producing inaccurate information on highly sensitive topics, such as healthcare. According to an updated internal guideline, contractors working with GlobalLogic, an outsourcing firm owned by Hitachi, are no longer allowed to skip certain prompts, even if they lack the domain expertise required to evaluate them.

The Role of Contractors in AI Development

Contractors working with GlobalLogic play a critical role in evaluating the accuracy of AI-generated responses. They are tasked with rating the outputs according to various factors, including "truthfulness." Previously, contractors could skip certain prompts if they lacked the necessary expertise, but this is no longer an option.

The Updated Guidelines: A Concern for Accuracy

According to internal correspondence seen by TechCrunch, the updated guidelines read:

"You should not skip prompts that require specialized domain knowledge. Instead, you should rate the parts of the prompt you understand and include a note that you don’t have domain knowledge."

This change has led to concerns about Gemini’s accuracy on certain topics, as contractors are sometimes tasked with evaluating highly technical AI responses about issues like rare diseases.

Contractors Weigh in: Concerns About Accuracy

Internal correspondence shows that contractors were initially concerned about the updated guidelines. One contractor noted:

"I thought the point of skipping was to increase accuracy by giving it to someone better?"

The new guidelines limit the ability of contractors to skip prompts only in two cases:

If they’re "completely missing information" like the full prompt or response.
If they contain harmful content that requires special consent forms to evaluate.

Google’s Response: A Commitment to Factual Accuracy

When asked for comment, Google spokesperson Shira McNamara responded:

"We are constantly working to improve factual accuracy in Gemini. Raters perform a wide range of tasks across many different Google products and platforms. They do not solely review answers for content; they also provide valuable feedback on style, format, and other factors."

However, McNamara also acknowledged that the ratings provided by contractors do not directly impact the algorithms used to generate responses.

The Impact on Factual Accuracy

The updated guidelines have raised concerns about Gemini’s ability to produce accurate information on sensitive topics. By requiring contractors to evaluate prompts outside their domain expertise, there is a risk that inaccurate information may be perpetuated.

Conclusion

The development of generative AI systems relies heavily on the work of prompt engineers and analysts who evaluate the accuracy of these systems. However, recent changes to internal guidelines have raised concerns about Gemini’s ability to produce accurate information on sensitive topics. The importance of human evaluation in AI development cannot be overstated, and it is crucial that companies like Google prioritize factual accuracy in their AI systems.

Related Reading