Measuring AI Success by the Value It Brings to Users

As a software tool enthusiast, it's no surprise that I'm a premium paid user of all three top-tier Large Language Models (LLMs) in the world (or is it?). My first paid membership was for OpenAI's ChatGPT. I was willing to pay for access to the latest model and its marketplace of plugin extensions (although the marketplace has since disappeared). Then, Google, not wanting to be left behind, rebranded its Bart LLM as Gemini, integrated it with Google Workspace, and offered a 2-month free trial, which I promptly took advantage of. Finally, Anthropic's Claude released its flagship Opus model, with research claiming it outperformed both GPT-4 and Gemini Ultra, leading me, with my constant need for writing and coding assistance, to subscribe.

However, just a month after subscribing, OpenAI's spring update announced the world-shaking GPT-4o model. Its user experience completely overshadowed the other two, causing me to cancel my Claude subscription. This made me wonder: for the end user, is the ability to smoothly complete a task (job-be-done) more important than the accuracy of the output?

GPT-4 brings better user experience to ChatGPT

I used an open-source (open-source) application called ChatALL to compare the responses of different models and see which output most closely met my needs. Initially, my expectations for LLM (Large Language Models) were mainly focused on the accuracy of the responses. However, since the definition of accuracy and evaluation methods vary across different domains, and given my experience with cross-functional collaboration and communication, my expectations for LLM outputs shifted towards quality, based on my past experiences as a standard of evaluation.

Recently, I have been working on some side projects that require the use of LLMs to optimize my coding process. However, this is where ChatGPT officially surpasses Claude in terms of user experience. With the ability to pass along the file directly, I can instantly take a screenshot of the error message in my code without having to copy and paste or click the upload button. This greatly enhances the user experience. Additionally, GPT-4's response speed is noticeably faster than previous models, greatly improving the flow of real-time interaction and approaching the level of personal tutoring. This can be seen in the video of Salman Khan and his son using ChatGPT to learn math.

Measure AI based on user value acquisition

As someone who started building products in CVML (Computer Vision Machine Learning) in 2018, I clearly understand the difficulties of building AI products. This process is vastly different from building traditional software products, whether it's in terms of economic costs, time costs, team composition, and so on. However, when I'm on the other side as a user rather than a developer, I can finally start to evaluate products from the user's perspective.

In the past, when I was developing products, I would use "The Elements of Value" to assess whether we should build a feature and what level of value it could bring to users. The most basic level of value belongs to functionality, such as saving time, reducing costs, quality, and so on. Currently, I see that the evaluation methods for text generation on the market, such as GQPA, MMLU, MGSM, are mostly still centered around the value of quality. Or, for example, the model competition platform LMSYS Chatbot Arena Leaderboard, maintained by the LMSYS Organization, uses a crowdsourced open platform to evaluate the output of various language models through voting, which is also centered around the quality of output.

As the competition among LLMs (Large Language Models) becomes increasingly fierce, how can closed-source companies build products that bring value beyond the functional level, thereby achieving user retention rates and freemium to premium conversion rates? I think this is something that needs to be considered when building products in the future. However, the functional level, although the most basic, is also the most important layer, serving as a foundation layer. If the output quality cannot meet human standards under various evaluation criteria, I think it will take 1-3 years for users to experience the emotional and life-changing levels of value.