The first Alpha Arena: Ali Tunyi 3-Max won 22,32% return, GPT-5 lost over 621,TP3T

On November 5th, American Research Institute Nof1 recently launched a live disk test: They'll be six tops, AI Large Language ModelLLM(c) Injecting $10,000 in each of the initial funds to enable them to trade in real markets。

The first Alpha Arena: Ali Tunyi 3-Max won 22,32% return, GPT-5 lost over 621,TP3T

First Alpha Arena It's officially over, and Ali's Qwen3-Max is in the lead at the end, winning the investment champion at 22.32% returns。

Qwen3-Max, DeepSeek v3.1, GPT-5, Gemini 2.5 Pro, Claude Sonet 4.5, Grok 4 and the top six global models, with the exception of Qwen and DeepSeek, have all lost, and even GPT-5 has more than 62%。

Alpha Arena aims to test the capabilities of these models in the field of “quantitative transactions” in a dynamic, competitive environment。

WHILE THE AI MODEL CAN PERFORM ASSIGNED TASKS, RESEARCHERS POINT OUT THAT THE MODEL SHOWS SIGNIFICANT VARIATIONS IN RISK MANAGEMENT, TRADING BEHAVIOUR, LEAD TIME, DIRECTIONAL PREFERENCES, ETC。

THE TEAM STRESSED THAT THIS WAS NOT TO “SELECT THE STRONGEST MODEL”, BUT TO FACILITATE THE TRANSITION OF AI RESEARCH FROM STATIC, TEST-BASED BENCHMARKING TO “REAL-WORLD” AND “REAL-TIME DECISION-MAKING”。

Experimental design

  • Each model has the initial funds of $10,000 (note: the current exchange rate is approximately RMB 71218) for the trade of encrypted currency contracts (including BTC, ETH, SOL, BNB, DOGE, XRP) on the Hyperliquid trading platform。
  • Models can only be based on numerical market data (price, turnover, technical indicators, etc.) and do not allow access to news or current information。
  • The goal of each model is “maximum PnL” and the Sharpe Ratio ratio is given as a risk-adjusted indicator。
  • transactions are simplified as: buying (many), selling (empty), holding, flatting. all models use the same hint (prompt), the same data interface, and no specific fine-tuning。

Preliminary results

The report indicates that while each model operates under the same structure, there are significant differences in their transaction style, risk preferences, hold time and transaction frequency. For example, some models are short (empty) and others are almost empty. Some models are long in hold and have low transaction frequency, while others are frequently traded。

In terms of the sensitivity of the data format, the team observed that if the "data order" was changed from "new and old" to "old and old" in the hint, the error caused by the misreading of some of the models could be repaired。

The study also noted the limitations of the test: limited sample size, short running time, no history of model performance and no cumulative learning capacity. The team said it would introduce more control, more features and more statistical strength next season。

Meaning and observation

the project seeks to answer a basic question: “can a large language model be a model for trading in a genuine trading environment as a zero sample (zero-shot) system?”

Through the experiment, Nof1 aims to facilitate the transition of AI research to “the organization of real, dynamic, risk-driven benchmarks” rather than just static data sets。

WHILE THE EXPERIMENT HAS NOT YET REACHED THE CONCLUSION THAT “WHICH MODEL IS STRONGEST”, IT HAS BEEN REVEALED THAT EVEN THE MOST ADVANCED LLM STILL FACES MULTIPLE CHALLENGES IN ACTUAL TRANSACTIONS, SUCH AS “ACTION-ENFORCEMENT” “MARKET-STATE UNDERSTANDING” “TRIBE-FORMAT SENSITIVITY”。

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

OpenAI video generation application Sora landing on the Android platform still requires an invitation code

2025-11-5 11:35:04

Information

Microsoft announced that its first self-study image generation model, MAI-Image-1, had been integrated into Bing and Copilot, gradually reducing dependence on OpenAI

2025-11-5 11:38:20

Search