On November 5th, American Research Institute Nof1 recently launched a live disk test: They'll be six tops, AI Large Language Model(LLM(c) Injecting $10,000 in each of the initial funds to enable them to trade in real markets。

First Alpha Arena It's officially over, and Ali's Qwen3-Max is in the lead at the end, winning the investment champion at 22.32% returns。
Qwen3-Max, DeepSeek v3.1, GPT-5, Gemini 2.5 Pro, Claude Sonet 4.5, Grok 4 and the top six global models, with the exception of Qwen and DeepSeek, have all lost, and even GPT-5 has more than 62%。
Alpha Arena aims to test the capabilities of these models in the field of “quantitative transactions” in a dynamic, competitive environment。
WHILE THE AI MODEL CAN PERFORM ASSIGNED TASKS, RESEARCHERS POINT OUT THAT THE MODEL SHOWS SIGNIFICANT VARIATIONS IN RISK MANAGEMENT, TRADING BEHAVIOUR, LEAD TIME, DIRECTIONAL PREFERENCES, ETC。
THE TEAM STRESSED THAT THIS WAS NOT TO “SELECT THE STRONGEST MODEL”, BUT TO FACILITATE THE TRANSITION OF AI RESEARCH FROM STATIC, TEST-BASED BENCHMARKING TO “REAL-WORLD” AND “REAL-TIME DECISION-MAKING”。
Experimental design
- Each model has the initial funds of $10,000 (note: the current exchange rate is approximately RMB 71218) for the trade of encrypted currency contracts (including BTC, ETH, SOL, BNB, DOGE, XRP) on the Hyperliquid trading platform。
- Models can only be based on numerical market data (price, turnover, technical indicators, etc.) and do not allow access to news or current information。
- The goal of each model is “maximum PnL” and the Sharpe Ratio ratio is given as a risk-adjusted indicator。
- transactions are simplified as: buying (many), selling (empty), holding, flatting. all models use the same hint (prompt), the same data interface, and no specific fine-tuning。
Preliminary results
The report indicates that while each model operates under the same structure, there are significant differences in their transaction style, risk preferences, hold time and transaction frequency. For example, some models are short (empty) and others are almost empty. Some models are long in hold and have low transaction frequency, while others are frequently traded。
In terms of the sensitivity of the data format, the team observed that if the "data order" was changed from "new and old" to "old and old" in the hint, the error caused by the misreading of some of the models could be repaired。
The study also noted the limitations of the test: limited sample size, short running time, no history of model performance and no cumulative learning capacity. The team said it would introduce more control, more features and more statistical strength next season。
Meaning and observation
the project seeks to answer a basic question: “can a large language model be a model for trading in a genuine trading environment as a zero sample (zero-shot) system?”
Through the experiment, Nof1 aims to facilitate the transition of AI research to “the organization of real, dynamic, risk-driven benchmarks” rather than just static data sets。
WHILE THE EXPERIMENT HAS NOT YET REACHED THE CONCLUSION THAT “WHICH MODEL IS STRONGEST”, IT HAS BEEN REVEALED THAT EVEN THE MOST ADVANCED LLM STILL FACES MULTIPLE CHALLENGES IN ACTUAL TRANSACTIONS, SUCH AS “ACTION-ENFORCEMENT” “MARKET-STATE UNDERSTANDING” “TRIBE-FORMAT SENSITIVITY”。