{"id":29500,"date":"2025-02-24T11:14:23","date_gmt":"2025-02-24T03:14:23","guid":{"rendered":"https:\/\/www.1ai.net\/?p=29500"},"modified":"2025-02-24T11:14:23","modified_gmt":"2025-02-24T03:14:23","slug":"openai-%e6%9c%80%e6%96%b0%e7%a0%94%e7%a9%b6%ef%bc%9a%e5%bd%93%e5%89%8d-ai-%e6%a8%a1%e5%9e%8b%e4%bb%8d%e6%97%a0%e6%b3%95%e5%aa%b2%e7%be%8e%e4%ba%ba%e7%b1%bb%e7%a8%8b%e5%ba%8f%e5%91%98","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/29500.html","title":{"rendered":"OpenAI Study: Current AI Models Still Can't Compare to Human Programmers"},"content":{"rendered":"<p>February 24, 2011 - Despite <a href=\"https:\/\/www.1ai.net\/en\/tag\/openai\" title=\"[View articles tagged with [OpenAI]]\" target=\"_blank\" >OpenAI<\/a> CEO Sam Altman insists that by the end of this year, the<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e4%ba%ba%e5%b7%a5%e6%99%ba%e8%83%bd%e6%a8%a1%e5%9e%8b\" title=\"_Other Organiser\" target=\"_blank\" >Artificial Intelligence Model<\/a>will be able to outperform \"low-level\" software engineers, but a new study by the company's researchers suggests that even the most advanced AI models can't compete with human beings.<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e7%a8%8b%e5%ba%8f%e5%91%98\" title=\"[Sees articles with [programmer] labels]\" target=\"_blank\" >programmer<\/a>Comparable.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-29501\" title=\"8f59bf89j00ss63mz004gd000lv00p9p\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/02\/8f59bf89j00ss63mz004gd000lv00p9p.jpg\" alt=\"8f59bf89j00ss63mz004gd000lv00p9p\" width=\"787\" height=\"909\" \/><\/p>\n<p>In a new paper, the researchers note that even cutting-edge models -- that is, those of the most innovative and groundbreaking AI systems -- are<strong>\"Still can't solve most\" programming tasks<\/strong>To that end, the researchers developed a new benchmarking tool called SWE-Lancer. To that end, the researchers developed a new benchmarking tool called SWE-Lancer, based on more than 1,400 software engineering tasks on the freelance website Upwork. With this benchmark, OpenAI tested three large language models (LLMs): its own o1 reasoning model, its flagship GPT-4o, and Anthropic's Claude 3.5 Sonnet.<\/p>\n<p>Specifically.<strong>This new benchmark evaluates the performance of these LLMs when handling two types of tasks on Upwork<\/strong>:: One category is individual tasks, which involve fixing vulnerabilities and implementing fixes; the other category is management tasks, which require the models to make higher-level decisions from a more macro perspective. It is worth noting that the models were denied access to the Internet during the testing process.<strong>Therefore they cannot directly copy similar answers already available online.<\/strong><\/p>\n<p>These models have been tasked with tasks worth hundreds of thousands of dollars on Upwork, but they only solve superficial software problems and don't really get to the bottom of vulnerabilities and their root causes in large projects. Such \"half-baked\" solutions are not new to those who have experience working with AI -- they're not new to the world of AI.<strong>AI is good at outputting confident-sounding information, but is often full of holes when scrutinized.<\/strong><\/p>\n<p>While the paper notes that the three LLMs are often able to accomplish tasks \"far faster than humans,\" they are unable to understand the breadth of the vulnerabilities and their context.<strong>This results in solutions that are \"wrong or incomplete\".<\/strong><\/p>\n<p>The researchers explain that Claude 3.5 Sonnet outperforms the other two OpenAI models and \"earns\" more than o1 and GPT-4o in the tests.<strong>Most of their answers are still wrong<\/strong>. The researchers noted that<strong>Any model needs to be \"more reliable\" if it is to be used for real programming tasks.<\/strong><\/p>\n<p>In short, the paper seems to suggest that while these cutting-edge models are capable of handling a number of detailed tasks quickly, their skill level in handling these tasks is still far less than that of a human engineer.<\/p>\n<p>While these large-scale language models have made rapid progress in recent years and will continue to advance in the future, their current skill level in the field of software engineering is still not sufficient to replace humans. Yet 1AI notes that this doesn't seem to have stopped some CEOs from firing human programmers in favor of these immature AI models.<\/p>","protected":false},"excerpt":{"rendered":"<p>February 24, 2011 - While OpenAI CEO Sam Altman insists that AI models will be able to outperform \"low-level\" software engineers by the end of the year, a new study by the company's researchers suggests that even the most advanced AI models can't compete with human programmers. human programmers. In a new paper, the researchers note that even cutting-edge models -- the most innovative and groundbreaking AI systems -- \"still cannot solve most\" programming tasks. To that end, the researchers developed a new benchmarking tool called SWE-Lancer, which is based on more than 1,400 software engineering tasks on the freelance website Upwork. By<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[190,599,1410],"collection":[],"class_list":["post-29500","post","type-post","status-publish","format-standard","hentry","category-news","tag-openai","tag-599","tag-1410"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/29500","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=29500"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/29500\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=29500"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=29500"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=29500"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=29500"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}