Evaluating language models for efficient code generation

J Liu, S Xie, J Wang, Y Wei, Y Ding, L Zhang - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Differential Performance Evaluation (DPE), a framework designed to reliably
evaluate Large Language Models (LLMs) for efficient code generation. Traditional coding …

[PDF][PDF] Reasoning runtime behavior of a program with llm: How far are we?

J Chen, Z Pan, X Hu, Z Li, G Li, X Xia - arXiv preprint cs.SE …, 2024 - ginolzh.github.io
Large language models for code (ie, code LLMs) have shown strong code understanding
and generation capabilities. To evaluate the capabilities of code LLMs in various aspects …

SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications

L Ma, S Liu, L Bu, S Li, Y Wang, Y Liu - arXiv preprint arXiv:2409.12866, 2024 - arxiv.org
Large Language models have achieved impressive performance in automated software
engineering. Extensive efforts have been made to evaluate the abilities of code LLMs in …

Best practices and lessons learned on synthetic data for language models

R Liu, J Wei, F Liu, C Si, Y Zhang, J Rao… - arXiv preprint arXiv …, 2024 - arxiv.org
The success of AI models relies on the availability of large, diverse, and high-quality
datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and …

Livecodebench: Holistic and contamination free evaluation of large language models for code

N Jain, K Han, A Gu, WD Li, F Yan, T Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) applied to code-related applications have emerged as a
prominent field, attracting significant interest from both academia and industry. However, as …

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

F Lin, E La Malfa, V Hofmann, EM Yang, A Cohn… - arXiv preprint arXiv …, 2024 - arxiv.org
Reasoning about asynchronous plans is challenging since it requires sequential and
parallel planning to optimize time costs. Can large language models (LLMs) succeed at this …

Neurosymbolic Repair of Test Flakiness

Y Chen, R Jabbarvand - Proceedings of the 33rd ACM SIGSOFT …, 2024 - dl.acm.org
Test flakiness, a non-deterministic behavior of builds irrelevant to code changes, is a major
and continuing impediment to deliver-ing reliable software. The very few techniques for the …

Uncovering Weaknesses in Neural Code Generation

X Lian, S Wang, J Ma, F Liu, X Tan, L Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Code generation, the task of producing source code from prompts, has seen significant
advancements with the advent of pre-trained large language models (PLMs). Despite these …

[PDF][PDF] Design and Optimization of Heat Exchangers Using Large Language Models

S Mishra, VS Jadhav, S Karande… - Fourth Workshop on …, 2024 - ceur-ws.org
Heat exchangers (HEs) are essential in process industries for efficient thermal energy
transfer. Their design and optimization are crucial for improving energy efficiency, reducing …

SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

Y Yang, Y Nie, Z Wang, Y Tang, W Guo, B Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing works have established multiple benchmarks to highlight the security risks
associated with Code GenAI. These risks are primarily reflected in two areas: a model …