Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

P Lu, H Bansal, T Xia, J Liu, C Li, H Hajishirzi… - arXiv preprint arXiv …, 2023 - arxiv.org
Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive skills in various domains, their ability for mathematical reasoning within visual …

Lawbench: Benchmarking legal knowledge of large language models

Z Fei, X Shen, D Zhu, F Zhou, Z Han, S Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have demonstrated strong capabilities in various aspects.
However, when applying them to the highly specialized, safe-critical legal domain, it is …

CyberMetric: a benchmark dataset based on retrieval-augmented generation for evaluating LLMs in cybersecurity knowledge

N Tihanyi, MA Ferrag, R Jain, T Bisztray… - … Conference on Cyber …, 2024 - ieeexplore.ieee.org
Large Language Models (LLMs) are increasingly used across various domains, from
software development to cyber threat intelligence. Understanding all the different …

Where Are Large Language Models for Code Generation on GitHub?

X Yu, L Liu, X Hu, JW Keung, J Liu, X Xia - arXiv preprint arXiv:2406.19544, 2024 - arxiv.org
The increasing use of Large Language Models (LLMs) in software development has
garnered significant attention from researchers assessing the quality of the code they …

CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation

Y Xu, Y Chen, X Zhang, X Lin, P Hu… - Proceedings of …, 2024 - proceedings.mlsys.org
Among the thriving ecosystem of cloud computing and the proliferation of Large Language
Model (LLM)-based code generation tools, there is a lack of benchmarking for code …

Polymath: A challenging multi-modal mathematical reasoning benchmark

H Gupta, S Verma, U Anantheswaran, K Scaria… - arXiv preprint arXiv …, 2024 - arxiv.org
Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities
in various domains, but their visual comprehension and abstract reasoning skills remain …

From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions

W Jiang, X Gao, J Zhai, S Ma, X Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Code Generation Models (LCGMs) have garnered significant attention and achieved
promising results across various programming tasks. However, concerns arise regarding …

VersiCode: Towards Version-controllable Code Generation

T Wu, W Wu, X Wang, K Xu, S Ma, B Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org
Significant research has focused on improving the performance of large language model on
code-related tasks due to their practical importance. Although performance is typically …

[HTML][HTML] Assessing and Optimizing Large Language Models on Spondyloarthritis Multi-Choice Question Answering: Protocol for Enhancement and Assessment

A Wang, Y Wu, X Ji, X Wang, J Hu… - JMIR Research …, 2024 - researchprotocols.org
Background Spondyloarthritis (SpA), a chronic inflammatory disorder, predominantly
impacts the sacroiliac joints and spine, significantly escalating the risk of disability. SpA's …

[PDF][PDF] Multi-Intent Inline Code Comment Generation via Large Language Model

X Zhang, Z Chen, Y Cao, L Chen… - International Journal of …, 2024 - researchgate.net
Comment generation (aka code summarization) refers to the process of generating concise
and fluent natural language descriptions for a piece of code [1–4]. It is considered a …