Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation.
醫學領域中用於評估大型語言模型回應的資料集和基準(MedGPTEval):評估開發和驗證。
JMIR Med Inform 2024-07-02
MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering.
MedExpQA:大型語言模型在醫學問答中的多語言基準測試。
Artif Intell Med 2024-08-09
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning.
透過視覺參考指導調整推進多模態大型語言模型在圖表問題回答中的應用。
IEEE Trans Vis Comput Graph 2024-09-10
Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context.
在中國背景下評估大型語言模型 (LLMs) 回答有關乳腺癌的醫學問題的表現。
Digit Health 2024-10-11
LLM-CGM: A Benchmark for Large Language Model-Enabled Querying of Continuous Glucose Monitoring Data for Conversational Diabetes Management.
LLM-CGM:一個用於大型語言模型驅動的持續血糖監測數據查詢的基準,以促進對話式糖尿病管理。
Pac Symp Biocomput 2024-12-13
A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options.
最近對大型語言模型在放射腫瘤物理學表現的評估,使用隨機打亂選項的問題。
ArXiv 2025-01-13
A Comprehensive Analysis of a Social Intelligence Dataset and Response Tendencies Between Large Language Models (LLMs) and Humans.
大型語言模型(LLMs)與人類之間社會智慧數據集及反應傾向的綜合分析。
Sensors (Basel) 2025-01-25
APBench and benchmarking large language model performance in fundamental astrodynamics problems for space engineering.
APBench 與大型語言模型在太空工程基本天體力學問題中的性能基準測試。
Sci Rep 2025-03-06