ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains

Why Do We Need ChroKnowledge?

Understanding how knowledge accumulates and evolves over time is crucial for improving large language models (LLMs), particularly in domains like science and law where facts change over time. Current evaluation methods often fail to address this dynamic nature, focusing on static, single-time stamp and missing the broader, accumulative point of knowledge. This can result in outdated or incomplete information, undermining the models' reliability.
We present ChroKnowledge, a novel framework evaluating and updating LLMs' chronological knowledge across various domains, allowing models to more accurately recall and adapt to evolving facts without requiring retraining, ensuring relevance and accuracy over time.

ChroKnowBench

The overview of time variant dataset generation in ChroKnowBench. We accumulate knowledge in three key aspects:
(1) multiple domains: general, biomedical, legal, commonsense, and mathematics
(2) time dependency: as time goes by, changeable knowledge
(3) temporal state: dynamic (has evolved over period) and static (no change occurred during period)
Trends of Correct for each years represented by line plots, showing difference among domains and temporal states.
Each highlighted portions are chronologically Known.

General Domain

Those two heapmaps are the performance of general domian, with generation templates. For both dynamic and static datasets, a common trend across models is that performance is stronger in the intermediate years but decline recent years, reflecting the data-cutoff point. Dynamic knowledge of above shows more variation compared to static which is below.

Biomedical Domain

Those two heatmpas are the performance of biomedical domain, with generation templates. When it is compared to the result of general domain, both dynamic and static datasets show lower variability, relfecting a domain-specific tendency toward consistency in knowledge changes. Both of them shows performance decreases between 2022 and 2023, aligning with the cutoff pattern noted in the general domain.

Legal Domain

Those two heatmaps are the performance of legal domain, with generation templates. Among time varaint domains, legal domain results show the most stable results of static, while the gap between dynamic and static datasets is the largest among domains.

CommonSense & Math

Performance of Common-Sense and Math domains. The upper-side is Common Sense, and the downside is Mathematics. Three line plots represent each template's results: from left, Generation, MCQA and TF. All models show clearly the domain specific characteristics, which is invariant knowledge even it comes with temporal attributes. Overall results are lower in generation templates, as it is challenging to correclty recall excatly one object in these domains.

Template-wise Results

Template-wise results of ChroKnowledge. Each three spider plots represent general, biomedical and legal domain's results comparing three templates. As time goes by, the preformance in generation goes low in general domain, on the other hand, MCQA and TF appeal to be rising. For biomedical domain, performance in generation decline same as general domain, but MCQA and TF continue to perform well. Lastly, when it comes to each template, generation shows the lowest performance, while TF settings perform extraordinarily well in answering correctly in legal doamin.

ChroKnowPrompt

Chronological categorization based on each answer with its time stamp. If the model answer correctly for all, it is re-categorized as Known. The target of ChroKnowPrompt is Partial Known, which confuses its knowledge among the whole time stamps.

Overview of ChroKnowPrompt. The algorithm systematically traverses step by step, appending each span’s result as few shot for each steps. The range of each previous and next span is predefined, with the order of nearest time stamp from target Tn. The model suggests last candidate answer Cn, verified and refined through several steps, which ends to be checked with the object on in benchmark. The detailed algorithm is down below:

Results

Results of ChroKnowPrompt across multiple domains with unchanged objects. For each domain, the left space represents the percentage of Partial Known, and the right represents the percentage of Known. Each model includes results for both dynamic (yellow-blue bar) and static (red-green) datasets, with arrows indicating the actual increase. As shown in plots, the most effective results are observed in the biomedical domain, where the unchangeable characteristic is stronger than the general domain. While the static dataset of the legal domain shows improvement, many models struggle with unstructured format, resulting in the lowest performance among the dynamic dataset.

BibTeX

@inproceedings{park2025chroknowledge,
      title={ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains},
      author={Yein Park and Chanwoong Yoon and Jungwoo Park and Donghyeon Lee and Minbyul Jeong and Jaewoo Kang},
      booktitle={The Thirteenth International Conference on Learning Representations},
      year={2025},
      url={https://openreview.net/forum?id=whaO3482bs}
   }

⏰ ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains