Regulations Challenge
Introduction
The financial industry operates within a labyrinth of complex regulations and industry standards designed to maintain market integrity and ensure reliability in financial reporting and compliance processes. Intricate financial regulations and standards have presented significant challenges for financial professionals and organizations. Financial regulations and industry standards are characterized by:
Complexity: Financial regulations and industry standards are intricate due to the sophisticated nature of financial markets and products. They require detailed provisions to address various financial activities and risks.
Frequent updates: Regulations are regularly updated to adapt to market changes, technological advancements, and new risks, ensuring they remain effective and relevant.
Jurisdictional differences: Regulations and industry standards differ across countries due to varying legal systems and economic conditions, posing challenges for multinational financial activities that must comply with multiple frameworks.
Specialized terminology: Financial regulations and industry standards use specialized language to ensure precision, requiring people to understand this terminology for accurate interpretation and application.
Understanding and applying these regulations and industry standards requires not only expertise but also the ability to interpret nuanced language and anticipate regulatory implications.
Large language models (LLMs), such as GPT-4o, Llama 3.1, and Mistral Large 2, have shown remarkable capabilities in natural language understanding and generation, making them promising for applications in the financial sector [1]. However, current LLMs face challenges in the domain of financial regulations and industry standards. These challenges include grasping specialized regulatory language, maintaining up-to-date knowledge of evolving regulations and industry standards, and ensuring interpretability and ethical considerations in their responses [2].
The Regulations Challenge aims to push the boundaries of LLMs in understanding, interpreting, and applying regulatory knowledge in the finance industry. In this challenge, participants will participate in 9 tasks to explore key issues, including, but not limited to, regulatory complexity, ethical considerations, domain-specific terminology, industry standards, and interpretability. We welcome students, researchers, and practitioners who are passionate about finance and LLMs. We encourage participants to develop solutions that advance the capabilities of LLMs in addressing the challenges of financial regulations and industry standards.
Task Definition
The tasks are designed to test the capabilities of large language models (LLMs) to generate responses related to regulatory texts. Your primary goal is to enable the LLM to accurately interpret and respond to a variety of regulatory questions in the following 9 tasks.
Dataset Usage: Participants are able to use our provided data sources and/or gather relevant data themselves to fine-tune and enhance their model's performance. This could include sourcing updated regulations, additional context from regulatory bodies, or other relevant documents to ensure the LLM is trained with both comprehensive and current information.
Input: The input for these tasks consist of a primary request and the content of the request, while some tasks may provide additional context to supplement the request.
Context: This is additional information that helps the LLM understand and accurately respond to the primary request.
Primary Request: This directs the focus of the response to what the user desires.
Content: This is the data that the primary request is about and what the model will act on. This will typically differ from query to query.
Output: The expected output is the answer to the question, reflecting the request provided.
These tasks assess the LLM's ability to handle different types of questions within the regulatory domain which follow:
Abbreviation Recognition Task:
Goal: Match an abbreviation with its expanded form.
Input Template: "Expand the following acronym into its full form: {acronym}. Answer:"
Definition Recognition Task:
Goal: Correctly define a regulatory term or phrase.
Input Template: "Define the following term: {regulatory term or phrase}. Answer:"
Named Entity Recognition (NER) Task:
Goal: Ensure the output correctly identifies entities and places them into groups that the user specifies.
Input Template: "Given the following text, only list the following for each: specific Organizations, Legislations, Dates, Monetary Values, and Statistics: {input text}."
Question Answering Task:
Goal: Ensure the output matches the correct answer to a detailed question about regulatory practices or laws.
Input Template: "Provide a concise answer to the following question: {detailed question}? Answer:"
Link Retrieval Task:
Goal: Ensure the link output matches the actual law.
Input Template: "Provide a link for ____ law, Write in the format of ("{Law}: {Link}" or "{Law}: Not able to find a link for the law")"
Certificate Question Task:
Goal: Select the correct answer choice to a question that may be based on additional context.
Input Template: "(This context is used for the question that follows: {context}). Please answer the following question with only the letter and associated description of the correct answer choice: {question and answer choices}. Answer:"
XBRL Analytics Task:
Goal: Ensure the output strictly matches the correct answer to a detailed question about financial data extraction and application tasks via XBRL filings. These standardized digital documents contain detailed financial information.
Input Template: "Provide the exact answer to the following question: {detailed question}? Answer:"
8. Common Domain Model (CDM) Task:
Goal: Deliver precise responses to questions about the Fintech Open Source Foundation's (FINOS) Common Domain Model (CDM).
Input Template: "Provide a concise answer to the following question related to Financial Industry Operating Network's (FINO) Common Domain Model (CDM): {detailed question}? Answer:"
9. Model Openness Framework (MOF) Licenses Task:
Goal: Deliver precise responses to questions concerning the requirement of license under the Model Openness Framework.
Input Template: "Provide a concise answer to the following question about MOF's licensing requirements: {detailed question}? Answer:"
These elements are structured to evaluate how effectively LLMs can process and respond to context-based queries within the realm of regulatory texts. The use of a standardized input template helps maintain consistency and focus across different types of queries. Examples will be shown in the following section.
Datasets
1. Abbreviation Data:
This dataset is designed to evaluate and benchmark the performance of LLMs in understanding and generating expansions for abbreviations within the context of regulatory and compliance documentation. It provides a collection of abbreviations commonly encountered in regulatory texts, along with their full forms.
2. Definition Data:
This dataset is crafted to evaluate the capability of LLMs to accurately understand and generate definitions for terms commonly used in regulatory and compliance contexts. It includes a curated collection of key terms, along with their definitions as used in regulatory documents.
3. NER Data:
Name Entity Recognition (NER) tests an LLM's ability to identify entities and categorize them into groups. For example, if the input text mentions "The European Markets and Securities Authority," the LLM should recognize it as an organization.
4. Question Answering Data:
This dataset is designed to assess the performance of LLMs in the context of long-form question answering. It focuses on complex inquiries related to regulatory and compliance issues, challenging the models to provide detailed, accurate, contextually relevant responses. It includes a set of questions, along with their answers as used in regulatory documents.
5. Link Retrieval Data
The Legal Link Retrieval dataset is designed to assess an LLM's capabilities in retrieving relevant links for regulations set by the ESMA (European Securities and Markets Authority). All questions and legal references in our dataset are either publicly accessible or are modified versions of publicly accessible documents.
6. Certificate Question Data
The Certificate Question Answering dataset is designed to evaluate the capabilities of LLMs in answering context-based certification-level questions accurately. The dataset includes mock multiple-choice questions from the CFA and CPA exams, specifically focusing on ethics and regulations. All questions in our dataset are either publicly accessible or modified versions of publicly accessible questions.
7. XBRL Question Answering Data
XBRL (eXtensible Business Reporting Language) is a globally recognized standard for the electronic communication of business and financial data. XBRL filings are structured digital documents that contain detailed financial information. XBRLBench is a benchmark dataset meticulously curated to evaluate and enhance the capabilities of large language models (LLMs) in both accurately extracting and applying financial data. The dataset includes a diverse array of XBRL filings from companies in the Dow Jones 30 index, ensuring a broad representation of different reporting practices and industry-specific requirements. All involved XBRL filings are reported under US GAAP standards.
XBRLBench Dataset Structure:
Ticker (str): Ticker symbol of the company (e.g., AAPL, AMZN).
Name (str): Company name (e.g., Apple Inc., Amazon.com Inc).
Question (str): Specific financial question.
Answer (str): Ground-truth answer extracted from XBRL filings.
question_type (str): Type of question: 'Accuracy Verification' and 'Detailed Breakdown'.
question_reasoning (str): Reasoning type needed to solve the question: 'data extraction', 'comparison', 'calculation', 'verification'.
doc_path (str): Unique Document Identifier. Path of XBRL instance file name ending with "_htm.xml” for loading the XBRL filing in XBRLBench Github repository. Typical format: XBRL_Filings/DowJones30/{ticker}-{reportEndDate}/{ticker}-{reportEndDate}_htm.xml. Our provided XBRL filings are structured in such a format.
XBRLBench Document Information Structure:
doc_path (str): Unique Document Identifier.
doc_type (str): Type of the Document: {"10K", "10Q", "8K"}.
doc_period (int): Period of the relevant financial document.
source_link (str): SEC URL of the XBRL document, where XBRL files can be downloaded.
company (str): Company name.
company_sector_gics (str): Company sector in terms of GICS standard.
8. Common Domain Model (CDM) Data:
This dataset comprises a curated collection of questions and answers designed to explore various aspects of the Fintech Open Source Foundation's (FINOS) Common Domain Model (CDM). The Common Domain Model (CDM) is a standardized, machine-readable, and machine-executable data and process model for how financial products are traded and managed across the transaction lifecycle. The dataset is designed to aid in fine-tuning language models for financial product management and process understanding under the CDM.
9. Model Openness Framework (MOF) Licenses Data
Model Openness Framework (MOF) is a comprehensive system for evaluating and classifying the completeness and openness of machine learning models. This dataset is designed to facilitate the understanding and application of the various licensing requirements outlined in the Model Openness Framework (MOF). It includes a series of questions and answers that delve into the specifics of licensing protocols for different components of machine learning model development, such as research papers, software code, and datasets.
Evaluation Metrics
We use different metrics to evaluate different tasks:
1. Abbreviation Recognition Task
Metric: Accuracy
2. Definition Recognition Task
Metric: BERTScore
3. NER Task
Metric: F1 Score
4. Question Answering Task
Metric: FActScore
5. Link Retrieval Task:
Metric: Accuracy
6. Certificate Question Task
Metric: Accuracy
7. XBRL Analytics Task
Metric: Accuracy and FActScore
8. Common Domain Model (CDM) Task
Metric: FActScore
9. Model Openness Framework (MOF) Licenses Task
Metric: Accuracy and FActScore
The final score is determined by the weighted average of metrics for 9 tasks. We assign the weight of 10% to Task 1-5 each, 20% to Task 6, and 10% to Task 7-9 each. The formula is as follows, where Sᵢ represents the score for task i within [0,1] and wᵢ represents the weight assigned to task i:
The formulas and explanations of metrics used follow:
BERTScore: BERTScore is an automatic evaluation metric for text generation. BERTScore represents reference and candidate sentences as contextual embeddings that are computed by models like BERT. The similarity between the contextual embeddings for reference and candidate sentences is then measured using cosine similarity. Each token in the reference sentence is matched with the most similar token in the candidate sentence to compute recall. The opposite occurs to compute precision. These are used to calculate an F1 score.
Inverse document frequency is optionally used to weight rare words' importance. Baseline rescaling is then used to make the score more human readable. The rescaled score is computed as as follows [3]:
FActScore: FActScore (Factual precision in Atomicity Score) is an automatic fine-grained evaluation of factual precision. It breaks down generated text into a series of atomic facts, short statements that each contain one piece of information. These atomic facts are then validated against a knowledge source. This validation process uses zero-shot prompting to an evaluator language model. The process involves four different prompt construction methods (No-context LM, Retrieve→LM, Nonparametric Probability [NP], and Retrieve→LM + NP). Predictions are then made by comparing the conditional probability of the facts being True or False, as determined by the evaluator language model. If logit values are unavailable, the prediction is made based on whether the generated text contains True or False. FActScore is defined as follows [4]:
Accuracy: Accuracy is the proportion of correctly predicted predictions out of total predictions. It is calculated as follows:
Precision: Precision is the proportion of true positive predictions out of the total predicted positives. For a multiclass classification task, precision is calculated for each task then macro-averaged.
Recall: Recall is the proportion of true positive predictions out of the total actual positives. For a multiclass classification task, precision is calculated for each task then macro-averaged.
F1 Score: An F1 Score is the harmonic mean of precision and recall. Depending, on the task, we may compute the F1 Score by using the macro-averaged precision and macro-averaged recall rather than their non-macro-averaged versions. F1 Score is computed as follows:
Example Precision and Recall Calculation for Multiclass Classification
[1] Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, et al. (2024). FinBen: A holistic financial benchmark for large language models. (https://arxiv.org/abs/2402.12659)
[2] Zhiyu Cao, and Zachary Feinstein (2024). Large Language Model in Financial Regulatory Interpretation. (https://arxiv.org/abs/2405.06808v1)
[3] Tianyi Zhang et al. (2020). BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations. (https://arxiv.org/abs/1904.09675)
[4] Sewon Min et al.(2023). Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251. (https://arxiv.org/abs/2305.14251)
Registration
We welcome students, researchers, and industry practitioners who are passionate about finance and LLMs to participate in this challenge.
Please register here. Please choose a unique team name and ensure that all team members provide their full names, emails, institutions, and the team name. Every team member should register using the same team name. We encourage you to use your institutional email to register.
Important Dates
Training Set Release: September 15, 2024
Training Data Details: Summary of Question Dataset
Validation Set Release: October 31, 2024
Validation question dataset: here
Systems Submission: November 7, 2024, November 13, 2024
Submission: https://softconf.com/coling2025/FinNLP25/
Testing dataset: here
Release of Results: November 12, 2024, November 21, 2024
Paper Submission Deadline: November 25, 2024, November 28, 2024
Submission: https://softconf.com/coling2025/FinNLP25/ (Regulations Challenge Track)
Notification of Acceptance: December 5, 2024
Camera-ready Paper Deadline: December 13, 2024
Workshop Date: January 19-20, 2025
Paper Submission
The ACL Template MUST be used for your submission(s). The main text is limited to 8 pages. The appendix is unlimited and placed after references.
The paper title format is fixed: "[Model or Team Name] at the Regulations Challenge Task: [Title]".
The reviewing process will be single-blind. Accepted papers proceedings will be published at ACL Anthology.
Shared task participants may be asked to review other teams' papers during the review period.
Submissions must be in electronic form using the paper submission software linked above.
At least one author of each accepted paper should register and present their work in person at the FinNLP-FNP-LLMFinLegal workshop. Papers with “No Show” may be redacted. Authors will be required to agree to this requirement at the time of submission. It's a rule for all COLING 2025 workshops.
Task Organizers
Keyi Wang, Columbia University, Northwestern University
Lihang (Charlie) Shen, Columbia University
Haoqiang Kang, Columbia University
Xingjian Zhao, Rensselaer Polytechnic Institute
Namir Xia, Rensselaer Polytechnic Institute
Christopher Poon, Rensselaer Polytechnic Institute
Jaisal Patel, Rensselaer Polytechnic Institute
Andy Zhu, Rensselaer Polytechnic Institute
Shengyuan Lin, Rensselaer Polytechnic Institute
Daniel Kim, Rensselaer Polytechnic Institute
Jaswanth Duddu, Rensselaer Polytechnic Institute
Matthew Tavares, Rensselaer Polytechnic Institute
Shanshan Yang, Stevens Institute of Technology
Sai Gonigeni, Stevens Institute of Technology
Kayli Gregory, Stevens Institute of Technology
Katie Ng, Stevens Institute of Technology
Andrew Thomas, Stevens Institute of Technology
Dong Li, FinAI
Supervisors
Yanglet Xiao-Yang Liu, Rensselaer Polytechnic Institute, Columbia University
Steve Yang, School of Business at Stevens Institute of Technology
Kairong Xiao, Roger F. Murray Associate Professor of Business at Columbia Business School
Matt White, Executive Director, PyTorch Foundation. GM of AI, Linux Foundation
Cailean Osborne, University of Oxford
Wes Turner, Rensselaer Center for Open Source (RCOS), Rensselaer Polytechnic Institute
Neha Keshan, Rensselaer Polytechnic Institute
Luca Borella, PM of AI Strategic Initiative, FINOS Ambassador, Linux Foundation
Karl Moll, Technical Project Advocate, FINOS, Linux Foundation
Related Workshops
International Workshop on Multimodal Financial Foundation Models (MFFMs) @ ICAIF'24
The workshop is dedicated to advancing the integration and utility of Generative AI in finance with an emphasis on reproducibility, transparency, and usability. Multimodal Financial Foundation Models (MFFMs) emerge as critical tools capable of handling complex and dynamic financial data from a variety of sources. This event is a collaborative initiative by Columbia University, Oxford University, and the Linux Foundation. It aims to tackle significant challenges including model cannibalism and openwashing, striving to set new standards for ethical deployment and the development of transparent, reliable financial models.
Contact
Contestants can communicate any questions on Discord in the #coling2025-finllm-workshop channel.
Contact email: colingregchallenge2025@gmail.com