Organizations invest substantial resources in training: staff time, money for instructors or platforms, opportunity costs of staff not doing other work while training. Without systematic evaluation of training effectiveness, organizations cannot determine whether these investments are worthwhile or how to improve training. Evaluation answers critical questions: Did learners actually learn? Did they change their behavior? Is the organization better off because of the training? How can we improve training for next time?
This lesson explores the Kirkpatrick Model—a foundational framework for training evaluation—and provides practical guidance for evaluating AI training effectiveness in nonprofit contexts.
Donald Kirkpatrick's Four Levels of Training Evaluation provides a comprehensive framework for assessing training at multiple levels, from immediate learner reaction to long-term organizational impact. Understanding and applying this model ensures training evaluation is thorough and useful.
Level 1: Reaction measures learner satisfaction and engagement. Did learners find the training valuable? Was the content relevant to their work? Were the materials clear? Was the facilitator engaging? Reaction evaluation uses surveys asking learners to rate their satisfaction with training. While reaction data alone isn't sufficient to determine training effectiveness, it provides useful information: if learners didn't enjoy training or find it relevant, they're unlikely to have learned or applied what was taught.
Level 2: Learning measures whether learners actually acquired the knowledge or skills the training was designed to teach. Did participants understand AI concepts explained in training? Can they explain how AI systems work? Do they understand the governance implications of AI adoption? Learning is typically assessed through knowledge tests, quizzes, case study analysis, or skill demonstrations. Learning evaluation answers whether the training content was actually learned.
Level 3: Behavior measures whether learners actually use what they learned in their jobs. Did grant professionals who learned about AI governance return to their organizations and implement AI governance practices? Are they considering AI applications thoughtfully rather than reactively? Are they managing data responsibly? Behavior evaluation addresses the most critical question: does training actually change what people do? Many training programs fail at this level: learners find training interesting and learn the content, but don't change their actual behavior.
Level 4: Results measures organizational outcomes resulting from training. Did the nonprofit's AI governance improve? Did the organization make better AI implementation decisions? Did staff confidence and competence increase? Did the organization avoid AI-related failures? Results evaluation looks at whether training contributes to organizational goals. This level is challenging to evaluate because many factors affect organizational outcomes, making it hard to isolate training's specific impact. However, it's the most important level because ultimately, training only matters if it contributes to organizational success.
Kirkpatrick's Four Levels of Training Evaluation assess training at multiple levels: Level 1 (Reaction) measures satisfaction; Level 2 (Learning) measures knowledge and skill acquisition; Level 3 (Behavior) measures whether learners apply what they learned; Level 4 (Results) measures organizational outcomes. Comprehensive evaluation addresses all four levels.
Before designing evaluation, clarify what training is intended to accomplish and define success. Training objectives should be specific and measurable: not "Train grant professionals about AI" but "Enable grant professionals to explain three AI applications relevant to grant evaluation and identify governance considerations for each." Clear objectives make evaluation straightforward: design assessment to determine whether this specific objective has been met.
Success criteria define what level of achievement constitutes success. Is training successful if 80% of learners pass knowledge test? If 70%? If 100%? If 50% of learners implement what they learned within six months? Establishing these criteria before training helps focus evaluation and prevents moving goalposts. Different organizations and training may have different success criteria; the key is being explicit about what success looks like.
Comparing learner knowledge and skills before and after training demonstrates what change training creates. Pre-assessment (testing knowledge before training) establishes baseline; post-assessment (testing after training) shows change. Comparing pre and post results demonstrates learning gains. For example, pre-assessment might find that grant professionals can explain AI in only vague terms; post-assessment might show they can now explain three specific AI techniques and their applications.
Pre-assessment also provides valuable information for tailoring training. If pre-assessment reveals that some learners already have significant AI knowledge while others are complete novices, trainers can differentiate instruction. Learners with prior knowledge might move to advanced topics while others get foundational content.
Assessing learning (Level 2 evaluation) requires measures of knowledge and skill. Knowledge Tests can use multiple-choice, short answer, or essay formats to assess understanding of concepts. For AI training, knowledge tests might include: "Explain the difference between supervised and unsupervised learning," "Describe what steps a nonprofit should take when implementing an AI system." Multiple-choice tests are quick to grade but may not fully capture understanding; essay or short answer questions provide richer assessment of depth of understanding.
Skill Demonstrations assess whether learners can actually do what they've learned. Rather than asking "Explain how to evaluate an AI tool," skill assessment might ask "Evaluate this AI tool using the framework presented in training." Demonstrations might include: working through a case study, analyzing a sample situation, completing a practical exercise, or presenting their thinking to peers.
Portfolios of work collected during and after training show learner growth over time. A learner might collect samples of their thinking about AI at beginning, middle, and end of training, demonstrating development of sophistication and understanding.
Assessing whether learners change their behavior (Level 3) is challenging because behavior change is influenced by many factors beyond training. However, several strategies help assess behavioral change: (1) 360-Degree Feedback: Gathering feedback from peers, supervisors, and others about whether learners' behavior has changed; (2) Self-Report: Asking learners what they're doing differently; (3) Observation: Direct observation of work or decision-making processes; (4) Work Products: Examining actual outputs (e.g., AI governance policies, evaluation designs) to assess whether they reflect training content; (5) Interview or Focus Groups: Discussing with learners and peers how they're applying training; (6) Organizational Records: Reviewing whether documented processes or decisions have changed.
Behavioral assessment typically occurs weeks or months after training (unlike knowledge assessment which can happen immediately after). This delay allows time for learners to return to work and attempt to apply what they learned. Organizations must commit to following up on training to assess behavior change rather than considering training complete when the final session ends.
Assessing organizational results (Level 4) is most challenging because attributing organizational outcomes to training requires isolating training's effect from other factors. Strategies for results assessment include: (1) Pre-Post Organizational Metrics: Measuring relevant organizational metrics before and after training (e.g., quality of AI decisions, number of documented governance policies, staff confidence with AI); (2) Comparison Groups: If possible, comparing organizations or departments that received training to similar ones that didn't; (3) Longitudinal Tracking: Following organizational performance over time to see if improvements are sustained; (4) Qualitative Description: Documenting how the organization has changed as a result of improved staff capacity; (5) Return on Investment Calculation: Estimating the financial value created by training (improved decisions, avoided failures, increased efficiency).
Comprehensive evaluation combines quantitative and qualitative methods. Quantitative Methods (surveys with numeric ratings, tests with right/wrong answers, counts of behaviors) provide numerical data enabling comparison and statistical analysis. Qualitative Methods (interviews, focus groups, observation, open-ended survey responses) provide rich description and explanation of what's happening and why.
Mixed methods evaluation might include: quantitative knowledge test (Level 2) showing learning gains, qualitative interviews (Level 3) describing how learners are applying training, and quantitative metrics (Level 4) showing organizational improvement alongside qualitative description of changes. Together, these methods provide comprehensive picture of training effectiveness.
Beyond formal evaluation, continuous collection of feedback enables training improvement. Mechanisms for gathering feedback include: (1) Post-Training Surveys: Quick surveys asking what learners found valuable and what should change; (2) Focus Groups: Facilitated discussion with groups of learners about training; (3) Suggestion Mechanisms: Inviting ongoing feedback through email or online forms; (4) Informal Discussion: Conversations with learners, facilitators, and organizational leaders about what's working; (5) Observation: Trainers observing learner engagement and adjusting in real-time.
Training impact fades over time if not reinforced. Evaluating longer-term retention and impact requires: (1) Follow-Up Assessments: Re-assessing knowledge and behavior weeks or months after training; (2) Longitudinal Studies: Tracking learners over extended periods; (3) Booster Sessions: Periodic refresher training to reinforce concepts; (4) Ongoing Support: Providing resources and support continuing beyond initial training; (5) Organizational Culture: Creating culture that values and reinforces what training taught.
Research on learning shows that without reinforcement, people forget much of what they learned. Training effectiveness depends partly on what happens after training: Do managers reinforce concepts? Does the organization change policies and practices to reflect training content? Are learners part of communities supporting continued learning?
Comparing training effectiveness across different offerings, formats, or trainer helps identify what works best. Comparative questions include: Is online training as effective as in-person for AI learning? Do shorter, focused workshops produce better behavior change than longer programs? Do different trainer styles affect learning outcomes? Collecting comparable data across training variations enables these comparisons and helps optimize training design.
In resource-constrained nonprofits, cost-effectiveness matters. ROI analysis compares training costs to value created: Did improved grant decisions resulting from AI training create more funding? Did better data governance reduce compliance risks? Did improved staff competence reduce turnover? While quantifying value is challenging, the basic approach involves: (1) calculating training costs (facilitator time and fees, platform costs, participant time), (2) estimating value created (improved decisions, cost savings, prevented failures, increased revenue), (3) calculating return (value divided by cost).
Several tools support evaluation: (1) Surveys: Online surveys collecting reaction feedback and self-reported behavior change; (2) Knowledge Tests: Pre-designed or custom tests assessing learning; (3) Interview Guides: Structured guides for conducting interviews about training application; (4) Focus Group Protocols: Scripts for facilitating group discussions; (5) Observation Checklists: Lists of behaviors to observe indicating behavior change; (6) Portfolio Templates: Guides for collecting work samples demonstrating learning. Organizations can create custom tools aligned with their specific training objectives.
Evaluation is only valuable if results are used. Reporting results in accessible ways increases their use: (1) Summary Reports: Brief documents summarizing key findings and recommendations; (2) Data Visualizations: Charts and graphs showing learning gains, behavior change, or organizational impact; (3) Case Studies: Stories of specific learners showing how they applied training; (4) Presentations: Sharing findings with organizational leadership and staff; (5) Action Plans: Specific recommendations for training improvement with assigned responsibility.
Design evaluation for an AI training program you're planning or delivering. Determine your training objectives (what should learners be able to do after training?). For each level of Kirkpatrick's model, identify how you'll evaluate: Level 1 (Reaction) - what will you ask learners about their satisfaction?; Level 2 (Learning) - how will you assess whether learners acquired intended knowledge?; Level 3 (Behavior) - how will you determine whether learners apply training?; Level 4 (Results) - what organizational outcomes would indicate training success? Create evaluation instruments (survey, test, interview guide) for at least one level.
Continue building your expertise in AI governance, standards, and nonprofit leadership with the CAGP Level 5 certification program.
Explore the Program