Agent Evaluation Intern

Description

We are hiring an intern to work on evaluation and reliability infrastructure for a real-world LLM agent system in the UA performance marketing field. The agent performs multi-step reasoning, retrieves context, selects tools, executes actions, handles user confirmations, and interacts with external services.

The goal of this internship is to build transferable expertise in agent evaluation engineering: evaluating tool use, measuring trajectory quality, designing benchmarks, analyzing traces, comparing model and prompt variants, and improving the reliability of agentic AI systems.

This role is ideal for someone interested in future opportunities in LLM agent evaluation, AI safety evaluation, research engineering, LLMOps, or applied AI infrastructure.

Responsibilities

Research the state-of-the-art agentic workflow evaluation frameworks in the industry and in the research field.
Apply the theory to build automated evaluation pipelines that can run agent scenarios, capture execution artifacts, score results, and detect regressions.
Evaluate tool-use behavior, including whether the agent selects the right tool, passes correct arguments, avoids unnecessary calls, and handles tool errors appropriately.
Analyze agent trajectories using traces, logs, intermediate steps, and final outputs to identify reasoning failures, context misuse, hallucinated assumptions, and brittle workflow patterns.
Design metrics for agent reliability, including success rate, tool-call precision, argument accuracy, recovery rate, retry count, latency, cost, and safety-related failure rates.
Create reusable evaluation datasets from synthetic cases, golden workflows, and real anonymized executions.
Support experiments comparing prompts, model providers, tool descriptions, memory strategies, context construction methods, and execution modes.
Help build human evaluation workflows and rubrics for judging agent correctness, faithfulness, usefulness, and risk awareness.
Work with engineers to translate evaluation findings into better tests, monitoring signals, tool interfaces, prompts, and guardrails.
Potentially compose research papers and publish in scientific conferences.

Requirements

Currently pursuing or recent graduates of a Master’s or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, Software Engineering, Data Science, or a related field.
Strong Python fundamentals and interest in AI systems.
Curious about how LLM agents work, fail, and improve.
Interested in evaluation methodology, not just application building.
Comfortable reading logs, traces, test cases, and structured data.
Detail-oriented and able to define clear, measurable criteria for ambiguous agent behavior.
Prior experience with LLMs, LangChain-like agents, tool calling, pytest, data analysis, or observability tools is helpful but not required.

Similar Active Jobs

IGTProduct & DevelopmentBelgrade, Serbia

Technical Artist

IGT is seeking a Technical Artist in Belgrade to bridge the gap between art and technology in the production of casino games. The role involves implementing 3D assets and animations in Unity while collaborating with international cross-functional teams. Candidates must possess strong technical skills in Unity and Adobe Creative Suite, along with a relevant portfolio of slot or casino artwork.

HybridFull-timeMid-level3 yearsEnglish

2026-07-02

SportradarProduct & DevelopmentVienna, Austria

Senior Application Specialist [m/f/d]

Sportradar is seeking a Senior Application Specialist to take technical ownership of Dynamics 365 F&O and connected financial systems. This role supports strategic initiatives within Finance systems by collaborating with the finance department and stakeholders to deliver customised solutions and enhance operational efficiency. The specialist will manage applications, permissions, provide operational support, and execute compliance controls.

Full-timeSeniorEnglish

2026-07-02

SportradarProduct & DevelopmentBremen, Germany

Senior C++ Software Engineer

Sportradar is seeking a Senior C++ Software Engineer to join its Sports Virtualisation team. The role involves developing innovative products using Unreal Engine 5.6+ by integrating high-performance C++ code with live skeletal tracking data. The engineer will support the team in building interactive virtual sports content, while also performing maintenance and stabilization of running systems and guiding junior developers.

Full-timeSenior3 yearsEnglish

2026-07-02

AristocratProduct & DevelopmentSkopje, North Macedonia

QA Engineer

The company is seeking a QA Engineer to ensure software product quality. This role involves completing manual test cases, assisting with test plans, and tracking defects. The engineer will collaborate with development teams, participate in testing activities, and support automation efforts. This is an opportunity for professional growth within a dedicated quality-focused team.

On-siteFull-timeMid-level1-2 yearsEnglish

2026-07-02

EntainProduct & DevelopmentHyderabad, India

Gaming Operations Executive

The Gaming Operations Executive ensures the stability, integrity, and operational performance of gaming products through advanced monitoring, automation, and risk management. The role involves combining escalation management with commercial risk oversight, focusing on game integrity, platform uptime, supplier performance, and proactive issue detection. This position is an important escalation point for complex technical incidents, requiring investigation and coordination of system-level issues and improvement of automated monitoring tools to protect revenue and player experience.

On-siteFull-timeMid-level1-3 yearsEnglish

2026-07-02

Agent Evaluation Intern

Technical Artist

Senior Application Specialist [m/f/d]

Senior C++ Software Engineer

QA Engineer

Gaming Operations Executive

Sign in

Job Alerts