Master AI — Complete AI Engineering Guide¶

A comprehensive, interview-ready guide to building AI-powered Java applications. Covers foundational theory, frameworks, production patterns, and hands-on projects.

Advanced · 4-6 weeks

Prerequisites

Complete Java Basics, OOP, and Java Backend modules. Familiarity with REST APIs and Spring Boot is assumed.

Part 1 — Understanding LLMs¶

What is a Large Language Model?¶

A Large Language Model (LLM) is a neural network trained on massive text datasets to predict the next token (word/subword) in a sequence. Through this simple objective, LLMs learn grammar, facts, reasoning patterns, and coding ability.

Key concepts:

Concept	Description
Transformer	The neural network architecture behind all modern LLMs (GPT, Claude, Llama, Gemini)
Token	A subword unit. "Hello world" → ["Hello", " world"] (2 tokens). ~4 chars per token on average
Context window	Maximum tokens the model can process at once (e.g., GPT-4o: 128K, Claude 3.5: 200K)
Temperature	Controls randomness. 0 = deterministic, 1 = creative/random
Top-p (Nucleus Sampling)	Only consider tokens whose cumulative probability exceeds `p`
Top-k	Only consider the `k` most probable next tokens
System prompt	Instructions that define the model's behavior/persona
Embeddings	Dense vector representations of text enabling semantic similarity search
Inference	The process of generating output from a trained model

How a Transformer Works (Simplified)¶

Input Text
    ↓
1. Tokenization (BPE / SentencePiece)
    ↓  "Hello world" → [15496, 995]
2. Token Embedding + Positional Encoding
    ↓  Each token → dense vector (d=768 to 12288)
3. Self-Attention (Multi-Head)
    ↓  Each token attends to every other token
    ↓  Q×K^T / √d_k → softmax → × V
4. Feed-Forward Network
    ↓  Per-token non-linear transformation
5. Repeat layers 2-4 (12 to 96+ times)
    ↓
6. Final Linear + Softmax → probability distribution
    ↓
7. Sample next token → append → repeat

Tokenization methods:

Method	Description	Used By
BPE (Byte Pair Encoding)	Iteratively merges most frequent character pairs	GPT, Llama
SentencePiece	Language-agnostic subword tokenizer	T5, Gemini
WordPiece	Similar to BPE but uses likelihood instead of frequency	BERT

Self-Attention Mechanism¶

Self-attention allows each token to "look at" every other token in the sequence to understand context.

For each token:
  Q (Query)  = token × W_Q    "What am I looking for?"
  K (Key)    = token × W_K    "What do I contain?"
  V (Value)  = token × W_V    "What do I provide?"

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

Multi-head attention runs multiple attention computations in parallel (e.g., 12-96 heads), each learning different relationships (syntax, semantics, coreference, etc.).

Model Comparison¶

Model	Provider	Context	Strengths	API Cost (approx)
GPT-4o	OpenAI	128K	General reasoning, coding, multimodal	$2.50/$10 per 1M tokens (in/out)
GPT-4o-mini	OpenAI	128K	Cost-effective, fast	$0.15/$0.60 per 1M tokens
Claude 3.5 Sonnet	Anthropic	200K	Long-context, nuanced writing, coding	$3/$15 per 1M tokens
Gemini 2.0 Flash	Google	1M	Massive context, multimodal, fast	$0.10/$0.40 per 1M tokens
Llama 3 (70B)	Meta	128K	Open-source, self-hosted, no API cost	Free (compute cost only)
Mistral Large	Mistral	128K	Open-weight, strong reasoning	$2/$6 per 1M tokens

Inference Optimization¶

Technique	Description	Benefit
Quantization	Reduce weight precision (FP32 → INT8/INT4)	2-4× smaller model, faster inference
KV Cache	Cache key-value pairs from previous tokens	Avoids redundant computation during generation
Speculative Decoding	Small model drafts, large model verifies	2-3× faster generation
Batching	Process multiple requests simultaneously	Higher throughput
Distillation	Train a smaller model to mimic a larger one	Smaller, faster model with similar quality

Part 2 — Prompt Engineering¶

Why Prompt Engineering Matters¶

The same model can produce vastly different outputs based on how you prompt it. Prompt engineering is the skill of crafting inputs to get optimal outputs — it's the highest-leverage skill in AI engineering.

Prompting Techniques¶

Technique	Description	When to Use
Zero-shot	No examples, just the instruction	Simple tasks the model already understands
Few-shot	Provide 2-5 examples before the question	Pattern-following, classification, formatting
Chain-of-Thought (CoT)	Ask model to "think step by step"	Math, logic, multi-step reasoning
Self-Consistency	Generate multiple CoT paths, take majority vote	Improved accuracy on reasoning tasks
ReAct	Reason + Act — interleave thinking with tool use	Agent-based systems
Tree-of-Thought	Explore multiple reasoning branches	Complex problem solving

Prompt Templates in Spring AI¶

// Using Spring AI's PromptTemplate
@Service
public class StructuredPromptService {
    private final ChatClient chatClient;

    // Few-shot prompt template
    private static final String FEW_SHOT_TEMPLATE = """
        Classify the following customer message into one of these categories:
        - billing
        - technical
        - general

        Examples:
        Message: "My payment failed" → Category: billing
        Message: "App crashes on login" → Category: technical
        Message: "What are your hours?" → Category: general

        Message: "{userMessage}" → Category:
        """;

    public String classifyMessage(String userMessage) {
        PromptTemplate template = new PromptTemplate(FEW_SHOT_TEMPLATE);
        Prompt prompt = template.create(Map.of("userMessage", userMessage));
        return chatClient.call(prompt).getResult().getOutput().getContent();
    }

    // Chain-of-Thought prompt
    private static final String COT_TEMPLATE = """
        You are a senior Java developer. Analyze the following code for bugs.

        Think step by step:
        1. First, understand what the code is trying to do
        2. Check for null pointer risks
        3. Check for concurrency issues
        4. Check for resource leaks
        5. Provide your final assessment

        Code:
        ```java
        {code}
        ```

        Analysis:
        """;

    public String analyzeCode(String code) {
        PromptTemplate template = new PromptTemplate(COT_TEMPLATE);
        Prompt prompt = template.create(Map.of("code", code));
        return chatClient.call(prompt).getResult().getOutput().getContent();
    }
}

System Prompt Best Practices¶

Principle	Example
Be specific about role	"You are a senior Java backend engineer with 10 years of experience"
Define output format	"Respond in JSON with fields: category, confidence, explanation"
Set constraints	"Only use information from the provided context. If unsure, say 'I don't know'"
Provide examples	Include 2-3 examples of desired input/output pairs
Specify what NOT to do	"Do not make up information. Do not include code you haven't verified"

Part 3 — Spring AI Framework¶

Overview¶

Spring AI provides a unified, Spring-native API for integrating AI models. It abstracts provider-specific implementations so you can swap between OpenAI, Anthropic, Ollama, etc. with configuration changes only.

Setup¶

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
    <version>0.8.0</version>
</dependency>

<!-- For local models via Ollama -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
    <version>0.8.0</version>
</dependency>

Configuration¶

# application.yml
spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o-mini
          temperature: 0.7
          max-tokens: 2048

    # Alternative: Ollama (local, no API key)
    ollama:
      base-url: http://localhost:11434
      chat:
        options:
          model: llama3

    # Retry configuration
    retry:
      max-attempts: 3
      backoff:
        initial-interval: 1000
        multiplier: 2
        max-interval: 10000

Simple Chat Service¶

import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.messages.SystemMessage;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.stereotype.Service;

@Service
public class AIChatService {
    private final ChatClient chatClient;

    public AIChatService(ChatClient chatClient) {
        this.chatClient = chatClient;
    }

    public String chat(String userInput) {
        Prompt prompt = new Prompt(List.of(
            new SystemMessage("You are a helpful Java programming assistant."),
            new UserMessage(userInput)
        ));
        return chatClient.call(prompt).getResult().getOutput().getContent();
    }
}

// REST Controller
@RestController
@RequestMapping("/api/chat")
public class ChatController {
    private final AIChatService chatService;

    public ChatController(AIChatService chatService) {
        this.chatService = chatService;
    }

    @PostMapping
    public Map<String, String> chat(@RequestBody Map<String, String> request) {
        String response = chatService.chat(request.get("message"));
        return Map.of("response", response);
    }
}

Structured Output Parsing¶

// Define your output structure
public record MovieRecommendation(
    String title,
    int year,
    String genre,
    double rating,
    String reason
) {}

@Service
public class MovieService {
    private final ChatClient chatClient;

    public List<MovieRecommendation> getRecommendations(String preferences) {
        String prompt = """
            Based on these preferences: %s
            Recommend 3 movies. For each, provide:
            - title, year, genre, rating (out of 10), and reason.
            Respond as a JSON array.
            """.formatted(preferences);

        String response = chatClient.call(new Prompt(prompt))
            .getResult().getOutput().getContent();

        // Parse JSON response into typed objects
        ObjectMapper mapper = new ObjectMapper();
        return mapper.readValue(response,
            new TypeReference<List<MovieRecommendation>>() {});
    }
}

Streaming Responses¶

@Service
public class StreamingChatService {
    private final ChatClient chatClient;

    // For real-time streaming (SSE)
    public Flux<String> streamChat(String userInput) {
        Prompt prompt = new Prompt(List.of(
            new SystemMessage("You are a helpful assistant."),
            new UserMessage(userInput)
        ));

        return chatClient.stream(prompt)
            .map(response -> response.getResult().getOutput().getContent())
            .filter(Objects::nonNull);
    }
}

@RestController
public class StreamController {
    private final StreamingChatService streamService;

    @GetMapping(value = "/api/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> streamChat(@RequestParam String message) {
        return streamService.streamChat(message);
    }
}

Part 4 — LangChain4j¶

What is LangChain4j?¶

LangChain4j is the Java port of the LangChain ecosystem. It provides abstractions for building AI-powered applications with:

Model interactions (chat, completion, embedding)
Memory management (conversation history)
Chains (composable pipelines)
Agents (autonomous tool-using systems)
RAG (retrieval-augmented generation)

Setup and Basic Usage¶

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j</artifactId>
    <version>0.28.0</version>
</dependency>
<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-open-ai</artifactId>
    <version>0.28.0</version>
</dependency>

ChatLanguageModel model = OpenAiChatModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName("gpt-4o-mini")
    .build();

String response = model.generate("Explain Java Streams in 3 sentences.");
System.out.println(response);

Chat with Memory¶

List<ChatMessage> messages = new ArrayList<>();
messages.add(new SystemMessage("You are a Java tutor."));
messages.add(new UserMessage("What is a HashMap?"));

ChatResponse response1 = model.generate(messages);
messages.add(response1.content());

// Follow-up (model remembers context)
messages.add(new UserMessage("How does it handle collisions?"));
ChatResponse response2 = model.generate(messages);

AI Services (Declarative Interface)¶

// Define your AI service as a Java interface
interface JavaTutor {

    @SystemMessage("You are a patient Java tutor. Explain concepts simply with examples.")
    String explain(@UserMessage String concept);

    @SystemMessage("You are a code reviewer. Be constructive.")
    String review(@UserMessage String code);
}

// LangChain4j generates the implementation
JavaTutor tutor = AiServices.builder(JavaTutor.class)
    .chatLanguageModel(model)
    .chatMemory(MessageWindowChatMemory.withMaxMessages(20))
    .build();

String explanation = tutor.explain("What are Java generics?");
String codeReview = tutor.review("public void process(List list) { ... }");

Part 5 — RAG (Retrieval-Augmented Generation)¶

What is RAG?¶

RAG augments an LLM's knowledge by retrieving relevant documents from a knowledge base before generating a response. This solves two key problems:

Knowledge cutoff: LLMs don't know about events after training
Hallucination: By grounding responses in actual documents, hallucinations are reduced

RAG Pipeline¶

User Query
    ↓
1. Embed the query → vector (e.g., 1536 dimensions)
    ↓
2. Search vector store → top-k relevant documents
    ↓
3. Construct prompt = system instructions + retrieved docs + user query
    ↓
4. Send to LLM → generate response grounded in retrieved context
    ↓
5. (Optional) Cite sources in the response

RAG Techniques Overview¶

Technique	Description	When to Use
Simple RAG	Encode documents → vector store → retrieve top-k	Starting point, small knowledge bases
BM25 RAG	Keyword-based retrieval (TF-IDF variant)	When exact keyword matching matters
Hybrid RAG	Combine dense (embedding) + sparse (BM25) retrieval	Best of both worlds, production systems
ReRanker RAG	Initial retrieval → re-rank with a cross-encoder	Improve precision of top results
Sentence Window	Retrieve sentence + surrounding context	Fine-grained retrieval
Auto Merging	Merge overlapping/redundant retrieved chunks	Reduce noise in context
HyDE	Generate hypothetical answer → use it as query	Abstract or vague queries
Query Transformation	Rewrite/expand the query before retrieval	Complex or ambiguous queries
Self Query	Model generates structured filters from natural language	Metadata-filtered retrieval
RAG Fusion	Multiple retrievals → merge and re-rank results	Comprehensive coverage
RAPTOR	Hierarchical summarization for multi-level retrieval	Large document collections
ColBERT	Token-level dense retrieval	High-precision search
Graph RAG	Knowledge graph-based retrieval	Relationship-heavy data
Agentic RAG	Agent decides when and how to retrieve	Complex multi-step reasoning
Vision RAG	Multi-modal retrieval (text + images)	Documents with diagrams/charts
CAG	Cache-augmented generation	Repeated similar queries

Embedding Models Comparison¶

Model	Dimensions	Strengths	Cost
OpenAI text-embedding-3-small	1536	Good general purpose, low cost	$0.02 / 1M tokens
OpenAI text-embedding-3-large	3072	Higher quality	$0.13 / 1M tokens
Cohere embed-v3	1024	Multilingual, search-optimized	$0.10 / 1M tokens
BGE-large-en	1024	Open-source, high quality	Free (self-hosted)
all-MiniLM-L6-v2	384	Fast, lightweight, open-source	Free (self-hosted)

Vector Database Options¶

Database	Type	Key Feature	Best For
pgvector	PostgreSQL extension	No new infra; lives in your existing DB	Teams already on PostgreSQL
Pinecone	Managed cloud	Fully managed, scalable	Production with minimal ops
Weaviate	Open-source	GraphQL API, hybrid search	Flexible self-hosted
Chroma	Open-source	Simple API, easy to start	Prototyping, small projects
Milvus	Open-source	High performance, GPU support	Large-scale production
Qdrant	Open-source	Rust-based, fast filtering	Performance-critical apps

Complete RAG Implementation¶

public class RAGSystem {
    private final ChatLanguageModel model;
    private final EmbeddingModel embeddingModel;
    private final InMemoryEmbeddingStore<TextSegment> store;

    // Index documents
    public void indexDocuments(List<String> documents) {
        for (String doc : documents) {
            // Chunk the document first
            List<String> chunks = splitIntoChunks(doc, 500, 50); // size=500, overlap=50
            for (String chunk : chunks) {
                Embedding emb = embeddingModel.embed(chunk).content();
                store.add(emb, new TextSegment(chunk, null));
            }
        }
    }

    // Query with RAG
    public String query(String question) {
        // 1. Embed the question
        Embedding queryEmb = embeddingModel.embed(question).content();

        // 2. Retrieve relevant documents
        List<EmbeddingMatch<TextSegment>> matches = store.findRelevant(queryEmb, 3);

        // 3. Filter by relevance score
        String context = matches.stream()
            .filter(m -> m.score() > 0.7) // Only high-relevance matches
            .map(m -> m.embedded().text())
            .collect(Collectors.joining("\n---\n"));

        if (context.isEmpty()) {
            return "I don't have enough information to answer that question.";
        }

        // 4. Generate grounded response
        String prompt = """
            Based on the following context, answer the question.
            If the answer is not in the context, say "I don't have that information."

            Context:
            %s

            Question: %s

            Answer:
            """.formatted(context, question);

        return model.generate(prompt);
    }

    private List<String> splitIntoChunks(String text, int chunkSize, int overlap) {
        List<String> chunks = new ArrayList<>();
        for (int i = 0; i < text.length(); i += (chunkSize - overlap)) {
            chunks.add(text.substring(i, Math.min(i + chunkSize, text.length())));
        }
        return chunks;
    }
}

Chunking Strategies¶

Strategy	Description	Best For	Chunk Overlap
Fixed size	Split every N characters/tokens	Simple, predictable	10-20% overlap recommended
Sentence-based	Split on sentence boundaries	Preserving meaning	1-2 sentence overlap
Paragraph-based	Split on paragraph breaks	Structured documents	No overlap needed
Semantic chunking	Split when topic changes (using embeddings)	High-quality retrieval	Automatic
Recursive	Try largest split first, fall back to smaller	General purpose	Configurable

RAG Evaluation Metrics¶

Metric	What It Measures	How to Compute
Faithfulness	Is the answer supported by retrieved context?	LLM-as-judge against context
Answer Relevance	Does the answer address the question?	Semantic similarity (question ↔ answer)
Context Precision	Are the retrieved docs relevant to the question?	Ratio of relevant docs in top-k
Context Recall	Did we retrieve all necessary information?	Coverage of ground-truth answer

Part 6 — AI Agents¶

What is an Agent?¶

An agent is an AI system that can autonomously decide which actions to take to accomplish a goal. Unlike simple chat, agents can:

Reason about a problem (observe → think)
Select and use tools (act)
Process tool results (observe)
Iterate until the task is complete

Agent Patterns¶

1. Reflection Pattern¶

The agent evaluates its own output and iteratively improves it.

public class ReflectionAgent {
    private final ChatLanguageModel model;

    public String generateWithReflection(String task, int maxIterations) {
        String draft = model.generate("Complete this task: " + task);

        for (int i = 0; i < maxIterations; i++) {
            // Self-critique
            String critique = model.generate(
                "Review this response for errors, missing details, and improvements:\n" + draft
            );

            // Check if good enough
            if (critique.toLowerCase().contains("no issues") ||
                critique.toLowerCase().contains("looks good")) {
                break;
            }

            // Refine based on critique
            draft = model.generate(
                "Original: " + draft + "\nCritique: " + critique +
                "\nGenerate an improved version addressing the critique."
            );
        }

        return draft;
    }
}

2. ReAct Pattern (Reason + Act)¶

The agent interleaves reasoning steps with tool calls.

Thought: I need to find the user's order status. Let me query the database.
Action: queryDatabase("SELECT status FROM orders WHERE user_id = 123")
Observation: [{status: "shipped", tracking: "1Z999AA10123456784"}]
Thought: The order is shipped. I should provide the tracking number.
Answer: Your order has been shipped! Tracking: 1Z999AA10123456784

public class ReActAgent {
    private final ChatLanguageModel model;
    private final Map<String, Function<String, String>> tools;

    public String solve(String question) {
        StringBuilder scratchpad = new StringBuilder();
        String systemPrompt = """
            You are a helpful agent. Use the following format:
            Thought: your reasoning
            Action: toolName(argument)
            ... (wait for Observation)
            Thought: reasoning about observation
            Answer: final answer to the user

            Available tools: %s
            """.formatted(tools.keySet());

        for (int step = 0; step < 5; step++) {
            String response = model.generate(
                systemPrompt + "\nQuestion: " + question + "\n" + scratchpad
            );

            // Parse action from response
            if (response.contains("Answer:")) {
                return response.substring(response.indexOf("Answer:") + 8).trim();
            }

            if (response.contains("Action:")) {
                String action = parseAction(response);
                String toolName = action.split("\\(")[0].trim();
                String arg = action.substring(action.indexOf('(') + 1, action.lastIndexOf(')'));

                String observation = tools.get(toolName).apply(arg);
                scratchpad.append(response)
                    .append("\nObservation: ").append(observation).append("\n");
            }
        }
        return "I couldn't find an answer within the step limit.";
    }
}

3. Tool Use Pattern¶

import dev.langchain4j.agent.tool.Tool;

public class DeveloperTools {

    @Tool("Search for Java documentation")
    public String searchDocs(String query) {
        return "Documentation result for: " + query;
    }

    @Tool("Execute a SQL query against the database")
    public String executeQuery(String sql) {
        // Validate SQL (prevent injection!)
        if (sql.toLowerCase().contains("drop") || sql.toLowerCase().contains("delete")) {
            return "Error: Destructive queries are not allowed.";
        }
        return "Query results: [...]";
    }

    @Tool("Get current system metrics")
    public String getMetrics() {
        Runtime rt = Runtime.getRuntime();
        return String.format("Memory: %dMB / %dMB, Processors: %d",
            rt.totalMemory() / 1024 / 1024,
            rt.maxMemory() / 1024 / 1024,
            rt.availableProcessors());
    }

    @Tool("Send an HTTP GET request to a URL")
    public String httpGet(String url) {
        HttpClient client = HttpClient.newHttpClient();
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url)).GET().build();
        try {
            return client.send(request, HttpResponse.BodyHandlers.ofString()).body();
        } catch (Exception e) {
            return "Error: " + e.getMessage();
        }
    }
}

4. Planning Pattern¶

public class PlanningAgent {
    private final ChatLanguageModel model;

    public String solve(String goal) {
        // Step 1: Create plan
        String plan = model.generate(
            "Break this goal into numbered steps (max 5): " + goal
        );

        // Step 2: Execute each step
        StringBuilder results = new StringBuilder();
        for (String step : plan.split("\n")) {
            if (step.trim().isEmpty()) continue;
            String result = model.generate("Execute this step: " + step);
            results.append(step).append("\nResult: ").append(result).append("\n\n");
        }

        // Step 3: Synthesize
        return model.generate("Summarize these results into a coherent answer:\n" + results);
    }
}

5. Multi-Agent Pattern¶

Multiple specialized agents collaborate to solve complex problems.

public class MultiAgentSystem {
    private final ChatLanguageModel researcher;
    private final ChatLanguageModel coder;
    private final ChatLanguageModel reviewer;

    public String buildFeature(String requirement) {
        // Agent 1: Research
        String research = researcher.generate(
            "Research the best approach for: " + requirement +
            "\nConsider: design patterns, performance, edge cases."
        );

        // Agent 2: Implement
        String code = coder.generate(
            "Based on this research, write production-quality Java code:\n" + research +
            "\nInclude error handling, logging, and javadoc."
        );

        // Agent 3: Review
        String review = reviewer.generate(
            "Review this code for bugs, security issues (OWASP), and improvements:\n" + code
        );

        // Agent 4: Iterate if needed
        if (review.toLowerCase().contains("critical") || review.toLowerCase().contains("bug")) {
            code = coder.generate(
                "Fix these issues in the code:\n" + review + "\n\nOriginal code:\n" + code
            );
        }

        return "Code:\n" + code + "\n\nReview:\n" + review;
    }
}

Agent Memory Types¶

Memory Type	Description	Implementation
Short-term	Current conversation context	Message list / sliding window
Long-term	Facts learned across sessions	Vector store / database
Episodic	Past experiences and outcomes	Event log with embeddings
Procedural	Learned procedures and workflows	Tool descriptions, prompts

Part 7 — MCP (Model Context Protocol)¶

What is MCP?¶

The Model Context Protocol is an open standard for connecting AI models to external data sources and tools. It defines a client-server architecture where:

MCP Client: The AI application (your Spring Boot app)
MCP Server: Exposes tools, resources, and prompts to the client

Why MCP Matters¶

Without MCP, every AI integration requires custom code. MCP standardizes how models access external capabilities — similar to how HTTP standardized web communication.

Without MCP	With MCP
Custom integration per tool	Standard protocol for all tools
Tight coupling to AI provider	Provider-agnostic tool access
No discoverability	Tools self-describe their capabilities
Manual context management	Automatic context injection

Building an MCP Server¶

@McpServer
public class JavaDocsServer {

    @McpTool(description = "Search Java API documentation for a class")
    public String searchJavaDocs(String className) {
        return "Documentation for " + className + ": ...";
    }

    @McpTool(description = "Get all method signatures for a Java class")
    public String getMethods(String className) {
        return "Methods for " + className + ": ...";
    }

    @McpTool(description = "Run a Java code snippet and return the output")
    public String executeJava(String code) {
        // Sandboxed execution
        return "Output: ...";
    }

    @McpResource(uri = "docs://java/tutorials")
    public String getTutorials() {
        return "Available tutorials: Streams, Collections, Concurrency...";
    }
}

Part 8 — Fine-Tuning Concepts¶

When to Fine-Tune vs RAG vs Prompt Engineering¶

┌─────────────────────────────────────────────────────────────┐
│ START: Can prompt engineering solve it?                      │
│   YES → Use prompt engineering (cheapest, fastest)           │
│   NO  → Does the model need external knowledge?             │
│           YES → Use RAG (retrieval-augmented generation)     │
│           NO  → Does the model need a new behavior/style?   │
│                   YES → Fine-tune                            │
│                   NO  → Combine RAG + better prompts         │
└─────────────────────────────────────────────────────────────┘

Approach	When to Use	Cost	Latency
Prompt Engineering	Model knows how but needs guidance	Free (just tokens)	Same
RAG	Model needs external/updated knowledge	Storage + retrieval	+100-500ms
Fine-Tuning	Model needs new behavior, style, or domain expertise	Training compute	Faster inference
RAG + Fine-Tuning	Both new knowledge and new behavior	Highest	Variable

Fine-Tuning Methods¶

Method	Description	Resource Needs
Full Fine-Tuning	Update all model parameters	Very high (multiple GPUs)
LoRA	Low-Rank Adaptation — freeze base weights, train small adapter matrices	Low (single GPU)
QLoRA	LoRA with quantized base model (4-bit)	Very low (consumer GPU)
Prefix Tuning	Prepend learnable tokens to input	Low

Dataset Preparation¶

// Training data format (OpenAI style)
{"messages": [
  {"role": "system", "content": "You are a Java code reviewer."},
  {"role": "user", "content": "Review this code: public void process(List items) {...}"},
  {"role": "assistant", "content": "Issues found:\n1. Raw type List..."}
]}

Guidelines:

Minimum 50-100 examples (more = better, diminishing returns after ~1000)
High quality > quantity — curate carefully
Include edge cases and varied examples
Validate with a held-out test set

Part 9 — Evaluation and Observability¶

LLM Evaluation Metrics¶

Metric	What It Measures	Method
Accuracy	Correct answers vs total	Exact match / fuzzy match
Faithfulness	Answer supported by context (no hallucination)	LLM-as-judge
Relevance	Answer addresses the question	Semantic similarity
Toxicity	Harmful or inappropriate content	Classifier / LLM-as-judge
Latency	Time to first token / total generation time	Instrumentation
Cost	Total token usage × price	Token counting

Hallucination Detection¶

@Service
public class HallucinationDetector {
    private final ChatLanguageModel judge;

    public boolean isHallucination(String context, String answer) {
        String prompt = """
            Given the following context and answer, determine if the answer
            contains any claims NOT supported by the context.

            Context: %s

            Answer: %s

            Respond with only "FAITHFUL" or "HALLUCINATION" followed by explanation.
            """.formatted(context, answer);

        String verdict = judge.generate(prompt);
        return verdict.toUpperCase().contains("HALLUCINATION");
    }
}

Token Usage and Cost Tracking¶

@Component
public class TokenUsageTracker {
    private final AtomicLong totalInputTokens = new AtomicLong(0);
    private final AtomicLong totalOutputTokens = new AtomicLong(0);

    // Pricing per 1M tokens (example: GPT-4o-mini)
    private static final double INPUT_COST_PER_M = 0.15;
    private static final double OUTPUT_COST_PER_M = 0.60;

    public void track(ChatResponse response) {
        Usage usage = response.getMetadata().getUsage();
        totalInputTokens.addAndGet(usage.getInputTokens());
        totalOutputTokens.addAndGet(usage.getOutputTokens());
    }

    public double getTotalCost() {
        return (totalInputTokens.get() / 1_000_000.0) * INPUT_COST_PER_M +
               (totalOutputTokens.get() / 1_000_000.0) * OUTPUT_COST_PER_M;
    }

    public String getReport() {
        return String.format("Input: %d tokens | Output: %d tokens | Cost: $%.4f",
            totalInputTokens.get(), totalOutputTokens.get(), getTotalCost());
    }
}

Part 10 — Production Patterns¶

Rate Limiting¶

@Component
public class AIRateLimiter {
    // Sliding window rate limiter
    private final Semaphore permits;
    private final ScheduledExecutorService scheduler;

    public AIRateLimiter(
        @Value("${ai.rate-limit.requests-per-minute:60}") int rpm
    ) {
        this.permits = new Semaphore(rpm);
        this.scheduler = Executors.newSingleThreadScheduledExecutor();

        // Replenish permits every minute
        scheduler.scheduleAtFixedRate(
            () -> permits.release(rpm - permits.availablePermits()),
            1, 1, TimeUnit.MINUTES
        );
    }

    public <T> T executeWithRateLimit(Supplier<T> aiCall) {
        if (!permits.tryAcquire(5, TimeUnit.SECONDS)) {
            throw new RateLimitExceededException("AI rate limit exceeded. Try again later.");
        }
        return aiCall.get();
    }
}

Caching Strategies¶

@Service
public class CachedAIService {
    private final ChatLanguageModel model;
    private final Cache<String, String> cache;

    public CachedAIService(ChatLanguageModel model) {
        this.model = model;
        this.cache = Caffeine.newBuilder()
            .maximumSize(1000)
            .expireAfterWrite(1, TimeUnit.HOURS)
            .build();
    }

    public String chat(String input) {
        String cacheKey = hashInput(input);
        return cache.get(cacheKey, k -> model.generate(input));
    }

    // For semantic caching: embed the query and check similarity
    // to cached queries before calling the model
    public String semanticCachedChat(String input) {
        Embedding queryEmb = embeddingModel.embed(input).content();
        // Check if similar query exists in cache
        Optional<CacheEntry> cached = findSimilar(queryEmb, 0.95);
        if (cached.isPresent()) return cached.get().response();

        String response = model.generate(input);
        cacheWithEmbedding(queryEmb, input, response);
        return response;
    }
}

Fallback Chains¶

@Service
public class ResilientAIService {
    private final ChatLanguageModel primary;   // GPT-4o
    private final ChatLanguageModel secondary; // Claude 3.5
    private final ChatLanguageModel fallback;  // Local Ollama

    public String generate(String prompt) {
        // Try primary
        try {
            return primary.generate(prompt);
        } catch (Exception e) {
            log.warn("Primary model failed: {}", e.getMessage());
        }

        // Try secondary
        try {
            return secondary.generate(prompt);
        } catch (Exception e) {
            log.warn("Secondary model failed: {}", e.getMessage());
        }

        // Fallback to local
        try {
            return fallback.generate(prompt);
        } catch (Exception e) {
            log.error("All models failed", e);
            throw new AIServiceUnavailableException("All AI providers are unavailable");
        }
    }
}

Content Filtering / Guardrails¶

@Service
public class AIGuardrails {

    // Input guardrails — validate before sending to model
    public String sanitizeInput(String userInput) {
        // 1. Check for prompt injection attempts
        if (containsPromptInjection(userInput)) {
            throw new SecurityException("Potential prompt injection detected");
        }

        // 2. Check length limits
        if (userInput.length() > 10_000) {
            throw new ValidationException("Input too long");
        }

        // 3. Remove PII (emails, phone numbers, SSN)
        return removePII(userInput);
    }

    // Output guardrails — validate before returning to user
    public String sanitizeOutput(String modelOutput) {
        // 1. Check for harmful content
        if (containsHarmfulContent(modelOutput)) {
            return "I'm unable to provide that information.";
        }

        // 2. Remove any leaked system prompt content
        modelOutput = removeSystemPromptLeaks(modelOutput);

        return modelOutput;
    }

    private boolean containsPromptInjection(String input) {
        String lower = input.toLowerCase();
        return lower.contains("ignore previous instructions") ||
               lower.contains("you are now") ||
               lower.contains("system prompt");
    }
}

Part 11 — Practical Projects¶

Project 1: Document Q&A System¶

Build a system where users upload documents and ask questions about them.

@Service
public class DocumentQAService {
    private final ChatLanguageModel model;
    private final EmbeddingModel embeddingModel;
    private final InMemoryEmbeddingStore<TextSegment> store;

    public void ingestDocument(String content) {
        List<String> chunks = splitIntoChunks(content, 500);
        for (String chunk : chunks) {
            Embedding emb = embeddingModel.embed(chunk).content();
            store.add(emb, new TextSegment(chunk, null));
        }
    }

    public String askQuestion(String question) {
        Embedding queryEmb = embeddingModel.embed(question).content();
        List<EmbeddingMatch<TextSegment>> relevant = store.findRelevant(queryEmb, 3);

        String context = relevant.stream()
            .map(m -> m.embedded().text())
            .collect(Collectors.joining("\n\n"));

        return model.generate(
            "Answer based on context only. If not found, say 'not found'.\n\n" +
            "Context:\n" + context + "\n\nQuestion: " + question
        );
    }

    private List<String> splitIntoChunks(String text, int chunkSize) {
        List<String> chunks = new ArrayList<>();
        for (int i = 0; i < text.length(); i += chunkSize) {
            chunks.add(text.substring(i, Math.min(i + chunkSize, text.length())));
        }
        return chunks;
    }
}

Project 2: Code Review Agent¶

An AI agent that reviews Java code for bugs, security issues, and best practices.

@Service
public class CodeReviewAgent {
    private final ChatLanguageModel model;

    public CodeReviewResult review(String code) {
        String review = model.generate(
            "You are a senior Java developer. Review this code for:\n" +
            "1. Bugs and logical errors\n" +
            "2. Security vulnerabilities (OWASP Top 10)\n" +
            "3. Performance issues\n" +
            "4. Best practice violations\n" +
            "5. Suggestions for improvement\n\n" +
            "Code:\n```java\n" + code + "\n```\n\n" +
            "Format: For each issue, provide [SEVERITY] Description and Fix."
        );

        return new CodeReviewResult(review);
    }
}

Project 3: Conversational Database Agent¶

An agent that translates natural language to SQL and queries your database.

@Service
public class DatabaseAgent {
    private final ChatLanguageModel model;
    private final JdbcTemplate jdbc;

    @Tool("Execute a read-only SQL query and return results")
    public String queryDatabase(String sql) {
        // Safety: only allow SELECT
        if (!sql.trim().toUpperCase().startsWith("SELECT")) {
            return "Error: Only SELECT queries are allowed";
        }
        List<Map<String, Object>> results = jdbc.queryForList(sql);
        return results.toString();
    }

    public String ask(String question, String schema) {
        String prompt = """
            Given this database schema:
            %s

            Convert this natural language question to SQL:
            "%s"

            Rules:
            - Only use SELECT queries
            - Use proper JOINs
            - Add LIMIT 10 to prevent large result sets

            SQL:
            """.formatted(schema, question);

        String sql = model.generate(prompt).trim();
        String results = queryDatabase(sql);

        return model.generate(
            "Based on these query results, answer the user's question in natural language.\n" +
            "Question: " + question + "\nResults: " + results
        );
    }
}