Skip to main content
Back to Journal
AI EngineeringJavaScript

Building My First LLM-Powered App with the OpenAI API

GPT-4 launched on March 14, 2023, and like every other developer on Twitter, I immediately wanted to build something with it. The demos were impressive, but I wanted to understand what it actually takes to go from API key to working application. So I built a code review assistant, a tool that takes a git diff and returns structured feedback on the changes. Here's everything I learned.

Getting Started with the SDK

OpenAI's Node.js SDK is straightforward. Install it and configure your API key:

npm install openai
const OpenAI = require('openai');

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

Never hardcode API keys. Use environment variables. I'm saying this because the number of "I accidentally committed my API key" posts on Reddit is alarming. Use a `.env` file locally and proper secrets management in production.

Chat Completions: The Core API

The chat completions endpoint is what you'll use for almost everything. It takes an array of messages, each with a role (`system`, `user`, or `assistant`) and content:

const response = await client.chat.completions.create({
  model: 'gpt-4',
  messages: [
    {
      role: 'system',
      content: 'You are a senior software engineer performing code reviews. Focus on bugs, security issues, and performance problems. Be concise and specific.'
    },
    {
      role: 'user',
      content: 'Review this git diff:\n\n' + gitDiff
    }
  ],
  temperature: 0.3,
  max_tokens: 2000,
});

The three message roles serve different purposes:

  • `system`: Sets the behavior and personality of the model. This is your prompt engineering foundation. The model treats system messages as high-priority instructions.
  • `user`: The human input. This is the question, the code to review, the text to summarize.
  • `assistant`: Previous model responses. Including these creates a conversation history, which is how you build multi-turn chat interfaces.

Temperature and Token Limits

Two parameters you need to understand immediately:

`temperature` controls randomness. Range is 0 to 2, but you'll almost always stay between 0 and 1. Lower values (0.1 to 0.3) give more deterministic, focused responses. Good for code review, classification, and extraction tasks. Higher values (0.7 to 1.0) give more creative, varied responses. Good for brainstorming and creative writing. For my code review tool, I use 0.3 because I want consistent, precise feedback, not creative interpretations.

`max_tokens` caps the response length. One token is roughly 4 characters in English. A 2000-token response is about 1,500 words. Set this based on how long you expect responses to be. If you set it too low, responses get cut off mid-sentence. If you don't set it, the model uses its default maximum, which can get expensive.

Streaming Responses

For user-facing applications, streaming is essential. Without streaming, the user stares at a loading spinner for 5 to 15 seconds while the model generates the full response. With streaming, tokens appear as they're generated, giving that characteristic ChatGPT typing effect.

const stream = await client.chat.completions.create({
  model: 'gpt-4',
  messages: messages,
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0].delta.content;
  if (content) {
    process.stdout.write(content);
  }
}

For a web application, you'd pipe this through Server-Sent Events (SSE) to the browser:

// Express route handler
app.get('/api/review', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  const stream = await client.chat.completions.create({
    model: 'gpt-4',
    messages: buildMessages(req.query.diff),
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0].delta.content;
    if (content) {
      res.write('data: ' + JSON.stringify({ content }) + '\n\n');
    }
  }

  res.write('data: [DONE]\n\n');
  res.end();
});

On the client side, the `EventSource` API or a fetch with a readable stream picks up these events and appends text to the UI in real time.

Prompt Engineering: What Actually Works

After a lot of experimentation, here are the prompt techniques that made the biggest difference for my code review tool:

Few-shot examples. Instead of just describing what you want, show the model an example of ideal output:

const systemPrompt = 'You are a code reviewer. Given a diff, return feedback as JSON.\n\n' +
  'Example input:\n' +
  '+ const data = JSON.parse(userInput);\n\n' +
  'Example output:\n' +
  '[{"line": 1, "severity": "high", "issue": "Unsanitized JSON.parse on user input can throw on malformed data. Wrap in try/catch.", "suggestion": "try { const data = JSON.parse(userInput); } catch (e) { // handle error }"}]\n\n' +
  'Now review the following diff:';

Giving the model a concrete example of the output format and level of detail you expect is worth more than a paragraph of instructions describing it.

Chain of thought. For complex diffs, asking the model to reason step by step produces better results:

const systemPrompt = 'You are a code reviewer. For each issue you find:\n' +
  '1. Identify the specific line or lines.\n' +
  '2. Explain what the current code does.\n' +
  '3. Explain the potential problem.\n' +
  '4. Suggest a fix with code.\n' +
  'Think through each change carefully before giving your assessment.';

The "think step by step" instruction forces the model to show its reasoning, which both improves accuracy and makes the output more useful to the developer reading the review.

Cost Management

GPT-4 pricing in March 2023: $0.03 per 1K input tokens and $0.06 per 1K output tokens. That adds up fast. A typical code review request with a 500-line diff might use 2,000 input tokens and 1,000 output tokens, costing about $0.12 per review. Run that 100 times a day across a team, and you're looking at $360 per month.

Practical cost controls:

  • Set `max_tokens` on every request. Never leave it unbounded.
  • Use GPT-3.5 Turbo for simpler tasks (classification, formatting, extraction) and reserve GPT-4 for complex reasoning. GPT-3.5 Turbo is roughly 30x cheaper.
  • Count tokens before sending. The `tiktoken` library lets you count tokens client-side so you can estimate costs and truncate inputs that are too long.
  • Cache responses for identical inputs. If the same diff gets reviewed twice, return the cached result.
const { encoding_for_model } = require('tiktoken');

function countTokens(text, model) {
  const enc = encoding_for_model(model);
  const tokens = enc.encode(text);
  enc.free();
  return tokens.length;
}

// Check before sending
const inputTokens = countTokens(fullPrompt, 'gpt-4');
if (inputTokens > 6000) {
  // Truncate or split the diff
}

Error Handling and Rate Limits

The OpenAI API will rate-limit you, and it will occasionally return errors. You need retry logic:

async function callWithRetry(fn, maxRetries) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status === 429) {
        // Rate limited: wait and retry
        const waitTime = Math.pow(2, i) * 1000;
        await new Promise(r => setTimeout(r, waitTime));
        continue;
      }
      if (error.status >= 500) {
        // Server error: retry
        await new Promise(r => setTimeout(r, 1000));
        continue;
      }
      throw error; // Client error, don't retry
    }
  }
  throw new Error('Max retries exceeded');
}

Exponential backoff on 429s is important. If you hammer the API after getting rate-limited, you'll get longer bans. Double the wait time on each retry.

Embeddings: A Quick Introduction

Beyond chat completions, the embeddings API is worth knowing about. It converts text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors. This enables semantic search: instead of keyword matching, you find content that's conceptually related.

const embedding = await client.embeddings.create({
  model: 'text-embedding-ada-002',
  input: 'How do I handle authentication in Express?',
});

// embedding.data[0].embedding is a 1536-dimensional vector
// Store it, then use cosine similarity to find similar content

For my code review tool, I used embeddings to build a knowledge base of past review comments. When the model generates a review, I check if similar feedback has been given before and reference the previous resolution. This makes reviews more consistent and contextually aware over time.

What I Learned

Building this tool taught me that the hard part of working with LLMs isn't the API integration. That's straightforward. The hard parts are: writing prompts that produce consistent, useful output across a wide range of inputs; managing costs so your side project doesn't generate a surprise bill; handling the inherent non-determinism (the same input won't always produce the same output); and knowing when the model's response is wrong, because it will be wrong sometimes and it will sound confident about it.

The LLM is a powerful tool, but it requires engineering around it: input validation, output parsing, cost controls, error handling, and human oversight. The API is the easy part. The system design around it is where the real work lives.

openaillmgpt-4apiprompt-engineeringstreaming