Prompting is central to how large language models interpret intent and respond to complex tasks. Well-structured inputs significantly influence the quality, relevance, and consistency of the output.
In this topic, you'll learn about the fundamentals of prompt engineering, its significance in crafting instructions for generative models, and how various configurations shape the final response.
What is a prompt?
A prompt is a set of instructions or inputs in natural language that directs a model to perform a specific task. The purpose of prompting is to direct the model toward a specific outcome, such as answering a question, summarizing information, or generating images. It acts as the bridge between your intent and the model's vast database of patterns.
LLMs operate by predicting the most contextually appropriate continuation of the text they receive. When the input is clear and well-structured, the model can infer intent more reliably and generate responses that better align with your expectations. It does not "know" facts in the human sense; instead, it calculates which words are most likely to follow your specific sequence of words based on what it learned during training.
Because LLMs are flexible and can perform many tasks, even a single word will get a response from them. However, not every input generates helpful content. If you just type "write," the model might write a poem, code, or a grocery list. You need to provide detailed information and context so that the LLM's response is useful. This often requires you to be specific about the format, length, and style you want.
This process usually involves continuous iteration to get the desired outcomes. You might write a prompt, see the result, and realize you missed a key constraint. You then rewrite the prompt to be more precise. This cycle of writing, testing, and refining is the core workflow of a prompt engineer. It transforms a raw idea into a reliable instruction that LLMs can follow consistently.
Model settings
While the text of your prompt guides the content, the model's behavior is heavily influenced by configuration settings. The most common setting is temperature, which controls the randomness of the output. A low temperature (near 0) makes the model consistent and focused, selecting only the most likely words. This is ideal for tasks like math or coding, where there is only one right answer. A high temperature (near 1) allows the model to take risks and choose less common words, which is useful for creative writing or brainstorming.
Two other settings, Top P and Top K, work alongside temperature to filter the word choices. Top P (or nucleus sampling) limits the model's choices to the best group of words that add up to a certain percentage of probability. This cuts off completely nonsensical words while keeping some variety. Top K is simpler; it tells the model to only pick from the top K most likely words (e.g., the top 50).
To prevent the model from repeating itself, we use frequency and presence penalties. The frequency penalty lowers the chance of a word appearing if the model has already used it many times in the text. This stops the model from getting stuck in a loop, repeating the same phrase over and over. The presence penalty is slightly different; it applies a penalty if a word has appeared even once. This encourages the model to move on to new topics and use a wider vocabulary, rather than staying focused on one subject for too long.
Another critical constraint is the max tokens setting, which imposes a hard limit on the output length. This is different from the context window, which is the total amount of text (your prompt plus the model's answer) the model can process at one time. Setting a max token limit is important for controlling costs and ensuring the application remains fast. However, if the limit is too low, the model might cut off mid-sentence. You must balance this setting to ensure the model has enough space to finish its thought.
Finally, stop sequences act as a manual brake for the generation process. You can define specific words or symbols—such as "User:" or "End"—that tell the model to stop writing immediately. This is very useful when building chatbots. For example, if you are writing a script between two characters, you can set a stop sequence for the other character's name. This forces the model to stop after writing its part, waiting for you to provide the next input instead of writing the whole conversation itself.
Note that some settings are only accessible to developers who interact with models programmatically via APIs.
Simple vs chat prompts
In the early days of LLMs, interactions were based on simple prompts, also known as text completion. In this format, the input is just a single block of text, and the model simply adds words to the end of it. It works like a smart autocomplete, but for whole paragraphs. If you feed it the text "The capital of France is", the model predicts "Paris". This method is easy to use, but it makes it hard to manage complex instructions or long back-and-forth conversations.
Modern applications primarily use chat prompts, which structure the input as a list of messages. This format mimics a real conversation and assigns specific roles to each message. The most distinct role is the system message. This message is usually hidden from the final user and acts as the "inner rules" for the model. It defines who the model is (e.g., "You are a helpful coding assistant") and sets boundaries that apply to the whole conversation.
The user role represents the human's input—the actual question or command you type. In a chat app, every time you send a message, it gets labeled as a "user" message. The assistant role represents the model's previous answers. By including these past messages in the prompt, you give the model "memory." This allows the model to look back at what it said earlier to maintain context.
A crucial addition in modern systems is the tool role (sometimes called the function role). This role is used when the model connects to external software, like a calculator, a database, or the internet. If the model needs to do math, it can generate a request, and the system runs the code. The system then sends the result back to the conversation with the "tool" role. This allows the model to see the actual result of an action and use that real-time data to answer your question accurately.
Understanding the difference between simple completion and structured chat roles is vital, especially for developers. While simple prompts are fine for quick one-off tasks, chat prompts are the standard for building applications.
Conclusion
In this topic, we established that a prompt is a clear set of instructions that guides an LLM to a specific goal. We explored how settings like temperature, Top P, and penalties change the "flavor" of the output, letting you choose between consistent facts and creative ideas. We also looked at the importance of managing token limits to keep responses complete and efficient.
We also distinguished between simple text completion and structured chat prompts. You learned that modern prompting uses specific roles—system, user, assistant, and tool—to manage context and external data. Mastering both the writing of the prompt and the configuration of the model is the foundation of effective prompt engineering.