Prompt Engineering in Generative AI: Strategies, Security, and Use Cases

Andrea Viliotti
31 dic 2024
Tempo di lettura: 15 min

The research titled “The Prompt Report: A Systematic Survey of Prompting Techniques” by Sander Schulhoff, Michael Ilie, and Nishant Balepur focuses on the most common practices in prompt engineering for Generative AI models. The analysis involves various universities and institutes—including the University of Maryland, Learn Prompting, OpenAI, Stanford, Microsoft, Vanderbilt, Princeton, Texas State University, Icahn School of Medicine, ASST Brianza, Mount Sinai Beth Israel, Instituto de Telecomunicações, and the University of Massachusetts Amherst—all working on AI and its prompt engineering applications. The central theme is the strategic use of prompts to enhance the comprehension and coherence of generative systems, highlighting techniques tested on different datasets and tasks. This research provides insights into best practices, methodological perspectives, and experimental outcomes for anyone looking to harness large language models effectively in business and research.

A notable addition to this field is the recent work by Aichberger, Schweighofer, and Hochreiter, titled “Rethinking Uncertainty Estimation in Natural Language Generation,” which introduces G-NLL, an efficient and theoretically grounded measure of uncertainty based on the probability of a given output sequence. This contribution is particularly valuable for evaluating the reliability of language models, complementing the advanced prompting techniques discussed by Schulhoff and collaborators.

Given the rising importance of cybersecurity in this realm, a dedicated focus is also provided on the guidelines outlined in the “OWASP Top 10 for LLM Applications 2025,” which offers a detailed taxonomy of the most critical vulnerabilities affecting large language models. This document supplies a comprehensive and up-to-date overview of the challenges and solutions related to cybersecurity in a rapidly evolving sector.

Foundations of Prompt Engineering in Generative AI: Essential Guide

The ability of Generative AI models to produce useful text relies on a set of procedures known as prompt engineering. This discipline has rapidly gained traction in the field of artificial intelligence and revolves around carefully crafting the input text to obtain targeted responses. Schulhoff and colleagues highlight that the prompt is more than just a simple input; it serves as the ground on which the system bases its inferences. It represents a crucial step at the core of human-machine interactions, where relevance and the richness of the instructions prove to be the key factors.

Researchers underscore how term choices, syntactic structure, and prompt length can lead to significant differences in model outputs. When discussing large language models, the idea of context window—namely, the maximum amount of text the model can process—comes into play. Once that limit is exceeded, the system discards earlier parts of the prompt. Keeping this mechanism in mind, prompt engineering in generative AI becomes a continuous optimization process, where the goal and coherence of the introductory text shape the entire output.

The study lists 33 core terms tied to Prompt Engineering in Generative AI—from template to self-consistency—offering a unified and shared vocabulary. Each term reflects a specific area of interest, spanning how to build an internal reasoning chain (Chain-of-Thought) to how to incorporate clarifying examples (Few-Shot or Zero-Shot). The secret of a well-tuned prompt doesn’t just lie in asking for something specific, but in demonstrating—through examples—the desired behavior.

What emerges is a sort of internal learning within the model, formalized through the conditional probability p(A‖T(x)), where the answer A depends heavily on the instruction T applied to an input x. This process does not imply any traditional training phase, but rather indicates the model’s capacity to follow the instruction contained in the string of text. The researchers show how this capacity has been tested in translation, classification, question answering, and text generation, sometimes leading to notable improvements.

The study also notes that Generative AI systems are highly sensitive to subtle linguistic changes. Adding spaces, removing an adverb, or switching delimiters can dramatically alter the final output. Hence the need to experiment with various prompt formulations to see which one works best—akin to “wooing” the model to guide it along the intended path. Some researchers introduce specific instructions at the beginning of the message (fictitious roles, explicit step-by-step strategies) to strengthen the logical consistency of the answer.

Another pivotal insight is that, in many situations, a clear example is more influential than an explicit directive. If a prompt provides clear demonstrations, the model tends to replicate the structure of those samples, adapting without additional commands. This is a core element: few-shot learning, in fact, is considered by many to be the most effective expression of prompt engineering, since it provides exactly the patterns the model expects, guiding it more reliably.

In the introduction, the researchers emphasize the importance of rigorously defining each prompt component. Saying “write an essay” or “explain quantum mechanics” is often too generic, whereas specifying details, indicating style, and giving examples of the desired answers result in more useful outputs. Thanks to these prerequisites, professionals in corporate innovation or research can immediately grasp the benefits of a solid prompt to generate analyses, reports, or synthesize complex documents.

Taxonomies and Practical Applications of Prompt Engineering in Generative AI

The research endeavors to map 58 textual prompting techniques, plus other variations developed for multimodal settings. This multitude of strategies falls under a broad taxonomy that organizes methods by their purpose: explanation, classification, generation, and so on. The taxonomy itself acts as a gateway for anyone approaching the prompt ecosystem, helping avoid confusion in definitions and concepts.

Some methods revolve around breaking down the problem. The study cites “chain-of-thought” to split a question into multiple steps, “least-to-most” to tackle subproblems, and “program-of-thought” to encapsulate sequences of code, executable snippets, and textual interpretations within the same flow. Other techniques embrace “self-criticism,” where the model’s initial text generation is subsequently reviewed by the model itself to spot errors or inconsistencies. These procedures leverage the generative nature of the system, leading it to analyze its own output with a degree of introspection.

The authors highlight that certain techniques are immediately applicable in real-world contexts. In customer support systems, for example, it’s highly useful to adopt prompts that ensure precise and appropriately toned answers. Here, filters and guardrails come into play, using very explicit instructions about restricted topics or permissible phrasing. For code generation, there are strategies to prompt the system to produce more reliable programming segments, selecting example snippets that illustrate the correct structure in advance.

A key point is the possibility of adapting the taxonomy to specific project needs. If a company wants to automate email correspondence, it can use templated prompts with the desired style, example replies, and lexical constraints. A marketing team might introduce fictional “roles” in the prompt, simulating an in-house creative expert suggesting slogans. Such decisions all aim to boost productivity. Within “The Prompt Report: A Systematic Survey of Prompting Techniques,” it is reiterated that there is no one-size-fits-all approach: every context can benefit from a different technique.

Furthermore, the proposed taxonomy is not limited to English. The researchers note the challenges posed by low-resource languages, suggesting solutions like “translate-first prompting,” where text in a less common language is initially converted to English. The subsequent step is building in-context examples consistent with the cultural or thematic domain, leveraging the reality that most current models are primarily trained on English-language data. The ultimate goal remains achieving relevant and accurate outputs.

Another intriguing point is that the taxonomy includes iterative request frameworks, where the model initially produces a draft and then refines it. Unlike the standard question-and-answer method, these techniques are especially suitable for extended writing tasks, brainstorming, or preparing documents. Anyone engaged in content creation, strategic planning, or the analysis of large text corpora can reap immediate rewards by adopting such procedures.

Prompt Engineering in Generative AI: Data, Security, and Optimal Results

One of the most delicate issues linked to prompt engineering is security, which directly affects the trustworthiness of the models. Threats such as prompt hacking exploit textual manipulations to coax the model into providing unwanted information. In some cases, a single forceful sentence can override the main instructions within the prompt, resulting in offensive or risky outputs. Many companies are actively addressing this point, as chatbots can be manipulated to disclose confidential data or adopt linguistic styles that breach compliance guidelines.

Experiments in the study highlight the ease with which attackers can coerce systems to output highly sensitive text or circumvent established rules. An example is an incident where merely advising the model to “ignore all previous instructions” caused all moderation constraints to collapse. The research also shows that building guardrails or defensive layers directly within the prompt does not always resolve the issue. Multi-layered reinforcement or screening mechanisms exist, but these too have limits.

Beyond security considerations, the research features precise numerical results from tests using reference datasets. In a noteworthy passage, it describes a benchmark based on 2,800 questions selected from the broader MMLU, covering diverse knowledge domains. Employing approaches such as “zero-shot” or “chain-of-thought” led in some instances to improved performance or, paradoxically, performance drops. There was no single dominating method: some techniques worked excellently on math tasks but stumbled on narrative problems, and vice versa. These discrepancies urge organizations to thoroughly test prompts before integrating them into mission-critical processes.

The authors also consider automated evaluation, noting that determining whether a prompt is effective requires a scoring system to compare the generated responses with a known standard. Some studies compare sentences in various output formats to a correct reference. However, there is a recognized need for human validation in more nuanced tasks, particularly if creativity or subtle interpretations are required.

The study warns of the danger of overconfidence in the responses generated by models. Frequently, these systems deliver answers with a high degree of certainty, even when they are incorrect. It’s essential to caution users and encourage balanced content generation, including instructions that prompt accurate self-assessment of certainty. Yet, models do not always offer reliable transparency about their confidence levels, and merely requesting confidence percentages may not suffice. There are cases in which the systems overestimate their reliability. In a corporate setting, such unflagged errors can be a major concern, since a system that appears persuasive but supplies inaccurate information can have damaging repercussions.

The scale of the studies involved is impressive. The paper refers to a systematic review of 1,565 articles, selected according to strict criteria, to piece together a comprehensive overview of prompt engineering. From these findings, the researchers highlight risks and possibilities, underlining the need for specialized solutions to maintain security.

Advanced Strategies and Evaluation Tools for Prompt Engineering in Generative AI

The research outlines scenarios that favor managing multiple prompts sequentially, forming a prompt chain. This chain enables the model to build responses in stages. In the first stage, for instance, the system might generate hypotheses; in the second, it tests them; and finally, it provides a definitive version. This mechanism proves useful for tasks involving multiple steps, such as solving math problems or planning multi-phase activities.

In business or research contexts, the complexity of a question may call for retrieving external information. This is referred to as using agents that leverage “retrieval augmented generation,” where the prompt instructs the model to fetch relevant data from databases or other services. One illustrative scenario involves a model tasked with reporting the current weather conditions: if guided appropriately by the prompt, it might trigger an API call. This expands the scope of interactions: the chain-of-thought is not just linguistic but can include real-world actions.

Result evaluation is another critical chapter. On one hand, there are self-consistency procedures, where the model generates multiple versions of a response with some degree of randomness. The system then picks the one that appears most frequent or coherent according to internal metrics. On the other hand, some experiments use “pairwise evaluation,” where the model compares two responses to select the better one. These self-assessment methods can lessen the burden of human evaluation, but they are not foolproof, as Schulhoff and colleagues note. Models sometimes favor lengthy or formally complex answers without actually improving quality.

The concept of “answer engineering” is also introduced, focusing on isolating and precisely formatting the desired response. This technique proves especially helpful when a concise output is needed, such as “positive” or “negative,” or a specific numeric code. Without it, generating free-form text could obscure the data point in question, complicating automated interpretation. In many managerial scenarios, having a structured output reduces the need for manual intervention.

The discussion around evaluation tools highlights projects like “LLM-EVAL,” “G-EVAL,” and “ChatEval.” These frameworks ask the model itself to generate a score or comment about a text, following guidelines from either the model or human operators. Here, the recent research by Aichberger, Schweighofer, and Hochreiter—and specifically the G-NLL method—plays a significant role. G-NLL estimates the level of uncertainty based on the probability assigned to the output sequence determined most representative under deterministic (greedy) decoding. This approach could be integrated into these systems to provide a quantitative measure of reliability for the automatically generated scores or comments.

For instance, if the model outputs “The capital of France is Paris” with a far higher probability than alternatives like “Rome” or “Berlin,” then G-NLL is low. Conversely, when the model is unsure among multiple options, G-NLL is higher, indicating greater uncertainty.

When “LLM-EVAL,” “G-EVAL,” or “ChatEval” produce a given score, one could incorporate a G-NLL measure for the textual sequence constituting the model’s answer.

A low G-NLL would indicate high confidence in the generated sequence and thus higher trust in the evaluation. In contrast, a high G-NLL would flag elevated uncertainty, suggesting caution in interpreting the score or comment. One might even weigh generated scores by their G-NLL values, giving more credence to those tied to lower uncertainty or setting a G-NLL threshold beyond which the model’s evaluation is deemed unreliable, requiring a human review. Under this scenario, G-NLL could guide iterative improvements to the prompt or the model itself, since consistently high G-NLL values might point to problems with the prompt, the fine-tuning process, or the model architecture. Integrating G-NLL into these evaluation frameworks would provide an added layer of oversight by quantifying the uncertainty associated with the scores and thus making them more robust. This is critical, especially for nuanced tasks, as underscored by Schulhoff and colleagues: relying solely on the model’s judgment without a measure of uncertainty could lead to flawed conclusions or subpar evaluations. The method proposed by Aichberger, Schweighofer, and Hochreiter thus emerges as a valuable tool for strengthening and stabilizing automated evaluations in intricate scenarios.

In summary, leveraging multiple prompts, external actions, automatic oversight procedures, and uncertainty estimation via G-NLL creates a more complex yet significantly more beneficial ecosystem—particularly for automating sensitive processes or addressing nuanced tasks. Future research might focus on practically integrating G-NLL into the discussed evaluation frameworks, assessing its impact on accuracy, reliability, and the reduction of human intervention.

Multimodal Prompt Engineering in Generative AI: Beyond Text

Recent progress shows that prompt engineering extends beyond text. Many lines of research concentrate on models that process images, audio, or video, broadening the scope of potential applications in fields such as robotics, medical imaging diagnostics, and multimedia content creation.

The authors address “image-as-text prompting,” meaning the conversion of an image into a textual description, which can then be incorporated into a broader prompt. This tactic facilitates automatic photo captioning or visual question answering. Other techniques allow the generation of images from textual prompts, incorporating “prompt modifiers” to control style. The balance between emphasized and excluded terms (with negative weights) echoes the text-optimization practices seen in linguistic contexts.

Audio is similarly an area of experimentation, covering tasks like transcription, voice translation, and even reproducing vocal timbre. Some studies have examined few-shot learning for speech, though the results are not always consistent. Schulhoff and collaborators point out that neural network-based audio models often require additional processing steps to enhance performance. In this domain, prompting intersects with feature extraction pipelines because raw speech cannot be directly converted into a token-friendly textual format.

The section on video explores generating or modifying clips based on textual inputs. Researchers have tested early-stage systems that create subsequent video frames. There are also initiatives aiming to design agents capable of interacting with simulated environments through suitably formulated instructions. A notable example might be a robot that, guided by a natural language command, interprets how to move or manipulate physical objects effectively.

Additionally, there is growing interest in 3D prompt engineering, bringing together textual suggestions with volumetric or rendering-based synthesis. In product design or architecture, for instance, expressions like “create a 3D model with smooth, symmetrical surfaces” enable modifications to meshes or geometric structures. This transformation from language to three-dimensional shapes opens up promising avenues in industrial prototyping and interactive entertainment.

The multidisciplinary dimension of these efforts reaffirms that the “prompt–response” relationship can take countless forms. Each time, the aim is to forge a link between the model’s upstream interpretation and the desired output. It’s not only about sentences and paragraphs: the channel can expand to any digital signal, preserving the prompting logic but adapting the encoding and decoding of information.

Focus on a Real Prompt Engineering Experiment

The paper details a scenario involving suicidal risk detection, examining whether a model can identify red flags in messages posted by individuals in severe distress. Researchers used posts from a specialized support forum for people who exhibit self-harm ideation. They selected over two hundred messages, labeling some as “entrapment” or “frantic hopelessness,” following a specific clinical definition. The objective was to see whether the model could replicate this labeling without offering any medical advice.

The initial prompt described what “entrapment” means and asked the model to reply with a simple “yes” or “no.” However, the model often produced excessive text, attempting to provide healthcare suggestions. To resolve this, the researchers expanded the context, specifying the experiment’s goals and instructing it not to give any advice. Prompts featuring examples (few-shot) and internally generated reasoning chains (chain-of-thought) were also tested to enhance accuracy and reduce false positives.

After 47 rounds of optimization, the F1 score—a statistical measure that balances precision (the percentage of relevant items correctly identified) and recall (the percentage of total relevant items truly detected)—improved noticeably. Early attempts were unsuccessful because the model struggled to follow formatting conventions, while later iterations brought better results, though far from perfect. To more reliably capture the output, the researchers integrated specialized extractors and final rules within the prompt, forcing the system to respond with a single “yes” or “no.” Nevertheless, occasional incomplete answers persisted. In one test, removing an email address from the reference text caused a substantial drop in accuracy, implying that additional contextual content helped guide the model’s reasoning more effectively.

This real-world example illustrates that prompt construction is not just a matter of issuing commands—it’s about conversational fine-tuning. Every detail, from the positioning of instructions to whether some text is duplicated or if a narrow constraint is specified, affects the outcome. It also highlights the tension between the need for coherent outputs and the model’s tendency to interpret requests too liberally. This serves as a cautionary note for business leaders and decision-makers: wherever results carry serious implications, involving domain experts (medical, legal, etc.) and engineers proficient in prompt techniques is advisable. Abstract optimization alone is insufficient; ongoing alignment with professional standards and ethical guidelines is essential.

The researchers also experimented with automation tools that generate and evaluate prompts in sequence. Sometimes the algorithm improved certain metrics, yet human intervention was still necessary to adjust false positives. An optimization tool might reduce sensitivity to gain precision, posing clear ethical risks. This real-world case shows that prompt engineering is anything but theoretical, requiring hands-on experimentation, meticulous attention to detail, and heightened awareness of real-world impacts.

Prompt Engineering in Generative AI: Security and OWASP Guidelines

In the ever-changing landscape of generative AI, cybersecurity is crucial, especially when dealing with large language models. The document “OWASP Top 10 for LLM Applications 2025” offers a detailed and current analysis of the main threats to these technologies, adding to the framework presented by Schulhoff and colleagues.

OWASP focuses on ten critical vulnerabilities, providing essential insights for anyone deploying LLMs in practical business or research environments. One of the most notorious risks is Prompt Injection, which comes in two forms: direct and indirect. Direct injection involves an attacker placing malicious content directly into the prompt, while indirect injection uses external sources processed by the model. Consequently, relying on techniques like Retrieval Augmented Generation (RAG) or fine-tuning alone is not enough; robust access controls, meticulous input validation, and possible human approval for higher-stakes actions are critical. Consider, for instance, a chatbot that—due to a malicious prompt—grants unauthorized access, or a model that parses hidden instructions buried in a webpage and is manipulated without the user’s knowledge.

Equally concerning is “Sensitive Information Disclosure,” where the unauthorized release of confidential details occurs. OWASP stresses the importance of sanitizing data and applying strict access controls. It also describes “Proof Pudding,” an attack that exploits the leakage of training data to compromise the model. Moreover, security encompasses the entire supply chain of the LLM. The common practice of using pre-trained models from third parties may expose users to compromised models with hidden backdoors or biases. For this reason, OWASP recommends employing tools such as SBOM (Software Bill of Materials) and performing rigorous integrity checks.

Closely related is “Data and Model Poisoning,” the deliberate manipulation of training data. Countermeasures include verifying data origins, anomaly detection, and specialized robustness testing. Meanwhile, “Improper Output Handling” highlights how carelessness in processing the model’s outputs can allow vulnerabilities like XSS or SQL injection. To address this, OWASP advises treating all LLM outputs with the same level of caution as user-generated content, using validation and sanitization best practices.

Another key concern is “Excessive Agency,” where an LLM is granted more permissions or capabilities than necessary. OWASP suggests strictly limiting each model’s functions, complemented by a “human-in-the-loop” mechanism for critical decisions. The guidelines also discuss “System Prompt Leakage,” referring to instances where system-level instructions become exposed. Although these prompts should never contain sensitive data, their disclosure can help attackers better understand and bypass the model’s defenses. It’s therefore wise not to include private information in system prompts and to avoid relying solely on them for controlling the model’s behavior.

A newer category, “Vector and Embedding Weaknesses,” delves into attacks on embeddings and vector-based components, particularly relevant to RAG systems. Access control and integrity checks of these vector resources become indispensable to prevent malicious alterations or unauthorized access. Another notable topic is “Misinformation,” which treats the generation of false or misleading data by LLMs as a security flaw, urging external validation, fact-checking, and transparent communication about the model’s inherent limitations.

Finally, “Unbounded Consumption” deals with unchecked resource usage, which can result in both economic and availability problems. OWASP recommends introducing rate limits, resource monitoring, and timeouts for prolonged tasks. Overall, LLM security is complex and multifaceted, necessitating a holistic, layered approach. With a constantly evolving taxonomy, the OWASP document stands as a valuable resource for anyone entering this domain. It lays out concrete guidelines for leveraging large language models while minimizing the associated risks. In this environment, security cannot be an afterthought; it must be a built-in requirement for ensuring the reliability and sustainability of these increasingly pervasive technologies.

Conclusions

From this analysis, prompt engineering emerges as a central component of Generative AI usage, albeit one still under ongoing development. The wide range of techniques—from problem decomposition methods to self-consistency strategies—demonstrates the diversity of approaches. While there is encouraging progress in using linguistic context effectively, risks remain tied to textual manipulation and imbalanced answers in terms of confidence and accuracy.

The potential impact for businesses and management is significant: a tailored prompt can automate the creation of reports or data classification, cutting costs and saving time. Still, the current state of the art demands rigorous testing. As shown in the suicidal risk detection experiment, it’s unwise to assume that a procedure successfully used in one system will seamlessly transfer to another. Multiple models and related technologies exist, each employing different prompting techniques; this variety calls for a careful comparative approach to understand each method’s strengths and limitations.

In a more in-depth view, prompt engineering should not be conflated with traditional programming. Instead, it involves “tailoring” instructions and contextual examples around the model’s statistical nature to ensure the output meets real-world needs. It is not purely mechanical: an ongoing collaboration between prompt designers and domain experts is essential. Only through this synergy can robust solutions emerge, where security, accuracy, and semantic coherence aren’t taken for granted.

Podcast: https://creators.spotify.com/pod/show/andrea-viliotti/episodes/Prompt-Engineering-in-Generative-AI-Strategies--Security--and-Use-Cases-e2su1sd

Source: https://arxiv.org/abs/2406.06608