Risultati di ricerca

Servizi (1)

Post sul blog (600)

Altre pagine (22)

600 risultati trovati con una ricerca vuota

Preference Discerning in Recommender Systems: Generative Retrieval for Personalized Recommendations
The study titled “PreferenceDiscerningwithLLM-Enhanced Generative Retrieval,” led by researchers Fabian Paischer, Liu Yang, and Linfeng Liu—affiliated with the ELLIS Unit, LIT AI Lab, the Institute for Machine Learning at JKU Linz, the University of Wisconsin–Madison, and Meta AI—opens a new chapter in how we think about recommender systems. By focusing on the practice of “preference discerning,” their work investigates how preference discerning in recommender systems leverages natural-language user input for sequential product recommendations. The ultimate vision is to give personalization an added dimension: users can express both positive and negative sentiments (steering instructions) so that the recommendation engine can better reflect everyone’s nuanced tastes and constraints. From a business perspective, especially in e-commerce, this approach can lead to a measurable performance lift—some experiments suggest improvement of up to 45% in Recall@10—while also nurturing a deeper bond between user and system by minimizing irrelevant results. Preference Discerning in Recommender Systems: Generative Retrieval for Personalized Recommendations Why Preference Discerning in Recommender Systems Redefines Sequential Recommendations Traditional recommender systems often rely on a user’s past behavior, such as purchase history or clicks, to guess future preferences. Yet such techniques risk overlooking people’s dynamic needs and explicit feedback (e.g., “I’d prefer something allergy-friendly” or “I want to avoid this brand”). With the concept of preference discerning , the authors propose to go beyond mere user-item embeddings: their generative retrieval mechanism deliberately incorporates statements a user might have offered in natural language. They emphasize that real-life preferences can be highly specific, ranging from sentiment-based prohibitions (“I hate scratchy materials”) to broad-yearning requirements (“I’d like something lighter for my hikes”). Traditional systems may fail to capture these nuances, while preference discerning in recommender systems integrates them at the core of the recommendation process. In this approach, the model does not simply scan for the nearest neighbor in an embedding space. Instead, it generates the next relevant item by conditioning on textual preferences. The authors employ a two-step workflow: first, preference approximation extracts the user’s key tastes from data like reviews and item descriptions; second, preference conditioning infuses these preferences into the generative component, shaping recommendations in real time. This dual-stage design helps the model pivot quickly in response to new information—such as a user disclaiming a sudden aversion to a chemical ingredient or wanting to try a new style. Empirical findings from the study show that classical baseline models struggle with fine-grained changes in user sentiment or abruptly shifting tastes over time. By contrast, a preference-discerning system follows detailed cues—either a “fine-grained” shift (subtle variations in otherwise stable tastes) or a “coarse-grained” one (a big departure from the user’s historical pattern). Thus, if a person who usually buys synthetic running shoes suddenly wants “the same shoe model but in a natural fiber,” the algorithm does not default to the old preference but adapts accordingly. Beyond that, the researchers highlight an often-neglected phenomenon: sentiment following . Many recommender systems are adept at identifying what someone likes but do poorly at interpreting what the individual decidedly dislikes . From an e-commerce standpoint, ignoring these negative signals can be disastrous, since suggesting unwanted products can alienate customers. By embedding user aversions into the generation loop, this new approach looks to reduce friction and zero in on the user’s genuine preferences. Semantic IDs: Key to Generative Retrieval in Preference Discerning Systems In the core of the paper, the concept of generative retrieval is enlarged to incorporate textual constraints. One of the structural elements enabling this capability is the design of semantic IDs for items. Formally expressed as: RQ(e,C,D)=(k1,…,kN)∈[K]N,RQ(e, C, D) = (k_1, \dots, k_N) \in [K]^N , this formula captures how the system quantizes continuous embeddings into discrete tokens. The benefit is significant: the recommender can handle huge catalogs without being bogged down by purely numeric embedding vectors that are often opaque to interpret. Instead, items are discretized and can be more directly “linked” to natural-language preferences. This synergy between textual preferences and token-based item representations leads to more precise suggestions. Initial results come from tests on Amazon categories—Beauty, Toys and Games, Sports and Outdoors—as well as the Steam platform. Across these datasets, the investigators observe that text-driven preference modeling elevates recall measures, effectively boosting the system’s ability to identify the correct items among the top ten recommendations. The advantage is particularly striking for businesses aiming to reduce user churn: a well-targeted suggestion can reassure prospective buyers that the platform “understands” them. Moreover, the authors cite the notion of history consolidation as a crucial test: the model must distinguish which aspects of the user’s history still matter, while filtering out stale or contradictory preferences. This capacity to sift through a user’s evolving tastes is especially relevant in real-world scenarios—imagine a frequent traveler who once raved about a certain brand but now actively avoids it. If the system can dynamically pivot to incorporate these fresh aversions, conversions are likely to go up, with fewer irrelevant items cluttering the user’s search. How Benchmarks Validate Preference-Based Models To rigorously validate their methods, the authors propose a benchmark across five dimensions: Preference-based recommendation : The model is given a textual preference—such as “only gluten-free products”—and tested to see if it can produce the correct item next. Training, validation, and test sets are structured in a way that ensures old and new preferences do not overly overlap. Fine-grained steering : This checks if a system can follow incremental changes in preference. For instance, a user might typically seek a certain style of running shoe but now demands an even lighter variant. Coarse-grained steering : The system is tested with drastic preference shifts, like jumping from sneakers to formal dress shoes. Sentiment following : The model must handle strong user sentiment—for or against certain brands, materials, or categories—and either highlight or exclude relevant items. History consolidation : The system processes a wide array of user preferences, some of which are no longer relevant. The goal is to filter out the noise and keep track of what still matters. Across these axes, classical systems can falter, especially on negative sentiments, because they often rely on positive correlations to make a recommendation. If your previous purchases favored brand X, standard models might keep suggesting that brand even if you have recently expressed distaste. Preference-discerning systems aim to fix that loophole, thereby ensuring a more holistic reflection of user desires. Mender: The Future of Multimodal Recommender Systems At the center of these innovations stands the model known as Mender —short for Multimodal Preference Discerner . Mender uses semantic IDs to generate new recommendations based on text-based user preferences, further refining the principle of preference discerning. Unlike typical recommender architectures that compare items in pairs, Mender employs an autoregressive approach. Given a user’s current state—history, textual instructions, or both—Mender directly predicts which item should appear next. Concretely, Mender implements the formula: RQ(e,C,D)=(k1,…,kN)∈[K]N,RQ(e, C, D) = (k_1, \dots, k_N) \in [K]^N, as a method to convert embeddings into discrete token codes. This bridging helps the model marry linguistic constraints (“avoid certain allergens,” “aim for sustainable materials,” etc.) with vast product spaces. The result is a system capable of “translating” user prompts into recommended items, circumventing the need for complicated retrieval heuristics. Instead, the system “generates” the next item in a manner reminiscent of how text-generation models produce the next word in a sentence. Technically, Mender relies on a pre-trained language encoder and a specialized decoder that outputs these semantic token sequences. The cross-attention mechanism couples the user’s textual instructions and purchase history with the generation process. Two variations illustrate Mender’s versatility: MenderEmb : Maintains separate embeddings for user preferences and items, later aligning them. MenderTok : Merges the history and user instructions into a single textual stream, prompting the model to treat the entire data as one sequential input. Notably, MenderTok often excels in performance benchmarks. On datasets like Amazon Beauty, the Recall@10 metric jumps from roughly 0.0697 with certain baseline models to around 0.0937 with MenderTok. In Sports and Outdoors, it inches upward from 0.0355 to 0.0427. These gains, while expressed as raw numbers, have tangible implications for real-world e-commerce, translating into more potential conversions. A pivotal feature is the model’s ability to adapt swiftly to new user profiles or novel constraints expressed in natural language. By leaning on a generative approach, Mender does not require laborious re-training for every shift in user preference. Instead, it processes textual disclaimers or clarifications in real time, updating its recommendations accordingly. This adaptability is invaluable for businesses looking to scale to large catalogs while maintaining a personalized edge. The study’s authors underscore that Mender’s effectiveness also hinges on high-quality preference inputs . In trials, roughly 75% of preferences extracted from user reviews closely mirrored true user inclinations. Mender capitalizes on these well-curated preferences by filtering out extraneous noise and concentrating on relevant signals. Such synergy between user-provided text and historical data paves the way for expansions to related items, bringing fresh but contextually aligned suggestions into play. For enterprises wishing to embed Mender within their data pipelines, the synergy of semantic embeddings and user instructions holds promise for interoperability: product reviews, social media mentions, or direct user queries can all feed into this model. Because Mender leverages a single encoder-decoder architecture, explainability and transparency may be more feasible, making it easier to justify recommendations to end users or to adapt for corporate objectives (like highlighting high-margin items). E-commerce Innovations with Preference Discerning The study evaluates four main datasets—three from Amazon (Beauty, Toys and Games, Sports and Outdoors) and the Steam platform. Action counts range from 167,597 for Toys and Games to nearly 600,000 on Steam, reflecting both the diversity and scale of the tested domains. Textual preferences are not invented in a vacuum: the authors draw real user reviews and refine them via large language models, weeding out repetitive references and random artifacts. This ensures that the preferences fed into Mender align with authentic consumer language. Performance is judged using standard recommender metrics like Recall@5, Recall@10, and NDCG@10. The system’s consistency in capturing negative preferences—such as excluding a disliked brand from top results—proves especially impactful. Many existing models, if not specifically trained on negative data, will keep recommending items that the user has explicitly denounced. Preference discerning addresses this failing by baking negative signals into the generative routine. For instance, if an individual strongly opposes a certain brand, Mender ensures it is deprioritized or removed from top suggestions. Another highlight is how Mender processes multiple evolving preferences—some of which may clash. This so-called history consolidation can occur when a user accumulates many preferences over time but no longer needs all of them. While standard generative models might attempt to juggle all hints at once, Mender zeroes in on the ones that truly matter for the recommendation at hand. Hence it sustains a harmonious balance between reliability (remembering past signals) and flexibility (overriding them when outmoded). From a business standpoint, this capacity to toggle seamlessly between continuity and controlled shifts means that an e-commerce platform could pilot new product clusters for a user without alienating them by ignoring old preferences. In practical terms, managers can direct the system to encourage or emphasize certain product lines, letting the model find a sweet spot between user satisfaction and company objectives. Expanding the Potential of Generative Retrieval The paper’s methods for textual preference integration open doors for a variety of industries. Whether in e-commerce, travel, healthcare, or media streaming, the ability to parse user preferences rapidly and accurately can enhance loyalty and reduce friction. If a user says “Only show me cruelty-free options” or “I’d like to avoid violent films,” a robust preference-discerning engine becomes indispensable for an engaging, trust-building experience. From a technical vantage, merging large language models with item embeddings can be computationally complex, but the authors propose to release code and benchmarks that enable peer review and replication. This forward-looking approach should help the field measure Mender’s performance against emerging alternatives, ensuring that the underlying technology keeps pace with new breakthroughs. It is also important to recognize that metrics like Recall@5 and Recall@10 only scratch the surface when it comes to user satisfaction. The immediacy of feedback, the interpretability of results, and the model’s capacity to respond to real-time prompts will become even more decisive in industries where user experience is paramount. As large language models continue to improve, more sophisticated textual commands—potentially covering style, ethical concerns, or budget constraints—will become routine in recommendation dialogues. By spotlighting explicit preference conditioning, this study advances a vision of the user as a co-creator in the recommendation process. An enterprise can overlay its own guidelines (e.g., business intelligence targets, marketing priorities) without drowning out the user’s personal voice, provided the system is carefully balanced. Mender’s generative nature readily accommodates prompts that might arise from ephemeral online interactions or fast-evolving social-media trends—where user opinions change suddenly or must be integrated on the fly. Concluding Reflections Overall, the findings underscore how explicitly weaving user preferences into the generative engine can heighten recommendation quality and open new avenues for personalization. Mender and its associated benchmark handle text-based instructions with relative ease, aligning well with the ongoing shift toward large language models. In practical terms, it implies fewer bad recommendations, more potential to branch out into specialized product categories, and a user base that feels genuinely heard. Although other generative retrieval systems are already experimenting with language-based constraints, this paper’s central innovation lies in clearly segregating the generation of user preferences from the actual conditioning phase. That means preferences can be created even in the absence of exhaustive user histories, making the system more amenable to brand-new users. In effect, the authors point to a future in which positive and negative sentiments expressed in plain language can shape the system’s behavior in real time. For corporate decision-makers, adopting preference-discerning methods might be more than just another technical upgrade: it signals a strategic pivot toward user-driven experiences. By letting textual preferences guide the model’s next move, businesses effectively amplify the user’s voice. This fosters a climate of responsiveness and trust where the user’s personal needs and the organization’s goals can align more harmoniously. In so doing, Mender and generative retrieval herald a path toward adaptive recommendation engines that gracefully balance personalization, efficiency, and user agency. Podcast: https://creators.spotify.com/pod/show/andrea-viliotti/episodes/Preference-Discerning-in-Recommender-Systems-Generative-Retrieval-for-Personalized-Recommendations-e2svs4u Source: https://arxiv.org/abs/2412.08604
Generative AI: Strategic Perspectives for Neuroscience, Security, and Governance
The growing complexity of the corporate and innovation landscape today arises from the convergence of multiple factors: from language models that surpass the expertise of human specialists to the integration of Generative AI into highly regulated industries such as banking, encompassing the need for specialized skills in the public sector, new security challenges related to LLM applications, increasingly complex game-based testing environments, and the establishment of control standards to avoid critical vulnerabilities. This is not merely a technological issue: it marks the advent of a scenario in which the ability to process, analyze, manage, and control AI becomes a true competitive, strategic, and cultural lever Generative AI: Strategic Perspectives for Neuroscience, Security, and Governance The complexity of LLMs comes into sharp focus when they are compared with areas of knowledge that were once the exclusive domain of high-level specialists. The article " BrainBench: Language Models Surpass Neuroscience Experts " shows how these models can synthesize decades of research and predict the outcomes of neuroscientific experiments, often more efficiently than humans themselves. This no longer means competing on the playing field of mere information retrieval; it means surpassing humans in forecasting. Yet this extraordinary efficiency encapsulates a paradox: LLMs’ ability to uncover hidden patterns and correlations unknown to experts highlights new responsibilities in controlling, aligning, and verifying the quality of their predictions. As AI’s predictive power advances, securing its implementations takes on a critical role. The article " LLMs and Security: MRJ-Agent for a Multi-Round Attack " describes the evolution of threats, showing how multi-round attack agents can bypass sophisticated defenses. We’re no longer talking about simple glitches or temporary weaknesses: the vulnerability landscape is becoming dynamic, with attacks adapting to the defensive responses of the models. In the past, raising a few walls might have sufficed; today, we need a comprehensive defensive strategy, from detecting malicious patterns to calibrating the autonomy of agents, to designing continuous testing. Security thus becomes a fluid process, not a final state reached once and for all. At the same time, adopting " GenAI in Banking " paves the way for a deep transformation of customer interactions, regulatory compliance, and risk management. Here, the stakes are extremely high: integrating the power of AI into decision-making processes can boost productivity, improve customer experience, and optimize data analysis. However, businesses must contend with cybersecurity challenges, system quality issues, and a gradual adoption process. This is not just a technical question; it’s a strategic consideration that demands investment decisions, public-private partnerships, and ongoing staff training. The goal is not merely to cut costs or increase efficiency, but to build an ecosystem of trust between the financial institution and its stakeholders. The public sector is also part of this shift. " AI Governance for Public Sector Transformation " emphasizes how the adoption of AI in public administrations requires technical, managerial, and political skills, as well as good governance practices that ensure transparency, reliability, and compliance with regulations. The topic is not confined to technology; it forms an ecosystem of policies, guidelines, staff training, and continuous alignment with human values. This scenario becomes a testing ground for the legitimacy of innovation: if AI in the public sector is not managed with rigor and ethics, there is a risk of undermining citizens’ trust, reducing innovation to nothing more than an empty exercise. The spectrum of LLM applications grows even wider when considering highly complex and dynamic environments such as gaming. The article " Gaming and Artificial Intelligence. BALROG the New Standard for LLMs and VLMs " shows how testing models in gaming contexts can reveal shortcomings in long-term planning, exploration capabilities, and the management of multimodal inputs. BALROG is a benchmark designed to test the agent-like abilities of the models, bringing to light their limitations in environments that simulate real-world scenarios, where AI must address unpredictable challenges. This approach helps identify weaknesses and gaps in reasoning, driving research toward more robust, versatile models capable of adapting to complex and ever-changing situations. The need to control and prevent vulnerabilities is no mere add-on. " OWASP Top 10 LLM: Ten Vulnerabilities for LLM-Based Applications " provides a detailed picture of the risks: from prompt injection to the disclosure of sensitive information, from supply chain weaknesses to generated disinformation. Although these vulnerabilities are technical in nature, they raise strategic questions: how can we protect resources, ensure financial resilience, and maintain public trust? Implementing integrated approaches, from data sanitization to defining operational limits and including human supervision for critical actions, is essential. Companies must invest not only in technical capabilities but also in awareness, internal training, and partnerships with security experts, making security a source of added value. Taken as a whole, the emerging landscape is one of profound transformation that cannot be left to chance. Companies and institutions are called to integrate expertise, control strategies, and ethical visions. Generative AI is not simply another tool to add to one’s technological arsenal: it is a paradigm shift that forces a rethinking of processes, business models, and governance methodologies. Faced with this scenario, the future belongs to those who can adopt hybrid solutions, balancing the power of LLMs with human oversight, the rigor of security with the flexibility of innovation, the capacity to foresee risks with the determination to seize opportunities. And as an imaginary ancestor of mine used to say, folding his arms with a smile somewhere between resigned and amused: “You may have the entirety of human knowledge at your fingertips, son, but truly knowing when to stop and look elsewhere always requires a touch of humanity.” And as his words fade, all that remains is the echo of advice no algorithm can ever update with a patch.
Generative AI and Quantum Technologies: Redefining Business Frontiers
In the dynamic landscape of corporate innovation, generative AI and quantum technologies are redefining business possibilities, opening new frontiers, unlocking new opportunities and challenges for companies across all sectors. Generative AI and Quantum Technologies: Redefining Business Frontiers Let us begin with “ Q-PnV: a new approach to quantum consensus for consortium blockchains ” which explores the integration of quantum consensus mechanisms into blockchain technology. This approach not only strengthens the security of business transactions against future threats posed by quantum computing but also demonstrates how companies can adopt advanced technologies to ensure integrity and reliability in their distributed systems. The blockchain, envisioned as an immutable public ledger, finds a powerful ally in Q-PnV to withstand attacks from quantum computers, thanks to techniques such as quantum entanglement , a phenomenon in which correlated particles instantaneously affect each other’s state regardless of distance. Moving on, “ Tech Trends 2025. Artificial Intelligence, the Cognitive Substrate for the Digital Future ” paints a picture in which AI is no longer an isolated technology but becomes the invisible fabric that permeates every technological, social, and economic aspect of business. This transformation is comparable to electrification, which was initially groundbreaking and then became indispensable. AI, as a cognitive substrate, enables faster and more precise decision-making processes, deeply integrating with data, systems, and corporate workflows. Companies are called to rethink strategies and business models, adapting their internal expertise to fully exploit the potential of an advanced cognitive infrastructure. In this evolving scenario, “ Artificial consciousness and biological naturalism: a perspective between computation, living dynamics, and ethical implications ” introduces a critical ethical and philosophical debate. Although artificial consciousness is not yet a reality, it raises fundamental questions about the nature of intelligence and awareness in machines. Businesses must consider not only technological efficiency but also the ethical implications of their innovations, ensuring that AI development is conducted responsibly and sustainably. Within the context of advanced technologies, “ How the RevThink Framework Enhances Efficiency in LLM Models ” presents an innovative approach to improving the deductive capabilities of large language models (LLMs). The RevThink framework leverages inverse reasoning, a technique that enables models to verify the answers they generate, increasing both accuracy and consistency in their deductions. This method not only optimizes model performance but also reduces the need for massive training datasets, making AI more accessible and sustainable for companies looking to implement advanced solutions without incurring excessive computational costs. AI’s impact also extends to traditional sectors such as accounting and finance. In “ Impact of AI in Accounting and Finance ” we see how AI is automating manual processes, enhancing the accuracy of financial forecasting, and improving risk management. Generative AI technologies are transforming financial reporting and strategic analysis, showcasing the power of generative AI and quantum technologies in traditional sectors, allowing companies to make more informed and timely decisions. However, adopting these technologies requires a transformation of professional skill sets and careful governance to address ethical and data security concerns. Finally, “ Multimodal AI in medicine ” illustrates how integrating data from various sources—such as genetics, imaging, and wearable sensors—is spurring innovation in diagnostics and personalized medical treatments. Multimodal AI enables a holistic view of a patient’s health, improving diagnostic precision and therapeutic effectiveness. This technology not only enhances the quality of care but also optimizes healthcare resource management, making the system more efficient and sustainable. In a world of continuous technological evolution, integrating generative AI and quantum technologies marks a strategic shift for businesses aiming to maintain a competitive edge. However, this transformation requires a forward-looking vision, targeted investments, and strong ethical governance. As an imaginary ancestor of mine might say, “To innovate without looking to the future is like sailing without a compass: you risk getting lost in the seas of change.” Who knows—the future will surprise us with even more innovations. But one thing is certain: those who know how to connect intelligence and technology will hold the keys to unlocking the doors of success.
Generative AI and Globalization: How Businesses Can Leverage Trends and Overcome Challenges
In a world where every day we wake up to conflicting news about the state of the economy and the transformative impact of Generative AI and globalization , doubts about the direction technology might take persist, some recent reflections offer an interesting overview, capable of blending the most human needs with the aspiration for ever more advanced business models. It’s a mix that spans from globalization (with all its potential opportunities and contradictions) to the development of generative artificial intelligence, passing through future scenarios of companies ready to experiment with new productivity formulas. A complex mosaic, then, to be observed with curiosity, but also with the awareness that every innovation, especially when talking about generative AI and globalization, entails non-trivial challenges and ethical, social, and economic implications. Generative AI and Globalization: How Businesses Can Leverage Trends and Overcome Challenges All it takes is reading some analyses on the global situation, such as those contained in “ Ipsos Global Trends 2024: Analysis of Tensions Between Global Uncertainties and Individualism ” to realize that globalization is far from over, although there are very strong forces pushing for the protection of local markets and a strengthening of national pride. In several emerging countries, the idea of entering an increasingly interconnected market even appears stimulating, demonstrating that when the benefits feel concrete, it becomes natural to support its expansion. Yet, the data also show the growth of phenomena like economic nationalism, almost as if wanting to maintain a distinctive identity in the face of an unstoppable flow of ideas and goods. Within just a few lines, we come across a kind of paradox: the same person, convinced of the advantages of interconnection, may also strongly desire to protect their country’s autonomy. For businesses, knowing how to navigate between localism and global vocation means calibrating strategies, brand identity, and operating models that consider different cultures, evolving markets, and, above all, a public opinion that is not always linear. In parallel, in the coming years, the issue of artificial intelligence will end up intertwining with social trends in an even more pronounced way. A window onto this near future is offered by “ 2025: AI Scenarios in Business ” a contribution that already presents situations in which companies rely on generative AI to speed up product design, reduce errors, and increase productivity. If terms like “AI agents” seem abstract, it’s worth specifying that an AI agent is software capable of acting autonomously on data or systems, performing analytical (and sometimes decision-making) tasks that, without automatic support, would require a massive investment of human time. These tools, far from replacing existing professional skills, tend to reframe their contours: repetitive work is eliminated, and the focus shifts to strategic and creative aspects. It makes sense, however, that each transition of this kind demands new skills and attention to “Responsible AI,” a set of methodologies aimed at designing systems that respect privacy, ethical values, and transparency rules. From a broader perspective, “ Technology 2025: Evolving Global Dynamics ” encourages us to look further and ask ourselves how geopolitical dynamics and markets will develop, considering the increasing importance of elements like cybersecurity, supply chain management, humanoid robotics, and the convergence with augmented and immersive realities. The arrival of 5G and (in the future) 6G networks, the approach of quantum computing scenarios (a term indicating the capability of special machines to solve complex problems by leveraging quantum properties), and the need to revise encryption protocols all intertwine with political tensions, fueled also by those who see greater protectionism as an opportunity to reshape global balances. Consequently, companies looking to expand on an international scale must balance efficiency, competitiveness, and safeguard the cultural aspects of the countries where they operate. This phenomenon could encourage the adoption of “glocal” production systems, where innovation can emerge from multiple regional hubs, without necessarily centralizing in one single location. Still in this context, “ Tech Trends 2025. Artificial Intelligence, the Cognitive Substrate for the Digital Future ” delves into the idea that AI won’t just be “used” consciously but will act as a pervasive infrastructure, like electricity or the Internet, that future users might not even perceive as “extraordinary.” This shift demands both technical and cultural reflection: on one hand, it requires specialized hardware (for example, GPUs, which are graphic processors suited to parallel computations) and robust energy management; on the other hand, it generates implications for how people will train, communicate, and verify the reliability of information. Consider, for instance, how the use of voice assistants on smartphones or in smart homes has already evolved: initially seen as a gadget, it has begun to blend into daily life, often without the user reflecting on the scope of these tools. However, one cannot ignore the ethical and social dimension. This is where “ Generative AI Ethics: Implications, Risks, and Opportunities for Businesses ” comes into play, addressing how the production of images, texts, and videos by increasingly sophisticated algorithms affects work, art, education, and privacy protection. The concept of deepfakes (videos or audio created to seem real but generated by an AI system) is only the tip of the iceberg in a context where the ease of generating content could influence the spread of fake news or potentially harmful information. At the same time, for a brand or institution, being able to leverage generative AI can open new spaces for creativity, experimentation, and service personalization. The real challenge, as highlighted in many studies, is establishing a framework of shared rules and responsibilities: protecting intellectual property, preventing sensitive data from indiscriminately ending up in training datasets, and adopting “Responsible AI” practices to avoid dangerous distortions and manipulations. In this interplay between globalization and cutting-edge AI, some constants emerge. On the one hand, there is a widespread demand for transparency: consumers and citizens want to know the impact of what they purchase, the production chain, and how companies handle data. On the other hand, there is a need for skill sets that go beyond mere technological knowledge, encompassing the ability to interpret economic trends, grasp cultural sensitivities, and preempt social tensions. Returning to social tensions, the data highlighted by Ipsos show how the very concept of inequality has changed shape in an era when anyone can establish virtual contacts with others, and where precariousness is sometimes perceived in more subtle forms, sometimes more striking. For organizations, this translates into a responsibility: implementing strategies, not solely oriented toward profit, that consider a trust that must be earned day by day, especially in diverse markets and communities. Thought then goes to a future scenario where companies find themselves evaluating, on the one hand, the benefits of AI capable of handling an enormous flow of information, and on the other, the need not to offload an excess of complexity onto individuals. We might see AR (Augmented Reality) tools that make training processes more immersive and faster, or e-commerce platforms capable of hyper-personalizing the shopping experience. These technologies, if well-balanced, can improve efficiency, even creating job opportunities never imagined. Yet we should not overlook the risk of informational saturation and decision-making overload, which could penalize those who lack the tools (or time) to keep up with constant updates. In other words, while systems evolve, a collective responsibility is needed to avoid forms of exclusion or subtler or more evident manipulation. Another common thread in the perspectives mentioned above concerns governance. If generative AI technologies begin to make an impact in previously unthinkable areas, defining reliable protocols becomes urgent. It’s not enough to rely on the good will of individual developers: a broader pact is needed among companies, institutions, scientific communities, and end users. Managers who are sensitive to innovation see opportunities for cost savings and creative momentum, but they also need to establish internal auditing processes and cross-sector collaboration to mitigate the risk of a race to the bottom. It’s not about overregulating, but about sharing minimum standards, for example on responsible data management or security mechanisms that prevent a system from generating content contrary to the public interest. Ultimately, the persistent tension between localism and a global outlook, between protectionist impulses and a desire for cooperation, seems to merge with the broader debate on AI, and on its generative form, capable of automating creative and analytical activities once reserved for humans alone. Anyone envisioning a future in which humanoid robots integrate into the workforce is not a naive optimist but rather an observer of signals already visible in certain cutting-edge sectors. Likewise, those who highlight fears about misinformation, data breaches, political manipulation, and cultural homogenization are not merely alarmist but recognize the need for rules, a culture of caution, and mechanisms of continuous validation. In between lie extraordinary possibilities: boosting medical research, setting up more sustainable production chains, making education inclusive and free from geographical constraints. How to navigate so many stimuli? Perhaps it’s helpful to focus on cross-functional skills: the ability to interpret data, assess social impact, and envision an organization that is as resilient as possible and ready to revise strategic choices. In an era when even news reporting and communication can be disrupted by automated generation systems, transparency becomes an indispensable safeguard, a credibility criterion for businesses seeking to endure over time. Adopting AI does not mean imposing a miraculous solution from above but building an ecosystem where machines and people coexist, each with their own role, so that the final outcome is truly sustainable and open to innovation that brings tangible benefits. By way of conclusion on this journey through technological perspectives and global reflections, one might say that, although superintelligent systems can make our lives easier, our humanity also resides in the enthusiasm for learning and the pleasure of challenging ourselves. If a device already knew how to do everything in our place, we might end up forgetting the satisfaction of a well-designed idea or a personal discovery. And perhaps precisely in this tension between convenience and curiosity lies the ultimate meaning of innovation: providing the tools but leaving people the freedom to explore, learn, and make mistakes. Because only in this way do, we remain critical, aware, and truly ready to seize whatever lies ahead. The rest… is still there to be discovered.
Mender: Preference Discerning e Generative Retrieval per raccomandazioni personalizzate
“ PreferenceDiscerningwithLLM-Enhanced Generative Retrieval ” di Fabian Paischer , Liu Yang e Linfeng Liu , coinvolge l’ ELLIS Unit, LIT AI Lab, Institute for Machine Learning, JKU Linz, University of Wisconsin, Madison e Meta AI . La ricerca tratta la raccomandazione sequenziale in chiave generative retrieval sfruttando preferenze utente espresse in linguaggio naturale. Emergono opportunità di personalizzazione più incisiva, con la possibilità di guidare il sistema anche attraverso indicazioni negative (sentiment) o desideri specifici (steering). Per le aziende che operano nell'e-commerce, i dati suggeriscono che includere preferenze espresse in formato testuale nei sistemi di raccomandazione può migliorare le prestazioni, con un aumento stimato del 45% su metriche come il Recall@10 . Questa metrica valuta la capacità di un sistema di identificare elementi rilevanti tra i primi 10 risultati mostrati, elemento cruciale per migliorare l’esperienza degli utenti. Tale approccio offre indicazioni pratiche per ottimizzare l'offerta di prodotti e servizi, consentendo un risparmio di risorse e favorendo un maggiore coinvolgimento del pubblico. Mender: Preference Discerning e Generative Retrieval per raccomandazioni personalizzate Preference Discerning: il nuovo standard per la Raccomandazione Sequenziale Il paradigma del preference discerning si distingue come pratica innovativa per integrare esplicitamente le preferenze degli utenti nei modelli di generative retrieval , perfezionando la personalizzazione nelle raccomandazioni. Quest’ultimo non si limita a confrontare rappresentazioni statiche di item e utenti ma produce direttamente l’ item successivo più adatto. L’idea di fondo è che la cronologia delle interazioni non basti a catturare la vera intenzione dell’utente, perché questi ultimi spesso esprimono desideri o limitazioni di vario tipo, soprattutto attraverso recensioni o note testuali che rimangono difficilmente codificabili in approcci tradizionali. Nel lavoro di Paischer, Yang e Liu la nozione di generative retrieval assume una connotazione spiccatamente testuale. Il sistema ingloba infatti le preferenze, ad esempio: «Preferisco prodotti leggeri che non contengano determinate sostanze» o «Evito del tutto alcuni materiali scomodi» . Questi desideri diventano variabili fondamentali per generare l’item successivo in una sequenza d’acquisti. Per giungere a un tale livello di personalizzazione , i ricercatori introducono un approccio in due fasi: preference approximation e preference conditioning . La prima individua le propensioni personali di ogni utente basandosi su dati come recensioni e descrizioni di item già acquistati; la seconda condiziona il modello generativo sulla base di queste preferenze, rendendo la raccomandazione decisamente flessibile e reattiva a istruzioni sia positive sia negative. I riscontri numerici rivelano che i metodi standard faticano a interpretare dettagli particolari, come preferenze di sentiment o variazioni nel tempo dei gusti personali. Un sistema di preference discerning affronta la questione offrendo, tra gli altri, una valutazione “fine-grained steering” (capacità di modificare la raccomandazione con precisione) e una valutazione “coarse-grained steering” (adattamento più generico ma comunque attento a nuove preferenze). Per esempio, se un utente specifica di evitare materiali sintetici per calzature, il sistema non solo smette di proporre prodotti non graditi ma suggerisce alternative coerenti con la direzione preferita. Dalla ricerca emerge anche come molti modelli esistenti non gestiscano bene la sentiment following , ossia comprendere se un utente esprime un rifiuto netto o un’attrazione forte per un certo brand o materiale. L’innovazione in termini di generative retrieval sta invece nell’inserire queste avversioni e inclinazioni dentro la generazione dell’output. Ciò risulta particolarmente utile per chi gestisce servizi di e-commerce e vuole limitare proposte indesiderate che rischierebbero di frustrare l’utente. Un concetto centrale è la formula per la rappresentazione di item come semantic IDs , ovvero RQ(e,C,D) = (k1, ..., kN) in [K]^N , dove si definisce un processo di quantizzazione che converte gli embedding in rappresentazioni discrete. Questo passaggio permette di generare token interpretabili anche a fronte di milioni di prodotti diversi. I test su dataset come Amazon (Beauty, Toys and Games, Sports and Outdoors) e Steam mostrano come, aumentando le informazioni testuali, le raccomandazioni si facciano più mirate. Per le aziende che gestiscono eCommerce, è particolarmente efficace unire l'integrazione delle preferenze degli utenti (history consolidation) con indicazioni di orientamento personalizzate. Questo consente all'azienda di individuare eventuali cambiamenti nei comportamenti degli utenti nel tempo e di adattare la strategia di presentazione dei prodotti in modo mirato. Questo approccio favorisce un incremento dei tassi di conversione e una riduzione del rischio di sovraccaricare l'utente con contenuti non pertinenti. Benchmarking nel Preference Discerning: test e metodologie innovative I ricercatori allestiscono un benchmark con cinque assi di valutazione: preference-based recommendation , fine-grained steering , coarse-grained steering , sentiment following e history consolidation . Ciascun asse mette in luce un diverso scenario d’uso e sottopone i modelli di raccomandazione a sfide particolari. Nel caso di preference-based recommendation , il modello riceve una preferenza specifica generata in precedenza (per esempio: «Opta per prodotti privi di determinati allergeni» ) e deve indovinare quale sarà l’item desiderato. Per validare la robustezza delle soluzioni, vengono adottati set di training, validation e test che evitano sovrapposizioni tra preferenze già viste e preferenze nuove, in modo da misurare la capacità di generalizzazione a utenti inediti. Sulla componente di fine-grained steering , si cerca di capire se il sistema riesce a cogliere preferenze molto ravvicinate all’item effettivamente acquistato. Immaginando un utente che ha sempre scelto scarpe da corsa ultraleggere, la preferenza potrebbe specificare di voler provare una versione “ancora più leggera ma con un certo tipo di ammortizzazione”. Il metodo deve sapersi orientare senza errori in direzioni affini, producendo item simili ma non identici. Al contrario, coarse-grained steering valuta la capacità di rispondere a preferenze che distanziano moltissimo la raccomandazione dal passato, come passare da “sneakers fitness” a “scarpe eleganti da cerimonia” . La ricerca rivela che i modelli tradizionali (per esempio TIGER o soluzioni con semplici vocab extension ) falliscono spesso queste distanze, mentre un sistema ben condizionato sulle preferenze sa manovrare anche cambiamenti drastici. L’aspetto di sentiment following spicca come funzione chiave. Se un utente ha espresso recensioni negative verso uno specifico brand, la preferenza generata può sottolineare di evitare quel marchio. Emerge tuttavia che molti modelli esistenti non sfruttano bene i dati negativi : la metrica m@k (mutuata dalla hit rate ) indica se il sistema riesce a inserire l’item nel set di raccomandazioni quando la preferenza è positiva, ed escluderlo quando la preferenza è negativa. I risultati mostrano punteggi molto bassi (attorno allo 0.004 su alcuni dataset) per i metodi che non sono stati addestrati su preferenze esplicitamente negative, mentre la nuova strategia migliora nettamente quando si alimentano esempi di questo tipo. L’ultima dimensione, la history consolidation , solleva la questione che molte preferenze non servono realmente a individuare l’item giusto in un dato momento e creano rumore. Fornire al modello un insieme di preferenze miste, che non tutte riguardano il prodotto finale, costituisce una prova di robustezza: il sistema deve filtrare i suggerimenti utili, ignorando preferenze irrilevanti. Secondo gli autori, l’abilità di gestire questi casi risulta cruciale per scenari reali, dove l’utente accumula preferenze e poi ne scarta alcune. Le valutazioni adottano metriche note come Recall@5 , Recall@10 , NDCG@5 , NDCG@10 e mostrano, in diversi esperimenti, come il paradigma del preference discerning migliori la qualità della raccomandazione in tutti e cinque gli assi. Il superamento dei modelli standard varia, talvolta arrivando fino a un +45% in termini di Recall@10 . Mender: il modello generativo che ridefinisce il Preference Discerning e il Generative Retrieval Mender, acronimo di Multimodal Preference Discerner, rappresenta un’innovazione chiave nel panorama del generative retrieval. Questo modello avanzato sfrutta semantic IDs per generare raccomandazioni basate su preferenze utente espresse in linguaggio naturale, ridefinendo il paradigma del preference discerning. Il sistema gestisce gli elementi come sequenze di token semantici applicando il concetto di autoregressive modeling. Questo approccio consente di prevedere direttamente il prossimo elemento in una sequenza anziché confrontare gli elementi a coppie, migliorando così l’efficienza e l’accuratezza del processo. Un aspetto chiave è l’impiego della formula RQ(e,C,D) = (k₁,...,kₙ) in [K]ⁿ , che permette di quantizzare gli embedding, cioè trasformare rappresentazioni numeriche complesse degli elementi in codici discreti. Questa trasformazione consente di collegare con maggiore precisione le preferenze testuali espresse dagli utenti all’universo degli elementi disponibili, migliorando il grado di personalizzazione delle raccomandazioni. Grazie a questa metodologia, Mender garantisce un abbinamento sofisticato e ottimale tra le preferenze degli utenti e gli elementi suggeriti, offrendo un sistema più efficace e user-friendly. La particolarità di Mender risiede nella sua struttura, composta da un encoder linguistico pre-addestrato e da un decoder che genera token semantici corrispondenti agli elementi raccomandati. Il decoder utilizza meccanismi di cross-attention con l’encoder, un processo che consente di trasformare le istruzioni degli utenti e la cronologia d’acquisto in una previsione autoregressiva, ossia in una sequenza predittiva basata sugli input forniti. Sono state sviluppate due versioni del modello: MenderEmb e MenderTok . MenderEmb codifica separatamente le preferenze degli utenti e gli elementi raccomandati attraverso embedding, cioè rappresentazioni numeriche specifiche per ogni componente. Al contrario, MenderTok unifica cronologia e preferenze in una sola sequenza di token testuali, permettendo al modello di trattare l’intero insieme di dati come un unico flusso informativo. Questa doppia configurazione offre flessibilità nella gestione e nell'ottimizzazione delle raccomandazioni in base alle esigenze specifiche del sistema. Nei risultati sperimentali, MenderTok si distingue per prestazioni superiori rispetto ad altri approcci, grazie alla sua capacità di rappresentare tutte le informazioni in forma testuale. Ad esempio, su un dataset come Amazon Beauty , il valore di Recall@10 aumenta da 0.0697, ottenuto con alcuni modelli base, a circa 0.0937. Analogamente, sul dataset Sports and Outdoors , si registra un incremento da 0.0355 a 0.0427. Questo miglioramento significativo è dovuto alla capacità del sistema di adattarsi a nuovi profili utente utilizzando vincoli espliciti espressi in linguaggio naturale, evitando così procedure complesse di ri-addestramento. Il modello genera un insieme di codici semantici valutando gli item in uno spazio latente, una rappresentazione astratta che cattura le caratteristiche principali degli elementi. Successivamente, questi codici vengono tradotti in ID discreti, consentendo al sistema di gestire efficacemente un catalogo di dimensioni molto ampie, mantenendo al contempo un elevato grado di personalizzazione e precisione nelle raccomandazioni. Nel documento di ricerca si sottolinea che il successo di Mender dipende anche dalla qualità delle preferenze generate , cioè dalla loro aderenza reale al profilo utente. Gli autori, infatti, hanno condotto un sondaggio, rivelando che circa il 75% delle preferenze testuali corrisponde effettivamente agli orientamenti delle persone. Un sistema come Mender trae vantaggio dalla precisione di queste preferenze, riducendo proposte non pertinenti. Inoltre, la combinazione di segnali testuali e passati acquisti rende più semplice ampliare l’offerta a item correlati senza snaturare i gusti dell’utente. Per le aziende interessate a implementare Mender, la sinergia tra embedding semantici e user input apre la strada a modelli in grado di integrarsi con i flussi di dati già esistenti, come recensioni, post sui social e feedback diretti. La prospettiva di codificare item e preferenze testuali in un unico encoder-decoder può incrementare la trasparenza e la spiegabilità delle raccomandazioni verso l’utente finale. Mender e Generative Retrieval: impatti strategici per l’E-commerce I test contemplati riguardano quattro dataset noti: tre subset di Amazon (Beauty, Toys and Games, Sports and Outdoors) e Steam. Le azioni totali vanno dalle 167.597 di Toys and Games alle 599.620 di Steam, con differenze anche nella distribuzione degli item. In modo coerente con l’idea di preference discerning , i ricercatori hanno generato preferenze da recensioni reali usando modelli di linguaggio di grandi dimensioni, filtrandole con meccanismi di post-processing per eliminare rumore o riferimenti ripetitivi. Le performance di raccomandazione si valutano tramite diverse metriche. MenderTok raggiunge, per alcune combinazioni di dataset e parametri, Recall@10 prossima allo 0.20 su Steam, mentre i modelli senza preferenze esplicite rimangono spesso sotto 0.19. Su Amazon, gli scarti tra le soluzioni sono ancor più marcati, con miglioramenti che, secondo i dati presentati, arrivano a toccare picchi di circa +45% rispetto a baseline come TIGER o LC-REC . Un punto decisivo è l’abilità di cambiare la raccomandazione in base a preferenze negative. Nella valutazione definita sentiment following , se l’utente dichiara di evitare un certo brand, l’algoritmo deve eliminare l’item corrispondente dalle prime posizioni della lista. I risultati mostrano che, senza addestramento mirato su preferenze negative, molte soluzioni mantengono quell’item nelle raccomandazioni, irritando l’utente. Con l’approccio preference discerning , invece, la metrica di hit rate combinata migliora, indicando una maggiore capacità di distinguere ciò che piace da ciò che infastidisce. Sono stati analizzati pure i casi di history consolidation , dove un utente accumula preferenze multiple e ne rivede alcune. Il sistema deve selezionare quali preferenze sono rilevanti e ignorare informazioni non più centrali. Gli autori evidenziano come modelli generativi standard, se privi di un’adeguata fase di conditioning testuale, fatichino a filtrare preferenze irrilevanti. Mender, viceversa, mostra un trade-off equilibrato tra affidabilità e adattabilità: anche quando compaiono preferenze disallineate con l’item finale, mantiene una prestazione competitiva. Per le imprese, questi test di multi-scenario suggeriscono che la dimensione della personalizzazione ha un peso crescente sulle conversioni. Avere un unico sistema che sappia passare da raccomandazioni coerenti col passato a raccomandazioni in rottura controllata può aiutare a sperimentare nuovi cluster di prodotto, massimizzando il gradimento. Preference Discerning e Generative Retrieval: applicazioni e futuro L’orientamento esplicito sulle preferenze testuali proietta questo filone di ricerca in aree molto varie. Nell’e-commerce, l’uso di preferenze negative consente di proporre articoli che evitino ciò che l’utente non vuole, mentre l’inserimento di preferenze positive raffina la scelta su modelli, caratteristiche tecniche o design. I manager aziendali possono trasformare questi sistemi in strumenti di retargeting o di cross-selling più mirato, riducendo sprechi di tempo e di budget pubblicitario. Sul piano tecnico, la combinazione tra embedding e preferenze naturali comporta un aumento di complessità gestibile grazie a large language models aperti. I ricercatori promettono di rilasciare il codice e i benchmark per favorire la riproducibilità e l’estendibilità a nuovi dataset. Si potrà così confrontare Mender con altri approcci che emergono rapidamente, assicurando un perfezionamento costante della tecnologia. È importante sottolineare che analisi condotte su metriche come Recall@5 , Recall@10 e NDCG@10 rappresentano un passaggio critico per individuare la capacità di rispettare preferenze specifiche. In settori come il turismo, l’assistenza sanitaria o le piattaforme di streaming, la necessità di saper interpretare rapidamente gusti e avversioni dell’utente risulta vitale. Adottare soluzioni in grado di recepire comandi in linguaggio naturale, come «Cerca prodotti sostenibili» o «Evita contenuti violenti» , può fare la differenza sul tasso di fidelizzazione. Grazie a preference discerning , l’utente diventa co-protagonista del processo di generazione, esprimendo istruzioni dirette su ciò che desidera. Un manager aziendale, dal canto suo, può definire linee guida di business intelligence , suggerendo al sistema quali preferenze aziendali favorire, per esempio prodotti a margine più elevato o integrati in campagne promozionali. La tecnologia generativa impiegata da Mender si dimostra sufficientemente flessibile da assorbire prompt esterni, un aspetto strategico quando le preferenze non emergono solo dai dati passati, ma anche da contesti online volatili o da input testuali in real time. Conclusioni Le informazioni provenienti dallo studio mostrano come la capacità di condizionare esplicitamente i modelli di raccomandazione con preferenze utente incida in modo concreto sulle prestazioni di generazione e sulle possibilità di personalizzare l’offerta. A differenza di soluzioni affini, Mender e il relativo benchmark introducono una gestione diretta delle istruzioni testuali, allineandosi con la tendenza emergente di integrare con efficacia i large language models . Le possibili conseguenze per il mondo imprenditoriale sono una maggiore modulazione della proposta, la riduzione di raccomandazioni errate e la potenzialità di esplorare mercati verticali con regole di personalizzazione più dettagliate. Se si osservano altre tecnologie simili, alcune soluzioni di generative retrieval iniziano a sperimentare meccanismi linguistici, ma raramente raggiungono una separazione così netta tra generazione di preferenze e condizionamento effettivo. Da questa prospettiva, appare strategica la scelta di generare preferenze anche quando non esiste un esplicito storico collegato, favorendo l’adattabilità a nuovi utenti. Nel complesso, la ricerca apre lo spazio a una personalizzazione ancora più fine, in cui preferenze positive e negative, espresse liberamente in linguaggio naturale, guidano i sistemi in modo più consapevole. L’invito rivolto ai dirigenti aziendali è considerare l’adozione di questi metodi non semplicemente come un ulteriore progresso tecnico, ma come un cambiamento strategico verso un sistema altamente reattivo e orientato all’ascolto, in cui la voce dell’utente assume un ruolo centrale nel processo di raccomandazione. Questo approccio consente di integrare direttamente le preferenze espresse dagli utenti, trasformandole in uno strumento fondamentale per migliorare l’esperienza e la personalizzazione dei servizi offerti. Podcast: https://creators.spotify.com/pod/show/andrea-viliotti/episodes/Mender-Preference-Discerning-e-Generative-Retrieval-per-raccomandazioni-personalizzate-e2svr5e Fonte: https://arxiv.org/abs/2412.08604
The LIGER Hybrid Model: Transforming Sequential Recommendation Systems
Over the past decade, recommendation systems have evolved into a critical technology for any enterprise that relies on guiding user choices, whether in e-commerce, streaming services, or digital platforms that provide content and entertainment. As people navigate vast catalogs of products and information, algorithms shoulder the task of pinpointing items that suit individual tastes. One area of research, known as sequential recommendation , focuses on predictions informed by user history: if someone viewed or purchased specific items in the past, what might they be interested in next? A recent investigation, authored by Liu Yang, Fabian Paischer, and Kaveh Hassani in collaboration with the University of Wisconsin–Madison, the ELLIS Unit at the LIT AI Lab (Johannes Kepler University, Linz), and Meta AI, lays out fresh insights into two distinct but equally influential approaches. The first is known as dense retrieval , in which each item is compressed into a numerical representation or “embedding,” allowing a system to measure similarity among items by comparing these embeddings. The second, generative retrieval , draws on Transformer-based architectures to produce, in a more direct manner, the semantic code that identifies the next item in a sequence. Their work highlights challenges such as memory demands, the incorporation of brand-new items (the so-called cold-start dilemma), and overall system performance, all of which are pressing for enterprises operating at scale. Yet these insights also go a step further by showcasing how dense retrieval and generative retrieval each come with benefits and trade-offs. By delving into recall scores and memory footprints, the research underscores a shared objective: to propose the most relevant items while balancing computational efficiency and adaptability. To bridge the gap, the team introduces a hybrid model called LIGER (LeveragIng dense retrieval for GEnerative Retrieval), which combines the strengths of dense similarity-based ranking with the flexible generation of new semantic codes. This reinterpretation of the study will traverse the key components: (1) how dense and generative retrieval differ in technique and resource requirements, (2) why cold-start items pose a particularly vexing problem for generative retrieval, and (3) how LIGER aims to integrate these two methods to reach a middle ground. We’ll also reflect on pragmatic aspects for businesses managing massive catalogs, where each new approach must not only outperform older systems but also remain nimble enough to handle shifting market demands. The LIGER Hybrid Model: Transforming Sequential Recommendation Systems LIGER Hybrid Model: Contrasting Dense and Generative Retrieval Approaches For many years, dense retrieval has been viewed as a natural extension of traditional recommendation algorithms. This approach assigns each item in a catalog a unique high-dimensional vector (or embedding) that captures its distinctive attributes—brand, category, textual description, or any relevant metadata. When a user’s past interactions are also transformed into an embedding, the system computes mathematical similarities (often an inner product or cosine similarity) to identify the items that most closely match the user’s profile. Pros and Cons of Dense Retrieval High Accuracy : Because each item is coded by a rich, learned representation, dense retrieval frequently achieves robust performance in standard benchmarks, especially for “in-set” items that the system has already seen during training. Resource Intensiveness : As a catalog grows into the millions, the system must store an embedding for every single item and compare user embeddings with all potential item embeddings. Even if efficient similarity search structures exist, scaling can be computationally and financially costly. Cold-Start Handling : When brand-new items enter the mix, a dense retrieval system can still generate embeddings using textual or categorical descriptions. While it doesn’t solve the cold-start challenge entirely, it often retains at least a moderate capacity to guess which new entries might interest users, thanks to textual representations. In short, the hallmark of dense retrieval is its ability to rank familiar items accurately. The system excels in memory-rich settings where the overhead of storing countless vectors does not pose a dire problem. This makes it particularly appealing for businesses with well-established catalogs that seldom alter drastically or those with ample computational resources dedicated to serving recommendations. Generative Retrieval: Leveraging Transformer Models As an alternative, generative retrieval utilizes a Transformer-based model (akin to those found in neural machine translation or advanced language processing) to generate the semantic ID of the next recommended item. Each item’s “ID” is not just a product name or numerical identifier, but a richer tapestry of textual cues—title, brand, category, and price, among other relevant descriptors. During training, the model observes sequences of item interactions. By seeing the progression of codes that led a user from one purchase to another, it learns to predict the next set of codes. During recommendation, a beam search can be employed: the system generates various candidate code sequences, retaining only the most promising among them. Hence, instead of scanning an entire catalog of item embeddings, the model “writes” the next item’s semantic code directly. Pros and Cons of Generative Retrieval Efficient Scaling : Rather than storing a dedicated vector for each item, the system mainly stores the distinct building blocks that form an item’s semantic representation. For instance, if a catalog includes 50 possible brands and 100 possible categories, the number of codes might be just 150. Whether there are 2,000 items or 20,000, the memory footprint for storing codes does not expand proportionally with the number of items. Cold-Start Weakness : Generative retrieval can struggle significantly when confronted with items that never appeared during training. Since the model typically leans on previously observed codes, brand-new items remain invisible to the learned patterns. Consequently, the probability of generating truly novel combinations is often negligible, making it hard to surface fresh content. Performance Gap : Across standard metrics such as Recall@10, purely generative approaches often lag behind dense retrieval. The difference in performance—3% or 4% in some experiments—might not appear enormous on paper, but in commercial settings, such a gap can translate to a substantial difference in user satisfaction or revenue. This generative idea presents an undeniably attractive path for businesses that aim to handle large catalogs without excessive overhead. Yet it also reveals a trade-off: a system that excels at storing minimal item representations might lose out on the fine-grained precision critical for personalizing recommendations. How the LIGER Hybrid Model Tackles the Cold-Start Dilemma For recommendation engines, the cold-start problem has long been recognized as one of the hardest challenges. When new items are introduced to a platform, there is no interaction history to guide the algorithm toward the right audience. Understanding how the two major retrieval strategies tackle this issue becomes crucial for any business that regularly updates its catalog. Dense Retrieval in Cold-Start Scenarios Thanks to textual embeddings, dense retrieval can still produce a ballpark representation for items with no prior clicks or purchases. A beauty product, for instance, could generate an embedding from text referencing its brand, fragrance type, and target demographic, helping the system connect it to similar items from the past. The model might not be spot-on, but it generally does better than random guessing, retaining a modest but real chance of being discovered in the top recommended slots. Generative Retrieval in Cold-Start Scenarios By contrast, generative retrieval can struggle to even place brand-new items into the candidate set. Given that the system has learned to generate item codes (brand, category, etc.) from existing examples, it strongly favors items that it “knows.” If an entirely unfamiliar brand or category arises, the model’s probability of generating that code in the next semantic sequence is extremely low—so low, in fact, that it typically fails to appear in the final beam of candidates. Empirical studies from the research highlight recall values near zero for generative approaches in these cold-start cases, especially in categories like Amazon Toys or Amazon Sports. Within a dynamic marketplace—where seasonal trends, rotating inventories, or brand partnerships result in a steady influx of new goods—this limitation cannot be overlooked. Some have proposed quick fixes, like artificially forcing the system to consider a small set of fresh items. Yet these solutions often rely on guesses about how many new items might appear at once or require manual heuristics. The outcome is a partial patch, but a far cry from an elegant, robust remedy. Bridging Dense and Generative Retrieval with the LIGER Hybrid Model In seeking a remedy that capitalizes on the best qualities of both methods, the authors propose the LIGER Hybrid Model , short for LeveragIng dense retrieval for GEnerative Retrieval. The LIGER Hybrid Model endeavors to blend the flexible generation of item codes with the robust similarity scoring typical of dense retrieval. Architectural Highlights Dual Optimization Path LIGER maintains two internal pathways during training: A dense-based component that measures how similar the Transformer’s output is to the textual embedding of the next item. By maximizing cosine similarity (modulated by a temperature parameter τ), this part of the system ensures that the model does not lose sight of close semantic matches. A generative-based component that learns to produce the semantic code of the future item. The model employs its Transformer layers to sequentially predict the brand, category, or other attributes that define each item. Combined Loss Function These two training targets are consolidated into one overarching objective, encouraging the model to be simultaneously skilled at identifying the “closest” items (dense retrieval) and generating the relevant codes (generative retrieval). Inference Strategy Once trained, LIGER draws an initial set of K candidate items via generative retrieval. This set is then augmented with potential new items (which might not appear in the generative scope) and evaluated more precisely through dense ranking. By enlarging K, one can gradually approach the performance of a fully dense-based system, but with improved efficiency and coverage for fresh or rarely seen items. Practical Outcomes Studies across four real-world datasets—Amazon Beauty, Amazon Sports, Amazon Toys, and Steam—reveal how LIGER narrows the performance gap between a purely dense strategy and the generative approach, particularly for in-set items. For cold-start items, LIGER surpasses its generative-only counterpart, which otherwise stagnates near zero recall, by introducing a mechanism that dips into dense retrieval’s ability to guess representations for previously unseen products. This fusion proves especially beneficial in domains where item turnover is significant and brand-new content arrives constantly. While LIGER does incur some additional computational overhead compared to a purely generative method, it remains more memory-efficient than a purely dense system. This middle ground—where a business can manage large catalogs without storing an embedding for every single new item, yet still remain relevant to brand-new products—has immediate commercial implications. Detailed Examination of the Research Findings To test their models, the authors used datasets that vary in size and domain: Amazon Beauty : ~22,000 users, ~12,000 items, and ~198,000 interactions; 43 new items. Amazon Sports : ~35,000 users, ~12,000 items, and ~296,000 interactions; 56 new items. Amazon Toys : ~19,000 users, ~12,000 items, and ~167,000 interactions; 81 new items. Steam : ~47,000 users, ~18,000 items, and ~599,000 interactions; 400 new items. They evaluated systems through standard metrics like Recall@10 (the proportion of relevant items captured in the top ten recommendations) and NDCG@10 (a measure that weights the position of correct recommendations). For “in-set” testing—where items from the training set appear again in evaluation—dense retrieval often leads the pack or at least matches robust baselines such as SASRec or RecFormer. Meanwhile, purely generative retrieval tends to rank slightly lower, missing some of the subtle item-user connections. In the cold-start setting, purely generative approaches can virtually fail to identify brand-new items, sometimes scoring near zero in Recall@10. By integrating a dense retrieval step, LIGER rectifies this shortfall, lifting recall to meaningful levels. When LIGER is given a wider candidate set (larger K), it draws closer to dense retrieval’s performance. Indeed, the Normalized Performance Gap (NPG) steadily decreases as K rises, striking a balance between generative speed and dense precision. Recommendations for Businesses For enterprises, these differences highlight crucial design choices: Abundant Resources, High Precision Needs : If a company has robust computing systems, a purely dense approach may still be ideal. Its recall advantage for items seen during training remains consistently strong. Fast-Changing Catalogs, Efficiency Concerns : In a scenario with rapidly introduced items or restricted memory budgets, generative retrieval appears appealing, though it struggles to handle unseen items. This is where LIGER’s hybrid method can offer a workable solution. Managed Trade-Offs : LIGER allows for configurable K values, enabling organizations to dial up or down the emphasis on dense-based accuracy versus generative flexibility. Within this context, the LIGER model highlights the idea that no single solution can do it all, particularly in business environments that shift unpredictably. Instead, it guides teams to adopt a layered approach: generative modules identify an initial set of candidate items (including brand-new arrivals, if properly integrated), while dense modules refine these suggestions to maintain accuracy. For those dealing with extremely large catalogs—sometimes numbering in the millions—this synergy could greatly lower the memory footprint without sacrificing too much in performance metrics. Future Directions of the LIGER Hybrid Model As product lines balloon in size or shift rapidly according to trends, memory usage becomes a serious concern. Dense retrieval demands the storage of unique vectors for every item, and the overhead involved in updating or recalculating them can be daunting. By contrast, generative retrieval collapses many items into a concise set of codes. LIGER deftly exploits this advantage by retaining the text-based benefits of dense retrieval but only for a narrower set of candidates produced through generation. It is not hard to imagine an e-commerce platform with tens of thousands of new products debuting monthly. For them, an architecture that can quickly update which codes are valid—without re-embedding every item in high-dimensional space—might deliver real competitive benefits. Moreover, the research indicates that once K surpasses a certain threshold, performance draws near to a purely dense approach, giving technical teams the power to choose how large that threshold should be, based on hardware constraints and business objectives. Personalization and the Transformative Capacity of Generative Models Dense approaches excel at known items, but generative retrieval has a special flair for forging links between user behavior and items that might initially appear unrelated. A Transformer-based system can tap into latent features, possibly connecting a user’s interest in, say, “eco-friendly household products” with a previously unassociated brand. By merging these two vantage points, LIGER holds the promise of robust personalization—especially relevant when the platform’s content extends beyond straightforward categories. From a more humanistic angle, this interplay between known objects and newly imagined possibilities resonates with how we explore culture and knowledge in everyday life. We rely on established patterns to recognize what’s familiar, but we also remain open to fresh and unexpected ideas. LIGER’s hybrid framework thus mirrors the dual nature of human cognition: building on existing knowledge while having room for novelty. The Potential Role of Large Language Models (LLMs) As the study hints, continuing advances in Large Language Models—such as GPT-like architectures—may blur the line between dense and generative retrieval. These more advanced models can potentially produce item embeddings on the fly or generate new item codes with remarkable accuracy. They might also address cold-start challenges better by tapping into extensive real-world textual knowledge that extends beyond a single dataset. However, the paper also underscores that applying LLMs at industrial scale remains an open question, involving significant computational costs and the need for careful fine-tuning. Real-world performance, especially for massive catalogs, might differ considerably from lab settings. This leaves plenty of territory for further experimentation, both academic and commercial. Industry Adoption and Gradual Integration For businesses with existing dense retrieval pipelines, one plausible roadmap involves integrating a generative subsystem step-by-step. They might first train a Transformer to produce candidate items, then pass those candidates to their dense rankers. Over time, they can test how well the generative module captures new releases and whether it helps reduce memory overhead. Alternatively, companies that begin with generative retrieval might incorporate a dense refinement layer only for high-traffic items or premium content. In either case, LIGER’s versatility accommodates incremental changes rather than demanding a complete overhaul of a well-functioning system. Final Observations By weaving together, the mathematical robustness of dense retrieval with the flexible coding of generative retrieval, LIGER forges a practical path toward adaptive, resource-friendly recommendation systems. In a market that continuously demands up-to-date offerings, any system that fails to handle novel items gracefully stands at a disadvantage. Yet businesses also cannot overlook the accuracy gap that arises when they rely exclusively on generative retrieval. The solutions outlined in the research point to a bigger theme: there is rarely a one-size-fits-all formula for recommendation tasks. Instead, engineers, data scientists, and business strategists must chart their path by weighing the importance of memory costs, computational budgets, and the diversity of product catalogs. For some enterprises, a system purely anchored in dense retrieval remains indispensable; for others, generative retrieval offers a means of exploring a vast item space without drowning in memory demands. LIGER shows that the conversation between these two extremes need not be a stalemate. By merging generative candidate selection with dense verification and refinement, it provides a flexible blueprint that narrows the performance gap while empowering companies to manage new inventory more seamlessly. As the next era of recommendation systems continues to unfold, approaches like LIGER may well represent the new mainstream: forging alliances between established and emerging methods to serve the needs of an ever-changing marketplace—and of the individuals who rely on these technologies day after day. Podcast: https://creators.spotify.com/pod/show/andrea-viliotti/episodes/The-LIGER-Hybrid-Model-Transforming-Sequential-Recommendation-Systems-e2sv9se Source : https://arxiv.org/abs/2411.18814
Modello ibrido LIGER: La nuova frontiera nelle raccomandazioni sequenziali
“Unifying Generative and Dense Retrieval for Sequential Recommendation” è il titolo della ricerca firmata da Liu Yang, Fabian Paischer e Kaveh Hassani, in collaborazione con l’Università del Wisconsin (Madison), l’ELLIS Unit del LIT AI Lab presso la JKU di Linz (Austria) e Meta AI. Lo studio esplora i sistemi di raccomandazione sequenziali, confrontando due approcci: il recupero denso, che punta sull’apprendimento di rappresentazioni complesse per ogni item, e il recupero generativo, basato su modelli in grado di predire direttamente l’indice dell’oggetto successivo. Alcuni elementi si rivelano di particolare interesse per le aziende, poiché coinvolgono aspetti di efficienza, gestione della memoria, integrazione di nuovi contenuti (cold-start) e prestazioni generali dei sistemi di raccomandazione. Modello ibrido LIGER: La nuova frontiera nelle raccomandazioni sequenziali Recupero denso e generativo: Come il modello LIGER rivoluziona le raccomandazioni sequenziali Le raccomandazioni sequenziali rappresentano una delle aree più studiate nell’ambito dei sistemi di suggerimento. L’idea è di analizzare la cronologia di interazioni di un utente per predire l’articolo successivo, facendo emergere correlazioni tra la sequenza di click o acquisti passati e la probabilità di interessare l’utente con un nuovo contenuto. La ricerca di Liu Yang e colleghi indaga proprio l’impatto di due diverse metodologie: da un lato il recupero denso, dall’altro un approccio generativo che punta a produrre l’indice dell’item da raccomandare. Il recupero denso, come descritto nella letteratura scientifica, si basa su tecniche avanzate di rappresentazione dei dati. Ogni articolo presente nel database viene trasformato in un embedding , ossia una rappresentazione numerica unica che sintetizza le caratteristiche fondamentali del suo contenuto. Il processo di raccomandazione si sviluppa calcolando il prodotto interno (una misura matematica di somiglianza) tra l'embedding associato all'utente o alla sequenza delle sue interazioni e l'insieme di tutte le rappresentazioni degli articoli disponibili. L'articolo che ottiene il punteggio di somiglianza più alto viene suggerito come opzione preferita. Tuttavia, quando si lavora con dataset di grandi dimensioni, questo approccio richiede di confrontare l'utente con tutti gli articoli presenti, comportando un elevato dispendio in termini di memoria e potenza computazionale. Nonostante ciò, il recupero denso offre spesso prestazioni superiori rispetto ad approcci più semplici. Il recupero generativo rappresenta un approccio alternativo al recupero denso. Invece di calcolare la similarità tra l'utente e tutti gli articoli disponibili, questa metodologia utilizza un modello di tipo Transformer , progettato per prevedere direttamente la prossima "etichetta semantica" associata all'articolo successivo. Con il termine "semantic ID" si intende un identificatore composto da più componenti che sintetizzano le principali caratteristiche dell'articolo, come titolo, marchio, categoria e prezzo. Ogni articolo viene quindi descritto attraverso una combinazione strutturata di questi attributi, spesso rappresentata come una tupla di codici. Durante la fase di addestramento, il modello generativo apprende a predire la sequenza successiva di codici basandosi sullo storico delle interazioni dell'utente. Una volta completata questa fase, il sistema può individuare l'articolo successivo mediante un algoritmo di beam search . Questo è un metodo euristico di ricerca che esplora più percorsi possibili in modo simultaneo, mantenendo solo quelli più promettenti, limitati a un numero prefissato di opzioni ("beam width"). In altre parole, invece di esaminare tutte le possibili combinazioni, il sistema si concentra su un sottoinsieme di percorsi che sembrano più probabili, migliorando così l'efficienza senza sacrificare troppo la qualità della soluzione. Un aspetto rilevante di questa strategia, nota come generative retrieval , è la sua capacità di scalare in modo più efficiente con l'aumentare del numero di articoli. Ciò è possibile grazie a una significativa riduzione dei costi di memoria: invece di conservare un embedding per ogni articolo, il sistema mantiene soltanto t codici, dove t rappresenta il numero di elementi distinti utilizzati per descrivere gli articoli. Per esempio, se nel database ci sono 10.000 articoli, ma solo 100 categorie e 50 marchi diversi, t sarà dato dalla somma degli elementi distinti necessari per rappresentarli (in questo caso, 100 categorie + 50 marchi = 150 codici), indipendentemente dal numero totale di articoli N . Questa caratteristica rende il recupero generativo particolarmente vantaggioso quando si lavora con dataset di grandi dimensioni, garantendo una migliore scalabilità e una gestione più efficiente delle risorse computazionali. L’analisi della ricerca evidenzia come i due approcci mostrino rispettivamente punti di forza e debolezze. Il recupero denso eccelle in termini di accuratezza, soprattutto nei test condotti su dataset con item noti o "in-set" , ossia insiemi di dati in cui gli articoli da raccomandare durante la fase di valutazione erano già presenti nel set utilizzato per l'addestramento del modello. Questo scenario semplifica il compito del sistema, poiché si tratta di identificare elementi già "visti" e memorizzati. In questi contesti, il recupero denso ha ottenuto valori di Recall@10 (una metrica che misura l'efficacia nel recuperare elementi rilevanti entro le prime dieci posizioni) nell’ordine di 0,18-0,20 in alcuni esperimenti. Di contro, il recupero denso paga il prezzo di costi di calcolo crescenti, soprattutto quando si deve raccomandare oggetti a milioni di utenti o lavorare con un numero molto elevato di articoli disponibili. Il recupero generativo, invece, si distingue per una struttura più leggera, che permette di gestire le informazioni sugli articoli in modo più compatto e consente inferenze rapide tramite l'algoritmo di beam search . Tuttavia, questo approccio mostra un divario di performance rispetto al recupero denso, specialmente in termini di accuratezza. Questo gap appare evidente quando si analizzano i risultati numerici ottenuti sui medesimi dataset: nei test, la differenza nelle prestazioni, misurata attraverso il Recall@10 , si attesta su uno scarto del 3-4%. Ciò significa che il recupero generativo, pur essendo più efficiente e scalabile, potrebbe non essere altrettanto efficace nel proporre articoli rilevanti, soprattutto in contesti in cui la precisione è cruciale. Per le aziende, questo confronto diretto mette in luce la necessità di bilanciare precisione della raccomandazione con costi di infrastruttura e flessibilità di aggiornamento del catalogo. Investire in un sistema di recupero denso può essere ideale quando si hanno risorse di calcolo abbondanti e l’obiettivo è massimizzare la pertinenza degli articoli suggeriti. Un sistema generativo, invece, può consentire un più agile adattamento a contesti con item in continuo mutamento, soprattutto se è cruciale ridurre gli oneri di archiviazione. Cold-start e recupero generativo: sfide e soluzioni con il modello LIGER Il fenomeno del cold-start è un nodo da sempre centrale nei sistemi di raccomandazione. Quando un articolo fa il suo ingresso sul mercato o quando si acquisisce un nuovo partner commerciale che fornisce prodotti inediti, può mancare uno storico di interazioni, rendendo complesso l’aggancio tra utente e articolo. La ricerca analizza in che modo gli approcci densi e generativi reagiscono alla comparsa di item completamente nuovi. I risultati mostrati in alcune tabelle di performance restituiscono uno scenario contrastante. Nel recupero denso, la presenza di rappresentazioni testuali per ogni articolo (per esempio descrizioni, brand e categorie) consente di generare un embedding anche per prodotti mai visti in precedenza. In questo modo, il modello conserva una capacità di raccomandazione non nulla per quei contenuti che non hanno ancora interazioni registrate. I ricercatori evidenziano che il Recall@10 in caso di cold-start rimane su valori positivi, sebbene inferiori ai corrispondenti item noti. Il recupero generativo rivela invece limiti più marcati. Il problema discusso è l’overfitting verso item già esistenti nel training: quando il modello cerca di generare il codice semantico del prossimo articolo, tende a privilegiare quelli già incontrati. Durante l’inferenza, si ottiene una probabilità di generazione p⋆ per l’oggetto corretto decisamente inferiore rispetto alla soglia pK necessaria perché l’item appaia nelle scelte di beam search. In altre parole, se l’item è nuovo e non è presente nel training set, la sua probabilità di generazione risulta estremamente bassa, tanto da escluderlo dalle raccomandazioni finali. Dalle analisi risulta che su dataset come Amazon Toys o Amazon Sports, generative retrieval fatica a superare lo 0.0 in Recall@10 per gli item non presenti in addestramento. Da un punto di vista imprenditoriale, quando ci si aspetta un ricambio frequente di prodotti o si ha l’esigenza di lanciare novità in modo continuo, diventa cruciale porre rimedio a questo deficit. Alcuni propongono di impostare una soglia che riservi una quota di K candidati al cold-start, forzando il modello a suggerire un certo numero di item inesplorati. Ciò però presuppone di conoscere in anticipo la proporzione degli articoli nuovi rispetto a quelli vecchi, un’informazione che non sempre è disponibile. È chiaro allora come, secondo gli autori della ricerca, il recupero generativo necessiti di strategie più raffinate per trattare i contenuti mai visti, lasciando un margine di miglioramento e di ricerca aperto. Una conferma ulteriore emerge dai test su quattro insiemi di dati: Amazon Beauty, Amazon Sports, Amazon Toys e Steam. Sui primi tre, la differenza in cold-start è la più evidente, con generative retrieval che oscilla attorno allo zero in molte misurazioni. Su Steam, che è un insieme di giochi con attributi più ricchi come genere, specifiche, tag e prezzo, l’approccio generativo appare più competitivo ma non risolve completamente la lacuna del cold-start. Chi gestisce un portale di e-commerce, una piattaforma di servizi o un catalogo in costante evoluzione dovrebbe dunque valutare con attenzione l’adozione di un metodo generativo “puro”, tenendo presente che, almeno su dataset di piccola o media scala, il recupero denso rimane superiore nel trattare item non visti. Modello ibrido LIGER: superamento delle lacune del recupero generativo Per affrontare il problema del divario nelle prestazioni e risolvere le difficoltà legate al cold-start, la ricerca propone un modello ibrido chiamato LIGER (LeveragIng dense retrieval for GEnerative Retrieval), progettato per combinare i punti di forza di entrambi gli approcci. L’architettura di LIGER è progettata per combinare le informazioni testuali degli articoli con i loro codici semantici e utilizza due distinti metodi di ottimizzazione. Il primo metodo si basa sul calcolo della similarità coseno tra l’output del Transformer e la rappresentazione testuale dell’elemento successivo. Questo approccio serve a misurare quanto le due rappresentazioni siano vicine in termini di significato. Il secondo metodo, invece, si focalizza sulla predizione diretta del codice semantico associato all’elemento futuro. Il modello utilizza una funzione obiettivo composta da due componenti principali. La prima parte considera una funzione logaritmica che normalizza il valore di similarità coseno attraverso un parametro chiamato "fattore di temperatura" (τ). Questo parametro regola la distribuzione delle probabilità, rendendo più o meno marcata la differenza tra le varie opzioni. In pratica, il modello cerca di massimizzare la similarità tra l’output del Transformer e la rappresentazione testuale corretta, minimizzando al contempo la probabilità associata a rappresentazioni non corrette. La seconda parte della funzione obiettivo si concentra sulla predizione del codice semantico. Il modello prevede ogni componente del codice semantico, utilizzando l’output del Transformer e le informazioni provenienti dagli elementi precedenti della sequenza. In sintesi, la funzione combinata spinge il modello a integrare due capacità fondamentali: Recupero denso : Massimizza la corrispondenza tra l’output del Transformer e l’elemento testuale corretto, favorendo un’accurata associazione semantica. Predizione generativa : Si occupa di prevedere la sequenza di codici semantici, migliorando la capacità del modello di anticipare informazioni complesse basate su ciò che ha già analizzato. Questa duplice strategia permette al modello LIGER di eccellere sia nell’identificazione accurata di elementi correlati sia nella generazione di predizioni utili e dettagliate. I ricercatori sottolineano che tale approccio consente di sfruttare congiuntamente i vantaggi di entrambi i metodi, ottimizzando le prestazioni su compiti che richiedono sia comprensione che generazione di contenuti. Durante la fase di inferenza, il modello ibrido LIGER impiega un numero K di candidati ottenuti attraverso il recupero generativo, integrandoli con eventuali nuovi elementi e valutandoli successivamente mediante metodologie dense. I test evidenziano che, all'aumentare di K, LIGER riesce progressivamente a ridurre il divario rispetto al recupero completamente denso. Il cosiddetto "Normalized Performance Gap (NPG)" mostra una diminuzione costante della differenza: si parte da una performance vicina a quella del recupero generativo (con valori di K bassi) fino a raggiungere risultati più comparabili al recupero denso (con valori di K alti). Ad esempio, nei casi di studio relativi ad Amazon Beauty e Amazon Toys, è stato osservato che incrementando K da 20 a 80, i valori di Recall@10 per elementi "in-set" tendono a convergere ai risultati del recupero denso, consentendo al contempo di esplorare nuovi elementi. Questa strategia trova notevoli riscontri nel mondo imprenditoriale. Abilitare un modello che sia in grado di gestire con efficienza la mole di contenuti (limitando lo sforzo computazionale) e allo stesso tempo proporre raccomandazioni efficaci, anche su item appena pubblicati, si traduce in un concreto valore di business. Ridurre i costi di stoccaggio delle informazioni d’item (grazie alle semantic ID) e mantenere un buon livello di accuratezza spinge le aziende a considerare con favore un’architettura ibrida, specialmente in scenari dove la varietà di prodotti cresce rapidamente. Modello LIGER: test e prestazioni su quattro dataset Amazon Il lavoro di confronto è stato svolto su quattro dataset emblematici: Amazon Beauty, Amazon Sports, Amazon Toys e Steam. Nel caso di Amazon Beauty si parla di 22.363 utenti, 12.101 articoli e 198.502 azioni, con 43 articoli totalmente nuovi in cold-start. Amazon Sports conta 35.598 utenti, 11.924 articoli e 296.337 azioni, con 56 item nuovi; Amazon Toys ne presenta 19.412, 11.924 articoli, 167.597 azioni e 81 item di cold-start. Steam, infine, con 47.761 utenti e 18.357 articoli, racchiude 599.620 azioni e 400 item nuovi. Gli autori hanno testato una serie di metodi tradizionali, come SASRec, S3-Rec, FDSA e altre varianti basate su Transformers, tra cui UniSRec e RecFormer, affiancandole a TIGER (recupero generativo puro) e poi al modello LIGER. Si evidenzia come i metodi che si basano esclusivamente sull’ID dell’articolo risultino deboli nel caso di item inediti, perché mancano di informazioni su come posizionare quei contenuti mai visti. Questo spiega punteggi di Recall@10 pressoché pari a zero in scenario cold-start. Nei test di in-set, i valori di NDCG@10 e Recall@10 raggiungono picchi elevati per i modelli densi e per alcuni modelli generativi potenziati con testo, ma il recupero generativo tende a rimanere indietro di qualche punto percentuale. In Amazon Beauty, per esempio, si registra un Recall@10 che per denso può superare lo 0.07 in determinate configurazioni, mentre la versione generativa si ferma più in basso. Nel caso di Amazon Toys, i valori di generative retrieval sfiorano 0.05782 in Recall@10, ben al di sotto di alcune soluzioni dense che arrivano oltre 0.07. La situazione appare più complessa per i cold-start. Qui, i dati mostrano che i valori generativi scendono fino a 0.0 su più dataset, riflettendo l’impossibilità del modello di “indovinare” codici semantici che non ha mai incontrato in fase di addestramento. LIGER, invece, porta un miglioramento tangibile. Sulla categoria Toys, per esempio, nei test riportati si nota come LIGER possa arrivare anche a 0.13063 in Recall@10 per item in cold-start (quando K=20), mentre TIGER rimane a 0.0. Un aspetto rilevante è la gestione della soglia K. L’aumento di K fa sì che aumentino le possibilità di includere l’articolo corretto nel set di generazione, ma ciò impatta i costi di inferenza. La ricerca mostra che con un K intorno a 40 o 60, su Amazon Sports e Amazon Toys, LIGER raggiunge un compromesso tra costi computazionali e accuratezza. Per un’azienda che gestisce grandi volumi di articoli e non vuole perdere opportunità su novità e prodotti a bassa frequenza, LIGER appare un compromesso interessante: in base alle risorse e agli obiettivi, si possono regolare i parametri per avvicinarsi il più possibile ai risultati del recupero denso, tenendo a bada al contempo la complessità computazionale. Modello LIGER: opportunità strategiche per il futuro delle raccomandazioni L’integrazione di un metodo ibrido come LIGER non è solo un esercizio di ingegneria algoritmica, ma tocca diversi aspetti dell’organizzazione e delle strategie di sviluppo del business. In primo luogo, esiste la questione della scalabilità. Quando la base di articoli raggiunge numeri ragguardevoli, l’idea di memorizzare un embedding unico per ciascun oggetto può diventare un problema in termini di costi di archiviazione e di aggiornamento. Al contrario, un sistema generativo riduce il numero di vettori da stoccare, poiché si concentrano quasi esclusivamente i codici semantici. Ciò si traduce in un risparmio tangibile, utile per aziende che offrono milioni di prodotti e subiscono un notevole turnover. In secondo luogo, la questione della personalizzazione diventa più sottile. Il recupero denso fornisce una ricerca accurata per item già “rodati”, mentre l’approccio generativo permette di cogliere connessioni latenti tra item e utenti grazie al potere del Transformer di produrre codici semantici nuovi. LIGER, abbinando i due procedimenti, offre risultati incoraggianti: evita di rimanere intrappolato nei bias del generativo puro e insieme conserva quella flessibilità essenziale per non penalizzare i contenuti emergenti. Questo si riflette in un miglioramento diretto per i clienti, che potrebbero ricevere suggerimenti più pertinenti su prodotti inediti o di nicchia. Sul piano dell’integrazione con sistemi aziendali, chi già possiede un’infrastruttura basata su modelli densi e desidera ridurre i costi può sfruttare LIGER gradualmente. Da un lato, si mantiene la rete di embedding esistente per la fase di ranking fine; dall’altro, si affianca un modulo generativo per la generazione di candidati. Il modello ibrido tende a coprire un ampio ventaglio di situazioni e diventa rilevante anche nei verticali come le piattaforme streaming o i marketplace di prodotti digitali. La ricerca sottolinea infine alcune possibili estensioni future. L’impiego di Large Language Models (LLM) per il recupero generativo potrebbe cambiare ancora gli equilibri tra i due paradigmi, anche se per ora i test qui citati si concentrano su dataset piccoli e medi. Manca una prova definitiva sui volumi industriali, dove gli autori stessi ammettono che i parametri di tuning, la distribuzione dei dati e l’ottimizzazione dell’infrastruttura possono trasformare i risultati. È plausibile che ulteriori perfezionamenti degli algoritmi generativi permettano di raggiungere prestazioni prossime a quelle del recupero denso, se non superiori, specie qualora i flussi di item nuovi fossero molto intensi. Conclusioni Le informazioni fornite dalla ricerca suggeriscono che recupero denso e recupero generativo rappresentano due facce di uno stesso obiettivo: facilitare la migliore interazione tra utenti e articoli in base alla cronologia dei comportamenti. La differenza più evidente sta negli oneri di memorizzazione e di calcolo. Il recupero denso offre accuratezza ma richiede risorse notevoli, mentre il generativo si distingue per la memoria ridotta e la capacità di manipolare codici semantici. LIGER, fondendo il ranking denso e la componente generativa, appare come un’alternativa realistica che ridimensiona il divario prestazionale e consente di includere item in cold-start con buoni risultati di Recall@10. Comparando i risultati con le tecnologie esistenti, emerge che l’adozione di grandi modelli pre-addestrati per il recupero denso, come BERT o T5, presenta potenzialità straordinarie, ma resta ancorata alla necessità di archiviare molteplici vettori. Al contempo, i metodi generativi di ultima generazione guadagnano terreno, specie se si utilizzano meccanismi di tokenizzazione più scalabili o se si integra il potere di modelli di linguaggio generali. LIGER si situa su una linea di convergenza strategica: è abbastanza leggero rispetto al denso puro, senza trascurare la precisione necessaria a mantenere alto l’engagement. Per le imprese, i dati suggeriscono che la scelta di un sistema ibrido possa rappresentare un vantaggio concreto, soprattutto quando si gestisce un catalogo in continuo aggiornamento o si temono costi di storage troppo elevati. Non esiste un’unica soluzione preferibile in maniera assoluta, poiché il contesto di scala e le risorse a disposizione determinano gran parte dell’efficacia. Ciò che emerge è la spinta verso un futuro in cui denso e generativo possano coesistere, magari con ulteriori ottimizzazioni che migliorano la generazione di item inesplorati e riducono il tempo di risposta. L’equilibrio dinamico tra i due metodi, già mostrato da LIGER, potrebbe innescare nuove idee per chi costruisce soluzioni di raccomandazione sempre più flessibili e pronte a adattarsi alla costante evoluzione dei mercati. Podcast: https://creators.spotify.com/pod/show/andrea-viliotti/episodes/LIGER-il-modello-ibrido-che-combina-recupero-denso-e-generativo-per-raccomandazioni-sequenziali-precise-e-scalabili-e2sv9i5 Fonte: https://arxiv.org/abs/2411.18814
AI-Driven Materials Innovation: Transforming Research, Product Development, and Workforce Dynamics
In the United States, a growing body of work explores how advanced artificial intelligence, particularly deep learning, can accelerate scientific discovery and reshape the process of material innovation. A notable example is the research paper “Artificial Intelligence, Scientific Discovery, and Product Innovation” by Aidan Toaner-Rodgers , developed at MIT with the collaboration of economists Daron Acemoglu and David Autor . This research scrutinizes how the introduction of a specialized deep learning tool in a large industrial laboratory impacts scientists’ productivity and alters the strategic decisions companies make about which projects to pursue. The study focuses on a cohort of 1,018 scientists working for a major corporation eager to expedite the creation of novel materials by tapping into AI-driven techniques. In doing so, the paper illustrates how AI can bring significant changes not only to the nuts and bolts of research and development (R&D) but also to the skill sets and roles of the workforce involved. Its findings are crucial for executives and entrepreneurs alike, highlighting both the substantial gains in patent activity and the broader diversification of prototypes. Yet the authors also discover an uneven distribution of these benefits among different categories of researchers—an imbalance that intensifies the need for strategic oversight and training. A recurring theme is that human expertise remains indispensable for interpreting and validating the AI’s output. The best outcomes surface when experienced scientists leverage the model’s suggestions effectively. This synergy of technology and specialized knowledge underscores the transformative potential of AI-driven materials innovation for corporate management and scientific progress and beyond. How AI-Driven Materials Innovation Expands Corporate Horizons The study offers a detailed narrative about how AI-driven materials innovation might spur the development of new materials in critical industries—think consumer electronics, medical devices, or advanced manufacturing. Historically, creating a new type of functional material often involved a significant amount of guesswork. Researchers spent countless hours devising hypothetical combinations of chemical elements, measuring their properties using expensive tools, and discarding a large portion of these attempts as unviable. High failure rates and the need to control costs pushed many R&D teams toward incremental adjustments rather than bold leaps into the unknown. In contrast, the deep learning system described in Toner-Rodgers’s paper uses graph neural networks (GNNs) to propose fresh “recipes” for new chemical structures. In essence, the algorithm is trained on substantial datasets detailing known materials and their physical or chemical traits—such as mechanical strength, resistance to temperature extremes, or unique optical properties. Building on that foundation, it synthesizes new formulas by scanning the statistical relationships in the data. Additional methods—often described as “diffusion models” and sophisticated probabilistic estimations—help identify solutions that might never have been tried but show promise on paper. It is vital to note that this AI-driven process is not purely automated. The scientists still take center stage in verifying the feasibility of the AI’s recommendations. They identify which material proposals are suitable for lab synthesis, weeding out those that might fail a practical test. The MIT researchers had a prime vantage point to witness these steps, as the rollout of AI tools occurred in several waves across various teams, allowing for direct comparisons between those already using the AI tool and those who had not yet received access. This staggered adoption schedule formed the backbone of a robust research design. It enabled the authors to isolate the effect of the AI system by comparing the performance of scientists actively engaging with it to the outcomes of scientists still working under conventional protocols. As a result, the study captured both quantitative metrics—like the number of patents produced, the novelty of the chemical structures, and the eventual prototyping rate—and qualitative shifts in how scientists spend their time and adapt their approach to problem-solving. Unprecedented Advances in Material Science with AI Across the board, results point to a marked uptick in productivity following the introduction of the AI platform. Scientists using the system saw a surge in the number of new formulas generated, many of which possessed unique chemical backbones or properties not previously explored in the lab’s portfolio. This had a ripple effect on multiple R&D metrics: Higher Count of Newly Validated Materials The study reports a 44% increase in new materials validated by researchers who had switched over to using the AI tool. These materials tended to align better with the performance targets R&D teams had in mind, suggesting that the AI’s recommendations were not just more numerous but also more precisely tailored. Growth in Patent Applications Patent filing activity rose by 39% among the scientists with AI access, an important gauge of how well a company can protect its intellectual property and achieve a competitive edge. Increased Prototyping Moving from the realm of theoretical studies to creating tangible prototypes, the study found a 17% jump in the number of lab projects that reached an advanced stage of physical development. One of the more intriguing findings is that these increases are not limited to volume. The study also notes an uptick in the intrinsic novelty of the discoveries. Patents began referencing technical terms that were new to the corporate lexicon, and entire product lines emerged that would have seemed impractical in earlier eras. This challenges the idea that AI might only recycle existing knowledge or push scientists to remain in familiar territory. Instead, it appears that the algorithm actively explores undercharted sections of the design space, occasionally coming up with materials that deviate significantly from existing norms. Balancing AI and Human Expertise in Material R&D Yet, the successful exploration of these uncharted territories hinges critically on the expertise of human scientists. When faced with large volumes of AI-generated suggestions, researchers with considerable knowledge in chemistry, physics of materials, or advanced synthesis techniques can rapidly spot which leads are truly viable. They can also detect “false positives”—seemingly impressive solutions that fail under the practical constraints of manufacturing, cost, or long-term stability. In the absence of robust human judgment, the AI might generate a flood of marginally useful formulas. Although the cost of creating new ideas with AI drops significantly—for example, reusing computational simulations is much more economical than running each set of experiments from scratch—there is a separate cost in weeding through poor-quality recommendations. The research suggests that scientists began devoting a larger share of time to screening, refinement, and validation, while automatic generation consumed a smaller chunk of the overall workflow. This transformation of research routines has fundamental implications for how labs structure their teams and set budgets for various research phases. Tackling Challenges in AI-Driven Research Designs One of the aspects that makes Toner-Rodgers’s analysis especially persuasive is the multi-phase experimental design. The laboratory introduced the AI tool to different groups of scientists in a staggered manner, using something akin to a randomized assignment. Because of that structure, the researchers could observe real-time changes in both the group that had immediate AI access and the group awaiting the rollout. Interestingly, the comparison groups did not show a drop in motivation or productivity—they did not, for instance, scramble to file as many patent applications as possible for fear of losing out. That observation helps debunk a simplistic assumption that the introduction of AI mainly creates an “arms race” dynamic where everyone is desperate to outdo the competition, whether or not the tool itself has inherent merits. To bolster their findings, the authors documented each step of the scientists’ work—from scanning academic literature to setting up real-world experiments—and tracked how these activities evolved. A sharp spike in validated materials signaled that, on average, AI suggestions were valuable. However, deeper data show a growing load of “false starts” for some teams, underscoring that AI systems can produce misleading ideas if not properly curated. Still, for teams capable of navigating those pitfalls, the payoff was substantial. They leveraged the AI tool not merely to add incremental variants to existing formula sets but to conceive entire new categories of compounds with higher odds of commercializing. This progression was especially notable in areas where advanced mechanical or thermal properties are vital, indicating that for certain specialized domains, AI drastically broadens the design range researchers consider plausible. Addressing Workforce Inequality in AI-Powered Labs One of the more socially and organizationally significant findings is that the gains from AI are not spread evenly across the workforce. The study draws attention to a phenomenon in which “top scientists”—those who were already highly productive before the AI introduction—reap the greatest improvements, often doubling their output. Meanwhile, scientists in the lower-performing tiers see more modest benefits, if any at all. Why does this disparity matter? For one, it magnifies existing inequalities. If a laboratory or a company prizes patent generation as a metric of success, then the “star performers” capable of handling AI-driven workflows get even further ahead, both in prestige and in compensation. This dynamic could eventually spur a reorganization of R&D hierarchies, with management teams choosing to invest more resources in the high achievers while phasing out less productive personnel. Indeed, the paper points out that the company studied eventually downsized a small subset of researchers who consistently underperformed, effectively consolidating resources among those most adept at leveraging AI. From a talent management standpoint, this reality prompts questions about training and recruitment. Could specialized upskilling programs help mid-tier researchers improve their capacity to filter AI proposals effectively, thus narrowing the performance gap? Or will the workforce naturally gravitate toward intensifying the dominance of the most knowledgeable experts? Interviews with the scientists revealed that domain knowledge—like familiarity with prior attempts, specific interactions among chemical elements, and unique conditions for synthesis—is crucial for sorting the good leads from the worthless ones. The massive influx of computationally suggested formulas does not negate the need for seasoned judgment. On the contrary, it heightens the need for people who can quickly detect whether an algorithmic output is truly workable. These dynamic challenges the perception that “AI can replace experience” and instead places a premium on those who can blend sophisticated computational tools with a deep grasp of the field’s underlying principles. Strategic Shifts in Management with AI in R&D Business leaders face a dual set of promises and perils when contemplating how best to incorporate AI into R&D. On one hand, the data highlight significant gains in the speed and volume of patentable ideas and prototypes. The corporation studied saw new product lines emerging that would have been unlikely under traditional practices. Such leaps are especially enticing in high-stakes industries where being first to secure a patent can lead to market dominance and robust revenue streams. On the other hand, these breakthroughs require an equally robust infrastructure for testing and vetting. AI can generate a near-endless supply of theoretical solutions, but a company must also invest heavily in the lab capacity, engineering teams, and pilot production lines needed to turn those theories into tangible products. If such downstream resources are in short supply, the organization might experience bottlenecks, leading to frustration and diminishing the real value of AI-driven innovation. Additionally, the study suggests that once AI starts delivering promising leads, management must address tension between building upon successful existing products and venturing into uncharted territory. While incremental improvements remain essential for stable revenue, the AI tool fosters exploration of dramatically different approaches that might demand substantial new investments and greater lead time to get to market. Deciding how to allocate resources between these parallel R&D channels becomes a critical strategic question. Licensing and data ownership concerns also come to the forefront. As more businesses rely on proprietary machine learning solutions or external cloud-based platforms, safeguarding intellectual property becomes complicated. The underlying code, training datasets, and resulting materials constitute valuable assets that may require specialized legal arrangements. Large corporations may opt to develop in-house AI solutions, thus keeping full control of the knowledge base. Smaller firms could partner with vendors or academic institutions, potentially sacrificing some independence but acquiring immediate access to advanced tools that would be impractical to build from scratch. Overall, Toner-Rodgers and his collaborators posit that AI might accelerate innovation in ways that macroeconomic literature has historically underemphasized. While AI’s role in assembly lines and supply chain optimization is well established, its contribution to frontline scientific discovery could exert an even more profound impact, especially if companies learn to integrate these tools effectively into their R&D pipelines. Yet the authors caution that any robust adoption plan must factor in labor dynamics, intellectual property management, and the strategic realignment of R&D resources. Adapting Workforce Morale in the Age of AI A lesser known yet critical dimension of this AI adoption story is the effect on job satisfaction. The study’s surveys reveal that the overall satisfaction levels among researchers dropped by 82% (as measured by an internal metric) due mainly to diminished feelings of creativity. Many scientists expressed the sentiment that their primary enjoyment stemmed from the imaginative act of conceiving new compounds. In this new environment, the machine handles most of the “ideation” phase, and human participants are left predominantly with the job of evaluating and filtering. While such a shift in responsibilities yields tangible gains in productivity and patent counts, it can erode the sense of personal fulfillment many researchers derive from the creative aspects of their job. Even high-performing scientists, who see their overall productivity rise, sometimes report a hollow victory. Their success feels less tied to their individual ingenuity and more to the fact that they have become efficient gatekeepers of the AI’s output. Consequently, management teams must walk a fine line. As AI becomes more ingrained in R&D, the workforce might split into those who thrive under this new mode of work—often the experts with deep knowledge and strong screening abilities—and those who feel disenchanted because they are no longer applying the same level of intellectual originality. Over the long term, that disenchantment could undermine an organization’s culture of innovation, particularly if it triggers the departure of employees who once brought vitality to the team. On a more optimistic note, the data show that the majority of respondents do understand the capacity of AI to accelerate the pace of scientific exploration, and many express a willingness to “reskill” or further refine their expertise to stay aligned with changing job demands. Some see a silver lining: freed from manual data searches, they can focus on analyzing the deeper rationale behind which ideas get flagged by the AI, creating a knowledge feedback loop that enriches both the human expert and the machine-learning system. Expanding AI’s Impact Beyond Material Science Although Toner-Rodgers’s case study centers on materials science, the authors propose that the same pattern will likely play out in any field where researchers must sift through countless possible configurations—pharmaceuticals (e.g., discovering new drug molecules), advanced robotics (e.g., designing novel sensor-actuator assemblies), climatology (e.g., modeling global climate variables), or even mathematical conjectures in pure research. Any problem domain with a vast “solution space” can benefit from AI’s capacity to propose creative possibilities grounded in data patterns, but such a domain also depends on human discernment to validate those possibilities in real-world contexts. From a managerial standpoint, the biggest lure is the promise of drastically reduced time-to-market for breakthroughs. Managers at large corporations see AI solutions as a way to compress multi-year R&D cycles into significantly shorter spans, thus beating competitors or securing an early patent in a niche area. This impetus for faster innovation might reshape how companies’ budget for R&D, how they organize interdisciplinary teams, and how they distribute risk across a portfolio of short-term and long-term projects. However, the authors repeatedly emphasize that greater speed in generating ideas does not magically ensure those ideas will be grounded in practicality or cost-effectiveness. The presence of AI makes it easier to produce large volumes of speculation, and some portion of that speculation will always be illusory unless curated by skilled personnel. This inherent dependence on expert oversight underscores the continuing need for robust education, specialized hiring, and dynamic upskilling programs. The study even cites the reduction of about 3% of the staff in the company’s less experienced ranks, a move that the firm believed was necessary to optimize the synergy between advanced modeling and seasoned human assessment. The Future of AI in Sustainable Scientific Innovation Another intriguing takeaway is how AI-driven R&D can prompt the creation of product lines that diverge significantly from a firm’s legacy offerings, thereby expanding market possibilities. Yet these radical new directions require a broader supply chain, additional training for engineers, and possibly new manufacturing processes. Even if the AI tool offers an impressive initial concept, the path to large-scale commercialization can still take years—especially if the technology is fundamentally different from anything the company has produced before. On a larger scale, there is the question of who ultimately benefits from this accelerated innovation cycle. If, for instance, a handful of multinational corporations perfect AI-empowered labs, they may secure a near-monopoly on cutting-edge patents, potentially raising barriers to entry for smaller competitors. Some might argue that a more democratized model, in which open-source AI tools help smaller labs, fosters healthy competition and innovation. But that, too, requires accessible training resources, robust data-sharing channels, and frameworks to protect intellectual property without stifling creativity. Toner-Rodgers’s study projects that as AI models become more sophisticated, we are likely to see an even greater impact on how scientific research unfolds in industrial settings. Costs associated with experimentation may continue to fall, encouraging a surge in bold product proposals. The flip side is that researchers might feel less ownership of projects where AI has done much of the conceptual heavy lifting, potentially fueling the ongoing talent reshuffling and motivation dilemmas mentioned earlier. In the face of these rapid advances, it remains critical for R&D directors and senior management to nurture a balanced workplace. The best outcomes will likely come from frameworks that keep top scientists motivated to push boundaries while ensuring mid-level talent receives sufficient training and involvement in truly creative tasks. This environment can help maintain a sense of purpose and collective problem-solving, which, over time, can strengthen a company’s capability to integrate computational insights with human-driven expertise. Conclusions Toner-Rodgers’s research reveals a vivid picture of how advanced deep learning can dramatically enhance an organization’s capacity to devise and patent novel materials, offering tantalizing prospects for corporate growth. The empirical data and interviews paint a scenario in which AI is not merely a tool for incremental improvement but a mechanism that accelerates the very formation of new scientific ideas. By analyzing nearly every step of the R&D process and examining how teams responded to the technology, the study offers a nuanced view that goes well beyond standard assessments of productivity gains. Nonetheless, the findings underscore a more complex landscape than one might expect. While overall productivity and patent numbers rise, there is also a stark divergence in performance across different skill levels, which in turn influences hiring practices, training programs, and corporate structures. Beyond pure metrics of efficiency, the study highlights significant shifts in job satisfaction and creativity, along with the need for advanced screening competencies—a blend of algorithmic insight and deep, experience-based knowledge. Ultimately, these observations inform corporate leaders that adopting AI tools for materials research (and likely for other scientific domains) is not just a matter of plugging in a new software platform. It demands a serious reevaluation of internal roles, the creation or enhancement of high-level “judgment” positions, a willingness to invest in and empower top experts, and thoughtful planning for the workers who might feel their creative drive diminished. The journey toward AI-augmented science involves recalibrating an entire organizational ecosystem to tap into the promise of large-scale automation without losing sight of the intangible spark that fuels human ingenuity. As companies look to the future, they will find themselves balancing the incredible promise of more efficient and wide-ranging innovation with the challenges of fostering an engaged and skilled research community. Those that succeed in melding computational power with domain-specific expertise stand poised to achieve dynamic leaps in product development. Yet they must remain attentive to the people behind the process—ensuring that passion, creativity, and collective knowledge continue to propel discovery forward on both scientific and human terms. Podcast : https://creators.spotify.com/pod/show/andrea-viliotti/episodes/AI-Driven-Materials-Innovation-Transforming-Research--Product-Development--and-Workforce-Dynamics-e2sv0f7 Source : https://arxiv.org/abs/2412.17866
AI e materiali innovativi: il potenziale tra ricerca, innovazione e nuovi ruoli nel lavoro
“Artificial Intelligence, Scientific Discovery, and Product Innovation” di Aidan Toner-Rodgers , sviluppato al MIT con la collaborazione di Daron Acemoglu e David Autor , esamina come l’adozione di una tecnologia di deep learning in un ampio laboratorio statunitense incida sulla produttività scientifica e sulle decisioni strategiche delle imprese. La ricerca coinvolge 1.018 scienziati di un grande gruppo industriale, interessato ad accelerare la creazione di materiali innovativi grazie all’AI , dimostrando come l’AI possa trasformare non solo i processi di sviluppo ma anche il lavoro dei ricercatori, ridefinendo competenze e ruoli. Per dirigenti e imprenditori è cruciale scoprire che l’AI può aumentare sensibilmente il numero di brevetti e la diversificazione di prototipi, pur generando squilibri fra ricercatori. Gli effetti più consistenti emergono dove l’esperienza umana riesce a valorizzare i suggerimenti del modello. Innovazione con AI e materiali innovativi: prospettive di crescita nelle imprese L’indagine condotta presso il laboratorio di una grande azienda statunitense racconta una storia peculiare di come l’intelligenza artificiale possa influenzare la creazione di nuovi materiali, step essenziale in comparti come l’elettronica di consumo, i dispositivi medicali e la produzione industriale. In passato, la scoperta di materiali funzionali avveniva tramite innumerevoli tentativi: i ricercatori immaginavano composizioni sempre più complesse, testavano le proprietà con strumenti costosi e registravano un elevato tasso di fallimenti. Questa dinamica, lunga e onerosa, incoraggiava i team di R&S a ridurre i rischi, puntando spesso a progetti incrementali anziché ad applicazioni più ardite. La tecnologia illustrata nella ricerca sfrutta un avanzato approccio di deep learning basato sulle reti neurali grafiche (Graph Neural Networks) per produrre nuove “ricette” chimiche. La rete, allenata su grandi set di dati riguardanti materiali noti, genera composizioni possibili che rispondono a determinate proprietà desiderate (ad esempio robustezza, resistenza a temperature estreme o particolari caratteristiche ottiche). L’algoritmo si avvale inoltre di tecniche di “diffusione” e di previsioni probabilistiche capaci di suggerire soluzioni mai realizzate prima ma potenzialmente valide. Non si tratta di automazione meccanica: gli scienziati restano fondamentali per valutare la fattibilità delle combinazioni e selezionare quali candidati sintetizzare in laboratorio. I ricercatori del MIT hanno avuto l'opportunità di studiare i processi internamente grazie a un'introduzione graduale articolata in tre fasi, coinvolgendo complessivamente 1.018 scienziati distribuiti in vari team. Quando ogni gruppo iniziava a utilizzare lo strumento di intelligenza artificiale, era possibile confrontare le nuove attività di ricerca sui materiali con quelle condotte dai colleghi che non facevano ancora uso della piattaforma basata su AI. Questo tipo di approccio sperimentale, con un’assegnazione casuale in più fasi, ha consentito di disporre di un quadro chiaro dei mutamenti che si sono verificati. La produttività complessiva cresce dunque in maniera netta, e ciò comporta anche miglioramenti qualitativi, poiché le formule generate presentano strutture chimiche più originali, con possibili implicazioni interessanti per la brevettabilità e la differenziazione nel mercato. Per le aziende, ciò significa opportunità di portare innovazioni aggiuntive sul mercato, riducendo i costi di ricerca per ogni singolo brevetto depositato. Un ulteriore tratto significativo riguarda la misurazione della novità stessa. Non c’è stato un semplice aumento del volume di progetti depositati, bensì una maggiore creatività intrinseca, ossia la presenza di brevetti che introducono termini tecnici non comparsi in precedenza. Inoltre, è cresciuta la percentuale di prototipi che danno vita a linee di prodotto radicalmente nuove. Questa evidenza contraddice chi teme che l’AI induca solo “riciclaggi” di informazioni già conosciute, amplificando ricerche sicure ma prive di reale impatto. Al contrario, la tecnologia ha mostrato di saper esplorare regioni meno scontate dello spazio di progettazione. Il metodo sperimentale adottato sottolinea al contempo l’utilità di un supporto umano altamente qualificato. In assenza di competenze adeguate, la tecnologia può generare molte più “false promesse” rispetto ai progetti che si dimostrano realizzabili. Eppure, quando gli scienziati sfruttano con intelligenza l’AI, testano con priorità i composti che hanno una maggiore probabilità di successo, evitando sprechi di tempo e denaro. Questa sinergia è un punto chiave per qualunque strategia imprenditoriale che intenda introdurre l’intelligenza artificiale nelle fasi di R&S. Ricerca e sviluppo con AI e materiali innovativi: evidenze sperimentali Il lavoro di Toner-Rodgers analizza i dati scaturiti da ogni stadio di R&S: dalla generazione dell’idea alla definizione di brevetti, fino alla comparsa di prototipi pre-commerciali. Una sezione corposa è stata dedicata all’indagine quantitativa sugli oltre mille scienziati che popolano il laboratorio dell’azienda, in un periodo compreso fra il 2020 e il 2024. Ogni aspetto della loro attività lavorativa – la tipologia di materiali valutati, i risultati delle simulazioni, le ore trascorse in esperimenti, le competenze pregresse – è stato mappato con precisione. Il passaggio chiave che rende solida la dimostrazione è la casualità con cui i ricercatori sono stati suddivisi, ciascuno abilitato all’AI in fasi diverse. Così si può verificare se l’aumento di idee brevettabili dipenda effettivamente dalla tecnologia o da una semplice competizione interna. L’analisi dei flussi di lavoro mostra che nei gruppi che non avevano ancora ricevuto l’accesso all’AI non si osserva alcun calo di performance. Di conseguenza, viene meno l’ipotesi di un “effetto corsa” (ossia che i ricercatori senza AI si affrettino per superare i colleghi già dotati dello strumento). I risultati numerici principali mostrano una trasformazione notevole: la ricerca individua un aumento del 44% di nuovi materiali validati dai ricercatori che usano la tecnologia. Questi composti presentano proprietà fisico-chimiche più avanzate, esprimendo al meglio i target qualitativi iniziali fissati dai team di R&S. Inoltre, la ricerca mostra il dato del 39% di incremento nelle richieste di brevetto, una metrica decisiva per la proprietà intellettuale e la differenziazione competitiva. L’effetto si propaga anche alla successiva fase di prototipazione, con un aumento del 17% di progetti effettivamente realizzati su larga scala. Nel complesso, i costi di produzione di idee e test non salgono in modo proporzionale ai benefici, poiché l’AI consente un parziale “riuso” di simulazioni: basta introdurre i parametri di riferimento per ottenere una moltitudine di potenziali formule chimiche. Il risvolto che emerge, tuttavia, è la necessità di una valutazione umana più critica: la fase di “giudizio” – così è stata definita nella ricerca – si dilata. I ricercatori trascorrono meno tempo a generare idee e più tempo a filtrare, comprendere e testare i suggerimenti proposti dall’algoritmo. L’indagine ha verificato che il 57% delle mansioni di generazione è automatizzato, creando una nuova sfida legata alla distinzione tra composti promettenti e “falsi positivi”. Uno spunto di lettura per imprenditori e manager sta nella diversa efficacia che questa tecnologia ha su lavoratori con differenti livelli di competenza. Nella ricerca viene evidenziato che la fascia più capace – definita “top scientists” – sperimenta un quasi raddoppio della produttività, mentre il terzo inferiore della distribuzione ottiene benefici modesti. A livello di performance si crea, quindi, una disuguaglianza molto marcata. In termini organizzativi, le aziende potrebbero decidere di cambiare criteri di selezione e formazione del personale, privilegiando la spiccata capacità di “valutare” e interpretare correttamente le proposte dell’AI. La ricerca dimostra che l’intervento della macchina non è imparziale, ma incide attivamente sull’orientamento delle innovazioni. Non si parla di materiali “meno innovativi”, bensì di soluzioni più agevoli da produrre: lo studio mette in luce una qualità costante nei risultati e persino la creazione di linee di prodotto che in passato sembravano poco plausibili. Per un manager attento alle prospettive di business, ciò indica la possibilità di introdurre prodotti non solo migliorativi rispetto ai precedenti, ma radicalmente diversi. Questo impatto multiforme si incrocia anche con le politiche di investimento a lungo termine: la ricerca evidenzia che, a fronte di un maggior numero di brevetti depositati, una parte di essi richiederà comunque anni di sviluppo per arrivare sul mercato, specie se coinvolge tecnologie completamente nuove. AI e materiali innovativi: come gli esperti possono ridurre le disuguaglianze Un aspetto sorprendente emerso dall’indagine riguarda la distribuzione degli effetti. Sebbene l’uso dell’intelligenza artificiale incrementi in media l’efficienza e favorisca la generazione di brevetti, evidenzia anche una marcata disparità tra i ricercatori. Secondo l’analisi di Toner-Rodgers, la tecnologia risulta particolarmente vantaggiosa per gli scienziati con competenze già eccellenti nel proprio settore. Valutando la produttività iniziale sulla base dei materiali scoperti e delle pubblicazioni pregresse, si è osservato che i ricercatori appartenenti al “top decile” riescono quasi a raddoppiare la propria produzione, mentre coloro che si collocano nella fascia inferiore rimangono sostanzialmente statici. I motivi emergono osservando la logica di reindirizzo dei compiti. Prima, lo scienziato dedicava una porzione consistente del tempo a elaborare teorie su possibili strutture chimiche, immaginare variazioni, leggere articoli e consultare banche dati. L’AI copre gran parte di queste funzioni, restituendo ipotesi di materiali nuovi o radicalmente diversi da quelli già testati. A quel punto, spetta all’umano interpretare e selezionare: se un ricercatore possiede un ampio bagaglio di conoscenze di chimica applicata, fisica dei materiali o di specifici processi di sintesi, riuscirà a intuire abbastanza in fretta se una molecola consigliata dal software abbia serie probabilità di funzionare. Al contrario, chi non ha padronanza di tali competenze rischia di fidarsi troppo ciecamente delle proposte e di imboccare un gran numero di vicoli senza sbocco. La distribuzione diseguale delle ricadute si avverte anche in ambito brevettuale e di sviluppo del prototipo. Se un materiale richiede molteplici test per verificarne la stabilità e la possibilità di integrare la molecola in un prodotto effettivo, ogni errore di valutazione si trasforma in uno spreco di risorse, ritardando progetti potenzialmente promettenti. Così, il divario tra chi ha accumulato anni di competenze nel valutare le strutture chimiche e chi ne è privo si amplia rapidamente. Ai vertici dell’azienda, quindi, si propone uno scenario in cui i più esperti ottengono aumenti di merito, mentre il personale meno preparato si trova esposto a critiche o, in alcuni casi, al rischio di uscire dall’organico, come è accaduto nell’ultima fase della ricerca. Nel paper vengono citate interviste che confermano la centralità della cosiddetta “domain knowledge”. Alcuni ricercatori spiegano di riconoscere con maggior immediatezza i “falsi positivi” grazie a pubblicazioni specifiche, esperienze dirette o “intuizioni” scaturite da anni di lavoro su composti simili. Questo elemento contraddice parzialmente l’idea che l’AI, grazie all’accesso a dati massivi, possa azzerare l’importanza dell’esperienza umana. In realtà, la grande mole di soluzioni generate dal deep learning ha bisogno di sguardi allenati per separare le opportunità reali dalla confusione. L’analisi quantitativa sottolinea che, per i ricercatori meno dotati di capacità valutative, la curva di apprendimento in merito alle proposte dell’AI rimane piatta: anche dopo diversi mesi di utilizzo, si continua a incappare nelle stesse difficoltà di selezione. Al contrario, chi fa parte del terzo più qualificato affina costantemente i criteri di screening, non solo riconoscendo meglio i progetti da scartare, ma anche testando con maggiore priorità quelli che presentano caratteristiche chimiche più coerenti con le finalità iniziali. Da una prospettiva manageriale, ciò apre la porta a possibili ridefinizioni nei piani di sviluppo del personale, nonché a meccanismi di incentivo più focalizzati sulle competenze di screening. L’esperienza del laboratorio analizzato mostra che, a distanza di poco più di un anno dall’adozione su vasta scala, la direzione ha cominciato a rimodulare le assunzioni e a licenziare un ristretto gruppo di ricercatori, concentrando la forza lavoro sulla fascia più efficiente. Questo processo, se esteso ad altri settori, potrebbe cambiare la configurazione dei laboratori industriali e di ricerca, spingendo a una maggiore specializzazione su abilità interpretative. AI e materiali innovativi: conseguenze sul benessere lavorativo e la motivazione Non bisogna trascurare le conseguenze personali, perché lo studio rivela che la soddisfazione lavorativa dei ricercatori si riduce dell’82% nei giudizi complessivi, specialmente per via del ridimensionamento della creatività. Nella discussione condotta tramite sondaggi, Toner-Rodgers evidenzia come gran parte del personale abbia motivazioni profonde nel “piacere di scoprire”, nel disegnare materiali ex novo, nel poter dire che una certa formula chimica è frutto della propria intuizione. Se l’AI automatizza larga parte del processo di ideazione, i ricercatori sentono di perdere quella parte creativa. Secondo i questionari, molti studiosi si ritrovano a fare un lavoro di verifica e controllo, ripetitivo e meno affascinante, avvertendo un senso di minor valorizzazione delle competenze. Anche chi vede crescere i risultati tangibili (più brevetti, più prototipi) non sempre si dice soddisfatto: la sensazione è che le proprie abilità rimangano inespresse e che il compito di “giudizio” non risulti così sfidante come l’attività di concettualizzazione. D’altro canto, la felicità legata al successo professionale di solito aumenta se si prova un senso di pienezza e padronanza, mentre nel nuovo contesto prevale la percezione di “essere un supporto” alle scelte generate da un modello. Molti ricercatori affermano di ritenere che il vero potenziale dell’intelligenza artificiale risieda nella capacità di accelerare la produttività nei loro ambiti di lavoro. È significativo che la maggior parte dei ricercatori coinvolti abbia modificato la propria visione, mostrando maggiore fiducia nella capacità di questi sistemi di incrementare i ritmi di innovazione, pur mantenendo la convinzione che il lavoro umano rimarrà essenziale. Questo perché, nonostante i progressi tecnologici, la valutazione critica e l’intuizione rimangono aspetti che richiedono il contributo umano. Tuttavia, è aumentata la consapevolezza che l’AI trasformerà profondamente le competenze necessarie per avere successo. Di conseguenza, il 71% dei partecipanti al sondaggio ha dichiarato l’intenzione di “reskillarsi”, ovvero di acquisire nuove competenze, per evitare di essere escluso dai cambiamenti in atto. Per un imprenditore che voglia integrare l’AI nei processi di R&S, questi dati suggeriscono di valutare con attenzione le dinamiche di team e la gestione del personale, non solo in termini di produttività ma anche di soddisfazione. Perdere la motivazione di parte del personale più creativo potrebbe, sul lungo periodo, incidere sulla cultura aziendale e sulla varietà di idee generate. Occorre, in sintesi, un equilibrio che mantenga lo slancio creativo degli specialisti, rendendo al contempo il lavoro di controllo e di screening meno alienante. Alcune imprese potrebbero puntare su nuove forme di incentivo, altri preferiranno ristrutturare i ruoli e affidare la fase valutativa soltanto a una cerchia ristretta di “super-esperti”. Inoltre, dai dati appare che i ricercatori non prevedevano l’intensità di certi effetti, segno che la relazione tra AI e competenze umane resta difficile da predire ex ante. L’esperienza concreta con questi sistemi ha dimostrato che la presenza di Big Data e di reti neurali genera soluzioni che possono far progredire la scienza, ma non elimina l’esigenza di un forte contributo delle conoscenze specialistiche. AI e materiali innovativi: strategie di competitività nei diversi settori Gli autori sottolineano la generalizzabilità di queste scoperte a settori in cui la ricerca si basa su combinazioni complesse di elementi, come la farmaceutica (dove occorre individuare molecole ottimali tra un’enorme varietà di configurazioni), la robotica avanzata, la climatologia computazionale o perfino alcune branche della matematica. L’elemento comune è l’esistenza di un’ampia “superficie di esplorazione”, in cui l’AI è in grado di individuare strutture promettenti sulla base di pattern statistici, mentre resta alla competenza umana il ruolo di validare la credibilità e l’utilità pratica dei risultati. Per molte aziende, la prospettiva di accelerare significativamente il processo che porta a scoperte brevettabili è estremamente attraente. La ricerca analizzata non si limita a dimostrare che l’intelligenza artificiale può potenziare la generazione di idee, ma sottolinea anche la necessità di trasformazioni organizzative. Tra queste, emerge un aumento della richiesta di esperti capaci di agire come valutatori qualificati: si apre un nuovo ambito di competenza, dove la conoscenza approfondita del settore scientifico e l’abilità di interpretare i risultati prodotti dagli algoritmi diventano requisiti essenziali. Non sorprende, quindi, che l’azienda oggetto dello studio abbia avviato una riorganizzazione interna, riducendo del 3% il personale meno abile nel giudicare i composti proposti. Per i top manager, l’uso dell’AI nelle fasi di ricerca pone questioni strategiche sulla struttura delle divisioni R&S. Se la generazione delle idee subisce un’accelerazione massiccia, bisogna rafforzare le filiere di test e di prototipazione per non accumulare ritardi a valle. Inoltre, la tensione tra prodotti completamente nuovi e miglioramenti di linee esistenti diventa più acuta: la ricerca osserva un aumento di prototipi che si discostano dai prodotti già in commercio, frutto di materiali radicalmente diversi. Da un lato questo può portare a vantaggi competitivi enormi se le soluzioni funzionano, dall’altro può rallentare la fase di go-to-market. Un altro riflesso interessante riguarda il posizionamento nella filiera degli approvvigionamenti di conoscenza. In questo caso, una parte dei modelli di deep learning è stata sviluppata su dataset interni all’azienda, arricchiti da database pubblici, ma si può prevedere che le società più grandi acquistino o sviluppino soluzioni proprietarie, mentre le più piccole ricorreranno a piattaforme esterne, forse in modalità cloud. Gli accordi di licenza e la protezione della proprietà intellettuale diventano dunque temi rilevanti, specie se si considera che la creazione di nuovi composti permette di presidiare mercati emergenti. Quanto alle prospettive future, la ricerca suggerisce che questi strumenti possano accelerare il ritmo dell’innovazione industriale più di quanto ipotizzato in precedenti studi macroeconomici, che spesso consideravano l’AI rilevante soprattutto nella catena di produzione. In questo caso, invece, viene rafforzata l’innovazione stessa, intesa non solo come sviluppo di nuovi prodotti ma anche come avanzamento nella “scoperta scientifica”, generando un beneficio potenzialmente cumulativo. Allo stesso tempo, emergono criticità, come l’accentuarsi delle disuguaglianze tra i lavoratori e la necessità di rinnovare i paradigmi formativi, coinvolgendo sia le università che le imprese. Per un’azienda che intenda preservare la competitività della propria struttura, è quindi fondamentale agire con tempestività, considerando non solo il potenziale di questi sistemi, ma anche l’importanza di sviluppare competenze interne in grado di interagirvi efficacemente. Sotto il profilo dei possibili sviluppi, gli autori evidenziano che la tecnologia, essendo in rapida evoluzione, aprirà forse opportunità ancor più incisive per la creazione di prodotti, riducendo ulteriormente i costi di sperimentazione. Resta comunque aperta la questione di come assicurare che l’eccitazione del momento non vada a scapito della motivazione dei ricercatori, i quali potrebbero allontanarsi dal settore se sentono svuotati i propri ruoli. Questa interazione uomo-macchina è quindi una sfida strategica di prima grandezza per chi pianifica i processi aziendali di medio-lungo termine. Conclusioni I dati e le osservazioni di Toner-Rodgers delineano uno scenario in cui l’AI aumenta nettamente la capacità di ideare e brevettare materiali inediti, con importanti ricadute sulle imprese che puntano ad ampliare le proprie linee di prodotto. Però l’autore mostra anche un quadro realistico: il potenziamento complessivo non è accompagnato da una piena soddisfazione della forza lavoro, e la dinamica tra competenze umane e suggerimenti algoritmici si è rivelata più complessa del previsto. L’aspetto di novità risiede nell’evidenza che la presenza di reti neurali digitali non mortifica affatto la parte creativa del pensiero scientifico, bensì la sposta su un livello diverso, richiedendo uno sguardo esperto che sappia sfruttare i suggerimenti. L’interazione tra modello e specialista emerge in modo cruciale quando si tratta di scremare gli esiti inattendibili, un compito di importanza strategica per evitare sprechi di tempo e risorse. Lo studio analizza anche la competizione con altre soluzioni tecnologiche basate sull’intelligenza artificiale, che in diversi settori hanno già prodotto un impatto significativo. Tuttavia, la peculiarità di questa applicazione, capace di accelerare una fase ideativa raramente automatizzata, suggerisce che le potenzialità potrebbero essere ancora più rilevanti. Per un imprenditore, integrare questa tecnologia potrebbe tradursi in una riduzione dei tempi di sviluppo di almeno un terzo nella ricerca di nuove sostanze e processi innovativi, offrendo un vantaggio competitivo significativo. È chiaro, però, che una riorganizzazione interna diventa indispensabile: la creazione di nuove figure professionali, l’aggiornamento delle competenze di chi già opera nel sistema e l’investimento su profili altamente qualificati rappresentano passaggi fondamentali. La ricerca invita a un approccio ponderato e strategico: l’adozione di modelli di deep learning dedicati alla ricerca e sviluppo aziendale ha il potenziale per trasformare profondamente i processi consolidati, spingendo i dirigenti a elaborare strategie organizzative in grado di combinare efficacemente l’enorme potenziale dell’automazione con l’unicità dell’intuito umano. La sfida principale non consiste solo nell’aumentare la produttività, ma anche nel preservare e valorizzare le figure in grado di tradurre le intuizioni in innovazioni tangibili e di valore. Podcast: https://creators.spotify.com/pod/show/andrea-viliotti/episodes/AI-e-materiali-innovativi-il-potenziale-tra-ricerca--innovazione-e-nuovi-ruoli-nel-lavoro-e2suuhh Fonte: https://arxiv.org/abs/2412.17866
Prompt Engineering Generative AI: strategie, sicurezza e applicazioni pratiche
La ricerca “ The Prompt Report: A Systematic Survey of Prompting Techniques ” di Sander Schulhoff , Michael Ilie e Nishant Balepur si concentra sulle pratiche più diffuse nella prompt engineering per modelli di Generative AI. L’analisi coinvolge atenei e istituti come University of Maryland, Learn Prompting, OpenAI, Stanford, Microsoft, Vanderbilt, Princeton, Texas State University, Icahn School of Medicine, ASST Brianza, Mount Sinai Beth Israel, Instituto de Telecomunicações e University of Massachusetts Amherst , tutti impegnati nello studio dell’AI e nell'applicazione della prompt engineering. Il tema principale è l’uso strategico dei prompt per migliorare la comprensione e la coerenza dei sistemi generativi, evidenziando tecniche testate su diversi set di dati e compiti. L’articolo propone uno sguardo alle migliori pratiche, alle prospettive metodologiche e agli esiti sperimentali per chi desidera sfruttare la forza dei modelli linguistici nelle attività aziendali e nella ricerca. Un'importante aggiunta a questo panorama è rappresentata dal recente lavoro di Aichberger , Schweighofer e Hochreiter , "Rethinking Uncertainty Estimation in Natural Language Generation", che introduce G-NLL , una misura di incertezza efficiente e teoricamente fondata, basata sulla probabilità della sequenza di output. Questo contributo si rivela particolarmente prezioso nell'ambito della valutazione dell'affidabilità dei modelli linguistici, integrandosi con le tecniche di prompting avanzate discusse nell'articolo di Schulhoff e colleghi. Inoltre, data la crescente importanza della sicurezza informatica in questo ambito, viene dedicato un approfondimento specifico alle linee guida delineate nel documento " OWASP Top 10 for LLM Applications 2025 ", che fornisce una tassonomia dettagliata delle vulnerabilità più critiche per i modelli linguistici di grandi dimensioni, offrendo un quadro completo e aggiornato delle sfide e delle soluzioni relative alla cyber sicurezza in questo settore in continua evoluzione. Prompt Engineering Generative AI: strategie, sicurezza e applicazioni pratiche Fondamenti di Prompt Engineering Generative AI: guida essenziale La capacità dei modelli di Generative AI di generare testo utile si basa su una serie di procedure chiamate prompt engineering . Questa disciplina si è diffusa in modo rapido nel campo dell’intelligenza artificiale e consiste nel formulare con cura il testo di ingresso per ottenere risposte mirate. Il lavoro di Schulhoff e colleghi spiega che il prompt non è un semplice input, bensì un terreno su cui il sistema fonda le proprie inferenze. È un passaggio che si colloca al cuore delle interazioni uomo-macchina, dove la chiave risiede nella pertinenza e nella ricchezza dell’istruzione fornita. I ricercatori sottolineano come la scelta dei termini, la struttura sintattica e la lunghezza del prompt possano generare differenze nelle risposte. Quando si parla di modelli di grandi dimensioni, emerge il tema dello spazio di contesto, vale a dire la porzione massima di testo processabile, il cui superamento induce il sistema a trascurare le parti iniziali. Tenendo a mente questo meccanismo, la prompt engineering diventa un processo di ottimizzazione continua, dove scopo e coerenza del testo introduttivo influenzano l’intero output. La ricerca elenca 33 termini fondamentali legati alla prompt engineering, da template fino a self-consistency , fornendo un vocabolario unificato e condiviso. Ogni termine riflette un’area di interesse, da come costruire una catena di ragionamento interna ( Chain-of-Thought ) a come porre esempi esemplificativi ( Few-Shot o Zero-Shot ). Il segreto di un prompt ben calibrato non è solo nel chiedere qualcosa di preciso, ma nel dimostrare, attraverso esempi, il comportamento atteso. Ciò che emerge è una sorta di apprendimento interno del modello, formalizzato con la probabilità condizionata p(A‖T(x)) , in cui la risposta A dipende in modo cruciale dall’istruzione T applicata a un input x . Questo passaggio non implica un vero addestramento tradizionale, bensì una capacità di seguire l’istruzione contenuta nella stringa di testo. Gli studiosi dimostrano che tale capacità è stata testata su traduzione, classificazione, question answering e generazione di testo, ottenendo talvolta miglioramenti significativi. Nella ricerca si fa anche notare che la sensibilità delle Generative AI a piccole modifiche linguistiche è molto forte. L’aggiunta di spazi, la rimozione di un avverbio o il cambiamento di un delimitatore possono stravolgere il risultato finale. Da qui nasce l’esigenza di sperimentare varianti del prompt per capire quale funzioni meglio. È un modo per “corteggiare” il modello affinché trovi la strada giusta. In aggiunta, alcuni studiosi inseriscono istruzioni a inizio messaggio (ruoli fittizi, strategie di pensiero a passi espliciti) per rendere più solida la coerenza logica della risposta. Un’altra grande intuizione è che, in molte situazioni, la logica dell’esempio conta più dell’istruzione esplicita. Se un prompt contiene dimostrazioni chiare, il modello tende a replicare la struttura di quei campioni, adattandosi senza richiedere ulteriori comandi. Questo aspetto è un fulcro essenziale: la few-shot learning , infatti, è vista da molti come l’espressione più efficace della prompt engineering, perché offre esattamente ciò che il modello si aspetta di vedere, e lo guida in modo più controllato. La sezione introduttiva dei ricercatori mette in luce l’importanza di definire in modo rigoroso i singoli ingredienti di un prompt. Dire “scrivi un saggio” o “spiega la meccanica quantistica” risulta spesso troppo generico, mentre puntualizzare i dettagli, indicare lo stile e fornire esempi di risposte desiderate favorisce un output più utile. Grazie a tali prerequisiti, chi opera nell’innovazione aziendale o nella ricerca può già intuire i vantaggi di un buon prompt per generare analisi, relazioni o sintetizzare documenti complessi. Tassonomie e applicazioni pratiche di Prompt Engineering Generative AI Nella ricerca si è cercato di mappare ben 58 tecniche di prompting testuale, più altre forme ideate per impostazioni multimodali. Questo grande numero di schemi rientra in un’ampia tassonomia, che organizza i metodi a seconda della finalità: spiegare, classificare, generare e così via. La stessa tassonomia funge da ponte per chiunque si avvicini al mondo dei prompt, evitando confusione nei termini e nei concetti. Ci sono metodi che puntano tutto sulla decomposizione del problema. La ricerca cita “ chain-of-thought ” per scomporre in più passaggi una domanda, “ least-to-most ” per affrontare sottoproblemi e “ program-of-thought ” per incapsulare sequenze di codice, eseguibili e interpretazioni testuali nello stesso flusso. Altri metodi seguono logiche di “ self-criticism ”, dove la generazione iniziale di un testo viene vagliata dal modello stesso, che cerca di correggere errori o incoerenze. Queste procedure valorizzano la natura generativa di un modello, portandolo ad analizzare con un certo grado di introspezione il proprio output. Gli autori mettono in evidenza che alcune tecniche trovano immediata applicazione pratica. Nei sistemi di assistenza al cliente, ad esempio, è molto utile sfruttare prompt che favoriscano risposte precise e prive di toni inopportuni. Qui si introducono filtri e guardrail, con istruzioni chiarissime sui temi da evitare o sugli stili di parola consentiti. Nel caso della scrittura di codice, esistono strategie per far sì che il modello generi segmenti di programmazione più affidabili, selezionando in anticipo snippet di esempio che mostrino la struttura corretta. Un punto saliente è la possibilità di adattare la tassonomia alle proprie esigenze progettuali. Se un’impresa vuole automatizzare la corrispondenza e-mail, potrà optare per template di prompt con lo stile voluto, esempi di risposte e vincoli lessicali. Chi si occupa di marketing potrebbe introdurre “ruoli” all’interno del prompt, fingendo di avere un esperto creativo che propone slogan. Si tratta di scelte tutte orientate a una maggiore produttività. All’interno di “The Prompt Report: A Systematic Survey of Prompting Techniques” si ribadisce che non c’è un singolo approccio valido universalmente: ogni scenario può beneficiare di una tecnica più adatta di altre. Inoltre, la tassonomia proposta non si limita al testo in lingua inglese. Gli studiosi sottolineano la presenza di problemi legati alle lingue a risorse limitate, ragion per cui vengono suggerite soluzioni dedicate come il “translate-first prompting”, dove il testo in lingua minoritaria viene prima convertito in inglese. Il passaggio successivo consiste nel costruire i cosiddetti esempi in contesto, coerenti con la cultura o la specifica area semantica, facendo leva sul fatto che molti modelli odierni siano addestrati principalmente su testi anglosassoni. L’obiettivo finale rimane la pertinenza e la precisione del risultato. È interessante notare come la tassonomia ad ampio raggio copra anche strutture iterative di richiesta, in cui il modello inizialmente genera una bozza e poi la perfeziona. A differenza di un tradizionale metodo di domanda e risposta, queste tecniche calzano bene nei compiti di scrittura prolungata, brainstorming e stesura di documenti. Dunque, chiunque svolga attività di creazione di contenuti, pianificazione strategica o analisi di grandi volumi testuali può trovare vantaggi immediati adottando tali procedure. Prompt Engineering: dati, sicurezza e risultati numerici per un utilizzo ottimale Uno degli aspetti più delicati legati alla prompt engineering è la sicurezza, che influisce direttamente sull’affidabilità dei modelli. Fenomeni come il prompt hacking sfruttano inganni testuali per forzare il modello a fornire informazioni non desiderate. In alcuni casi, una sola frase con tono imperativo può sovrascrivere le indicazioni principali del prompt, generando output offensivi o rischiosi. È un punto su cui molte aziende stanno lavorando, perché i chatbot possono essere manipolati in modo da divulgare dati confidenziali o assumere stili linguistici non previsti. Gli esperimenti descritti nella ricerca evidenziano la capacità di intervenire sui sistemi per ottenere frammenti di testo estremamente riservati o per eludere le regole stabilite. Si menziona un episodio in cui il semplice suggerimento di ignorare tutte le istruzioni precedenti ha fatto collassare i vincoli di moderazione. Il lavoro segnala poi come lo sviluppo di guardrail o barriere di difesa, implementate direttamente nel prompt, non sia sempre risolutivo. Sorgono meccanismi di rinforzo o di screening a più livelli, ma anche questi mostrano limiti. Oltre all’ambito sicurezza, emergono risultati numerici precisi legati ai test su set di dati di riferimento. In un passaggio significativo, si parla di un benchmark basato su 2.800 domande selezionate dal più ampio MMLU, che comprende molte categorie di conoscenza. L’uso di approcci come “zero-shot” o “chain-of-thought” hanno portato in alcuni casi a un miglioramento delle prestazioni o, paradossalmente, a un calo. Si è osservato che non esiste un metodo dominante: alcune tecniche funzionano a meraviglia su compiti di ragionamento matematico, ma falliscono su problemi di tipo narrativo o viceversa. Queste discrepanze sollecitano le imprese a testare i prompt in modo esteso prima di integrarli in processi critici. Gli autori hanno anche trattato la questione della valutazione automatica, spiegando che, per stabilire se un prompt risulta efficace, serve un sistema di scoring che confronti la risposta generata con lo standard di riferimento. In alcune ricerche, sono state confrontate frasi in più formati di output e poi confrontate con la verità nota. Tuttavia, si constata l’esigenza di una validazione umana in situazioni di maggiore complessità, soprattutto quando il compito è creativo o soggetto a interpretazioni sottili. Nella ricerca si evidenzia il rischio legato all’eccessiva sicurezza nelle risposte generate dai modelli. Spesso, questi sistemi forniscono risposte con un alto grado di fiducia, anche quando risultano errate. È fondamentale avvisare l’utente e promuovere una generazione di contenuti più equilibrata, includendo richieste che stimolino una stima accurata del livello di certezza. Tuttavia, i modelli non sempre mostrano una trasparenza statistica affidabile, e sollecitare l’elaborazione di percentuali di fiducia potrebbe non essere sufficiente. Si riscontrano infatti situazioni in cui i modelli sopravvalutano l’attendibilità delle proprie risposte. In ambito aziendale, questa mancanza di segnalazione degli errori può rappresentare un problema significativo, poiché un sistema che appare credibile ma fornisce informazioni inesatte rischia di causare conseguenze dannose. La dimensione numerica di tali studi appare imponente. Nel documento si cita una revisione sistematica che ha incluso ben 1.565 articoli, filtrati secondo criteri ristretti, al fine di ricomporre un panorama completo della prompt engineering. Su questa base, i ricercatori hanno fatto emergere i rischi e le potenzialità legati alla sicurezza, evidenziando la necessità di soluzioni specializzate. Prompt Engineering: strategie avanzate e strumenti di valutazione Nella ricerca sono descritti scenari in cui si preferisce gestire più passaggi di prompt in sequenza, creando una prompt chain . Tale catena permette al modello di costruire risposte graduali. Un sistema potrebbe, in un primo stadio, generare ipotesi, poi verificare tali ipotesi con un secondo passaggio, infine fornire una versione definitiva. Questo meccanismo agevola compiti complessi, come la risoluzione di problemi matematici e la pianificazione di attività in più step. In contesti aziendali o di ricerca, la complessità della domanda può richiedere un recupero di informazioni esterne. Si parla di agenti che integrano la cosiddetta “retrieval augmented generation”, dove il prompt ordina al modello di reperire elementi da database o da altri servizi. Citando uno degli esempi presenti, un modello che deve rispondere sullo stato del meteo potrebbe fare una chiamata a un’API dedicata, se pilotato correttamente dal prompt. Tutto ciò apre le porte a interazioni più dinamiche: la chain-of-thought non è solo linguistica, ma può includere azioni reali in ambiente esterno. La valutazione del risultato ottenuto è un altro capitolo cruciale. Da un lato, esistono procedure di self-consistency, in cui lo stesso modello genera più versioni della risposta con un certo grado di casualità. Tra queste, si seleziona la più frequente o la più coerente, secondo alcune metriche interne. Dall’altro, si sperimenta con meccanismi di “ pairwise evaluation ”, dove il modello confronta due risposte e sceglie la migliore. Metodi di auto-giudizio possono alleggerire il peso di valutazioni umane, ma non sono infallibili, come spiegato nel paper di Schulhoff e colleghi, perché i modelli talvolta preferiscono risposte lunghe o formalmente complesse senza che siano davvero migliori. Si introduce poi il concetto di “ answer engineering ”, ossia la pratica di isolare e formattare con precisione la risposta desiderata. Questa tecnica si rivela particolarmente utile quando è necessario ottenere previsioni sintetiche, come “positivo” o “negativo”, oppure un codice numerico specifico. In sua assenza, la generazione di un testo libero e articolato rischia di celare l’informazione ricercata, complicando l’interpretazione automatica. In ambito manageriale, avere a disposizione un output già strutturato può ridurre significativamente la necessità di interventi manuali. La discussione relativa agli strumenti di valutazione mette in luce progetti come “ LLM-EVAL ”, “ G-EVAL ” e “ ChatEval ”. Sono cornici che invitano il modello a generare un punteggio o un commento su un testo, seguendo guide create dallo stesso modello o da operatori umani. Proprio in questo contesto, la recente ricerca di Aichberger, Schweighofer e Hochreiter, e in particolare il loro metodo G-NLL , acquista un'importanza notevole. Il metodo G-NLL stima il grado di incertezza basandosi sulla probabilità associata all'output determinato come più rappresentativo, calcolato tramite un processo di decodifica deterministica (greedy decoding). Tale approccio potrebbe essere integrato in questi sistemi per offrire una misura quantitativa dell'affidabilità dei punteggi o dei commenti generati. Ad esempio, se un modello genera la frase “La capitale della Francia è Parigi” con una probabilità molto alta rispetto ad alternative come “Roma” o “Berlino”, avremo un G-NLL basso. Invece, qualora il modello non fosse certo tra più opzioni, il G-NLL risulterebbe più alto, indicando maggiore incertezza. Infatti, quando “LLM-EVAL”, “G-EVAL” o “ChatEval” producono una valutazione, si potrebbe affiancare a essa il calcolo del G-NLL della sequenza di testo che costituisce la risposta del modello. In questo modo, un G-NLL basso indicherebbe un'alta probabilità della sequenza generata e, di conseguenza, una maggiore affidabilità della valutazione. Viceversa, un G-NLL elevato segnalerebbe un'elevata incertezza, suggerendo cautela nell'interpretazione del punteggio o del commento. Si potrebbe addirittura pensare di ponderare i punteggi generati in base al valore di G-NLL, dando maggior peso a quelli associati a una minore incertezza, o di stabilire una soglia di G-NLL oltre la quale la valutazione del modello viene considerata inaffidabile e scartata, richiedendo un intervento umano. In questo scenario, il valore di G-NLL potrebbe anche guidare un processo di miglioramento iterativo del prompt o del modello stesso, dato che valori di G-NLL costantemente alti potrebbero suggerire la necessità di rivedere il prompt, il processo di fine-tuning o l'architettura del modello. L'integrazione di G-NLL in questi framework di valutazione fornirebbe quindi un ulteriore livello di controllo, quantificando l'incertezza associata alle valutazioni e rendendole più affidabili. Questo aspetto è cruciale soprattutto quando i compiti diventano sfumati, come evidenziato da Schulhoff e colleghi, poiché affidarsi unicamente al giudizio del modello, senza una misura della sua incertezza, potrebbe portare a decisioni errate o a valutazioni imprecise. L'approccio di Aichberger, Schweighofer e Hochreiter si configura dunque come uno strumento prezioso per rendere più robusta e affidabile la valutazione automatica in contesti complessi. In sintesi, la combinazione di prompt multipli, azioni esterne, procedure di controllo automatico e la stima dell'incertezza tramite G-NLL costituisce un ecosistema che aumenta in complessità, ma anche in potenziale utilità, specialmente quando si automatizzano processi delicati o si ha a che fare con compiti sfumati. La ricerca futura potrebbe concentrarsi sull'integrazione pratica di G-NLL all'interno di framework di valutazione come quelli discussi, valutandone l'impatto in termini di accuratezza, affidabilità e riduzione dell'intervento umano. Prompt Engineering Multimodale: oltre il testo Gli ultimi sviluppi mostrano come la prompt engineering non si applichi solo al testo. Molte attività di ricerca puntano su modelli che processano immagini, audio o video. È un modo per ampliare la sfera di utilizzo di questi sistemi, con ricadute potenzialmente enormi su settori come la robotica, la diagnosi medica per immagini o la creazione di contenuti multimediali. Gli autori parlano di “ image-as-text prompting ”: l’idea di convertire un’immagine in una descrizione testuale, così da inserirla all’interno di un prompt più esteso. Tale accorgimento facilita compiti come la didascalia automatica di foto o la risposta a domande visive. Emergono anche tecniche per generare immagini partendo da un prompt testuali, dove si aggiungono “prompt modifiers” per controllare lo stile. L’equilibrio tra i termini da sottolineare o da escludere con un peso negativo risulta analogo alle pratiche di ottimizzazione testuale viste in ambito linguistico. Anche il campo dell’audio è oggetto di sperimentazioni che si concentrano su trascrizione, traduzione vocale e persino sulla riproduzione del timbro vocale. Alcuni studi hanno esplorato l’applicazione del few-shot learning, ovvero l'apprendimento basato su pochi esempi, al parlato, sebbene i risultati ottenuti non siano sempre coerenti o affidabili. Le analisi presentate da Schulhoff e collaboratori evidenziano che i modelli audio sviluppati mediante reti neurali spesso necessitano di ulteriori fasi di elaborazione per migliorare le prestazioni. In questo ambito, il concetto di prompting si intreccia con le pipeline di estrazione delle caratteristiche, poiché la natura della sequenza vocale non permette una conversione diretta in un formato testuale assimilabile a un token. La sezione dedicata ai video esplora la possibilità di generare o modificare clip partendo da un input descrittivo. Si descrivono test preliminari su segmenti video, in cui il sistema crea versioni iniziali dei frame successivi. Sono in corso anche sviluppi per progettare agenti capaci, grazie a istruzioni formulate in modo appropriato, di interagire con un ambiente simulato e produrre azioni mirate. Un esempio significativo potrebbe essere un robot che, guidato da un comando espresso in linguaggio naturale, riesca a comprendere come muoversi in uno spazio o gestire oggetti fisici in modo efficace. C'è infine un crescente interesse per la 3D prompt engineering , un approccio che integra suggerimenti testuali con modelli di sintesi volumetrica o rendering. Nel design di prodotto o nell'architettura, ad esempio, frasi come “crea un modello 3D con superfici lisce e simmetriche” permettono di generare modifiche su mesh o strutture geometriche. Questa trasformazione dal linguaggio a forme tridimensionali apre prospettive affascinanti, con ricadute significative sulla prototipazione industriale e sull’intrattenimento interattivo. La multidisciplinarità delle ricerche conferma che il passaggio “prompt-risposta” può essere declinato in infiniti modi. Ogni volta si cerca di creare un collegamento tra l’interpretazione a monte del modello e l’output che si vuole ottenere. Non è più solo una questione di frasi e paragrafi: si tratta di un canale espandibile verso qualsiasi segnale digitale, dove la logica di prompting rimane la stessa, ma cambia il modo di codificare e decodificare le informazioni. Focus su un esperimento reale di Prompt Engineering Nel paper viene descritto uno scenario di suicidal risk detection , in cui si è cercato di capire se un modello potesse riconoscere segnali di grave crisi in testi postati da utenti in difficoltà. Sono stati utilizzati post tratti da un forum specializzato nel supporto a chi mostra pensieri di autolesionismo. Gli studiosi hanno selezionato oltre duecento messaggi, contrassegnandone alcuni con la categoria “ entrapment ” o “ frantic hopelessness ”, secondo la definizione clinica di interesse. L’obiettivo era far sì che il modello replicasse tale etichettatura, senza erogare consigli medici. Il prompt di partenza forniva una descrizione sintetica di cosa significasse “entrapment” e chiedeva al modello di restituire un semplice “sì” o “no”. Ci si è scontrati con problemi di eccessiva generazione di testo, in cui il modello provava a fornire suggerimenti sanitari. Per risolvere la questione, è stato aggiunto un contesto più ricco, che spiegava esplicitamente la finalità dell’esperimento e chiedeva di non dare consigli. Sono stati poi testati prompt con esempi (few-shot) e catene di ragionamento generati dallo stesso modello. Si cercava di migliorare la precisione e di ridurre i falsi positivi. Attraverso una serie di quarantasette fasi di ottimizzazione, il punteggio F1, una misura statistica utilizzata per valutare l'equilibrio tra precisione (percentuale di elementi rilevanti correttamente identificati) e richiamo (percentuale di elementi rilevanti totali effettivamente individuati), è migliorato sensibilmente. Si è passati da valori estremamente bassi, dovuti all'incapacità iniziale del modello di rispettare la formattazione, a risultati più soddisfacenti, sebbene ancora distanti dalla perfezione. Per migliorare la cattura dell’output, i ricercatori hanno integrato nel prompt estrattori specifici e regole finali, costringendo il sistema a rispondere con un semplice “sì” o “no” senza aggiungere altro. Tuttavia, anche con queste accortezze, si verificavano occasionalmente risposte incomplete. In uno degli esperimenti è stato osservato che la rimozione di un'e-mail dal testo di riferimento causava un netto calo di accuratezza, suggerendo che il contenuto aggiuntivo fosse cruciale per orientare il modello a ragionare in modo più efficace. Questo esempio reale mette in risalto come la costruzione di un prompt non sia un semplice comando, bensì un lavoro di fine tuning discorsivo. Ogni piccolo dettaglio, come la posizione delle istruzioni, la presenza di un testo duplicato o la definizione di un vincolo stretto, incide sull’esito. Emerge inoltre il contrasto tra l’esigenza di coerenza e la tendenza del modello a interpretare la richiesta in maniera troppo libera. È un segnale forte per imprenditori e dirigenti: laddove i risultati abbiano implicazioni delicate, è prudente coinvolgere esperti del settore (clinico, legale, ecc.) e ingegneri specializzati nei prompt. Non basta l’ottimizzazione in astratto, ma serve un raccordo continuo con le linee guida deontologiche. Gli studiosi hanno anche provato strumenti di automazione per generare e valutare i prompt in sequenza, scoprendo che a volte l’algoritmo migliorava certi punteggi. Eppure, l’interazione umana si è rivelata ugualmente decisiva per modulare i falsi positivi: un software di ottimizzazione tendeva infatti a sacrificare la sensibilità in favore di una maggior precisione, con rischi etici evidenti. Questo racconto di un caso concreto dimostra che la prompt engineering non è un esercizio teorico, ma un percorso che richiede spirito esplorativo, attenzione ai dettagli e consapevolezza delle ricadute reali. Prompt Engineering Generative AI e sicurezza: Linee guida OWASP Nel panorama in continua evoluzione dell'intelligenza artificiale generativa, la sicurezza informatica assume un ruolo di primaria importanza, specialmente quando si ha a che fare con i modelli linguistici di grandi dimensioni. Il documento " OWASP Top 10 for LLM Applications 2025 " offre una disamina dettagliata e aggiornata delle principali minacce che incombono su queste tecnologie, andando ad arricchire il quadro delineato dal paper di Schulhoff e colleghi. In particolare, OWASP si sofferma su dieci vulnerabilità cruciali, fornendo una prospettiva indispensabile per chiunque utilizzi i LLM in contesti operativi reali, sia in ambito aziendale che di ricerca. Una delle problematiche più insidiose è senza dubbio il Prompt Injection, che si articola in due varianti: diretta e indiretta. Nel primo caso, l'attaccante inserisce input malevoli direttamente nel prompt, mentre nel secondo, sfrutta fonti esterne processate dal modello. Non basta, quindi, affidarsi a tecniche come il Retrieval Augmented Generation (RAG) o il fine-tuning; è cruciale implementare solidi controlli di accesso, validare attentamente gli input e prevedere un'approvazione umana per le azioni più delicate. Immaginiamo, ad esempio, un chatbot che, a causa di un prompt malevolo, conceda accessi non autorizzati o un modello che, processando istruzioni nascoste in una pagina web, venga manipolato a insaputa dell'utente. Altrettanto critica è la questione della divulgazione involontaria di informazioni sensibili, la cosiddetta " Sensitive Information Disclosure ". Il documento OWASP pone l'accento sulla necessità di sanitizzare i dati e di applicare controlli di accesso rigorosi, introducendo il concetto di " Proof Pudding ", un attacco che sfrutta proprio la fuga di dati di addestramento per compromettere il modello. La sicurezza, però, non si limita ai dati in entrata e in uscita, ma si estende all'intera catena di approvvigionamento dei modelli LLM. L'utilizzo di modelli pre-addestrati da terze parti, pratica sempre più comune, porta con sé il rischio di imbattersi in modelli compromessi, con backdoors o bias occulti. Per questo motivo, OWASP suggerisce di adottare strumenti come gli SBOM (Software Bill of Materials) e di eseguire controlli di integrità approfonditi. Strettamente connessa è la vulnerabilità del " Data and Model Poisoning ", che si concentra sulla manipolazione intenzionale dei dati usati per l'addestramento. Per contrastarla, oltre a un'attenta verifica della provenienza dei dati, si propongono tecniche di rilevamento delle anomalie e test di robustezza specifici. Un output del modello non gestito correttamente, ovvero " Improper Output Handling ", può aprire la strada a vulnerabilità come XSS o SQL injection; da qui la raccomandazione di trattare l'output del LLM come potenzialmente pericoloso, applicando le stesse tecniche di validazione e sanitizzazione previste per gli input degli utenti. Un altro aspetto cruciale è quello dell'" Excessive Agency ", che si verifica quando un LLM dispone di permessi o capacità superiori al necessario. Per mitigare questo rischio, OWASP suggerisce di limitare al minimo le funzionalità e i permessi dei modelli, integrando meccanismi di " human-in-the-loop " per le azioni più critiche. Il documento introduce poi la categoria del " System Prompt Leakage ", ovvero la fuga di informazioni sul prompt di sistema. Anche se questo prompt non dovrebbe mai contenere dati sensibili, la sua esposizione può aiutare un attaccante a capire meglio il funzionamento del modello e a bypassare eventuali controlli. Meglio, quindi, non inserire informazioni riservate nei prompt di sistema e non affidarsi solo a questi per controllare il comportamento del modello. Nuova e di particolare interesse è la categoria " Vector and Embedding Weaknesses ", che si focalizza sulle vulnerabilità legate all'uso di vettori ed embedding, soprattutto nel contesto del RAG. Controlli di accesso e di integrità diventano quindi fondamentali per prevenire manipolazioni o accessi non autorizzati a queste componenti vitali. Non da meno è il problema della " Misinformation ": OWASP considera la generazione di informazioni false o fuorvianti da parte dei LLM come una vulnerabilità specifica, raccomandando tecniche di verifica esterna, fact-checking e una comunicazione trasparente sui limiti di questi modelli. Infine, " Unbounded Consumption " riguarda il consumo eccessivo di risorse, con possibili conseguenze economiche e di disponibilità del servizio. Rate limiting, monitoraggio delle risorse e timeout per le operazioni più lunghe sono alcune delle contromisure suggerite. In conclusione, la sicurezza dei LLM è un tema complesso e sfaccettato, che richiede un approccio olistico e stratificato. Il documento OWASP, con la sua tassonomia dettagliata e in continua evoluzione, rappresenta una risorsa preziosa per chiunque voglia addentrarsi in questo campo, fornendo linee guida concrete per sfruttare il potenziale dei modelli linguistici di grandi dimensioni, minimizzando al contempo i rischi associati. La sicurezza, in questo contesto, non può essere un optional, ma un requisito fondamentale, integrato fin dalla progettazione, per garantire l'affidabilità e la sostenibilità di queste tecnologie sempre più pervasive. Conclusioni Dall’analisi emerge che la prompt engineering è diventata un tassello centrale nell’uso delle Generative AI, ma resta un ambito in evoluzione. L’ampio spettro di tecniche, dai metodi per decomporre i problemi alle strategie di self-consistency, dimostra quanto sia variegata la scena. Se da un lato si vedono progressi incoraggianti nello sfruttamento del contesto linguistico, dall’altro persistono rischi legati a inganni testuali e riposte sbilanciate in termini di fiducia e precisione. Le prospettive per il settore imprenditoriale e manageriale sono significative, perché un prompt mirato può automatizzare la generazione di rapporti o la classificazione di dati, riducendo tempi e costi. In parallelo, lo stato dell’arte richiede test rigorosi: come si è visto negli esperimenti sul riconoscimento di segnali di crisi, non basta trasferire una procedura da un sistema all’altro confidando che funzioni. L’esistenza di modelli e tecnologie similari, capaci di operare con strategie di prompting diverse, suggerisce di valutare comparativamente ogni soluzione, comprendendo limiti e potenzialità. Da una prospettiva più approfondita, la prompt engineering non coincide con la programmazione tradizionale. Si tratta, piuttosto, di “cucire” istruzioni e contesti di esempio attorno alla natura statistica del modello, affinché l’output risponda con precisione alle esigenze reali. Non è uno schema meramente meccanico: occorre un continuo dialogo tra chi costruisce i prompt e chi conosce a fondo il dominio di applicazione. Da questa sinergia nascono le soluzioni più affidabili, dove l’equilibrio tra sicurezza, precisione e coerenza semantica non è mai dato per scontato. Podcast: https://creators.spotify.com/pod/show/andrea-viliotti/episodes/Prompt-Engineering-Generative-AI-strategie--sicurezza-e-applicazioni-pratiche-e2su3fa Fonte: https://arxiv.org/abs/2406.06608
Prompt Engineering in Generative AI: Strategies, Security, and Use Cases
The research titled “The Prompt Report: A Systematic Survey of Prompting Techniques” by Sander Schulhoff , Michael Ilie , and Nishant Balepur focuses on the most common practices in prompt engineering for Generative AI models. The analysis involves various universities and institutes—including the University of Maryland, Learn Prompting, OpenAI, Stanford, Microsoft, Vanderbilt, Princeton, Texas State University, Icahn School of Medicine, ASST Brianza, Mount Sinai Beth Israel, Instituto de Telecomunicações, and the University of Massachusetts Amherst—all working on AI and its prompt engineering applications. The central theme is the strategic use of prompts to enhance the comprehension and coherence of generative systems, highlighting techniques tested on different datasets and tasks. This research provides insights into best practices, methodological perspectives, and experimental outcomes for anyone looking to harness large language models effectively in business and research. A notable addition to this field is the recent work by Aichberger , Schweighofer , and Hochreiter , titled “Rethinking Uncertainty Estimation in Natural Language Generation,” which introduces G-NLL, an efficient and theoretically grounded measure of uncertainty based on the probability of a given output sequence. This contribution is particularly valuable for evaluating the reliability of language models, complementing the advanced prompting techniques discussed by Schulhoff and collaborators. Given the rising importance of cybersecurity in this realm, a dedicated focus is also provided on the guidelines outlined in the “OWASP Top 10 for LLM Applications 2025,” which offers a detailed taxonomy of the most critical vulnerabilities affecting large language models. This document supplies a comprehensive and up-to-date overview of the challenges and solutions related to cybersecurity in a rapidly evolving sector. Prompt Engineering in Generative AI: Strategies, Security, and Use Cases Foundations of Prompt Engineering in Generative AI: Essential Guide The ability of Generative AI models to produce useful text relies on a set of procedures known as prompt engineering. This discipline has rapidly gained traction in the field of artificial intelligence and revolves around carefully crafting the input text to obtain targeted responses. Schulhoff and colleagues highlight that the prompt is more than just a simple input; it serves as the ground on which the system bases its inferences. It represents a crucial step at the core of human-machine interactions, where relevance and the richness of the instructions prove to be the key factors. Researchers underscore how term choices, syntactic structure, and prompt length can lead to significant differences in model outputs. When discussing large language models, the idea of context window—namely, the maximum amount of text the model can process—comes into play. Once that limit is exceeded, the system discards earlier parts of the prompt. Keeping this mechanism in mind, prompt engineering in generative AI becomes a continuous optimization process, where the goal and coherence of the introductory text shape the entire output. The study lists 33 core terms tied to Prompt Engineering in Generative AI—from template to self-consistency—offering a unified and shared vocabulary. Each term reflects a specific area of interest, spanning how to build an internal reasoning chain (Chain-of-Thought) to how to incorporate clarifying examples (Few-Shot or Zero-Shot). The secret of a well-tuned prompt doesn’t just lie in asking for something specific, but in demonstrating—through examples—the desired behavior. What emerges is a sort of internal learning within the model, formalized through the conditional probability p(A‖T(x)), where the answer A depends heavily on the instruction T applied to an input x. This process does not imply any traditional training phase, but rather indicates the model’s capacity to follow the instruction contained in the string of text. The researchers show how this capacity has been tested in translation, classification, question answering, and text generation, sometimes leading to notable improvements. The study also notes that Generative AI systems are highly sensitive to subtle linguistic changes. Adding spaces, removing an adverb, or switching delimiters can dramatically alter the final output. Hence the need to experiment with various prompt formulations to see which one works best—akin to “wooing” the model to guide it along the intended path. Some researchers introduce specific instructions at the beginning of the message (fictitious roles, explicit step-by-step strategies) to strengthen the logical consistency of the answer. Another pivotal insight is that, in many situations, a clear example is more influential than an explicit directive. If a prompt provides clear demonstrations, the model tends to replicate the structure of those samples, adapting without additional commands. This is a core element: few-shot learning, in fact, is considered by many to be the most effective expression of prompt engineering, since it provides exactly the patterns the model expects, guiding it more reliably. In the introduction, the researchers emphasize the importance of rigorously defining each prompt component. Saying “write an essay” or “explain quantum mechanics” is often too generic, whereas specifying details, indicating style, and giving examples of the desired answers result in more useful outputs. Thanks to these prerequisites, professionals in corporate innovation or research can immediately grasp the benefits of a solid prompt to generate analyses, reports, or synthesize complex documents. Taxonomies and Practical Applications of Prompt Engineering in Generative AI The research endeavors to map 58 textual prompting techniques, plus other variations developed for multimodal settings. This multitude of strategies falls under a broad taxonomy that organizes methods by their purpose: explanation, classification, generation, and so on. The taxonomy itself acts as a gateway for anyone approaching the prompt ecosystem, helping avoid confusion in definitions and concepts. Some methods revolve around breaking down the problem. The study cites “chain-of-thought” to split a question into multiple steps, “least-to-most” to tackle subproblems, and “program-of-thought” to encapsulate sequences of code, executable snippets, and textual interpretations within the same flow. Other techniques embrace “self-criticism,” where the model’s initial text generation is subsequently reviewed by the model itself to spot errors or inconsistencies. These procedures leverage the generative nature of the system, leading it to analyze its own output with a degree of introspection. The authors highlight that certain techniques are immediately applicable in real-world contexts. In customer support systems, for example, it’s highly useful to adopt prompts that ensure precise and appropriately toned answers. Here, filters and guardrails come into play, using very explicit instructions about restricted topics or permissible phrasing. For code generation, there are strategies to prompt the system to produce more reliable programming segments, selecting example snippets that illustrate the correct structure in advance. A key point is the possibility of adapting the taxonomy to specific project needs. If a company wants to automate email correspondence, it can use templated prompts with the desired style, example replies, and lexical constraints. A marketing team might introduce fictional “roles” in the prompt, simulating an in-house creative expert suggesting slogans. Such decisions all aim to boost productivity. Within “The Prompt Report: A Systematic Survey of Prompting Techniques,” it is reiterated that there is no one-size-fits-all approach: every context can benefit from a different technique. Furthermore, the proposed taxonomy is not limited to English. The researchers note the challenges posed by low-resource languages, suggesting solutions like “translate-first prompting,” where text in a less common language is initially converted to English. The subsequent step is building in-context examples consistent with the cultural or thematic domain, leveraging the reality that most current models are primarily trained on English-language data. The ultimate goal remains achieving relevant and accurate outputs. Another intriguing point is that the taxonomy includes iterative request frameworks, where the model initially produces a draft and then refines it. Unlike the standard question-and-answer method, these techniques are especially suitable for extended writing tasks, brainstorming, or preparing documents. Anyone engaged in content creation, strategic planning, or the analysis of large text corpora can reap immediate rewards by adopting such procedures. Prompt Engineering in Generative AI: Data, Security, and Optimal Results One of the most delicate issues linked to prompt engineering is security, which directly affects the trustworthiness of the models. Threats such as prompt hacking exploit textual manipulations to coax the model into providing unwanted information. In some cases, a single forceful sentence can override the main instructions within the prompt, resulting in offensive or risky outputs. Many companies are actively addressing this point, as chatbots can be manipulated to disclose confidential data or adopt linguistic styles that breach compliance guidelines. Experiments in the study highlight the ease with which attackers can coerce systems to output highly sensitive text or circumvent established rules. An example is an incident where merely advising the model to “ignore all previous instructions” caused all moderation constraints to collapse. The research also shows that building guardrails or defensive layers directly within the prompt does not always resolve the issue. Multi-layered reinforcement or screening mechanisms exist, but these too have limits. Beyond security considerations, the research features precise numerical results from tests using reference datasets. In a noteworthy passage, it describes a benchmark based on 2,800 questions selected from the broader MMLU, covering diverse knowledge domains. Employing approaches such as “zero-shot” or “chain-of-thought” led in some instances to improved performance or, paradoxically, performance drops. There was no single dominating method: some techniques worked excellently on math tasks but stumbled on narrative problems, and vice versa. These discrepancies urge organizations to thoroughly test prompts before integrating them into mission-critical processes. The authors also consider automated evaluation, noting that determining whether a prompt is effective requires a scoring system to compare the generated responses with a known standard. Some studies compare sentences in various output formats to a correct reference. However, there is a recognized need for human validation in more nuanced tasks, particularly if creativity or subtle interpretations are required. The study warns of the danger of overconfidence in the responses generated by models. Frequently, these systems deliver answers with a high degree of certainty, even when they are incorrect. It’s essential to caution users and encourage balanced content generation, including instructions that prompt accurate self-assessment of certainty. Yet, models do not always offer reliable transparency about their confidence levels, and merely requesting confidence percentages may not suffice. There are cases in which the systems overestimate their reliability. In a corporate setting, such unflagged errors can be a major concern, since a system that appears persuasive but supplies inaccurate information can have damaging repercussions. The scale of the studies involved is impressive. The paper refers to a systematic review of 1,565 articles, selected according to strict criteria, to piece together a comprehensive overview of prompt engineering. From these findings, the researchers highlight risks and possibilities, underlining the need for specialized solutions to maintain security. Advanced Strategies and Evaluation Tools for Prompt Engineering in Generative AI The research outlines scenarios that favor managing multiple prompts sequentially, forming a prompt chain. This chain enables the model to build responses in stages. In the first stage, for instance, the system might generate hypotheses; in the second, it tests them; and finally, it provides a definitive version. This mechanism proves useful for tasks involving multiple steps, such as solving math problems or planning multi-phase activities. In business or research contexts, the complexity of a question may call for retrieving external information. This is referred to as using agents that leverage “retrieval augmented generation,” where the prompt instructs the model to fetch relevant data from databases or other services. One illustrative scenario involves a model tasked with reporting the current weather conditions: if guided appropriately by the prompt, it might trigger an API call. This expands the scope of interactions: the chain-of-thought is not just linguistic but can include real-world actions. Result evaluation is another critical chapter. On one hand, there are self-consistency procedures, where the model generates multiple versions of a response with some degree of randomness. The system then picks the one that appears most frequent or coherent according to internal metrics. On the other hand, some experiments use “pairwise evaluation,” where the model compares two responses to select the better one. These self-assessment methods can lessen the burden of human evaluation, but they are not foolproof, as Schulhoff and colleagues note. Models sometimes favor lengthy or formally complex answers without actually improving quality. The concept of “answer engineering” is also introduced, focusing on isolating and precisely formatting the desired response. This technique proves especially helpful when a concise output is needed, such as “positive” or “negative,” or a specific numeric code. Without it, generating free-form text could obscure the data point in question, complicating automated interpretation. In many managerial scenarios, having a structured output reduces the need for manual intervention. The discussion around evaluation tools highlights projects like “LLM-EVAL,” “G-EVAL,” and “ChatEval.” These frameworks ask the model itself to generate a score or comment about a text, following guidelines from either the model or human operators. Here, the recent research by Aichberger, Schweighofer, and Hochreiter—and specifically the G-NLL method—plays a significant role. G-NLL estimates the level of uncertainty based on the probability assigned to the output sequence determined most representative under deterministic (greedy) decoding. This approach could be integrated into these systems to provide a quantitative measure of reliability for the automatically generated scores or comments. For instance, if the model outputs “The capital of France is Paris” with a far higher probability than alternatives like “Rome” or “Berlin,” then G-NLL is low. Conversely, when the model is unsure among multiple options, G-NLL is higher, indicating greater uncertainty. When “LLM-EVAL,” “G-EVAL,” or “ChatEval” produce a given score, one could incorporate a G-NLL measure for the textual sequence constituting the model’s answer. A low G-NLL would indicate high confidence in the generated sequence and thus higher trust in the evaluation. In contrast, a high G-NLL would flag elevated uncertainty, suggesting caution in interpreting the score or comment. One might even weigh generated scores by their G-NLL values, giving more credence to those tied to lower uncertainty or setting a G-NLL threshold beyond which the model’s evaluation is deemed unreliable, requiring a human review. Under this scenario, G-NLL could guide iterative improvements to the prompt or the model itself, since consistently high G-NLL values might point to problems with the prompt, the fine-tuning process, or the model architecture. Integrating G-NLL into these evaluation frameworks would provide an added layer of oversight by quantifying the uncertainty associated with the scores and thus making them more robust. This is critical, especially for nuanced tasks, as underscored by Schulhoff and colleagues: relying solely on the model’s judgment without a measure of uncertainty could lead to flawed conclusions or subpar evaluations. The method proposed by Aichberger, Schweighofer, and Hochreiter thus emerges as a valuable tool for strengthening and stabilizing automated evaluations in intricate scenarios. In summary, leveraging multiple prompts, external actions, automatic oversight procedures, and uncertainty estimation via G-NLL creates a more complex yet significantly more beneficial ecosystem—particularly for automating sensitive processes or addressing nuanced tasks. Future research might focus on practically integrating G-NLL into the discussed evaluation frameworks, assessing its impact on accuracy, reliability, and the reduction of human intervention. Multimodal Prompt Engineering in Generative AI: Beyond Text Recent progress shows that prompt engineering extends beyond text. Many lines of research concentrate on models that process images, audio, or video, broadening the scope of potential applications in fields such as robotics, medical imaging diagnostics, and multimedia content creation. The authors address “image-as-text prompting,” meaning the conversion of an image into a textual description, which can then be incorporated into a broader prompt. This tactic facilitates automatic photo captioning or visual question answering. Other techniques allow the generation of images from textual prompts, incorporating “prompt modifiers” to control style. The balance between emphasized and excluded terms (with negative weights) echoes the text-optimization practices seen in linguistic contexts. Audio is similarly an area of experimentation, covering tasks like transcription, voice translation, and even reproducing vocal timbre. Some studies have examined few-shot learning for speech, though the results are not always consistent. Schulhoff and collaborators point out that neural network-based audio models often require additional processing steps to enhance performance. In this domain, prompting intersects with feature extraction pipelines because raw speech cannot be directly converted into a token-friendly textual format. The section on video explores generating or modifying clips based on textual inputs. Researchers have tested early-stage systems that create subsequent video frames. There are also initiatives aiming to design agents capable of interacting with simulated environments through suitably formulated instructions. A notable example might be a robot that, guided by a natural language command, interprets how to move or manipulate physical objects effectively. Additionally, there is growing interest in 3D prompt engineering, bringing together textual suggestions with volumetric or rendering-based synthesis. In product design or architecture, for instance, expressions like “create a 3D model with smooth, symmetrical surfaces” enable modifications to meshes or geometric structures. This transformation from language to three-dimensional shapes opens up promising avenues in industrial prototyping and interactive entertainment. The multidisciplinary dimension of these efforts reaffirms that the “prompt–response” relationship can take countless forms. Each time, the aim is to forge a link between the model’s upstream interpretation and the desired output. It’s not only about sentences and paragraphs: the channel can expand to any digital signal, preserving the prompting logic but adapting the encoding and decoding of information. Focus on a Real Prompt Engineering Experiment The paper details a scenario involving suicidal risk detection, examining whether a model can identify red flags in messages posted by individuals in severe distress. Researchers used posts from a specialized support forum for people who exhibit self-harm ideation. They selected over two hundred messages, labeling some as “entrapment” or “frantic hopelessness,” following a specific clinical definition. The objective was to see whether the model could replicate this labeling without offering any medical advice. The initial prompt described what “entrapment” means and asked the model to reply with a simple “yes” or “no.” However, the model often produced excessive text, attempting to provide healthcare suggestions. To resolve this, the researchers expanded the context, specifying the experiment’s goals and instructing it not to give any advice. Prompts featuring examples (few-shot) and internally generated reasoning chains (chain-of-thought) were also tested to enhance accuracy and reduce false positives. After 47 rounds of optimization, the F1 score—a statistical measure that balances precision (the percentage of relevant items correctly identified) and recall (the percentage of total relevant items truly detected)—improved noticeably. Early attempts were unsuccessful because the model struggled to follow formatting conventions, while later iterations brought better results, though far from perfect. To more reliably capture the output, the researchers integrated specialized extractors and final rules within the prompt, forcing the system to respond with a single “yes” or “no.” Nevertheless, occasional incomplete answers persisted. In one test, removing an email address from the reference text caused a substantial drop in accuracy, implying that additional contextual content helped guide the model’s reasoning more effectively. This real-world example illustrates that prompt construction is not just a matter of issuing commands—it’s about conversational fine-tuning. Every detail, from the positioning of instructions to whether some text is duplicated or if a narrow constraint is specified, affects the outcome. It also highlights the tension between the need for coherent outputs and the model’s tendency to interpret requests too liberally. This serves as a cautionary note for business leaders and decision-makers: wherever results carry serious implications, involving domain experts (medical, legal, etc.) and engineers proficient in prompt techniques is advisable. Abstract optimization alone is insufficient; ongoing alignment with professional standards and ethical guidelines is essential. The researchers also experimented with automation tools that generate and evaluate prompts in sequence. Sometimes the algorithm improved certain metrics, yet human intervention was still necessary to adjust false positives. An optimization tool might reduce sensitivity to gain precision, posing clear ethical risks. This real-world case shows that prompt engineering is anything but theoretical, requiring hands-on experimentation, meticulous attention to detail, and heightened awareness of real-world impacts. Prompt Engineering in Generative AI: Security and OWASP Guidelines In the ever-changing landscape of generative AI, cybersecurity is crucial, especially when dealing with large language models. The document “ OWASP Top 10 for LLM Applications 2025 ” offers a detailed and current analysis of the main threats to these technologies, adding to the framework presented by Schulhoff and colleagues. OWASP focuses on ten critical vulnerabilities, providing essential insights for anyone deploying LLMs in practical business or research environments. One of the most notorious risks is Prompt Injection, which comes in two forms: direct and indirect. Direct injection involves an attacker placing malicious content directly into the prompt, while indirect injection uses external sources processed by the model. Consequently, relying on techniques like Retrieval Augmented Generation (RAG) or fine-tuning alone is not enough; robust access controls, meticulous input validation, and possible human approval for higher-stakes actions are critical. Consider, for instance, a chatbot that—due to a malicious prompt—grants unauthorized access, or a model that parses hidden instructions buried in a webpage and is manipulated without the user’s knowledge. Equally concerning is “Sensitive Information Disclosure,” where the unauthorized release of confidential details occurs. OWASP stresses the importance of sanitizing data and applying strict access controls. It also describes “Proof Pudding,” an attack that exploits the leakage of training data to compromise the model. Moreover, security encompasses the entire supply chain of the LLM. The common practice of using pre-trained models from third parties may expose users to compromised models with hidden backdoors or biases. For this reason, OWASP recommends employing tools such as SBOM (Software Bill of Materials) and performing rigorous integrity checks. Closely related is “Data and Model Poisoning,” the deliberate manipulation of training data. Countermeasures include verifying data origins, anomaly detection, and specialized robustness testing. Meanwhile, “Improper Output Handling” highlights how carelessness in processing the model’s outputs can allow vulnerabilities like XSS or SQL injection. To address this, OWASP advises treating all LLM outputs with the same level of caution as user-generated content, using validation and sanitization best practices. Another key concern is “Excessive Agency,” where an LLM is granted more permissions or capabilities than necessary. OWASP suggests strictly limiting each model’s functions, complemented by a “human-in-the-loop” mechanism for critical decisions. The guidelines also discuss “System Prompt Leakage,” referring to instances where system-level instructions become exposed. Although these prompts should never contain sensitive data, their disclosure can help attackers better understand and bypass the model’s defenses. It’s therefore wise not to include private information in system prompts and to avoid relying solely on them for controlling the model’s behavior. A newer category, “Vector and Embedding Weaknesses,” delves into attacks on embeddings and vector-based components, particularly relevant to RAG systems. Access control and integrity checks of these vector resources become indispensable to prevent malicious alterations or unauthorized access. Another notable topic is “Misinformation,” which treats the generation of false or misleading data by LLMs as a security flaw, urging external validation, fact-checking, and transparent communication about the model’s inherent limitations. Finally, “Unbounded Consumption” deals with unchecked resource usage, which can result in both economic and availability problems. OWASP recommends introducing rate limits, resource monitoring, and timeouts for prolonged tasks. Overall, LLM security is complex and multifaceted, necessitating a holistic, layered approach. With a constantly evolving taxonomy, the OWASP document stands as a valuable resource for anyone entering this domain. It lays out concrete guidelines for leveraging large language models while minimizing the associated risks. In this environment, security cannot be an afterthought; it must be a built-in requirement for ensuring the reliability and sustainability of these increasingly pervasive technologies. Conclusions From this analysis, prompt engineering emerges as a central component of Generative AI usage, albeit one still under ongoing development. The wide range of techniques—from problem decomposition methods to self-consistency strategies—demonstrates the diversity of approaches. While there is encouraging progress in using linguistic context effectively, risks remain tied to textual manipulation and imbalanced answers in terms of confidence and accuracy. The potential impact for businesses and management is significant: a tailored prompt can automate the creation of reports or data classification, cutting costs and saving time. Still, the current state of the art demands rigorous testing. As shown in the suicidal risk detection experiment, it’s unwise to assume that a procedure successfully used in one system will seamlessly transfer to another. Multiple models and related technologies exist, each employing different prompting techniques; this variety calls for a careful comparative approach to understand each method’s strengths and limitations. In a more in-depth view, prompt engineering should not be conflated with traditional programming. Instead, it involves “tailoring” instructions and contextual examples around the model’s statistical nature to ensure the output meets real-world needs. It is not purely mechanical: an ongoing collaboration between prompt designers and domain experts is essential. Only through this synergy can robust solutions emerge, where security, accuracy, and semantic coherence aren’t taken for granted. Podcast: https://creators.spotify.com/pod/show/andrea-viliotti/episodes/Prompt-Engineering-in-Generative-AI-Strategies--Security--and-Use-Cases-e2su1sd Source: https://arxiv.org/abs/2406.06608
Smarter & Inclusive Cities: Innovation and Inclusivity for the Urban Future
“Smarter & Inclusive Cities” is a course developed by Arup, TalTech, and Climate-KIC in collaboration with the M4EG initiative, supported by the European Union and UNDP. The general theme is empowering cities through the adoption of digital tools, participatory methodologies, and social inclusion programs. The goal of Smarter & Inclusive Cities is to promote new urban growth models that consider the variety of local communities, paying particular attention to efficiency, sustainability, and equal opportunities. The course offers practical examples and performance indicators for public administrators, citizens, and anyone interested in delving deeper into these dynamics. Smarter & Inclusive Cities: Innovation and Inclusivity for the Urban Future Smarter & Inclusive Cities: Local and Global Challenges for Inclusivity The digital transformations at the heart of the “Smarter & Inclusive Cities” concept have opened up unprecedented avenues for improvement, making available platforms, apps, and solutions aimed at more effective management of public services and infrastructure. The idea of making cities “smart” is closely linked to the goal of including all segments of the population in this process, from younger to older age groups, while recognizing that people with disabilities and minorities often encounter greater barriers and limitations in accessing technological tools. The “Smarter & Inclusive Cities” document highlights that 66% of the world’s population is connected to the Internet, whereas Europe shows an average of about 89%. This gap indicates the need for guidelines and interventions to reduce the digital divide, which would otherwise widen existing economic and social disparities. The study notes that 69% of men worldwide have internet access compared to 63% of women, and that the percentage of people living in urban areas who are connected is higher than in rural areas. This digital gap is not only geographic but also generational, given that 75% of young people between 15 and 24 use the Internet—ten percentage points higher than the average in other age brackets. It becomes evident that new technologies can act either as a catalyst for development or as a source of marginalization, depending on how they are designed and implemented. This leads to a reflection on the importance of creating accessible digital services to reduce gender and socioeconomic disparities, paying attention to real skills and the risks of exclusion. Experience from some urban areas confirms that simply deploying sensors and high-tech solutions does not automatically improve residents’ quality of life. The case of Songdo in South Korea—designed as an ultra-technological, low environmental impact city—demonstrated how limited public engagement can undermine initial objectives. Similarly, Santander in Spain introduced a vast number of sensors but had to deal with major maintenance demands and data privacy issues. Based on these negative examples, the “Smarter & Inclusive Cities” analysis underscores the importance of placing people at the center, with participatory processes starting from the design phase. Technology should not be the only driver of development but rather a tool to address real problems, from air pollution to access to basic services like education and healthcare. For a city to be inclusive and smart, a holistic approach is therefore needed, one in which social and economic dimensions interact with the digital realm, all with a focus on meeting people’s needs. This implies involving residents in decision-making processes, fostering partnerships with universities, startups, local businesses, and community associations, and developing data analysis methods that respect privacy and fundamental rights. Smarter & Inclusive Cities: Fostering Participation and Digital Inclusion The inclusion strategies proposed by “Smarter & Inclusive Cities” focus on a key element: community engagement. Experiences in Rotterdam (Netherlands) show how co-designing digital solutions can have positive impacts on service efficiency and management costs. With the creation of the Meld’R app—co-designed with citizens and aimed at real-time reporting of issues in public spaces—70% of all reports are now channeled through the digital platform, saving over 180,000 euros in the first year alone. After this success, the local administration decided to extend the same user-centric approach to other municipal areas, confirming how co-creation can lead to tangible results and increased institutional trust. At the same time, various governments have launched digital training programs targeting people with limited computer skills, often women and seniors. In Greater Manchester (UK), the Digital Inclusion Agenda aims to achieve total coverage, providing skills and devices to those who cannot afford them. The same spirit drives Sihanoukville in Cambodia, where public areas are being equipped with free Internet and IT literacy courses for the most vulnerable groups. This initiative is rooted in the idea that only by removing barriers to new technologies can all social segments be given a real opportunity to participate, thereby enhancing the economic and social vitality of a region. The experiences of Tartu in Estonia demonstrate how a participatory budgeting project, defined through inclusive procedures and user-friendly digital platforms, can make citizens key players in the decision-making process. In Tartu, the city allocated a portion of resources (about 1% of the investment budget) for projects proposed directly by residents. This mechanism, called Participatory Budgeting, generated concrete initiatives such as urban redevelopments and additional services while reducing the distance between the municipality and the public. The main takeaway is that an informed and involved local population feels more responsible for the outcome, promoting more efficient resource management. In the context of Smarter & Inclusive Cities, it is also important not to underestimate the risk of exclusion created by overly relying on digital-only channels. The approach supported by the researchers of “Smarter & Inclusive Cities” involves keeping hybrid formats for service provision, allowing physical access to information and onsite support for those who do not have adequate technological skills. This dual-track method is essential to ensure the continuity of services during the digital transition, preventing certain user groups (such as the elderly or individuals with special needs) from being left behind. Participation and co-creation thus acquire a concrete and proactive dimension, where urban development processes are built together with the community rather than merely imposing top-down solutions. Smarter & Inclusive Cities: Security and Digital Access A core point emerging from the “Smarter & Inclusive Cities” document is the importance of digital infrastructure and its proper management. It is not sufficient to cover a large portion of the urban territory with broadband or 5G connections; it is also necessary to think about how to make these infrastructures accessible and reliable, aligned with robust security standards. The European Union, for instance, aims to provide Gigabit coverage to all populated areas by 2030, but reaching this goal requires ongoing dialogue among public agencies, telecommunications companies, universities, and civil society organizations. Such dialogue is crucial to anticipating any potential issues related to cost or long-term sustainability. Amsterdam in the Netherlands offers a significant example of how data management must be addressed with well-defined and transparent procedures. The city has developed a standard for handling mobility data (City Data Standard for Mobility – CDS-M) intended to use information produced by citizens without violating privacy and the rights of those traveling daily. This approach combines the need to optimize traffic flows with the desire not to turn residents into “passive data providers.” Such initiatives illustrate how clear rules of digital governance are indispensable for aligning economic and urban innovation objectives with ethical and legal principles. Alongside data management, “Smarter & Inclusive Cities” emphasizes the central role of cybersecurity. Between July 2021 and June 2022, 24% of recorded global cyberattacks targeted public administration. As local governments become more digital, they find themselves increasingly exposed to threats that can disrupt entire services, leading to reputational and economic harm. Therefore, it is crucial to incorporate cybersecurity into strategic planning, starting with staff training and progressing to emergency protocols that safeguard operational continuity. Helsinki, through its innovation entity Forum Virium, and Singapore, through targeted investments in data protection, represent two successful examples of reconciling a connected city’s needs with the protection of information systems. In this setting, strong institutional leadership is key to promoting accountability policies and targeted public information, ensuring that greater digitalization does not lead to security gaps that are difficult to manage. Transitioning to more inclusive and smarter cities from an infrastructure standpoint thus involves a renewal process spanning technological, regulatory, and sociocultural aspects. Models of data ownership must be studied to ensure transparency and respect for individual freedom, paying special attention to open standards and interoperability among different systems. Collaboration between public and private entities is vital: for instance, telecommunications companies have an economic interest in expanding networks but need suitable guidelines and incentives to ensure accessibility for the more vulnerable segments of the population. In this regard, the analysis points out that an “intelligent city” is not just a collection of interconnected devices but an ecosystem of decisions and shared responsibilities. Smarter & Inclusive Cities: Driving Innovation in Urban Governance Inclusion and smart approaches to urban areas also translate into collaborative governance. According to the researchers, municipalities have a dual task: govern and coordinate a multi-stakeholder framework by initiating processes of consultation, cooperation, and information exchange. The “Smarter & Inclusive Cities” document describes how Bristol in the UK has adopted the so-called One City Approach, a model that brings together public and private organizations, universities, and civil society groups to pursue common objectives. The city has established thematic boards on transport, healthcare, and economic development, ensuring that responsibility does not fall solely on the municipal authority. To make an inclusive approach a reality, action is required on multiple fronts. On one hand, it is essential to develop medium- to long-term strategies that encompass carbon emission reductions, cultural enhancement, and digital skills growth. On the other hand, administrative procedures need to be set up to facilitate the launch of pilot projects—often rapid and relatively low-cost minipilots (in the range of tens of thousands of euros)—to test innovative ideas in small, controlled environments. Some Northern European cities, such as Turku in Finland, have used this strategy to reduce car traffic in central areas by deploying intelligent camera systems to analyze vehicle flow. A small-scale experimental approach makes it possible to evaluate costs and benefits without affecting the entire urban system. Another critical step is defining Key Performance Indicators (KPIs). It is not enough to merely state goals; they must be measured in a clear and consistent way. Indicators can encompass a variety of dimensions: reducing energy consumption, increasing the recycling rate (as in Ljubljana, which managed to exceed 63% correct waste differentiation thanks to a smart waste management system that uses sensors and optimizes vehicle routes, coupled with constant awareness campaigns for citizens), promoting the use of green transportation, expanding co-housing projects, or creating digital social and health services for vulnerable groups. Vienna, with its Smart City Strategy, shows how regular monitoring pushes the administration to periodically review its policies by producing evaluation reports in which results are compared with initially stated objectives. This data-driven mindset promotes timely adjustments and continual improvement of initiatives. Cities venturing down this path must nevertheless navigate institutional complexity, insufficient long-term structural funding, and, at times, cultural resistance among certain social groups. Here, political leadership proves essential for creating a shared vision while also supporting training and participation efforts. Without an adaptable governance structure that engages citizens, there remains a risk of isolated interventions that are difficult to scale or integrate systematically. Within the framework presented by the study, intersectoral coordination is indispensable: bringing together traditionally separate fields such as transport, construction, welfare, and digital innovation. Smarter & Inclusive Cities: Projects and Future Opportunities Analysis of initiatives introduced in various urban settings clearly demonstrates that adopting agile and experimental frameworks can produce widespread benefits. The concept of the living lab, for instance, describes physical or virtual spaces where public administrations, research centers, private companies, and citizens collaborate to identify solutions in real time. Some of these labs organize specific challenges—for example, eliminating architectural barriers for people with disabilities or improving air quality through monitoring and emission-reduction mechanisms in specific neighborhoods. Tartu’s strategy for promoting public bike use or Narva’s Well-being Score project, both in Estonia, illustrate how a seemingly limited idea can create a ripple effect. When a project proves successful, it moves on to the upscaling phase, expanding testing to other parts of the city, engaging additional partners, or replicating the solutions in different geographical contexts while adapting them to respective legal and cultural requirements. These expansion or replication mechanisms are critical for transitioning from smaller-scale interventions—often less visible—to a systemic vision capable of reshaping essential services such as mobility, waste management, public lighting, or energy networks. Not all pilot projects, however, achieve satisfactory outcomes. Failure is often part of the learning process and should be considered a valuable resource for refining planning. Sometimes new expertise is needed, or a complete overhaul of the approach is required. Helsinki, for example, constantly experiments with minipilots and, when they do not work, uses negative results to improve subsequent calls for proposals. This agile method treats errors not as defeats but as steps toward refining public policies. Looking at the international landscape, certain challenges are shared. In November 2021, UN-Habitat and the Swedish government launched the Climate Smart Cities Challenge, an international initiative involving four cities: Bristol (UK), Curitiba (Brazil), Makindye Ssabagabo (Uganda), and Bogotá (Colombia). The strength of such programs lies in the exchange of practices and expertise. Transferring a prototype from one city to another is not always straightforward due to different regulations and specific socioeconomic conditions. However, the opportunity to create city networks that share their experiences enables collective progress and significantly reduces research and development costs. Looking ahead, as repeatedly emphasized in the “Smarter & Inclusive Cities” document, we are likely to see ever-closer collaboration among local governments, international organizations (like UNDP), and private investors who are prepared to support solutions that combine profitability with social impact. Training administrators and officials to manage complex digital projects will be critical, as will the presence of communities and associations that can take part in decision-making processes, and the construction of solid infrastructure to foster large-scale connectivity. Conclusions The findings from “Smarter & Inclusive Cities” encourage a realistic reflection on how digital technologies, social inclusion, and urban management converge. The data presented does not fully capture the complexity of the subject, but it does provide a clear picture of what cities have already tested and what remains to be done to make territories more livable, sustainable, and accessible. Public and private actors are already working together to develop e-governance platforms, co-design services, and data analysis methods, while the advancement of practices such as participatory budgeting and living labs makes citizen engagement less theoretical and more tangible. Other similar, already mature and widespread technologies intersect with environmental monitoring projects or ecological transportation systems. Here, innovation does not simply translate into faster services or widespread sensor deployment but rather into the ability to integrate dimensions that might appear distant, such as administrative efficiency, privacy protection, the reduction of inequality, and long-term sustainability. For entrepreneurs and corporate executives, this approach represents an opportunity to develop cooperative business models grounded in investments that yield both economic and social benefits. A look at global challenges—from climate change to urban robotics—confirms that only those who manage to blend creativity with strategic planning will remain competitive in an interconnected world. This is why the collaboration among international bodies, municipalities, businesses, and citizens continues to be the preferred route for building genuinely inclusive and smart cities, ensuring that the benefits of technological progress do not remain exclusive to a few but become shared resources for an improved quality of life. Podcast: https://spotifycreators-web.app.link/e/Nh2NHaaOKPb Source: https://www.undp.org/eurasia/publications/smarter-and-inclusive-cities