Preference Discerning in Recommender Systems: Generative Retrieval for Personalized Recommendations

Andrea Viliotti
2 gen 2025
Tempo di lettura: 9 min

The study titled “PreferenceDiscerningwithLLM-Enhanced Generative Retrieval,” led by researchers Fabian Paischer, Liu Yang, and Linfeng Liu—affiliated with the ELLIS Unit, LIT AI Lab, the Institute for Machine Learning at JKU Linz, the University of Wisconsin–Madison, and Meta AI—opens a new chapter in how we think about recommender systems. By focusing on the practice of “preference discerning,” their work investigates how preference discerning in recommender systems leverages natural-language user input for sequential product recommendations. The ultimate vision is to give personalization an added dimension: users can express both positive and negative sentiments (steering instructions) so that the recommendation engine can better reflect everyone’s nuanced tastes and constraints. From a business perspective, especially in e-commerce, this approach can lead to a measurable performance lift—some experiments suggest improvement of up to 45% in Recall@10—while also nurturing a deeper bond between user and system by minimizing irrelevant results.

Why Preference Discerning in Recommender Systems Redefines Sequential Recommendations

Traditional recommender systems often rely on a user’s past behavior, such as purchase history or clicks, to guess future preferences. Yet such techniques risk overlooking people’s dynamic needs and explicit feedback (e.g., “I’d prefer something allergy-friendly” or “I want to avoid this brand”). With the concept of preference discerning, the authors propose to go beyond mere user-item embeddings: their generative retrieval mechanism deliberately incorporates statements a user might have offered in natural language. They emphasize that real-life preferences can be highly specific, ranging from sentiment-based prohibitions (“I hate scratchy materials”) to broad-yearning requirements (“I’d like something lighter for my hikes”). Traditional systems may fail to capture these nuances, while preference discerning in recommender systems integrates them at the core of the recommendation process.

In this approach, the model does not simply scan for the nearest neighbor in an embedding space. Instead, it generates the next relevant item by conditioning on textual preferences. The authors employ a two-step workflow: first, preference approximation extracts the user’s key tastes from data like reviews and item descriptions; second, preference conditioning infuses these preferences into the generative component, shaping recommendations in real time. This dual-stage design helps the model pivot quickly in response to new information—such as a user disclaiming a sudden aversion to a chemical ingredient or wanting to try a new style.

Empirical findings from the study show that classical baseline models struggle with fine-grained changes in user sentiment or abruptly shifting tastes over time. By contrast, a preference-discerning system follows detailed cues—either a “fine-grained” shift (subtle variations in otherwise stable tastes) or a “coarse-grained” one (a big departure from the user’s historical pattern). Thus, if a person who usually buys synthetic running shoes suddenly wants “the same shoe model but in a natural fiber,” the algorithm does not default to the old preference but adapts accordingly.

Beyond that, the researchers highlight an often-neglected phenomenon: sentiment following. Many recommender systems are adept at identifying what someone likes but do poorly at interpreting what the individual decidedly dislikes. From an e-commerce standpoint, ignoring these negative signals can be disastrous, since suggesting unwanted products can alienate customers. By embedding user aversions into the generation loop, this new approach looks to reduce friction and zero in on the user’s genuine preferences.

Semantic IDs: Key to Generative Retrieval in Preference Discerning Systems

In the core of the paper, the concept of generative retrieval is enlarged to incorporate textual constraints. One of the structural elements enabling this capability is the design of semantic IDs for items. Formally expressed as:

RQ(e,C,D)=(k1,…,kN)∈[K]N,RQ(e, C, D) = (k_1, \dots, k_N) \in [K]^N,

this formula captures how the system quantizes continuous embeddings into discrete tokens. The benefit is significant: the recommender can handle huge catalogs without being bogged down by purely numeric embedding vectors that are often opaque to interpret. Instead, items are discretized and can be more directly “linked” to natural-language preferences. This synergy between textual preferences and token-based item representations leads to more precise suggestions.

Initial results come from tests on Amazon categories—Beauty, Toys and Games, Sports and Outdoors—as well as the Steam platform. Across these datasets, the investigators observe that text-driven preference modeling elevates recall measures, effectively boosting the system’s ability to identify the correct items among the top ten recommendations. The advantage is particularly striking for businesses aiming to reduce user churn: a well-targeted suggestion can reassure prospective buyers that the platform “understands” them.

Moreover, the authors cite the notion of history consolidation as a crucial test: the model must distinguish which aspects of the user’s history still matter, while filtering out stale or contradictory preferences. This capacity to sift through a user’s evolving tastes is especially relevant in real-world scenarios—imagine a frequent traveler who once raved about a certain brand but now actively avoids it. If the system can dynamically pivot to incorporate these fresh aversions, conversions are likely to go up, with fewer irrelevant items cluttering the user’s search.

How Benchmarks Validate Preference-Based Models

To rigorously validate their methods, the authors propose a benchmark across five dimensions:

Preference-based recommendation: The model is given a textual preference—such as “only gluten-free products”—and tested to see if it can produce the correct item next. Training, validation, and test sets are structured in a way that ensures old and new preferences do not overly overlap.
Fine-grained steering: This checks if a system can follow incremental changes in preference. For instance, a user might typically seek a certain style of running shoe but now demands an even lighter variant.
Coarse-grained steering: The system is tested with drastic preference shifts, like jumping from sneakers to formal dress shoes.
Sentiment following: The model must handle strong user sentiment—for or against certain brands, materials, or categories—and either highlight or exclude relevant items.
History consolidation: The system processes a wide array of user preferences, some of which are no longer relevant. The goal is to filter out the noise and keep track of what still matters.

Across these axes, classical systems can falter, especially on negative sentiments, because they often rely on positive correlations to make a recommendation. If your previous purchases favored brand X, standard models might keep suggesting that brand even if you have recently expressed distaste. Preference-discerning systems aim to fix that loophole, thereby ensuring a more holistic reflection of user desires.

Mender: The Future of Multimodal Recommender Systems

At the center of these innovations stands the model known as Mender—short for Multimodal Preference Discerner. Mender uses semantic IDs to generate new recommendations based on text-based user preferences, further refining the principle of preference discerning. Unlike typical recommender architectures that compare items in pairs, Mender employs an autoregressive approach. Given a user’s current state—history, textual instructions, or both—Mender directly predicts which item should appear next.

Concretely, Mender implements the formula:

RQ(e,C,D)=(k1,…,kN)∈[K]N,RQ(e, C, D) = (k_1, \dots, k_N) \in [K]^N,

as a method to convert embeddings into discrete token codes. This bridging helps the model marry linguistic constraints (“avoid certain allergens,” “aim for sustainable materials,” etc.) with vast product spaces. The result is a system capable of “translating” user prompts into recommended items, circumventing the need for complicated retrieval heuristics. Instead, the system “generates” the next item in a manner reminiscent of how text-generation models produce the next word in a sentence.

Technically, Mender relies on a pre-trained language encoder and a specialized decoder that outputs these semantic token sequences. The cross-attention mechanism couples the user’s textual instructions and purchase history with the generation process.

Two variations illustrate Mender’s versatility:

MenderEmb: Maintains separate embeddings for user preferences and items, later aligning them.
MenderTok: Merges the history and user instructions into a single textual stream, prompting the model to treat the entire data as one sequential input.

Notably, MenderTok often excels in performance benchmarks. On datasets like Amazon Beauty, the Recall@10 metric jumps from roughly 0.0697 with certain baseline models to around 0.0937 with MenderTok. In Sports and Outdoors, it inches upward from 0.0355 to 0.0427. These gains, while expressed as raw numbers, have tangible implications for real-world e-commerce, translating into more potential conversions.

A pivotal feature is the model’s ability to adapt swiftly to new user profiles or novel constraints expressed in natural language. By leaning on a generative approach, Mender does not require laborious re-training for every shift in user preference. Instead, it processes textual disclaimers or clarifications in real time, updating its recommendations accordingly. This adaptability is invaluable for businesses looking to scale to large catalogs while maintaining a personalized edge.

The study’s authors underscore that Mender’s effectiveness also hinges on high-quality preference inputs. In trials, roughly 75% of preferences extracted from user reviews closely mirrored true user inclinations. Mender capitalizes on these well-curated preferences by filtering out extraneous noise and concentrating on relevant signals. Such synergy between user-provided text and historical data paves the way for expansions to related items, bringing fresh but contextually aligned suggestions into play.

For enterprises wishing to embed Mender within their data pipelines, the synergy of semantic embeddings and user instructions holds promise for interoperability: product reviews, social media mentions, or direct user queries can all feed into this model. Because Mender leverages a single encoder-decoder architecture, explainability and transparency may be more feasible, making it easier to justify recommendations to end users or to adapt for corporate objectives (like highlighting high-margin items).

E-commerce Innovations with Preference Discerning

The study evaluates four main datasets—three from Amazon (Beauty, Toys and Games, Sports and Outdoors) and the Steam platform. Action counts range from 167,597 for Toys and Games to nearly 600,000 on Steam, reflecting both the diversity and scale of the tested domains. Textual preferences are not invented in a vacuum: the authors draw real user reviews and refine them via large language models, weeding out repetitive references and random artifacts. This ensures that the preferences fed into Mender align with authentic consumer language.

Performance is judged using standard recommender metrics like Recall@5, Recall@10, and NDCG@10. The system’s consistency in capturing negative preferences—such as excluding a disliked brand from top results—proves especially impactful. Many existing models, if not specifically trained on negative data, will keep recommending items that the user has explicitly denounced. Preference discerning addresses this failing by baking negative signals into the generative routine. For instance, if an individual strongly opposes a certain brand, Mender ensures it is deprioritized or removed from top suggestions.

Another highlight is how Mender processes multiple evolving preferences—some of which may clash. This so-called history consolidation can occur when a user accumulates many preferences over time but no longer needs all of them. While standard generative models might attempt to juggle all hints at once, Mender zeroes in on the ones that truly matter for the recommendation at hand. Hence it sustains a harmonious balance between reliability (remembering past signals) and flexibility (overriding them when outmoded).

From a business standpoint, this capacity to toggle seamlessly between continuity and controlled shifts means that an e-commerce platform could pilot new product clusters for a user without alienating them by ignoring old preferences. In practical terms, managers can direct the system to encourage or emphasize certain product lines, letting the model find a sweet spot between user satisfaction and company objectives.

Expanding the Potential of Generative Retrieval

The paper’s methods for textual preference integration open doors for a variety of industries. Whether in e-commerce, travel, healthcare, or media streaming, the ability to parse user preferences rapidly and accurately can enhance loyalty and reduce friction. If a user says “Only show me cruelty-free options” or “I’d like to avoid violent films,” a robust preference-discerning engine becomes indispensable for an engaging, trust-building experience.

From a technical vantage, merging large language models with item embeddings can be computationally complex, but the authors propose to release code and benchmarks that enable peer review and replication. This forward-looking approach should help the field measure Mender’s performance against emerging alternatives, ensuring that the underlying technology keeps pace with new breakthroughs.

It is also important to recognize that metrics like Recall@5 and Recall@10 only scratch the surface when it comes to user satisfaction. The immediacy of feedback, the interpretability of results, and the model’s capacity to respond to real-time prompts will become even more decisive in industries where user experience is paramount. As large language models continue to improve, more sophisticated textual commands—potentially covering style, ethical concerns, or budget constraints—will become routine in recommendation dialogues.

By spotlighting explicit preference conditioning, this study advances a vision of the user as a co-creator in the recommendation process. An enterprise can overlay its own guidelines (e.g., business intelligence targets, marketing priorities) without drowning out the user’s personal voice, provided the system is carefully balanced. Mender’s generative nature readily accommodates prompts that might arise from ephemeral online interactions or fast-evolving social-media trends—where user opinions change suddenly or must be integrated on the fly.

Concluding Reflections

Overall, the findings underscore how explicitly weaving user preferences into the generative engine can heighten recommendation quality and open new avenues for personalization. Mender and its associated benchmark handle text-based instructions with relative ease, aligning well with the ongoing shift toward large language models. In practical terms, it implies fewer bad recommendations, more potential to branch out into specialized product categories, and a user base that feels genuinely heard.

Although other generative retrieval systems are already experimenting with language-based constraints, this paper’s central innovation lies in clearly segregating the generation of user preferences from the actual conditioning phase. That means preferences can be created even in the absence of exhaustive user histories, making the system more amenable to brand-new users. In effect, the authors point to a future in which positive and negative sentiments expressed in plain language can shape the system’s behavior in real time.

For corporate decision-makers, adopting preference-discerning methods might be more than just another technical upgrade: it signals a strategic pivot toward user-driven experiences. By letting textual preferences guide the model’s next move, businesses effectively amplify the user’s voice. This fosters a climate of responsiveness and trust where the user’s personal needs and the organization’s goals can align more harmoniously. In so doing, Mender and generative retrieval herald a path toward adaptive recommendation engines that gracefully balance personalization, efficiency, and user agency.

Podcast: https://creators.spotify.com/pod/show/andrea-viliotti/episodes/Preference-Discerning-in-Recommender-Systems-Generative-Retrieval-for-Personalized-Recommendations-e2svs4u

Source: https://arxiv.org/abs/2412.08604