Adversarial Item Promotion on Visually-Aware Recommender Systems by Guided Diffusion (2024)

Lijian ChenThe University of QueenslandBrisbaneQLDAustraliauqlche22@uq.edu.au,Wei YuanThe University of QueenslandBrisbaneQLDAustraliaw.yuan@uq.edu.au,Tong ChenThe University of QueenslandBrisbaneQLDAustraliatong.chen@uq.edu.au,Guanhua YeDeep Neural Computing Company LimitedShenzhenChinarex.ye@dncc.tech,Quoc Viet Hung NguyenGriffith UniversityGold CoastQLDAustraliahenry.nguyen@griffith.edu.auandHongzhi YinThe University of QueenslandBrisbaneQLDAustraliadb.hongzhi@gmail.com

(2024)

Abstract.

Visually-aware recommender systems have found widespread applications in domains where visual elements significantly contribute to the inference of users’ potential preferences. While the incorporation of visual information holds the promise of enhancing recommendation accuracy and alleviating the cold-start problem, it is essential to point out that the inclusion of item images may introduce substantial security challenges. Some existing works have shown that the item provider can manipulate item exposure rates to its advantage by constructing adversarial images. However, these works cannot reveal the real vulnerability of visually-aware recommender systems because (1) the generated adversarial images are markedly distorted, rendering them easily detected by human observers; (2) the effectiveness of these attacks is inconsistent and even ineffective in some scenarios or datasets. To shed light on the real vulnerabilities of visually-aware recommender systems when confronted with adversarial images, this paper introduces a novel attack method, IPDGI (Item Promotion by Diffusion Generated Image). Specifically, IPDGI employs a guided diffusion model to generate adversarial samples designed to promote the exposure rates of target items (e.g., long-tail items). Taking advantage of accurately modeling benign images’ distribution by diffusion models, the generated adversarial images have high fidelity with original images, ensuring the stealth of our IPDGI. To demonstrate the effectiveness of our proposed methods, we conduct extensive experiments on two commonly used e-commerce recommendation datasets (Amazon Beauty and Amazon Baby) with several typical visually-aware recommender systems. The experimental results show that our attack method significantly improves both the performance of promoting the long-tailed (i.e., unpopular) items and the quality of generated adversarial images.

visually-aware recommender system, image poisoning attack, diffusion model

copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXjournal: TOISccs: Information systemsRecommender systems

1. Introduction

With the exponential growth of data, recommender systems have become nearly indispensable across various industry sectors due to their ability to provide personalized suggestions(Lietal., 2021; Yinetal., 2015; Yuanetal., 2023a). Traditionally, recommender systems predict users’ preferences by learning user and item latent features from extensive collaborative data, such as user-item interactions(Chenetal., 2018; Yin etal., 2024). While these traditional recommender systems have achieved notable success, they often fall short in certain domains where users’ preferences and decisions are strongly influenced by visual factors, such as fashion, food, and micro-video recommendations(Chengetal., 2023). Furthermore, given the persistent reality of data sparsity within collaborative datasets, traditional recommender systems struggle with cold-start problems(Scheinetal., 2002; Zhengetal., 2023). To address these two challenges, researchers have incorporated item visual information to assist systems in making recommendations, giving rise to visually-aware recommender systems(He and McAuley, 2016; Kangetal., 2017; Tangetal., 2019).

While harnessing visual features offers numerous advantages, incorporating these features may also introduce vulnerabilities in visually-aware recommender systems. Numerous existing works(Longetal., 2022; Nguyen etal., 2022; Zeng etal., 2019) in the field of computer vision have demonstrated that by constructing adversarial images, even state-of-the-art deep neural network models can be disrupted by adversaries. These adversarial images look like normal images with imperceptible perturbations that are carefully crafted by adversaries according to specific objectives. As it is challenging to distinguish adversarial images from normal ones(Elsayed etal., 2018), such kind of adversarial attacks pose serious threats to the application of computer vision models. In recommender systems, the number of items can be more than the million level(Yinetal., 2014), and item images are usually provided by external parties (e.g., item merchants) on social media platforms and E-commerce platforms. This setting leaves a backdoor for untrusted image providers to upload poisoned images to achieve certain adversarial goals, such as promoting the target item’s ranking with financial incentives. In light of this, it is necessary to validate the threats of adversarial images in visually-aware recommender systems.

Existing attacks in visually-aware recommender systems generally can be categorized into two types: classifier-targeted attacks and ranker-targeted attacks. The classifier-targeted attack(DiNoiaetal., 2020) aims to change the prediction of item categories, which cannot directly change items’ ranking. In contrast, the ranker-targeted attack uses adversarial samples to directly manipulate the top-K recommendation ranker. Liu et al.(Liu and Larson, 2021) is the first work to investigate how to deceive recommender systems via perturbed visual information. However, their approach exhibits loose constraints regarding the scale of noise added to adversarial images, rendering their attacks impractical as their generated images are easily detectable by users. Figure1 illustrates the adversarial images generated by their proposed AIP attack. Furthermore, the efficacy of these attacks tends to be highly unstable.(Cohenetal., 2021) explores a black-box setting of adversarial image attack. Nevertheless, this approach heavily relies on the assumption that each target user has a surrogate user, a condition that may not always hold in many practical scenarios since the untrusted third party (e.g., item merchants) has less chance to know the target user’s whole interaction behavior. As a result, all existing adversarial image attacks in visually-aware recommender systems cannot reveal the real threats as they are ineffective with realistic conditions.

Adversarial Item Promotion on Visually-Aware Recommender Systems by Guided Diffusion (1)

The primary objective of this work is to develop an adversarial attack(Nguyen etal., 2024) to disclose the real vulnerability of visually-aware recommender systems, highlighting the security concerns of using images provided by third parties. To achieve that, the adversarial image attack should meet the following two requirements. Firstly, the attack should be both effective and inconspicuous. In other words, the generated adversarial image should closely resemble the original images while successfully misleading the recommender system. Secondly, the underlying assumptions of the attack methodology should align with real-world conditions and constraints.

Diffusion models have garnered remarkable success in the field of image generation(Dhariwal andNichol, 2021). Drawing inspiration from their impressive capacity to model real data distributions, we endeavor to exploit diffusion models for the purpose of generating adversarial images. Since the generated adversarial images are still from the normal image distribution, these poisoned images would be imperceptible, ensuring the stealthiness of our attack. Nevertheless, constructing a diffusion model-based attack tailored for visually-aware recommender systems presents at least two challenges. The first challenge is how to keep the consistency of generated adversarial images with original images. Due to the randomness of the diffusion model, the content of images generated by the general diffusion model is random, leading to a wide variation in the generated adversarial images that noticeably differ from the original images. Besides, how to incorporate the adversarial goal in the general diffusion model, i.e., ensure the effectiveness of the generated adversarial images from the diffusion model, is also non-trivial.

In this paper, we propose a novel adversarial attack method, namely Item Promotion by Diffusion Generated Images (IPDGI). To address the generation randomness of the diffusion model, we introduce a conditional constraint into the reverse process. This constraint ensures that the generated adversarial images are as similar as possible to the original image in terms of data distribution and visual appearance. To enhance the effectiveness of adversarial images generated by the diffusion model in the context of visually-aware recommender systems, we have devised a novel mechanism for perturbation generation, as shown in Figure3. In essence, the underlying concept is to perturb the target item image to align its visual features with those of popular items. The perturbation generation mechanism involves several key steps. Initially, a clustering model is employed to identify the cluster to which the target item image belongs. Subsequently, we select the image of the most popular item within that cluster as a reference image. Following this, we iteratively optimize the perturbation to align the feature vector of the target item image with that of the reference image. It is crucial to note that, as the reference image is chosen from the same cluster, it is semantically close to our target image, and a slight perturbation is adequate to align the feature vectors of our target image. Nevertheless, the issue of severe distortion might still exist in the adversarial image if we directly apply the perturbation to the original image. To overcome this challenge, we integrate the optimized perturbation into the diffusion model’s general Gaussian noise, as shown in Figure2. By doing this, the corrupted image generated by the forward process of the diffusion model contains the perturbation. Later on, during the reverse process, as the diffusion model was trained on normal images, the model tends to denoise the image to the normal/clean image domain while preserving the perturbation. Ultimately, the adversarial images generated by the IPDGI are capable of deceiving the top-K ranker for item promotion(Zhengetal., 2024; Yuanetal., 2024) while simultaneously maintaining a high similarity to the original images, as shown in Figure1.

To validate the effectiveness of our proposed IPDGI, we conduct extensive experiments on two widely used recommendation datasets with three visually-aware recommender systems. The experimental results demonstrate the effectiveness of our method, IPDGI, in promoting items across all experimental datasets and visually-aware recommender systems. Our method outperforms existing ranker-targeted attacks. Furthermore, the experimental results indicate that the side effects caused by IPDGI are minimal, i.e., the original performance of the recommender system undergoes no significant changes under the worst-case attack scenario. Lastly, the image quality of the adversarial images generated by IPDGI surpasses that of the baseline attack, showcasing a noticeable improvement.

To sum up, the main contributions of this paper are as follows:

  • This is the first work to employ the diffusion model for generating adversarial samples against the visually-aware recommender systems for item promotion.

  • We reveal the real vulnerability of the visually-aware recommender system with respect to the utilization of visual features.

  • We evaluate our method, IPDGI, across three representative visually-aware recommender systems on two real-world datasets to assess the effectiveness and stealthiness of the attack.

The following sections of the paper are organized as follows: In Section2, we conduct a review of related work. Following that, Section3 discusses the preliminaries of this research, encompassing the base visually-aware recommender systems and the adversarial approaches of IPDGI. Section4 is dedicated to presenting the technical details of our method, IPDGI. Transitioning to Section5, we provide comprehensive details on the experiments and results. Section6 discusses the potential defense methods. Finally, in Section7, we draw conclusions from our work.

2. Related Work

2.1. Visually-Aware Recommender Systems

Visually-aware recommender systems are those that incorporate visual information into the recommendation ranking mechanism or the prediction of the user’s preference. Initially, before the deep learning era, most of the works that adopted visual information relied on image retrieval for the recommendation tasks. Kalantidis et al.(Kalantidisetal., 2013) propose an approach that commences with the segmentation of a query image, followed by the retrieval of visually similar items within each of the predicted classes. This work integrates semantic information derived from the images to enhance retrieval performance. Following this, Jagadeesh et al.(Jagadeesh etal., 2014) emphasized the pivotal role of semantic information in the retrieval process, highlighting its significance in refining and enhancing the overall efficiency of the retrieval procedure. In this work, they curated the extensive Fashion-136K dataset, enriched with detailed annotations, and introduced multiple retrieval-based methodologies to recommend matching items corresponding to a given query image.

With the advancements in Convolutional Neural Networks (CNNs)(Heetal., 2016; Simonyan andZisserman, 2014) and the developments of deep learning-based recommender systems(Heetal., 2017; Qiuetal., 2019, 2020; Quetal., 2021), numerous studies have concentrated on more intricate modeling that integrates visual features into user-item interactions. IBR(McAuley etal., 2015) suggests complementary items by analyzing the styles inherent in the visual features of each item, taking into consideration human perceptions of similarity. Later on, several works further incorporate visual information into Collaborative Filtering (CF)-based recommender models to exploit the latent factors of users and items along with visual features simultaneously. Examples include VBPR(He and McAuley, 2016) and Fashion DNA(Bracheretal., 2016). Notably, VBPR is the pioneer in integrating the pre-extracted CNN visual features into CF-based recommender models, acknowledging the significance of visual information in scenarios such as fashion-related recommendations. Furthermore, the utilization of visual features helps address persistent issues in traditional recommender systems, such as data sparsity and cold starts, leading to performance improvements.

In contrast to VBPR, which directly uses pre-extracted CNN visual features, DVBPR(Kangetal., 2017) takes a different approach to handling image information. Specifically, Kang et al.(Kangetal., 2017) adopt an end-to-end framework for DVBPR to train a CNN model with raw image input for visual feature extraction and simultaneously train the recommender model. ImRec(Neve andMcConville, 2020) suggests leveraging reciprocal information between user groups through the use of image features. Chen et al.(Chenetal., 2017) propose ACF, which incorporates the attention mechanism into the CF model. It comprises item-level and component-level attention. The item-level attention is strategically employed to identify the most representative items that characterize individual users, while the component-level attention aims to extract the most informative features from multimedia auxiliary information for each user.

2.2. Adversarial Attack on Visually-Aware Recommender Systems

While visually-aware recommender systems indeed improve recommendation performance and alleviate the cold start issue, they also introduce new threats to these systems. Tang et al.(Tangetal., 2019) reveal a security threat caused by malicious alterations to item images and further propose a robustness-focused visually-aware recommender model, AMR. Yin et al.(Yinetal., 2023) proposed a framework capable of maintaining recommendation performance by denoising adversarial perturbations from attacked images (e.g., FGSM(Goodfellowetal., 2014), PGD(Kurakinetal., 2018)), as well as detecting adversarial attacks. However, this work is constrained to defending against untargeted attacks under the assumption of a white-box attack. Merra et al.(Merra etal., 2023) introduced a novel method called AiD, which has the ability to remove adversarial perturbations from attacked images prior to their use in visually-aware recommender systems. To effectively remove perturbations from adversarial images, the AiD model requires a dataset consisting of clean images and their corresponding noised images for training. Alternatively, the defender must possess knowledge of the specific attack method used to generate the adversarial images in order to generate the necessary noised images for training the AiD model.

To date, adversarial approaches for the visual data in the domain of visually-aware recommender systems can be categorized into two main types: classifier-targeted attacks and ranker-targeted attacks.

In a classifier-targeted attack, the objective is to modify the predictions of item categories without directly impacting the ranking of items. TAaMR(DiNoiaetal., 2020) is noteworthy as a representative work in this category. Specifically, classifier-targeted attacks, such as TAaMR, generate adversarial images with the ability to deceive the image classifier, transitioning from the source class (e.g., bottles) to the target class (e.g., shoes), while maintaining visual consistency with the original image. However, there is an evident and inherent limitation to such classifier-targeted attacks, arising from the obligatory use of class labels. Technically, adversarial images generated by classifier-targeted attacks could be ineffective when the class categories for the source and target are the same. For instance, attacking a shoe image could be highly challenging if the target class is also a shoe.

In a ranker-targeted attack, adversarial images are purposefully crafted to directly perturb the ranker of recommender systems, aiming to either promote or demote items. To the best of our knowledge, AIP(Liu and Larson, 2021) stands as the first and only work dedicated to ranker-targeted attacks within the context of visually-aware recommender systems. Specifically, AIP generates adversarial images by introducing an optimized perturbation to the target item’s image, aiming to reduce the distance between the visual vectors of the target item and popular items. However, the effectiveness of the AIP attack is constrained by a loose perturbation scale and the selection of popular images. Additionally, noticeable image distortion occurs, making the adversarial image easily recognizable by users.

Therefore, both the existing attacks on the visual data of the visually-aware recommender systems are unable to expose the genuine vulnerability of the system.

2.3. Diffusion Model

Diffusion models are well-known for their ability to generate high-quality data, particularly in the computer vision domain. Drawing inspiration from non-equilibrium thermodynamics(Sohl-Dickstein etal., 2015), the diffusion model operates with a different mechanism compared to other commonly used generative models such as GANs(Goodfellow etal., 2020) and VAEs(Kingma andWelling, 2013). Specifically, the diffusion model is composed of two Markov chains that represent the forward and reverse processes, respectively. In the forward process, noises randomly sampled from the Gaussian distribution are gradually added to the original data. In the reverse process, the model predicts and removes the noises to generate new data (Yang etal., 2023).

DDPM(Hoetal., 2020) represents a pioneering effort in adopting the diffusion model for generating high-quality images. However, due to the unpredictability of noise denoising in the reverse process, the content of the generated image becomes random. To address this, Dhariwal et al.(Dhariwal andNichol, 2021) propose a guided diffusion model that introduces a condition to constrain noise removal, thereby controlling the data distribution of the generated image. In addition to applications in the computer vision domain, (Li etal., 2023; Duetal., 2023; Wangetal., 2023) are works that utilize the diffusion model in the context of recommendations. Yuan et al.(Yuanetal., 2023b) present the first work employing the diffusion model for enhancing the security of federated recommender systems.

3. Preliminaries

In this section, we provide a discussion of the preliminaries of our research by introducing the base visually-aware recommender systems and the adversarial goals and prior knowedge employed by IPDGI.

3.1. Base Visually-Aware Recommender Systems

In broad terms, visually-aware recommender systems are those that incorporate visual features into preference predictions. We have selected three representative visually-aware recommender systems as base models to evaluate the effectiveness and imperceptibility of the IPDGI attack. The chosen models include VBPR(He and McAuley, 2016), DVBPR(Kangetal., 2017), and AMR(Tangetal., 2019), each distinguished by a unique mechanism.

3.1.1. VBPR

Visual Bayesian Personalized Ranking (VBPR) is a pioneering visually-aware recommender system that leverages images as auxiliary information for predicting users’ preferences, specifically designed to alleviate the cold start issue. It is a Bayesian Personalized Ranking (BPR)-based method(Rendle etal., 2012) extended to incorporate visual features into the latent features of users and items. More precisely, VBPR integrates the Convolutional Neural Network (CNN)(Heetal., 2016; Simonyan andZisserman, 2014) pre-extracted visual features of items with the latent (non-visual) features to form the item representation. The predictive model of VBPR can be articulated as follows:

(1)yu,i=χ+ηu+ηi+λuTλi+ϕuTϕi+ηvTfisubscript𝑦𝑢𝑖𝜒subscript𝜂𝑢subscript𝜂𝑖superscriptsubscript𝜆𝑢𝑇subscript𝜆𝑖superscriptsubscriptitalic-ϕ𝑢𝑇subscriptitalic-ϕ𝑖superscriptsubscript𝜂𝑣𝑇subscript𝑓𝑖y_{u,i}=\chi+\eta_{u}+\eta_{i}+\lambda_{u}^{T}\lambda_{i}+\phi_{u}^{T}\phi_{i}%+\eta_{v}^{T}{f_{i}}italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT = italic_χ + italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
(2)ϕi=𝐄fisubscriptitalic-ϕ𝑖𝐄subscript𝑓𝑖\phi_{i}=\mathbf{E}f_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_E italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

where yu,isubscript𝑦𝑢𝑖y_{u,i}italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT is the predicted score that user u𝑢uitalic_u given to item i𝑖iitalic_i; χ𝜒\chiitalic_χ is the global offset; ηusubscript𝜂𝑢\eta_{u}italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and ηisubscript𝜂𝑖\eta_{i}italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the biases associated with user u𝑢uitalic_u and item i𝑖iitalic_i, respectively; λusubscript𝜆𝑢\lambda_{u}italic_λ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the latent features for user u𝑢uitalic_u and item i𝑖iitalic_i, respectively; and ϕusubscriptitalic-ϕ𝑢\phi_{u}italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the visual features for user u𝑢uitalic_u and item i𝑖iitalic_i, respectively. The visual feature for item i𝑖iitalic_i is denoted as fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is obtained through CNN extraction. Furthermore, the visual bias is represented by ηvTfisuperscriptsubscript𝜂𝑣𝑇subscript𝑓𝑖\eta_{v}^{T}{f_{i}}italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Due to the high dimensionality of the extracted image feature fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it cannot be directly used as the visual feature ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Alternatively, He et al.(He and McAuley, 2016) proposed a learnable embedding 𝐄𝐄\mathbf{E}bold_E to transform the extracted feature fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the CNN space into a lower-dimensional visual space, as described by Eq.2. In addition to its efficacy in mitigating the cold start issue, VBPR has demonstrated enhanced recommendation performance, along with notable transparency and interpretability in the recommendation process.

3.1.2. DVBPR

Deep Visual Bayesian Personalized Ranking (DVBPR) constitutes a recommender system built upon the foundation of VBPR, specially tailored for fashion recommendation scenarios. DVBPR distinguishes itself from the original VBPR systems through its unique approach to leveraging item images. According to(Leietal., 2016; Veit etal., 2015), DVBPR(Kangetal., 2017) employs a CNN model to directly extract visual features from item images rather than relying on pre-extracted CNN visual features. Specifically, DVBPR utilizes an end-to-end framework that concurrently employs a CNN model to extract visual features and a recommender model to learn user latent factors. Kang et al.(Kangetal., 2017) argue that discarding item bias and non-visual latent factors is justified, as the remaining terms adequately capture implicit factors under the end-to-end approach of extracting visual features. Consequently, the preference predictor of DVBPR can be expressed as:

(3)yu,i=χ+ηu+ϕuTΨ(𝐗i)subscript𝑦𝑢𝑖𝜒subscript𝜂𝑢superscriptsubscriptitalic-ϕ𝑢𝑇Ψsubscript𝐗𝑖y_{u,i}=\chi+\eta_{u}+\phi_{u}^{T}\Psi(\mathbf{X}_{i})italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT = italic_χ + italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Ψ ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where Ψ(𝐗i)Ψsubscript𝐗𝑖\Psi(\mathbf{X}_{i})roman_Ψ ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the CNN model Ψ()Ψ\Psi(\cdot)roman_Ψ ( ⋅ ) with the item i𝑖iitalic_i’s image 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Similar to VBPR, the recommender model of DVBPR is also BPR-based, with the primary objective of optimizing rankings through the consideration of triplets (u,i,j)𝒟𝑢𝑖𝑗𝒟(u,i,j)\in\mathcal{D}( italic_u , italic_i , italic_j ) ∈ caligraphic_D. This can be defined as:

(4)yu,i,j=yu,iyu,j,subscript𝑦𝑢𝑖𝑗subscript𝑦𝑢𝑖subscript𝑦𝑢𝑗\displaystyle y_{u,i,j}=y_{u,i}-y_{u,j},italic_y start_POSTSUBSCRIPT italic_u , italic_i , italic_j end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT ,
where𝒟={(u,i,j)|u𝒰iu+j\u+}where𝒟conditional-set𝑢𝑖𝑗𝑢𝒰𝑖superscriptsubscript𝑢𝑗\superscriptsubscript𝑢\displaystyle\mathrm{where}\;\mathcal{D}=\{(u,i,j)|u\in\mathcal{U}\wedge i\in%\mathcal{I}_{u}^{+}\wedge j\in\mathcal{I}\backslash\mathcal{I}_{u}^{+}\}roman_where caligraphic_D = { ( italic_u , italic_i , italic_j ) | italic_u ∈ caligraphic_U ∧ italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∧ italic_j ∈ caligraphic_I \ caligraphic_I start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT }

In Eq.4, 𝒰𝒰\mathcal{U}caligraphic_U and \mathcal{I}caligraphic_I represent the sets of users and items, respectively. For an item iu+𝑖superscriptsubscript𝑢i\in\mathcal{I}_{u}^{+}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, it signifies an item that the user u𝑢uitalic_u has interacted with or expressed interest in, while j\u+𝑗\superscriptsubscript𝑢j\in\mathcal{I}\backslash\mathcal{I}_{u}^{+}italic_j ∈ caligraphic_I \ caligraphic_I start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT represents an item that the user u𝑢uitalic_u has not interacted with or expressed interest in. Moreover, following the BPR expression, the global bias χ𝜒\chiitalic_χ and user bias ηusubscript𝜂𝑢\eta_{u}italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT can be eliminated due to the cancellation between yu,isubscript𝑦𝑢𝑖y_{u,i}italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT and yu,jsubscript𝑦𝑢𝑗y_{u,j}italic_y start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT. Consequently, the DVBPR predictor (see Eq.3) can be further simplified, yielding the final form of the preference predictor as:

(5)yu,i=ϕuTΨ(𝐗i)subscript𝑦𝑢𝑖superscriptsubscriptitalic-ϕ𝑢𝑇Ψsubscript𝐗𝑖y_{u,i}=\phi_{u}^{T}\Psi(\mathbf{X}_{i})italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Ψ ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

3.1.3. AMR

Adversarial Multimedia Recommendation (AMR) is a visually-aware recommender system with a focus on robustness. It is built upon VBPR and utilizes the same preference predictor (see Eq.1). Specifically, AMR integrates the VBPR recommender model with an adversarial training procedure to enhance model robustness. In this approach, adversarial perturbations are proactively introduced to the visual features of items during the recommender model training, as defined by:

(6)yu,i=χ+ηu+ηi+λuTλi+ϕuT𝐄(fi+Δi)+ηvT(fi+Δi)subscript𝑦𝑢𝑖𝜒subscript𝜂𝑢subscript𝜂𝑖superscriptsubscript𝜆𝑢𝑇subscript𝜆𝑖superscriptsubscriptitalic-ϕ𝑢𝑇𝐄subscript𝑓𝑖subscriptΔ𝑖superscriptsubscript𝜂𝑣𝑇subscript𝑓𝑖subscriptΔ𝑖y_{u,i}=\chi+\eta_{u}+\eta_{i}+\lambda_{u}^{T}\lambda_{i}+\phi_{u}^{T}\cdot%\mathbf{E}(f_{i}+\Delta_{i})+\eta_{v}^{T}\cdot{(f_{i}+\Delta_{i})}italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT = italic_χ + italic_η start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ bold_E ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_η start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where ΔisubscriptΔ𝑖\Delta_{i}roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the adversarial perturbations optimized to exert the most significant influence, corresponding to the worst-case scenario, on the recommender model. The optimization process for these adversarial perturbations is detailed in Eq.7.

(7)Δi=LBPRadv=argminΔ(u,i,j)𝒟lnς(yu,iadvyu,jadv),subscriptΔ𝑖superscriptsubscript𝐿𝐵𝑃𝑅𝑎𝑑𝑣Δsubscript𝑢𝑖𝑗𝒟𝜍superscriptsubscript𝑦𝑢𝑖𝑎𝑑𝑣superscriptsubscript𝑦𝑢𝑗𝑎𝑑𝑣\displaystyle\Delta_{i}=L_{BPR}^{adv}=\underset{\Delta}{\arg\min}\sum_{(u,i,j)%\in\mathcal{D}}-\ln\varsigma(y_{u,i}^{adv}-y_{u,j}^{adv}),roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_B italic_P italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT = underroman_Δ start_ARG roman_arg roman_min end_ARG ∑ start_POSTSUBSCRIPT ( italic_u , italic_i , italic_j ) ∈ caligraphic_D end_POSTSUBSCRIPT - roman_ln italic_ς ( italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT ) ,
whereΔiυ,i=1,,||;Δjυ,j=1,,||formulae-sequencewheredelimited-∥∥subscriptΔ𝑖𝜐formulae-sequence𝑖1formulae-sequencedelimited-∥∥subscriptΔ𝑗𝜐𝑗1\displaystyle\mathrm{where}\;\lVert\Delta_{i}\rVert\leq\upsilon,\;i=1,...,|%\mathcal{I}|;\;\lVert\Delta_{j}\rVert\leq\upsilon,\;j=1,...,|\mathcal{I}|roman_where ∥ roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ italic_υ , italic_i = 1 , … , | caligraphic_I | ; ∥ roman_Δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ ≤ italic_υ , italic_j = 1 , … , | caligraphic_I |

Here, ς()𝜍\varsigma(\cdot)italic_ς ( ⋅ ) denotes the sigmoid function, delimited-∥∥\lVert\cdot\rVert∥ ⋅ ∥ represents the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm, and υ𝜐\upsilonitalic_υ is the magnitude to restrict the perturbations. As the AMR method involves a minimax game, perturbations ΔΔ\Deltaroman_Δ are learned to maximize the loss function of the recommender model, while simultaneously, the model parameters ΘΘ\Thetaroman_Θ are learned to minimize both the loss function and the adversary’s loss (see Eq.8).

(8)ΘΘ\displaystyle\Thetaroman_Θ=argminΘLBPR+φLBPRadvabsentΘsubscript𝐿𝐵𝑃𝑅𝜑superscriptsubscript𝐿𝐵𝑃𝑅𝑎𝑑𝑣\displaystyle=\underset{\Theta}{\arg\min}\;L_{BPR}+\varphi L_{BPR}^{adv}= underroman_Θ start_ARG roman_arg roman_min end_ARG italic_L start_POSTSUBSCRIPT italic_B italic_P italic_R end_POSTSUBSCRIPT + italic_φ italic_L start_POSTSUBSCRIPT italic_B italic_P italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT
=argminΘ(u,i,j)𝒟lnς(yu,iyu,j)φlnς(yu,iadvyu,jadv)+τΘ2absentΘsubscript𝑢𝑖𝑗𝒟𝜍subscript𝑦𝑢𝑖subscript𝑦𝑢𝑗𝜑𝜍superscriptsubscript𝑦𝑢𝑖𝑎𝑑𝑣superscriptsubscript𝑦𝑢𝑗𝑎𝑑𝑣𝜏superscriptdelimited-∥∥Θ2\displaystyle=\underset{\Theta}{\arg\min}\sum_{(u,i,j)\in\mathcal{D}}-\ln%\varsigma(y_{u,i}-y_{u,j})-\varphi\ln\varsigma(y_{u,i}^{adv}-y_{u,j}^{adv})+%\tau\lVert\Theta\rVert^{2}= underroman_Θ start_ARG roman_arg roman_min end_ARG ∑ start_POSTSUBSCRIPT ( italic_u , italic_i , italic_j ) ∈ caligraphic_D end_POSTSUBSCRIPT - roman_ln italic_ς ( italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT ) - italic_φ roman_ln italic_ς ( italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT ) + italic_τ ∥ roman_Θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where τ𝜏\tauitalic_τ regulates the strength of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization on model parameters, and φ𝜑\varphiitalic_φ is a hyper-parameter controlling the impact of the adversary on model optimization. Specifically, the adversary has no impact if φ𝜑\varphiitalic_φ is set to 0. This dual learning approach enhances the model’s robustness to adversarial perturbations in multimedia content, resulting in a diminished impact on the model’s predictions.

3.2. Adversarial Approaches for Visually-aware Recommender Systems

Adversarial Goal.The primary objective of employing adversarial images in this paper is to promote target items within the top-K ranker of visually-aware recommender systems, i.e., enhance the exposure rate of the target items. Additionally, the adversarial images should closely resemble the visual appearance of the original images, ensuring they appear natural to users to maintain stealthiness while preserving the effectiveness of the attack. Furthermore, the overall recommendation performance should not be significantly compromised when the recommender system is under attack, even in the worst-case scenario.

Adversarial Prior Knowledge.In this paper, we explore the vulnerabilities of visually-aware recommender systems within a real-world and practical context. Therefore, we assume that adversaries have minimal internal knowledge of the system. The only prior knowledge we attribute to adversaries is familiarity with the visual feature extraction model employed in the target visually-aware recommender system, denoted as ΨΨ\Psiroman_Ψ.

4. Our Approach

In this section, we delve into the details of our approach, the Item Promotion by Diffusion Generated Images (IPDGI) attack. Figures2 and 3 illustrate the overview and perturbation generation process of IPDGI, respectively. Additionally, Algorithm 1 is the pseudocode of IPDGI attack.

Adversarial Item Promotion on Visually-Aware Recommender Systems by Guided Diffusion (2)

4.1. Base Diffusion Model

Diffusion models such as DDPM(Hoetal., 2020) are probabilistic generative models consisting of two processes: the forward process and the reverse process, both of which can be represented as Markov chains. Initially, in the forward process, noise is randomly sampled from a Gaussian distribution and added to the input image for T𝑇Titalic_T steps, gradually transforming the original image into complete Gaussian noise. Subsequently, in the reverse process, the diffusion model is trained to iteratively reverse the noise image generated from the forward process, recovering to a clean image that shares the same data distribution as the input image.

4.1.1. Forward Process

The forward process, a.k.a. the diffusion process, is a Markov chain with the goal of transforming the data distribution of the input image into a Gaussian distribution by iteratively adding the Gaussian noise to it. Formally, according to the chain rule of probability and the property of Markov chains, the forward process generates the noisy samples 𝐱1,𝐱2,,𝐱Tsubscript𝐱1subscript𝐱2subscript𝐱𝑇\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, as follows:

(9)q(𝐱1,,𝐱T|𝐱0)=t=1Tq(𝐱t|𝐱t1)𝑞subscript𝐱1conditionalsubscript𝐱𝑇subscript𝐱0superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1q(\mathbf{x}_{1},...,\mathbf{x}_{T}|\mathbf{x}_{0})=\prod_{t=1}^{T}q(\mathbf{x%}_{t}|\mathbf{x}_{t-1})italic_q ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )

Here, T𝑇Titalic_T denotes the number of diffusion steps, and the transformation process q(𝐱t|𝐱t1)𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1q(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) turns the data distribution of q(𝐱0)𝑞subscript𝐱0q(\mathbf{x}_{0})italic_q ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) into a tractable prior distribution by gradually adding Gaussian noise. In DDPM, the q(𝐱t|𝐱t1)𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1q(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) process has the following representation:

(10)q(𝐱t|𝐱t1)=𝒩(𝐱t;1βt𝐱t1,βt𝐈),t{1,,T},formulae-sequence𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝒩subscript𝐱𝑡1subscript𝛽𝑡subscript𝐱𝑡1subscript𝛽𝑡𝐈for-all𝑡1𝑇q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t%}}\mathbf{x}_{t-1},\beta_{t}\mathbf{I}),\;\forall t\in\{1,...,T\}\;,italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) , ∀ italic_t ∈ { 1 , … , italic_T } ,

where 𝒩(𝐱t;1βt𝐱t1,βt𝐈)𝒩subscript𝐱𝑡1subscript𝛽𝑡subscript𝐱𝑡1subscript𝛽𝑡𝐈\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t}}\mathbf{x}_{t-1},\beta_{t}\mathbf%{I})caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) can be expressed in the general form, 𝒩(𝐱t;μ,σ2)𝒩subscript𝐱𝑡𝜇superscript𝜎2\mathcal{N}(\mathbf{x}_{t};\mu,\sigma^{2})caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), indicating that 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is generated by the Gaussian distribution with mean μ𝜇\muitalic_μ and variance σ𝜎\sigmaitalic_σ. βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the noise at step t𝑡titalic_t, which is pre-scheduled. Commonly, the scheduled noise β𝛽\betaitalic_β can be generated in the manner of either linear, cosine, or square-root. According to Ho et al. (Hoetal., 2020), we can simplify Eq.10 by directly calculating 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT conditioned on 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at an arbitrary diffusion step with the following transformation:

(11)q(𝐱t|𝐱0)=𝒩(𝐱t;αt¯𝐱0,(1αt¯)𝐈)𝑞conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡¯subscript𝛼𝑡subscript𝐱01¯subscript𝛼𝑡𝐈\displaystyle q(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};%\sqrt{\bar{\alpha_{t}}}\mathbf{x}_{0},(1-\bar{\alpha_{t}})\mathbf{I})italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) bold_I )
αt:=1βt,αt¯:=s=1tαsformulae-sequenceassignsubscript𝛼𝑡1subscript𝛽𝑡assign¯subscript𝛼𝑡superscriptsubscriptproduct𝑠1𝑡subscript𝛼𝑠\displaystyle\alpha_{t}:=1-\beta_{t},\quad\bar{\alpha_{t}}:=\prod_{s=1}^{t}%\alpha_{s}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG := ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT

Then, with re-parameter tricks, 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be computed as follows:

(12)𝐱t=αt¯𝐱0+1αt¯δ,whereδ𝒩(0,𝐈)formulae-sequencesubscript𝐱𝑡¯subscript𝛼𝑡subscript𝐱01¯subscript𝛼𝑡𝛿similar-towhere𝛿𝒩0𝐈\mathbf{x}_{t}=\sqrt{\bar{\alpha_{t}}}\cdot\mathbf{x}_{0}+\sqrt{1-\bar{\alpha_%{t}}}\cdot\delta,\;\mathrm{where}\;\delta\sim\mathcal{N}(0,\mathbf{I})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ⋅ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ⋅ italic_δ , roman_where italic_δ ∼ caligraphic_N ( 0 , bold_I )

4.1.2. Reverse Process

Unlike the forward process, which gradually corrupts the original image 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into Gaussian noise with schedules, the reverse process is a trainable Markov chain that approximates 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by predicting and removing noise from 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Formally, the learnable reverse process from step T𝑇Titalic_T to 0 can be defined as:

(13)pθ(𝐱^0,,𝐱^t,,𝐱^T1|𝐱T)=t=1Tpθ(𝐱^t1|𝐱^t),subscript𝑝𝜃subscript^𝐱0subscript^𝐱𝑡conditionalsubscript^𝐱𝑇1subscript𝐱𝑇superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsubscript^𝐱𝑡1subscript^𝐱𝑡p_{\theta}(\hat{\mathbf{x}}_{0},...,\hat{\mathbf{x}}_{t},...,\hat{\mathbf{x}}_%{T-1}|\mathbf{x}_{T})=\prod_{t=1}^{T}p_{\theta}(\hat{\mathbf{x}}_{t-1}|\hat{%\mathbf{x}}_{t})\;,italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where θ𝜃\thetaitalic_θ denotes the model parameters. The learnable reverse process pθ(𝐱^t1|𝐱^t)subscript𝑝𝜃conditionalsubscript^𝐱𝑡1subscript^𝐱𝑡p_{\theta}(\hat{\mathbf{x}}_{t-1}|\hat{\mathbf{x}}_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) takes the diffused input 𝐱^tsubscript^𝐱𝑡\hat{\mathbf{x}}_{t}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its corresponding time embedding t𝑡titalic_t to predict the mean μθ(𝐱^t,t)subscript𝜇𝜃subscript^𝐱𝑡𝑡\mu_{\theta}(\hat{\mathbf{x}}_{t},t)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and variance Σθ(𝐱^t,t)subscriptΣ𝜃subscript^𝐱𝑡𝑡\Sigma_{\theta}(\hat{\mathbf{x}}_{t},t)roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ), as shown in Eq.14.

(14)pθ(𝐱^t1|𝐱^t)=𝒩(𝐱^t1;μθ(𝐱^t,t),Σθ(𝐱^t,t))subscript𝑝𝜃conditionalsubscript^𝐱𝑡1subscript^𝐱𝑡𝒩subscript^𝐱𝑡1subscript𝜇𝜃subscript^𝐱𝑡𝑡subscriptΣ𝜃subscript^𝐱𝑡𝑡p_{\theta}(\hat{\mathbf{x}}_{t-1}|\hat{\mathbf{x}}_{t})=\mathcal{N}(\hat{%\mathbf{x}}_{t-1};\mu_{\theta}(\hat{\mathbf{x}}_{t},t),\Sigma_{\theta}(\hat{%\mathbf{x}}_{t},t))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) )
(15)μθ(𝐱^t,t)=1αt(𝐱^tβt1αt¯zθ(𝐱^t,t))subscript𝜇𝜃subscript^𝐱𝑡𝑡1subscript𝛼𝑡subscript^𝐱𝑡subscript𝛽𝑡1¯subscript𝛼𝑡subscript𝑧𝜃subscript^𝐱𝑡𝑡\mu_{\theta}(\hat{\mathbf{x}}_{t},t)=\frac{1}{\sqrt{\alpha_{t}}}(\hat{\mathbf{%x}}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha_{t}}}}z_{\theta}(\hat{\mathbf{x}}%_{t},t))italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG end_ARG italic_z start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) )

In practice(Hoetal., 2020), the variance Σθ(𝐱^t,t)subscriptΣ𝜃subscript^𝐱𝑡𝑡\Sigma_{\theta}(\hat{\mathbf{x}}_{t},t)roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is treated as constant values to reduce the training complexity. As a result, the objective of the reverse process is simplified to reduce the distance between the real noise 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the predicted noise zθ(𝐱^t,t)subscript𝑧𝜃subscript^𝐱𝑡𝑡z_{\theta}(\hat{\mathbf{x}}_{t},t)italic_z start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) at the random step t𝑡titalic_t, as defined in Eq.16:

(16)simple=𝔼t[1,T]𝔼𝐱0p(𝐱0)𝔼𝐳t𝒩(0,𝐈)𝐳tzθ(𝐱^t,t)2subscript𝑠𝑖𝑚𝑝𝑙𝑒subscript𝔼similar-to𝑡1𝑇subscript𝔼similar-tosubscript𝐱0𝑝subscript𝐱0subscript𝔼similar-tosubscript𝐳𝑡𝒩0𝐈superscriptdelimited-∥∥subscript𝐳𝑡subscript𝑧𝜃subscript^𝐱𝑡𝑡2\mathcal{L}_{simple}=\mathbb{E}_{t\sim[1,T]}\mathbb{E}_{\mathbf{x}_{0}\sim p(%\mathbf{x}_{0})}\mathbb{E}_{\mathbf{z}_{t}\sim\mathcal{N}(0,\mathbf{I})}\lVert%{\mathbf{z}_{t}-z_{\theta}(\hat{\mathbf{x}}_{t},t)}\rVert^{2}caligraphic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_t ∼ [ 1 , italic_T ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) end_POSTSUBSCRIPT ∥ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where 𝐱0p(𝐱0)similar-tosubscript𝐱0𝑝subscript𝐱0\mathbf{x}_{0}\sim p(\mathbf{x}_{0})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is normal image sampled from training data.

4.2. Guided Diffusion for Adversarial Item Promotion

While Diffusion models(Hoetal., 2020) exhibit the ability to produce high-quality synthetic images, they inherently introduce diversity in the generated outputs. In other words, the images generated by a diffusion model can be random and divergent from the original input image. In the context of our visual attack on recommender systems, this randomness should be avoided as our adversarial goal is to promote items while maintaining high similarity between the content of adversarial images and their corresponding originals. Essentially, this randomness issue stems from the fact that sample generation in the reverse process occurs without conditional constraints. Inspired by guided diffusion (Dhariwal andNichol, 2021) in the computer vision domain that used a classifier as guidance for the reverse process, we have adopted a conditional constraint into the reverse process to guide diffusion sampling at each step t𝑡titalic_t. Specifically, the reverse process in Eq.14 is transformed to Eq.17 with a condition:

(17)pθ(𝐱advt1|𝐱advt,𝒞)=pθ(𝐱advt1|𝐱advt)p(𝒞),subscript𝑝𝜃conditionalsuperscriptsubscript𝐱adv𝑡1superscriptsubscript𝐱adv𝑡𝒞subscript𝑝𝜃conditionalsuperscriptsubscript𝐱adv𝑡1superscriptsubscript𝐱adv𝑡𝑝𝒞p_{\theta}(\mathbf{x}_{\mathrm{adv}}^{t-1}|\mathbf{x}_{\mathrm{adv}}^{t},%\mathcal{C})=p_{\theta}(\mathbf{x}_{\mathrm{adv}}^{t-1}|\mathbf{x}_{\mathrm{%adv}}^{t})p(\mathcal{C})\;,italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_C ) = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_p ( caligraphic_C ) ,

where 𝒞𝒞\mathcal{C}caligraphic_C denotes the conditional constraint. Here, p(𝒞)𝑝𝒞p(\mathcal{C})italic_p ( caligraphic_C ) is defined as p(𝐱0|𝐱advt)𝑝conditionalsubscript𝐱0superscriptsubscript𝐱adv𝑡p(\mathbf{x}_{0}|\mathbf{x}_{\mathrm{adv}}^{t})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), since we aim to force the reverse steps considering more about the original input image. Intuitively, p(𝐱0|𝐱advt)𝑝conditionalsubscript𝐱0superscriptsubscript𝐱adv𝑡p(\mathbf{x}_{0}|\mathbf{x}_{\mathrm{adv}}^{t})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) can be interpreted as “The possibility of recovering to the original image 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT based on the current reversed image 𝐱advtsuperscriptsubscript𝐱adv𝑡\mathbf{x}_{\mathrm{adv}}^{t}bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT”.Then, according to (Sohl-Dickstein etal., 2015; Dhariwal andNichol, 2021), we have the approximation of Eq.17 as:

(18)logpθ(𝐱advt1|𝐱advt,𝒞)subscript𝑝𝜃conditionalsuperscriptsubscript𝐱adv𝑡1superscriptsubscript𝐱adv𝑡𝒞\displaystyle\log p_{\theta}(\mathbf{x}_{\mathrm{adv}}^{t-1}|\mathbf{x}_{%\mathrm{adv}}^{t},\mathcal{C})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_C )logpθ(𝐱advt1|𝐱advt)p(𝒞)absentsubscript𝑝𝜃conditionalsuperscriptsubscript𝐱adv𝑡1superscriptsubscript𝐱adv𝑡𝑝𝒞\displaystyle\approx\log p_{\theta}(\mathbf{x}_{\mathrm{adv}}^{t-1}|\mathbf{x}%_{\mathrm{adv}}^{t})p(\mathcal{C})≈ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_p ( caligraphic_C )
logpθ(𝐱advt1|𝐱advt)p(𝐱0|𝐱advt)absentsubscript𝑝𝜃conditionalsuperscriptsubscript𝐱adv𝑡1superscriptsubscript𝐱adv𝑡𝑝conditionalsubscript𝐱0superscriptsubscript𝐱adv𝑡\displaystyle\approx\log p_{\theta}(\mathbf{x}_{\mathrm{adv}}^{t-1}|\mathbf{x}%_{\mathrm{adv}}^{t})p(\mathbf{x}_{0}|\mathbf{x}_{\mathrm{adv}}^{t})≈ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
log(z)absent𝑧\displaystyle\approx\log(z)≈ roman_log ( italic_z )
(19)z𝒩(μθ(𝐱advt,t)+σt2𝐱advtlogp(𝐱0|𝐱advt),σt2𝐈),similar-to𝑧𝒩subscript𝜇𝜃superscriptsubscript𝐱adv𝑡𝑡superscriptsubscript𝜎𝑡2subscriptsuperscriptsubscript𝐱adv𝑡𝑝conditionalsubscript𝐱0superscriptsubscript𝐱adv𝑡superscriptsubscript𝜎𝑡2𝐈z\sim\mathcal{N}(\mu_{\theta}(\mathbf{x}_{\mathrm{adv}}^{t},t)+\sigma_{t}^{2}%\nabla_{\mathbf{x}_{\mathrm{adv}}^{t}}\log p(\mathbf{x}_{0}|\mathbf{x}_{%\mathrm{adv}}^{t}),\sigma_{t}^{2}\mathbf{I})\;,italic_z ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) ,

Further, we have opted for the Mean Squared Error (MSE) loss as the condition:

(20)p(𝐱0|𝐱advt)=exp(ξ𝐱0𝐱advt),𝑝conditionalsubscript𝐱0superscriptsubscript𝐱adv𝑡𝜉delimited-∥∥subscript𝐱0superscriptsubscript𝐱adv𝑡p(\mathbf{x}_{0}|\mathbf{x}_{\mathrm{adv}}^{t})=\exp(\xi\lVert\mathbf{x}_{0}-%\mathbf{x}_{\mathrm{adv}}^{t}\rVert)\;,italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = roman_exp ( italic_ξ ∥ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ) ,

where ξ𝜉\xiitalic_ξ denotes the guidance scale.This design can effectively guide the reverse process to generate images that are similar to the original images, particularly in terms of pixel values.

Finally, we define the reverse process of guided diffusion for adversarial sample as:

(21)pθ(𝐱advt1|𝐱advt,𝒞)=𝒩(𝐱advt1;μθ(𝐱advt,t)+ξσt2𝐱advt𝐱0𝐱advt,σt2𝐈)subscript𝑝𝜃conditionalsuperscriptsubscript𝐱adv𝑡1superscriptsubscript𝐱adv𝑡𝒞𝒩superscriptsubscript𝐱adv𝑡1subscript𝜇𝜃superscriptsubscript𝐱adv𝑡𝑡𝜉superscriptsubscript𝜎𝑡2subscriptsuperscriptsubscript𝐱adv𝑡subscript𝐱0superscriptsubscript𝐱adv𝑡superscriptsubscript𝜎𝑡2𝐈p_{\theta}(\mathbf{x}_{\mathrm{adv}}^{t-1}|\mathbf{x}_{\mathrm{adv}}^{t},%\mathcal{C})=\mathcal{N}(\mathbf{x}_{\mathrm{adv}}^{t-1};\mu_{\theta}(\mathbf{%x}_{\mathrm{adv}}^{t},t)+\xi\cdot\sigma_{t}^{2}\nabla_{\mathbf{x}_{\mathrm{adv%}}^{t}}\lVert\mathbf{x}_{0}-\mathbf{x}_{\mathrm{adv}}^{t}\rVert,\sigma_{t}^{2}%\mathbf{I})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_C ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t ) + italic_ξ ⋅ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I )

4.3. Perturbations for Adversarial Sample

We utilize perturbations to generate adversarial images for the target item, typically the unpopular item. These adversarial images are intended to boost the ranking of the target item within the top-K recommendations. The fundamental idea of our attack is to add human-imperceptible noise to the target item image to shrink the distance of our target item’s visual feature vector with the popular items’ visual feature vectors:

(22)argmin𝜀Ψ(𝐱ref)Ψ(𝐱adv)2,where𝐱adv=𝐱i+ε,ε𝒩(0,𝐈),formulae-sequence𝜀subscriptdelimited-∥∥Ψsubscript𝐱refΨsubscript𝐱adv2wheresubscript𝐱advsubscript𝐱𝑖𝜀similar-to𝜀𝒩0𝐈\underset{\varepsilon}{\arg\min}\;\lVert\Psi(\mathbf{x}_{\mathrm{ref}})-\Psi(%\mathbf{x}_{\mathrm{adv}})\rVert_{2},\;\mathrm{where}\;\mathbf{x}_{\mathrm{adv%}}=\mathbf{x}_{i}+\varepsilon,\;\varepsilon\sim\mathcal{N}(0,\mathbf{I})\;,underitalic_ε start_ARG roman_arg roman_min end_ARG ∥ roman_Ψ ( bold_x start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ) - roman_Ψ ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_where bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ε , italic_ε ∼ caligraphic_N ( 0 , bold_I ) ,

where 𝐱refsubscript𝐱ref\mathbf{x}_{\mathrm{ref}}bold_x start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT, 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and ε𝜀\varepsilonitalic_ε represent the reference image, target image, and perturbation, respectively. Ψ()Ψ\Psi(\cdot)roman_Ψ ( ⋅ ) refers to the image feature extraction model(Simonyan andZisserman, 2014; Heetal., 2016) used in visually-aware recommender system.

In order to optimize Eq.22, the first step is to find an appropriate popular item’s image as the reference image. According to Liu et al.(Liu and Larson, 2021), the reference image is commonly chosen from the image of a popular item. This choice is made because the AIP attack operates by shifting the image space of the target item’s image closer to that of the reference image through carefully designed perturbations. However, the selection of a reference image is crucial to ensure the attack’s effectiveness and stealthiness since the objects in different images vary. Specifically, if the reference image has significant semantic differences from the target item’s image, the resulting adversarial sample may become noticeably distorted as a larger ε𝜀\varepsilonitalic_ε is needed to align Ψ(𝐱adv)Ψsubscript𝐱adv\Psi(\mathbf{x}_{\mathrm{adv}})roman_Ψ ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT ) with Ψ(𝐱ref)Ψsubscript𝐱ref\Psi(\mathbf{x}_{\mathrm{ref}})roman_Ψ ( bold_x start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ).

In this paper, the reference image selection is as follows. We first employ k-means cluster analysis on the images within the dataset, aiming to categorize images into different clusters based on their visual feature information. Then, we choose the image of the most popular items whose feature vectors are in the same cluster associated with our target items as the reference image, thereby minimizing the image semantic differences between them. Figure3 illustrates the perturbation generation process of IPDGI.

Adversarial Item Promotion on Visually-Aware Recommender Systems by Guided Diffusion (3)

4.4. Adversarial Samples Generation

In Section4.3, we discussed how to create perturbations to promote target items. Although we reduce the perturbation scales by selecting semantic-nearest reference images, customers may still discern adversarial images resulting from this visual attack, given that the perturbations are directly applied to regular images. To further improve the stealthiness of our attack, our IPDGI incorporates the perturbation within the diffusion model. This strategic choice is motivated by the proven efficacy of the diffusion model in producing high-quality synthetic images that closely resemble real-world counterparts, thus further refining the stealthiness of our approach. The details of combining attack perturbations and diffusion models are as follows.

We first generate an adversarial perturbation ε𝜀\varepsilonitalic_ε for the image of the target item following Eq.22. We then combine it with the Gaussian noise δ𝛿\deltaitalic_δ from the base diffusion model (refer to Section4.1.1). As illustrated in the perturbation generator process of IPDGI in Figure3, the optimized adversarial perturbation is initially sampled from the Gaussian distribution, allowing it to naturally fuse with the Gaussian noise δ𝛿\deltaitalic_δ, denoted as ζ:=ε+δassign𝜁𝜀𝛿\zeta:=\varepsilon+\deltaitalic_ζ := italic_ε + italic_δ (see Figure2). At this point, we have a perturbed Gaussian noise ready for the forward process of the guided diffusion model in the IPDGI. The forward process is further defined as follows:

(23)𝐱t=αt¯𝐱0+1αt¯ζsubscript𝐱𝑡¯subscript𝛼𝑡subscript𝐱01¯subscript𝛼𝑡𝜁\mathbf{x}_{t}=\sqrt{\bar{\alpha_{t}}}\cdot\mathbf{x}_{0}+\sqrt{1-\bar{\alpha_%{t}}}\cdot\zetabold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ⋅ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ⋅ italic_ζ

Since the pre-trained diffusion model we used being trained with normal images, i.e., the uncorrupted/undistorted images, the denoising within the reverse process tends to restore 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT (i.e., the perturbed Gaussian noise image generated through forward process) to the domain of clean images. Specifically, the reverse process can remove the noises that cause image distortion while preserving the perturbations in the image to deceive the top-K ranker of the recommender systems. The reverse process remains the same as the Eq.21 in section4.2.

1:

2:𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: Original image of target item

3:ϵitalic-ϵ\epsilonitalic_ϵ: Perturbation magnitude

4:e𝑒eitalic_e: Perturbation epochs

5:T𝑇Titalic_T: Diffusion steps

6:ξ𝜉\xiitalic_ξ: Guidance scale

7:ΨΨ\Psiroman_Ψ: Visual feature extractor

8:κ𝜅\kappaitalic_κ: k-means clustering model

9:

10:𝐱advsubscript𝐱adv\mathbf{x}_{\mathrm{adv}}bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT: Adversarial image of target item

11:functionGeneratePerturbation(𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, ϵitalic-ϵ\epsilonitalic_ϵ, e𝑒eitalic_e, ΨΨ\Psiroman_Ψ, κ𝜅\kappaitalic_κ)

12:ε𝜀absent\varepsilon\leftarrowitalic_ε ← Sample from 𝒩(0,𝐈)𝒩0𝐈\mathcal{N}(0,\mathbf{I})caligraphic_N ( 0 , bold_I ) \triangleright Initialise a perturbation by randomly sampling from a Gaussian distribution

13:l𝐱0κ(𝐱0)subscript𝑙subscript𝐱0𝜅subscript𝐱0l_{\mathbf{x}_{0}}\leftarrow\kappa(\mathbf{x}_{0})italic_l start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_κ ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \triangleright Get the cluster label of the original image from κ𝜅\kappaitalic_κ

14:𝐱refsubscript𝐱refabsent\mathbf{x}_{\mathrm{ref}}\leftarrowbold_x start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ← Image of the most popular item within the cluster l𝐱0subscript𝑙subscript𝐱0l_{\mathbf{x}_{0}}italic_l start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

15:fori1𝑖1i\leftarrow 1italic_i ← 1 to e𝑒eitalic_edo

16:frefΨ(𝐱ref)subscript𝑓refΨsubscript𝐱reff_{\mathrm{ref}}\leftarrow\Psi(\mathbf{x}_{\mathrm{ref}})italic_f start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ← roman_Ψ ( bold_x start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ) \triangleright Visual feature of the reference image

17:fxiΨ(𝐱i1+ε)subscript𝑓subscriptxiΨsubscript𝐱𝑖1𝜀f_{\mathrm{x_{i}}}\leftarrow\Psi(\mathbf{x}_{i-1}+\varepsilon)italic_f start_POSTSUBSCRIPT roman_x start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← roman_Ψ ( bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_ε ) \triangleright Visual feature of the target item’s image containing the perturbation ε𝜀\varepsilonitalic_ε at i𝑖iitalic_i epoch

18:εargmin𝜀freffxi2𝜀𝜀subscriptdelimited-∥∥subscript𝑓refsubscript𝑓subscriptxi2\varepsilon\leftarrow\underset{\varepsilon}{\arg\min}\;\lVert f_{\mathrm{ref}}%-f_{\mathrm{x_{i}}}\rVert_{2}italic_ε ← underitalic_ε start_ARG roman_arg roman_min end_ARG ∥ italic_f start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT roman_x start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT \triangleright Refer to Eq.22

19:ε𝜀absent\varepsilon\leftarrowitalic_ε ← Resize the size of ε𝜀\varepsilonitalic_ε according to perturbation magnitude/epsilon ϵitalic-ϵ\epsilonitalic_ϵ

20:endfor

21:return ε𝜀\varepsilonitalic_ε

22:endfunction

23:ε𝜀absent\varepsilon\leftarrowitalic_ε ← GeneratePerturbation(𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, ϵitalic-ϵ\epsilonitalic_ϵ, e𝑒eitalic_e, ΨΨ\Psiroman_Ψ, κ𝜅\kappaitalic_κ)

24:δ𝛿absent\delta\leftarrowitalic_δ ← Sample from 𝒩(0,𝐈)𝒩0𝐈\mathcal{N}(0,\mathbf{I})caligraphic_N ( 0 , bold_I )

25:ζε+δ𝜁𝜀𝛿\zeta\leftarrow\varepsilon+\deltaitalic_ζ ← italic_ε + italic_δ \triangleright Perturbed Gaussian noise for the diffusion process

26:𝐱Tsubscript𝐱𝑇absent\mathbf{x}_{T}\leftarrowbold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← Generate the complete Gaussian noise for 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT through the diffusion process \triangleright Refer to Eq.23

27:foriT𝑖𝑇i\leftarrow Titalic_i ← italic_T to 1111do

28:𝝁𝝁\mathbf{\bm{\mu}}bold_italic_μ, Σμθ(𝐱advt,t)Σsubscript𝜇𝜃superscriptsubscript𝐱adv𝑡𝑡\Sigma\leftarrow\mu_{\theta}(\mathbf{x}_{\mathrm{adv}}^{t},t)roman_Σ ← italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t ), σt2superscriptsubscript𝜎𝑡2\sigma_{t}^{2}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

29:𝐱advt1superscriptsubscript𝐱adv𝑡1absent\mathbf{x}_{\mathrm{adv}}^{t-1}\leftarrowbold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ← Sample from 𝒩(𝝁+ξΣ𝐱advt𝐱0𝐱advt,Σ𝐈)𝒩𝝁𝜉Σsubscriptsuperscriptsubscript𝐱adv𝑡subscript𝐱0superscriptsubscript𝐱adv𝑡Σ𝐈\mathcal{N}(\bm{\mu}+\xi\cdot\Sigma\nabla_{\mathbf{x}_{\mathrm{adv}}^{t}}%\lVert\mathbf{x}_{0}-\mathbf{x}_{\mathrm{adv}}^{t}\rVert,\Sigma\mathbf{I})caligraphic_N ( bold_italic_μ + italic_ξ ⋅ roman_Σ ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ , roman_Σ bold_I ) \triangleright Refer to Eq.21

30:endfor

31:𝐱adv𝐱advt1subscript𝐱advsuperscriptsubscript𝐱adv𝑡1\mathbf{x}_{\mathrm{adv}}\leftarrow\mathbf{x}_{\mathrm{adv}}^{t-1}bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT ← bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT

32:return 𝐱advsubscript𝐱adv\mathbf{x}_{\mathrm{adv}}bold_x start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT

5. Experiments

In this section, we conduct extensive experiments to answer the following research questions (RQs):

  • RQ1: How is the attack effectiveness of our proposed IPDGI?

  • RQ2: How is the attack stealthiness of our proposed IPDGI?

  • RQ3: How is the attack impact of IPDGI on normal recommendation performance?

  • RQ4: How do the hyper-parameters influence the effectiveness and imperceptibility of the IPDGI?

5.1. Experimental Settings

5.1.1. Dataset

We conduct experiments on two real-world recommendation datasets, namely Amazon Beauty and Amazon Baby, both derived from the Amazon website (McAuley etal., 2015). We chose these two datasets because the visual signal is vital in influencing customers’ final decisions in the two domains. In addition, these two datasets have proper sizes so that our experimental results could be easy to be reproduced by other researchers. For both datasets, we filter out those users and items with less than ten interactions (i.e., 10-core filtering) following (Qu etal., 2023; Zhaoetal., 2022). After filtering, the Amazon Beauty dataset consists of 8,78787878,7878 , 787 users, 1,48014801,4801 , 480 items, and 62,6316263162,63162 , 631 user-item interactions. On the other hand, the Amazon Baby dataset includes 6,15861586,1586 , 158 users and 1,00910091,0091 , 009 items, and 44,3354433544,33544 , 335 user-item interactions. Each item has one corresponding image. Then, following the common settings in implicit feedback recommender models(Yuanetal., 2023b; Heetal., 2017; Zhangetal., 2021b), we binarize the user-item ratings by transforming all the ratings contained in the dataset to rij=1subscript𝑟𝑖𝑗1r_{ij}=1italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 and negative instances are sampled with 1:4:141:41 : 4 ratio(Heetal., 2017; Xiaetal., 2022) to train visually-aware recommender systems. Table1 illustrates the statistics of these two datasets.

DatasetUser#Item#Interactions#Sparsity
Amazon Beauty8,78787878,7878 , 7871,48014801,4801 , 48062,6316263162,63162 , 63199.52%percent99.5299.52\%99.52 %
Amazon Baby6,15861586,1586 , 1581,00910091,0091 , 00944,3354433544,33544 , 33599.29%percent99.2999.29\%99.29 %

5.1.2. Evaluation Protocol

We employ the standard leave-one-out protocol(Magnusson etal., 2019; Heetal., 2017; Zhang etal., 2022) to establish the training and testing data for each user. Specifically, for each user, we leave the last interacted item as the test item, while the remaining interacted items are utilized for training. In addition, we also select the last interacted item in the training data for validation during each training epoch. In order to simulate a more real attack scenario, we choose a set of unpopular items as target items to promote in the top-K recommendation.Following the approaches outlined in (He and McAuley, 2016; Kangetal., 2017; Tangetal., 2019), we initially train the chosen base visually-aware recommender systems with uncorrupted images. Subsequently, we substitute the images of target items with the generated adversarial images to improve their rankings in the top-K recommendations. To analyze the performance of our attack method, we evaluate it from three perspectives: the attack effectiveness, the attack imperceptibility, and the recommendation accuracy. We employ the Exposure Rate at Rank K (ER@K)(Zhang etal., 2022; Yuanetal., 2023a) and Normalized Discounted Cumulative Gain at K (NDCG@K) to measure the attack effectiveness and the recommendation accuracy in the top-K recommendation, respectively. Consequently, the impact of attacks can be measured by the metric value differences between before and after the integration of adversarial images. In other words, a greater improvement in ER@K signifies a more effective attack, whereas a less decrease in NDCG@K indicates a more subtle impact on the recommendation accuracy caused by the attack. When calculating ER@K and NDCG@K, we rank all items. Therefore, our paper’s ER@K and NDCG@K values look much lower than those calculated based on a small portion of randomly selected items (e.g., 100 randomly sampled negative items)(He and McAuley, 2016; Kangetal., 2017; Tangetal., 2019; Liu and Larson, 2021). In addition, we adopt the Fréchet Inception Distance (FID)(Heusel etal., 2017) metric to evaluate the quality of adversarial images. A smaller FID score suggests a higher similarity between the adversarial image and the original one, indicating a more imperceptible attack. To quantify the improvement or difference between two values, such as “No Attack” and IPDGI, we employ Eq.24 for the calculation.

(24)Improvement=ωωω×100𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡superscript𝜔𝜔𝜔100Improvement=\frac{\omega^{{}^{\prime}}-\omega}{\omega}\times 100italic_I italic_m italic_p italic_r italic_o italic_v italic_e italic_m italic_e italic_n italic_t = divide start_ARG italic_ω start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_ω end_ARG start_ARG italic_ω end_ARG × 100

Here, ω𝜔\omegaitalic_ω and ωsuperscript𝜔\omega^{{}^{\prime}}italic_ω start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT denote the original and new/updated ER@K values, respectively. For example, in the “a vs. b” scenario presented in Table2, we have ωa𝜔a\omega\leftarrow\mathrm{a}italic_ω ← roman_a and ωbsuperscript𝜔b\omega^{{}^{\prime}}\leftarrow\mathrm{b}italic_ω start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ← roman_b, indicating the changes between values “a” and “b”.

5.1.3. Baselines

Since our attack method is ranker-targeted, we chose the same type of attack method as our baseline, the Adversarial Item Promotion (AIP) attack (Liu and Larson, 2021). As far as our knowledge goes, the AIP attack is the sole ranker-targeted attack designed for visually-aware recommender systems. Similar to IPDGI, AIP adds noise to the target item’s image to shrink the difference to popular items. However, it randomly chooses the popular item to optimize Eq.22, resulting in unstable attack performance and obvious image distortion. In addition, we include “No Attack” to show the original ranking of the target items.

5.2. Implementation Details

In this section, we provide the implementation details of our experiments. The experiment pipeline is as follows. Firstly, we train the base models of visually-aware recommender systems, including VBPR(He and McAuley, 2016), DVBPR(Kangetal., 2017), and AMR(Tangetal., 2019), using uncorrupted item images. Subsequently, we calculate the average ER@K score of the target items and NDCG@K score of the testing data in the three base visually-aware recommender systems. We choose the unpopular items (with less than 20 interactions) as the target items to promote. We apply adversarial attacks, including baseline attacks and our IPDGI, to generate corrupted images. These generated images are then utilized to substitute the regular images of unpopular items for promotional purposes within visually-aware recommender systems. The efficacy of these attacks is evaluated by comparing the variations in ER@K and NDCG@K scores and examining the FID values.

5.2.1. Implementation of Visually-Aware Recommender Systems

All visually-aware recommender models are implemented using PyTorch (Paszkeetal., 2019). The associated images of each item in the dataset used to train the base recommender system are uncorrupted. The size of user and item embeddings for all three models is set to 100(Liu and Larson, 2021). We use the pre-trained ResNet152 model (Heetal., 2016) as the image feature extractor ΨΨ\Psiroman_Ψ for all recommender models and the visual feature size is 2048. Following (Kangetal., 2017; Tangetal., 2019; Liu and Larson, 2021), we train VBPR, DVBPR, and AMR for 2000, 50, and 2000 epochs, respectively. For the training optimizer, we adopt Adam (Kingma and Ba, 2014) with a learning rate set to 0.0001 and weight decay with a value of 0.001. After completing the training of these base visually-aware recommender systems, we choose the model with the lowest validation loss as the final model for each recommender system.

5.2.2. Implementation of Attacks

For the baseline attack AIP, we implement and execute it using the same settings in its original paper(Liu and Larson, 2021). Specifically, the perturbation training epochs are set to 5000 for each target image using Adam with a 0.001 learning rate as the optimizer. The maximum size of perturbation ϵitalic-ϵ\epsilonitalic_ϵ is set to 32.

Our proposed IPDGI attack is a novel ranker-targeted attack based on the diffusion model, designed to promote a target item within the top-K ranker of a visually-aware recommender system in a stealthy manner. In IPDGI, we employ a 256×256256256256\times 256256 × 256 unconditional diffusion model weight111https://openaipublic.blob.core.windows.net/diffusion/jul-2021/256x256_diffusion_uncond.pt, which has been per-trained by (Dhariwal andNichol, 2021) on the ImageNet (Russakovskyetal., 2015) dataset.

DatasetVisually-AwareRecommender SystemER@K𝐾Kitalic_K(a)(b)(c)Improvement \uparrow
No AttackAIPIPDGIa vs. ba vs. cb vs. c
Amazon BeautyVBPR50.00950.00910.0149-4.21%56.84%63.74%
100.02700.02700.02730%1.11%1.11%
200.06700.06660.0674-0.60%0.60%1.20%
DVBPR50.01530.01530.01710%11.76%11.76%
100.03130.03140.03360.32%7.35%7.01%
200.06390.06400.06670.16%4.38%4.22%
AMR50.01830.01730.0187-5.46%2.19%8.09%
100.03120.03120.03170%1.60%1.60%
200.06510.06500.0657-0.15%0.92%1.08%
Amazon BabyVBPR50.01800.01600.0187-11.11%3.89%16.88%
100.03270.03200.0330-2.14%0.92%3.13%
200.06770.06800.06970.44%2.95%2.50%
DVBPR50.01700.01680.0180-1.18%5.88%7.14%
100.03380.03400.03480.59%2.96%2.35%
200.06610.06630.06960.30%5.30%4.98%
AMR50.01570.01530.0165-2.55%5.10%7.84%
100.03170.03070.0333-3.15%5.05%8.47%
200.07100.07170.07230.99%1.83%0.84%

5.3. The Attack Effectiveness of IPDGI (RQ1)

In this paper, we evaluate the efficacy of an attack based on exposure rate (ER@K, where K{5,10,20}𝐾51020K\in\{5,10,20\}italic_K ∈ { 5 , 10 , 20 }). We conduct the experiments on two datasets, namely Amazon Beauty and Amazon Baby, with three visually-aware recommender systems (i.e., VBPR, DVBPR, AMR). Table2 presents the outcomes of a comparative analysis evaluating the effectiveness of various attacks, encompassing the baseline AIP attack (column label “b”), our proposed method IPDGI attack (column label “c”), and the original ranking performance labeled as “No Attack” (column label “a”). Within the table, the highest score for each dataset and corresponding visually-aware recommender system is presented in bold, while the second-best score is underlined. Additionally, we provide the relative improvement results for the comparisons between “No Attack” and the AIP attack (a vs. b), “No Attack” and the IPDGI attack (a vs. c), and the AIP attack and IPDGI attack (b vs. c).

Firstly, in Table2, we investigate the effectiveness of the AIP attack. The most significant improvement achieved by the AIP attack is 0.99%, observed in the ER@20 for AMR on Amazon Baby when compared to the “No Attack”. In other scenarios, the AIP attack demonstrates improvements of DVBPR on both datasets when evaluated using the ER@10 and ER@20 metrics. However, when assessed with the ER@5 metric, the AIP attack fails to promote target items in any scenario. Notably, there is an 11.11% decline (i.e., -11.11%) in the ER@5 for VBPR on Amazon Baby compared to the “No Attack”. Moreover, under the scenarios of ER@10 for VBPR on Amazon Beauty, ER@5 for VBPR on Amazon Beauty, and ER@10 for AMR on Amazon Beauty, the AIP attack achieves the same exposure rate as the “No Attack”. The failure of item promotion by the AIP attack can be attributed to the ineffectiveness of the perturbations added to the original images. AIP consistently selects the most popular item’s image as the reference for all target items, without employing a technique such as a clustering model to analyze semantic differences between images. As a result, the perturbations lead to a huge difference in the data distribution of the generated images compared to the originals, causing image distortion and rendering the attack ineffective. These observations show that the AIP attack demonstrates limited improvement or is even ineffective under certain scenarios when considering item promotion against the top-K ranker of visually-aware recommender systems.

Secondly, we assess the performance of our method, the IPDGI attack. As shown in Table2, the IPDGI attack consistently outperforms the baseline (AIP attack) and the “No Attack” (original performance), effectively promoting target items across all scenarios. In the ER@5 scenario for VBPR on Amazon Beauty, the IPDGI attack demonstrated its most significant improvements, achieving a 56.84% improvement when compared with the “No Attack” (a vs. c) and a 63.74% improvement when compared to the AIP attack (b vs. c). Even for the robustness-focused visually-aware recommender system AMR, the IPDGI attack achieved ER@5 improvements of 2.19% and 8.09% compared to the “No Attack” and the AIP attack on the Amazon Beauty dataset. Similarly, on the Amazon Baby dataset, the IPDGI attack exhibited improvements over the “No Attack” and the AIP attack, amounting to 5.10% and 7.84%, respectively.

Based on these observations, we posit that our method, the IPDGI attack, has demonstrated its effectiveness in targeting the base visually-aware recommender systems and exhibits a significant improvement over the baseline attack method.

DatasetAttack MethodFID \downarrowImprovement
Amazon BeautyAIP114.5587.56%
IPDGI14.25
Amazon BabyAIP265.4891.51%
IPDGI22.54

5.4. The Attack Imperceptibility of IPDGI (RQ2)

In this paper, an attack’s imperceptibility is measured by the consistency of adversarial images to the original ones. Notably, an attack capable of generating high-fidelity images becomes less noticeable to customers. We utilize the FID metric to assess the similarity of adversarial images generated by the attack methods (i.e., AIP and IPDGI) with the original images, as shown in Table3. A lower FID score signifies less difference between the real and generated images. For example, if FID is 0, there will be no difference between the two images. According to the results in Table3, our attack method IPDGI outperforms the AIP attack on both datasets, achieving FID scores of 14.25 and 22.54 on Amazon Beauty and Amazon Baby, respectively. In contrast, the FID scores of the adversarial images generated by AIP are much higher, measuring 114.55 for Amazon Beauty and 265.48 for Amazon Baby, representing significantly lower image quality compared to IPDGI. Consequently, IPDGI demonstrates an 87.56% and 91.51% improvement over AIP in terms of image quality on Amazon Beauty and Amazon Baby, respectively. Figures1 and 4 present examples of adversarial images generated by different attacks. It is evident that the AIP images exhibit more severe distortions when compared to the original images, reflecting a loss of finer details. In contrast, the IPDGI images maintain high degrees of similarity with the originals.

Adversarial Item Promotion on Visually-Aware Recommender Systems by Guided Diffusion (4)

5.5. The Attack Impact on Normal Recommendation Performance (RQ3)

In this experiment, we study the impact of the attack on the normal recommendation performance (i.e., recommendation accuracy). For the side effects of attacks on the recommender system, we evaluate the NDCG@K value changes of the testing data before and after attacks. We utilize NDCG@K because it can directly reflect the item position change in a ranking list. Table4 indicates the impact caused by the attacks on the visually-aware recommender systems. The values under “No Attack” indicate the average NDCG@K for all testing data when the recommender system is not corrupted, i.e., no adversarial images are used for the unpopular items. The columns “AIP” and “IPDG” show the average NDCG@K score after applying attacks. Specifically, all target items (112 unpopular items on Amazon Beauty and 6 unpopular items on Amazon Baby) are corrupted by using adversarial images, representing the worst-case scenario. Overall, a smaller difference compared to the “No Attack” values (i.e., “a vs. b” and “a vs. c”) implies that the attack has more subtle side effects.

As shown in Table4, we evaluate the average NDCG@K (where K5,10,20𝐾51020K\in{5,10,20}italic_K ∈ 5 , 10 , 20) for two different attack methods (i.e., AIP, IPDGI) across three visually-aware recommender systems on two datasets. For the Amazon Beauty dataset, the IPDGI attack results in slight declines in all of the base visually-aware recommender systems. This can be attributed to the successful promotion of target items, which are long-tail items. In other words, the adversarial images enable the target items to be exposed to a larger user audience. In the offline evaluation setting, this would inevitably lead to a decline in the recommendation accuracy. It should be noted here that the recommendation performance may not be compromised in the real-world online evaluation setting.On the Amazon Baby dataset, both the AIP and IPDGI indicate no changes when compared to the “No Attack” scenario. This observation can be attributed to the small number of target items (only 6) in this dataset. Relative to the total number of items, this constitutes a relatively minor proportion. Consequently, the overall performances of the recommender systems are not affected by this limited amount of corrupted data or the attack.

By synthesizing the results from Tables3 and 4, we can infer that the adversarial images generated by the IPDGI attack are imperceptible in terms of image quality and have a minimal impact on the visually-aware recommender systems, even under the worst-case scenario.

DatasetVisually-AwareRecommender SystemNDCG@K𝐾Kitalic_K(a)(b)(c)Difference
No AttackAIPIPDGIa vs. ba vs. c
Amazon BeautyVBPR50.03060.03020.0297-1.31%-2.94%
100.04680.04640.0461-0.85%-1.50%
200.07150.07120.0711-0.42%-0.56%
DVBPR50.03120.03070.0303-1.60%-2.88%
100.04720.04650.0467-1.48%-1.06%
200.07210.07130.0716-1.11%-0.69%
AMR50.03110.03050.0299-1.93%-3.86%
100.04730.04680.0464-1.06%-1.90%
200.07210.07170.0714-0.55%-0.97%
Amazon BabyVBPR50.03230.03230.03230%0%
100.04850.04850.04850%0%
200.07350.07350.07350%0%
DVBPR50.03250.03250.03250%0%
100.04870.04870.04870%0%
200.07370.07370.07370%0%
AMR50.03340.03340.03340%0%
100.04950.04950.04950%0%
200.07430.07430.07430%0%

5.6. The Effects of Hyper-Parameters on IPDGI (RQ4)

In this section, we delve into the impact of varying hyper-parameters of IPDGI on the generation of adversarial images. IPDGI has four most important hyper-parameters: maximum perturbation scale ϵitalic-ϵ\epsilonitalic_ϵ, perturbation optimization epochs e𝑒eitalic_e, diffusion steps T𝑇Titalic_T, and guidance scale ξ𝜉\xiitalic_ξ. Specifically, ϵitalic-ϵ\epsilonitalic_ϵ is the perturbation magnitude used to determine the strength of the perturbation for generating adversarial images. The epochs e𝑒eitalic_e represent the number of iterations used to generate the perturbation. The diffusion steps T𝑇Titalic_T indicate the number used in the forward and reverse processes of the diffusion model. The guidance scale ξ𝜉\xiitalic_ξ is the factor controlling the guidance during the reverse process of the diffusion model. To facilitate analysis, when investigating a single hyper-parameter, the remaining three hyper-parameters were held constant at their default values: 16 for epsilon, 30 for the number of epochs, and 15 for both the diffusion steps and guidance scale. For each hyper-parameter, we considered five possible values for testing.

Figure5 illustrates the impact on the effectiveness and imperceptibility of IPDGI resulting from variations in the hyper-parameter values. Each sub-figure depicts changes in the exposure rate (ER@5) and image quality (FID) corresponding to different hyper-parameter values, denoted by red triangles and blue dots, respectively. The left y-axis represents the exposure rate, while the right y-axis represents image quality. The x-axis of each sub-figure in Figure5 indicates the testing values of the hyper-parameter.

Impact of ϵitalic-ϵ\epsilonitalic_ϵ.As depicted in the top-left sub-figure of Figure5, an increment in the value of ϵitalic-ϵ\epsilonitalic_ϵ from 16 to 256 leads to an elevation in the FID score of the adversarial image, indicating a degradation in image quality. This outcome is expected, given that a larger ϵitalic-ϵ\epsilonitalic_ϵ corresponds to the introduction of stronger noise to the original images. However, it is noteworthy that the increase in ϵitalic-ϵ\epsilonitalic_ϵ does not necessarily result in an improvement in attack effectiveness. This suggests that a larger perturbation does not consistently yield better attack performance. For instance, at ϵ=256italic-ϵ256\epsilon=256italic_ϵ = 256, the attack attains the highest ER@5 scores, while at ϵ=32italic-ϵ32\epsilon=32italic_ϵ = 32 or ϵ=64italic-ϵ64\epsilon=64italic_ϵ = 64, the ER@5 performance surpasses that at ϵ=128italic-ϵ128\epsilon=128italic_ϵ = 128.

Impact of e𝑒eitalic_e.Based on the observations from the top-right sub-figure in Figure5, it is evident that the two highest ER@5 results are achieved at e=20𝑒20e=20italic_e = 20 and e=100𝑒100e=100italic_e = 100. Similarly, the two best FID scores are also obtained at these epochs (i.e., e=20𝑒20e=20italic_e = 20 and e=100𝑒100e=100italic_e = 100). From these findings, we contend that the judicious selection of the epoch is capable of generating well-optimized adversarial perturbations for the target image. Such perturbations enable the achievement of an improved exposure rate (i.e., a high ER@5 score) while simultaneously minimizing the impact on image quality (i.e., a low FID score).

Impact of T𝑇Titalic_T.The diffusion step significantly influences the performance of IPDGI in both ER@5 and FID. Concerning the impact on image quality with different diffusion steps, as illustrated in the bottom-left sub-figure of Figure5, it is evident that larger diffusion steps result in lower image quality. This is attributed to the larger steps in the diffusion processes, increasing the likelihood of the generated image deviating from the original. In the analysis of attack effectiveness, we observed that for diffusion steps T=30𝑇30T=30italic_T = 30 and T=100𝑇100T=100italic_T = 100, they have achieved the two highest ER@5 scores. Upon detailed observation of the sub-figure depicting diffusion steps, we noticed that before reaching its peak at T=30𝑇30T=30italic_T = 30, the ER@5 scores increase rapidly while maintaining good image quality (with a slow increment in the FID score). Thus, based on the changes in ER@5 and FID for the diffusion steps, we contend that employing diffusion steps around 30 achieves a desirable ER@5 score while maintaining high image quality for the generated adversarial image.

Impact of ξ𝜉\xiitalic_ξ.The guidance factor ξ𝜉\xiitalic_ξ regulates the strength of guidance during the reverse process. In this paper, guidance is represented by the Mean Squared Error (MSE) loss between the original image and reversed images. Therefore, a larger ξ𝜉\xiitalic_ξ will make the generated image more similar to the original image. By observing the bottom-right sub-figure of Figure5, the changes in ER@5 and FID exhibit highly similar trends. Based on these observations, a higher guidance scale corresponds to a better ER@5 score. Additionally, the overall FID scores are acceptable, as they are relatively low compared to other hyper-parameters, indicating good image quality. Thus, we argue that a relatively high guidance scale is essential to closely resemble the generated image to the original image and achieve better noise removal.

With cross-observations on the changes of ER@5 and FID among the four hyper-parameters (i.e., perturbation epsilon ϵitalic-ϵ\epsilonitalic_ϵ, perturbation epochs e𝑒eitalic_e, diffusion steps T𝑇Titalic_T, and diffusion guide scale ξ𝜉\xiitalic_ξ), we claim that image quality emerges as a crucial implicit factor affecting not only the imperceptibility of IPDGI but also its effectiveness.

Adversarial Item Promotion on Visually-Aware Recommender Systems by Guided Diffusion (5)

5.7. Ablation Study

To validate the significance and necessity of each component of the IPDGI, we conducted an ablation study on three base visually-aware recommender systems using the Amazon Beauty dataset. The results are presented in Figure6.

For each recommender system, we compared exposure rate (ER@5) scores across three settings: IPDGI, IPDGI w/o Clustering, and IPDGI w/o Attack. In the “IPDGI” setting, recommender systems are evaluated with adversarial images generated by the fully functional IPDGI. In the “IPDGI w/o Clustering” setting, the image clustering model is disabled in the IPDGI, and the reference image is simply selected from the most popular item (i.e., the item with the most interactions in the dataset). Lastly, in the “IPDGI w/o Attack” setting, perturbations are not combined into the general Gaussian noise before the forward process of the diffusion. In other words, we only employ the base guided diffusion model (see Section4.2) to generate an image.

Based on the results depicted in Figure6, it is evident that the fully-functional IPDGI achieves optimal ER@5 scores for all visually-aware recommender systems. Conversely, for the other two settings, a noteworthy decline in ER@5 performance is observed. The reduction in ER@5 observed in the “IPDGI w/o Clustering” setting highlights the significance of reference image selection. Additionally, the effect of the IPDGI may fluctuate depending on the dissimilarities or distances between clusters within the clustering model. Specifically, the greater the distance between the clusters in the clustering model, the more effective IPDGI becomes. The comparison between “IPDGI” and “IPDGI w/o Attack” implies the effectiveness of our perturbation generator. Additionally, an interesting observation is that “IPDGI w/o Clustering” even achieves poorer performance than without perturbation (i.e., “IPDGI w/o Attack”) in some cases. This may be because the image of the most popular item may differ from some target images in terms of semantic content; therefore, using it as a reference image cannot guide the model to find the optimal perturbation.

Adversarial Item Promotion on Visually-Aware Recommender Systems by Guided Diffusion (6)

6. Discussion of Potential Defense Methods

Detecting adversarial images generated by IPDGI poses a significant challenge due to their high fidelity and imperceptibility to humans, particularly within visually-aware recommender systems. Unlike classifier-targeted attacks, IPDGI operates as a ranker-targeted attack, further complicating defense efforts. Potential defense methods primarily aim to diminish or eliminate the perturbations present in adversarial images. These methods generally fall into two categories: image compression and image reconstruction.

In the field of computer vision, image compression stands as a widely discussed defense strategy against adversarial images (Xu etal., 2017; Guoetal., 2017; Dziugaiteetal., 2016; Das etal., 2018; Jiaetal., 2019). This method involves preprocessing the input image (i.e., compress) to reduce adversarial perturbations before it feeds to the model. Importantly, image compression does not require retraining or modifying the model, rendering it a practical defense approach in real-world scenarios. However, image compression is unable to completely remove adversarial perturbations and may result in the loss of image information. Its effectiveness is also limited when faced with strong adversarial perturbations within adversarial images (Daietal., 2020; Zhangetal., 2021a). Furthermore, in practice, the application of image compression would be applied to all images, potentially diminishing the utility of image information in visually-aware recommender systems (Liu and Larson, 2021).

Similar to image compression, image reconstruction (Yuanetal., 2023b; Daietal., 2020; Zhangetal., 2021a; Songetal., 2017; Yinetal., 2023; Merra etal., 2023) is another defense method that does not require retraining or modifying the model. Specifically, this approach involves generating a revised image through an image reconstruction network. The goal is to produce an image that appears identical to the original but without the adversarial perturbations. In contrast to image compression, image reconstruction tends to outperform in diminishing the perturbations present in adversarial images while preserving more image information.

7. Conclusion and Future Work

In this paper, we propose a novel attack specifically designed for visually-aware recommender systems, namely Item Promotion by Diffusion Guided Image (IPDGI). It adopts the diffusion model as the core framework for generating adversarial images to promote item rankings within the top-K ranker of the recommender model. To ensure the effectiveness of the IPDGI attack, we introduce an adversarial perturbation generator to produce optimized perturbations for the target item’s image, effectively popularizing the item. Additionally, to maintain the imperceptibility of the IPDGI attack, we impose a conditional constraint at every time step of the reverse process of the diffusion model to preserve the visual consistency between the ultimate adversarial image and the original image. Extensive experiments conducted on two real-world datasets using three visually-aware recommender systems demonstrate the effectiveness and imperceptibility of our proposed attack. Furthermore, we conduct a hyper-parameters analysis and an ablation study to provide additional insights.After highlighting the security hole of visually-aware recommender systems in this paper, in future research, we plan to explore the defense method and propose a more robust recommender system against the visual threat.

Acknowledgements.

This work is supported by the Australian Research Council under the streams of Future Fellowship (Grant No. FT210100624), Discovery Early Career Researcher Award (Grants No. DE230101033 and No. DE200101465), Discovery Project (Grants No. DP240101108, and No. DP240101814), and Industrial Transformation Training Centre (Grant No. IC200100022).

References

  • (1)
  • Bracheretal. (2016)Christian Bracher,Sebastian Heinz, and Roland Vollgraf.2016.Fashion DNA: merging content and sales data forrecommendation and article mapping.arXiv preprint arXiv:1609.02489(2016).
  • Chenetal. (2017)Jingyuan Chen, HanwangZhang, Xiangnan He, Liqiang Nie,Wei Liu, and Tat-Seng Chua.2017.Attentive collaborative filtering: Multimediarecommendation with item-and component-level attention. InProceedings of the 40th International ACM SIGIRconference on Research and Development in Information Retrieval.335–344.
  • Chenetal. (2018)Tong Chen, Hongzhi Yin,Hongxu Chen, Lin Wu, HaoWang, Xiaofang Zhou, and Xue Li.2018.Tada: trend alignment with dual-attentionmulti-task recurrent neural networks for sales prediction. In2018 IEEE international conference on data mining(ICDM). IEEE, 49–58.
  • Chengetal. (2023)Yu Cheng, Yunzhu Pan,Jiaqi Zhang, Yongxin Ni,Aixin Sun, and Fajie Yuan.2023.An Image Dataset for Benchmarking RecommenderSystems with Raw Pixels.arXiv preprint arXiv:2309.06789(2023).
  • Cohenetal. (2021)Rami Cohen, OrenSarShalom, Dietmar Jannach, andAmihood Amir. 2021.A black-box attack model for visually-awarerecommender systems. In Proceedings of the 14thACM International Conference on Web Search and Data Mining.94–102.
  • Daietal. (2020)Tao Dai, Yan Feng,Dongxian Wu, Bin Chen,Jian Lu, Yong Jiang, andShu-Tao Xia. 2020.Dipdefend: Deep image prior driven defense againstadversarial examples. In Proceedings of the 28thACM International Conference on Multimedia. 1404–1412.
  • Das etal. (2018)Nilaksh Das, MadhuriShanbhogue, Shang-Tse Chen, Fred Hohman,Siwei Li, Li Chen,MichaelE Kounavis, and DuenHorngChau. 2018.Shield: Fast, practical defense and vaccination fordeep learning using jpeg compression. InProceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining.196–204.
  • Dhariwal andNichol (2021)Prafulla Dhariwal andAlexander Nichol. 2021.Diffusion models beat gans on image synthesis.Advances in neural information processingsystems 34 (2021),8780–8794.
  • DiNoiaetal. (2020)Tommaso DiNoia, DanieleMalitesta, and FeliceAntonio Merra.2020.Taamr: Targeted adversarial attack againstmultimedia recommender systems. In 2020 50thAnnual IEEE/IFIP international conference on dependable systems and networksworkshops (DSN-W). IEEE, 1–8.
  • Duetal. (2023)Hanwen Du, Huanhuan Yuan,Zhen Huang, Pengpeng Zhao, andXiaofang Zhou. 2023.Sequential Recommendation with Diffusion Models.arXiv preprint arXiv:2304.04541(2023).
  • Dziugaiteetal. (2016)GintareKarolina Dziugaite,Zoubin Ghahramani, and DanielM Roy.2016.A study of the effect of jpg compression onadversarial images.arXiv preprint arXiv:1608.00853(2016).
  • Elsayed etal. (2018)Gamaleldin Elsayed, ShreyaShankar, Brian Cheung, Nicolas Papernot,Alexey Kurakin, Ian Goodfellow, andJascha Sohl-Dickstein. 2018.Adversarial examples that fool both computer visionand time-limited humans.Advances in neural information processingsystems 31 (2018).
  • Goodfellow etal. (2020)Ian Goodfellow, JeanPouget-Abadie, Mehdi Mirza, Bing Xu,David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio.2020.Generative adversarial networks.Commun. ACM 63,11 (2020), 139–144.
  • Goodfellowetal. (2014)IanJ Goodfellow, JonathonShlens, and Christian Szegedy.2014.Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572(2014).
  • Guoetal. (2017)Chuan Guo, Mayank Rana,Moustapha Cisse, and Laurens VanDerMaaten. 2017.Countering adversarial images using inputtransformations.arXiv preprint arXiv:1711.00117(2017).
  • Heetal. (2016)Kaiming He, XiangyuZhang, Shaoqing Ren, and Jian Sun.2016.Deep residual learning for image recognition. InProceedings of the IEEE conference on computervision and pattern recognition. 770–778.
  • He and McAuley (2016)Ruining He and JulianMcAuley. 2016.VBPR: visual bayesian personalized ranking fromimplicit feedback. In Proceedings of the AAAIconference on artificial intelligence, Vol.30.
  • Heetal. (2017)Xiangnan He, Lizi Liao,Hanwang Zhang, Liqiang Nie,Xia Hu, and Tat-Seng Chua.2017.Neural collaborative filtering. InProceedings of the 26th international conference onworld wide web. 173–182.
  • Heusel etal. (2017)Martin Heusel, HubertRamsauer, Thomas Unterthiner, BernhardNessler, and Sepp Hochreiter.2017.Gans trained by a two time-scale update ruleconverge to a local nash equilibrium.Advances in neural information processingsystems 30 (2017).
  • Hoetal. (2020)Jonathan Ho, Ajay Jain,and Pieter Abbeel. 2020.Denoising diffusion probabilistic models.Advances in neural information processingsystems 33 (2020),6840–6851.
  • Jagadeesh etal. (2014)Vignesh Jagadeesh,Robinson Piramuthu, Anurag Bhardwaj,Wei Di, and Neel Sundaresan.2014.Large scale visual recommendations from streetfashion images. In Proceedings of the 20th ACMSIGKDD international conference on Knowledge discovery and data mining.1925–1934.
  • Jiaetal. (2019)Xiaojun Jia, XingxingWei, Xiaochun Cao, and HassanForoosh. 2019.Comdefend: An efficient image compression model todefend adversarial examples. In Proceedings of theIEEE/CVF conference on computer vision and pattern recognition.6084–6092.
  • Kalantidisetal. (2013)Yannis Kalantidis, LyndonKennedy, and Li-Jia Li.2013.Getting the look: clothing recognition andsegmentation for automatic product suggestions in everyday photos. InProceedings of the 3rd ACM conference onInternational conference on multimedia retrieval.105–112.
  • Kangetal. (2017)Wang-Cheng Kang, ChenFang, Zhaowen Wang, and JulianMcAuley. 2017.Visually-aware fashion recommendation and designwith generative image models. In 2017 IEEEinternational conference on data mining (ICDM). IEEE,207–216.
  • Kingma and Ba (2014)DiederikP Kingma andJimmy Ba. 2014.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980(2014).
  • Kingma andWelling (2013)DiederikP Kingma andMax Welling. 2013.Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114(2013).
  • Kurakinetal. (2018)Alexey Kurakin, IanJGoodfellow, and Samy Bengio.2018.Adversarial examples in the physical world.In Artificial intelligence safety andsecurity. Chapman and Hall/CRC,99–112.
  • Leietal. (2016)Chenyi Lei, Dong Liu,Weiping Li, Zheng-Jun Zha, andHouqiang Li. 2016.Comparative deep learning of hybrid representationsfor image recommendations. In Proceedings of theIEEE conference on computer vision and pattern recognition.2545–2553.
  • Lietal. (2021)Yang Li, Tong Chen,Peng-Fei Zhang, and Hongzhi Yin.2021.Lightweight self-attentive sequentialrecommendation. In Proceedings of the 30th ACMInternational Conference on Information & Knowledge Management.967–977.
  • Li etal. (2023)Zihao Li, Aixin Sun,and Chenliang Li. 2023.DiffuRec: A Diffusion Model for SequentialRecommendation.arXiv preprint arXiv:2304.00686(2023).
  • Liu and Larson (2021)Zhuoran Liu and MarthaLarson. 2021.Adversarial item promotion: Vulnerabilities at thecore of top-n recommenders that use images to address cold start. InProceedings of the Web Conference 2021.3590–3602.
  • Longetal. (2022)Teng Long, Qi Gao,Lili Xu, and Zhangbing Zhou.2022.A survey on adversarial attacks in computer vision:Taxonomy, visualization and future directions.Computers & Security(2022), 102847.
  • Magnusson etal. (2019)Måns Magnusson,Michael Andersen, Johan Jonasson, andAki Vehtari. 2019.Bayesian leave-one-out cross-validation for largedata. In International Conference on MachineLearning. PMLR, 4244–4253.
  • McAuley etal. (2015)Julian McAuley,Christopher Targett, Qinfeng Shi, andAnton Van DenHengel. 2015.Image-based recommendations on styles andsubstitutes. In Proceedings of the 38thinternational ACM SIGIR conference on research and development in informationretrieval. 43–52.
  • Merra etal. (2023)FeliceAntonio Merra,VitoWalter Anelli, Tommaso DiNoia,Daniele Malitesta, and AlbertoCarloMaria Mancino. 2023.Denoise to Protect: A Method to Robustify VisualRecommenders from Adversaries. In Proceedings ofthe 46th International ACM SIGIR Conference on Research and Development inInformation Retrieval. 1924–1928.
  • Neve andMcConville (2020)James Neve and RyanMcConville. 2020.ImRec: Learning reciprocal preferences usingimages. In Proceedings of the 14th ACM Conferenceon Recommender Systems. 170–179.
  • Nguyen etal. (2022)ThanhTam Nguyen,ThanhTrung Huynh, PhiLe Nguyen,Alan Wee-Chung Liew, Hongzhi Yin, andQuoc VietHung Nguyen. 2022.A survey of machine unlearning.arXiv preprint arXiv:2209.02299(2022).
  • Nguyen etal. (2024)ThanhToan Nguyen, QuocVietHung Nguyen, ThanhTam Nguyen,ThanhTrung Huynh, ThanhThi Nguyen,Matthias Weidlich, and Hongzhi Yin.2024.Manipulating Recommender Systems: A Survey ofPoisoning Attacks and Countermeasures.arXiv preprint arXiv:2404.14942(2024).
  • Paszkeetal. (2019)Adam Paszke, Sam Gross,Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan,Trevor Killeen, Zeming Lin,Natalia Gimelshein, Luca Antiga,etal. 2019.Pytorch: An imperative style, high-performance deeplearning library.Advances in neural information processingsystems 32 (2019).
  • Qiuetal. (2019)Ruihong Qiu, Jingjing Li,Zi Huang, and Hongzhi Yin.2019.Rethinking the item order in session-basedrecommendation with graph neural networks. InProceedings of the 28th ACM internationalconference on information and knowledge management.579–588.
  • Qiuetal. (2020)Ruihong Qiu, Hongzhi Yin,Zi Huang, and Tong Chen.2020.Gag: Global attributed graph neural network forstreaming session-based recommendation. InProceedings of the 43rd International ACM SIGIRConference on Research and Development in Information Retrieval.669–678.
  • Qu etal. (2023)Liang Qu, Ningzhi Tang,Ruiqi Zheng, Quoc VietHung Nguyen,Zi Huang, Yuhui Shi, andHongzhi Yin. 2023.Semi-decentralized Federated Ego Graph Learning forRecommendation. In Proceedings of the ACM WebConference 2023. 339–348.
  • Quetal. (2021)Liang Qu, Huaisheng Zhu,Ruiqi Zheng, Yuhui Shi, andHongzhi Yin. 2021.Imgagn: Imbalanced network embedding via generativeadversarial graph networks. In Proceedings of the27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.1390–1398.
  • Rendle etal. (2012)Steffen Rendle, ChristophFreudenthaler, Zeno Gantner, and LarsSchmidt-Thieme. 2012.BPR: Bayesian personalized ranking from implicitfeedback.arXiv preprint arXiv:1205.2618(2012).
  • Russakovskyetal. (2015)Olga Russakovsky, JiaDeng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma,Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein,etal. 2015.Imagenet large scale visual recognition challenge.International journal of computer vision115 (2015), 211–252.
  • Scheinetal. (2002)AndrewI Schein,Alexandrin Popescul, LyleH Ungar, andDavidM Pennock. 2002.Methods and metrics for cold-startrecommendations. In Proceedings of the 25th annualinternational ACM SIGIR conference on Research and development in informationretrieval. 253–260.
  • Simonyan andZisserman (2014)Karen Simonyan andAndrew Zisserman. 2014.Very deep convolutional networks for large-scaleimage recognition.arXiv preprint arXiv:1409.1556(2014).
  • Sohl-Dickstein etal. (2015)Jascha Sohl-Dickstein,Eric Weiss, Niru Maheswaranathan, andSurya Ganguli. 2015.Deep unsupervised learning using nonequilibriumthermodynamics. In International conference onmachine learning. PMLR, 2256–2265.
  • Songetal. (2017)Yang Song, Taesup Kim,Sebastian Nowozin, Stefano Ermon, andNate Kushman. 2017.Pixeldefend: Leveraging generative models tounderstand and defend against adversarial examples.arXiv preprint arXiv:1710.10766(2017).
  • Tangetal. (2019)Jinhui Tang, Xiaoyu Du,Xiangnan He, Fajie Yuan,Qi Tian, and Tat-Seng Chua.2019.Adversarial training towards robust multimediarecommender system.IEEE Transactions on Knowledge and DataEngineering 32, 5(2019), 855–867.
  • Veit etal. (2015)Andreas Veit, BalazsKovacs, Sean Bell, Julian McAuley,Kavita Bala, and Serge Belongie.2015.Learning visual clothing style with heterogeneousdyadic co-occurrences. In Proceedings of the IEEEinternational conference on computer vision. 4642–4650.
  • Wangetal. (2023)Wenjie Wang, Yiyan Xu,Fuli Feng, Xinyu Lin,Xiangnan He, and Tat-Seng Chua.2023.Diffusion Recommender Model.arXiv preprint arXiv:2304.04971(2023).
  • Xiaetal. (2022)Lianghao Xia, Chao Huang,Yong Xu, Jiashu Zhao,Dawei Yin, and Jimmy Huang.2022.Hypergraph contrastive collaborative filtering. InProceedings of the 45th International ACM SIGIRconference on research and development in information retrieval.70–79.
  • Xu etal. (2017)Weilin Xu, David Evans,and Yanjun Qi. 2017.Feature squeezing: Detecting adversarial examplesin deep neural networks.arXiv preprint arXiv:1704.01155(2017).
  • Yang etal. (2023)Ling Yang, Zhilong Zhang,Yang Song, Shenda Hong,Runsheng Xu, Yue Zhao,Wentao Zhang, Bin Cui, andMing-Hsuan Yang. 2023.Diffusion models: A comprehensive survey of methodsand applications.Comput. Surveys 56,4 (2023), 1–39.
  • Yinetal. (2015)Hongzhi Yin, Bin Cui,Zi Huang, Weiqing Wang,Xian Wu, and Xiaofang Zhou.2015.Joint modeling of users’ interests and mobilitypatterns for point-of-interest recommendation. InProceedings of the 23rd ACM internationalconference on Multimedia. 819–822.
  • Yinetal. (2014)Hongzhi Yin, Bin Cui,Yizhou Sun, Zhiting Hu, andLing Chen. 2014.LCARS: A spatial item recommender system.ACM Transactions on Information Systems(TOIS) 32, 3 (2014),1–37.
  • Yin etal. (2024)Hongzhi Yin, Liang Qu,Tong Chen, Wei Yuan,Ruiqi Zheng, Jing Long,Xin Xia, Yuhui Shi, andChengqi Zhang. 2024.On-Device Recommender Systems: A ComprehensiveSurvey.arXiv preprint arXiv:2401.11441(2024).
  • Yinetal. (2023)Minglei Yin, Bin Liu,NeilZhenqiang Gong, and Xin Li.2023.Securing Visually-Aware Recommender Systems: AnAdversarial Image Reconstruction and Detection Framework.arXiv preprint arXiv:2306.07992(2023).
  • Yuanetal. (2023a)Wei Yuan, Quoc VietHungNguyen, Tieke He, Liang Chen, andHongzhi Yin. 2023a.Manipulating Federated Recommender Systems:Poisoning with Synthetic Users and Its Countermeasures.arXiv preprint arXiv:2304.03054(2023).
  • Yuanetal. (2024)Wei Yuan, Chaoqun Yang,Liang Qu, Guanhua Ye,Quoc VietHung Nguyen, and HongzhiYin. 2024.Robust Federated Contrastive Recommender Systemagainst Model Poisoning Attack.arXiv preprint arXiv:2403.20107(2024).
  • Yuanetal. (2023b)Wei Yuan, Shilong Yuan,Chaoqun Yang, Quoc VietHung Nguyen,and Hongzhi Yin. 2023b.Manipulating Visually-aware Federated RecommenderSystems and Its Countermeasures.ACM Transactions on Information Systems(2023).
  • Zeng etal. (2019)Xiaohui Zeng, Chenxi Liu,Yu-Siang Wang, Weichao Qiu,Lingxi Xie, Yu-Wing Tai,Chi-Keung Tang, and AlanL Yuille.2019.Adversarial attacks beyond the image space. InProceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition. 4302–4311.
  • Zhangetal. (2021a)Shudong Zhang, HaichangGao, and Qingxun Rao. 2021a.Defense against adversarial attacks byreconstructing images.IEEE Transactions on Image Processing30 (2021), 6117–6129.
  • Zhangetal. (2021b)Shijie Zhang, HongzhiYin, Tong Chen, Zi Huang,Lizhen Cui, and Xiangliang Zhang.2021b.Graph embedding for recommendation againstattribute inference attacks. In Proceedings of theWeb Conference 2021. 3002–3014.
  • Zhang etal. (2022)Shijie Zhang, HongzhiYin, Tong Chen, Zi Huang,Quoc VietHung Nguyen, and Lizhen Cui.2022.Pipattack: Poisoning federated recommender systemsfor manipulating item promotion. In Proceedings ofthe Fifteenth ACM International Conference on Web Search and Data Mining.1415–1423.
  • Zhaoetal. (2022)WayneXin Zhao, ZihanLin, Zhichao Feng, Pengfei Wang, andJi-Rong Wen. 2022.A revisiting study of appropriate offlineevaluation for top-N recommendation algorithms.ACM Transactions on Information Systems41, 2 (2022),1–41.
  • Zhengetal. (2024)Ruiqi Zheng, Liang Qu,Tong Chen, Kai Zheng,Yuhui Shi, and Hongzhi Yin.2024.Poisoning Decentralized Collaborative RecommenderSystem and Its Countermeasures.arXiv preprint arXiv:2404.01177(2024).
  • Zhengetal. (2023)Ruiqi Zheng, Liang Qu,Bin Cui, Yuhui Shi, andHongzhi Yin. 2023.Automl for deep recommender systems: A survey.ACM Transactions on Information Systems41, 4 (2023),1–38.
Adversarial Item Promotion on Visually-Aware Recommender Systems by Guided Diffusion (2024)
Top Articles
Latest Posts
Article information

Author: Chrissy Homenick

Last Updated:

Views: 5667

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Chrissy Homenick

Birthday: 2001-10-22

Address: 611 Kuhn Oval, Feltonbury, NY 02783-3818

Phone: +96619177651654

Job: Mining Representative

Hobby: amateur radio, Sculling, Knife making, Gardening, Watching movies, Gunsmithing, Video gaming

Introduction: My name is Chrissy Homenick, I am a tender, funny, determined, tender, glorious, fancy, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.