[晓理紫]每日论文分享(有中文摘要，源码或项目地址)–大模型、扩散模型、视觉导航

本文介绍: 为了增强大型语言模型的特定领域能力，对特定领域语料库进行持续的预训练是一种流行的方法。最近的工作表明，使用由基于正则表达式的模式格式化的阅读理解数据来调整模型，可以显著提高特定领域任务的性能。然而，基于正则表达式的模式不能使用特定领域的知识解析原始语料库。此外，问题和答案对以预定义的格式直接从语料库中提取，提供了有限的上下文。为了解决这个限制，我们通过LLM和聚类来提高阅读理解。LLM侧重于利用语料库中的领域知识来完善理解阶段，而聚类则通过扩展语境来丰富阅读阶段的相关知识。

专属领域论文订阅

VX扫吗关注{晓理紫}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持。谢谢提供建议

在这里插入图片描述

分类:

大语言模型LLM

视觉模型VLM

扩散模型

视觉导航

具身智能，机器人

强化学习

开放词汇，检测分割

[晓理紫]每日论文分享(有中文摘要，源码或项目地址)

== LLM ==

标题: Improving Domain Adaptation through Extended-Text Reading Comprehension

作者: Ting Jiang, Shaohan Huang, Shengyue Luo

中文摘要: 为了增强大型语言模型的特定领域能力，对特定领域语料库进行持续的预训练是一种流行的方法。最近的工作表明，使用由基于正则表达式的模式格式化的阅读理解数据来调整模型，可以显著提高特定领域任务的性能。然而，基于正则表达式的模式不能使用特定领域的知识解析原始语料库。此外，问题和答案对以预定义的格式直接从语料库中提取，提供了有限的上下文。为了解决这个限制，我们通过LLM和聚类来提高阅读理解。LLM侧重于利用语料库中的领域知识来完善理解阶段，而聚类则通过扩展语境来丰富阅读阶段的相关知识。此外，我们的方法结合了参数有效的微调，以提高领域适应的效率。与AdaptLLM相比，我们的方法在特定领域的任务中实现了超过5%的改进。我们的代码将在https：//github.com/microsoft/LMOps上提供。

摘要: To enhance the domain-specific capabilities of large language models, continued pre-training on a domain-specific corpus is a prevalent method. Recent work demonstrates that adapting models using reading comprehension data formatted by regex-based patterns can significantly improve performance on domain-specific tasks. However, regex-based patterns are incapable of parsing raw corpora using domain-specific knowledge. Furthermore, the question and answer pairs are extracted directly from the corpus in predefined formats offers limited context. To address this limitation, we improve reading comprehension via LLM and clustering. LLM focuses on leveraging domain knowledge within the corpus to refine comprehension stage, while clustering supplies relevant knowledge by extending the context to enrich reading stage. Additionally, our method incorporates parameter-efficient fine-tuning to improve the efficiency of domain adaptation. In comparison to AdaptLLM, our method achieves an improvement exceeding 5% in domain-specific tasks. Our code will available at https://github.com/microsoft/LMOps.

[Downlink:]http://arxiv.org/abs/2401.07284v2

[GitHub:]https://github.com/microsoft/LMOps.|

标题: AUTOACT: Automatic Agent Learning from Scratch via Self-Planning

作者: Shuofei Qiao, Ningyu Zhang, Runnan Fang

中文摘要: 语言代理在各种复杂任务上取得了相当大的性能。尽管在这一领域进行了不断的探索，现有的语言代理系统仍然与昂贵的、不可再现的数据依赖作斗争，并面临着迫使单一模型用于多种功能的挑战。为此，我们引入了AutoAct，这是一个自动代理学习框架，它不依赖于来自闭源模型（如GPT-4）的大规模注释数据和合成轨迹。给定工具库的有限数据，AutoAct首先自动合成规划轨迹，而无需人工或强大的闭源模型的任何帮助。然后，AutoAct利用分工策略，根据目标任务信息和合成轨迹自动区分，产生子代理组来完成任务。我们对不同的LLM进行了全面的实验，这表明与各种强基线相比，AutoAct产生了更好或并行的性能。我们甚至注意到，当使用Llama-2-13b模型时，AutoAct可以实现与零射击GPT-3.5-Turbo代理相当的性能。代码将在https：//github.com/zjunlp/AutoAct。

摘要: Language agents have achieved considerable performance on various complex tasks. Despite the incessant exploration in this field, existing language agent systems still struggle with costly, non-reproducible data reliance and face the challenge of compelling a single model for multiple functions. To this end, we introduce AutoAct, an automatic agent learning framework that does not rely on large-scale annotated data and synthetic trajectories from closed-source models (e.g., GPT-4). Given limited data with a tool library, AutoAct first automatically synthesizes planning trajectories without any assistance from humans or strong closed-source models. Then, AutoAct leverages a division-of-labor strategy to automatically differentiate based on the target task information and synthesized trajectories, producing a sub-agent group to complete the task. We conduct comprehensive experiments with different LLMs, which demonstrates that AutoAct yields better or parallel performance compared to various strong baselines. We even notice that AutoAct, when using the Llama-2-13b model, can achieve performance comparable to that of the zero-shot GPT-3.5-Turbo agent. Code will be available at https://github.com/zjunlp/AutoAct.

[Downlink:]http://arxiv.org/abs/2401.05268v2

[GitHub:]https://github.com/zjunlp/AutoAct.|

标题: CLadder: Assessing Causal Reasoning in Language Models

作者: Zhijing Jin, Yuen Chen, Felix Leeb

摘要: The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the “causal inference engine” postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.

[Downlink:]http://arxiv.org/abs/2312.04350v3

[Project:]https://huggingface.co/datasets/causalNLP/cladder,|

[GitHub:]https://github.com/causalNLP/cladder.|

标题: MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

作者: Jiawei Chen, Dingkang Yang, Yue Jiang

中文摘要: 医学视觉问答（VQA）是一项具有挑战性的多模态任务，其中视觉语言预训练（VLP）模型可以有效地提高泛化性能。然而，医学领域的大多数方法都将VQA视为答案分类任务，难以转移到实际应用场景中。此外，由于医学图像的隐私性和昂贵的注释过程，用于预训练的大规模医学图像——文本对数据集严重缺乏。在本文中，我们提出了一个大规模的多任务自我监督学习框架（MISS）的医疗VQA任务。与现有方法不同，我们将医学VQA视为一项生成性任务。我们统一了文本编码器和多模态编码器，并通过多任务学习对齐图像——文本特征。此外，我们提出了一种传输和字幕方法，该方法使用大型语言模型（LLMs）扩展了单模态图像数据集的特征空间，使那些传统的医学视野任务数据能够应用于VLP。实验表明，我们的方法在较少的多模态数据集下取得了优异的结果，并展示了生成VQA模型的优势。代码和模型权重将在论文被接受后发布。

摘要: Medical visual question answering (VQA) is a challenging multimodal task, where Vision-Language Pre-training (VLP) models can effectively improve the generalization performance. However, most methods in the medical field treat VQA as an answer classification task which is difficult to transfer to practical application scenarios. Additionally, due to the privacy of medical images and the expensive annotation process, large-scale medical image-text pairs datasets for pretraining are severely lacking. In this paper, we propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Furthermore, we propose a Transfer-and-Caption method that extends the feature space of single-modal image datasets using large language models (LLMs), enabling those traditional medical vision field task data to be applied to VLP. Experiments show that our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models. The code and model weights will be released upon the paper’s acceptance.

[Downlink:]http://arxiv.org/abs/2401.05163v2

标题: Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

作者: Haoran Xu, Amr Sharaf, Yunmo Chen

中文摘要: 中等规模的大型语言模型（LLMs）-那些具有7B或13B参数的模型——表现出有前途的机器翻译（MT）性能。然而，即使是性能最好的基于13B LLM的翻译模型，如ALMA，也无法与最先进的传统编码器——解码器翻译模型或更大规模的LLM（如GPT-4）的性能相媲美。在这项研究中，我们弥合了这一性能差距。我们首先评估了MT任务中LLMs监督微调的缺点，强调了参考数据中存在的质量问题，尽管这些数据是人为生成的。然后，与模拟参考翻译的SFT相比，我们引入了对比偏好优化（CPO），这是一种训练模型以避免产生足够但不完美的翻译的新方法。将CPO应用于只有22K个平行句子和12M个参数的ALMA模型产生了显著的改进。由此产生的模型称为ALMA-R，可以在WMT’21、WMT’22和WMT’23测试数据集上匹配或超过WMT竞赛获胜者和GPT-4的性能。

摘要: Moderate-sized large language models (LLMs) – those with 7B or 13B parameters – exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT’21, WMT’22 and WMT’23 test datasets.

[Downlink:]http://arxiv.org/abs/2401.08417v2

标题: Developing ChatGPT for Biology and Medicine: A Complete Review of Biomedical Question Answering

作者: Qing Li, Lei Li, Yu Li

中文摘要: ChatGPT探索了问答（QA）在提供医疗诊断、治疗建议和其他医疗保健支持方面的战略蓝图。这是通过自然语言处理（NLP）和多模态范式越来越多地整合医学领域数据来实现的。通过将文本、图像、视频和其他形式的分发从一般领域转移到医学领域，这些技术加速了医学领域问答（MDQA）的进展。它们弥合了人类自然语言和复杂的医学领域知识或专家手动注释之间的差距，处理了医学环境中大规模、多样、不平衡甚至未标记的数据分析场景。我们关注的核心是利用语言模型和多模态范式进行医学问题回答，旨在指导研究界为其特定的医学研究需求选择适当的机制。详细讨论了专门的任务，如单模态相关的问题回答、阅读理解、推理、诊断、关系提取、概率建模等，以及多模态相关的任务，如视觉问题回答、图像标题、跨模态检索、报告摘要和生成。每一节都深入研究了所考虑的相应方法的复杂细节。本文重点介绍了医学领域探索相对于一般领域方法的结构和进展，强调了它们在不同任务和数据集上的应用。它还概述了未来医学领域研究的当前挑战和机遇，为这一快速发展领域的持续创新和应用铺平了道路。

摘要: ChatGPT explores a strategic blueprint of question answering (QA) in delivering medical diagnosis, treatment recommendations, and other healthcare support. This is achieved through the increasing incorporation of medical domain data via natural language processing (NLP) and multimodal paradigms. By transitioning the distribution of text, images, videos, and other modalities from the general domain to the medical domain, these techniques have expedited the progress of medical domain question answering (MDQA). They bridge the gap between human natural language and sophisticated medical domain knowledge or expert manual annotations, handling large-scale, diverse, unbalanced, or even unlabeled data analysis scenarios in medical contexts. Central to our focus is the utilizing of language models and multimodal paradigms for medical question answering, aiming to guide the research community in selecting appropriate mechanisms for their specific medical research requirements. Specialized tasks such as unimodal-related question answering, reading comprehension, reasoning, diagnosis, relation extraction, probability modeling, and others, as well as multimodal-related tasks like vision question answering, image caption, cross-modal retrieval, report summarization, and generation, are discussed in detail. Each section delves into the intricate specifics of the respective method under consideration. This paper highlights the structures and advancements of medical domain explorations against general domain methods, emphasizing their applications across different tasks and datasets. It also outlines current challenges and opportunities for future medical domain research, paving the way for continued innovation and application in this rapidly evolving field.

[Downlink:]http://arxiv.org/abs/2401.07510v2

== VLM ==

标题: UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

作者: Bowen Shi, Peisen Zhao, Zichen Wang

中文摘要: 视觉——语言基础模型，以对比语言——图像预训练（CLIP）为代表，因联合理解视觉和文本任务而受到越来越多的关注。然而，现有的方法主要集中在训练模型以匹配全局图像表示和文本描述，从而忽略了局部区域和相应文本标记之间的关键对齐。本文对CLIP进行了多粒度对齐扩展。值得注意的是，我们特意构建了一个新的数据集，包括不同粒度级别的伪注释，包括图像级、区域级和像素级标题/标签。因此，我们开发了一个统一的多粒度学习框架，名为UMG-CLIP，它同时赋予模型跨不同细节级别的多功能感知能力。UMG剪辑配备了参数高效调整，超越了当前广泛使用的剪辑模型，并在各种图像理解基准上实现了最先进的性能，包括开放世界识别、检索、语义分割和全景分割任务。我们希望UMG剪辑可以作为推进视觉语言基础模型的一个有价值的选择。

摘要: Vision-language foundation models, represented by Contrastive language-image pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level, and pixel-level captions/tags. Accordingly, we develop a unified multi-granularity learning framework, named UMG-CLIP, that simultaneously empowers the model with versatile perception abilities across different levels of detail. Equipped with parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP models and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We hope UMG-CLIP can serve as a valuable option for advancing vision-language foundation models.

[Downlink:]http://arxiv.org/abs/2401.06397v2

标题: MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

作者: Jiawei Chen, Dingkang Yang, Yue Jiang

[Downlink:]http://arxiv.org/abs/2401.05163v2

标题: MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance

作者: Renjie Pi, Tianyang Han, Yueqi Xie

中文摘要: 多模态大型语言模型（MLLMs）的部署带来了一个独特的漏洞：容易受到通过视觉输入的恶意攻击。我们深入研究了保护MLLMs免受此类攻击的新挑战。我们发现，图像作为一种“外语”，在对齐过程中没有被考虑，这可能使MLLMs容易产生有害的反应。不幸的是，与基于文本的LLMs中考虑的离散标记不同，图像信号的连续性质带来了重大的对齐挑战，这给彻底覆盖可能的场景带来了困难。开源MLLMs主要在有限的图像——文本对上进行微调，这比广泛的基于文本的预训练语料库少得多，这使得MLLMs在显式对齐调整期间更容易灾难性地忘记它们的原始能力，这一事实加剧了这一漏洞。为了应对这些挑战，我们推出了MLLM-Protector，这是一种即插即用的策略，结合了轻量级伤害检测器和响应解毒器。危害检测器的作用是识别MLLM的潜在有害输出，而解毒者纠正这些输出，以确保响应符合安全标准。这种方法有效地降低了恶意视觉输入带来的风险，而不会影响模型的整体性能。我们的结果表明，MLLM-Protector为MLLM安全的一个以前未解决的方面提供了一个强大的解决方案。

摘要: The deployment of multimodal large language models (MLLMs) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. We delve into the novel challenge of defending MLLMs against such attacks. We discovered that images act as a “foreign language” that is not considered during alignment, which can make MLLMs prone to producing harmful responses. Unfortunately, unlike the discrete tokens considered in text-based LLMs, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover the possible scenarios. This vulnerability is exacerbated by the fact that open-source MLLMs are predominantly fine-tuned on limited image-text pairs that is much less than the extensive text-based pretraining corpus, which makes the MLLMs more prone to catastrophic forgetting of their original abilities during explicit alignment tuning. To tackle these challenges, we introduce MLLM-Protector, a plug-and-play strategy combining a lightweight harm detector and a response detoxifier. The harm detector’s role is to identify potentially harmful outputs from the MLLM, while the detoxifier corrects these outputs to ensure the response stipulates to the safety standards. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the model’s overall performance. Our results demonstrate that MLLM-Protector offers a robust solution to a previously unaddressed aspect of MLLM security.

[Downlink:]http://arxiv.org/abs/2401.02906v2

标题: Exploring Vulnerabilities of No-Reference Image Quality Assessment Models: A Query-Based Black-Box Method

作者: Chenxi Yang, Yujia Liu, Dingquan Li

中文摘要: 无参考图像质量评估（NR-IQA）旨在预测与人类感知一致的图像质量分数，而不依赖于原始参考图像，是各种视觉任务中的重要组成部分。确保NR-IQA方法的鲁棒性对于不同图像处理技术的可靠比较和推荐中一致的用户体验至关重要。NR-IQA的攻击方法为测试NR-IQA的鲁棒性提供了有力的工具。然而，目前的NR-IQA攻击方法严重依赖于NR-IQA模型的梯度，导致梯度信息不可用时的局限性。在本文中，我们提出了一个开创性的基于查询的黑盒攻击NR-IQA方法。我们提出了分数边界的概念，并利用具有多个分数边界的自适应迭代方法。同时，初始攻击方向也被设计成利用人类视觉系统（HVS）的特征。实验表明，我们的方法优于所有比较的最先进的攻击方法，并远远领先于以前的黑盒方法。有效的NR-IQA模型DBCNN在我们的方法攻击下遭受0.6381的Spearman秩序相关系数（SROCC）下降，揭示了NR-IQA模型对黑盒攻击的脆弱性。所提出的攻击方法也为进一步探索NR-IQA鲁棒性提供了有力的工具。

摘要: No-Reference Image Quality Assessment (NR-IQA) aims to predict image quality scores consistent with human perception without relying on pristine reference images, serving as a crucial component in various visual tasks. Ensuring the robustness of NR-IQA methods is vital for reliable comparisons of different image processing techniques and consistent user experiences in recommendations. The attack methods for NR-IQA provide a powerful instrument to test the robustness of NR-IQA. However, current attack methods of NR-IQA heavily rely on the gradient of the NR-IQA model, leading to limitations when the gradient information is unavailable. In this paper, we present a pioneering query-based black box attack against NR-IQA methods. We propose the concept of score boundary and leverage an adaptive iterative approach with multiple score boundaries. Meanwhile, the initial attack directions are also designed to leverage the characteristics of the Human Visual System (HVS). Experiments show our method outperforms all compared state-of-the-art attack methods and is far ahead of previous black-box methods. The effective NR-IQA model DBCNN suffers a Spearman’s rank-order correlation coefficient (SROCC) decline of 0.6381 attacked by our method, revealing the vulnerability of NR-IQA models to black-box attacks. The proposed attack method also provides a potent tool for further exploration into NR-IQA robustness.

[Downlink:]http://arxiv.org/abs/2401.05217v2

== diffusion model ==

标题: Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping

作者: Zijie Pan, Jiachen Lu, Xiatian Zhu

中文摘要: 高分辨率3D对象生成仍然是一项具有挑战性的任务，这主要是由于综合注释训练数据的可用性有限。最近的进展旨在通过利用图像生成模型来克服这一限制，图像生成模型在广泛的精选网络数据集上进行预训练，使用知识转移技术，如分数蒸馏采样（SDS）。有效地解决高分辨率渲染的要求通常需要采用基于潜在表示的模型，例如潜在扩散模型（LDM）。在这个框架中，出现了一个重大的挑战：为了计算单个图像像素的梯度，有必要通过图像模型的冻结组件，例如LDM中使用的VAE编码器，从指定的潜在空间反向传播梯度。然而，这种梯度传播途径从未被优化，在训练期间保持不受控制。我们发现，不规范的梯度会不利地影响3D模型从图像生成模型获取纹理相关信息的能力，导致外观合成质量差。为了解决这一首要挑战，我们提出了一种称为像素梯度裁剪（PGC）的创新操作，旨在无缝集成到现有的3D生成模型中，从而提高它们的合成质量。具体来说，我们通过有效地裁剪像素级梯度来控制随机梯度的大小，同时保留关键的纹理相关梯度方向。尽管这种简单性和最小的额外成本，大量的实验证明了我们的PGC在增强现有3D生成模型的高分辨率对象渲染性能方面的有效性。

摘要: High-resolution 3D object generation remains a challenging task primarily due to the limited availability of comprehensive annotated training data. Recent advancements have aimed to overcome this constraint by harnessing image generative models, pretrained on extensive curated web datasets, using knowledge transfer techniques like Score Distillation Sampling (SDS). Efficiently addressing the requirements of high-resolution rendering often necessitates the adoption of latent representation-based models, such as the Latent Diffusion Model (LDM). In this framework, a significant challenge arises: To compute gradients for individual image pixels, it is necessary to backpropagate gradients from the designated latent space through the frozen components of the image model, such as the VAE encoder used within LDM. However, this gradient propagation pathway has never been optimized, remaining uncontrolled during training. We find that the unregulated gradients adversely affect the 3D model’s capacity in acquiring texture-related information from the image generative model, leading to poor quality appearance synthesis. To address this overarching challenge, we propose an innovative operation termed Pixel-wise Gradient Clipping (PGC) designed for seamless integration into existing 3D generative models, thereby enhancing their synthesis quality. Specifically, we control the magnitude of stochastic gradients by clipping the pixel-wise gradients efficiently, while preserving crucial texture-related gradient directions. Despite this simplicity and minimal extra cost, extensive experiments demonstrate the efficacy of our PGC in enhancing the performance of existing 3D generative models for high-resolution object rendering.

[Downlink:]http://arxiv.org/abs/2310.12474v4

[Project:]https://fudan-zvg.github.io/PGC-3D|

标题: Hierarchical Fashion Design with Multi-stage Diffusion Models

作者: Zhifeng Xie, Hao li, Huiming Ding

中文摘要: 跨模态时装合成和编辑通过实现设计草稿的自动生成和局部修改，为时装设计师提供智能支持。虽然当前的扩散模型在图像合成中表现出值得称赞的稳定性和可控性，但它们在从抽象设计元素和细粒度编辑生成时装设计方面仍然面临重大挑战。抽象的感官表达，如办公室、商业和聚会，形成了高层次的设计概念，而可测量的方面，如袖长、领型和裤长被认为是服装的低级属性。使用冗长的文本描述来控制和编辑时尚图像是一件困难的事情。在本文中，我们提出了一种新的HieraFashDiff服装设计方法d.使用共享的多阶段扩散模型，在分层结构中包含高级设计概念和低级服装属性。具体来说，我们根据专业服装设计师的标准，将输入文本分为不同的级别，并以不同的时间步长输入扩散模型。HieraFashDiff允许设计者在高级提示后添加低级属性，以进行增量交互编辑。此外，我们在采样过程中设计了一个可微损失函数，并使用掩模来保留非编辑区域。在我们新进行的分层时尚数据集上进行的综合实验表明，我们提出的方法优于其他最先进的竞争对手。

摘要: Cross-modal fashion synthesis and editing offer intelligent support to fashion designers by enabling the automatic generation and local modification of design drafts.While current diffusion models demonstrate commendable stability and controllability in image synthesis,they still face significant challenges in generating fashion design from abstract design elements and fine-grained editing.Abstract sensory expressions, eg office, business, and party, form the high-level design concepts, while measurable aspects like sleeve length, collar type, and pant length are considered the low-level attributes of clothing.Controlling and editing fashion images using lengthy text descriptions poses a difficulty.In this paper, we propose HieraFashDiff,a novel fashion design method using the shared multi-stage diffusion model encompassing high-level design concepts and low-level clothing attributes in a hierarchical structure.Specifically, we categorized the input text into different levels and fed them in different time step to the diffusion model according to the criteria of professional clothing designers.HieraFashDiff allows designers to add low-level attributes after high-level prompts for interactive editing incrementally.In addition, we design a differentiable loss function in the sampling process with a mask to keep non-edit areas.Comprehensive experiments performed on our newly conducted Hierarchical fashion dataset,demonstrate that our proposed method outperforms other state-of-the-art competitors.

[Downlink:]http://arxiv.org/abs/2401.07450v2

标题: Adversarial Examples are Misaligned in Diffusion Model Manifolds

作者: Peter Lorenz, Ricard Durall, Janis Keuper

中文摘要: 近年来，扩散模型（DMs）因其在逼近数据分布、产生状态-最先进的生成结果。然而，这些模型的多功能性超出了它们的生成能力，包括各种视觉应用，例如图像修复、分割、对抗鲁棒性等。本研究致力于从扩散模型的角度研究对抗性攻击。然而，我们的目标并不涉及增强图像分类器的对抗鲁棒性。相反，我们的重点在于利用扩散模型来检测和分析这些对图像的攻击所引入的异常。为此，我们使用扩散模型系统地检查了当受到转换过程时，对抗性例子的分布的排列。这种方法的有效性在CIFAR-10和ImageNet数据集上进行评估，包括后者中不同的图像大小。结果显示了有效区分良性和攻击图像的显著能力，提供了令人信服的证据，即敌对实例与DMs的学习流形不一致。

摘要: In recent years, diffusion models (DMs) have drawn significant attention for their success in approximating data distributions, yielding state-of-the-art generative results. Nevertheless, the versatility of these models extends beyond their generative capabilities to encompass various vision applications, such as image inpainting, segmentation, adversarial robustness, among others. This study is dedicated to the investigation of adversarial attacks through the lens of diffusion models. However, our objective does not involve enhancing the adversarial robustness of image classifiers. Instead, our focus lies in utilizing the diffusion model to detect and analyze the anomalies introduced by these attacks on images. To that end, we systematically examine the alignment of the distributions of adversarial examples when subjected to the process of transformation using diffusion models. The efficacy of this approach is assessed across CIFAR-10 and ImageNet datasets, including varying image sizes in the latter. The results demonstrate a notable capacity to discriminate effectively between benign and attacked images, providing compelling evidence that adversarial instances do not align with the learned manifold of the DMs.

[Downlink:]http://arxiv.org/abs/2401.06637v3