Voxify3D Logo

From Mesh to Voxel Art with Palette Discretization and Semantic Guidance

Abstract

Stylized voxel art is widely used in games and digital media, but turning 3D meshes into visually appealing voxel forms remains challenging and often requires manual effort. Existing methods struggle to preserve semantic structure and offer limited control over stylization, particularly in discrete color and abstraction. We present Voxify3D, a differentiable two-stage framework for generating stylized voxel art from 3D meshes. In the first stage, we initialize a coarse voxel grid via neural volume rendering. In the second stage, we refine the grid under six-view orthographic pixel art supervision, guided by a discrete color palette derived from clustering strategies (e.g., K-means, Max-Min, Median Cut). To support differentiable palette-based quantization, we design a rendering mechanism based on Gumbel-Softmax and incorporate a CLIP-based perceptual loss to enforce semantic alignment between voxel renderings and the original mesh.

Pipeline

Voxify3D Pipeline Overview

Our two-stage voxel art generation pipeline. (a) Coarse voxel grid training: Given a 3D mesh, we render multi-view images and optimize a voxel-based radiance field (DVGO [Sun et al. 2022]) using MSE loss to learn coarse geometry and appearance. (b) Orthographic pixel art fine-tuning: We refine the voxel grid using six orthographic pixel art views, which also serve to extract a discrete color palette (e.g., via k-means). Optimization includes appearance, depth, and alpha losses. (c) CLIP-guided optimization: A CLIP loss computed over rendered patches and mesh images encourages semantic alignment while being memory-efficient. (d) Differentiable discrete color selection via Gumbel-Softmax: Each voxel stores palette logits. Gumbel-Softmax enables differentiable sampling for end-to-end color optimization, yielding coherent, stylized voxel art.

Result Gallery

Color Palette


Palette Size


Voxel Size

Interactable Meshes

Input

Ours

Input

Ours

Input

Ours

Input

Ours

Input

Ours

Input

Ours

Input

Ours

Input

Ours

Input

Ours

Input

Ours

Evaluation

CLIP-IQA Result

Quantitative Result (CLIP-IQA)

User Study Result

Qualitative Result (User Study)

<\!-- rebuild trigger -->