Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity Insights
Data summarization tasks are often modeled as $k$-clustering problems, where the goal is to choose $k$ data points, called cluster centers, that best represent the dataset by minimizing a clustering objective. A popular objective is to minimize the maximum distance between any data point and its nea...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data summarization tasks are often modeled as $k$-clustering problems, where
the goal is to choose $k$ data points, called cluster centers, that best
represent the dataset by minimizing a clustering objective. A popular objective
is to minimize the maximum distance between any data point and its nearest
center, which is formalized as the $k$-center problem. While in some
applications all data points can be chosen as centers, in the general setting,
centers must be chosen from a predefined subset of points, referred as
facilities or suppliers; this is known as the $k$-supplier problem. In this
work, we focus on fair data summarization modeled as the fair $k$-supplier
problem, where data consists of several groups, and a minimum number of centers
must be selected from each group while minimizing the $k$-supplier objective.
The groups can be disjoint or overlapping, leading to two distinct problem
variants each with different computational complexity.
We present $3$-approximation algorithms for both variants, improving the
previously known factor of $5$. For disjoint groups, our algorithm runs in
polynomial time, while for overlapping groups, we present a fixed-parameter
tractable algorithm, where the exponential runtime depends only on the number
of groups and centers. We show that these approximation factors match the
theoretical lower bounds, assuming standard complexity theory conjectures.
Finally, using an open-source implementation, we demonstrate the scalability of
our algorithms on large synthetic datasets and assess the price of fairness on
real-world data, comparing solution quality with and without fairness
constraints. |
---|---|
DOI: | 10.48550/arxiv.2410.12913 |