My research is mainly in the computer vision domain. In the following I present illustrations of some of my research activities.

Object and face tracking


A very fast adaptive algorithm for tracking deformable generic objects online.

Paper: ICCV2013

Find more information on the PixelTrack web page.

Visual Focus Of Attention estimation

We developed a method to learn online, unsupervised, and in real-time the Visual Focus Of Attention (VFOA) of people in video-conferencing or meeting room scenarios. The method uses a specific type of online clustering (similar to sequential k-means) to group face patch appearances coming from a face tracking algorithm. By learning these appearance clusters from low-level features we can directly estimate the VFOA without the intermediate step of head pose estimation. Further, the method does not need any prior knowledge about the room configuration or persons in the room. Our experimental results on 110 minutes of annotated data from different sources show that this approach can outperform a classical supervised method based on head orientation estimation and Gaussian Mixture Models.

Papers: AVSS2013

Evaluation data: TA2 database (fullHD resolution reduced to 960x540), IHPD, PETS 2003,

Annotation: VFOA annotation (Please cite our paper (AVSS2013) if you use our data or annotation (bibtex).)

Online multi-modal speaker detection and tracking

In the TA2 project, we developed a real-time system for multi-modal speaker detection and tracking, in a living room environment where the camera and microphone array are not colocated. The system is able to detect for a varying number of persons, the position of faces, where each person is looking at (i.e. the visual focus of attention), when someone is speaking, who is speaking, and map the IDs to the respective visually tracked faces. Moreover, a keyword spotting algorithm has been integrated. The purpose of this processing is to automatically cut to interesting portions inside the video picture (i.e. to automatically edit the video) in real-time, based on these audio-visual cues.

Demo: Controlled environment (9.4MB)

Papers: ICME2011, MMM2012, AM2013

Evaluation data: TA2 database

Online multiple face tracking

Also in the context of the TA2 project, we proposed an efficient particle filter-based online multi-face tracker with MCMC sampling. It is able to deal reliably with false face detections and missing detections over longer periods of time. The proposed method is based on a probabilistic framework taking into account longer-term observations, and not only reasoning on a frame-by-frame basis, as it is commonly done.

Demos: TA2 video (12MB)

Papers: FG2011, TIP2013

Evaluation data: TA2 database

Effective multi-cue object tracking

Fusing multiple cues (like colour, texture) in a tracking framework is not a trivial task. We developed an efficient sampling technique, called Dynamic Partitioned Sampling (DPS), that is able to fuse several cues in a principled way, using a dynamic estimation of the reliability of each cue at each point in time. We evaluated this method in a face tracking application on several videos.

In the following example, we use colour (blue rectangle), texture (red rectangle), and shape (green ellipse) cues to track a head. During the video, the estimated reliability of the different cues is dynamically adapted, e.g. when the girl turns around or when her head gets occluded.

Demos: girl (1.8MB) (left: no dynamic fusion, middle: partitioned sampling (fixed order), right: Dynamic Partitioned Sampling)
twopeople (0.5MB)

Papers: BMVC2009

Face image processing

We presented several novel approaches to facial image processing based on convolutional neural networks (CNN). After applying the Convolutional Face Finder (CFF) (Garcia and Delakis, PAMI 2004) on gray-scale images we obtain face images that we further process with different models.

Face alignment

Our Convolutional Face Aligner (CFA). Geometric transformations (translation, rotation and zoom) are iteratively estimated and applied to precisely find the correct face bounding box.

Some results of face alignment with CFA on an Internet dataset. For each image pair, on the left: the face bounding box detected by CFF (white) with the desired face bounding box (white); on the right: the face bounding box detected with CFA (in white)

Papers: VISAPP2008

Facial feature localisation

In this work, we estimate the location of four facial features: the eyes, the nose, and the mouth, in low-quality grey-scale images. As illustrated in the following diagram, we first extract normalised face patches found by the CFF, then our convolutional neural network highlights the positions of each of the four features in so-called feature maps, and finally, we project these positions back onto the original image.

Here are some results on difficult examples showing robustness of this method to extreme illumination, poor quality and contrast, head pose variation, and partial occlusions:

Papers: ISPA2005, CORESA2005 (best student paper), MLSP2007

Other work on facial image analysis on low-quality grey-scale images, e.g. face recognition, gender recognition, can be found here: IWAPR2007, PHD-THESIS2007

Object detection in images

We applied convolutional neural networks (CNN) to the task of object detection in still images. The models are "light-weight" but very robust to common types of noise and object/image variations.

Here are some results on cars and motorbikes:

Papers: VOCC2005

Some results on the difficult task of transparent logos detection (France 2 TV logo):

Papers: ICANN2006

Object segmentation in videos

We proposed a spatio-temporal segmentation of moving objects using active contours with non-parametric region descriptors based on dense motion vectors. Here are some examples of segmentations of single video frames.

motion vectors for one video frame motion magnitude initial contour intermediate contour final result

Papers: GRETSI2005, JMIV2006, MASTER-THESIS2004