My research is mainly in the computer vision domain. In the following I present illustrations of some of my research activities.
Object and face tracking
PixeltrackA very fast adaptive algorithm for tracking deformable generic objects online.
Find more information on the PixelTrack web page.
Visual Focus Of Attention estimationWe developed a method to learn online, unsupervised, and in real-time the Visual Focus Of Attention (VFOA) of people in video-conferencing or meeting room scenarios. The method uses a specific type of online clustering (similar to sequential k-means) to group face patch appearances coming from a face tracking algorithm. By learning these appearance clusters from low-level features we can directly estimate the VFOA without the intermediate step of head pose estimation. Further, the method does not need any prior knowledge about the room configuration or persons in the room. Our experimental results on 110 minutes of annotated data from different sources show that this approach can outperform a classical supervised method based on head orientation estimation and Gaussian Mixture Models.
Online multi-modal speaker detection and trackingIn the TA2 project, we developed a real-time system for multi-modal speaker detection and tracking, in a living room environment where the camera and microphone array are not colocated. The system is able to detect for a varying number of persons, the position of faces, where each person is looking at (i.e. the visual focus of attention), when someone is speaking, who is speaking, and map the IDs to the respective visually tracked faces. Moreover, a keyword spotting algorithm has been integrated. The purpose of this processing is to automatically cut to interesting portions inside the video picture (i.e. to automatically edit the video) in real-time, based on these audio-visual cues.
Evaluation data: TA2 database
Online multiple face tracking
Also in the context of the TA2 project, we proposed an efficient particle filter-based online multi-face tracker with MCMC sampling. It is able to deal reliably with false face detections and missing detections over longer periods of time. The proposed method is based on a probabilistic framework taking into account longer-term observations, and not only reasoning on a frame-by-frame basis, as it is commonly done.
Demos: TA2 video (12MB)
Evaluation data: TA2 database
Effective multi-cue object trackingFusing multiple cues (like colour, texture) in a tracking framework is not a trivial task. We developed an efficient sampling technique, called Dynamic Partitioned Sampling (DPS), that is able to fuse several cues in a principled way, using a dynamic estimation of the reliability of each cue at each point in time. We evaluated this method in a face tracking application on several videos.
In the following example, we use colour (blue rectangle), texture (red rectangle), and shape (green ellipse) cues to track a head. During the video, the estimated reliability of the different cues is dynamically adapted, e.g. when the girl turns around or when her head gets occluded.
Face image processingWe presented several novel approaches to facial image processing based on convolutional neural networks (CNN). After applying the Convolutional Face Finder (CFF) (Garcia and Delakis, PAMI 2004) on gray-scale images we obtain face images that we further process with different models.
Our Convolutional Face Aligner (CFA). Geometric transformations (translation, rotation and zoom) are iteratively estimated and applied to precisely find the correct face bounding box.
Some results of face alignment with CFA on an Internet dataset. For each image pair, on the left: the face bounding box detected by CFF (white) with the desired face bounding box (white); on the right: the face bounding box detected with CFA (in white)
Facial feature localisation
In this work, we estimate the location of four facial features: the eyes, the nose, and the mouth, in low-quality grey-scale images. As illustrated in the following diagram, we first extract normalised face patches found by the CFF, then our convolutional neural network highlights the positions of each of the four features in so-called feature maps, and finally, we project these positions back onto the original image.
Here are some results on difficult examples showing robustness of this method to extreme illumination, poor quality and contrast, head pose variation, and partial occlusions:
Object detection in images
We applied convolutional neural networks (CNN) to the task of object detection in still images. The models are "light-weight" but very robust to common types of noise and object/image variations.
Here are some results on cars and motorbikes:
Some results on the difficult task of transparent logos detection (France 2 TV logo):
Object segmentation in videos
We proposed a spatio-temporal segmentation of moving objects using active contours with non-parametric region descriptors based on dense motion vectors. Here are some examples of segmentations of single video frames.