Department of Computer Science
Stevens Institute of Technology
Office: Lieb 215
Service, Awards and
THIS PAGE HAS NOT BEEN UPDATED IN A LONG TIME. PLEASE LOOK AT THE PUBLICATIONS PAGE FOR THE LATEST PAPERS.
RESEARCH PROJECT AT STEVENS
Note 2: All videos on this page were made using the Xvid codec, which can be downloaded from http://www.xvid.org/.
RESEARCH PROJECT AT STEVENS
Evaluation of Confidence Measures for Stereo
I am currently working on a formal evaluation of confidence measures for stereo. In the context of this effort, I proposed a new confidence measure called the Self-Aware Matching Measure (SAMM), which has been accepted for publication at ICCV 2009. More details and images will be added in this space soon. In the meantime the paper can be found below.
Object Extraction from Large-Scale 3D Point Clouds
In August 2007, I started working on the DARPA Urban Reasoning and Geospatial Exploitation Technology (URGENT) project. The goal of the project is the analysis of very large point clouds captured in urban environments by terrestrial and airborne LIDAR sensors to recognize several classes of objects. The UPenn team, led by Ben Taskar, Kostas Daniilidis and Jianbo Shi, aims at extracting objects from the unorganized data for further processing and finer granularity recognition by our collaborators.
As part of our work on URGENT, Alexander Patterson, Kostas Daniilidis and I published a method for object detection in large scale 3D datasetes. We were able to achieve a 74.1% recall for 1221 manually annotated cars with a 7.6% false alarm rate. (In other words, our algorithm correctly detected 905 of the 1221 cars, reported 74 non-cars as cars and missed 316 cars.) Screenshots of cars detected in a very large uncontrolled environemnt and colored randomly can be seen below.
See a fly through part of the dataset in this video.
A follow-up to this work, in collaboration with Qihui Zhu, addresses the extraction of roads from airborne LIDAR data formulating the problem as computing a minimum cover of road likelihood maps. This was published in 3DIM 2009. More details coming soon.
Video-based Reconstruction of Urban Environments
Between September 2005 and May 2007, I worked on the DARPA UrbanScape project, which aims at the real-time 3-D reconstruction of urban scenes. The video-based part of UrbanScape that was carried out by the UNC computer vision group led by Marc Pollefeys and the computer vision group from the Center for Visualization and Virtual Environments of the University of Kentucky led by David Nistér. We worked on both the video collection system that can record to disk eight high-resolution video streams at 30 frames per second as well as on the 3D reconstruction system that generates 3D models from these videos. More information on UrbanScape can be found at the Urban 3D Modelling from Video webpage (somewhat outdated now).
Several theoretical contributions were also made during our efforts to develop an accurate and fast system. These include:
Screenshots of models can be seen below. These reconstructions were made using videos from two non-overlapping cameras. The cameras were mounted on a moving vehicle which was also equipped with GPS and an inertial navigation system (INS). The model at the bottom was reconstucted from more than 2,600 frames. Close-ups show that features such as the ground plane which is viewed at an angle from the camera and thin objects such as the light pole on the right are reconstructed well. More pictures are available at the Urban 3D Modelling from Video webpage.
In collaboration with the GAMMA group at UNC, we presented a technique for coupling simulated fluid phenomena that interact with real dynamic scenes captured as a binocular video sequence. We first process the binocular video sequence to obtain a complete 3D reconstruction of the scene, including velocity information. We use stereo for the visible parts of 3D geometry and surface completion to fill the missing regions. We then perform fluid simulation within a 3D domain that contains the object, enabling one-way coupling from the video to the fluid. In order to maintain temporal consistency of the reconstructed scene and the animated fluid across frames, we develop a geometry tracking algorithm that combines optic flow and depth information with a novel technique for ''velocity completion''. The velocity completion technique uses local rigidity constraints to hypothesize a motion field for the entire 3D shape, which is then used to propagate and filter the reconstructed shape over time. This approach not only generates smoothly varying geometry across time, but also simultaneously provides the necessary boundary conditions for one-way coupling between the dynamic geometry and the simulated fluid. Finally, we employ a GPU based scheme for rendering the synthetic fluid in the real video, taking refraction and scene texture into account.
Project webpage: http://gamma.cs.unc.edu/FluidInVideo/
I also worked (October 2006-July 2007) on a project entitled "3D Content Extraction from Video Streams". It is part of the DTO Video Analysis and Content Extraction program. The 3D computer vision group at UNC still works under this award on the development of algorithms that can automatically extract 3D information from videos captured by unknown cameras under unknown conditions.
The focus is on robust Structure from Motion algorithms and auto-calibration that would allow us to recover the camera parameters and poses from the videos. This will be followed by model selection to determine whether we can achieve a 3D reconstruction, a panoramic mosaic (if the camera only rotates) or no reconstruction at all from the video sequence. The emphasis is on obtaining high quality Structure from Motion and dense reconstruction results from video sequences on which we have very little control. The reconstruction would then be the basis for further analysis of the scene. A very fast recognition module will be used to detect previously observed landmarks to stitch partial models potentially reconstructed from different videos. Among the goals are quantitative distance and size measurements on the static background and the objects of the scene, as well as higher level inferences such as whether the scene is natural or man-made.
Multiple-View Reconstruction using Graph Cuts on an Adaptive Tetrahedral Mesh
In this project, Sudipta Sinha, Marc Pollefeys and I formulated multi-view 3D shape reconstruction as the computation of a minimum cut on the dual graph of a semi-regular, multi-resolution, tetrahedral mesh. Our method uses photo-consistency to guide the adaptive subdivision of a coarse mesh. This generates a multi-resolution volumetric mesh that is densely tessellated in the parts likely to contain the unknown surface and coarse in parts that are empty. The graph-cut on the dual graph of this tetrahedral mesh produces a minimum cut corresponding to a triangulated surface that minimizes a global surface cost functional. We make no assumptions about topology and can recover deep concavities when enough cameras observe them. Our formulation also allows silhouette constraints to be enforced during the graph-cut step to counter its inherent bias for producing minimal surfaces. Local shape refinement via surface deformation is used to recover details in the reconstructed surface. Reconstructions of the Multi-View Stereo Evaluation benchmark datasets and several other real datasets show the effectiveness of our method.
In ICCV 2007, Scott Larsen, Marc Pollefeys, Henry Fuchs and I presented an approach for 3D reconstruction from multiple video streams taken by static, synchronized and calibrated cameras that is capable of enforcing temporal consistency on the reconstruction of successive frames. We attempted to improve the quality of the reconstruction by finding corresponding pixels in subsequent frames of the same camera using optical flow, but also to at least maintain the quality of the single time-frame reconstruction when these correspondences are wrong or cannot be found. This allows us to process scenes with fast motion, occlusions and self-occlusions where optical flow fails for large numbers of pixels. To this end, we modify the belief propagation algorithm to operate on a 3D graph that includes both spatial and temporal neighbors and to be able to discard messages from outlying neighbors. We also propose methods for introducing a bias and for suppressing noise typically observed in uniform regions. The bias term encapsulates information about the background and aids in achieving a temporally consistent reconstruction and in the mitigation of errors caused by occlusion.
In the fall of 2005, Scott Larsen Marc Pollefeys, Henry Fuchs and I worked on the development of a belief propagation framework applicable to multiple-view reconstruction. The beliefs for the depth of each pixel are initialized using the plane sweep algorithm, which is repeated after each belief propagation iteration taking into account the update visibility information. Pdf's for the depth of each pixel along the ray emanating from the camera center are maintained for all pixels of all images. The main novelty of our work is a scheme for performing belief propagation in adaptive neighborhoods that include 3D neighbors besides the classic 4-neighbors in the image that contains each pixel (see figure below). Essentially each pixel has four constant neighbors in its own image and a number of other neighbors that are determined based on its projection in the other images. Messages are passed among all the neighbors modulated by a compatibility function that takes into account similarity in color, to mitigate the effects of occlusion, and distance in 3D, to suppress the influence of points on different surfaces. For these computations to be feasible, we had to simplify the belief propagation algorithm, thus the title of the project.
Binocular Stereo using Tensor Voting
One of the many projects I worked on at USC, and arguably the one I spent more time on, was binocular stereo. This work was based on the preliminary approach of Mi-Suen Lee and Gérard Medioni that addressed stereo based on the premise that correct pixel correspondences reconstructed in 3D form the scene surfaces, while wrong correspondences do not form salient surfaces. Under this approach stereo can be posed a perceptual organization problem and tensor voting (see below) can be used to infer the surfaces. I worked to develop an algorithm that would use the same philosophy, but would be more effective in challenging real examples and benchmark data. After experimenting with a number of options for establishing pixel correspondences and for integrating monocular information in a way that mitigates the effects of occlusion without committing to premature decisions, we presented an algorithm that offers certain advantages. These include:
Left images, ground truth depth maps, the depth maps we generated and error maps for the Middlebury Stereo Evaluation webpage datasets. White in the error maps indicates errors below 0.5 disparity levels, gray errors between 0.5 and 1 disparity level and black errors greater than 1 disparity level. The error metric is the percentage of pixels above a certain error in disparity. First row: Tsukuba. Second row: Venus. Third row: Teddy. Fourth row: Cones.
We have submitted our results to the Middlebury Stereo Evaluation webpage and rank 14th among 25 algorithms (as of 11/11/2006) when the error threshold is set to 1 disparity level and 9th when the threshold is set to 0.5 disparity levels.
At USC I also worked on multiple view stereo, where the input is a set of more than two images with known calibration. Our goal was to develop an approach with minimum reliance on binocular processing that addresses the problem in 3D and not 2 1/2-D. We also did not want to be restricted by constraints such as having to place all the cameras on the same part of the scene, perform background segmentation or merge partial results. When data from all images are processed simultaneously the difficulties caused by occlusion and uniform surfaces are reduced. Merging partial noisy depth maps is not guaranteed to have the same effect. Of course, this is only feasible for relatively small sets of images such as the ones we processed here that do not exceed 36 images. The only binocular step is the detection of potential pixel correspondences which are then reconstructed in 3D and are used as input for tensor voting. Correct correspondences receive a lot of support as parts of salient surfaces from their neighboring while wrong correspondences do not. Tensor voting on a rather large number of potential correspondences (1.1 mil) takes around 45 minutes. Since this work was done for the most part before the integration of monocular information in binocular stereo, there is still room for improvement. Some day I may find the time to improve these results and their visualization...
Six of the input images captured at the CMU dome, a view of the inputs from above (note that the cameras are inside the set of points), a view of the most salient points and a zoomed in view at the center of the dome where the person is.
Besides working on specific computer vision and machine learning problems using tensor voting, I put a lot of effort in understanding, evaluating and extending the framework. Tensor voting is a perceptual organization approach based on the Gestalt principles of proximity and good continuation. It has mainly been applied for organizing generic tokens into coherent groups in core perceptual organization scenarios as well as for computer vision problems formulated as perceptual organization. The two fundamental aspects of tensor voting are the representation of the data by second-order, symmetric, non-negative definite tensors and the information propagation mechanism among the inputs that cast and receive votes to and from their neighbors. Following the standards set by Gérard Medioni and a number of his students that also worked on this, I tried to ensure that all modifications and additional functionalities adhere to our philosophy and result in an approach that is:
My other large contribution to the framework was a fully general N-D implementation that allows us to tackle problems in high-dimensional spaces. See below for our work on dimensionality estimation, manifold learning and function approximation. What made this implementation feasible is a geometric observation that allowed us to simplify the vote generation process and made the pre-computation of huge high-dimensional voting fields unnecessary.
I also did research on figure completion which is a perceptual organization process which is triggered by the presence of certain configurations of keypoints. For example, for contour completion to occur behind an occluder two T-junctions at appropriate positions and orientations have to exist. The integration of first order information allows us to detect keypoints such as endpoints of curves, T-junctions and L-junctions. These are indicators of potential completions and are used to generate hypotheses. An important aspect of our approach is that the decision between modal completion, which occurs along the boundary of the occluder, and amodal completion, which occurs along the direction of the occluded contour, can be made completely automatically. If a hypothesis is supported by at least two keypoints, we can infer the completion in a second pass of tensor voting. What should be noted here is that while we do not address real images since integrated edge and junction detection in them is far from being solved, our algorithm can explain a few illusions such as the Koffka crosses shown below, the Ehrenstein stimulus and the Poggendorf illusion.
Top row: two examples of the Koffka cross. Notice that the perceived completion by the human visual system changes from a circle to a square depending on the width of the cross' arms.
Bottom row: zoomed in views of the illusory contours produced by our algorithm that detects that modal completion is feasible and completes the low contrast occluder. Due to pixel quantization the circle appears slightly squared. Note that the junctions in the case of the square completion have been explicitly detected.
Using the N-D implementation of tensor voting, we were able to tackle problems in instance-based learning. One such problem is the estimation of the intrinsic dimensionality of the data given a set of observations in a high-dimensional space. We can perform this estimation after a round of tensor voting, since the eigenstructure of the resulting tensors provides an estimate of the dimensionality of the structure going through the point. This point-wise estimation makes our method applicable to challenging datasets with varying dimensionality and datasets that are not manifolds, as is the case when they contain intersections. Moreover, the absence of global computations allows us to process very large datasets at reasonable computational costs. We show results of accurate dimensionality estimation at the point level in spaces of up to 150-D.
This work was extended to address manifold learning and function approximation, going beyond the early work on dimensionality estimation. (Note that the formulation of N-D voting presented in the JMLR 2010 paper is more efficient than that of the IJCAI 2005 paper.) In summary, manifolds in high-dimensional spaces are inferred by estimating geometric relationships among the input instances. Unlike conventional manifold learning, we do not perform dimensionality reduction, but instead perform all operations in the original input space. Analyzing the estimated local structure at the inputs after tensor voting, we are able to obtain reliable dimensionality and structure estimates at each instance. These local estimates enable us to measure geodesic distances and perform nonlinear interpolation for data sets with varying density, outliers, perturbation and intersections, that cannot be handled by state-of-the-art methods. Quantitative results on the estimation of local manifold structure using ground truth data are presented in the JMLR paper. In addition, we compare our approach with several leading methods for manifold learning at the task of measuring geodesic distances. Finally, we show competitive function approximation results on real data.
Top row: data of varying dimensionality in a 4D. (The fourth dimension has been dropped for visualization purposes.) The input consists of an empty 3D sphere in 4D (which appears as a full 3D sphere when the fourth dimension is dropped), a 2D cone and a curve.
Bottom row: points classified according to their dimensionality as 1D, 2D and 3D. Notice that the intersection between the cone and the curve is correctly classified as 3D.
I spent a few years demonstrating and evaluating the 3D face reconstruction and recognition technology developed by Geometrix, Inc.
For a 3-D model of my face using these two pictures click on the pictures or here. This model was created with the Facevision 200 Series system.
While I never wrote any code directly used in this project I have thoroughly tested numerous versions of the Geometrix software and hardware systems over a five year period. The reconstruction system matured to the point that recognition using 3D information only could be reliably performed. The fact that appearance is not used at all makes the system invariant to illumination and viewpoint variations.
This is a screenshot of a verification test on my face using two models made two months apart. There are large variations in lighting, pose and my appearance, which do not throw the system off.
In 2003, Gérard Medioni and I collaborated with Ory Dor and Charles G. Sammis from the Department of Earth Sciences at USC towards developing a technique that uses computer vision to assist in the characterization of the orientation distribution of slip surfaces in fault breccia. My contribution was to use the Facevision stereo rig to reconstruct rock samples from the fault. I also wrote software that detected markers corresponding to slip planes and slip lines, computed their normal or tangent respectively and collected statistics analyzed by our collaborators. They were able to draw useful conclusions on the mechanical origin of the set of surfaces. The use of stereo vision made the process considerably faster and more accurate compared to manual measurements of each slip surface.
During the Spring semester of 1999, I worked as a Research Assistant at the Signal and Image Processing Institute in the Electrical Engineering Department of USC doing research on Magnetic Resonance Imaging with Richard M. Leahy. My task was to develop processing able to segment MR images of the brain into gray and white matter and cerebrospinal fluid using morphological processing in 3D.
RESEARCH PROJECT AT THE ARISTOTLE UNIVERSITY OF THESSALONIKI
Lossless Image Compression and Watermarking
For my undergraduate diploma thesis in the Electrical and Computer Engineering Department of the Aristotle University of Thessaloniki, Greece, I developed a plug-in for the Windows version of Netscape Navigator that decodes pyramid encoded images and extracts an embedded watermark from them. The thesis was supervised by Michael G. Strintzis.