Performance Evaluation using Ground-Truthed, 3D Data Sets

We have conducted two experiments to evaluate the accuracy of our 3D modeling pipeline. We have used two image data sets - Fountain-p11 and Herz-Jesu-P25 - for which the ground-truth 3D structures were available (obtained by a Lidar system). Color images and ground-truth depth maps from Lidar are provided by Drs. Strecha, von Hansen, Van Gool, Fua, and Thoennessen of the Computer Vision Labs of EPFL and ETHZ. These data sets are available for download at: http://cvlabwww.epfl.ch/data/multiview, and were used in the following academic paper: C. Strecha, W. von Hansen, L. Van Gool, P. Fua, U. Thoennessen. On Benchmarking Camera Calibration and Multi-View Stereo for High Resolution Imagery, CVPR 2008.


Evaluation Methodologies


We tried to emulate - as faithfully as possible - what a commercial 3D modeling system needs to accomplish for a client submitting 3D modeling tasks through a Web-service model. To this end:


  • We down-sampled input images to the VGA size (640x480) for upload and processing. While the original images are of a very high spatial resolution (3072x2048), it is unrealistic to expect that such high-resolution images will be available in real-world scenarios. Uploading such high-resolution images to a back-end 3D server is a very expensive proposition - this is especially true for a mobile client with a very limited bandwidth and the user may have to pay for the bandwidth usage (e.g., a mobile user with an Apple iPhone or an Android Phone).


    For example, the Herz-jesu-P25 data set with 25 3072x2048 images totals 151MB, and the Fountain-p11 data set with 11 3072x2048 images totals 64MB. It is highly unlikely that a client will be able to upload such large amounts of data to a back-end server reliably in reasonable time. Processing such large images on a client machine - be it a mobile device, a notebook, or a desktop - is going to tie up local resources for a long time and is definitely not a feasible solution. On the other hand, Herz-jesu-P25 in VGA is 2.5MB total and Fountain-p11 in VGA is 1MB total.


  • We do not use any camera calibration data generated externally (i.e., we do not use the intrinsic or the extrinsic camera parameters supplied with the images). Again, in the real-world application scenarios, such intrinsic and extrinsic camera parameters are not available, and most, if not all, consumer-market digital cameras and phones are not calibrated. A 3D modeling pipeline must be able to automatically calibrate the intrinsic and extrinsic camera parameters using nothing but the input images, without any outside assistance.


  • We have concentrated on the "end results" and ignored the "by products" of such 3D processing. That is, our comparison is on the faithfulness of the 3D models, not on the accuracy of the recovered intrinsic and extrinsic camera parameters. We believe that in the consumer market, the end users are mainly interested in the 3D models. Furthermore, we believe that the situation where a modeling system estimated erroneous camera parameters but somehow still obtains correct 3D structures fortuitously is extremely unlikely (we have never observed such a phenomenon).


    To summarize, our methodologies are to exercise the full 3D modeling pipeline, leading from input images directly to output 3D models, without any user intervention or parameter tuning, without using any external calibration data (i.e., the camera's intrinsic and extrinsic parameters) that are not embedded in the input images themselves, and do all these using photos of a reasonable size (VGA). This is what we envision a robust, commercial-grade 3D photo modeling system must be able to do.


    Evaluation Procedures


    We down-sampled the ground-truth data 20x for comparison. Down-sampling is necessary because the 3D ground-truth data uncompressed to 1GB for Fountain-p11 and 1.4GB for Herz-Jesu-P25. Meshlab dumped core when opening such huge files. The file-size limitation was present even on a state-of-the-art Windows 7 desktop with an Intel Core i-3 3.3GHz processor, 6GB of memory, and 1TB of disk space.


    After our 3D pipeline finished generating 3D models from the input VGA images, these models were aligned with the ground-truth models in a two-stage procedure:


  • We first used the manual alignment process provided by Meshlab to roughly align our 3D models with the ground-truth models. The alignment process consisted of loading both models into Meshlab, manually specifying a small number of corresponding points in the two models to establish a rough alignment, and then allowing Meshlab to refine the initial manual alignment using an iterative ICP algorithm.


  • After the models were roughly aligned as in the previous step, we loaded both models into our own display program that allowed small x, y, z rotations and translations applied to the ground-truth models. We applied such small translations and rotations interactively and eye-balled the display for the best qualitative alignment results.


    After models had been aligned, we computed an absolute error measure for each and every 3D point in our 3D models. This error was the minimum distance from a 3D point in our models to points in the corresponding ground-truth models. We then computed a percentage error by dividing the absolute error distance by the largest dimension of the ground-truth models in the x, y, or z direction.


    Evaluation Results using VGA-size Sequences


    Our modeling pipeline ran on a Windows 7 desktop with an Intel Core i-3 3.3GHz processor, 6GB of memory, and 1TB of disk space.


    For Herjzesu data set (25 VGA images), our pipeline recovered 342,949 3D points and 735,419 faces in 11min 01sec, and the max, median and average percentage errors (over 342,949 3D points) are: 4.74%, 0.62%, and 0.41%, respectively.


    For Fountain data set (11 VGA images), our pipeline recovered 140,785 3D points and 315,195 faces in 4min 40sec, and the max, median and average errors (over 140,785 3D points) are: 5.07%, 0.62%, and 0.44%, respectively.


    One can visualize the model alignments (white is the ground-truth model and colored is our model) of the two data sets below:


    Herj-Zesu:


    Fountain:



    Evaluation Results using Larger-size Sequences


    For comparison purpose, we also ran our programs using larger images of size 2150x1434 for both sequences. These lager sequences are about 14M total for Herz-jesu-P25 (2.5MB in VGA) and 6MB total for Fountain-p11 (1MB in VGA). Here are the statistics:


    For Herjzesu data set (25 2150x1434 images), our pipeline recovered 1,437,984 3D points and 2,781,106 faces in 59min 48sec, and the max, median and average percentage errors (over 1,437,984 3D points) are: 4.42%, 0.44%, 0.26%, respectively.


    For Fountain data set (11 2150x1434 images), our pipeline recovered 1,072,139 3D points and 2,107,543 faces in 7min 21sec, and the max, median and average errors (over 1,072,139 3D points) are: 1.92%, 0.23%, 0.17%, respectively.


    One can visualize the model alignments (white is the ground-truth model and colored is our model) of the two larger data sets below:


    Herj-Zesu:


    Fountain:


    These results are summarized in a table form here:

    Visual inspection seems to indicate while model density and accuracy improve with a larger image size, the models are less complete than before, probably due to the more stringent feature analysis requirement put in place to filter out outliers that tend to arise from matching unstable, evanescent fine-level features in high-resolution images.



    Data


    • To retrieve small (VGA) 3D model for Herz_Jesu_p25 in ply format (82M uncompressed, 21M compressed) , click here You need to apply this transform to the groundtruth 3D model to align the two.

    • To retrieve large (2150x1434) 3D model for Herz_Jesu_p25 in ply format (256M uncompressed, 62M compressed) , click here You need to apply this transform to the groundtruth 3D model to align the two.

    • To retrieve small (VGA) 3D model for Fountain_p11 in ply format (27M uncompressed, 7M compressed) , click here You need to apply this transform to the groundtruth 3D model to align the two.

    • To retrieve large (2150x1434) 3D model for Fountain_p11 in ply format (121M uncompressed, 28M compressed) , click here You need to apply this transform to the groundtruth 3D model to align the two.


    Ground-truth data sets are available for download at: http://cvlabwww.epfl.ch/data/multiview,


  • This page demonstrates the accuracy of our core 3D reconstruction technology.
    Please visit our home page for more examples.
    Please note that Photomodel3D was developed by VisualSize in-house with no 3rd party licensing or royalty payment required. Please contact us for an in-depth evaluation and customized solutions.