Visual Concept Detection in Images and Videos

Mühling, Markus

Titel:	Visual Concept Detection in Images and Videos
Autor:	Mühling, Markus
Weitere Beteiligte:	Freisleben, Bernd (Prof. Dr.)
Veröffentlicht:	2014
URI:	https://archiv.ub.uni-marburg.de/diss/z2014/0483
URN:	urn:nbn:de:hebis:04-z2014-04834
DOI:	https://doi.org/10.17192/z2014.0483
DDC:	Informatik
*Titel (trans.):*	Erkennung visueller Konzepte in Bildern und Videos
Publikationsdatum:	2015-01-05
Lizenz:	https://rightsstatements.org/vocab/InC-NC/1.0/

Dokument

Schlagwörter:
Informatik, Video, Bild, Mustererkennung, Video Retrieval, Image Retrieval, Visual Concept Detection, Information Retrieval, Bildverstehen, Maschinelles Lernen

Summary:
The rapidly increasing proliferation of digital images and videos leads to a situation where content-based search in multimedia databases becomes more and more important. A prerequisite for effective image and video search is to analyze and index media content automatically. Current approaches in the field of image and video retrieval focus on semantic concepts serving as an intermediate description to bridge the “semantic gap” between the data representation and the human interpretation. Due to the large complexity and variability in the appearance of visual concepts, the detection of arbitrary concepts represents a very challenging task. In this thesis, the following aspects of visual concept detection systems are addressed: First, enhanced local descriptors for mid-level feature coding are presented. Based on the observation that scale-invariant feature transform (SIFT) descriptors with different spatial extents yield large performance differences, a novel concept detection system is proposed that combines feature representations for different spatial extents using multiple kernel learning (MKL). A multi-modal video concept detection system is presented that relies on Bag-of-Words representations for visual and in particular for audio features. Furthermore, a method for the SIFT-based integration of color information, called color moment SIFT, is introduced. Comparative experimental results demonstrate the superior performance of the proposed systems on the Mediamill and on the VOC Challenge. Second, an approach is presented that systematically utilizes results of object detectors. Novel object-based features are generated based on object detection results using different pooling strategies. For videos, detection results are assembled to object sequences and a shot-based confidence score as well as further features, such as position, frame coverage or movement, are computed for each object class. These features are used as additional input for the support vector machine (SVM)-based concept classifiers. Thus, other related concepts can also profit from object-based features. Extensive experiments on the Mediamill, VOC and TRECVid Challenge show significant improvements in terms of retrieval performance not only for the object classes, but also in particular for a large number of indirectly related concepts. Moreover, it has been demonstrated that a few object-based features are beneficial for a large number of concept classes. On the VOC Challenge, the additional use of object-based features led to a superior performance for the image classification task of 63.8% mean average precision (AP). Furthermore, the generalization capabilities of concept models are investigated. It is shown that different source and target domains lead to a severe loss in concept detection performance. In these cross-domain settings, object-based features achieve a significant performance improvement. Since it is inefficient to run a large number of single-class object detectors, it is additionally demonstrated how a concurrent multi-class object detection system can be constructed to speed up the detection of many object classes in images. Third, a novel, purely web-supervised learning approach for modeling heterogeneous concept classes in images is proposed. Tags and annotations of multimedia data in the WWW are rich sources of information that can be employed for learning visual concepts. The presented approach is aimed at continuous long-term learning of appearance models and improving these models periodically. For this purpose, several components have been developed: a crawling component, a multi-modal clustering component for spam detection and subclass identification, a novel learning component, called “random savanna”, a validation component, an updating component, and a scalability manager. Only a single word describing the visual concept is required to initiate the learning process. Experimental results demonstrate the capabilities of the individual components. Finally, a generic concept detection system is applied to support interdisciplinary research efforts in the field of psychology and media science. The psychological research question addressed in the field of behavioral sciences is, whether and how playing violent content in computer games may induce aggression. Therefore, novel semantic concepts most notably “violence” are detected in computer game videos to gain insights into the interrelationship of violent game events and the brain activity of a player. Experimental results demonstrate the excellent performance of the proposed automatic concept detection approach for such interdisciplinary research.

Bibliographie / References

T. Stadelmann. Voice Modeling Methods for Automatic Speaker Recognition. PhD thesis, Department of Mathematics and Computer Science, University of Marburg, Germany, 2010.
M. Kloft, U. Rückert, and P. Bartlett. A Unifying View of Multiple Kernel Learning. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML- PKDD'10), number II, pages 66–81, Barcelona, Spain, 2010. Springer.
R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning Object Categories from Google's Image Search. In Proceedings of the 10 th IEEE International Conference on Computer Vision (ICCV'05), pages 1816–1823, Beijing, China, 2005. IEEE.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, second edition, 2005.
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2009.
K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text Classification from Labeled and Unlabeled Documents Using EM. Machine Learning, 39(2): 103–134, 2000.
P. Viola and M. J. Jones. Robust Real-Time Face Detection. International Journal of Computer Vision, 57(2):137–154, 2004.
Y.-G. Jiang, C.-W. Ngo, and J. Yang. Towards Optimal Bag-of-Features for Object Categorization and Semantic Video Retrieval. In Proceedings of the 6 th ACM International Conference on Image and Video Retrieval (CIVR'07), pages 494–501, Amsterdam, The Netherlands, 2007. ACM.
B. Leibe, A. Leonardis, and B. Schiele. Robust Object Detection with Interleaved Categorization and Segmentation. International Journal of Computer Vision, 77(1-3):259–289, 2008.
L. Fei-Fei, R. Fergus, and P. Perona. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories. In Computer Vision and Image Understanding, volume 106, pages 59–70. Elsevier, 2007.
T. Joachims. Text Categorization With Support Vector Machines: Learning With Many Relevant Features. In Proceedings of the 10 th European Conference on Machine Learning (ECML'98), pages 137–142, Chemnitz, Germany, 1998. Springer.
P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial Structures for Object Recog- nition. International Journal of Computer Vision, 61(1):55–79, Jan. 2005.
J. Wu, D. Ding, X.-S. Hua, and B. Zhang. Tracking Concept Drifting with an Online-Optimized Incremental Learning Framework. In Proceedings of the 7 th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR'05), pages 33–40, Singapore, Singapore, 2005. ACM.
J. Yang and A. G. Hauptmann. (Un)Reliability of Video Concept Detection. In Proceedings of the 7 th ACM International Conference on Image and Video Retrieval (CIVR'08), pages 85–94, Niagara Falls, Canada, 2008. ACM.
F. Bach, G. Lanckriet, and M. Jordan. Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. In Proceedings of the 21 st International Conference on Machine Learning (ICML'04), pages 1–8, Banff, Alberta, Canada, 2004. ACM.
W. Jiang, C. Cotton, S.-F. Chang, D. Ellis, and A. Loui. Short-Term Audio- Visual Atoms for Generic Video Concept Classification. In Proceedings of the 17 th ACM International Conference on Multimedia (MM'09), pages 5–14, Van- couver, British Columbia, Canada, 2009a. ACM.
P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object De- tection With Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, Sept. 2010b.
R. Ewerth, M. Mühling, T. Stadelmann, J. Gllavata, M. Grauer, and B. Freisleben. Videana: A Software Toolkit for Scientific Film Studies. In Proceedings of the International Workshop on Digital Tools in Film Studies, pages 1–16, Siegen, Germany, 2007b. Transcript Verlag.
K. Yu, T. Zhang, and Y. Gong. Nonlinear Learning using Local Coordinate Coding. Advances in Neural Information Processing Systems (NIPS), 22:2223– 2231, 2009.
J. Gall and V. Lempitsky. Class-Specific Hough Forests for Object Detection. In Proceedings of the 22 nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR'09), pages 1022–1029, Miami Beach, Florida, USA, 2009.
C. Wang, F. Jing, L. Zhang, and H.-J. Zhang. Scalable Search-Based Image Anno- tation of Personal Images. In Proceedings of the 8 th ACM International Work- shop on Multimedia Information Retrieval (MIR'06), pages 269–278, Santa Barbara, CA, USA, 2006. ACM.
J. Yang, K. Yu, Y. Gong, and T. Huang. Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification. In Proceedings of the 22 nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR'09), pages 1794–1801, Miami Beach, Florida, USA, June 2009. IEEE.
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
A. Yao, J. Gall, and L. Van Gool. A Hough Transform-Based Voting Frame- work for Action Recognition. In Proceedings of the 23 rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10), pages 2061–2068, San Francisco, CA, USA, 2010. IEEE.
C. G. Snoek, K. E. A. van de Sande, O. de Rooij, B. Huurnink, E. Gavves, D. Odijk, M. D. Rijke, T. Gevers, M. Worring, D. C. Koelma, and A. W. M. Smeulders. The MediaMill TRECVID 2010 Semantic Video Search En- gine. In Proceedings of the TREC Video Retrieval Evaluation Workshop (TRECVid'11), Gaithersburg, Maryland, USA, 2011. NIST. URL http: //www-nlpir.nist.gov/projects/tvpubs/tv10.papers/mediamill.pdf.
L.-J. Li, H. Su, E. P. Xing, and L. Fei-Fei. Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification. In Proceedings of the 24 th Annual Conference on Neural Information Processing Systems (NIPS'10), pages 1–9, Vancouver, British Columbia, Canada, 2010b.
D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Inter- national Journal of Computer Vision, 60(2):91–110, 2004.
T. Joachims. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of the 16 th International Conference on Machine Learning (ICML'99), pages 200–209, Bled, Slovenia, 1999.
H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-Up Robust Features (SURF). Journal of Computer Vision and Image Understanding, 110(3):346– 359, 2008. –183–
J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-Constrained Linear Coding for Image Classification. In Proceedings of the 23 rd IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR'10), pages 3360– 3367, San Francisco, CA, USA, 2010.
E. Yilmaz, E. Kanoulas, and J. Aslam. A Simple and Efficient Sampling Method for Estimating AP and NDCG. In Proceedings of the 31 st Annual Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, pages 603–610, Singapore, Singapore, 2008. ACM.
P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade Object Detection with Deformable Part Models. In Proceedings of the 23 rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10), pages 2241–2248, San Francisco, CA, USA, 2010a. IEEE.
C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A Practical Guide to Support Vector Classification. Technical Report 1, 2010.
J. C. v. Gemert, C. G. M. Snoek, C. J. Veenman, A. W. M. Smeulders, and J.-M. Geusebroek. Comparing Compact Codebooks for Visual Categorization. Computer Vision and Image Understanding, 114(4):450–462, Apr. 2010a.
B. H. Lee, R. Grosse, R. Ranganath, and A. Ng. Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks. Com- munications of the ACM, 54(10):95–103, 2011.
R. Lienhart, S. Pfeiffer, and W. Effelsberg. The MoCA Workbench: Support for Creativity in Movie Content Analysis. In Proceedings of the 3 rd IEEE Conference on Multimedia Computing and Systems, pages 314–321, Hiroshima, Japan, 1996. IEEE.
R. Lienhart, W. Effelsberg, and R. Jain. VisualGREP: A Systematic Method to Compare and Retrieve Video Sequences. Multimedia Tools and Applications, 10(1):47–72, 1999a.
R. Lienhart, S. Pfeiffer, and W. Effelsberg. Video Abstracting. Communications of the ACM, 40(12):54–62, 1997.
L. Kennedy, S.-F. Chang, and I. Kozintsev. To Search or To Label?: Predicting the Performance of Search-Based Automatic Image Classifiers. In Proceedings of the 8 th ACM International Workshop on Multimedia Information Retrieval (MIR'06), pages 249–258, Santa Barbara, CA, USA, 2006. ACM.
J. Yang, Y.-G. Jiang, A. G. Hauptmann, and C. Ngo. Evaluating Bag-of-Visual- Words Representations in Scene Classification. In Proceedings of the Interna- tional Workshop on Multimedia Information Retrieval (MIR'07), pages 197– 206, Augsburg, Bavaria, Germany, 2007a. ACM.
B. S. Manjunath, J.-R. Ohm, and V. V. Vasudevan. Color and Texture Descrip- tors. IEEE Transactions on Circuits and Systems for Video Technology, 11(6): 703–715, 2001.
W.-L. Zhao, X. Wu, and C.-W. Ngo. On the Annotation of Web Videos by Efficient Near-Duplicate Search. IEEE Transactions on Multimedia, 12(5): 448–461, Aug. 2010.
Y.-G. Jiang, C.-W. Ngo, and S.-F. Chang. Semantic Context Transfer Across Heterogeneous Sources for Domain Adaptive Video Search. In Proceedings of the 17 th ACM International Conference on Multimedia (MM'09), pages 155– 164, Vancouver, British Columbia, Canada, 2009b. ACM.
E. Yilmaz and J. Aslam. Estimating Average Precision with Incomplete and Imperfect Judgments. In Proceedings of the 15 th ACM International Confer- ence on Information and Knowledge Management (CIKM'06), pages 102–111, Arlington, Virginia, USA, 2006. ACM.
V. Vapnik. The Nature of Statistical Learning Theory. Springer, 2000.
S. Mallat and Z. Zhang. Matching Pursuits with Time-frequency Dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993.
Y.-G. Jiang, J. Yang, C.-W. Ngo, and A. Hauptmann. Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study. IEEE Transactions on Multimedia, 12(1):42–53, 2010a.
K. E. A. v. d. Sande, T. Gevers, and C. G. M. Snoek. Evaluating Color Descriptors for Object and Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 32(9):1582–96, Sept. 2010.
B. D. Lukas and T. Kanade. An Iterative Image Registration Technique with an Application to Stereo Vision. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 674–679, Vancouver, Canada, 1981.
P. Kruizinga and N. Petkov. Nonlinear Operator for Oriented Texture. IEEE Transactions on Image Processing, 8(10):1395–407, Jan. 1999.
L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
Y. Freund and R. E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
M. Ester and J. Sander. Knowledge Discovery in Databases. Springer Verlag, Berlin, 2000.
O. Chapelle, B. Schölkopf, and A. Zien. Semi-Supervised Learning. MIT Press, 2006.
R. Ewerth, M. Schwalb, P. Tessmann, and B. Freisleben. Estimation of Ar- bitrary Camera Motion in MPEG Videos. In Proceedings of the 17 th Interna- tional Conference on Pattern Recognition (ICPR'04), volume 1, pages 512–515, Cambridge, UK, Aug. 2004. IEEE.
C. G. Snoek, M. Worring, J. C. van Gemert, J.-M. Geusebroek, and A. W. M. Smeulders. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. In Proceedings of the 14 th Annual ACM International Bibliography Conference on Multimedia (MM'06), pages 421–430, Santa Barbara, CA, USA, 2006b. ACM.
A. F. Smeaton, P. Over, and W. Kraaij. Evaluation Campaigns and TRECVid. In Proceedings of the 8 th ACM International Workshop on Multimedia Infor- mation Retrieval (MIR'06), pages 321–330, New York, New York, USA, 2006. ACM.
G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang. Correlative Multi- Label Video Annotation. In Proceedings of the 15 th International Conference on Multimedia (MM'07), pages 17–26, Augsburg, Germany, 2007a. ACM.
R. Weber, U. Ritterfeld, and K. Mathiak. Does Playing Violent Video Games Induce Aggression? Empirical Evidence of a Functional Magnetic Resonance Imaging Study. Media Psychology, 8:39–60, 2006.
R. Ewerth and B. Freisleben. Video Cut Detection without Thresholds. In Proceedings of the 11 th International Workshop on Signals, Systems and Image Processing (IWSSIP'04), pages 227–230, Poznan, Poland, 2004.
S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large Scale Multiple Kernel Learning. Journal of Machine Learning Research, 7(1):1531–1565, 2006.
F. Felzenszwalb, D. McAllester, and D. Ramanan. A Discriminatively Trained, Multiscale, Deformable Part Model. In Proceedings of the 21 st IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08), pages 1–8, Anchor- age, Alaska, USA, June 2008. IEEE.
R. Datta, D. Joshi, J. Li, and J. Wang. Tagging Over Time: Real-World Image Annotation by Lightweight Meta-Learning. In Proceedings of the 15 th Inter- national Conference on Multimedia (MM'07), pages 393–402, Augsburg, Ger- many, 2007. ACM.
I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In Pro- ceedings of the 21 st International Conference on Machine Learning (ICML'04), pages 104–112, Banff, Alberta, Canada, 2004. ACM.
W. M. Campbell, D. E. Sturim, D. A. Reynolds, and W. Street. Support Vec- tor Machines using GMM Supervectors for Speaker Verification. IEEE Signal Processing Letters, 13(5):308–311, 2006.
K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Clas- sification with Sets of Image Features. In Proceedings of the 10 th IEEE Inter- national Conference on Computer Vision (ICCV'05), number October, pages 1458–1465, Beijing, China, 2005. IEEE.
A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-Based Image Retrieval at the End of the Early Years. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22(12):1349–1380, 2000.
D. Zhou and C. J. C. Burges. Spectral Clustering and Transductive Learning with Multiple Views. In Proceedings of the 24 th International Conference on Machine Learning (ICML'07), pages 1159–1166, Corvallis, Oregon, USA, 2007. ACM.
H. Xu and T.-S. Chua. Fusion of AV Features and External Information Sources for Event Detection in Team Sports Video. Transactions on Multimedia Com- puting, Communications, and Applications, 2(1):44–67, Feb. 2006.
M. Naphade and J. R. Smith. On the Detection of Semantic Concepts at TRECVID. In Proceedings of the 12 th Annual ACM International Conference on Multimedia (MM'04), pages 660–667, New York, New York, USA, 2004. ACM.
J. A. Aslam, V. Pavlu, and E. Yilmaz. A Statistical Method for System Evalua- tion using Incomplete Judgments. In Proceedings of the 29 th Annual Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, pages 541–548, New York, New York, USA, 2006. ACM.
J. Yang, R. Yan, and A. G. Hauptmann. Cross-Domain Video Concept Detection Using Adaptive SVMs. In Proceedings of the 15 th International Conference on Multimedia (MM'07), pages 188–197, Augsburg, Germany, 2007b. ACM.
G. Kühne, S. Richter, and M. Beier. Motion-based Segmentation and Contour- based Classification of Video Objects. In Proceedings of the 9 th ACM In- ternational Conference on Multimedia (MM), pages 41–50, Ottawa, Ontario, Canada, 2001.
A. F. Smeaton, P. Over, and W. Kraaij. High-Level Feature Detection from Video in TRECVid: a 5-Year Retrospective of Achievements. In Multimedia Content Analysis, chapter Theory and, pages 151–174. 2009.
V. Viitaniemi and J. Laaksonen. Experiments on Selection of Codebooks for Local Image Feature Histograms. In Proceedings of the 10 th International Conference on Visual Information Systems. Web-Based Visual Information Search and Management (VISUAL'08), pages 126–137, Salerno, Italy, 2008. Springer.
M. Mühling, R. Ewerth, and B. Freisleben. On the Spatial Extents of SIFT De- scriptors for Visual Concept Detection. In Proceedings of the 8 th International Conference on Computer Vision Systems (ICVS'11), pages 71–80, Sophia An- tipolis, France, 2011b. Springer.
T. Tuytelaars, C. H. Lampert, M. B. Blaschko, and W. Buntine. Unsupervised Object Discovery: A Comparison. International Journal of Computer Vision, 88(2):284–302, July 2009.
P. Koniusz and K. Mikolajczyk. Spatial Coordinate Coding to Reduce Histogram Representations, Dominant Angle and Colour Pyramid Match. In Proceedings of the 18 th IEEE International Conference on Image Processing (ICIP'11), pages 661–664, Brussels, Belgium, 2011. IEEE.
I. Feki, A. B. Ammar, and A. M. Alimi. Audio Stream Analysis for Environmen- tal Sound Classification. In Proceedings of the International Conference on Multimedia Computing and Systems (ICMCS'11), Ouarzazate, Morocco, 2011. IEEE.
X. Tian, L. Yang, and J. Wang. Transductive Video Annotation via Local Learn- able Kernel Classifier. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME'08), pages 1509–1512, Hannover, Germany, 2008. IEEE.
W. Wojcikiewicz, A. Binder, and M. Kawanabe. Enhancing Image Classifica- tion with Class-Wise Clustered Vocabularies. In Proceedings of the 20 th In- ternational Conference on Pattern Recognition (ICPR'10), pages 1060–1063, Istanbul, Turkey, Aug. 2010. IEEE.
M. Mühling, R. Ewerth, and B. Freisleben. Improving Semantic Video Retrieval via Object-Based Features. In Proceedings of the 3 rd IEEE International Con- ference on Semantic Computing (ICSC'09), pages 109–115, Berkeley, CA, USA, 2009a. IEEE.
C. G. Snoek, M. Worring, J.-M. Geusebroek, D. C. Koelma, F. J. Seinstra, and A. W. M. Smeulders. The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 28(10):1678–1689, Oct. 2006a.
M. Mühling, R. Ewerth, T. Stadelmann, B. Freisleben, R. Weber, and K. Math- iak. Semantic Video Analysis for Psychological Research on Violence in Com- puter Games. In Proceedings of the 6 th ACM International Conference on Image and Video Retrieval (CIVR'07), pages 611–618, Amsterdam, The Netherlands, July 2007a. ACM.
G. Wang, T.-S. Chua, and M. Zhao. Exploring Knowledge of Sub-Domain in a Multi-Resolution Bootstrapping Framework for Concept Detection in News Video. In Proceedings of the 16 th ACM International Conference on Multimedia (MM'08), pages 249–258, Vancouver, British Columbia, Canada, 2008a. ACM.
Z. Gong, Q. Liu, and J. Guo. Deriving Semantic Terms for Images by Mining the Web. In Proceedings of the 11 th International Conference on Electronic Commerce (ICEC'09), number 2, pages 323–328, New York, New York, USA, 2009. ACM. ISBN 9781605585864.
J. Fan, C. Yang, Y. Shen, N. Babaguchi, and H. Luo. Leveraging Large-Scale Weakly-Tagged Images to Train Inter-Related Classifiers for Multi-Label An- notation. In Proceedings of the 1 st ACM Workshop on Large-Scale Multimedia Bibliography Retrieval and Mining (LS-MMRM'09), pages 27–34, New York, New York, USA, 2009. ACM.
X. Zhang, Y.-C. Song, J. Cao, Y.-D. Zhang, and J.-T. Li. Large Scale Incremental Web Video Categorization. In Proceedings of the 1 st Workshop on Web-Scale Bibliography Multimedia Corpus (WSMC'09), pages 33–40, New York, New York, USA, 2009. ACM.
H. Xu, X. Zhou, M. Wang, Y. Xiang, and B. Shi. Exploring Flickr's Related Tags for Semantic Annotation of Web Images. In Proceedings of the 8 th ACM Bibliography International Conference on Image and Video Retrieval (CIVR'09), page 1, Santorini, Fira, Greece, 2009. ACM.
E. Mbanya, S. Gerke, and P. Ndjiki-Nya. Spatial Codebooks for Image Catego- rization. In Proceedings of the 1 st ACM International Conference on Multimedia Retrieval (ICMR'11), Trento, Italy, 2011. ACM.
Bibliography R. Ewerth. Robust Video Content Analysis via Transductive Ensemble Learning. PhD thesis, Department of Mathematics and Computer Science, University of Marburg, Germany, 2008.
J. Gans. Deciding What's News: a Study of CBS Evening News, NBC Nightly News, Newsweek, and Time. Vintage Books, New York, NY, USA, 1980.
A. Bosch, A. Zisserman, and X. Muñoz. Scene Classification Using a Hybrid Generative/Discriminative Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4):712–727, 2008.
J. C. v. Gemert, J. M. Geusebroek, C. Veenman, and A. W. M. Smeulders. Ker- nel Codebooks for Scene Categorization. In Proceedings of the 10 th European Conference on Computer Vision (ECCV'08), pages 696–709, Marseille, France, 2008. Springer.
J. C. v. Gemert, C. Veenman, A. Smeulders, and J. Geusebroek. Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(7):1271–1283, 2010b.
A. Ulges, M. Worring, and T. Breuel. Learning Visual Contexts for Image Anno- tation From Flickr Groups. IEEE Transactions on Multimedia, 13(2):330–341, Apr. 2011.
J. Gllavata and R. Ewerth. Text Detection in Images Based on Unsupervised Classification of High-Frequency Wavelet Coefficients. In Proceedings of 17 th International Conference on Pattern Recognition (ICPR'04), pages 425–428, Cambridge, UK, 2004. IEEE.
A. Abdel-Hakim and A. Farag. CSIFT: A SIFT Descriptor with Color Invariant Characteristics. In Proceedings of the 19 th IEEE Conference on Computer Vision and Pattern Recognition (CVPR'06), pages 1978–1983, New York, New York, USA, 2006. IEEE.
F. Schroff, A. Criminisi, and A. Zisserman. Harvesting Image Databases from the Web. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 33(4):754–66, Apr. 2011.
R. Ewerth, K. Ballafkir, M. Mühling, D. Seiler, and B. Freisleben. Long-Term In- cremental Web-Supervised Learning of Visual Concepts via Random Savannas. IEEE Transactions on Multimedia, 14(4):1008–1020, 2012.
S. Jeannin and B. Mory. Video Motion Representation for Improved Content Access. IEEE Transactions on Consumer Electronics, 46(3):645–655, 2000.
N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, 2000.
L. Lu, H.-J. Zhang, and S. Z. Li. Content-Based Audio Classification and Segmen- tation by Using Support Vector Machines. Multimedia Systems, 8(6):482–492, Apr. 2003.
M. Mühling, R. Ewerth, B. Shi, and B. Freisleben. Multi-Class Object Detection with Hough Forests Using Local Histograms of Visual Words. In Proceedings of 14 th International Conference on Computer Analysis of Images and Patterns (CAIP'11), pages 386–393, Seville, Spain, 2011c. Springer.
M. Mühling, R. Ewerth, J. Zhou, and B. Freisleben. Multimodal Video Con- cept Detection via Bag of Auditory Words and Multiple Kernel Learning. In Proceedings of the 18 th International Conference on Advances in Multimedia Modeling (MMM'12), pages 40–50, Klagenfurt, Austria, 2012. Springer.
A. Torralba, R. Fergus, and W. T. Freeman. 80 Million Tiny Images: a Large Data Set for Nonparametric Object and Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(11):1958–1970, Nov. 2008.
R. Lienhart, L. Liang, and A. Kuranov. A Detector Tree of Boosted Classifiers for Real-Time Object Detection and Tracking. In Proceedings of the 4 th IEEE International Conference on Multimedia and Expo (ICME'03), number c, pages 277–280, Baltimore, Maryland, USA, 2003. IEEE.
M. Naphade, L. Kennedy, J. R. Kender, S.-F. Chang, J. R. Smith, P. Over, and A. Hauptmann. A Light Scale Concept Ontology for Multimedia Understand- ing for TRECVID 2005. Technical report, 2005.
M. Porter. An Algorithm for Suffix Stripping. Program: Electronic Library and Information Systems, 14(3):130–137, 1980.
X. Wang, L. Zhang, X. Li, and W. Ma. Annotating Images by Mining Image Search Results. IEEE Transactions on Pattern Analysis and Machine Intelli- gence (PAMI), 30(11):1919–1932, 2008c.
M. Riley, E. Heinen, and J. Ghosh. A Text Retrieval Approach to Content-Based Audio Retrieval. In Proceedings of the 9 th International Conference of Music Information Retrieval (ISMIR'08), pages 295–300, Philadelphia, Pennsylvania, USA, 2008.
L. Lu and A. Hanjalic. Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval. IEEE Transactions on Multimedia, 10(1):74–85, 2008.
X. Tong, Q. Liu, L. Duan, H. Lu, C. Xu, and Q. Tian. A Unified Framework for Semantic Shot Representation of Sports Video. In Proceedings of the 7 th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR'05), pages 127–134, Singapore, Singapore, 2005. ACM. Bibliography A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing Visual Features for Multiclass and Multiview Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5):854–869, 2007.
G. Griffin, A. Holub, and P. Perona. Caltech-256 Object Category Dataset. Tech- nical report, California Institute of Technology, 2007. URL http://authors. library.caltech.edu/7694. Bibliography A. Hauptmann, R. Yan, and W.-H. Lin. How Many High-Level Concepts Will Fill the Semantic Gap in News Video Retrieval? In Proceedings of the 6 th ACM International Conference on Image and Video Retrieval (CIVR'07), pages 627– 634, Amsterdam, The Netherlands, 2007. ACM.
S. Ahmadi and A. Spanias. Cepstrum-based pitch detection using a new statis- tical V/UV classification algorithm. IEEE Transactions on Speech and Audio Processing, 7(3):333–338, May 1999.
D. A. Sadlier and N. E. O'Connor. Event Detection in Field Sports Video Using Audio-Visual Features and a Support Vector Machine. IEEE Transactions on Circuits and Systems for Video Technology, 15(10):1225–1233, Oct. 2005. Bibliography K. E. A. v. d. Sande, T. Gevers, and C. G. M. Snoek. A Comparison of Color Features for Visual Concept Classification. In Proceedings of the 7 th ACM Inter- national Conference on Content-Based Image and Video Retrieval (CIVR'08), pages 141–150, Niagara Falls, Ontario, Canada, 2008. ACM.
P. Abend, T. Thielmann, R. Ewerth, D. Seiler, M. Mühling, J. Döring, M. Grauer, and B. Freisleben. Geobrowsing the Globe: A Geovisual Analysis of Google Earth Usage. Linking GeoVisualization with Spatial Analysis and Modeling (GeoViz), 2011.
A. Bosch, A. Zisserman, and X. Muoz. Image Classification using Random Forests and Ferns. In Proceedings of the 11 th IEEE International Conference on Com- puter Vision (ICCV'07), pages 1–8, Rio de Janeiro, Brazil, 2007. IEEE.
Y. Bengio, O. Delalleau, and N. L. Roux. Label Propagation and Quadratic Criterion. In O. Chapelle, B. Schölkopf, and A. Zien, editors, Semi-Supervised Learning, pages 193–216. MIT Press, 2006.
Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning Mid-Level Features for Recognition. In Proceedings of the 23 rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10), pages 2559–2566, San Francisco, CA, USA, 2010. IEEE.
T. Tuytelaars and K. Mikolajczyk. Local Invariant Feature Detectors: A Survey. Foundations and Trends in Computer Graphics and Vision, 3(3):177–280, 2008.
M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-Norm Multiple Kernel Learning. Journal of Machine Learning Research, 12(1):953–997, 2011.
J. M. Martinez. MPEG-7 Overview. Technical report, ISO/IEC, Klagenfurt, Austria, 2002.
S. Maji and J. Malik. Object Detection Using a Max-Margin Hough Transform. In Proceedings of the 22 nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR'09), pages 1038–1045, Miami Beach, Florida, USA, 2009. IEEE.
L.-J. Li and L. Fei-Fei. OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning. International Journal of Computer Vision, 88 (2):147–168, July 2009.
Y. Peng, Z. Lu, and J. Xiao. Semantic Concept Annotation Based on Audio PLSA Model. In Proceedings of the 17 th ACM International Conference on Multimedia (MM'09), pages 841–844, New York, New York, USA, 2009. ACM.
D. Arthur and S. Vassilvitskii. K-Means++: The Advantages of Careful Seed- ing. In Proceedings of the 18 th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035, New Orleans, Louisiana, USA, 2007.
G. Bradski. The OpenCV Library. Dr. Dobb's Journal of Software Tools, 2000.
S. Sonnenburg, G. Rätsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. Bona, A. Binder, C. Gehl, and V. Franc. The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, 11(1):1799–1802, 2010.
G.-J. Qi, X.-S. Hua, Y. Song, and H.-J. Zhang. Transductive Inference with Hierarchical Clustering for Video Annotation. In Proceedings of the 8 th IEEE International Conference on Multimedia and Expo (ICME'07), pages 643–646, Beijing, China, 2007b. IEEE.
J. Wang, Y. Zhao, X. Wu, and X.-S. Hua. Transductive Multi-Label Learning for Video Concept Detection. In Proceedings of the 1 st ACM International Conference on Multimedia Information Retrieval (MIR'08), pages 298–304, Vancouver, British Columbia, Canada, 2008b. ACM.
P. Over, T. Ianeva, W. Kraaij, and A. F. Smeaton. TRECVID 2006 -An Overview. Technical report, Gaithersburg, Maryland, USA, 2007. URL http: //www-nlpir.nist.gov/projects/tvpubs/tv6.papers/tv6overview.pdf.
Y. Liu, D. Xu, I. W. Tsang, and J. Luo. Using Large-Scale Web Data to Facil- itate Textual Query Based Retrieval of Consumer pPhotos. In Proceedings of Bibliography the 17 th ACM International Conference on Multimedia (MM'09), pages 55–64, Vancouver, British Columbia, Canada, 2009. ACM.
A. Vedaldi and B. Fulkerson. VLFeat — An Open and Portable Library of Computer Vision Algorithms. In Proceedings of the 18 th ACM International Conference on Multimedia (MM'10), pages 1469–1472, Firence, Italy, 2010. ACM.
D. Seiler, R. Ewerth, S. Heinzl, T. Stadelmann, M. Mühling, B. Freisleben, and M. Grauer. Eine Service-Orientierte Grid-Infrastruktur zur Unterstützung medienwissenschaftlicher Filmanalyse. In Proceedings of the Workshop on Gemeinschaften in Neuen Medien (GeNeMe'09), pages 79–89, Dresden, Ger- many, Sept. 2009.
J.-Y. Bouguet. Pyramidal Implementation of the Affine Lucas Kanade Fea- ture Tracker. 2001. URL http://pages.slc.edu/ ~ aschultz/mocap/Bouget_ Affine.pdf.
K. Mathiak and R. Weber. Toward Brain Correlates of Natural Behavior: fMRI during Violent Video Games. Human Brain Mapping, 27(12):948–56, Dec. 2006.
J. C. Platt. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods, pages 185–208. MIT Press, Apr. 1999.
J. Tang, S. Yan, R. Hong, G.-J. Qi, and T.-S. Chua. Inferring Semantic Concepts from Community-Contributed Images and Noisy Tags. In Proceedings of the 17 th ACM International Conference on Multimedia (MM'09), pages 223–232, New York, New York, USA, 2009. ACM.
N. Inoue, T. Saito, K. Shinoda, and S. Furui. High-Level Feature Extraction Using SIFT GMMs and Audio Models. In Proceedings of the 20 th International Con- ference on Pattern Recognition (ICPR'10), pages 3220–3223, Istanbul, Turkey, Aug. 2010a. IEEE.
N. Inoue and K. Shinoda. A Fast MAP Adaptation Technique for Gmm- Supervector-Based Video Semantic Indexing Systems. In Proceedings of the 19 th ACM International Conference on Multimedia (MM'11), pages 1357–1360, Scottsdale, Arizona, USA, 2011. ACM.
N. Inoue and K. Shinoda. A Fast and Accurate Video Semantic Indexing System Using Fast MAP Adaptation and GMM Supervectors. IEEE Transactions on Multimedia, 6(1):1–22, 2012.
S. Kopf. Computergestützte Inhaltsanalyse von digitalen Videoarchiven. PhD thesis, Department of Computer Science, University of Mannheim, Germany, 2006.
T. Dittrich, S. Kopf, P. Schaber, B. Guthier, and W. Effelsberg. Saliency Detec- tion for Stereoscopic Video. In Proceedings of the 4 th ACM Multimedia Systems Conference (MMSYS), pages 12–23, Oslo, Norway, 2013.
H. Bredin, L. Koenig, and J. Farinas. IRIT @ TRECVid 2010 : Hidden Markov Models for Context-aware Late Fusion of Multiple Audio Classifiers. In Pro- ceedings of the TREC Video Retrieval Evaluation Workshop (TRECVid'10), Gaithersburg, Maryland, USA, 2010. NIST. URL http://www-nlpir.nist. gov/projects/tvpubs/tv.pubs.org.htm.
Y.-G. Jiang, X. Zeng, G. Ye, S. Bhattacharya, D. Ellis, M. Shah, and S.-F. Chang. Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Mul- tiple Modalities, Contextual Concepts, and Temporal Matching. In Proceedings of the TREC Video Retrieval Evaluation Workshop (TRECVid'10), Gaithers- burg, Maryland, USA, 2010b. NIST. URL http://www-nlpir.nist.gov/ projects/tvpubs/tv.pubs.org.htm.
N. Elleuch, M. Zarka, I. Feki, A. Ben Ammar, and A. M. Alimi. REGIMVID at TRECVID 2010 : Semantic Indexing. In Proceedings of the TREC Video Retrieval Evaluation Workshop (TRECVid'10), Gaithersburg, Mary- land, USA, 2010. NIST. URL http://www-nlpir.nist.gov/projects/ tvpubs/tv.pubs.org.htm.
R. Ewerth, M. Mühling, T. Stadelmann, E. Qeli, B. Agel, D. Seiler, and B. Freisleben. University of Marburg at TRECVID 2006: Shot Boundary De- tection and Rushes Task Results. In Proceedings of the TREC Video Retrieval Evaluation Workshop (TRECVid'06), Gaithersburg, Maryland, USA, 2006b. NIST. URL http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org. htm.
S.-F. Chang, W. Hsu, L. Kennedy, L. Xie, A. Yanagawa, E. Zavesky, and D.- Q. Zhang. Columbia University TRECVID-2005 Video Search and High-Level Feature Extraction. In Proceedings of the TREC Video Retrieval Evaluation Workshop (TRECVid'05), Gaithersburg, Maryland, USA, 2005. NIST. URL http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.htm.
P. Over, G. Awad, J. Fiscus, B. Antonishek, M. Michel, A. Smeaton, W. Kraaij, and G. Quéenot. TRECVID 2010 — An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics. In Proceedings of the TREC Video Retrieval Evaluation Workshop (TRECVid'10), pages 1–34, Gaithers- burg, Maryland, USA, 2011. National Institute of Standards and Technol- ogy (NIST). URL http://www-nlpir.nist.gov/projects/tvpubs/tv10. papers/tv10overview.pdf.
P. Over, G. Awad, J. Fiscus, B. Antonishek, M. Michel, A. Smeaton, W. Kraaij, and G. Quéenot. TRECVID 2011 — An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics. In Proceedings of the TREC Bibliography Video Retrieval Evaluation Workshop (TRECVid'11), pages 1–56, Gaithers- burg, Maryland, USA, 2012. National Institute of Standards and Technol- ogy (NIST). URL http://www-nlpir.nist.gov/projects/tvpubs/tv11. papers/tv11overview.pdf.
P. Over, J. Fiscus, G. Sanders, B. Shaw, G. Awad, M. Michel, A. Smeaton, W. Kraaij, and G. Quéenot. TRECVID 2012 – An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics. In Proceedings of the TREC Video Retrieval Evaluation Workshop (TRECVid'12), pages 1–58, Gaithers- burg, Maryland, USA, 2013. National Institute of Standards and Technol- ogy (NIST). URL http://www-nlpir.nist.gov/projects/tvpubs/tv12. papers/tv12overview.pdf.
V. Kumar and I. Patras. A Discriminative Voting Scheme for Object Detection using Hough Forests. In Proceedings of the 2 nd British Machine Vision Con- ference Postgraduate Workshop, pages 1–10, Aberystwyth, UK, 2010. British Machine Vision Association.
K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The Devil is in the Details: an Evaluation of Recent Feature Encoding Methods. In Proceedings of the 22 nd British Machine Visision Conference (BMVC'11), pages 76.1–76.12, Dundee, Scotland, UK, 2011. British Machine Vision Association.
Ben-Hur and J. Weston. A User's Guide to Support Vector Machines. 2011. URL http://www.csie.ntu.edu.tw/ ~ cjlin/papers/guide/guide.pdf.
Y. Hu and P. Loizou. Subjective Comparison of Speech Enhancement Algorithms. In Proceedings of the 31 st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'06), pages 153–156, Toulouse, France, 2006. IEEE.
R. Ewerth, M. Mühling, and B. Freisleben. Self-Supervised Learning of Face Appearances in TV Casts and Movies. In Proceedings of the 8 th IEEE Inter- national Symposium on Multimedia (ISM'06), pages 78–85, Washington, DC, USA, 2006a. IEEE.
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. International Journal of Computer Vision, 73(2):213–238, 2007.
F. Moosmann, B. Triggs, and F. Jurie. Fast Discriminative Visual Codebooks using Randomized Clustering Forests. In Proceedings of the 20 th Annual Con- ference on Neural Information Processing Systems (NIPS'06), pages 1–7, Van- couver, British Columbia, Canada, 2006.
E. Nowak, F. Jurie, and B. Triggs. Sampling Strategies for Bag-of-Features Image Classification. In Proceedings of the 9 th European Conference on Computer Vision (ECCV'06), pages 490–503, Graz, Austria, 2006. Springer.
K. Mikolajczyk and C. Schmid. Indexing Based on Scale Invariant Interest Points. In Proceedings of the 8 th IEEE International Conference on Computer Vision (ICCV'01), pages 525–531, Vancouver, British Columbia, Canada, 2001. IEEE.
F. Jurie and B. Triggs. Creating Efficient Codebooks for Visual Recognition. In Proceedings of the 10 th IEEE International Conference on Computer Vision (ICCV'05), pages 604–610, Beijing, China, 2005. IEEE.
N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 18 th IEEE Conference on Computer Vision and Pattern Recognition (CVPR'05), pages 886–893, San Diego, CA, USA, 2005. IEEE.
J. V. D. Weijer and C. Schmid. Coloring Local Feature Extraction. In Proceedings of the 9 th European Conference on Computer Vision (ECCV'06), pages 334– 348, Graz, Austria, 2006. Springer.
S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In Proceedings of the 19 th IEEE Conference on Computer Vision and Pattern Recognition (CVPR'06), pages 2169–2178, Washington, DC, USA, 2006. IEEE.
F. Perronnin, J. Sánchez, and T. Mensink. Improving the Fisher Kernel for Large- Scale Image Classification. In Proceedings of the 11 th European Conference on Computer Vision (ECCV'10), pages 143–156, Heraklion, Crete, Greece, 2010. Springer.
J. Krapac, J. Verbeek, and F. Jurie. Modeling Spatial Layout with Fisher Vec- tors for Image Categorization. In Proceedings of the 13th International Con- ference on Computer Vision (ICCV'11), pages 1487–1494, Barcelona, Spain, Nov. 2011. IEEE.
G. Fanelli, J. Gall, and L. Van Gool. Hough Transform-Based Mouth Localiza- tion for Audio-Visual Speech Recognition. In Proceedings of the 20 th British Machine Visision Conference (BMVC'09), London, UK, 2009. British Machine Vision Association.
R. Lienhart, S. Pfeiffer, and W. Effelsberg. Scene Determination based on Video and Audio Features. In IEEE International Conference on Multimedia Com- puting and Systems, pages 685–690, Florence, Italy, 1999b.

Das Dokument ist im Internet frei zugänglich - Hinweise zu den Nutzungsrechten