音声・音楽情報処理に関する講演会

日時： 2008年12月22日(月) 14:00 - 17:30
場所： 東京大学本郷キャンパス工学部６号館３階セミナー室Ａ＆Ｄ (後、移動)
       (交通情報: http://hil.t.u-tokyo.ac.jp/info/transportation.html)
主催：　IEEE Signal Processing Society Japan Chapter

プログラム:

14:00-14:20
(1) 講演「音響音楽信号における打楽器パターン特徴量を用いたジャンル認識」
    講演者: 角尾 衣未留 (東京大学大学院情報理工学系研究科 修士課程)
    概要: 
    我々の研究室で開発している調波成分／打楽器成分分離手法をジャンル分
    類に応用し、ジャンルに共通の打楽器パターンをクラスタリング・抽出し、
    それらに基づく特徴量を利用してジャンルの認識率の向上を目指す。

14:20-14:40
(2) 講演「スパース性に基づくブラインド音源分離を用いた2チャンネル入力
    音声認識」
    講演者: 西亀 健太 (東京大学大学院情報理工学系研究科 修士課程)
    概要: 
    我々はスパース性に基づくブラインド音源分離をフロントエンドに用いた
    雑音残響下における2チャンネル入力音声認識を提案する。2チャンネルの
    ブラインド音源分離により雑音・残響が重畳した観測信号から音声を取り
    出し、Cepstral Mean Nomalization によって分離音声に残る歪みのさら
    なる解消を行う。提案手法に対し、複数妨害音および残響の存在下におけ
    る連続数字音声認識タスクにおいて従来手法に比べ最大で誤りを72%削減
    した。

14:40-15:20
(3) 講演「マイクロフォンアレイによる音源分離の新展開」
    講演者: 小野 順貴 (東京大学大学院情報理工学系研究科 講師)
    概要:
    我々の研究室では、様々な雑音が存在する実環境での雑音抑圧や音源分離
    を目的にマイクロフォンアレイの研究を進めている。本講演では、信号処
    理、アレイ配置の設計、信号取得方法など、いくつかの側面からの検討を
    進めている我々の取り組みを紹介する。

15:20-15:40 休憩

15:40-16:40
(4) 講演 "From Text to Media: A Unified Approach to Multimedia
          Pattern Recognition"
    講演者: Prof. Chin-Hui Lee (Georgia Institute of Technology)
    概要:
    With an increasing amount of audio and video materials made
    available on the web, information extraction from multimedia
    documents is becoming a key area of growing business and
    technology interest. Research opportunities range from traditional
    topics, such as multimedia signal representation, processing,
    coding, modeling, authentication, and recognition, to emerging
    subjects, such as language modeling, semantic concept decoding,
    media data mining, and knowledge discovery. Conventional
    multimedia processing often focuses on techniques developed for an
    individual medium. However for multimedia pattern recognition
    purposes, a number of algorithms are well-positioned and
    applicable to many cross-media applications.

    We present three families of such algorithms. The first, derived
    from speech and image coding, is unsupervised tokenization of
    multimedia patterns into a finite set of alphabets through segment
    or block quantization. Acoustic and visual lexicons can then be
    constructed. The second, derived from information retrieval, is a
    vector space representation of multimedia documents via extraction
    of high-dimensional salient feature vectors using co-occurrences
    statistics of acoustic and visual words. This can be accomplished
    through a feature extraction and feature reduction framework,
    known as latent semantic analysis (LSA), serving as a unified
    representation of multimedia patterns.  This allows us to convert
    heterogeneous multimedia patterns into uniform text-like
    documents. Finally we discuss decision-feedback discriminative
    learning, derived from automatic speech and speaker recognition,
    for document classification, such as text categorization (TC) or
    topic identification.  Machine learning techniques have been
    extensively used in the TC community to design discriminative
    classifiers. We present a recently developed maximal
    figure-of-merit (MFoM) learning framework for TC. It attempts to
    optimize parameters for any classifier with any feature
    representation on any desired performance metric, and was shown to
    outperform other well-known machining learning algorithms, such as
    support vector machine (SVM), especially for topics with only very
    few training documents.

    The mathematical formulation of the above three sets of techniques
    will be described in detail first, followed by their applications
    to text categorization, automatic image annotation, video story
    segmentation, audio fingerprinting, and automatic language
    identification. The three frameworks, all derived from the speech
    and language processing community, provide a natural linkage to
    language characterization and concept modeling of multimedia
    documents and seem to serve as an ideal combination of tools for
    bridging the gap from conventional, low-level, content-based
    signal processing to high-level, concept-based processing of
    multimedia patterns.

16:50-17:30
(5) 見学: 東京大学大学院情報理工学系研究科システム情報第一研究室
          (嵯峨山・小野研究室)の研究内容 (at 工学部６号館１４０号室)
    内容: 
    a: 調波打楽器イコライザ
    b: 実時間テンポ・ピッチ変換
    c: 自動作曲システム Orpheus
    d: 自動伴奏システム