音声・音楽情報処理に関する講演会
日時: 2008年12月22日(月) 14:00 - 17:30
場所: 東京大学本郷キャンパス工学部6号館3階セミナー室A&D (後、移動)
(交通情報: http://hil.t.u-tokyo.ac.jp/info/transportation.html)
主催: IEEE Signal Processing Society Japan Chapter
プログラム:
14:00-14:20
(1) 講演「音響音楽信号における打楽器パターン特徴量を用いたジャンル認識」
講演者: 角尾 衣未留 (東京大学大学院情報理工学系研究科 修士課程)
概要:
我々の研究室で開発している調波成分/打楽器成分分離手法をジャンル分
類に応用し、ジャンルに共通の打楽器パターンをクラスタリング・抽出し、
それらに基づく特徴量を利用してジャンルの認識率の向上を目指す。
14:20-14:40
(2) 講演「スパース性に基づくブラインド音源分離を用いた2チャンネル入力
音声認識」
講演者: 西亀 健太 (東京大学大学院情報理工学系研究科 修士課程)
概要:
我々はスパース性に基づくブラインド音源分離をフロントエンドに用いた
雑音残響下における2チャンネル入力音声認識を提案する。2チャンネルの
ブラインド音源分離により雑音・残響が重畳した観測信号から音声を取り
出し、Cepstral Mean Nomalization によって分離音声に残る歪みのさら
なる解消を行う。提案手法に対し、複数妨害音および残響の存在下におけ
る連続数字音声認識タスクにおいて従来手法に比べ最大で誤りを72%削減
した。
14:40-15:20
(3) 講演「マイクロフォンアレイによる音源分離の新展開」
講演者: 小野 順貴 (東京大学大学院情報理工学系研究科 講師)
概要:
我々の研究室では、様々な雑音が存在する実環境での雑音抑圧や音源分離
を目的にマイクロフォンアレイの研究を進めている。本講演では、信号処
理、アレイ配置の設計、信号取得方法など、いくつかの側面からの検討を
進めている我々の取り組みを紹介する。
15:20-15:40 休憩
15:40-16:40
(4) 講演 "From Text to Media: A Unified Approach to Multimedia
Pattern Recognition"
講演者: Prof. Chin-Hui Lee (Georgia Institute of Technology)
概要:
With an increasing amount of audio and video materials made
available on the web, information extraction from multimedia
documents is becoming a key area of growing business and
technology interest. Research opportunities range from traditional
topics, such as multimedia signal representation, processing,
coding, modeling, authentication, and recognition, to emerging
subjects, such as language modeling, semantic concept decoding,
media data mining, and knowledge discovery. Conventional
multimedia processing often focuses on techniques developed for an
individual medium. However for multimedia pattern recognition
purposes, a number of algorithms are well-positioned and
applicable to many cross-media applications.
We present three families of such algorithms. The first, derived
from speech and image coding, is unsupervised tokenization of
multimedia patterns into a finite set of alphabets through segment
or block quantization. Acoustic and visual lexicons can then be
constructed. The second, derived from information retrieval, is a
vector space representation of multimedia documents via extraction
of high-dimensional salient feature vectors using co-occurrences
statistics of acoustic and visual words. This can be accomplished
through a feature extraction and feature reduction framework,
known as latent semantic analysis (LSA), serving as a unified
representation of multimedia patterns. This allows us to convert
heterogeneous multimedia patterns into uniform text-like
documents. Finally we discuss decision-feedback discriminative
learning, derived from automatic speech and speaker recognition,
for document classification, such as text categorization (TC) or
topic identification. Machine learning techniques have been
extensively used in the TC community to design discriminative
classifiers. We present a recently developed maximal
figure-of-merit (MFoM) learning framework for TC. It attempts to
optimize parameters for any classifier with any feature
representation on any desired performance metric, and was shown to
outperform other well-known machining learning algorithms, such as
support vector machine (SVM), especially for topics with only very
few training documents.
The mathematical formulation of the above three sets of techniques
will be described in detail first, followed by their applications
to text categorization, automatic image annotation, video story
segmentation, audio fingerprinting, and automatic language
identification. The three frameworks, all derived from the speech
and language processing community, provide a natural linkage to
language characterization and concept modeling of multimedia
documents and seem to serve as an ideal combination of tools for
bridging the gap from conventional, low-level, content-based
signal processing to high-level, concept-based processing of
multimedia patterns.
16:50-17:30
(5) 見学: 東京大学大学院情報理工学系研究科システム情報第一研究室
(嵯峨山・小野研究室)の研究内容 (at 工学部6号館140号室)
内容:
a: 調波打楽器イコライザ
b: 実時間テンポ・ピッチ変換
c: 自動作曲システム Orpheus
d: 自動伴奏システム