IBM Cognitive Services: Speech API, Visual API and Language APIs – Part 1

Let Your Friends Know:

Cognitive computing is becoming the next essential of business today. This is the next leap in the technology race.

First, let us understand what is Cognitive Computing and why this term has become the business essential lately.

“Cognitive Computing is the process of mimic the way the human brain works. It involves various artificial intelligence applications and machine learning algorithms such as Natural Language Processing, neural networks, virtual reality and robotics”

There are many players in the market who are providing cognitive capabilities and services. IBM Watson is one of them which provides the latest cognitive computing through its products and APIs.

The key enablers for this technology shift are:

  • Data: With IoT and Mobility has to lead to exponential growth of data
  • Computation power: In the era of cloud computing, computing power is not anymore at limit
  • Artificial Intelligence Algorithms: Many AI algorithms are available with years of research works by scholars.

Here are some most important Watson APIs, which we are using to provide services

SPEECH

Speech to Text

Speech to Text service let you add speech transcription capabilities to your applications. The service leverages machine intelligence to combine language grammar and language structure with knowledge of the composition of the audio signal. The service continuously returns and updates as more speech is heard. For multiple speakers in speech, It labels each speaker’s conversation.

It has the capability to returns alternative and interim transcription results. It also has the capability to introduce filtering to sanitize the output. It can convert dates, times, numbers, phone numbers, and currency values in final transcripts of US English audio into more readable, conventional forms.

Supported language: Brazilian Portuguese, French, Japanese, Mandarin Chinese, Modern Standard Arabic, Spanish, English.

Text to Speech

Text to Speech service uses speech-synthesis capabilities to convert written text to natural-sounding speech at real time(with minimal delay). The service accepts plain text or text that is tagged with the Speech Synthesis Markup Language (SSML), an XML-based markup language that provides annotations of text for speech. SSML with an expressive element that lets you indicate a speaking style emotional notation. SSML also let you control possible voices by controlling pitch, rate, and timbre. Text to Speech provides a customization interface that lets you specify how it pronounces unusual words that occur in your input.

Supported language: English, French, German, Italian, Japanese, Spanish, and Brazilian Portuguese. The service offers at least one male or female voice, sometimes both.

VISUAL

Visual Recognition

Visual Recognition service uses deep learning algorithms to analyze images for scenes, objects, faces, and other content. It understands the contents of images and can tag the image, find human faces, approximate age and gender, and find similar images in a collection. The response includes keywords that provide information about the content. A set of built-in classes provides highly accurate results without training. But it can be trained with custom classifiers to create specialized classes. Custom functionalities can be built around it like – to detect a product in a shop, identify damaged inventory, and much more.

Please go to this link for Language processing API .

Let Your Friends Know: