Flowstorm Socket V2
Socket V2 is a brand new way how to communicate and integrate speech-capable client software to Flowstorm very efficiently.
In fact, Socket V2 is not based on web socket technology, instead it leverages common HTTP GET/POST requests in a fashion so called HTTP long polling (so clients might wait for the complete response from the server for a variable amount of time, depending on duration of conversational turn processing). That makes communication protocol more simple and reliable as conversational interaction is (mostly) linear, HTTP requests also allows only to write and read channel events alternately, but not simultaneously.
Socket V2 implementation if provided by Core Application at single endpoint (/client
) and supports two HTTP methods
PUT for writing
input and reading response of the conversational turn (that means user/client input representing object or audio/video input stream data, followed by finally recognized input and response items)
other client related data to the server (e.g. client logs)
GET for reading interim recognized input results and to get information that recognition has been finished (by end of response to that request) in Audio/Video input streaming.
To understand how client using V2 socket can be implemented please you can take a look on source code of platform-independent ClientV2 class as well as JavaHttpSocketClient class showing Java based client implementation of HTTP polling socket (interface HttpSocketClient).
Data formats
Audio/Video formats
Currently, V2 Socket support Audio/Video input in one of following formats
Content Type | |
---|---|
Raw PCM audio | audio/basic |
WAV audio file | audio/wav |
MP3 audio file | audio/mpeg |
Request/response content types
V2 sockets supports several input format (specified in Content-Type
header) for requests:
Plain text (
Content-Type: plain/text
) allowing only simple text input from userStream of JSON objects representing one or more V2 client channel events (
Content-Type: application/json
)InputEvent
BinaryEvent
InputStreamEvent
Audio/Video input (see list of supported Audio/Video formats)
Server will respond with JSON objects representing output events, except case of plain text request when response will be also in plain text (see descripton of plain text communication below). Here is the list of supported output events
RecognizedInputEvent
(GET and PUT)ResponseItemEvent
(PUT only)ResponseEvent
(PUT only)ErrorEvent
(GET and PUT)
Please note that server will generate response continuosly so it may take some time between
See the detailed list of Channel Events
Configuration and session ID handling
V2 protocol is using HTTP request query parameters OR extra request headers to configure pipeline. You have to pass them in every turn as it is not guaranteed that subsequent turns in the same session will be processed by the same pipeline instance.
Here is the full list of supported configuration values
Request Header | Query Parameter | Default Value (empty if mandatory) | Options / comments |
---|---|---|---|
X-Key | key |
| |
X-DeviceId | deviceId | Your client device unique identifier | |
Accept-Language | - | en-US | See IETF language tag specification |
X-TimeZone | timeZone | Europe/Prague | |
X-SttMode | sttMode | SingleUtterance | SingleUtterance only supported by V2 |
X-SttModel | sttModel | General | See SttModel enum |
X-SttSampleRate | sttSampleRate | 16000 | In Herz |
X-SttEncoding | sttEncoding | LINEAR16 |
|
X-TtsFileType | ttsFileType | mp3 |
|
X-VideoFile | videoFile | false |
|
X-SendResponseItems | sendResponseItems | true | Send separate response items followed by response object with empty items array
|
Session ID
In opposite to configuration values session ID is transported using cookie named flowstorm-session-id. V2 socket client is obliged to set or update its value received Set-Cookie
response header and set it using Cookie
request header. Client should discard stored session value in case of receiving ErrorEvent OR ResponseEvent with property sessionEnded set to true and sessionTimeout set to zero. If sessionTimeout is positive value, session ID should be discarded only if user does not open new conversation in specificed timeout in seconds.
Supported types of communication
Plain text
You should receive response like
So this is example of plain text interaction which might be useful for testing purposes. Every response line represents part of response introduced by a special single character
in response to PUT requests
<
for speech text (preceeded by voice or persona name in square brackets)#
for other response item properties and their values (audio
,video
,code
,background
) specified in parentheses, multiple separates by comma!
in case of error, in form source: text; e.g.DialogueManagerV2: Action #action1 not found in dialogue
.
if session ends
in response to GET requests
~
for speech recognition interim result text; e.g.~ What's ~ What's the weather ~ What's the weather to be like
Don't forget that if you want to continue in conversation (in case that session has not ended) you have to keep cookie flowstorm-session-id
in subsequent HTTP requests.
Audio/Video input file
In case that you have prerecorded audio input from user (e.g. you implement "push to talk" style of audio input) you can send it
Audio/Video input streaming
If you want to pass audio input from mic to socket directly and let Flowstorm Pipeline's streams do the recognition work on the fly, you can do it by writing to PUT request
Audio/Video data blocks (see list of supported Audio/Video formats). Request URL must contain query parameter
realtime=1
to get server notified that audio data are not prerecorded and therefore thay cannot be transported at once (as it is the case of Audio/Video input file form of communication) but they are going to be send in realtime while being captured by mic/cameraJSON objects representing Channel events
Multiple
BinaryEvent
s transporting blocks of raw PCM data (one block should contain something between 50-100ms of audio, that means thousands of bytes, depending on sample rate and format)Optionally followed by
OutputStreamCloseEvent
to indicate that user has switched from audio to text input (no moreBinaryEvent
s should follow), followed byInputEvent
containing user's text input
to the request body which has be transported chunked (using request header Transfer-Encoding: chunked
) otherwise Flowstorm server app wouldn't be able to process them continuously.
As we use standard HTTP request for input streaming, which allows by its design only half-duplex transmission, that means optional writing followed by mandatory reading in one request, in opposite to Flowstorm Socket V1 leveraging web sockets supporting full-duplex transmission, we cannot send the interim nor final result of input (speech) recognition represented by RecognizedInputEvent
to the client in the same PUT request as client does not know when to stop audio streaming and start reading of server response. So in case of input streaming, client has to do second GET request running in parallel to obtain interim recognition results and, first of all, by completing GET response be notified that recognition has been finished so it has stop writing more data and the the PUT request should start reading response from it.
Don't forget to use all the same socket properties and share flowstorm-session-id
cookie value for all PUT and GET requests in your client implementation otherwise input streaming will be ended prematurely leading to processing of #silence
action in dialogue after timeout specified by speech recognizer's configuration (5 seconds by default).
Last updated