Cloud Engineer, aus Ostermundigen
Raiffeisen - Speech-to-Text translation POC
In a multilingual country like Switzerland, language barriers and comprehension issues are a common difficulty in internal meetings, somtimes even requiring the help of translators.
Building on the document translation application we developed, the objective was to create a POC to evaluate the possibility of providing live speech-to-text transcription and translation for internal meetings to increase employee experience across the whole organization.
As meetings need to be translated in real-time, the customer faced many challenges to handle various audio and data streams:
- Audio has to be streamed to and converted into text by the transcription service.
- Transcriptions have to be sent to the translation service.
- Translations have to be streamed back to all participants.
- Participants’ status also needs to be streamed.
- The solution must offer 3rd party transcription and translation services.
As for the documentation-translation application, the objective was to develop a serverless solution. The web application runs on the side of the video conferencing service used by the participants.
Regarding the POC, we decided to apply a simple approach where each client transcribes its own audio and then sends the text for translation. Even though this approach increases the data flow and latency, it simplifies the management of audio and data streams and was considered acceptable for a POC.
The following picture shows the resulting architecture which was written using Terraform as IaC tool.
Tools & Technologies
- The web application is an Angular application hosted in an Amazon S3 bucket and delivered to clients through Amazon CloudFront.
- Amazon CloudFront leverages AWS Web Application Firewall (WAF) to restrict access to the customer's public IPs and ensure accessibility from the customer's internal network only.
- Amazon API Gateway, AWS Lambda and Amazon DynamoDB are used to schedule and manage the meetings.
Realtime Transcription & Translation:
- Meeting organizers can select 3rd party transcription and translation service providers for domain-specific transcriptions and translations (e.g. transcription of Swiss-German dialect).
- Each client application requests a direct streaming connection to Amazon Transcribe from the backend. They also establish a web socket connection to the Websocket API gateway. The same Amazon DynamoDB table used to store meeting schedules is also utilized to trackparticipants' connections and status.
- Clients who enable their microphones stream their audio directly to Amazon Transcribe and receive the transcription of their audio. They then stream their transcription on the meeting web socket.
- The backend uses Amazon Translate to translate the transcription streamed by clients over the meeting web socket and streams the translation in 4 languages back to all connected participants. To make the solution more efficient, the backend performs translations and streaming simultaneously.
- All AWS Lambda functions send application logs and business metrics to Amazon Cloudwatch. Logs can be used for troubleshooting and business metrics to monitor the health of the application.
- AWS X-Ray tracing is enabled on the AWS Lambda functions to identify potential performance bottlenecks, and troubleshoot requests that resulted in an error.
- Users' authentication is done through Amazon Cognito, configured in a separate AWS account and federated with the customer's on-premise Microsoft Active Directory.
- Data in transit and at rest are encrypted, the Amazon DynamoDB table is encrypted at rest using AWS KMS with Customer Master Keys (CMKs).
- Each AWS Lambda function is executed with AWS IAM roles and scoped down policies according to least privilege principles.
Governance and Compliance:
- We useAWS CloudTrail to capture audit logs of all API calls which are made to AWS services in the AWS account.
- AWS Config is used to monitor and log all infrastructure configuration changes as well as the fulfillment or lack of compliance with established configuration rules.
- The architecture is written in Terraform and deployed in the customers' environment using AWS CodePipeline and AWS CodeBuild.
Results & Benefits
Building on the architecture of the document translation and using AWS services enabled us to build a real time translation service. We were able to quickly and easily come up with an architecture and deliver a first version of the POC which we improved through multiple iterations.
Using tools and serverless services provided by AWS offered major benefits. First, the provisioning of infrastructure and application servers is fully outsourced to the cloud provider. Second, we could easily build the first version of the POC by leveraging ready-to-use audio transcription streaming service, translation service, or web socket service. Offloading the complexity of deploying and maintaining such services to the cloud provider, allows us to focus on the customers' business requirements and user experience.
This application is currently deployed in the customer’s AWS environment and runs through a first pilot phase. Open challenges are:
- Enhancing the transcription by training a custom language model for the customer.
- Enhancing the data flow between the application client, the transcription and the translation service.
- Improving backend performance
The Raiffeisen Group is the leading Swiss retail bank. The Group is the third largest player in the Swiss banking sector with around 1.96 million cooperative members and 3.6 million clients. The Raiffeisen Group is represented at 820 locations throughout Switzerland. The 219 legally autonomous cooperative Raiffeisen banks are aligned within Raiffeisen Switzerland Cooperative.
As AWS Advanced Consulting and training partner, we support Swiss customers on their way to the cloud. Cloud-native technologies are part of our DNA. Since the company’s foundation (2011), we have been accompanying cloud projects, implementing and developing cloud-based solutions.