Project Background
Empeek’s team developed a new product for a client that operates in the vendor management software market.
The primary idea behind the Product is to use Artificial Intelligence (AI) to recognize and classify important data from emails, texts, and other sources.
The goal is to minimize the time employees spend reviewing various alerts and notifications, allowing them to focus solely on critical matters.
Timeline: 6+ months
Team Size: 3-4
Tech Stack
Languages
Python, C#, React
Neural Network Model Types
PyTorch, ONNX
Framework
Transformers
Infrastructure
Google Colab, Microsoft Azure ML Studio
Start you AI-enabled Project today. Get free consultation.
Contact UsChallenges and Requirements
Initially, there was no clear vision of how the system should work. The team quickly ruled out the usage of third-party services and Large Language Models (LLMs) due to performance concerns, cost considerations, and the potential for handling personal client information.
The project required a solution that could:
- Accurately identify and extract relevant tokens (entities, keywords, etc.) from text data.
- Classify alerts and notifications into appropriate categories.
- Summarize lengthy texts for easy comprehension.
- Ensure privacy and security by processing data locally without sending it to external services.
- Provide high performance and cost-effectiveness.
Chosen Approach
After evaluating various options, we decided to use specific pre-trained models from the HuggingFace repository and implement the solution using Python and HuggingFace library.
To optimize model size and performance, the team converts the models to the ONNX format, allowing them to be used directly in .NET projects (with some limitations due to the lack of robust libraries).
Named Entity Recognition (NER)
NER models from HuggingFace are used to identify tokens (entities, keywords, etc.) in text data. These models are fine-tuned on pre-processed client data.
Classification
Sentence-transformer models from the Sentence-Transformers (SBERT) library are employed for alert classification. The SetFit technique is used for efficient model fine-tuning, even with small training datasets.
Summarization
A pre-trained summarization model is also incorporated, which can be fine-tuned if needed.
Categorization
Embedding vectors are used to measure semantic similarity, allowing for the best match between the given text and category/subcategory descriptions.
Training & Development Process
The training process is currently conducted on Google Colab, using a custom-built pipeline that fine-tunes the chosen HuggingFace base models with client data in a specific format. The proper models are then uploaded to Google Drive or Azure Storage.
The data preparation process involves:
- Manual analysis and annotation of client data by support personnel and developers.
- Thorough data validation to ensure consistency and accuracy.
- Categorization of alerts into different types.
Empeek’s team initially had focused on achieving 80% accuracy for classifying the top 8 alert types and plans to continuously expand and improve the system.
Future Plans
- The team plans to expand the number of clients using the X version of the Product, offering automated alert classification, prioritization, and ticket creation to minimize human intervention.
- The long-term goal is to release the beta version of the Product, incorporating feedback from alpha users to enhance functionality.
- Ongoing models refinement and customization options will be available to cater to specific client needs and preferences for alert handling.
About Empeek
Engineering a better healthcare future.
Empeek is a custom healthcare software development company that helps healthtech startups and medical facilities create and leverage innovative, HIPAA/HITECH compliant technology solutions such as EMR and telemedicine systems, patient-centered crossplatform apps, AI-powered tools, IoT ecosystems, and others.
- 150+ Specialists
- On the Market Since 2015
- HIPAA & GDPR Compliant