A health insurance company implements a machine learning solution for their customer database powered by natural language processing to ensure greater accuracy in their network confirmation process.


Our client is a health sector insurance company with a large data network of health service providers. Their main prospects are companies and organizations that have collective health insurance policies for their employees or members, and they receive daily information about health services providers used by these prospects. Our client needs to compare and match that information against its network, which is in multiple databases with unstructured data files, to determine which providers are part of it and which ones are not. The key issue in matching this date is that the providers' and facilities' names and addresses normally do not exactly match with those of our client's network. Matches should take place on different levels and a manual comparison between names and identifiers is often required to assure a minimum level of certainty in the results. This process took the current client team of seven employees an entire day to perform the task manually while being vulnerable to human error and requiring exhaustive training. To alleviate these problems, our client requested that Apex automate this matching via a machine learning (ML) natural language processing (NLP) solution. 

1,080 Employee Hours Saved Monthly


Our team defined and constructed a Python framework that allows for the identification and standardization of unstructured data into structured datasets. They then built a staging layer of all the client's relevant internal databases and defined a pipeline strategy to perform the following tasks: 

  • Standardize providers' data and provide the ability to upload it into a PostgreSQL database on demand and automatically. 
  • Delimit best matching candidates with analytics algorithms, then calculate similarity using NLP on those candidates and construct a confidence ratio matrix on them to feed a ML classification model. 
  • Train and incorporate a K-Nearest Neighbors (KNN) model to group the best matches according to the match types defined by the user. 
  • Store match results in the database and provide results with a confidence level ratio. 
  • Construction of a Flask UI to upload the ingested data, monitor the pipeline progress, and download the results from an internal PostgreSQL database. 


Our team built a programmatic automated solution that can compare new providers' data files against the large amount of data within the client's network. The solution uses an NLP ML to define and quantifies a confidence ratio to identify the best matches. It also uses an ML KNN classification model to classify the type of match according to definitions defined by the client. This process takes less than two hours, saving 1,080 working hours monthly, and delivers greater confidence in the final product. This tool feeds into a user interface where the team can make one final revision on non-matching cases to allow for atypical cases. The results the tool generates are also backend stored in a normalized database for direct backend consumption for the client.