Building BetterBasket's product database with AI

Using entity recognition to extract item relationships

Vagelis Viskadouros

Feb 17, 2025

BetterBasket's AI Product Matching Algorithm

Overcoming the Challenges of Product Matching in Grocery

In today's competitive grocery landscape, effective price optimization demands tracking market and competitor prices, yet achieving real-time accuracy remains challenging, with stores spending over 20 hours per store per week on manual data handling. This complexity is heightened with fresh produce and private label, where comparing items consistently across different units of measurement and sizes can be difficult.
‍
BetterBasket harnesses the power of AI and entity recognition to deliver unmatched precision, enabling businesses to automate price matching, execute strategic pricing, and stay ahead of market trends with unprecedented accuracy.

Entity Recognition and NLP

Entity recognition is a branch of AI that uses Natural Language Processing (NLP) to identify and classify specific entities within vast datasets. It involves the extraction and categorization of essential data points from text, enabling algorithms to draw connections and comparisons.

BetterBasket's entity recognition system is tailored to grocery and takes into account characteristics like product type, brand, size, origin, and more to allow for more accurate product mapping. Our system utilizes advanced models that have been trained on millions of product data points. These models learn from a variety of features, such as:

• Word Embeddings: Capturing the context and meaning of product names and descriptions.
• Similarity Scoring: Using cosine similarity and Jaccard similarity algorithms to determine how closely two products align based on their attributes.
• Weighted Scoring Systems: Assigning different weights to entities like brand, size, and type, allowing for adaptable matching strategies across different categories.

Case 1: Details in Egg Products

Consider the challenge of eggs. A simple algorithm might struggle to differentiate between “Large Brown Cage-Free Eggs” and “Medium White Organic Eggs,” resulting in poor accuracy or irrelevant matches. Our entity recognition system separates each available data point and aligns them based on relevant attributes like size, color, and organic status.

For example, for products "A" and "B" below:
‍
     • Product A: "Large Brown Cage-Free Eggs, 12 Count"
     • Product B: "Medium White Organic Eggs, 18 Count"

The entity recognition algorithm compares:

     • Size: Large vs. Medium
     • Color: Brown vs. White
     • Quantity: 12 Count vs. 18 Count
     • Attributes: Cage-Free vs. Organic

By aligning products based on such attributes, we can accurately identify similarities and differences. This approach is especially valuable for items like eggs, where pricing strategies may apply different multipliers to 18-count versus 12-count SKUs, ensuring more precise matching and pricing decisions

Case Study 2: The Complexity of Private Labels

Private label products now account for over 20% of grocery sales and significantly shape price perception. However, benchmarking these products adds complexity.

This is where BetterBasket's entity recognition shines. For example, when comparing private label items, product type takes precedence over brand, which often holds less significance. Size variations between merchants are common, so we strip out numerical sizes and units, allowing grocers to set flexible parameters—such as a 15% size difference—to define what qualifies as a match.
‍
By prioritizing product type and similar sizing over brand distinctions, our algorithm achieves accurate product matches, ensuring optimal pricing strategies for your key private label items.

Technology

Our algorithm is meticulously designed to handle the nuances inherent in food and beverage products. The process begins with identifying the product's UPC (Universal Product Code). If the UPC is recognized, we can precisely extract key components related to the brand and product attributes.

The first step involves standardizing brands. We start by examining the UPC prefix, comparing it against our database of known code and brand pairs. If a match is found, we further validate this pairing using cosine similarity to ensure accuracy. In cases where the UPC is unrecognized or lacks a known prefix, we leverage a Large Language Model (LLM) to analyze the product details and extract the brand information if available.
‍
Once the brand is identified, we proceed to determine the product sizing by focusing on four key elements: unit of measurement, size, pack size, and pack unit. If our algorithm detects all these components, we assign the appropriate values. If not, we prompt the LLM to extract any missing details and add any new entries to our dictionary of aliases, continuously refining our system for future matches.