OUr technology

Identity Graph &
Probabilistic Matching

Our Unique Matching Process

Our Modest Research on this Matter

Our user matching solution is completely unsupervised: ground truth for such matches is unavailable. Most methods in matching in academic literature, as well as industry solutions, require ground truth and labels.

1- Matching Sessions

Our algorithms utilize user sessions to infer the matching of users. Sessions (which are available on all known data sources) have the most valuable information for the matching problem: features on the user agent and device properties (browser, OS, screen size…), as well as temporal and geolocation data are available on the session-level. Sessions can also be expected to be recorded in similar ways across all of your business available databases, from Google Analytics to CRMs and retailers. 

We compute a probability of two sessions being matched, which will then be used to cluster their respective users. Finally, the more sessions a user is known to have, the higher our confidence in any matching will be.

If you have data sources that are not structured with sessions, our algorithms are flexible enough to extract valuable information and still utilize the data for matching.

2 - Divide-and-Conquer: Clustering Sessions

Our algorithms are optimized for accuracy and speed. A naive approach would incur a quadratic time with respect to the number of sessions, which would start being prohibitive with only a few thousands of sessions [Sar12]. Instead, our algorithm scales to billions of sessions with a reasonable time. We eliminate highly unlikely matches to only compare sessions that could possibly be a match. We start by partitioning the set of all sessions into clusters of likely candidates. We leverage our proprietary techniques to create meaningful clusters by examining their identifying attributes and timestamps.


To alleviate any risk of missing matches in the clustering step, we also keep track of cluster canopies. Cluster canopies [Cal00, Kop10] are overlapping super-sets built so that all of the sessions’ nearest neighbors will be considered for matches regardless of which cluster they belong to. The Figure below illustrates the overlap between canopies. This allows us to keep all potential matches while providing a fast algorithm, scaling linearly with respect to the number of sessions.

Our algorithm utilizes carefully calibrated distance metrics and probabilistic models to provide matches across data sources, even without available ground truth.

3 - Probabilistic Matching and Gibbs Sampling

We then compute a similarity metric among the likely candidates using the features extracted, and select sessions that maximizes the similarity. During this step, we compute a confidence measure on every match using Gibbs sampling. Thanks to our probabilistic matching and thanks to our confidence measures, a mismatch in sessions should be unlikely to create mismatches in users. Session matches with lower probabilities weight less in the matching of the associated users.

4 - From Matched Sessions to Matched Users

Once we have established matched sessions with a high confidence level, we can then match the associated users together. An additional probabilistic matching step ensures that an erroneous match within sessions will not cause several separated users to be incorrectly matched. 

This step is modeled as a correlation clustering problem. We represent sessions as vertices of a graph, where edges between sessions are weighted by the estimated probability of those sessions to be a match from the divide and conquer step. The goal is then to find clusters of sessions that maximize agreement and minimizes disagreement [Dem06]. We then match the users associated with the clustered sessions. 

Such a problem has been proven to be NP-hard [Ban04]. Thanks to our divide and conquer method, we are dealing with small, disjoint graphs and we can approach the optimal solution in reduced time with heuristics.

Robustness to Heterogeneous Data Sources

Each and every step of our algorithm is calibrated for your business' specific needs and desired confidence level on the resulting user matching. Thanks to our algorithm’s performance and flexibility, we are able to add additional data sources as our partnership continues.

Our probabilistic model also allows augmenting datasets with truncated data or from less reliable data sources.Once calibrated and tested on the data sources of a specific brand and country, our matching algorithm can also be scaled to the brands that utilize the same data sources.Our algorithm’s results will also be entirely explainable, giving a full understanding of our results to your business' teams. Not only this should increase the teams’ trust in our algorithm, it will also enable us to make tweaks to the model while deployed and working on real data.

References

  • [Cal00] McCallum, Andrew, Kamal Nigam, and Lyle H. Ungar. "Efficient clustering of high-dimensional data sets with application to reference matching." Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. 2000.
  • [Ban04] Bansal, Nikhil, Avrim Blum, and Shuchi Chawla. "Correlation clustering." Machine learning 56.1-3 (2004): 89-113.
  • [Dem06] Demaine, Erik D., et al. "Correlation clustering in general weighted graphs." Theoretical Computer Science 361.2-3 (2006): 172-187.
  • [Kop10] Köpcke, Hanna, and Erhard Rahm. "Frameworks for entity matching: A comparison." Data & Knowledge Engineering 69.2 (2010): 197-210.
  • [Sar12] Das Sarma, Anish, et al. "An automatic locking mechanism for large-scale de-duplication tasks." Proceedings of the 21st ACM international conference on Information and knowledge management. 2012.