Meghan Mergui

Structured Clustering on Large Data Sources

Data Scientist – Intuit

Meghan Mergui

Structured Clustering on Large Data Sources

Data Scientist – Intuit

Bio

Meghan is a new data scientist at Intuit, having joined in January 2023 She likes doing pilates, dance and training, baking and spending time with friends and family. She lives with her husband, Ruben, and their three kids, Ezra, Shirel and Meir, in Ramat Gan.

She is originally from France and came to Israel seven years ago.

Meghan has an MSc in Industrial Engineering and Applied Mathematics from France and an MSc in Machine Learning and Data Science from the Technion. Her thesis was on Behavioral Economics and Data Science. During her MSc at the Technion, she started working at Data Science Consulting Group, a data science consulting company that works with all kinds of clients.

Three years ago, she started working at BigID, a startup that quickly became a unicorn, helping other companies to find and manage sensitive data for their clients.

Bio

Meghan is a new data scientist at Intuit, having joined in January 2023 She likes doing pilates, dance and training, baking and spending time with friends and family. She lives with her husband, Ruben, and their three kids, Ezra, Shirel and Meir, in Ramat Gan.

She is originally from France and came to Israel seven years ago.

Meghan has an MSc in Industrial Engineering and Applied Mathematics from France and an MSc in Machine Learning and Data Science from the Technion. Her thesis was on Behavioral Economics and Data Science. During her MSc at the Technion, she started working at Data Science Consulting Group, a data science consulting company that works with all kinds of clients.

Three years ago, she started working at BigID, a startup that quickly became a unicorn, helping other companies to find and manage sensitive data for their clients.

Abstract

Companies of all sorts store their data for years in different data sources. They can face  major challenges with analyzing, managing, and organizing their data. These challenges also include those related to data discovery and data governance, such as those outlined in the General Data Protection Regulation (GDPR).

Finding all similar or duplicate data across large data sources is challenging. Clustering similar columns can facilitate the management of data, and is always helpful during data analysis. Clustered columns can be classified and labeled across the system, allowing convenient management and analysis of the stored data. Moreover, it can assist in removing duplicate (or near duplicate) data, and help improve data storage. The results can be represented in a graph database, allowing fast querying, and can find similar data subsets.

Common approaches usually rely on schema matching (regex) or column name similarity, however this talk presents a method based on the data content of the columns with no need for prior knowledge. The method gives a solution to data pattern matching (pattern-based data similarity) and data content matching in a scalable manner.

Due to the importance of performance and scale, pairwise comparison cannot be conducted. Instead, proxy hash fingerprints, based on the column values and metadata, are used. Comparing the proxy hash fingerprints is a good estimator of similarity tests, such as Cosine and Jaccard similarity applied between vectors representing the original data subsets (e.g., columns). Moreover, it significantly reduces the space search and scales up the search and comparison.

Speaker Meghan Mergui worked on this project from research to productization and a patent as part of her work at BigID. She tested her algorithm on synthetic data. The product is used by all customers and is one of the key and major components of the company. She will present  the hashes used in the fingerprints, and the algorithm itself with approval from her previous employer.

Abstract

Companies of all sorts store their data for years in different data sources. They can face  major challenges with analyzing, managing, and organizing their data. These challenges also include those related to data discovery and data governance, such as those outlined in the General Data Protection Regulation (GDPR).

Finding all similar or duplicate data across large data sources is challenging. Clustering similar columns can facilitate the management of data, and is always helpful during data analysis. Clustered columns can be classified and labeled across the system, allowing convenient management and analysis of the stored data. Moreover, it can assist in removing duplicate (or near duplicate) data, and help improve data storage. The results can be represented in a graph database, allowing fast querying, and can find similar data subsets.

Common approaches usually rely on schema matching (regex) or column name similarity, however this talk presents a method based on the data content of the columns with no need for prior knowledge. The method gives a solution to data pattern matching (pattern-based data similarity) and data content matching in a scalable manner.

Due to the importance of performance and scale, pairwise comparison cannot be conducted. Instead, proxy hash fingerprints, based on the column values and metadata, are used. Comparing the proxy hash fingerprints is a good estimator of similarity tests, such as Cosine and Jaccard similarity applied between vectors representing the original data subsets (e.g., columns). Moreover, it significantly reduces the space search and scales up the search and comparison.

Speaker Meghan Mergui worked on this project from research to productization and a patent as part of her work at BigID. She tested her algorithm on synthetic data. The product is used by all customers and is one of the key and major components of the company. She will present  the hashes used in the fingerprints, and the algorithm itself with approval from her previous employer.

Planned Agenda

8:45 Reception
9:30 Opening words by WiDS TLV ambassador Nitzan Gado and by Lily Ben Ami, CEO of the Michal Sela Forum
9:50 Prof. Bracha Shapira – Data Challenges in Recommender Systems Research: Insights from Bundle Recommendation
10:20 Juan Liu – Accounting Automation: Making Accounting Easier So That People Can Forget About It
10:50 Break
11:00 Lightning talks
12:20 Lunch & poster session
13:20 Roundtable session & poster session
14:05 Roundtable closure
14:20 Break
14:30 Merav Mofaz – “Every Breath You Take and Every Move You Make…I'll Be Watching You:” The Sensitive Side of Smartwatches
14:50 Reut Yaniv – Ad Serving in the Online Geo Space Along Routes
15:10 Rachel Wities - It’s Not Just the Doctor’s Handwriting: Challenges and Opportunities in Healthcare NLP
15:30 Closing remarks
15:40 End