Meghan is a new data scientist at Intuit, having joined in January 2023 She likes doing pilates, dance and training, baking and spending time with friends and family. She lives with her husband, Ruben, and their three kids, Ezra, Shirel and Meir, in Ramat Gan.
She is originally from France and came to Israel seven years ago.
Meghan has an MSc in Industrial Engineering and Applied Mathematics from France and an MSc in Machine Learning and Data Science from the Technion. Her thesis was on Behavioral Economics and Data Science. During her MSc at the Technion, she started working at Data Science Consulting Group, a data science consulting company that works with all kinds of clients.
Three years ago, she started working at BigID, a startup that quickly became a unicorn, helping other companies to find and manage sensitive data for their clients.
Meghan is a new data scientist at Intuit, having joined in January 2023 She likes doing pilates, dance and training, baking and spending time with friends and family. She lives with her husband, Ruben, and their three kids, Ezra, Shirel and Meir, in Ramat Gan.
She is originally from France and came to Israel seven years ago.
Meghan has an MSc in Industrial Engineering and Applied Mathematics from France and an MSc in Machine Learning and Data Science from the Technion. Her thesis was on Behavioral Economics and Data Science. During her MSc at the Technion, she started working at Data Science Consulting Group, a data science consulting company that works with all kinds of clients.
Three years ago, she started working at BigID, a startup that quickly became a unicorn, helping other companies to find and manage sensitive data for their clients.
Companies of all sorts store their data for years in different data sources. They can face major challenges with analyzing, managing, and organizing their data. These challenges also include those related to data discovery and data governance, such as those outlined in the General Data Protection Regulation (GDPR).
Finding all similar or duplicate data across large data sources is challenging. Clustering similar columns can facilitate the management of data, and is always helpful during data analysis. Clustered columns can be classified and labeled across the system, allowing convenient management and analysis of the stored data. Moreover, it can assist in removing duplicate (or near duplicate) data, and help improve data storage. The results can be represented in a graph database, allowing fast querying, and can find similar data subsets.
Common approaches usually rely on schema matching (regex) or column name similarity, however this talk presents a method based on the data content of the columns with no need for prior knowledge. The method gives a solution to data pattern matching (pattern-based data similarity) and data content matching in a scalable manner.
Due to the importance of performance and scale, pairwise comparison cannot be conducted. Instead, proxy hash fingerprints, based on the column values and metadata, are used. Comparing the proxy hash fingerprints is a good estimator of similarity tests, such as Cosine and Jaccard similarity applied between vectors representing the original data subsets (e.g., columns). Moreover, it significantly reduces the space search and scales up the search and comparison.
Speaker Meghan Mergui worked on this project from research to productization and a patent as part of her work at BigID. She tested her algorithm on synthetic data. The product is used by all customers and is one of the key and major components of the company. She will present the hashes used in the fingerprints, and the algorithm itself with approval from her previous employer.
Companies of all sorts store their data for years in different data sources. They can face major challenges with analyzing, managing, and organizing their data. These challenges also include those related to data discovery and data governance, such as those outlined in the General Data Protection Regulation (GDPR).
Finding all similar or duplicate data across large data sources is challenging. Clustering similar columns can facilitate the management of data, and is always helpful during data analysis. Clustered columns can be classified and labeled across the system, allowing convenient management and analysis of the stored data. Moreover, it can assist in removing duplicate (or near duplicate) data, and help improve data storage. The results can be represented in a graph database, allowing fast querying, and can find similar data subsets.
Common approaches usually rely on schema matching (regex) or column name similarity, however this talk presents a method based on the data content of the columns with no need for prior knowledge. The method gives a solution to data pattern matching (pattern-based data similarity) and data content matching in a scalable manner.
Due to the importance of performance and scale, pairwise comparison cannot be conducted. Instead, proxy hash fingerprints, based on the column values and metadata, are used. Comparing the proxy hash fingerprints is a good estimator of similarity tests, such as Cosine and Jaccard similarity applied between vectors representing the original data subsets (e.g., columns). Moreover, it significantly reduces the space search and scales up the search and comparison.
Speaker Meghan Mergui worked on this project from research to productization and a patent as part of her work at BigID. She tested her algorithm on synthetic data. The product is used by all customers and is one of the key and major components of the company. She will present the hashes used in the fingerprints, and the algorithm itself with approval from her previous employer.
8:45 | Reception |
---|---|
9:30 | Opening words by WiDS TLV ambassador Nitzan Gado and by Lily Ben Ami, CEO of the Michal Sela Forum |
9:50 | Prof. Bracha Shapira – Data Challenges in Recommender Systems Research: Insights from Bundle Recommendation |
10:20 | Juan Liu – Accounting Automation: Making Accounting Easier So That People Can Forget About It |
10:50 | Break |
11:00 | Lightning talks |
12:20 | Lunch & poster session |
---|---|
13:20 | Roundtable session & poster session |
14:05 | Roundtable closure |
14:20 | Break |
14:30 | Merav Mofaz – “Every Breath You Take and Every Move You Make…I'll Be Watching You:” The Sensitive Side of Smartwatches |
14:50 | Reut Yaniv – Ad Serving in the Online Geo Space Along Routes |
15:10 | Rachel Wities - It’s Not Just the Doctor’s Handwriting: Challenges and Opportunities in Healthcare NLP |
15:30 | Closing remarks |
15:40 | End |
WiDS Tel Aviv is an independent event that is organized by Intuit’s WiDS TLV ambassadors as part of the annual WiDS Worldwide conference, the WiDS Datathon, and an estimated 200 WiDS Regional Events worldwide. Everyone is invited to attend all WiDS conference and WiDS Datathon Workshop events which feature outstanding women doing outstanding work.
© 2018-2023 WiDS TLV – Intuit. All rights reserved.
Scotty – By Nir Azoulay
Design: Sharon Geva