Table Discovery in Data Lakes: State-of-the-art and Future Directions | Companion of the 2023 International Conference on Management of Data (2024)

tutorial

Authors: Grace Fan, Jin Wang, Yuliang Li, and Renée J. Miller

SIGMOD '23: Companion of the 2023 International Conference on Management of Data

June 2023

Pages 69 - 75

Published: 05 June 2023 Publication History

  • 6citation
  • 668
  • Downloads

Metrics

Total Citations6Total Downloads668

Last 12 Months668

Last 6 weeks32

  • Get Citation Alerts

    New Citation Alert added!

    This alert has been successfully added and will be sent to:

    You will be notified whenever a record that you have chosen has been cited.

    To manage your alert preferences, click on the button below.

    Manage my Alerts

    New Citation Alert!

    Please log in to your account

  • Get Access

      • Get Access
      • References
      • Media
      • Tables
      • Share

    Abstract

    Data discovery refers to a set of tasks that enable users and downstream applications to explore and gain insights from massive collections of data sources such as data lakes. In this tutorial, we will provide a comprehensive overview of the most recent table discovery techniques developed by the data management community. We will cover table understanding tasks such as domain discovery, table annotation, and table representation learning which help data lake systems capture semantics of tables. We will also cover techniques enabling various query-driven discovery and table exploration tasks, as well as how table discovery can support key data science applications such as machine learning and knowledge base construction. Finally, we will discuss future research directions on developing new table discovery paradigms by combining structured knowledge and dense table representations, as well as improving the efficiency of discovery using state-of-the-art indexing techniques, and more.

    References

    [1]

    Parag Agrawal, Arvind Arasu, and Raghav Kaushik. 2010. On indexing error-tolerant set containment. In SIGMOD. 927--938.

    [2]

    Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In ICDE. 709--720.

    [3]

    Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW. 1365--1375.

    Digital Library

    [4]

    Michael J. Cafarella, Alon Y. Halevy, and Nodira Khoussainova. 2009. Data Integration for the Relational Web. Proc. VLDB Endow., Vol. 2, 1 (2009), 1090--1101.

    Digital Library

    [5]

    Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In SIGMOD. 1335--1349.

    [6]

    Sonia Castelo, Ré mi Rampin, Aé cio S. R. Santos, Aline Bessa, Fernando Chirigati, and Juliana Freire. 2021. Auctus: A Dataset Search Engine for Data Discovery and Augmentation. Proc. VLDB Endow., Vol. 14, 12 (2021), 2791--2794.

    Digital Library

    [7]

    Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibá n ez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. VLDB J., Vol. 29, 1 (2020), 251--272.

    Digital Library

    [8]

    Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David R. Karger. 2020. ARDA: Automatic Relational Data Augmentation for Machine Learning. Proc. VLDB Endow., Vol. 13, 9 (2020), 1373--1387.

    [9]

    Michael Chui, Diana Farrell, and Kate Jackson. 2014. How government can promote open data. McKinsey Company (2014).

    [10]

    Joel Coffman and Alfred C. Weaver. 2014. An Empirical Performance Evaluation of Relational Keyword Search Techniques. IEEE Trans. Knowl. Data Eng., Vol. 26, 1 (2014), 30--42.

    Digital Library

    [11]

    Tianji Cong, James Gale, Jason Frantz, H. V. Jagadish, and cC agatay Demiralp. 2022. WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses. CoRR, Vol. abs/2212.14155 (2022).

    [12]

    Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Y. Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. 2012. Finding related tables. In SIGMOD. 817--828.

    [13]

    Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SCG. ACM, 253--262.

    [14]

    Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: Table Understanding through Representation Learning. Proc. VLDB Endow., Vol. 14, 3 (2020), 307--319.

    Digital Library

    [15]

    Yuyang Dong, Kunihiro Takeoka, Chuan Xiao, and Masafumi Oyamada. 2021. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In ICDE. 456--467.

    [16]

    Mahdi Esmailoghli, Jorge-Arnulfo Quiané -Ruiz, and Ziawasch Abedjan. 2022. MATE: Multi-Attribute Table Extraction. Proc. VLDB Endow., Vol. 15, 8 (2022), 1684--1696.

    Digital Library

    [17]

    Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and René e J. Miller. 2022. Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. CoRR, Vol. abs/2210.01922 (2022).

    [18]

    Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018a. Aurum: A Data Discovery System. In ICDE. 1001--1012.

    [19]

    Raul Castro Fernandez, Essam Mansour, Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018b. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. In ICDE. 989--1000.

    [20]

    Sainyam Galhotra and Udayan Khurana. 2020. Semantic Search over Structured Data. In CIKM.

    [21]

    Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google's Datasets. In SIGMOD. 795--806.

    [22]

    Vagelis Hristidis and Yannis Papakonstantinou. 2002. DISCOVER: Keyword Search in Relational Databases. In VLDB. 670--681.

    Digital Library

    [23]

    Madelon Hulsebos, Kevin Zeng Hu, Michiel A. Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, cC agatay Demiralp, and Cé sar A. Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In SIGKDD. 1500--1508.

    Digital Library

    [24]

    Stratos Idreos and Tim Kraska. 2019. From Auto-tuning One Size Fits All to Self-designed and Learned Data-intensive Systems. In SIGMOD. 2054--2059.

    [25]

    Stratos Idreos, Kostas Zoumpatianos, Manos Athanassoulis, Niv Dayan, Brian Hentschel, Michael S. Kester, Demi Guo, Lukas M. Maas, Wilson Qin, Abdul Wasay, and Yiyou Sun. 2018. The Periodic Table of Data Structures. IEEE Data Eng. Bull., Vol. 41, 3 (2018), 64--75.

    [26]

    Mehdi Kargar, Aijun An, Nick Cercone, Parke Godfrey, Jaroslaw Szlichta, and Xiaohui Yu. 2014. MeanKS: meaningful keyword search in relational databases with complex schema. In SIGMOD. 905--908.

    [27]

    Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, and Mirek Riedewald. 2023. SANTOS: Relationship-based Semantic Table Union Search. In SIGMOD.

    [28]

    Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsifodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In ICDE. 468--479.

    [29]

    Oliver Lehmberg and Christian Bizer. 2017. Stitching Web Tables for Improving Matching Quality. Proc. VLDB Endow., Vol. 10, 11 (2017), 1502--1513.

    Digital Library

    [30]

    Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. A Large Public Corpus of Web Tables containing Time and Context Metadata. In WWW. 75--76.

    [31]

    Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paulheim, and Christian Bizer. 2015. The Mannheim Search Join Engine. J. Web Semant., Vol. 35 (2015), 159--166.

    Digital Library

    [32]

    Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, René e J. Miller, and Mirek Riedewald. 2021. DomainNet: hom*ograph Detection for Data Lake Disambiguation. In EDBT. 13--24.

    [33]

    Keqian Li, Yeye He, and Kris Ganjam. 2017. Discovering Enterprise Concepts Using Spreadsheet Tables. In SIGKDD. 1873--1882.

    [34]

    Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow., Vol. 3, 1 (2010), 1338--1347.

    Digital Library

    [35]

    Xiao Ling, Alon Y. Halevy, Fei Wu, and Cong Yu. 2013. Synthesizing Union Tables from the Web. In IJCAI. 2677--2683.

    [36]

    Colin Lockard, Xin Luna Dong, Prashant Shiralkar, and Arash Einolghozati. 2018. CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web. Proc. VLDB Endow., Vol. 11, 10 (2018), 1084--1096.

    Digital Library

    [37]

    Jiaheng Lu, Chunbin Lin, Jin Wang, and Chen Li. 2019. Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join. In CIKM. 2975--2976.

    [38]

    Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 42, 4 (2020), 824--836.

    Digital Library

    [39]

    René e J. Miller. 2018. Open Data Integration. Proc. VLDB Endow., Vol. 11, 12 (2018), 2130--2139.

    Digital Library

    [40]

    René e J. Miller, Fatemeh Nargesian, Erkang Zhu, Christina Christodoulakis, Ken Q. Pu, and Periklis Andritsos. 2018. Making Open Data Transparent: Data Discovery on Open Data. IEEE Data Eng. Bull., Vol. 41, 2 (2018), 59--70.

    [41]

    Fatemeh Nargesian, Ken Q. Pu, Bahar Ghadiri Bashardoost, Erkang Zhu, and René e J. Miller. 2023. Data Lake Organization. IEEE Trans. Knowl. Data Eng., Vol. 35, 1 (2023), 237--250.

    [42]

    Fatemeh Nargesian, Ken Q. Pu, Erkang Zhu, Bahar Ghadiri Bashardoost, and René e J. Miller. 2020. Organizing Data Lakes for Navigation. In SIGMOD. 1939--1950.

    [43]

    Fatemeh Nargesian, Erkang Zhu, René e J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow., Vol. 12, 12 (2019), 1986--1989.

    Digital Library

    [44]

    Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and René e J. Miller. 2018. Table Union Search on Open Data. Proc. VLDB Endow., Vol. 11, 7 (2018), 813--825.

    Digital Library

    [45]

    Masayo Ota, Heiko Mueller, Juliana Freire, and Divesh Srivastava. 2020. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow., Vol. 13, 7 (2020), 953--965.

    Digital Library

    [46]

    Paul Ouellette, Aidan Sciortino, Fatemeh Nargesian, Bahar Ghadiri Bashardoost, Erkang Zhu, Ken Pu, and René e J. Miller. 2021. RONIN: Data Lake Exploration. Proc. VLDB Endow., Vol. 14, 12 (2021), 2863--2866.

    Digital Library

    [47]

    Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the Web using Column Keywords. Proc. VLDB Endow., Vol. 5, 10 (2012), 908--919.

    Digital Library

    [48]

    Aé cio S. R. Santos, Aline Bessa, Christopher Musco, and Juliana Freire. 2022. A Sketch-based Index for Correlated Dataset Search. In ICDE. 2928--2941.

    [49]

    Mihail Stoian, Andreas Kipf, Ryan Marcus, and Tim Kraska. 2021. PLEX: Towards Practical Learned Indexing. CoRR, Vol. abs/2108.05117 (2021).

    [50]

    Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, cC agatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-trained Language Models. In SIGMOD. 1493--1503.

    [51]

    Petros Venetis, Alon Y. Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering Semantics of Tables on the Web. Proc. VLDB Endow., Vol. 4, 9 (2011), 528--538.

    Digital Library

    [52]

    Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, and Meng Jiang. 2021. TCN: Table Convolutional Network for Web Table Interpretation. In WWW. 4020--4032.

    [53]

    Gerhard Weikum. 2021. Knowledge Graphs 2021: A Data Odyssey. Proc. VLDB Endow., Vol. 14, 12 (2021), 3233--3238.

    Digital Library

    [54]

    Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. 2012. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD. 97--108.

    [55]

    Ce Zhang, Jaeho Shin, Christopher Ré, Michael J. Cafarella, and Feng Niu. 2016. Extracting Databases from Dark Data with DeepDive. In SIGMOD. 847--859.

    [56]

    Dan Zhang, Yoshihiko Suhara, Jinfeng Li, Madelon Hulsebos, cC agatay Demiralp, and Wang-Chiew Tan. 2020. Sato: Contextual Semantic Type Detection in Tables. Proc. VLDB Endow., Vol. 13, 11 (2020), 1835--1848.

    Digital Library

    [57]

    Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In SIGMOD. 1951--1966.

    [58]

    Zixuan Zhao and Raul Castro Fernandez. 2022. Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation. In SIGMOD. 1504--1517.

    [59]

    Erkang Zhu, Dong Deng, Fatemeh Nargesian, and René e J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIGMOD. 847--864.

    Digital Library

    [60]

    Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and René e J. Miller. 2016. LSH Ensemble: Internet-Scale Domain Search. Proc. VLDB Endow., Vol. 9, 12 (2016), 1185--1196. io

    Digital Library

    Cited By

    View all

    • Kayali MLykov AFountalis IVasiloglou NOlteanu DSuciu D(2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024

      https://dl.acm.org/doi/10.14778/3659437.3659461

    • Deng YChai CCao LYuan QChen SYu YSun ZWang JLi JCao ZJin KZhang CJiang YZhang YWang YYuan YWang GTang N(2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024

      https://dl.acm.org/doi/10.14778/3659437.3659448

    • Zeng ACafarella M(2024)Digging Up Threats to Validity: A Data Marshalling Approach to Sensitivity AnalysisProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669850(1-5)Online publication date: 9-Jun-2024

      https://dl.acm.org/doi/10.1145/3665601.3669850

    • Show More Cited By

    Index Terms

    1. Table Discovery in Data Lakes: State-of-the-art and Future Directions

      1. Information systems

        1. Data management systems

          1. Database design and models

            1. Information integration

              1. Mediators and data integration

            2. Information retrieval

              1. Retrieval models and ranking

                1. Top-k retrieval in databases

          Recommendations

          • Towards social network analytics for understanding and managing enterprise data lakes

            ASONAM '16: Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

            We have built a tool for inspecting and managing data lakes. The motivations for creating this tool are 1) schema discovery (determining links pertinent to solving a data analysis problem), 2) discovering high risk links in data schemas that give rise ...

            Read More

          • Towards Data Discovery by Example

            Heterogeneous Data Management, Polystores, and Analytics for Healthcare

            Abstract

            Data scientists today have to query an avalanche of multi-source data (e.g., data lakes, company databases) for diverse analytical tasks. Data discovery is labor-intensive as users have to find the right tables, and the combination thereof to ...

            Read More

          • Future trends in data mining

            Over recent years data mining has been establishing itself as one of the major disciplines in computer science with growing industrial impact. Undoubtedly, research in data mining will continue and even increase over coming decades. In this article, we ...

            Read More

          Comments

          Information & Contributors

          Information

          Published In

          Table Discovery in Data Lakes: State-of-the-art and Future Directions | Companion of the 2023 International Conference on Management of Data (5)

          SIGMOD '23: Companion of the 2023 International Conference on Management of Data

          June 2023

          330 pages

          ISBN:9781450395076

          DOI:10.1145/3555041

          • General Chairs:
          • Sudipto Das

            Amazon Web Services, USA

            ,
          • Ippokratis Pandis

            Amazon Web Services, USA

            ,
          • Program Chairs:
          • K. Selçuk Candan

            Arizona State University, USA

            ,
          • Sihem Amer-Yahia

            CNRS, Université Grenoble Alpes, France

          Copyright © 2023 ACM.

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [emailprotected].

          Sponsors

          • SIGMOD: ACM Special Interest Group on Management of Data

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 05 June 2023

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. data integration
          2. data lake
          3. dataset discovery
          4. unionable tables

          Qualifiers

          • Tutorial

          Funding Sources

          • National Science Foundation

          Conference

          SIGMOD/PODS '23

          Sponsor:

          • SIGMOD

          Acceptance Rates

          Overall Acceptance Rate 785 of 4,003 submissions, 20%

          Contributors

          Table Discovery in Data Lakes: State-of-the-art and Future Directions | Companion of the 2023 International Conference on Management of Data (10)

          Other Metrics

          View Article Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 6

            Total Citations

            View Citations
          • 668

            Total Downloads

          • Downloads (Last 12 months)668
          • Downloads (Last 6 weeks)32

          Other Metrics

          View Author Metrics

          Citations

          Cited By

          View all

          • Kayali MLykov AFountalis IVasiloglou NOlteanu DSuciu D(2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024

            https://dl.acm.org/doi/10.14778/3659437.3659461

          • Deng YChai CCao LYuan QChen SYu YSun ZWang JLi JCao ZJin KZhang CJiang YZhang YWang YYuan YWang GTang N(2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024

            https://dl.acm.org/doi/10.14778/3659437.3659448

          • Zeng ACafarella M(2024)Digging Up Threats to Validity: A Data Marshalling Approach to Sensitivity AnalysisProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669850(1-5)Online publication date: 9-Jun-2024

            https://dl.acm.org/doi/10.1145/3665601.3669850

          • Chen KKoudas N(2024)Unstructured Data Fusion for Schema and Data ExtractionProceedings of the ACM on Management of Data10.1145/36549842:3(1-26)Online publication date: 30-May-2024

            https://dl.acm.org/doi/10.1145/3654984

          • Taha ILissandrini MSimitsis AIoannidis Y(2024)A Study on Efficient Indexing for Table Search in Data Lakes2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00046(245-252)Online publication date: 5-Feb-2024
          • Miao ZWang J(2023)Watchog: A Light-weight Contrastive Learning based Framework for Column AnnotationProceedings of the ACM on Management of Data10.1145/36267661:4(1-24)Online publication date: 12-Dec-2023

            https://dl.acm.org/doi/10.1145/3626766

          View Options

          Get Access

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          Get this Publication

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Table Discovery in Data Lakes: State-of-the-art and Future Directions | Companion of the 2023 International Conference on Management of Data (2024)

          References

          Top Articles
          Latest Posts
          Article information

          Author: Trent Wehner

          Last Updated:

          Views: 6127

          Rating: 4.6 / 5 (56 voted)

          Reviews: 95% of readers found this page helpful

          Author information

          Name: Trent Wehner

          Birthday: 1993-03-14

          Address: 872 Kevin Squares, New Codyville, AK 01785-0416

          Phone: +18698800304764

          Job: Senior Farming Developer

          Hobby: Paintball, Calligraphy, Hunting, Flying disc, Lapidary, Rafting, Inline skating

          Introduction: My name is Trent Wehner, I am a talented, brainy, zealous, light, funny, gleaming, attractive person who loves writing and wants to share my knowledge and understanding with you.