抄録
Offer Organization: Japan Society for the Promotion of Science, System Name: Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (C), Category: Grant-in-Aid for Scientific Research (C), Fund Type: competitive_research_funding, Overall Grant Amount: - (direct: 2300000, indirect: 690000)
Similarity joins on massive datasets are useful operations to detect many-to-many relationship residing in target datasets. However many join algorithms on various similarity functions are known to have unstable performance on map/reduce systems. The objective of this research is to clarify reasons of this unstablity, and to solve it. To do so, the research proposes two new algorithmic frameworks. One is the hybrid-hash join enhanced with bucket-regrouping techniques, named HSJ+BR. It solves unexpected unbalance between reducers without intermediate mapreduce jobs. The other is called two-stage hash-partitioning strategy. It can greatly reduce the shuffle overhead caused by too much record-replication associated with many similarity join algorithms. Using these two frameworks, it is shown that stable and efficient performance of similarity joins on map/reduce systems (where, as typical cases, m-to-n equi-join and edit-distance join are used) is achieved.