Cleaning 4 Million email ids using edit distance algorithm

At work today we needed to filter out malformed email domains from a list of 4.2 Million email ids which are similar to real domain. For example

gnail.com, gmali.com instead of gmail.com
hotmali.com instead of hotmail.com

The problem boils down to two things

how do you define “similar”
how to do this for a list of 4.2 Million email ids in an efficient manner.

After staring at the problem for a moment it occurred to me, that this is the perfect candidate for - the Levenshtein distance (aka edit distance) algorithm.

I decided to use pandas.py as it provided some high-level constructs for manipulating CSV. Also found this cool package implementing edit-distance algorithm.

Leveraging these, two packages - in less than an hour, I hacked a script together using which I processed the 4.2M email ids to get 1342 email domains along with no. of email ids using such domains.

Being able to pattern match and solve a real world problem using algorithm is a joy seldom enjoyed by software engineers.

import csv
import editdistance

import numpy as np
import pandas as pd

def get_similar_emails():
    real_mails = ["gmail.com", "hotmail.com", "outlook.com","yahoo.com","msn.com","live.com"]

    readFile = open(unqiue_domains_file)
    csvreader = csv.reader(readFile)


    writeFile = open(resultfile, "w")
    writer = csv.writer(writeFile)

    count = 0

    similar_mail_count = 0
    for row in csvreader:
        for real_mail in real_mails:
            if(real_mail != row[0]):
                dist = editdistance.eval(row[0], real_mail)
                ispartof = real_mail in row[0]
                if (dist <= 2) or ispartof:

                    similar_mail_count += 1
                    row.append(real_mail)

                    writer.writerow(row)

        count = count + 1

    # close the file
    writeFile.close()
    readFile.close()