At work today we needed to filter out malformed email domains from a list of 4.2 Million email ids which are similar to real domain. For example

  • gnail.com, gmali.com instead of gmail.com
  • hotmali.com instead of hotmail.com

The problem boils down to two things

  1. how do you define “similar”
  2. how to do this for a list of 4.2 Million email ids in an efficient manner.

After staring at the problem for a moment it occurred to me, that this is the perfect candidate for - the Levenshtein distance (aka edit distance) algorithm.

I decided to use pandas.py as it provided some high-level constructs for manipulating CSV. Also found this cool package implementing edit-distance algorithm.

Leveraging these, two packages - in less than an hour, I hacked a script together using which I processed the 4.2M email ids to get 1342 email domains along with no. of email ids using such domains.

Being able to pattern match and solve a real world problem using algorithm is a joy seldom enjoyed by software engineers.

import csv
import editdistance

import numpy as np
import pandas as pd

def get_similar_emails():
    real_mails = ["gmail.com", "hotmail.com", "outlook.com","yahoo.com","msn.com","live.com"]

    readFile = open(unqiue_domains_file)
    csvreader = csv.reader(readFile)


    writeFile = open(resultfile, "w")
    writer = csv.writer(writeFile)

    count = 0

    similar_mail_count = 0
    for row in csvreader:
        for real_mail in real_mails:
            if(real_mail != row[0]):
                dist = editdistance.eval(row[0], real_mail)
                ispartof = real_mail in row[0]
                if (dist <= 2) or ispartof:

                    similar_mail_count += 1
                    row.append(real_mail)

                    writer.writerow(row)

        count = count + 1

    # close the file
    writeFile.close()
    readFile.close()