At work today we needed to filter out malformed email domains from a list of 4.2 Million email ids which are similar to real domain. For example
gnail.com, gmali.com
instead ofgmail.com
hotmali.com
instead ofhotmail.com
The problem boils down to two things
- how do you define “similar”
- how to do this for a list of 4.2 Million email ids in an efficient manner.
After staring at the problem for a moment it occurred to me, that this is the perfect candidate for - the Levenshtein distance (aka edit distance) algorithm.
I decided to use pandas.py as it provided some high-level constructs for manipulating CSV. Also found this cool package implementing edit-distance algorithm.
Leveraging these, two packages - in less than an hour, I hacked a script together using which I processed the 4.2M email ids to get 1342 email domains along with no. of email ids using such domains.
Being able to pattern match and solve a real world problem using algorithm is a joy seldom enjoyed by software engineers.
import csv
import editdistance
import numpy as np
import pandas as pd
def get_similar_emails():
real_mails = ["gmail.com", "hotmail.com", "outlook.com","yahoo.com","msn.com","live.com"]
readFile = open(unqiue_domains_file)
csvreader = csv.reader(readFile)
writeFile = open(resultfile, "w")
writer = csv.writer(writeFile)
count = 0
similar_mail_count = 0
for row in csvreader:
for real_mail in real_mails:
if(real_mail != row[0]):
dist = editdistance.eval(row[0], real_mail)
ispartof = real_mail in row[0]
if (dist <= 2) or ispartof:
similar_mail_count += 1
row.append(real_mail)
writer.writerow(row)
count = count + 1
# close the file
writeFile.close()
readFile.close()