Levenshtein Distance in PySpark

3 min readJun 15, 2019

Levenshtein distance is used to compare two strings to find how different they are. The difference is calculated based on the number of edits (insertion, deletion or substitutions) required to convert one string to another.

Spark has a built-in method for Levenshtein distance which we use to compare difference between strings in two different dataframes.

I have picked examples from Commonly misspelled English words page on Wikipedia to create two dataframes to compare.

The logical flow of code below is as follows:

We create 2 data frames — one with dictionary words and another with dictionary words and misspelled words.
Cross join these two data frames because we want to compare every word of 2nd data frame with every word of 1st data frame
Compute Levenshtein distance
Look at word combinations with Levenshtein distance < certain value

Now, lets get right into it.

import pyspark.sql.functions as F
import pyspark.sql.types as Tdf_dict = spark.createDataFrame(data=[["absence"], ["acceptable"], ["accidentally"], ["accidently"], ["controversy"], ["fulfil"]], schema=["dict_words"])

You view this data frame using show

df_dict.show()+------------+
|  dict_words|
+------------+
|     absence|
|  acceptable|
|accidentally|
|  accidently|
| controversy|
|      fulfil|
+------------+

Now lets create df_test which also has misspelled words

df_test = spark.createDataFrame(data=[["absence"], ["abcense"], ["absance"], ["acceptable"], ["acceptible"], ["accidentally"], ["accidently"], ["accidentaly"], ["contraversy"], ["fullfil"]], schema=["test_words"])df_test.show()+------------+
|  test_words|
+------------+
|     absence|
|     abcense|
|     absance|
|  acceptable|
|  acceptible|
|accidentally|
|  accidently|
| accidentaly|
| contraversy|
|     fullfil|
+------------+

Now we will cross-join df_dict and df_test to get all combinations of words between the two columns

df1_x_df2 = df_dict.crossJoin(df_test)df1_x_df2.show()+----------+------------+
|dict_words|  test_words|
+----------+------------+
|   absence|     absence|
|   absence|     abcense|
|   absence|     absance|
|   absence|  acceptable|
|   absence|  acceptible|
|   absence|accidentally|
|   absence|  accidently|
|   absence| accidentaly|
|   absence| contraversy|
|   absence|     fullfil|
|acceptable|     absence|
|acceptable|     abcense|
|acceptable|     absance|
|acceptable|  acceptable|
|acceptable|  acceptible|
|acceptable|accidentally|
|acceptable|  accidently|
|acceptable| accidentaly|
|acceptable| contraversy|
|acceptable|     fullfil|
+----------+------------+
only showing top 20 rows

Next we will compute Levenshtein distance in a new column,

df1_x_df2 = df1_x_df2.withColumn("levenshtein", F.levenshtein(F.col("dict_words"), F.col("test_words")))df1_x_df2.show()+----------+------------+-----------+
|dict_words|  test_words|levenshtein|
+----------+------------+-----------+
|   absence|     absence|          0|
|   absence|     abcense|          2|
|   absence|     absance|          1|
|   absence|  acceptable|          7|
|   absence|  acceptible|          7|
|   absence|accidentally|          9|
|   absence|  accidently|          7|
|   absence| accidentaly|          8|
|   absence| contraversy|         10|
|   absence|     fullfil|          7|
|acceptable|     absence|          7|
|acceptable|     abcense|          6|
|acceptable|     absance|          7|
|acceptable|  acceptable|          0|
|acceptable|  acceptible|          1|
|acceptable|accidentally|          5|
|acceptable|  accidently|          6|
|acceptable| accidentaly|          5|
|acceptable| contraversy|         10|
|acceptable|     fullfil|          9|
+----------+------------+-----------+
only showing top 20 rowsdf1_x_df2.filter("levenshtein < 2").show()+------------+------------+-----------+
|  dict_words|  test_words|levenshtein|
+------------+------------+-----------+
|     absence|     absence|          0|
|     absence|     absance|          1|
|  acceptable|  acceptable|          0|
|  acceptable|  acceptible|          1|
|accidentally|accidentally|          0|
|accidentally| accidentaly|          1|
|  accidently|  accidently|          0|
|  accidently| accidentaly|          1|
| controversy| contraversy|          1|
|      fulfil|     fullfil|          1|
+------------+------------+-----------+

At the end

df1_x_df2.filter("levenshtein < 2").show()

shows the word combinations with Levenshtein distance ≤ 1, i.e. single or no edit is needed to change one word to match the other. Quite easily you can see that it picks up the commonly misspelled words in language along with the identical cases where the distance is 0.

In practice you can try different methods to arrive at a threshold to decide whether the misspelled word should be substituted with the words in your dictionary or not. Hopefully, we will discuss that (and more) in a later post.

Levenshtein Distance in PySpark

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Abhay Shukla

Responses (1)