Regular Expression (Regexp) in PySpark

Rohit Kumar Prajapati
Dev Genius
Published in
3 min readMar 12, 2023

--

Image credit — CoderPad

What is Regex?

A regular expression is a sequence of characters that specifies a match pattern in the text.

To get more information on Regex you can check out https://regexr.com/ and https://regex101.com/ these sites will help you to understand how regex works.

In Pyspark we have a few functions that use the regex feature to help us in string matches.

Below are the regexp that used in pyspark

  1. regexp_replace
  2. rlike
  3. regexp_extract

1.regexp_replace — as the name suggested it will replace all substrings if a regexp match is found in the string.

pyspark.sql.functions.regexp_replace(str, pattern, replacement)

By using regexp_replace() Spark function you can replace a column’s string value with another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string. The below example replaces the number and special characters value with an empty string on address column.

[^A-Za-z\s] Matches only characters and whitespace


df = df.withColumn("clean_address", F.regexp_replace(F.col("address"), "[^A-Za-z\s]", ""))
df.show(truncate=False)

#o/p
+------------------------------------------+--------------------------------+
|address |clean_address |
+------------------------------------------+--------------------------------+
|93 NORTH 9TH STREET, BROOKLYN NY 11211 | NORTH TH STREET BROOKLYN NY |
|380 WESTMINSTER ST, PROVIDENCE RI 02903 | WESTMINSTER ST PROVIDENCE RI |
|177 MAIN STREET, LITTLETON NH 03561 | MAIN STREET LITTLETON NH |
|202 HARLOW ST, BANGOR ME 04401 | HARLOW ST BANGOR ME |
|46 FRONT STREET, WATERVILLE, ME 04901 | FRONT STREET WATERVILLE ME |
|22 SUSSEX ST, HACKENSACK NJ 07601 | SUSSEX ST HACKENSACK NJ |
|75 OAK STREET, PATCHOGUE NY 11772 | OAK STREET PATCHOGUE NY |
|1 CLINTON AVE, ALBANY NY 12207 | CLINTON AVE ALBANY NY |
|7242 ROUTE 9, PLATTSBURGH NY 12901 | ROUTE PLATTSBURGH NY |
|520 5TH AVE, MCKEESPORT PA 15132 | TH AVE MCKEESPORT PA |
|122 W 3RD STREET, GREENSBURG PA 15601 | W RD STREET GREENSBURG PA |
+------------------------------------------+--------------------------------+

2. rlike — SQL RLIKE expression (LIKE with Regex). Returns a boolean column based on a regex match.

Based on the expression it will match with the string and filter out records.

10023$filter records that ends with 10023 zip_code

df_rlike = df.filter(F.col("address").rlike("10023$"))
df_rlike.show(truncate=False)

#o/p
+---------------------------------------+
|address |
+---------------------------------------+
|380 WESTMINSTER ST, PROVIDENCE RI 10023|
+---------------------------------------+

3. regexp_extract — Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.

pyspark.sql.functions.regexp_extract(str, pattern, idx)

It takes 3 arguments
1. str — a column where you want to perform a regexp match.
2. pattern — regexp pattern to extract substring from the specified string column
3. idx — part of the match we need to extract from the group of match

(\d*)Match 0 or more digits (0–9)

Below we extracted number, city, state, zip_code from the address using regexp_extract.
Here we are creating groups pattern and based on need we are selecting a specific group by providing idx.

df_clean = df.withColumn("number", F.regexp_extract(F.col("address"), "(\d*)", 1))\
.withColumn("city", F.regexp_extract(F.col("address"), "(,) (\w*)", 2))\
.withColumn("state", F.regexp_extract(F.col("address"), "(\w*) (\w*$)", 1))\
.withColumn("zip_code", F.regexp_extract(F.col("address"), "(\d*$)", 1))

df_clean.show(truncate=False)

#o/p
+------------------------------------------+------+------------+-----+--------+
|address |number|city |state|zip_code|
+------------------------------------------+------+------------+-----+--------+
|93 NORTH 9TH STREET, BROOKLYN NY 11211 |93 |BROOKLYN |NY |11211 |
|380 WESTMINSTER ST, PROVIDENCE RI 02903 |380 |PROVIDENCE |RI |02903 |
|177 MAIN STREET, LITTLETON NH 03561 |177 |LITTLETON |NH |03561 |
|202 HARLOW ST, BANGOR ME 04401 |202 |BANGOR |ME |04401 |
|46 FRONT STREET, WATERVILLE, ME 04901 |46 |WATERVILLE |ME |04901 |
|22 SUSSEX ST, HACKENSACK NJ 07601 |22 |HACKENSACK |NJ |07601 |
|75 OAK STREET, PATCHOGUE NY 11772 |75 |PATCHOGUE |NY |11772 |
|1 CLINTON AVE, ALBANY NY 12207 |1 |ALBANY |NY |12207 |
|7242 ROUTE 9, PLATTSBURGH NY 12901 |7242 |PLATTSBURGH |NY |12901 |
|520 5TH AVE, MCKEESPORT PA 15132 |520 |MCKEESPORT |PA |15132 |
|122 W 3RD STREET, GREENSBURG PA 15601 |122 |GREENSBURG |PA |15601 |
|901 UNIVERSITY DR, STATE COLLEGE PA 16801 |901 |STATE |PA |16801 |
|240 W 3RD ST, WILLIAMSPORT PA 17701 |240 |WILLIAMSPORT|PA |17701 |
|41 N 4TH ST, ALLENTOWN PA 18102 |41 |ALLENTOWN |PA |18102 |
+------------------------------------------+------+------------+-----+--------+

Hope you enjoyed it.
Happy Coding !!

--

--