Function to clean text data in spark rdd

4/11/2023

Could you please help me by suggesting the approach (and hint of code) so that I can take this task further?Ĭheck if the following works for you. Now, I need to roll over these multiple rows into one row by item no (all the assembly_name and id's belonging to one itemno should be in one row) and then I need to perform task#1, 2 and 3 as listed at the very top to clean the fits_assembly_name column and save the processed data to final dataframe or table with itemno, fits_assembly_id and fits_assembly_name columns but I am not sure how to get started in python to do it. Select itemno, concat_ws(' | ', collect_set(cast(fits_assembly_id as int))) as fits_assembly_id, concat_ws(' | ' ,collect_set(fits_assembly_name)) as fits_assembly_name Rearrange Data of fits_assembly_name in the table so that all the fits_assembly_name and fits_assembly_id roll over to single row for each distinct itemno %sql ("overwrite").format("delta").saveAsTable("partsFitsTable")ģ. Create table from DF partFitmentDF.createOrReplaceTempView("partsFits") Read data from parquet file partFitmentDF = ("/mnt/blob/devdatasciencesto/pga-parts-forecast/raw/parts-fits/")Ģ. Though I've worked on writing code to perform these task during assignments in college, I've never done it in a single piece of code (for any project) and I am looking out for guidance here from experts who can help me by pointing towards the best approach to get it done (in python or scala)ġ. Remove duplicate words in ASSEMBLY_NAME column Turn all the words to lower case and remove stop words (list from NLTK)ģ. Remove extra whitespaces (keep one whitespace in between word but remove moreĢ. While working in a sample problem, I came across the following task of data cleaning 1. I am a very new student of data engineering/machine learning and learning by myself.

0 Comments

Function to clean text data in spark rdd

Leave a Reply.

Author

Archives

Categories