Talend Open Studio for Data Integration helps you to efficiently and effectively manage all facets of data extraction, data transformation, and data loading. This leading open source ETL tool boosts developer productivity with a rich set of features.
Now let's look into a simple use case to remove the duplicate values from a csv file.
It contains a duplicate record. Let's dive into the solution directly.
- Open talend and create and new job.
- Insert tFileInputDelimited component, this component is used to get the values from the csv file.
- Insert tSortRow component which allows you to sort the records in order.
- Insert tUniqRow component, this component allows you to remove the duplicates from the file. It gives two outputs unique and duplicate records.
- Insert two tLogRow components which displays the records in console of the tool.
In tFileInputDelimited component, Choose the file location and insert the schema of the file. Refer this post (Getting the schema of a CSV file) for configuring the schema of the component.
Right click on the tFileInputDelimited and select row --> main and click on tSortRow.
In tSortRow component properties section click on the + icon and select first name, last name and email in schema column and alpha for all three in Sort num or alpha column and Order as asc as shown in the below screenshot.
Right click on the tSortRow component Row --> Main and click on tUniqRow component.
Configure the tUniqRow component as shown below. What ever the column you select as the key attribute it will check for those column duplicates combined (And condition if you are a developer).
Now right clik on the tUniqRow component and select Row --> Uniques for unique records and select tLogRow_1 and right click Row --> Duplicates and map it to tLogRow_2.
If you run the job now you will get both (Unique and Duplicate ) records in console. So delete the tLogRow_1 and now you can get the duplicate values in console as shown below.
Cheers!!! Have a great day...