Using Bash to Change the Delimiter in a CSV File

 

The Problem

I recently had a situation where I had a comma separated value (CSV) file that I wanted to easily parse within a shell script. Unfortunately the CSV data contained some double quoted strings with embedded commas, for example:

"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

This made parsing the file quite painful, particularly as only the strings with an embedded comma were double quoted like this.

The Solution

I devised a utility script that can parse the data and replace the delimiter with a new character of the users choice, which makes CSV files far easier to work with if you pick a character you know won’t be in your data.

The Script

This script borrows the core CSV parsing from this rather good post on CSV parsing with awk, but I’ve edited it to allow the substitution of the delimiter and ensure it still outputs a single record per line.

#!/bin/bash

input=$1
delimiter=$2

if [ -z "$input" ];
then
	echo "Input file must be passed as an argument!"
	exit 98
fi

if ! [ -f $input ] || ! [ -e $input ];
then
	echo "Input file '"$input"' doesn't exist!"
	exit 99
fi

if [ -z "$delimiter" ];
then
	echo "Delimiter character must be passed as an argument!"
	exit 98
fi

gawk '{
	c=0
	$0=$0","
	while($0) {
		delimiter=""
		if (c++ > 0) # Evaluate and then increment c
		{
			delimiter="'$delimiter'"
		}

		match($0,/ *"[^"]*" *,|[^,]*,/)
		# save what matched in f
		s=substr($0,RSTART,RLENGTH)
		# remove extra stuff
		gsub(/^ *"?|"? *,$/,"",s)
		printf (delimiter s)
		# "consume" what matched
		$0=substr($0,RLENGTH+1)
	}
	printf ("\n")
}' $input

Sample Input


$ cat testprovinces.csv
Province,ProvinceCode,CriteriaId,CountryCode,Country
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
Bío-Bío,CL-BI,20154,CL,Chile

Sample Output


$ ./change-delimiter testprovinces.csv '^'
Province^ProvinceCode^CriteriaId^CountryCode^Country
Australian Capital Territory^AU-ACT^20034^AU^Australia
Piaui^BR-PI^20100^BR^Brazil
Adygeya, Republic^RU-AD^21250^RU^Russian Federation
Bío-Bío^CL-BI^20154^CL^Chile

This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>