Now, that we have defined the rules for CSV files, we can
implement CSV reader that is able to find out which character is used as a
separator.
Here is an entire C# source code of the method that detects
separator in CSV stream:
public static char Detect(TextReader reader, int rowCount, IList<char> separators)
{
IList<int> separatorsCount = new int[separators.Count];
int character;
int row = 0;
bool quoted = false;
bool firstChar = true;
while (row < rowCount)
{
character = reader.Read();
switch (character)
{
case '"':
if (quoted)
{
if (reader.Peek() != '"')
// Value is quoted and current character is " and next character is not ".
quoted = false;
else
reader.Read();
// Value is quoted and current and next characters are "" - read (skip) peeked
// qoute.
}
else
{
if (firstChar)
// Set value as quoted only if this quote is the first char in the value.
quoted = true;
}
break;
case '\n':
if (!quoted)
{
++row;
firstChar = true;
continue;
}
break;
case -1:
row = rowCount;
break;
default:
if (!quoted)
{
int index = separators.IndexOf((char)character);
if (index != -1)
{
++separatorsCount[index];
firstChar = true;
continue;
}
}
break;
}
if (firstChar)
firstChar = false;
}
int maxCount = separatorsCount.Max();
return maxCount == 0 ? '\0' : separators[separatorsCount.IndexOf(maxCount)];
}
CSV stream is represented with reader
parameter that is used for reading characters from CSV stream, parameter rowCount tells the method how many rows should be read
before determining separator and separators parameter
is a list of characters that tells the method which characters are possible
separators.
Method maintains internal state with these parameters:
· separatorsCount – used for counting the number of
occurrences of possible separator as a separator in CSV stream,
· character – last character that was read from the CSV
stream,
· row – index of the currently processing row in the CSV
stream,
· quoted – true if characters that are read next are enclosed
in quotes, otherwise false,
· firstChar – true if next character that is to be read is
the first character of the next entry in CSV stream. This parameter is needed because
we consider a value to be enclosed in quotes only if opening quote is the first
character of the CSV entry.
When rowCount rows are read or CSV
stream is read to the end, method returns first of the possible separators that
has maximum number of occurrences as a separator in CSV stream. If any of the
possible separators never occurred as a separator in CSV stream, ‘\0’ is returned.
Method takes care when reading quotes, separators and new
line characters that are part of the quoted value. In this case, if a quote is
read, method will peek into CSV stream to see if the next character is also a
quote, otherwise it will consider this quote to be a closing quote. New line
and separator characters are ignored if contained in a quoted value.
For example, in the following Employees.csv file:
Name,Surname,Salary
John,Doe,"$2,130"
Fred;Nurk;"$1,500"
Hans;Meier;"$1,650"
Ivan;Horvat;"$3,200"
Method detects that CSV separator is [;] although total
number of occurrences of [;] is 6 and total number of occurrences of [,] is 9.
That is because last 4 occurrences of [,] are enclosed in quotes so they don’t
qualify as a possible separators. So total number of occurrences of [,] as
separators is 5 and total number of occurrences of [;] as separators is 6,
which makes [;] the most probable CSV separator.
Bundled with this article is WPF solution that demonstrates
auto detection of CSV separator in action. Solution can be downloaded here.
Application is located in bin/Release folder. Original article can be viewed here.