Algorithms Wiki
Advertisement

In this article I will introduce some thoughts on a program I am developing to convert CSV (Comma Seperated Variable) files to structured HTML (HyperText Markup Language) Files. Hopefully others will join in the fun and help me to refine the thought process to the point that I can get the end result working nicely. I'm going to request that any chances to code sections be fully documented in the discussion tab.

Thank you, --Jacob Nyquist [1] 15:02, 16 April 2009 (UTC)

We are making the following assumptions.

  1. The CSV has new rows deliniated by a line break
  2. The first row contains the column names
  3. All the data can be stored as Character Strings.
  4. No more than 50 columns of data

Data Structures[]

TABLE columndata
FIELD column_name CHARACTER_STRING
FIELD column_id INTEGER.

TABLE rowdata
FIELD field_data CHARACTER_STRING
FIELD column_id INTEGER
FIELD row_id INTEGER. 

The first table is merely to grab the information we will need for recreating the header row. The second table will store actual data from the file. The field rows.row_id will be a system generated during the data gathering phase.

Getting the Input File[]

Each system has it's own method for obtaining a file name, I'm going to leave this as a code specific item for each person to obtain themselves. Should someone decide they only want to use this algorithm on one file they can hard code it - should they decide they want the user to have to type the full path they can code that... this is a piece that doesn't impact the logic.

OBTAIN input_filename.


Obtaining the Column Information[]

Now to read the data in to program. Many languages that I'm familiar with have routines either built in, or easily available to include, for processing comma delimited documents. The trick is to get this data into our tables. I decided to ensure an end of line by testing for two NUL's or blank values before considering the data we read to truly be terminated.

Why is counter_max set to "counter - 1" ? Because counter_max lets us know exactly where the last real value is in our array. If your language has the [0] as an actual value, then change to [0]..[49] for a 50 element array.

DEFINE VARIABLE temp_array CHARACTER_STRING EXTENT 50.  

OPEN input_filename.
IMPORT DELIMITER "," 
  temp_array[1] .. temp_array[50].

Determining Where the Columns End[]

Our next stage is to process through the array to determine where it ends. For this we count through until we find two consecutive NUL/UNDEFINED/""/blank or otherwise unused positions in our array. We can then extrapolate that the data in the arrays only charts until the position prior to this.

DEFINE VARIABLE counter_max INTEGER INITIALIZE 0.
DEFINE VARIABLE counter INTEGER INITIALIZE 0.

DO WHILE counter_max = 0.
  counter = counter + 1.
  IF temp_array[counter] EQ NUL && temp_array[counter+1] EQ NUL THEN
    counter_max = counter - 1.
END.

Translating the Array into the First Table[]

To translate our array into our first table, we're going to reset the counter variable, and use another DO WHILE loop, this one to simulate a FOR i = x TO y looping structure. This should enable us to get all the data into the table. If your arrays contain data in [0] then set counter = 0 instead of 1.

counter = 1.
DO WHILE counter <= counter_max.
  CREATE columndata.
  ASSIGN 
    columndata.column_name = temp_array[counter]
    columndata.column_id = counter.
  counter = counter + 1.
END.

Obtaining the rest of the data[]

Now we've got our header row, so the columns are defined for us. If your language allows you to setup dynamic imports you're five steps ahead of where I'm at. The looping structure used should continue until we reach EOF.

IMPORT DELIMITER "," 
  temp_array[1 FOR 50].
    
REPEAT:
  counter = 1.
  DO WHILE counter <= counter_max.
    CREATE rowdata.
    FIND columndata WHERE columndata.column_id = counter.
    ASSIGN 
      rowdata.column_id = columndata.column_id
      rowdata.field_data = temp_array[counter]
      rowdata.row_id = rowcounter
      counter = counter + 1.
  END.
  rowcounter = rowcounter + 1.
  IMPORT DELIMITER "," 
    temp_array[1 FOR 50].
END.


output phase[]

We're skipping the sorting and the ability to break into files based on data in a particular column, so we can have a result by end of day - for anyone that wants to help write the in between pieces feel free. {SKIP inserts a hard return into the output file... consider it equiv to /n}

OUTPUT TO output_filename.
OUTPUT TO VALUE("v:\cstars.dat\test.html").
PUT UNFORMATTED  "<HTML>" SKIP.
PUT UNFORMATTED "  <BODY>" SKIP.
PUT UNFORMATTED "    <TABLE>" SKIP.
PUT UNFORMATTED "      <TR>".

FOR EACH columndata.
  PUT UNFORMATTED "<TH>" + columndata.column_name + "</TH>".
END.

PUT "</TR>" SKIP .

FOR EACH rowdata BREAK BY row_id.
  IF FIRST-OF (row_id) THEN PUT UNFORMATTED "      <TR>" .
  PUT UNFORMATTED "<TD>" + rowdata.field_data + "</TD>".
  IF LAST-OF (row_id) THEN PUT UNFORMATTED "</TR>" SKIP .
END.

PUT UNFORMATTED "    </TABLE>" SKIP .
PUT UNFORMATTED "  </BODY>" SKIP .
PUT UNFORMATTED "</HTML>" SKIP .

Full Source Code - Progress 4gl programming language[]

{i:\ah-global\includes\loadfile.i}
{i:\ah-global\includes\savefile.i}

DEF TEMP-TABLE columndata
FIELD column_name AS CHAR FORMAT "x(50)"
FIELD column_id   AS INT
.

DEF TEMP-TABLE rowdata
FIELD field_data  AS CHAR FORMAT "x(50)"
FIELD column_id AS INT
FIELD row_id AS INT
.

DEF VAR input_filename AS CHAR.

input_filename = loadafile("CSV File","*.csv").

DEF VAR temp_array AS CHAR EXTENT 50.

DEF VAR counter_max AS INTEGER INIT 0.
DEF VAR counter AS INTEGER INIT 0.

INPUT FROM VALUE(input_filename).
IMPORT DELIMITER "," 
  temp_array[1 FOR 50].

DO WHILE counter_max = 0.
  counter = counter + 1.
  IF temp_array[counter] EQ "" AND temp_array[counter + 1] EQ "" THEN
    counter_max = counter - 1.
END.

counter = 1.
DO WHILE counter <= counter_max.
  CREATE columndata.
  ASSIGN 
    columndata.column_name = temp_array[counter]
    columndata.column_id = counter.
  counter = counter + 1.
END.

DEF VAR rowcounter AS INT INIT 1.

IMPORT DELIMITER "," 
  temp_array[1 FOR 50].
    
REPEAT:
  counter = 1.
  DO WHILE counter <= counter_max.
    CREATE rowdata.
    FIND columndata WHERE columndata.column_id = counter.
    ASSIGN 
      rowdata.column_id = columndata.column_id
      rowdata.field_data = temp_array[counter]
      rowdata.row_id = rowcounter
      counter = counter + 1.
  END.
  rowcounter = rowcounter + 1.
  IMPORT DELIMITER "," 
    temp_array[1 FOR 50].
END.

/* OUTPUT */
OUTPUT TO VALUE(savefile("v:\","convertedcsv.html")).
PUT UNFORMATTED  "<HTML>" SKIP.
PUT UNFORMATTED "  <BODY>" SKIP.
PUT UNFORMATTED "    <TABLE>" SKIP.
PUT UNFORMATTED "      <TR>".

FOR EACH columndata.
  PUT UNFORMATTED "<TH>" + columndata.column_name + "</TH>".
END.

PUT "</TR>" SKIP .

FOR EACH rowdata BREAK BY row_id.
  IF FIRST-OF (row_id) THEN PUT UNFORMATTED "      <TR>" .
  PUT UNFORMATTED "<TD>" + rowdata.field_data + "</TD>".
  IF LAST-OF (row_id) THEN PUT UNFORMATTED "</TR>" SKIP .
END.

PUT UNFORMATTED "    </TABLE>" SKIP .
PUT UNFORMATTED "  </BODY>" SKIP .
PUT UNFORMATTED "</HTML>" SKIP .
Advertisement