Thursday, September 01, 2016

Splitting Large File with Headers (Mac)

I was given a 3.6 gb CSV data file (43 million rows x ~12 columns). I have an automation that's supposed to intake and process those files automatically, but the file was too large. What I needed were smaller files, each with the same header.

I had to do a number of Google searches to figure out how to do this, so I thought I would bring together everything I learned into a single post in case it would help someone else with a similar scenario.

Experience: Moderate

Step 1: Split the file into multiple files, each 1,000,000 rows long.

split -l 1000000 filename.csv filename-

This will give you multiple files with filenames like filename-aa, filename-ab, filename-ac, etc.

Step 2: Move the large file (filename.csv) to another directory. You don't need it anymore, but you'll want to hang on to it until everything's as you want.

Step 3: Rename all the files as CSV files.

for i in *; do mv "$i" "$i.csv"; done

Step 4: Open filename-aa.csv with Excel. Copy the first row (the headers) into a new worksheet. Save as "header.csv" Close filename-aa.csv (don't save).

Step 5: Move filename-aa.csv to a different directory. We don't want to add the header to it a second time.

Step 6: Open header.csv with a text editor to make sure there's a "return" at the end of the row. (If you move to the last character in your header and press right-arrow, does it drop to a new line? If not, press return or enter and then save. If so, you're good.)

Step 7: Rename header.csv to header.txt - it will warn you, but it's ok.

Step 8: Add the header to all of the other files

for i in *.csv; do mv "$i" tmp; cat header.txt tmp > "$i"; done

Step 9: Move filename-aa.csv back into the directory

You should now have a bunch of files, each with a header and 1,000,000 rows.  (And a file called "tmp" that you can safely delete.)







No comments: