HACKERS GONNA HACK
Importing large databases into Amazon RDS

I recently had to import a large database (20GB+) into Amazon RDS. I first tried using the method suggested in the RDS documentation for large databases, i.e., export the table data into CSV files then use the mysqlimport utility to import the data into the tables on your RDS instance. However, when the database tables have foreign key constraints this approach doesn’t work as you will invariably run into FK constraint errors (as I did). 

It also didn’t appear that there was a way to turn off foreign key checks with mysqlimport like I could if I were using LOAD DATA INFILE, i.e.,

SET FOREIGN_KEY_CHECKS = 0
LOAD DATA INFILE 'xxx';
SET FOREIGN_KEY_CHECKS = 1

So, what to do? 

The key was to recognize that what I was essentially dealing with was a sorting problem. That is if you have two tables A and B and B has a foreign key a_id which points to table A then table A needs to be imported before table B. So, simple enough, I wrote a script to sort the schema based on foreign key constraints and imported the data according to that ordering and viola… it didn’t work!  Well, I did solve the FK violations this way, but now I ran into a new problem: newlines. 

Amazon’s approach for importing data into RDS is to use mysqlimport. In order to keep things performant, you need to ensure that the files you are loading with mysqlimport are manageable in size.  This particular database had tables that were several gigabytes in size resulting in very unmanageable file sizes. The solution Amazon suggests is to use the split utility to chop these large files up into smaller chunks.

The split utility splits a file based upon newlines. This is great unless the records in your table contain newlines in which case you end up splitting records across files which of course causes errors on import. I tried several different approaches here but ended up settling on one suggested by a friend: change the line terminator in the output file to something other than a newline and then split on that. A super simple solution!  Well, not exactly.  

Unfortunately, the split utility doesn’t have an option for changing the line terminator.  You can change the split pattern but not the line terminator.  So, I ended up having to write my own crude version of split for this purpose.

After much trial and tribulation I present to you the final version of my script (really two rake tasks) which should be able to be used to dump any sized database and import it into RDS!  

Share and enjoy.

NEWLINE = "MYNEWLINE"

# Sorts the database tables based on foreign key dependencies so that when we
# load data in the rds:import task below we do not get any FK violations, i.e., 
# if table A has a b_id FK to table B, we would first load the B table and 
# then load the A table.
def sort_schema(schema_file)
  sql = File.read(schema_file)
  tables = sql.split(/\-\-\n\-\-.*?\n\-\-\n/)
  table_names = tables.map { |t| t.scan(/CREATE TABLE \`(.*)?\`/).first ? t.scan(/CREATE TABLE \`(.*)?\`/).first.first : nil }.compact
  sorted = table_names.clone
  changed = 0
  begin
    changed = 0
    tables.each { |t| 
      name = t.scan(/CREATE TABLE \`(.*)?\`/).first ? t.scan(/CREATE TABLE \`(.*)?\`/).first.first : nil
      next unless name
      t_i = sorted.index(name)
      fks = t.scan(/REFERENCES \`(.*)?\` /).map { |f| f[0] }
      for fk in fks do
        fk_i = sorted.index(fk)
        if fk_i and fk_i > t_i
          sorted.delete_at(fk_i)
          sorted.insert(t_i, fk)
          changed += 1
        end
      end
    }
  end while changed > 0 
  sorted
end

def import_data(hostname, username, password, database, file)
  print "Importing data from #{file}..."
  `mysqlimport --local --compress --user=#{username} --password=#{password} --host=#{hostname} --fields-terminated-by=',' --lines-terminated-by='#{NEWLINE}' --fields-optionally-enclosed-by='"' #{database} #{file}`
  puts "Done"
end

def split(filename, size, newline = "\n")
  max    = size * 1024 * 1024
  infile = File.open(filename, "r")
  part   = 0
  begin
    data = String.new
    begin
      data << infile.readline(newline)
    end while data.size <= max && !infile.eof?
    partfile = "#{filename}.part_#{sprintf("%02d", part)}"
    puts "Writing part #{partfile}..."
    File.open(partfile, "w") do |outfile|
      outfile.write data
    end
    part += 1 
  end while !infile.eof?
  infile.close
end

# Note, we have to get rid of triggers here because they require SUPER user 
# privileges to create and Amazon does not grant SUPER privileges to anyone 
def dump_schema(dir)
  config = ActiveRecord::Base.configurations[Rails.env].symbolize_keys
  `mysqldump -u #{config[:username]} -p#{config[:password]} -S #{config[:socket]} --skip-triggers -d #{config[:database]} > #{dir}/schema.sql`
end

namespace :rds do
  desc "Dumps the data into the target directory; has to be run on the database server."
  task :dump, :target, :needs => :environment do |task, args|
    dirname = args[:target].chomp
    dump_schema(dirname)
    tables = ActiveRecord::Base.connection.execute("show tables")
    tables.each { |t| 
      table = t[0]
      filename = "#{dirname}/#{table}.csv"
      ActiveRecord::Base.connection.execute("SELECT * FROM #{table} INTO OUTFILE '#{filename}' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '#{NEWLINE}'")
      max_size = 1024*1024*1024
      if File.new(filename).stat.size > max_size
        puts "Splitting #{filename}..."
        split(filename, 100, NEWLINE)
      end
    }
  end
  
  desc "Imports data from the source directory; must contain files dumped using rds:dump"
  task :import, :data_dir, :host, :user, :pass, :db, :needs => :environment do |t, args|
    dirname = args[:data_dir].chomp
    schema = "#{dirname}/schema.sql"
    puts "Importing schema..."
    `mysql -u #{args[:user]} -p#{args[:pass]} -h #{args[:host]} #{args[:db]} < #{schema}`
    puts "Sorting tables..."
    table_names = sort_schema(schema)
    dir = Dir.new(dirname)
    files = dir.entries.delete_if { |e| [".","..","schema.sql"].include?(e) }
    puts "Importing data..."
    for table in table_names do
      if files.include?("#{table}.csv.part_00")
        import_data(args[:host], args[:user], args[:pass], args[:db], "#{dirname}/#{table}.csv.part_*")
      elsif files.include?("#{table}.csv")
        import_data(args[:host], args[:user], args[:pass], args[:db], "#{dirname}/#{table}.csv")
      else
        raise "Missing data file for table #{table}"
      end
    end
  end
end
  1. jfoxny posted this