I recently had to import a large database (20GB+) into Amazon RDS. I first tried using the method suggested in the RDS documentation for large databases, i.e., export the table data into CSV files then use the mysqlimport utility to import the data into the tables on your RDS instance. However, when the database tables have foreign key constraints this approach doesn’t work as you will invariably run into FK constraint errors (as I did).
It also didn’t appear that there was a way to turn off foreign key checks with mysqlimport like I could if I were using LOAD DATA INFILE, i.e.,
SET FOREIGN_KEY_CHECKS = 0 LOAD DATA INFILE 'xxx'; SET FOREIGN_KEY_CHECKS = 1
So, what to do?
The key was to recognize that what I was essentially dealing with was a sorting problem. That is if you have two tables A and B and B has a foreign key a_id which points to table A then table A needs to be imported before table B. So, simple enough, I wrote a script to sort the schema based on foreign key constraints and imported the data according to that ordering and viola… it didn’t work! Well, I did solve the FK violations this way, but now I ran into a new problem: newlines.
Amazon’s approach for importing data into RDS is to use mysqlimport. In order to keep things performant, you need to ensure that the files you are loading with mysqlimport are manageable in size. This particular database had tables that were several gigabytes in size resulting in very unmanageable file sizes. The solution Amazon suggests is to use the split utility to chop these large files up into smaller chunks.
The split utility splits a file based upon newlines. This is great unless the records in your table contain newlines in which case you end up splitting records across files which of course causes errors on import. I tried several different approaches here but ended up settling on one suggested by a friend: change the line terminator in the output file to something other than a newline and then split on that. A super simple solution! Well, not exactly.
Unfortunately, the split utility doesn’t have an option for changing the line terminator. You can change the split pattern but not the line terminator. So, I ended up having to write my own crude version of split for this purpose.
After much trial and tribulation I present to you the final version of my script (really two rake tasks) which should be able to be used to dump any sized database and import it into RDS!
Share and enjoy.
NEWLINE = "MYNEWLINE"
# Sorts the database tables based on foreign key dependencies so that when we
# load data in the rds:import task below we do not get any FK violations, i.e.,
# if table A has a b_id FK to table B, we would first load the B table and
# then load the A table.
def sort_schema(schema_file)
sql = File.read(schema_file)
tables = sql.split(/\-\-\n\-\-.*?\n\-\-\n/)
table_names = tables.map { |t| t.scan(/CREATE TABLE \`(.*)?\`/).first ? t.scan(/CREATE TABLE \`(.*)?\`/).first.first : nil }.compact
sorted = table_names.clone
changed = 0
begin
changed = 0
tables.each { |t|
name = t.scan(/CREATE TABLE \`(.*)?\`/).first ? t.scan(/CREATE TABLE \`(.*)?\`/).first.first : nil
next unless name
t_i = sorted.index(name)
fks = t.scan(/REFERENCES \`(.*)?\` /).map { |f| f[0] }
for fk in fks do
fk_i = sorted.index(fk)
if fk_i and fk_i > t_i
sorted.delete_at(fk_i)
sorted.insert(t_i, fk)
changed += 1
end
end
}
end while changed > 0
sorted
end
def import_data(hostname, username, password, database, file)
print "Importing data from #{file}..."
`mysqlimport --local --compress --user=#{username} --password=#{password} --host=#{hostname} --fields-terminated-by=',' --lines-terminated-by='#{NEWLINE}' --fields-optionally-enclosed-by='"' #{database} #{file}`
puts "Done"
end
def split(filename, size, newline = "\n")
max = size * 1024 * 1024
infile = File.open(filename, "r")
part = 0
begin
data = String.new
begin
data << infile.readline(newline)
end while data.size <= max && !infile.eof?
partfile = "#{filename}.part_#{sprintf("%02d", part)}"
puts "Writing part #{partfile}..."
File.open(partfile, "w") do |outfile|
outfile.write data
end
part += 1
end while !infile.eof?
infile.close
end
# Note, we have to get rid of triggers here because they require SUPER user
# privileges to create and Amazon does not grant SUPER privileges to anyone
def dump_schema(dir)
config = ActiveRecord::Base.configurations[Rails.env].symbolize_keys
`mysqldump -u #{config[:username]} -p#{config[:password]} -S #{config[:socket]} --skip-triggers -d #{config[:database]} > #{dir}/schema.sql`
end
namespace :rds do
desc "Dumps the data into the target directory; has to be run on the database server."
task :dump, :target, :needs => :environment do |task, args|
dirname = args[:target].chomp
dump_schema(dirname)
tables = ActiveRecord::Base.connection.execute("show tables")
tables.each { |t|
table = t[0]
filename = "#{dirname}/#{table}.csv"
ActiveRecord::Base.connection.execute("SELECT * FROM #{table} INTO OUTFILE '#{filename}' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '#{NEWLINE}'")
max_size = 1024*1024*1024
if File.new(filename).stat.size > max_size
puts "Splitting #{filename}..."
split(filename, 100, NEWLINE)
end
}
end
desc "Imports data from the source directory; must contain files dumped using rds:dump"
task :import, :data_dir, :host, :user, :pass, :db, :needs => :environment do |t, args|
dirname = args[:data_dir].chomp
schema = "#{dirname}/schema.sql"
puts "Importing schema..."
`mysql -u #{args[:user]} -p#{args[:pass]} -h #{args[:host]} #{args[:db]} < #{schema}`
puts "Sorting tables..."
table_names = sort_schema(schema)
dir = Dir.new(dirname)
files = dir.entries.delete_if { |e| [".","..","schema.sql"].include?(e) }
puts "Importing data..."
for table in table_names do
if files.include?("#{table}.csv.part_00")
import_data(args[:host], args[:user], args[:pass], args[:db], "#{dirname}/#{table}.csv.part_*")
elsif files.include?("#{table}.csv")
import_data(args[:host], args[:user], args[:pass], args[:db], "#{dirname}/#{table}.csv")
else
raise "Missing data file for table #{table}"
end
end
end
end