I recently had to import a large database (20GB+) into Amazon RDS. I first tried using the method suggested in the RDS documentation for large databases, i.e., export the table data into CSV files then use the mysqlimport utility to import the data into the tables on your RDS instance. However, when the database tables have foreign key constraints this approach doesn’t work as you will invariably run into FK constraint errors (as I did).
It also didn’t appear that there was a way to turn off foreign key checks with mysqlimport like I could if I were using LOAD DATA INFILE, i.e.,
SET FOREIGN_KEY_CHECKS = 0 LOAD DATA INFILE 'xxx'; SET FOREIGN_KEY_CHECKS = 1
So, what to do?
The key was to recognize that what I was essentially dealing with was a sorting problem. That is if you have two tables A and B and B has a foreign key a_id which points to table A then table A needs to be imported before table B. So, simple enough, I wrote a script to sort the schema based on foreign key constraints and imported the data according to that ordering and viola… it didn’t work! Well, I did solve the FK violations this way, but now I ran into a new problem: newlines.
Amazon’s approach for importing data into RDS is to use mysqlimport. In order to keep things performant, you need to ensure that the files you are loading with mysqlimport are manageable in size. This particular database had tables that were several gigabytes in size resulting in very unmanageable file sizes. The solution Amazon suggests is to use the split utility to chop these large files up into smaller chunks.
The split utility splits a file based upon newlines. This is great unless the records in your table contain newlines in which case you end up splitting records across files which of course causes errors on import. I tried several different approaches here but ended up settling on one suggested by a friend: change the line terminator in the output file to something other than a newline and then split on that. A super simple solution! Well, not exactly.
Unfortunately, the split utility doesn’t have an option for changing the line terminator. You can change the split pattern but not the line terminator. So, I ended up having to write my own crude version of split for this purpose.
After much trial and tribulation I present to you the final version of my script (really two rake tasks) which should be able to be used to dump any sized database and import it into RDS!
Share and enjoy.
NEWLINE = "MYNEWLINE"
# Sorts the database tables based on foreign key dependencies so that when we
# load data in the rds:import task below we do not get any FK violations, i.e.,
# if table A has a b_id FK to table B, we would first load the B table and
# then load the A table.
def sort_schema(schema_file)
sql = File.read(schema_file)
tables = sql.split(/\-\-\n\-\-.*?\n\-\-\n/)
table_names = tables.map { |t| t.scan(/CREATE TABLE \`(.*)?\`/).first ? t.scan(/CREATE TABLE \`(.*)?\`/).first.first : nil }.compact
sorted = table_names.clone
changed = 0
begin
changed = 0
tables.each { |t|
name = t.scan(/CREATE TABLE \`(.*)?\`/).first ? t.scan(/CREATE TABLE \`(.*)?\`/).first.first : nil
next unless name
t_i = sorted.index(name)
fks = t.scan(/REFERENCES \`(.*)?\` /).map { |f| f[0] }
for fk in fks do
fk_i = sorted.index(fk)
if fk_i and fk_i > t_i
sorted.delete_at(fk_i)
sorted.insert(t_i, fk)
changed += 1
end
end
}
end while changed > 0
sorted
end
def import_data(hostname, username, password, database, file)
print "Importing data from #{file}..."
`mysqlimport --local --compress --user=#{username} --password=#{password} --host=#{hostname} --fields-terminated-by=',' --lines-terminated-by='#{NEWLINE}' --fields-optionally-enclosed-by='"' #{database} #{file}`
puts "Done"
end
def split(filename, size, newline = "\n")
max = size * 1024 * 1024
infile = File.open(filename, "r")
part = 0
begin
data = String.new
begin
data << infile.readline(newline)
end while data.size <= max && !infile.eof?
partfile = "#{filename}.part_#{sprintf("%02d", part)}"
puts "Writing part #{partfile}..."
File.open(partfile, "w") do |outfile|
outfile.write data
end
part += 1
end while !infile.eof?
infile.close
end
# Note, we have to get rid of triggers here because they require SUPER user
# privileges to create and Amazon does not grant SUPER privileges to anyone
def dump_schema(dir)
config = ActiveRecord::Base.configurations[Rails.env].symbolize_keys
`mysqldump -u #{config[:username]} -p#{config[:password]} -S #{config[:socket]} --skip-triggers -d #{config[:database]} > #{dir}/schema.sql`
end
namespace :rds do
desc "Dumps the data into the target directory; has to be run on the database server."
task :dump, :target, :needs => :environment do |task, args|
dirname = args[:target].chomp
dump_schema(dirname)
tables = ActiveRecord::Base.connection.execute("show tables")
tables.each { |t|
table = t[0]
filename = "#{dirname}/#{table}.csv"
ActiveRecord::Base.connection.execute("SELECT * FROM #{table} INTO OUTFILE '#{filename}' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '#{NEWLINE}'")
max_size = 1024*1024*1024
if File.new(filename).stat.size > max_size
puts "Splitting #{filename}..."
split(filename, 100, NEWLINE)
end
}
end
desc "Imports data from the source directory; must contain files dumped using rds:dump"
task :import, :data_dir, :host, :user, :pass, :db, :needs => :environment do |t, args|
dirname = args[:data_dir].chomp
schema = "#{dirname}/schema.sql"
puts "Importing schema..."
`mysql -u #{args[:user]} -p#{args[:pass]} -h #{args[:host]} #{args[:db]} < #{schema}`
puts "Sorting tables..."
table_names = sort_schema(schema)
dir = Dir.new(dirname)
files = dir.entries.delete_if { |e| [".","..","schema.sql"].include?(e) }
puts "Importing data..."
for table in table_names do
if files.include?("#{table}.csv.part_00")
import_data(args[:host], args[:user], args[:pass], args[:db], "#{dirname}/#{table}.csv.part_*")
elsif files.include?("#{table}.csv")
import_data(args[:host], args[:user], args[:pass], args[:db], "#{dirname}/#{table}.csv")
else
raise "Missing data file for table #{table}"
end
end
end
end
This error (and a related one: undefined method `unpack’ for nil) threw me for a loop today. Here’s what I was trying to do:
> s = ActionController::Base.helpers.strip_tags("foo:bar")
=> "foo:bar"
> CGI.escape(s)
NoMethodError: undefined method `bytesize' for nil:NilClass
Strange. Even less helpful was the error message that I’d get from calling the view helper url_escape on the string: undefined method `unpack’ for nil. A quick Google search, however, revealed the issue (which you can read about here, here, and here).
Basically the issue is that you get an ActiveSupport::SafeBuffer, not a String, from strip_tags (and other helpers presumably) which is not compatible with CGI.escape or any method that internally makes use of $1, $2, etc. variables to perform string replacements. This incompatibility has to do with the fact that SafeBuffer attempts to provide it’s own version of gsub which won’t work due to an interesting property of Ruby.
...the $_ and $~ variables (and related vars like $1, $&, etc) are "special", and only certain core class methods are able to modify them (in the caller's scope) while no Ruby code can. It is for this reason that you can't alias and wrap any of those methods without breaking them.
A simple fix for this is to call the to_str method on SafeBuffer before passing it into a method like CGI.escape like so:
> s = ActionController::Base.helpers.strip_tags("foo:bar")
=> "foo:bar"
> CGI.escape(s.to_str)
=> "foo:bar"
If you are following the excellent work-flow laid out by Vincent in this post and implemented in his git extension then you will need to know how to deploy a specific tag to Heroku. The syntax for this was not obvious to me and there was little information about it on the interwebs. So, here it is (run from the master branch):
git push heroku +1.1.1^{commit}:master
What this command does is force (+) push the commit represented by tag 1.1.1 to a remote named heroku. Share and enjoy.
Something that I’ve always struggled with until recently is the apparent shift in color you get when you save or export (depending on the app) an image “for the web.” Recently, I learned that it has to do with the “color profile” your image editing app is using. For example, in Pixelmator, you should change to Image > Color Management > Web (sRGB) if you are going to be working with web destined images. A detailed explanation regarding color profiles and the interwebs can be found here: http://www.gballard.net/psd/srgbforwww.html
Heroku allows you to run cron jobs at most every hour. However, there are many tasks that need to be run more frequently than that. Solution? delayed_job; here’s how…
# Called by Heroku once per day (for free!)
task :cron => :environment do
# Runs this task every five minutes (24*60/5=288) using delayed_job
1.upto(288) do |run|
SomeModel.delay(:run_at => (run * 5).minutes.from_now).some_method
end
end
PrinceXML will work on Heroku; here’s how:
Thanks for that reminder, delayed_job! Almost as helpful as “Windows error 32.”
The real issue here is that the handler field in your delayed_jobs table is not large enough to hold the serialized version of your delayed method call. The “text” data type has a maximum storage capacity of 65535 bytes (in MySQL) which might be too small for some calls. For example, if you are using an object that contains one or more text field attributes.
The solution is to change your handler field from text to longtext (yes, that’s right - longtext). So, for MySQL users just…
alter table delayed_jobs change column handler handler longtext null default null;