How to read a text file in batches with Ruby
When I was reviewing my colleague's source, I found awful code in his program.
Like this:
File.open("./production.log") do |file| results = {} file.readline # skip header file.each_line do |line| key, value = process(line) results[key] = value next if results.size < 2000 output(results) results = {} end output(results) end
Oops! How did this happen!?
He said "We need purge results of memory every 2,000 lines because that causes out of memory. And there is no method to get 'each batch of lines' from text files in Ruby."
I got it, we need some methods to read files in batches, like a "find_in_batches" of ActiveRecord. Then I searched such methods, and I found a same question as ours in Stack Overflow.
Those answers will go well, but those are not cool! Finally, I couldn't find out great ways.
Ideal ways
I'd like to use like a "find_in_batches" method, like this:
File.open("./production.log", skip_header: true) do |file| file.batch_line(batch_size: 2000) do |lines| results = process(lines) output(results) end end
Or, it seems nice to wrap text file by an object which has the "batch_line*1" method:
file = TextFilePager.new("./production.log", skip_header: true) file.batch_line(batch_size: 2000) do |lines| results = process(lines) output(results) end
Both are better than previous code. We can easily understand what this code will do.
First Proposal Way
Using "Enumerator::Lazy" and "each_slice" seems one of the great ways:
File.open("./production.log") do |file| file.lazy.drop(1).each_slice(2000) do |lines| results = process(lines) output(results) end end
It's really simple way, because we need no additional class and method.
All we need is understand "Enumerator::Lazy*2" and "each_slice", that's it.
Second Proposal Way
Making a new class, like "TextFilePager", seems a nice idea:
class TextFilePager DEFAULT_BATCH_SIZE = 1000 def initialize(file_path, skip_header: false, delete_line_break: false) @file_path = file_path @skip_header = skip_header @delete_line_break = delete_line_break end def batch_line(batch_size: DEFAULT_BATCH_SIZE) File.open(@file_path) do |file| file.gets if skip_header? loop do line, lines = "", [] batch_size.times do break if (line = file.gets).nil? lines << (delete_line_break? ? line.chomp : line) end yield lines break if line.nil? end end end def skip_header? @skip_header end def delete_line_break? @delete_line_break end end
You can use this class like this:
file = TextFilePager.new("./production.log", skip_header: true) file.batch_line(batch_size: 2000) do |lines| results = process(lines) output(results) end
This code is absolutely same what I wrote above section. The reason is clear, I've just created this class along with my ideal's interface, like TDD.
Conclusion
I prefer first proposal because it only requires knowledge of Ruby and we don't need to make a new class. Too many classes bother us, so we shouldn't add a class unless we really need that.
But second way gives us good interface, it seems intuitive. Plus, we can easily pass options to that object. If you need such text file processing time and time again, adding a new class may be a good way.
Perhaps, it's nice to add the "batch_line" method to "IO" or "File" object. But I don't like to modify core classes so much, because that affects our code widely.
Hence, I'll choose to use "Enumerator::Lazy" and "each_slice" at first when it comes to that matter.
Thanks for reading.
An appendix
Here is my original post about that matter, written in Japanese.