Optimizing Performance of Get-Content for Large Files

Every so often a question comes across the PowerShell newsgroup microsoft.public.windows.powershell about why Get-Content is so slow when reading large log files.  The typical answer is to use Microsoft’s LogParser utility.  However that isn’t always satisfying if you want to use PowerShell specific features.  If this case, you really need to be able to get the best performance out of Get-Content when reading large files. 

As for why the performance is not so good by default, I’ll chalk that up to PowerShell being a v1.0 product.  The PowerShell team has said several times that the goal for v1.0 was high quality, not necessarily high performance.  I suspect that the product team is working on performance issues for a future release but as it stands the perf isn’t too bad except for some special cases – like reading large files.

Fortunately, the Get-Content cmdlet has a parameter that allows you to improve its performance by telling to send lines down the pipe in larger chunks i.e. array of lines instead of line-by-line.  Keep in mind that when you specify this via the parameter -ReadCount, Get-Content actually sends an array of strings down the pipeline instead of one string at a time.  This can have a surprising effect.  For instance this issue came up:

(gc test.txt -read 1 | ? {$_ -like ‘*a*’} | Measure-Object).count
(gc test.txt -read 5 | ? {$_ -like ‘*a*’} | Measure-Object).count

In this case the count varies based on the value of -ReadCount.  That may seem counterintuitive *if* you are expecting that Measure-Object is counting the lines in test.txt.  However in this case, Measure-Object is just counting the number of objects it sees.  In the first command, this actually corresponds to the number of lines in the file because Get-Content passes each line as a string to the Where-Object (alias ?) cmdlet.  Where-Object simply evaluates the specified expression and if it evaluates to true then the object is passed down the pipeline unmodified.  In the second case, what happens is that Where-Object receives an array of 5 strings (System.Object[]).  The -like operator works on arrays as well as scalar values.  If the expression evaluates to true then whole array is sent to Measure-Object where that is counted as "one" object. 

You can see this easily by executing:

gc test.txt -read 5 | %{$_.psobject.typenames[0]}
System.Object[]
System.Object[]

Now when reading large files like a 75 MB log file it helps performance to use a readCount of something like 1000.  Just remember that you will be getting arrays of lines instead of single lines down the pipeline.  However this is simple to rectify if you need to process line-by-line e.g.:

(gc test.txt -read 5 | %{$_} | ? {$_ -like ‘*a*’} | Measure-Object).count

Even with shredding the arrays the performance is still better using the higher readCount.  Here’s an experiment that I ran to confirm this on a 75MB text file:

23> $ht = @{};for ($i = 1; $i -le 10MB; $i *= 10) {
>>      write-progress "Measuring gc -readCount $i" "% Complete" `
>>          -perc ([math]::log10($i)*100/[math]::log10(10MB))
>>      $ts = measure-command { gc large.txt -read $i | %{$_} | measure }
>>      $ht[$i] = $ts
>>  }
>>
24> $ht.Keys | sort | 
>>    select @{n=’ReadCount';e={$_}}, `
>>           @{n=’ElapsedTime';e={$ht[$_].ToString()}} | 
>>    ft -a
>>

ReadCount ElapsedTime
——— ———–
        1 00:03:07.4161690
       10 00:01:38.5779661
      100 00:01:17.9219998
     1000 00:01:14.9202370
    10000 00:01:22.1434037
   100000 00:01:17.8457756
  1000000 00:01:17.9850525
 10000000 00:01:19.0217524

If you happen to have the slick PowerGadgets snapin, then you can view the data like so:

33> $ht.Keys | sort |
>>    select @{n=’ReadCount';e={"RC: $_"}},
>>          
@{n=’ElapsedTime';e={$ht[$_].TotalSeconds}} |
>>    out-chart -title ‘Optimal ReadCount for 75MB Text File’
>>

As you can see in this particular case (75 MB file) a readCount of 1000 is optimal but if in doubt – measure, measure, measure.  :-)

About these ads
This entry was posted in PowerShell. Bookmark the permalink.

3 Responses to Optimizing Performance of Get-Content for Large Files

  1. Josh Friberg says:

    I was wondering if you still think this is the best way to process large text files or do you feel Roman’s solution is better? You commented on this over on StackOverFlow.

    http://stackoverflow.com/questions/4192072/how-to-process-a-file-in-powershell-line-by-line-as-a-stream

    Did you run any testing with his code to see how much faster it was?

    • rkeithhill says:

      For very large files – yes it is probably still better to avoid the overhead of cmdlet processing. Although with script compilation in PowerShell V3 the performance of using straight PowerShell is a bit better.

  2. edwaa says:

    Great post! I found that you can use this technique to parse large XML files (e.g. 200MB).

    [xml]$myXmlObj = Get-Content $largeXmlFile -ReadCount 0

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s