Nothing’s Perfect Including PowerShell

Today I needed to count the number of errors in a log file.  Pretty straightforward stuff that I would typically accomplish like so:

5> Select-String '^\d+,Error' Messages.log | Measure-Object

And that normally works well for me – except for today.  It turns out that this log file is big, really big – as in 600MB worth of log file!  The command above runs for quite some time and then fails ignominiously with a System.OutOfMemoryException.  Sure enough, a quick execution of “gps -id $pid” revealed that the PowerShell process was consuming 1.7 GB of private memory.  No wonder we hit an OOM exception.

So back to the drawing board on how to accomplish this in PowerShell.  But first I had to do something about the memory footprint of my current PowerShell session.  In PowerShell Community Extensions we have a Collect function (which just calls [System.GC]::Collect()).  This brought the private memory footprint back down to ~76MB which tells me that PowerShell’s pipeline or one of the cmdlets above is hoarding memory.  No matter.  One of the best things about PowerShell is this awesome escape hatch it provides – direct access to the .NET Framework.  Fortunately there is a simple class in the .NET Framework called System.IO.StreamReader that allows you to read text files a line at a time which is important when you’ re dealing with huge log files.  Here is the resulting solution I came up with:

7> $sr = new-object System.IO.StreamReader("$pwd\Messages.log")
8> $sum = 0; while (($line = $sr.ReadLine()) -ne $null) {if ($line -match '^\d+,Error') {$sum++}}; $sum
2702996
9> $sr.Dispose()

I monitored the private memory usage of the PowerShell process during the execution of this script.  The private memory usage increased about 200K and then didn’t budge until the script was finished.  No doubt this contributed to the script finishing much faster as compared to the time it took my first attempt to finish, err, run out of memory –  1 min 43 secs versus 7 min 16 secs respectively.

When it comes to reading files, another useful .NET Framework method is the static method: [System.IO.File]::ReadAllText(string path) which returns a single string containing the file’s entire contents.  If you ever need to load the entire contents of a file into a variable for manipulation (say you need to execute a regex over an entire file’s contents  – not just line-by-line) this method is a good way to go. I find the ReadAllText() method a bit easier to use in this case than Get-Content piped to Out-String.  The other benefit of ReadAllText() is that it doesn’t add an extra line terminator to the end of the string which is something Out-String will do.  It seems like Get-Content should have a parameter to indicate that it should read the entire file into a single string and output that. Update: Get-Content now has the -Raw parameter that will read the entire file into a single string.

Advertisements
This entry was posted in PowerShell. Bookmark the permalink.

7 Responses to Nothing’s Perfect Including PowerShell

  1. Unknown says:

    I am all for trying to be consistent when using tools for doing your job, but almost two minutes to process a 600MB file just seems like a really long time.  I did a quick experiment with a similarly-sized file that I had laying around.  These are the results I got with PowerShell:PSH$ Measure-Command -Expression { $sr = new-object System.IO.StreamReader("$pwd\\corpus.txt"); $sum = 0; while (($line = $sr.ReadLine()) -ne $null) {if ($line -match \’^the\\b\’) {$sum++}}; write-host $sum; $sr.Dispose() }572398Days              : 0Hours             : 0Minutes           : 7Seconds           : 42Milliseconds      : 35Ticks             : 4620355879TotalDays         : 0.00534763411921296TotalHours        : 0.128343218861111TotalMinutes      : 7.70059313166667TotalSeconds      : 462.0355879TotalMilliseconds : 462035.5879Here is the result using Perl (the result is the same; Measure-Command eats the output):PSH$ Measure-Command -Expression {perl -e "my `$sum = 0; while(<>){++`$sum if /^the\\b/;} print `$sum;" corpus.txt }Days              : 0Hours             : 0Minutes           : 0Seconds           : 36Milliseconds      : 533Ticks             : 365339868TotalDays         : 0.000422847069444444TotalHours        : 0.0101483296666667TotalMinutes      : 0.60889978TotalSeconds      : 36.5339868TotalMilliseconds : 36533.9868This probably doesn\’t surprise anyone, as Perl was designed for this sort of task, but I think it is important to use the right tool for each job.

  2. Shay says:

    Hi
     
    I\’m curious, have you tried measuring it with get-content?.
     
    —–Shay Levy$cript Fanatichttp://scriptolog.blogspot.com

  3. Stephen says:

    I had a log file that wasn\’t quite as large for as many errors, but I did a quick test on a few different ways of doing the search.  I think my third solution works out very well.  They all returned the correct answer of 294809.
     
    The first one is similar to your first attempt.  In my case it ran fairly quickly, but powershell did use 250 MB of RAM.
     
    Select-String \’file\’ "C:\\temp\\chkde.log" | measure-object# 1 Minutes 9 Seconds – 250 MB Memory Usage
     
    Here I used your solution.  It solved the memory problem, but was much slower (over 3 minutes). 
     
    $sr = new-object System.IO.StreamReader("c:\\temp\\chkde.log") $sum = 0; while (($line = $sr.ReadLine()) -ne $null) {if ($line -match \’file\’) {$sum++}}; $sum$sr.Dispose()# 3 Minutes 16 seconds – 65 MB Memory Usage
     
    And finally I combined the 2 solutions and used select-string plus summing it manually.  It didn\’t use much memory and ran very fast as well.
    $sum = 0; Select-String \’file\’ "C:\\temp\\chkde.log" | % { $sum++}; $sum294809 – 65 MB Memory Usage# 1 minute 3 seconds
     
     

  4. Keith says:

    I wasn\’t nearly as concerned about execution time as I was about just getting the command to not fail with an OOM.  🙂   In general I agree about using the right tool for the job.  However I never learned Perl and don\’t have any desire to at this point.  If Flowering Weeds was reading my blog he would tell me to use LogParser, which probably would have been *the* right tool for this particular job.  However since I\’m trying to educate specifically on PowerShell I figured it was worth pointing out that depending on how construct your script you can run out of memory.  I guess the other thing I should have pointed out is that since Select-String captures the match info in an object that includes each line then the memory footprint can be quite large (if you match a lot of lines within a huge file).  This would actually argue against using Select-String for this task.

  5. Dmitry says:

    Measure-Object actually has an issue of not releasing objects when processing them which leads to the memory consumption troubles you describe. See also: http://dmitrysotnikov.wordpress.com/2008/01/18/measure-vs-count-vs-i/Jeffrey promised to beat the appropriate engineer for that: http://dmitrysotnikov.wordpress.com/2008/01/18/measure-vs-count-vs-i/#comment-3523 🙂

  6. Keith says:

    Ah that would explain why a [gc]::collect() recovers the memory. 

  7. Pingback: Coding for speed – FoxDeploy.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s