Arnaud's Blog

Thursday, January 17, 2013

Powershell advanced functions

As I am quite new to powershell, I'm learning how to implement functions in the spirit of the language.
I'll take an example of refactoring to make a basic function in a more powershell one.
I have the following function word2pdf that converts a word file to a PDF.

function word2pdf([String]$dir)
{
 $wdFormatPDF = 17
 $word = New-Object -ComObject word.application
 $word.visible = $false

 foreach($file in $dir)
 {
    $doc = $word.documents.open($file.Fullname)
    $pdfFilename = $file.Fullname + ".pdf"
    $doc.saveas([ref] $pdfFilename, [ref]$wdFormatPDF)
 }

 $doc.close()
 $word.Quit()
}

The function is called like this:

word2pdf "."

This is working, but I'd like to be able to send the list of files in the pipeline instead of providing a whole directory. Here are the changes:

function word2pdf([System.IO.FileInfo]$file)
{
 $wdFormatPDF = 17
 $word = New-Object -ComObject word.application
 $word.visible = $false
 $doc = $word.documents.open($file.Fullname)
 $pdfFilename = $file.Fullname + ".pdf"
 $doc.saveas([ref] $pdfFilename, [ref]$wdFormatPDF)
 $doc.close()
 $word.Quit()
}

The function is called like this:

dir *.docx | foreach-object { word2pdf $_ }

So far so good. However, as you can see, word2pdf is called through a foreach-object. If you have a look at powershell built-in functions, they directly take the pipeline as their input. This is what we will do.
First, we convert our function to an advanced function. We add CmdletBinding, define parameters with the Param directive
and enclose the source code in the Process section. What does it mean?
CmdletBinding means our function is an advanced function, among other things, it will be able to process the pipeline.
In the Param directive, we list the parameters, in our case we have one mandatory parameter that can be input from the pipeline.
How is the pipeline dealt with in advanced functions? Very easily. For each object incoming in the pipeline, the parameter ($file in our case)
is set to the object and Process section is called.
If the pipeline contains one object, Process section is called once. If it contains 10 objects, it is called 10 times.
You do not have to implement loops by yourself, great isn't it?

function word2pdf()
{
 [CmdletBinding()]
  Param([Parameter(Mandatory=$True,ValueFromPipeline=$True)]$file)
  Process
  {
  $wdFormatPDF = 17
  $word = New-Object -ComObject word.application
  $word.visible = $false
  $doc = $word.documents.open($file.Fullname)
  $pdfFilename = $file.Fullname + ".pdf"
  $doc.saveas([ref] $pdfFilename, [ref]$wdFormatPDF)
  $doc.close()
  $word.Quit()
 }
}
dir *.docx | word2pdf

Note: Do not forget the Processs primitive, otherwise the function will only process the first item of the pipeline.

As you may have noticed, this version will instantiate a new word object each time a file is processed. This is less efficient that the original version that was performing only one instantiation.
Advanced functions provide the handy Begin and End sections to deal with this matter. They behave exactly like Junit's setup and teardown methods.
Before the loop, Begin section is executed and all the variables declared in the Begin section will be available in the Process and End sections. Once the loop is finished, End section is executed.
Here is the final implementation:

function word2pdf()
{
 [CmdletBinding()]
  Param([Parameter(Mandatory=$True,ValueFromPipeline=$True)]$file)
  Begin
  {
  $wdFormatPDF = 17
  $word = New-Object -ComObject word.application
  $word.visible = $false
  }
  Process
  {
  $doc = $word.documents.open($file.Fullname)
  $pdfFilename = $file.Fullname + ".pdf"
  $doc.saveas([ref] $pdfFilename, [ref]$wdFormatPDF)
  $doc.close()
 }
 End
 {
  $word.Quit()
 }
}
dir *.docx | word2pdf

As you can see, we now have a function that accepts pipeline input and efficiently processes it. The change from the original version implementation to the last one, is straightforward once you have done it a couple of times.
I like the advanced functions for several reasons. First of all, I can use the pipeline easily, no need of a special variable or anything else. Powershell let me use usual variables and will call Process section natively. Moreover, the final source code clearly describes my intent, which is always a good point.

Wednesday, January 9, 2013

My git setup at work - the synchronisation script

In my previous post, I detailed my git setup.
Although I can synchronize the backup manually, it is more convenient to have an automatic solution.

I decided for a scheduled task solution, based on a small powershell script. I could have chosen a post-commit hook.

I wanted to write the output of the git push command in windows event log. Thanks to powershell, it is straightforward. First, create a event source:

New-EventLog -LogName Application -Source "Git_Backup"

Note that you may need to launch this powershell command in an elevated powershell prompt.
Then, I created the following script in c:\repo_git\backup.ps1:

$gitDir = "c:\repo_git"
$ret = cmd /c "git --git-dir=$gitDir\.git --work-tree=$gitDirpush backup master 2>&1"

$id = $lastExitCode
$type = "Information"
if( $id -ne 0)
{
 $type = "Error"
}

Write-EventLog -LogName "Application" -Source "Git_Backup" -EntryType $type -Message $($ret -join [Environment]::NewLine) -EventId $id

The command line for the scheduled task will then be:

C:\WINDOWS\system32\windowspowershell\v1.0\powershell.exe -NonInteractive -NoLogo -WindowStyle Hidden c:\repo_git\backup.ps1

Monday, January 7, 2013

My git setup at work

Although I do not code any more recently, I keep on the good habit of versioning.
Bascially, I have a local git repository on my desktop and all the files I work with (text, keynotes, excel, scripts...) are located inside.

I do changes and commit them as I would do with code. With git, it's simple and efficient. Git logs are also a good way to know what I worked on on a given date, very convenient for reporting.

The only remaining problem was related to backup. Indeed, if my laptop disk crashes, I would lose everything. I'm not allowed to set up a git repository at work, but our IT provides everybody with a shared private folder. The advantage of this shared folder is that IT performs backup every night.

The first thing I did, was just to move my directory on the shared folder. It worked great...until I wanted to access my files from a location without the network! So, the repository on the shared folder was not a solution.

I decided to create a backup git repository on the shared folder. This way, I would still work on the laptop and I would have a backup on the shared drive.

Here is the set of commands to do this. I'll suppose that the shared folder has been mounted on drive x:\ and that my repository is located in c:\repo_git :

First, we create the backup repository:

cd x:\
mkdir repo_git_backup
cd repo_git_backup
git init . --bare

bare option means that the repository will only contain git database files. You cannot perform any checkout on the repository. It's OK for my use case, because I just need a backup.

We can register our new repository, we will name it backup:

cd c:\repo_git
git remote add backup file://x:/repo_git_backup

Now everything is in place. In order to start a backup, just execute:

git push backup master

In case your git repository contains a lot of file or has a long history, it may take some time. Afterwards, it should be must faster.

Let's say that my disk crashes, I will lose all the data in c:\repo_git . How to I retrieve the files from the backup? Just create a folder and execute:

git clone file://x:/git_repo_backup

Everything will be back to the latest "git push".

In order to push automatically, I decided to create a scheduled task running the following powershell script, I'll describe it in a next post.

Monday, December 31, 2012

Powershell and binary output

I've been working a lot with powershell recently. I must admit that after a steep learning curve, it is a great language.
Pipelining objects (instead of plain text) is a beautiful idea. If you compare scripts written in powershell and bash, the powershell ones are much shorter and easier to read.
Basically you can perform advanced scripting without using the magic formulas of sed, awk and their friends.

If you've never played with powershell and use windows on a daily basis, just start...you will enjoy it. But please, give it some time, do not expect to master it in a few days.

I've been implementing scripts to automatically extract git commits and faced a major problem. git commands usually output directly on stdout and do not provide command line options to output in a file (e.g git show). Everything is more or less OK as long as the output is text, if you want to deal with binary output, you start entering hell...

First of all, you have to understand powershell string are UTF-16 !!!. So if you do something like this:

git show master:image.png | out-file 'image-master.png'

powershell will interpret git output as text lines. Therefore, it will look for carriage return characters to split the output and convert each line to UTF-16; and then pass them to out-file.

Surprisingly, this is also true when you type:

git show master:image.png > image-master.png

Execute this in a powershell prompt and a DOS prompt and the content of the file will be different. With a DOS prompt, it is working as expected, whereas Powershell screws up the content on the file. Basically, Microsoft broke backward compatibility on this particular topic...I wonder how this decision was taken internally...
After failing with ">", I tried all the possible options for cmdlet out-file . It does not work, period. By the way, Set-Content won't work either.

Now, I really needed to deal with binary output. I had to do something about it. It took me a couple of hours, a lot of reading on the web and I ended up with my own powershell function to do it. If you want to save the content of command1 in file.out ; execute:

saveBinary('file.out' command1)

If you need parameters for command1:

saveBinary('file.out' command1 parameter1 switch1)

Here is the source code, I've been using it extensively and so far it works fine:

function execSaveBinary([string] $file, [string] $command)
{
 $processname  = $command.split(" ") | select-object -first 1
 $ArgumentList  = $command.split(" ") | select-object -skip  1

 $process = New-Object System.Diagnostics.Process
 $process.StartInfo.FileName = (Get-Command $processname -totalcount 1).Definition
 $process.StartInfo.WorkingDirectory = (Get-Location).Path
 $process.StartInfo.Arguments = $argumentList
 $process.StartInfo.UseShellExecute = $false
 $process.StartInfo.RedirectStandardOutput = $true
 $process.StartInfo.RedirectStandardError  = $true

 $process.Start()

 $fileStream = New-Object System.IO.FileStream($file, [System.IO.FileMode]'Create', [System.IO.FileAccess]'Write')

 $blockSize = 4096
 $bytesRead = 0
 $bytes = New-Object byte[] $blockSize

 do
 {
     $bytesRead = $process.StandardOutput.BaseStream.Read($bytes, 0, $blockSize)

     if($bytesRead -gt 0)
     { 
      $fileStream.Write( $bytes, 0, $bytesRead)
     }
 } while($bytesRead -gt 0)

 $process.WaitForExit()

 $fileStream.Close()
}