SplitPipeline

1.6.1

en-US/SplitPipeline.dll-Help.xml

                                <?xml version="1.0" encoding="utf-8"?>

<helpItems xmlns="http://msh" schema="maml">

<command:command xmlns:maml="http://schemas.microsoft.com/maml/2004/10" xmlns:command="http://schemas.microsoft.com/maml/dev/command/2004/10" xmlns:dev="http://schemas.microsoft.com/maml/dev/2004/10">

<command:details>

<command:name>Split-Pipeline</command:name>

<maml:description>

<maml:para>Splits pipeline input and processes its parts by parallel pipelines.</maml:para>

</maml:description>

<command:verb>Split</command:verb>

<command:noun>Pipeline</command:noun>

</command:details>

<maml:description>

<maml:para>The cmdlet splits the input, processes its parts by parallel pipelines, and

outputs the results for further processing. It may work without collecting

the whole input, large or infinite.

When Load is omitted the whole input is collected and split evenly between

Count parallel pipelines. This method shows the best performance in simple

cases. In other cases, e.g. on large or slow input, Load should be used in

order to enable processing of partially collected input.

The cmdlet creates several pipelines. Each pipeline is created when input

parts are available, created pipelines are busy, and their number is less

than Count. Each pipeline is used for processing one or more input parts.

Because each pipeline works in its own runspace variables, functions, and

modules from the main script are not automatically available for pipeline

scripts. Such items should be specified by Variable, Function, and Module

parameters in order to be available.

The Begin and End scripts are invoked for each created pipeline once before

and after processing. Each input part is piped to the script block Script.

The Finally script is invoked after all, even on failures or stopping.

If number of created pipelines is equal to Count and all pipelines are busy

then incoming input items are enqueued for later processing. If the queue

size hits the limit then the algorithm waits for any pipeline to complete.

Input parts are not necessarily processed in the same order as they come.

But output parts can be ordered according to input, use the switch Order.

In rare scenarios when synchronous code must be invoked in pipelines,

use the helper $Pipeline.Lock, see the repository tests for examples.</maml:para>

</maml:description>

<command:syntax>

<command:syntaxItem>

<maml:name>Split-Pipeline</maml:name>

<command:parameter required="true" position="1" >

<maml:name>Script</maml:name>

<command:parameterValue required="true">ScriptBlock</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="2" >

<maml:name>InputObject</maml:name>

<command:parameterValue required="true">PSObject</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>ApartmentState</maml:name>

<command:parameterValue required="true">ApartmentState</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Begin</maml:name>

<command:parameterValue required="true">ScriptBlock</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Count</maml:name>

<command:parameterValue required="true">Int32[]</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>End</maml:name>

<command:parameterValue required="true">ScriptBlock</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Filter</maml:name>

<command:parameterValue required="true">PSObject</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Finally</maml:name>

<command:parameterValue required="true">ScriptBlock</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Function</maml:name>

<command:parameterValue required="true">String[]</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Load</maml:name>

<command:parameterValue required="true">Int32[]</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Module</maml:name>

<command:parameterValue required="true">String[]</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Variable</maml:name>

<command:parameterValue required="true">String[]</command:parameterValue>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Order</maml:name>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Refill</maml:name>

</command:parameter>

</command:syntaxItem>

</command:syntax>

<command:parameters>

<command:parameter required="true" position="1" >

<maml:name>Script</maml:name>

<maml:description>

<maml:para>The script invoked for each input part of each pipeline with an input

part piped to it. The script either processes the whole part ($input)

or each item ($_) separately in the &quot;process&quot; block. Examples:

    # Process the whole $input part:

    ... | Split-Pipeline { $input | %{ $_ } }

    # Process input items $_ separately:

    ... | Split-Pipeline { process { $_ } }

The script may have any of &quot;begin&quot;, &quot;process&quot;, and &quot;end&quot; blocks:

    ... | Split-Pipeline { begin {...} process { $_ } end {...} }

Note that &quot;begin&quot; and &quot;end&quot; blocks are called for each input part but

scripts defined by parameters Begin and End are called for pipelines.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" pipelineInput="true (ByValue)" position="2" >

<maml:name>InputObject</maml:name>

<maml:description>

<maml:para>Input objects processed by parallel pipelines. Normally this parameter

is not used directly, objects are sent using the pipeline. But it is

fine to specify the input using this parameter.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>ApartmentState</maml:name>

<maml:description>

<maml:para>Specify either &quot;MTA&quot; (multi-threaded ) or &quot;STA&quot; (single-threaded) for

the apartment states of the threads used to run commands in pipelines.</maml:para>

<maml:para>Values : STA, MTA, Unknown</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Begin</maml:name>

<maml:description>

<maml:para>The script invoked for each pipeline on creation before processing. The

goal is to initialize the runspace to be used by the pipeline, normally

to set some variables, dot-source scripts, import modules, and etc.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Count</maml:name>

<maml:description>

<maml:para>Specifies the parallel pipeline count. The default value is the number

or processors. For intensive jobs use the default or decreased value,

especially if there are other tasks working at the same time. But for

jobs not consuming much processor resources increasing the number may

improve performance.

The parameter accepts an array of one or two integers. A single value

specifies the recommended number of pipelines. Two arguments specify

the minimum and maximum numbers and the recommended value is set to

Max(Count[0], Min(Count[1], ProcessorCount)).</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>End</maml:name>

<maml:description>

<maml:para>The script invoked for each pipeline once after processing. The goal

is, for example, to output some results accumulated during processing

of input parts by the pipeline. Consider to use Finally for releasing

resources instead of End or in addition to it.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Filter</maml:name>

<maml:description>

<maml:para>Either a hashtable for collecting unique input objects or a script used

in order to test an input object. Input includes extra objects added in

Refill mode. In fact, this filter is mostly needed for Refill.

A hashtable is used in order to collect and enqueue unique objects. In

Refill mode it may be useful for avoiding infinite loops.

A script is invoked in a child scope of the scope where the cmdlet is

invoked. The first argument is an object being tested. Returned $true

tells to add an object to the input queue.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Finally</maml:name>

<maml:description>

<maml:para>The script invoked for each opened pipeline before its closing, even on

terminating errors or stopping (Ctrl-C). It is normally needed in order

to release resources created by Begin. Output is ignored. If Finally

fails then its errors are written as warnings because it has to be

called for remaining pipelines.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Function</maml:name>

<maml:description>

<maml:para>Functions imported from the current runspace to parallel.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Load</maml:name>

<maml:description>

<maml:para>Enables processing of partially collected input and specifies input

part limits. If it is omitted then the whole input is collected and

split evenly between pipelines.

The parameter accepts an array of one or two integers. The first is the

minimum number of objects per pipeline. If it is less than 1 then Load

is treated as omitted. The second number is the optional maximum.

If processing is fast then it is important to specify a proper minimum.

Otherwise Split-Pipeline may work even slower than a standard pipeline.

Setting the maximum causes more frequent output. For example, this may

be important for feeding simultaneously working downstream pipelines.

Setting the maximum number is also needed for potentially large input

in order to limit the input queue size and avoid out of memory issues.

The maximum queue size is set internally to Load[1] * Count.

Use the switch Verbose in order to get some statistics which may help

to choose suitable load limits.

CAUTION: The queue limit may be ignored and exceeded if Refill is used.

Any number of objects written via [ref] go straight to the input queue.

Thus, depending on data Refill scenarios may fail due to out of memory.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Module</maml:name>

<maml:description>

<maml:para>Modules imported to parallel runspaces.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Order</maml:name>

<maml:description>

<maml:para>Tells to output part results in the same order as input parts arrive.

The algorithm may work slower.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Refill</maml:name>

<maml:description>

<maml:para>Tells to refill the input by [ref] objects from output. Other objects

go to output as usual. This convention is used for processing items of

hierarchical data structures: child container items come back to input,

leaf items or other data produced by processing go to output.

NOTE: Refilled input makes infinite loops possible for some data. Use

Filter in order to exclude already processed objects and avoid loops.</maml:para>

</maml:description>

</command:parameter>

<command:parameter required="false" position="named" >

<maml:name>Variable</maml:name>

<maml:description>

<maml:para>Variables imported from the current runspace to parallel.</maml:para>

</maml:description>

</command:parameter>

</command:parameters>

<command:inputTypes>

<command:inputType>

<dev:type>

<maml:name>Object</maml:name>

</dev:type>

<maml:description>

<maml:para>Input objects processed by parallel pipelines.</maml:para>

</maml:description>

</command:inputType>

</command:inputTypes>

<command:returnValues>

<command:returnValue>

<dev:type>

<maml:name>Object</maml:name>

</dev:type>

<maml:description>

<maml:para>Output of the Begin, Script, and End script blocks. The scripts Begin

and End are invoked once for each pipeline before and after processing.

The script Script is invoked repeatedly with input parts piped to it.</maml:para>

</maml:description>

</command:returnValue>

</command:returnValues>

<command:examples>

<command:example>

<maml:title>-------------------------- EXAMPLE 1 --------------------------</maml:title>

<dev:code>1..10 | . {process{ $_; sleep 1 }}

1..10 | Split-Pipeline -Count 10 {process{ $_; sleep 1 }}</dev:code>

<dev:remarks>

<maml:para>Two commands perform the same job simulating long but not processor

consuming operations on each item. The first command takes about 10

seconds. The second takes about 2 seconds due to Split-Pipeline.</maml:para>

<maml:para></maml:para>

</dev:remarks>

</command:example>

<command:example>

<maml:title>-------------------------- EXAMPLE 2 --------------------------</maml:title>

<dev:code>$PSHOME | Split-Pipeline -Refill {process{

    foreach($item in Get-ChildItem -LiteralPath $_ -Force) {

        if ($item.PSIsContainer) {

            [ref]$item.FullName

        }

        else {

            $item.Length

        }

    }

}} | Measure-Object -Sum</dev:code>

<dev:remarks>

<maml:para>This is an example of Split-Pipeline with refilled input. By the convention

output [ref] objects refill the input, other objects go to output as usual.

The code calculates the number and size of files in $PSHOME. It is a &quot;how

to&quot; sample, performance gain is not expected because the code is trivial

and works relatively fast.

See also another example with simulated slow data requests:

https://github.com/nightroman/SplitPipeline/blob/master/Tests/Test-Refill.ps1</maml:para>

<maml:para></maml:para>

</dev:remarks>

</command:example>

<command:example>

<maml:title>-------------------------- EXAMPLE 3 --------------------------</maml:title>

<dev:remarks>

<maml:para>Because each pipeline works in its own runspace variables, functions, and

modules from the main script are not automatically available for pipeline

scripts. Such items should be specified by Variable, Function, and Module

parameters in order to be available.

&gt; $arr = @(&apos;one&apos;, &apos;two&apos;, &apos;three&apos;); 0..2 | . {process{ $arr[$_] }}

one

two

three

&gt; $arr = @(&apos;one&apos;, &apos;two&apos;, &apos;three&apos;); 0..2 | Split-Pipeline {process{ $arr[$_] }}

Split-Pipeline : Cannot index into a null array.

...

&gt; $arr = @(&apos;one&apos;, &apos;two&apos;, &apos;three&apos;); 0..2 | Split-Pipeline -Variable arr {process{ $arr[$_] }}

one

two

three</maml:para>

</dev:remarks>

</command:example>

</command:examples>

<maml:relatedLinks>

<maml:navigationLink>

<maml:linkText>Project site:</maml:linkText>

<maml:uri>https://github.com/nightroman/SplitPipeline</maml:uri>

</maml:navigationLink>

</maml:relatedLinks>

</command:command>

</helpItems>