en-US/SplitPipeline.dll-Help.xml

<?xml version="1.0" encoding="utf-8"?>
<helpItems xmlns="http://msh" schema="maml">
<command:command xmlns:maml="http://schemas.microsoft.com/maml/2004/10" xmlns:command="http://schemas.microsoft.com/maml/dev/command/2004/10" xmlns:dev="http://schemas.microsoft.com/maml/dev/2004/10">
<command:details>
<command:name>Split-Pipeline</command:name>
<maml:description>
<maml:para>Splits pipeline input and processes its parts by parallel pipelines.</maml:para>
</maml:description>
<command:verb>Split</command:verb>
<command:noun>Pipeline</command:noun>
</command:details>
<maml:description>
<maml:para>The cmdlet splits the input, processes its parts by parallel pipelines, and
outputs the results for further processing. It may work without collecting
the whole input, large or infinite.
 
When Load is omitted the whole input is collected and split evenly between
Count parallel pipelines. This method shows the best performance in simple
cases. In other cases, e.g. on large or slow input, Load should be used in
order to enable processing of partially collected input.
 
The cmdlet creates several pipelines. Each pipeline is created when input
parts are available, created pipelines are busy, and their number is less
than Count. Each pipeline is used for processing one or more input parts.
 
Because each pipeline works in its own runspace variables, functions, and
modules from the main script are not automatically available for pipeline
scripts. Such items should be specified by Variable, Function, and Module
parameters in order to be available.
 
The Begin and End scripts are invoked for each created pipeline once before
and after processing. Each input part is piped to the script block Script.
The Finally script is invoked after all, even on failures or stopping.
 
If number of created pipelines is equal to Count and all pipelines are busy
then incoming input items are enqueued for later processing. If the queue
size hits the limit then the algorithm waits for any pipeline to complete.
 
Input parts are not necessarily processed in the same order as they come.
But output parts can be ordered according to input, use the switch Order.
 
In rare scenarios when synchronous code must be invoked in pipelines,
use the helper $Pipeline.Lock, see the repository tests for examples.</maml:para>
</maml:description>
<command:syntax>
<command:syntaxItem>
<maml:name>Split-Pipeline</maml:name>
<command:parameter required="true" position="1" >
<maml:name>Script</maml:name>
<command:parameterValue required="true">ScriptBlock</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="2" >
<maml:name>InputObject</maml:name>
<command:parameterValue required="true">PSObject</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>ApartmentState</maml:name>
<command:parameterValue required="true">ApartmentState</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Begin</maml:name>
<command:parameterValue required="true">ScriptBlock</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Count</maml:name>
<command:parameterValue required="true">Int32[]</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>End</maml:name>
<command:parameterValue required="true">ScriptBlock</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Filter</maml:name>
<command:parameterValue required="true">PSObject</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Finally</maml:name>
<command:parameterValue required="true">ScriptBlock</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Function</maml:name>
<command:parameterValue required="true">String[]</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Load</maml:name>
<command:parameterValue required="true">Int32[]</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Module</maml:name>
<command:parameterValue required="true">String[]</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Variable</maml:name>
<command:parameterValue required="true">String[]</command:parameterValue>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Order</maml:name>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Refill</maml:name>
</command:parameter>
</command:syntaxItem>
</command:syntax>
<command:parameters>
<command:parameter required="true" position="1" >
<maml:name>Script</maml:name>
<maml:description>
<maml:para>The script invoked for each input part of each pipeline with an input
part piped to it. The script either processes the whole part ($input)
or each item ($_) separately in the &quot;process&quot; block. Examples:
 
    # Process the whole $input part:
    ... | Split-Pipeline { $input | %{ $_ } }
 
    # Process input items $_ separately:
    ... | Split-Pipeline { process { $_ } }
 
The script may have any of &quot;begin&quot;, &quot;process&quot;, and &quot;end&quot; blocks:
 
    ... | Split-Pipeline { begin {...} process { $_ } end {...} }
 
Note that &quot;begin&quot; and &quot;end&quot; blocks are called for each input part but
scripts defined by parameters Begin and End are called for pipelines.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" pipelineInput="true (ByValue)" position="2" >
<maml:name>InputObject</maml:name>
<maml:description>
<maml:para>Input objects processed by parallel pipelines. Normally this parameter
is not used directly, objects are sent using the pipeline. But it is
fine to specify the input using this parameter.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>ApartmentState</maml:name>
<maml:description>
<maml:para>Specify either &quot;MTA&quot; (multi-threaded ) or &quot;STA&quot; (single-threaded) for
the apartment states of the threads used to run commands in pipelines.</maml:para>
<maml:para>Values : STA, MTA, Unknown</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Begin</maml:name>
<maml:description>
<maml:para>The script invoked for each pipeline on creation before processing. The
goal is to initialize the runspace to be used by the pipeline, normally
to set some variables, dot-source scripts, import modules, and etc.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Count</maml:name>
<maml:description>
<maml:para>Specifies the parallel pipeline count. The default value is the number
or processors. For intensive jobs use the default or decreased value,
especially if there are other tasks working at the same time. But for
jobs not consuming much processor resources increasing the number may
improve performance.
 
The parameter accepts an array of one or two integers. A single value
specifies the recommended number of pipelines. Two arguments specify
the minimum and maximum numbers and the recommended value is set to
Max(Count[0], Min(Count[1], ProcessorCount)).</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>End</maml:name>
<maml:description>
<maml:para>The script invoked for each pipeline once after processing. The goal
is, for example, to output some results accumulated during processing
of input parts by the pipeline. Consider to use Finally for releasing
resources instead of End or in addition to it.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Filter</maml:name>
<maml:description>
<maml:para>Either a hashtable for collecting unique input objects or a script used
in order to test an input object. Input includes extra objects added in
Refill mode. In fact, this filter is mostly needed for Refill.
 
A hashtable is used in order to collect and enqueue unique objects. In
Refill mode it may be useful for avoiding infinite loops.
 
A script is invoked in a child scope of the scope where the cmdlet is
invoked. The first argument is an object being tested. Returned $true
tells to add an object to the input queue.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Finally</maml:name>
<maml:description>
<maml:para>The script invoked for each opened pipeline before its closing, even on
terminating errors or stopping (Ctrl-C). It is normally needed in order
to release resources created by Begin. Output is ignored. If Finally
fails then its errors are written as warnings because it has to be
called for remaining pipelines.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Function</maml:name>
<maml:description>
<maml:para>Functions imported from the current runspace to parallel.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Load</maml:name>
<maml:description>
<maml:para>Enables processing of partially collected input and specifies input
part limits. If it is omitted then the whole input is collected and
split evenly between pipelines.
 
The parameter accepts an array of one or two integers. The first is the
minimum number of objects per pipeline. If it is less than 1 then Load
is treated as omitted. The second number is the optional maximum.
 
If processing is fast then it is important to specify a proper minimum.
Otherwise Split-Pipeline may work even slower than a standard pipeline.
 
Setting the maximum causes more frequent output. For example, this may
be important for feeding simultaneously working downstream pipelines.
 
Setting the maximum number is also needed for potentially large input
in order to limit the input queue size and avoid out of memory issues.
The maximum queue size is set internally to Load[1] * Count.
 
Use the switch Verbose in order to get some statistics which may help
to choose suitable load limits.
 
CAUTION: The queue limit may be ignored and exceeded if Refill is used.
Any number of objects written via [ref] go straight to the input queue.
Thus, depending on data Refill scenarios may fail due to out of memory.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Module</maml:name>
<maml:description>
<maml:para>Modules imported to parallel runspaces.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Order</maml:name>
<maml:description>
<maml:para>Tells to output part results in the same order as input parts arrive.
The algorithm may work slower.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Refill</maml:name>
<maml:description>
<maml:para>Tells to refill the input by [ref] objects from output. Other objects
go to output as usual. This convention is used for processing items of
hierarchical data structures: child container items come back to input,
leaf items or other data produced by processing go to output.
 
NOTE: Refilled input makes infinite loops possible for some data. Use
Filter in order to exclude already processed objects and avoid loops.</maml:para>
</maml:description>
</command:parameter>
<command:parameter required="false" position="named" >
<maml:name>Variable</maml:name>
<maml:description>
<maml:para>Variables imported from the current runspace to parallel.</maml:para>
</maml:description>
</command:parameter>
</command:parameters>
<command:inputTypes>
<command:inputType>
<dev:type>
<maml:name>Object</maml:name>
</dev:type>
<maml:description>
<maml:para>Input objects processed by parallel pipelines.</maml:para>
</maml:description>
</command:inputType>
</command:inputTypes>
<command:returnValues>
<command:returnValue>
<dev:type>
<maml:name>Object</maml:name>
</dev:type>
<maml:description>
<maml:para>Output of the Begin, Script, and End script blocks. The scripts Begin
and End are invoked once for each pipeline before and after processing.
The script Script is invoked repeatedly with input parts piped to it.</maml:para>
</maml:description>
</command:returnValue>
</command:returnValues>
<command:examples>
<command:example>
<maml:title>-------------------------- EXAMPLE 1 --------------------------</maml:title>
<dev:code>1..10 | . {process{ $_; sleep 1 }}
1..10 | Split-Pipeline -Count 10 {process{ $_; sleep 1 }}</dev:code>
<dev:remarks>
<maml:para>Two commands perform the same job simulating long but not processor
consuming operations on each item. The first command takes about 10
seconds. The second takes about 2 seconds due to Split-Pipeline.</maml:para>
<maml:para></maml:para>
</dev:remarks>
</command:example>
<command:example>
<maml:title>-------------------------- EXAMPLE 2 --------------------------</maml:title>
<dev:code>$PSHOME | Split-Pipeline -Refill {process{
    foreach($item in Get-ChildItem -LiteralPath $_ -Force) {
        if ($item.PSIsContainer) {
            [ref]$item.FullName
        }
        else {
            $item.Length
        }
    }
}} | Measure-Object -Sum</dev:code>
<dev:remarks>
<maml:para>This is an example of Split-Pipeline with refilled input. By the convention
output [ref] objects refill the input, other objects go to output as usual.
 
The code calculates the number and size of files in $PSHOME. It is a &quot;how
to&quot; sample, performance gain is not expected because the code is trivial
and works relatively fast.
 
See also another example with simulated slow data requests:
https://github.com/nightroman/SplitPipeline/blob/master/Tests/Test-Refill.ps1</maml:para>
<maml:para></maml:para>
</dev:remarks>
</command:example>
<command:example>
<maml:title>-------------------------- EXAMPLE 3 --------------------------</maml:title>
<dev:remarks>
<maml:para>Because each pipeline works in its own runspace variables, functions, and
modules from the main script are not automatically available for pipeline
scripts. Such items should be specified by Variable, Function, and Module
parameters in order to be available.
 
&gt; $arr = @(&apos;one&apos;, &apos;two&apos;, &apos;three&apos;); 0..2 | . {process{ $arr[$_] }}
one
two
three
 
&gt; $arr = @(&apos;one&apos;, &apos;two&apos;, &apos;three&apos;); 0..2 | Split-Pipeline {process{ $arr[$_] }}
Split-Pipeline : Cannot index into a null array.
...
 
&gt; $arr = @(&apos;one&apos;, &apos;two&apos;, &apos;three&apos;); 0..2 | Split-Pipeline -Variable arr {process{ $arr[$_] }}
one
two
three</maml:para>
</dev:remarks>
</command:example>
</command:examples>
<maml:relatedLinks>
<maml:navigationLink>
<maml:linkText>Project site:</maml:linkText>
<maml:uri>https://github.com/nightroman/SplitPipeline</maml:uri>
</maml:navigationLink>
</maml:relatedLinks>
</command:command>
</helpItems>